--题外话:最近发现了一些问题,一些高搜索量的东西相当一部分没有价值。发现大部分是一些问题的错误日志。而我是个比较爱贴图的。搜索引擎的检索会将我们的博文文本分词。所以图片内容一般是检索不到的,也就是说同样的问题最好是帖错误代码,日志,虽然图片很直观,但是并不利与传播。希望大家能够优化一部分博文的内容,这样有价值的东西传播量可能会更高。
本文主要是记录Elasticsearch5.3.1 IK分词,同义词/联想搜索设置,本来是要写fscrawler的多种格式(html,pdf,word...)数据导入的,但是IK分词和同义词配置还是折腾了两天,没有很详细的内容,这里决定还是记录下来。IK Analyzer是一个开源的,基于java语言开发的轻量级的中文分词工具包。从2006年12月推出1.0版开始, IKAnalyzer已经推出了3个大版本。最初,它是以开源项目Luence为应用主体的,结合词典分词和文法分析算法的中文分词组件。新版本的IK Analyzer 3.0则发展为面向Java的公用分词组件,独立于Lucene项目,同时提供了对Lucene的默认优化实现。所以IK跟ES本来是天生一对,当然是对于中文来说,起码对于英文分词来说,空格分词就足够简单粗暴。中文检索为了达到更好的检索效果分词效果还是很重要的,所以IK分词插件有必要一试。
一、IK分词的安装:1、下载IK分词器:https://github.com/medcl/elasticsearch-analysis-ik/releases 我这里下载的是5.3.2的已经编译的版本,因为这里没有5.3.1的版本。
2、在Elasticsearch的plugins目录下新建目录analysis-ik: mkdir analysis-ik
3、将IK分词器的压缩包解压到analysis-ik目录下:
4、编辑plugin-sescriptor.properties:
5、启动Elasticsearch测试IK分词:[rzxes@rzxes elasticsearch-5.3.1]$ bin/elasticsearch
ik_smart
, ik_max_word
, 两种分词器Tokenizer: ik_smart
, ik_max_word,
ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合;
ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "word",
"start_offset" : 6,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "西",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "红",
"start_offset" : 11,
"end_offset" : 12,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "柿",
"start_offset" : 12,
"end_offset" : 13,
"type" : "<IDEOGRAPHIC>",
"position" : 4
}
]
}
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "ENGLISH",
"position" : 0
},
{
"token" : "word",
"start_offset" : 6,
"end_offset" : 10,
"type" : "ENGLISH",
"position" : 1
},
{
"token" : "西红柿",
"start_offset" : 10,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "9f",
"start_offset" : 13,
"end_offset" : 15,
"type" : "LETTER",
"position" : 3
}
]
}
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "ENGLISH",
"position" : 0
},
{
"token" : "word",
"start_offset" : 6,
"end_offset" : 10,
"type" : "ENGLISH",
"position" : 1
},
{
"token" : "中华人民",
"start_offset" : 10,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "中华",
"start_offset" : 10,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "华人",
"start_offset" : 11,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "人民",
"start_offset" : 12,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 5
}
]
}
西红柿,番茄 =>西红柿,番茄
社保,公积金 =>社保,公积金
curl -XPUT 'http://192.168.230.150:9200/index' -d'
{
"index": {
"analysis": {
"analyzer": {
"by_smart": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": ["by_tfr","by_sfr"],
"char_filter": ["by_cfr"]
},
"by_max_word": {
"type": "custom",
"tokenizer": "ik_max_word",
"filter": ["by_tfr","by_sfr"],
"char_filter": ["by_cfr"]
}
},
"filter": {
"by_tfr": {
"type": "stop",
"stopwords": [" "]
},
"by_sfr": {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
}
},
"char_filter": {
"by_cfr": {
"type": "mapping",
"mappings": ["| => |"]
}
}
}
}
}'
curl -XPUT 'http://192.168.230.150:9200/index/_mapping/typename' -d'
{
"properties": {
"title": {
"type": "text",
"index": "analyzed",
"analyzer": "by_max_word",
"search_analyzer": "by_smart"
}
}
}'
curl -XPOST http://192.168.230.150:9200/index/title/1 -d'{"title":"我有一个西红柿"}'
curl -XPOST http://192.168.230.150:9200/index/title/2 -d'{"title":"番茄炒蛋饭"}'
curl -XPOST http://192.168.230.150:9200/index/title/3 -d'{"title":"西红柿鸡蛋面"}'
curl -XPOST http://192.168.230.150:9200/index/title/_search -d'
{
"query" : { "match" : { "title" : "番茄" }},
"highlight" : {
"pre_tags" : ["<tag1>", "<tag2>"],
"post_tags" : ["tag1>", "tag2>"],
"fields" : {
"title" : {}
}
}
}
'
结果如下:命中了三条数据,命中了"番茄"和他的同义词"西红柿".
Since elasticsearch 5.x index level settings can NOT be set on the nodes
configuration like the elasticsearch.yaml, in system properties or command line
arguments.In order to upgrade all indices the settings must be updated via the
/${index}/_settings API. Unless all settings are dynamic all indices must be closed
in order to apply the upgradeIndices created in the future should use index templates
to set default values.
儿童,青年,少年,幼年
西红柿,番茄 => 西红柿,番茄
社保,公积金 => 社保,公积金
重启ES再进行分词:curl -XGET 'http://192.168.230.150:9200/index/_analyze?pretty=true&analyzer=by_smart' -d '{"text":"青年"}' 结果如下:
[rzxes@rzxes elasticsearch-5.3.1]$ curl -XGET 'http://192.168.230.150:9200/index/_analyze?pretty=true&analyzer=by_smart' -d '{"text":"青年"}'
{
"tokens" : [
{
"token" : "青年",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "儿童",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "少年",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
},
{
"token" : "幼年",
"start_offset" : 0,
"end_offset" : 2,
"type" : "SYNONYM",
"position" : 0
}
]
}