分词就是指将一个文本转化成一系列单词的过程,也叫文本分析,在Elasticsearch中称之为Analysis。
接口:POST http://192.168.12.10:9200/_analyze
参数
{
"analyzer":"standard",
"text":"我是中国人"
}
IK Analyzer是一个开源的,基于java语言开发的轻量级的中文分词工具包。从2006年12月推出1.0版开始,IKAnalyzer已经推出了3个大版本。最初,它是以开源项目Luence为应用主体的,结合词典分词和文法分析算法的中文分词组件。新版本的IK Analyzer 3.0则发展为面向Java的公用分词组件,独立于Lucene项目,同时提供了对Lucene的默认优化实现
下载 下载地址 https://download.csdn.net/download/zhangxm_qz/12553781
解压文件到 ES安装目录 plugins下
yum install -y unzip zip –安装 unzip命令
unzip elasticsearch-analysis-ik-6.5.4.zip –解压安装文件
[root@localhost plugins]# ll
总用量 0
drwxr-xr-x. 5 root root 135 11月 20 2018 zk
[root@localhost plugins]#
重启es服务
调用接口测试
POST http://192.168.12.10:9200/_analyze
参数:
{
"analyzer":"ik_max_word",
"text":"我是中国人"
}
响应
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "CN_CHAR",
"position": 0
},
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "CN_CHAR",
"position": 1
},
{
"token": "中国人",
"start_offset": 2,
"end_offset": 5,
"type": "CN_WORD",
"position": 2
},
{
"token": "中国",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 3
},
{
"token": "国人",
"start_offset": 3,
"end_offset": 5,
"type": "CN_WORD",
"position": 4
}
]
}
创建索引
PUT http://192.168.12.10:9200/myindex2
{
"settings": {
"index": {
"number_of_shards": "1",
"number_of_replicas": "0"
}
},
"mappings": {
"person": {
"properties": {
"name": {
"type": "text"
},
"age": {
"type": "integer"
},
"mail": {
"type": "keyword"
},
"hobby": {
"type": "text",
"analyzer":"ik_max_word"
}
}
}
}
}
添加数据
接口 :POST http://192.168.12.10:9200/myindex2/person/_bulk
参数:
{"index":{"_index":"myindex2","_type":"person"}}
{"name":"张三","age": 20,"mail": "111@qq.com","hobby":"羽毛球、乒乓球、足球"}
{"index":{"_index":"myindex2","_type":"person"}}
{"name":"李四","age": 21,"mail": "222@qq.com","hobby":"羽毛球、乒乓球、足球、篮球"}
{"index":{"_index":"myindex2","_type":"person"}}
{"name":"王五","age": 22,"mail": "333@qq.com","hobby":"羽毛球、篮球、游泳、听音乐"}
{"index":{"_index":"myindex2","_type":"person"}}
{"name":"赵六","age": 23,"mail": "444@qq.com","hobby":"跑步、游泳、篮球"}
{"index":{"_index":"myindex2","_type":"person"}}
{"name":"孙七","age": 24,"mail": "555@qq.com","hobby":"听音乐、看电影、羽毛球"}
数据如下:
POST http://192.168.12.10:9200/myindex2/person/_search
参数:
{
"query":{
"match":{
"hobby":"音乐"
}
},
"highlight": {
"fields": {
"hobby": {}
}
}
}
查询出 爱好包含音乐的两个用户
过程说明
POST http://192.168.12.10:9200/myindex2/person/_search
参数:
{
"query":{
"match":{
"hobby":"音乐 篮球"
}
},
"highlight": {
"fields": {
"hobby": {}
}
}
}
可以看到,只要包含了“音乐”、“篮球” 中某一个的数据都被搜索到了。二者是或的关系。如果我们想搜索的是既包含“音乐”又包含“篮球”的用户,以指定词之间的逻辑关系,如下,这样就只查询出一个用户数据
{
"query":{
"match":{
"hobby":{
"query":"音乐 篮球",
"operator":"and"
}
}`在这里插入代码片`
},
"highlight": {
"fields": {
"hobby": {}
}
}
}
“OR” 和 “AND”搜索,这是两个极端,其实在实际场景中,并不会选取这2个极端,大多时候是只需要符合一定的相似度就可以查询到数据,在Elasticsearch中也支持这样的查询,通过minimum_should_match来指定匹配度,如:80%。
如下查询就只返回匹配度大于80%的数据4条数据
{
"query":{
"match":{
"hobby":{
"query":"游泳 羽毛球",
"minimum_should_match":"80%"
}
}
},
"highlight": {
"fields": {
"hobby": {}
}
}
}
可以通过bool组合查询如下:
查询 必须包含篮球,不能包含音乐,如果包含了游泳,那么它的相似度更高 的结果
{
"query":{
"bool":{
"must":{
"match":{
"hobby":"篮球"
}
},
"must_not":{
"match":{
"hobby":"音乐"
}
},
"should":[
{
"match": {
"hobby":"游泳"
}
}
]
}
},
"highlight": {
"fields": {
"hobby": {}
}
}
}
结果如下:
{
"took": 46,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.8336569,
"hits": [
{
"_index": "myindex2",
"_type": "person",
"_id": "cTw053IBd5Ym0N5fXExh",
"_score": 1.8336569,
"_source": {
"name": "赵六",
"age": 23,
"mail": "444@qq.com",
"hobby": "跑步、游泳、篮球"
},
"highlight": {
"hobby": [
"跑步、游泳、篮球"
]
}
},
{
"_index": "myindex2",
"_type": "person",
"_id": "bzw053IBd5Ym0N5fXExh",
"_score": 0.50270504,
"_source": {
"name": "李四",
"age": 21,
"mail": "222@qq.com",
"hobby": "羽毛球、乒乓球、足球、篮球"
},
"highlight": {
"hobby": [
"羽毛球、乒乓球、足球、篮球"
]
}
}
]
}
}
bool 查询会为每个文档计算相关度评分 _score , 再将所有匹配的 must 和 should 语句的分数 _score 求和
must_not 语句不会影响评分; 它的作用只是将不相关的文档排除。
默认情况下,should中的内容不是必须匹配的,如果查询语句中没有must,那么就会至少匹配其中一个。也可以通过minimum_should_match参数进行控制,该值可以是数字也可以的百分比.
如下should最好满足两个会被查询出
{
"query":{
"bool":{
"should":[
{
"match": {
"hobby":"游泳"
}
},
{
"match": {
"hobby":"篮球"
}
},
{
"match": {
"hobby":"音乐"
}
}
],
"minimum_should_match":2
}
},
"highlight": {
"fields": {
"hobby": {}
}
}
}
结果如下:
{
"took": 24,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 2.135749,
"hits": [
{
"_index": "myindex2",
"_type": "person",
"_id": "cDw053IBd5Ym0N5fXExh",
"_score": 2.135749,
"_source": {
"name": "王五",
"age": 22,
"mail": "333@qq.com",
"hobby": "羽毛球、篮球、游泳、听音乐"
},
"highlight": {
"hobby": [
"羽毛球、篮球、游泳、听音乐"
]
}
},
{
"_index": "myindex2",
"_type": "person",
"_id": "cTw053IBd5Ym0N5fXExh",
"_score": 1.8336569,
"_source": {
"name": "赵六",
"age": 23,
"mail": "444@qq.com",
"hobby": "跑步、游泳、篮球"
},
"highlight": {
"hobby": [
"跑步、游泳、篮球"
]
}
}
]
}
}
有些时候,我们可能需要对某些词增加权重来影响该条数据的得分:
搜索关键字为“游泳篮球”,如果结果中包含了“音乐”权重为10,包含了“跑步”权重为2,如下:
{
"query": {
"bool": {
"must": {
"match": {
"hobby": {
"query": "游泳篮球",
"operator": "and"
}
}
},
"should": [
{
"match": {
"hobby": {
"query": "音乐",
"boost": 10
}
}
},
{
"match": {
"hobby": {
"query": "跑步",
"boost": 2
}
}
}
]
}
},
"highlight": {
"fields": {
"hobby": {}
}
}
}
结果:
{
"took": 46,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 9.484448,
"hits": [
{
"_index": "myindex2",
"_type": "person",
"_id": "cDw053IBd5Ym0N5fXExh",
"_score": 9.484448,
"_source": {
"name": "王五",
"age": 22,
"mail": "333@qq.com",
"hobby": "羽毛球、篮球、游泳、听音乐"
},
"highlight": {
"hobby": [
"羽毛球、篮球、游泳、听音乐"
]
}
},
{
"_index": "myindex2",
"_type": "person",
"_id": "cTw053IBd5Ym0N5fXExh",
"_score": 5.4279313,
"_source": {
"name": "赵六",
"age": 23,
"mail": "444@qq.com",
"hobby": "跑步、游泳、篮球"
},
"highlight": {
"hobby": [
"跑步、游泳、篮球"
]
}
}
]
}
}