Elasticsearch5Ik+pinyin分词配置详解

作者：liqiqinai | 来源：互联网 | 2023-05-17 21:33

一、拼音分词的应用拼音分词在日常生活中其实很常见，也许你每天都在用。打开淘宝看一看吧,输入拼音”zhonghua”,下面会有包含”zhonghua”对应的中文”中华”的商品的提示：

一、拼音分词的应用

拼音分词在日常生活中其实很常见，也许你每天都在用。打开淘宝看一看吧,输入拼音”zhonghua”,下面会有包含”zhonghua”对应的中文”中华”的商品的提示：

拼音分词是根据输入的拼音提示对应的中文，通过拼音分词提升搜索体验、加快搜索速度。下面介绍如何在Elasticsearch 5.1.1中配置和实现pinyin+iK分词。

二、IK分词器下载与安装

关于IK分词器的介绍不再多少，一言以蔽之，IK分词是目前使用非常广泛分词效果比较好的中文分词器。做ES开发的，中文分词十有八九使用的都是IK分词器。

下载地址:https://github.com/medcl/elasticsearch-analysis-ik
配置之前关闭elasticsearch，配置完成以后再重启。
IK的版本要和当前ES的版本一致，README中有说明。我使用的是ES是5.1.1，IK的版本为5.1.1(你也许会奇怪为什么IK上一个版本是1.X,下一个版本一下升到5.X?是因为Elastic官方为了统一版本号，之前es的版本是2.x,logstash的版本是2.x,同时Kibana的版本是4.x，ik的版本是1.x，这样版本很混乱。5.0之后，统一版本号，这样你使用5.1.1的es，其它软件的版本也使用5.1.1就好了)。

下载之后进入到elasticsearch-analysis-pinyin-master目录，mvn打包(没有安装maven的自行安装)，运行命令：

    mvn package

打包成功以后，会生成一个target文件夹，在elasticsearch-analysis-ik-master/target/releases目录下，找到elasticsearch-analysis-ik-5.1.1.zip，这就是我们需要的安装文件。解压elasticsearch-analysis-ik-5.1.1.zip，得到下面内容：

commons-codec-1.9.jar
commons-logging-1.2.jar
config
elasticsearch-analysis-ik-5.1.1.jar
httpclient-4.5.2.jar
httpcore-4.4.4.jar
plugin-descriptor.properties

然后在elasticsearch-5.1.1/plugins目录下新建一个文件夹ik，把elasticsearch-analysis-ik-5.1.1.zip解压后的文件拷贝到elasticsearch-5.1.1/plugins/ik目录下.截图方便理解。
这里写图片描述

三、pinyin分词器下载与安装

pinyin分词器的下载地址:
https://github.com/medcl/elasticsearch-analysis-pinyin

安装过程和IK一样，下载、打包、加入ES。这里不在重复上述步骤，给出最后配置截图

四、分词测试
IK和pinyin分词配置完成以后，重启ES。如果重启过程中ES报错，说明安装有错误，没有报错说明配置成功。

4.1 IK分词测试

创建一个索引:

curl -XPUT "http://localhost:9200/index"

测试分词效果:

curl -XPOST "http://localhost:9200/index/_analyze?analyzer=ik_max_word&text=中华人民共和国"

分词结果:

   {
    "tokens": [{
 "token": "中华人民共和国",
 "start_offset": 0,
 "end_offset": 7,
 "type": "CN_WORD",
 "position": 0
 }, {
 "token": "中华人民",
 "start_offset": 0,
 "end_offset": 4,
 "type": "CN_WORD",
 "position": 1
 }, {
 "token": "中华",
 "start_offset": 0,
 "end_offset": 2,
 "type": "CN_WORD",
 "position": 2
 }, {
 "token": "华人",
 "start_offset": 1,
 "end_offset": 3,
 "type": "CN_WORD",
 "position": 3
 }, {
 "token": "人民共和国",
 "start_offset": 2,
 "end_offset": 7,
 "type": "CN_WORD",
 "position": 4
 }, {
 "token": "人民",
 "start_offset": 2,
 "end_offset": 4,
 "type": "CN_WORD",
 "position": 5
 }, {
 "token": "共和国",
 "start_offset": 4,
 "end_offset": 7,
 "type": "CN_WORD",
 "position": 6
 }, {
 "token": "共和",
 "start_offset": 4,
 "end_offset": 6,
 "type": "CN_WORD",
 "position": 7
 }, {
 "token": "国",
 "start_offset": 6,
 "end_offset": 7,
 "type": "CN_CHAR",
 "position": 8
 }, {
 "token": "国歌",
 "start_offset": 7,
 "end_offset": 9,
 "type": "CN_WORD",
 "position": 9
 }]
}

使用ik_smart分词:

curl -XPOST "http://localhost:9200/index/_analyze?analyzer=ik_smart&text=中华人民共和国"

分词结果:

{
    "tokens": [{
 "token": "中华人民共和国",
 "start_offset": 0,
 "end_offset": 7,
 "type": "CN_WORD",
 "position": 0
 }, {
 "token": "国歌",
 "start_offset": 7,
 "end_offset": 9,
 "type": "CN_WORD",
 "position": 1
 }]
}

截图方便理解:
这里写图片描述

4.2拼音分词测试

测试拼音分词:

curl -XPOST "http://localhost:9200/index/_analyze?analyzer=pinyin&text=张学友"

分词结果:

{
    "tokens": [{
 "token": "zhang",
 "start_offset": 0,
 "end_offset": 1,
 "type": "word",
 "position": 0
 }, {
 "token": "xue",
 "start_offset": 1,
 "end_offset": 2,
 "type": "word",
 "position": 1
 }, {
 "token": "you",
 "start_offset": 2,
 "end_offset": 3,
 "type": "word",
 "position": 2
 }, {
 "token": "zxy",
 "start_offset": 0,
 "end_offset": 3,
 "type": "word",
 "position": 3
 }]
}

五、IK+pinyin分词配置

5.1创建索引与分析器设置

创建一个索引，并设置index分析器相关属性:

curl -XPUT "http://localhost:9200/medcl/" -d'
{
 "index": {
 "analysis": {
 "analyzer": {
 "ik_pinyin_analyzer": {
 "type": "custom",
 "tokenizer": "ik_smart",
 "filter": ["my_pinyin", "word_delimiter"]
 }
 },
 "filter": {
 "my_pinyin": {
 "type": "pinyin",
 "first_letter": "prefix",
 "padding_char": " "
 }
 }
 }
 }
}'

创建一个type并设置mapping:

curl -XPOST http://localhost:9200/medcl/folks/_mapping -d'
{
    "folks": {
        "properties": {
            "name": {
                "type": "keyword",
                "fields": {
                    "pinyin": {
                        "type": "text",
                        "store": "no",
                        "term_vector": "with_positions_offsets",
                        "analyzer": "ik_pinyin_analyzer",
                        "boost": 10
                    }
                }
            }
        }
    }
}'

5.2索引测试文档

索引2份测试文档。
文档1:

curl -XPOST http://localhost:9200/medcl/folks/andy -d'{"name":"刘德华"}'

文档2:

curl -XPOST http://localhost:9200/medcl/folks/tina -d'{"name":"中华人民共和国国歌"}'

5.3测试(1)拼音分词

下面四条命命令都可以匹配”刘德华”

curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:liu"

curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:de"

curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:hua"

curl -XPOST "http://localhost:9200/medcl/folks/_search?q=name.pinyin:ldh"

5.4测试(2)IK分词测试

curl -XPOST "http://localhost:9200/medcl/_search?pretty" -d'
{
 "query": {
 "match": {
 "name.pinyin": "国歌"
 }
 },
 "highlight": {
 "fields": {
 "name.pinyin": {}
 }
 }
}'

返回结果:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
 "total" : 5,
 "successful" : 5,
 "failed" : 0
 },
  "hits" : {
 "total" : 1,
 "max_score" : 16.698704,
 "hits" : [
 {
 "_index" : "medcl",
 "_type" : "folks",
 "_id" : "tina",
 "_score" : 16.698704,
 "_source" : {
 "name" : "中华人民共和国国歌"
 },
 "highlight" : {
 "name.pinyin" : [
 "中华人民共和国国歌"
 ]
 }
 }
 ]
 }
}

说明IK分词器起到了效果。

5.3测试(4)pinyin+ik分词测试：

curl -XPOST "http://localhost:9200/medcl/_search?pretty" -d'
{
 "query": {
 "match": {
 "name.pinyin": "zhonghua"
 }
 },
 "highlight": {
 "fields": {
 "name.pinyin": {}
 }
 }
}'

返回结果:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
 "total" : 5,
 "successful" : 5,
 "failed" : 0
 },
  "hits" : {
 "total" : 2,
 "max_score" : 5.9814634,
 "hits" : [
 {
 "_index" : "medcl",
 "_type" : "folks",
 "_id" : "tina",
 "_score" : 5.9814634,
 "_source" : {
 "name" : "中华人民共和国国歌"
 },
 "highlight" : {
 "name.pinyin" : [
 "中华人民共和国国歌"
 ]
 }
 },
 {
 "_index" : "medcl",
 "_type" : "folks",
 "_id" : "andy",
 "_score" : 2.2534127,
 "_source" : {
 "name" : "刘德华"
 },
 "highlight" : {
 "name.pinyin" : [
 "刘德华"
 ]
 }
 }
 ]
 }
}

截图如下:
这里写图片描述

使用pinyin分词以后，原始的字段搜索要加上.pinyin后缀，搜索原始字段没有返回结果：

这里写图片描述

六、参考资料

https://github.com/medcl/elasticsearch-analysis-ik

https://github.com/medcl/elasticsearch-analysis-pinyin

https://my.oschina.net/xiaohui249/blog/214505

推荐阅读

utf-8
python模块之正则

re模块可以读懂你写的正则表达式根据你写的表达式去执行任务用re去操作正则正则表达式使用一些规则来检测一些字符串是否符合个人要求，从一段字符串中找到符合要求的内容。在 ... [详细]

蜡笔小新 2024-11-14 15:52:38
express
Spring 高级教程（15）：Spring AOP（3）—— 使用注解配置切面（1）：方法执行前后的增强处理

本文介绍了如何在Spring框架中使用AspectJ实现AOP编程，重点讲解了通过注解配置切面的方法，包括方法执行前和方法执行后的增强处理。阅读本文前，请确保已安装并配置好AspectJ。 ... [详细]

蜡笔小新 2024-11-15 15:57:13
utf-8
短视频app源码，Android开发底部滑出菜单

短视频app源码，Android开发底部滑出菜单首先依赖三方库implementationandroidx.appcompat:appcompat:1.2.0im ... [详细]

蜡笔小新 2024-11-15 15:35:01
main
vue引入echarts地图的四种方式

一、vue中引入echart1、安装echarts:npminstallecharts--save2、在main.js文件中引入echarts实例: Vue.prototype.$echartsecharts3、在需要用到echart图形的vue文件中引入: importechartsfrom&quot;echarts&quot;;4、如果用到map（地图），还 ... [详细]

蜡笔小新 2024-11-15 13:07:46
main
Android布局优化：使用标签

本文主要介绍如何使用标签来优化Android应用的UI布局，通过减少不必要的视图层次，提高应用性能。 ... [详细]

蜡笔小新 2024-11-15 11:06:03
controller
Go Echo 框架入门指南【1】

本文介绍了 Go 语言中的高性能、可扩展、轻量级 Web 框架 Echo。Echo 框架简单易用，仅需几行代码即可启动一个高性能 HTTP 服务。 ... [详细]

蜡笔小新 2024-11-14 18:30:58
join
Spark 弹性分布式数据集详解

本文详细介绍了 Spark 中的弹性分布式数据集（RDD）及其常见的操作方法，包括 union、intersection、cartesian、subtract、join、cogroup 等转换操作，以及 count、collect、reduce、take、foreach、first、saveAsTextFile 等行动操作。 ... [详细]

蜡笔小新 2024-11-14 15:44:57
utf-8
HTTP header 介绍

HTTP(HyperTextTransferProtocol)是超文本传输协议的缩写，它用于传送www方式的数据。HTTP协议采用了请求响应模型。客服端向服务器发送一 ... [详细]

蜡笔小新 2024-11-14 09:13:00
utf-8
Android 自定义加载对话框 CustomProgressDialog

本文介绍如何在 Android 中自定义加载对话框 CustomProgressDialog，包括自定义 View 类和 XML 布局文件的详细步骤。 ... [详细]

蜡笔小新 2024-11-12 21:51:00
utf-8
网站访问全流程解析

本文详细介绍了从用户在浏览器中输入一个域名（如www.yy.com）到页面完全展示的整个过程，包括DNS解析、TCP连接、请求响应等多个步骤。 ... [详细]

蜡笔小新 2024-11-12 18:13:16
utf-8
Android开发技巧：使用IconFont减少应用体积

本文介绍如何在Android应用中使用IconFont来显示图标，从而有效减少应用的体积。 ... [详细]

蜡笔小新 2024-11-12 12:07:42
ip
在CentOS 7环境中安装配置Redis及使用Redis Desktop Manager连接时的注意事项与技巧

在 CentOS 7 环境中安装和配置 Redis 时，需要注意一些关键步骤和最佳实践。本文详细介绍了从安装 Redis 到配置其基本参数的全过程，并提供了使用 Redis Desktop Manager 连接 Redis 服务器的技巧和注意事项。此外，还探讨了如何优化性能和确保数据安全，帮助用户在生产环境中高效地管理和使用 Redis。 ... [详细]

蜡笔小新 2024-11-11 18:27:44
php
在PHP中如何正确调用JavaScript变量及定义PHP变量的方法详解

在PHP中如何正确调用JavaScript变量及定义PHP变量的方法详解 ... [详细]

蜡笔小新 2024-11-11 17:28:29
php
深入解析Django CBV模型的源码运行机制

本文详细探讨了Django CBV（Class-Based Views）模型的源码运行流程，通过具体的示例代码和详细的解释，帮助读者更好地理解和应用这一强大的功能。 ... [详细]

蜡笔小新 2024-11-14 12:36:32
ip
微信公众号推送模板40036问题

返回码错误码描述说明40001invalidcredential不合法的调用凭证40002invalidgrant_type不合法的grant_type40003invalidop ... [详细]

蜡笔小新 2024-11-12 16:31:32