您的位置:首页 > 其它

Elasticsearch集群使用ik分词器

2016-04-26 11:28 330 查看

IK分词插件的安装

ES集群环境

VMWare下三台虚拟机Ubuntu 14.04.2 LTS

JDK 1.8.0_66

Elasticsearch 2.3.1

elasticsearch-jdbc-2.3.1.0

IK分词器1.9.1

clustername:my-application

分配如下表:

虚拟机 | IP | node-x

----|----

search1 | 192.168.235.133 | node-1

search2 |192.168.235.134 | node-2

search3 |192.168.235.135 | node-3

IK分词器下载与编译

在github下载IK分词器zip包:

https://github.com/myitroad/elasticsearch-analysis-ik

解压后导入IntelliJ IDEA为maven工程。

生成jar包

使用IntelliJ IDEA maven的terminal工具,执行:

mvn clean
mvn compile
mvn package

在F:\workspace_idea\elasticsearch-analysis-ik-master\target\releases生成:

elasticsearch-analysis-ik-1.9.1.zip

上传IK分词器

将上述zip包上传Elasticsearch的node-x(择一即可,如node-1),解压到:

/home/es/cluster/elasticsearch-2.3.1/plugins/ik目录,

最终的ik文件夹内目录为:

ik
│   ├── commons-codec-1.9.jar
│   ├── commons-logging-1.2.jar
│   ├── config
│   │   └── ik
│   │       ├── custom
│   │       │   ├── ext_stopword.dic
│   │       │   ├── mydict.dic
│   │       │   ├── single_word.dic
│   │       │   ├── single_word_full.dic
│   │       │   ├── single_word_low_freq.dic
│   │       │   └── sougou.dic
│   │       ├── IKAnalyzer.cfg.xml
│   │       ├── main.dic
│   │       ├── preposition.dic
│   │       ├── quantifier.dic
│   │       ├── stopword.dic
│   │       ├── suffix.dic
│   │       └── surname.dic
│   ├── elasticsearch-analysis-ik-1.9.1.jar
│   ├── httpclient-4.4.1.jar
│   ├── httpcore-4.4.1.jar
│   └── plugin-descriptor.properties

配置词库(ik自带搜狗词库)

配置:$ES_HOME/plugins/ik/config/ik/IKAnalyzer.cfg.xml

添加以下配置:

<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic;custom/sougou.dic</entry>

重启节点node-1

测试IK分词效果

默认_analyze分析命令可能造成中文乱码,因此对中文使用URL编码。

%E6%88%91%E6%98%AF%E4%B8%AD%E5%9B%BD%E4%BA%BA是“我是中国人”的URL转码。

若直接使用“我是中国人”测试分词,则可能会返回乱码。

使用IK的ik_max_word最大分词

es@search1:~/cluster/elasticsearch-2.3.1$ curl -XGET 'localhost:9200/myindex/_analyze?analyzer=ik_max_word&text=%E6%88%91%E6%98%AF%E4%B8%AD%E5%9B%BD%E4%BA%BA&pretty'

返回分词结果:

{
"tokens" : [ {
"token" : "我是",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
}, {
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "CN_WORD",
"position" : 1
}, {
"token" : "是中国人",
"start_offset" : 1,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 2
}, {
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 3
}, {
"token" : "中国",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 4
}, {
"token" : "国人",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 5
}, {
"token" : "人",
"start_offset" : 4,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 6
} ]
}

使用IK的ik_smart最小分词

es@search1:~/cluster/elasticsearch-2.3.1$ curl -XGET 'localhost:9200/myindex/_analyze?analyzer=ik_smart&text=%E6%88%91%E6%98%AF%E4%B8%AD%E5%9B%BD%E4%BA%BA&pretty'

返回:

{
"tokens" : [ {
"token" : "我是",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
}, {
"token" : "中国人",
"start_offset" : 2,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 1
} ]
}

使用IK分词器导入MySQL数据

建立myindex索引

在node-1上执行:

curl -XPUT 'localhost:9200/myindex?pretty'

编写MySQL导入es脚本mysql-es-all.sh:(存放位置可任意)

#!/bin/sh
bin=/home/es/cluster/elasticsearch-2.3.1/elasticsearch-jdbc-2.3.1.0/bin
lib=/home/es/cluster/elasticsearch-2.3.1/elasticsearch-jdbc-2.3.1.0/lib
echo '
{
"type" : "jdbc",
"jdbc" : {
"locale" : "zh_CN",
"statefile" : "statefile.json",
"timezone" : "GMT+8",
"autocommit" : true,
"elasticsearch" : {
"cluster" : "my-application",
"host" : "192.168.235.133",
"port" : "9300"
},
"index" : "myindex",
"type" : "mytype",
"url" : "jdbc:mysql://10.110.1.47:3306/ispider_data",
"user" : "root",
"password" : "xxx",
"sql" : "select uuid as _id,title,content,release_time from JCY_VOICE_NEWS_INFO",
"metrics" : {
"enabled" : true,
"interval" : "5m"
},
"index_settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 2
}
},
"type_mapping": {
"mytype" : {
"properties" : {
"title" : {
"type" : "string",
"store": "no",
"term_vector": "with_positions_offsets",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word",
"include_in_all": "true"
},
"content" : {
"type" : "string",
"store": "no",
"term_vector": "with_positions_offsets",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word",
"include_in_all": "true"
},
"release_time":{
"type":"date",
"store":"no",
"format":"YYYY-MM-dd HH:mm:ss",
"index":"not_analyzed",
"include_in_all":"true"
}
}
}
}
}
}
' | java \
-cp "${lib}/*" \
-Dlog4j.configurationFile=${bin}/log4j2.xml \
org.xbib.tools.Runner \
org.xbib.tools.JDBCImporter

添加运行权限并运行脚本

es@search1:~/cluster/elasticsearch-2.3.1$chmod +x mysql-es-all.sh
es@search1:~/cluster/elasticsearch-2.3.1$./mysql-es-all.sh

参考资料

IK Analysis for Elasticsearch

https://github.com/myitroad/elasticsearch-analysis-ik

[LNMP]全文检索方案:分布式Elasticsearch+Mysql

http://www.jianshu.com/p/638ff7b848cc

Elasticsearch中文乱码问题的解决(_analyze过程)

http://www.52brt.com/2015/09/19/Elasticsearch%E4%B8%AD%E6%96%87%E4%B9%B1%E7%A0%81%E9%97%AE%E9%A2%98%E7%9A%84%E8%A7%A3%E5%86%B3/

在线编码转换

http://tool.oschina.net/encode?type=4
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: