Elasticsearch集群使用ik分词器
2016-04-26 11:28
330 查看
IK分词插件的安装
ES集群环境
VMWare下三台虚拟机Ubuntu 14.04.2 LTSJDK 1.8.0_66
Elasticsearch 2.3.1
elasticsearch-jdbc-2.3.1.0
IK分词器1.9.1
clustername:my-application
分配如下表:
虚拟机 | IP | node-x
----|----
search1 | 192.168.235.133 | node-1
search2 |192.168.235.134 | node-2
search3 |192.168.235.135 | node-3
IK分词器下载与编译
在github下载IK分词器zip包:https://github.com/myitroad/elasticsearch-analysis-ik
解压后导入IntelliJ IDEA为maven工程。
生成jar包
使用IntelliJ IDEA maven的terminal工具,执行:
mvn clean mvn compile mvn package
在F:\workspace_idea\elasticsearch-analysis-ik-master\target\releases生成:
elasticsearch-analysis-ik-1.9.1.zip
上传IK分词器
将上述zip包上传Elasticsearch的node-x(择一即可,如node-1),解压到:
/home/es/cluster/elasticsearch-2.3.1/plugins/ik目录,
最终的ik文件夹内目录为:
ik │ ├── commons-codec-1.9.jar │ ├── commons-logging-1.2.jar │ ├── config │ │ └── ik │ │ ├── custom │ │ │ ├── ext_stopword.dic │ │ │ ├── mydict.dic │ │ │ ├── single_word.dic │ │ │ ├── single_word_full.dic │ │ │ ├── single_word_low_freq.dic │ │ │ └── sougou.dic │ │ ├── IKAnalyzer.cfg.xml │ │ ├── main.dic │ │ ├── preposition.dic │ │ ├── quantifier.dic │ │ ├── stopword.dic │ │ ├── suffix.dic │ │ └── surname.dic │ ├── elasticsearch-analysis-ik-1.9.1.jar │ ├── httpclient-4.4.1.jar │ ├── httpcore-4.4.1.jar │ └── plugin-descriptor.properties
配置词库(ik自带搜狗词库)
配置:$ES_HOME/plugins/ik/config/ik/IKAnalyzer.cfg.xml
添加以下配置:
<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic;custom/sougou.dic</entry>
重启节点node-1
测试IK分词效果
默认_analyze分析命令可能造成中文乱码,因此对中文使用URL编码。%E6%88%91%E6%98%AF%E4%B8%AD%E5%9B%BD%E4%BA%BA是“我是中国人”的URL转码。
若直接使用“我是中国人”测试分词,则可能会返回乱码。
使用IK的ik_max_word最大分词
es@search1:~/cluster/elasticsearch-2.3.1$ curl -XGET 'localhost:9200/myindex/_analyze?analyzer=ik_max_word&text=%E6%88%91%E6%98%AF%E4%B8%AD%E5%9B%BD%E4%BA%BA&pretty'
返回分词结果:
{ "tokens" : [ { "token" : "我是", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "我", "start_offset" : 0, "end_offset" : 1, "type" : "CN_WORD", "position" : 1 }, { "token" : "是中国人", "start_offset" : 1, "end_offset" : 5, "type" : "CN_WORD", "position" : 2 }, { "token" : "中国人", "start_offset" : 2, "end_offset" : 5, "type" : "CN_WORD", "position" : 3 }, { "token" : "中国", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 4 }, { "token" : "国人", "start_offset" : 3, "end_offset" : 5, "type" : "CN_WORD", "position" : 5 }, { "token" : "人", "start_offset" : 4, "end_offset" : 5, "type" : "CN_WORD", "position" : 6 } ] }
使用IK的ik_smart最小分词
es@search1:~/cluster/elasticsearch-2.3.1$ curl -XGET 'localhost:9200/myindex/_analyze?analyzer=ik_smart&text=%E6%88%91%E6%98%AF%E4%B8%AD%E5%9B%BD%E4%BA%BA&pretty'
返回:
{ "tokens" : [ { "token" : "我是", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "中国人", "start_offset" : 2, "end_offset" : 5, "type" : "CN_WORD", "position" : 1 } ] }
使用IK分词器导入MySQL数据
建立myindex索引在node-1上执行:
curl -XPUT 'localhost:9200/myindex?pretty'
编写MySQL导入es脚本mysql-es-all.sh:(存放位置可任意)
#!/bin/sh bin=/home/es/cluster/elasticsearch-2.3.1/elasticsearch-jdbc-2.3.1.0/bin lib=/home/es/cluster/elasticsearch-2.3.1/elasticsearch-jdbc-2.3.1.0/lib echo ' { "type" : "jdbc", "jdbc" : { "locale" : "zh_CN", "statefile" : "statefile.json", "timezone" : "GMT+8", "autocommit" : true, "elasticsearch" : { "cluster" : "my-application", "host" : "192.168.235.133", "port" : "9300" }, "index" : "myindex", "type" : "mytype", "url" : "jdbc:mysql://10.110.1.47:3306/ispider_data", "user" : "root", "password" : "xxx", "sql" : "select uuid as _id,title,content,release_time from JCY_VOICE_NEWS_INFO", "metrics" : { "enabled" : true, "interval" : "5m" }, "index_settings" : { "index" : { "number_of_shards" : 2, "number_of_replicas" : 2 } }, "type_mapping": { "mytype" : { "properties" : { "title" : { "type" : "string", "store": "no", "term_vector": "with_positions_offsets", "analyzer": "ik_max_word", "search_analyzer": "ik_max_word", "include_in_all": "true" }, "content" : { "type" : "string", "store": "no", "term_vector": "with_positions_offsets", "analyzer": "ik_max_word", "search_analyzer": "ik_max_word", "include_in_all": "true" }, "release_time":{ "type":"date", "store":"no", "format":"YYYY-MM-dd HH:mm:ss", "index":"not_analyzed", "include_in_all":"true" } } } } } } ' | java \ -cp "${lib}/*" \ -Dlog4j.configurationFile=${bin}/log4j2.xml \ org.xbib.tools.Runner \ org.xbib.tools.JDBCImporter
添加运行权限并运行脚本
es@search1:~/cluster/elasticsearch-2.3.1$chmod +x mysql-es-all.sh es@search1:~/cluster/elasticsearch-2.3.1$./mysql-es-all.sh
参考资料
IK Analysis for Elasticsearchhttps://github.com/myitroad/elasticsearch-analysis-ik
[LNMP]全文检索方案:分布式Elasticsearch+Mysql
http://www.jianshu.com/p/638ff7b848cc
Elasticsearch中文乱码问题的解决(_analyze过程)
http://www.52brt.com/2015/09/19/Elasticsearch%E4%B8%AD%E6%96%87%E4%B9%B1%E7%A0%81%E9%97%AE%E9%A2%98%E7%9A%84%E8%A7%A3%E5%86%B3/
在线编码转换
http://tool.oschina.net/encode?type=4
相关文章推荐
- iOS数字媒体开发浅析
- iOS数字媒体开发浅析
- Android屏幕适配经验谈
- CSDN-markdown编辑器语法——字体、字号与颜色
- CodeIgniter third_party 使用demo
- 应用程序的生命周期
- 【剑指offer-Java版】25二叉树中和为某一值的路径
- Android自动检测版本及自动升级
- robotium截图路径设置
- 解决Ubuntu下IDEA无法输入中文问题
- 苹果开源框架ResearchKit简介
- nginx try_files命令
- iOS中多线程原理与runloop介绍
- 欢迎使用CSDN-markdown编辑器
- AndroidSuperDialog
- redis与spring结合使用
- 大数据相关资源收集
- unit vs单元测试
- 循环拼接字符串
- st