您的位置：首页 > 其它

solr-用mmseg4j配置同义词索引和检索（IKanlyzer需要修改源码适应solr接口才能使用同义词功能）

2014-07-02 14:15 417 查看

概念说明：同义词大体的意思是指，当用户输入一个词时，solr会把相关有相同意思的近义词的或同义词的term的语段内容从索引中取出，展示给用户，提高交互的友好性（当然这些同义词的定义是要在配置文件中事先定义好的），比如：用户输入：日本，那么就可能有一些相关的近义词如：鬼子，屠杀，战犯等的词，这个可在配置文件中事先定义好。

搜索：http://localhost:8080/solr/testcore/select/?q=content:笔笔音乐会

结果如下：

结果不仅有笔笔音乐会，还有周笔畅音乐会。（首先前提是笔笔和周笔畅在词典中存在的，配置了同义词，能被mmseg4j分析器分出来）

看配置：

solr中自带有synonyms的功能，但是功能很有限，因为中文需要在分词的基础上进行搜索，所以官方的配置就没有多大意义。

一) 官方的配置：这个配置是在cookbook中有提及的，但是在中文分词上没办法加在一起，所以基本上没用。

1：在schema.xml的<types>标签中添加<fieldType>,如下:

<fieldtype name="text_mmseg" class="solr.TextField" positionIncrementGap="100">

<analyzer>

<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="mmseg4jdic"/>

<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false" />

<filter class="solr.LowerCaseFilterFactory"/>



</analyzer>

</fieldtype>

<field name="content" type="text_mmseg" indexed="true" stored="true"/>

上面标记蓝色的：可以使其他的如：“solr.WhitespaceTokenizerFactory”。注意：IKanalyzer的IKTokenizerFactory不能直接用在solr中，需要修改源码重新编译打包才能使用，配置同义词。

2 、这其中涉及到的synonyms.txt文件是配置文件中原来就有的，这个就是同义词的配置文件，和schema.xml文件同级目录。大体格式如下

# The ASF licenses this file to You under the Apache License, Version 2.0

# (the "License"); you may not use this file except in compliance with

# the License. You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0
#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

#-----------------------------------------------------------------------

#some test synonym mappings unlikely to appear in real input text

aaafoo => aaabar

bbbfoo => bbbfoo bbbbar

cccfoo => cccbar cccbaz

fooaaa,baraaa,bazaaa

# Some synonym groups specific to this example

GB,gib,gigabyte,gigabytes

MB,mib,megabyte,megabytes

Television, Televisions, TV, TVs

#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming

#after us won't split it into two words.

# Synonym mappings can be used for spelling correction too

pixima => pixma

杨颖 => angelababy

angelababy => 杨颖

徐熙媛 => 大S

徐熙娣 => 小S

周笔畅 => 笔笔

杨颖,angelababy

angelababy,杨颖

徐熙媛,大S,小S姐姐

徐熙娣,小S,大S妹妹

周笔畅,笔笔,超女笔笔

我已经在上面加入了中文的配置(由于字符集的问题，请修改完成后用EditNote打开并选择格式-->UTF-8编码无DOM，如有乱码就改)，意思是输入这几个中文字都是一样的搜索结果，另外其中还有=>及以逗号分隔的，这里引用官方的话做参考:

就是说=>指一对一，以逗号分隔的是组群，也就是多对多。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航