您的位置：首页 > 编程语言 > Java开发

solr 基本原理配置，solr java 初级应用

2016-01-05 17:16 543 查看

由于公司开发的电商平台，在进行商品搜索时，需要使用搜索引擎，这里简单说明一下，自己在使用和搭建过程的心得和经验：仅供参考，如有误请指出

solr版本：4.10.1

搜索引擎的基本原理：

1.通过分词创建索引

a.根据存储数据对应的字段进行分词，得到分词库；

Solr/Lucene采用的是一种反向索引，所谓反向索引：就是从关键字到文档的映射过程，保存这种映射这种信息的索引称为反向索引。

b.形成索引和分词库后，此时的索引就类似于我们使用的新华字典的比划偏旁，通过索引对应的文档编号，找到对应的文档。

由于我们环境中已经安装了solr,solr中已经存在一个core（collection1），需要在此基础上添加一个core.

在solr安装目录下，找到example/solr，然后新建一个文件夹，例如:collection2;

在collection2中新建conf和data,用来存放配置文件和数据;

可以直接将原来已经存在的collection1中的conf所有东西copy过去，然后修改schema.xml和solrconfig.xml这两个文件。

注：安装完成之后，会自动生成一个solr的集合，建议自己试着重新创建一个core.配置文件可以参考安装之后的demo进行配置。

schema.xml：

schema.xml这个配置文件可以在你下载solr包的安装解压目录的\solr\example\solr\collection1\conf中找到，它就是solr模式关联的文件。

打开这个配置文件，你会发现有详细的注释,主要进行存储字段的配置。模式组织主要分为三个重要配置（types,fileds,其他配置）;

types定义将会使用到的域类型：

<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
<fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
-->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
<filter class="solr.EnglishMinimalStemFilterFactory"/>
-->
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
<filter class="solr.EnglishMinimalStemFilterFactory"/>
-->
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>

<fieldType name="text_cn" class="solr.TextField">
<!-- isMaxWordLength:true使用最大长度分词 false使用最细粒度分词 -->
<analyzer type="index" isMaxWordLength="true" class="org.wltea.analyzer.lucene.IKAnalyzer"/>
<analyzer type="query" isMaxWordLength="true" class="org.wltea.analyzer.lucene.IKAnalyzer"/>
</fieldType>

这里使用的是IK分词器

fileds 域，类似于数据库里面的字段

而声明类型就需要用到上面的types

<field name="_version_" type="long" indexed="true" stored="true"/>

<!-- points to the root document of a block of nested documents. Required for nested
document support, may be removed otherwise
-->
<field name="_root_" type="string" indexed="true" stored="false"/>

<field name="id" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="goodsId" type="long" indexed="true" stored="true" multiValued="false"/>
<field name="goodsName" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="coverUrl" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="thumbUrl" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="goodsPrice" type="float" indexed="true" stored="true" multiValued="false"/>
<field name="categoryId" type="long" indexed="true" stored="true" multiValued="false"/>
<field name="categoryName" type="string" indexed="true" stored="true" multiValued="false"/>

其他配置：

uniqueKey: 唯一键，这里配置的是上面出现的fileds，一般是id、url等不重复的。在更新、删除的时候可以用到。

在做add(document)操作时，如果里面已经存在一个id相同的文档，则后面添加会将已经存在的覆盖。

defaultSearchField:默认搜索属性，如q=solr就是默认的搜索那个字段

solrQueryParser:查询转换模式，是并且还是或者（AND/OR必须大写）

solrconfig.xml基本上可以直接使用。

主要说一下在开发过程中的使用，这些配置都比价简单，也能在网上找到对应的案例。

安装和创建好solr对应的core之后我们进行简单的应用

1.在使用之前，先获取solr连接：

注：此方法在SolrJUtil这个类下。

DEFAULT_URL="http://localhost:8080/solr/collection1";
public static HttpSolrServer getServer() {
synchronized (SolrJUtil.class) {
if (server == null) {
server = new HttpSolrServer(DEFAULT_URL);

server.setMaxRetries(1); // defaults to 0. > 1 not recommended.
server.setConnectionTimeout(5000); // 5 seconds to establish TCP
server.setSoTimeout(1000); // socket read timeout
server.setDefaultMaxConnectionsPerHost(100);
server.setMaxTotalConnections(100);
server.setFollowRedirects(false); // defaults to false
server.setAllowCompression(true);
}
}
return server;
}

2.获取数据，整理数据，对相应的字段进行分词，创建基本的分词库，将数据添加到solr document文档中

//部分伪代码片段
//步骤：
//1.获取solr core对应的连接
//2.删除集合中所有数据
//3.抓取和整理数据中获取的数据
//4.进行分词
//5.将分词构建成document存入solr中，将查询的其他数据构建成另外一个document存入solr中
//6.提交commit

public static boolean initDbIndex(Connection conn) {
ResultSet result = null;
HanyuPinyinOutputFormat spellFormat = new HanyuPinyinOutputFormat();
int i = 0;
try {

server.deleteByQuery("*:*");

Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
List<String> uniqueData = new ArrayList<String>();

//此处从数据库获取你需要存储到搜索引擎里面的数据。
//result = 从数据库抓取的结果集
Map<Long,Integer> commentsCountMap = new HashMap<Long,Integer>();
while (result.next()) {
commentsCountMap.put(result.getLong("goodsId"), result.getInt("num"));
}
while (result.next()) {
Long goodsId = result.getLong("goodsId");
String goodsName = result.getString("goods_name");
String coverUrl = result.getString("cover_url");
String thumbUrl = result.getString("thumb_url");
String categoryId = result.getString("cat_id");
String categoryName = result.getString("cat_name");

//得到分词集合，存储到solr中之后，主要用来做提示和引导，例如用户在搜索框输入鞋，然后下面提示鞋子，篮球鞋等。
List<String> suggestList = SolrJUtil.getFieldDefaultAnalysis("content",goodsName.toLowerCase());//手动分词
List<Long> goodsIds = new ArrayList<Long>();
for (String suggest : suggestList) {
SolrInputDocument doc = null;
if (!uniqueData.contains(suggest)) {// 不重复添加数据
String pinyin = PinyinHelper.toHanyuPinyinString(suggest,spellFormat, "");
doc = new SolrInputDocument();
doc.addField("id", UUIDGenerator.getUUID());
String trimSuggest = suggest.replaceAll(" ", "");//去除空格的中文索引
String pinyin_Text = pinyin.replaceAll(" ", "");//拼音
String firstSpellText = getFirstSpell(suggest);//拼音首字母
doc.addField("suggest", suggest);//中文建议
doc.addField("content", trimSuggest+" "+pinyin_Text+" "+ firstSpellText);//混合字段
uniqueData.add(suggest);
}
if(!goodsIds.contains(goodsId)){
String goodsName_py = PinyinHelper.toHanyuPinyinString(goodsName.replaceAll(" ", ""), spellFormat, "");
if(doc == null){
doc = new SolrInputDocument();
//doc.addField("id", UUIDGenerator.getUUID());
}
doc.setField("id", goodsId);
doc.addField("goodsId", goodsId);
doc.addField("goodsName", goodsName);
doc.addField("searchContent", goodsName.replaceAll(" ", "").toLowerCase()+goodsName_py);
if(prices.size()>0){
doc.addField("goodsPrice", prices.get(0));
}
if(minNums.size()>0){
doc.addField("minOrderNum", minNums.get(0));
}
doc.addField("coverUrl", Global.IMG_SERVER_URL+coverUrl);
doc.addField("thumbUrl", Global.IMG_SERVER_URL+thumbUrl);
doc.addField("categoryId", categoryId);
doc.addField("categoryName", categoryId+"_"+categoryName);
goodsIds.add(goodsId);
}
if(doc != null){
docs.add(doc);
}
}
}
try {
if(docs.size()>0){
UpdateResponse response = server.add(docs);
server.commit();
i = docs.size();
}
} catch (SolrServerException e) {
e.printStackTrace();
return false;
} catch (IOException e) {
e.printStackTrace();
return false;
}

} catch (Exception e) {
LOG.error("initDbIndex errror:"+e.getMessage());
e.printStackTrace();
return false;
} finally {
try {
result.close();
conn.close(); // 4、关闭数据库
} catch (Exception e) {
LOG.error("initDbIndex errror:"+e.getMessage());
e.printStackTrace();
return false;
}
}
System.out.println("索引创建完毕,本次共更新" + i + "条数据!");
LOG.error("索引创建完毕,本次共更新" + i + "条数据!");
return true;
}

其他相关的工具类，主要用来进行分词.

获取指定字符串的分词，采用默认分词器

public static List<String> getFieldDefaultAnalysis(String tokenField, String content) {
FieldAnalysisRequest request = new FieldAnalysisRequest("/analysis/field");
request.addFieldName(tokenField);// 字段名，随便指定一个支持中文分词的字段
request.setFieldValue("");// 字段值，可以为空字符串，但是需要显式指定此参数
request.setQuery(content);

FieldAnalysisResponse response = null;
try {
response = request.process(server);
} catch (Exception e) {
e.printStackTrace();
}

List<String> results = new ArrayList<String>();
Iterator<AnalysisPhase> it = response.getFieldNameAnalysis(tokenField).getQueryPhases().iterator();
while (it.hasNext()) {
AnalysisPhase pharse = (AnalysisPhase) it.next();
List<TokenInfo> list = pharse.getTokens();
for (TokenInfo info : list) {
results.add(info.getText());
}
}

return results;
}

获取汉语拼音首字母，英文字符不变

public static String getFirstSpell(String chinese) {
StringBuffer pybf = new StringBuffer();
char[] arr = chinese.toCharArray();
HanyuPinyinOutputFormat defaultFormat = new HanyuPinyinOutputFormat();
defaultFormat.setCaseType(HanyuPinyinCaseType.LOWERCASE);
defaultFormat.setToneType(HanyuPinyinToneType.WITHOUT_TONE);
for (int i = 0; i < arr.length; i++) {
if (arr[i] > 128) {
try {
String[] temp = PinyinHelper.toHanyuPinyinStringArray(
arr[i], defaultFormat);
if (temp != null) {
pybf.append(temp[0].charAt(0));
}
} catch (BadHanyuPinyinOutputFormatCombination e) {
e.printStackTrace();
}
} else {
pybf.append(arr[i]);
}
}
return pybf.toString().replaceAll("\\W", "").trim();
}

上述有些方法是不需要的，例如不需要使用拼音分词等。就不需最后一个。

完成上述之后，solr中的数据基本就初始化和准备完毕了，这个时候需要使用程序来进行查询了：

public String productList() {
SearchResult result = new SearchResult();
HttpSolrServer server = null;
server = SolrJUtil.getServer();
SolrQuery query = new SolrQuery();
keywords = keywords.trim();
query.set("q", "searchContent:*" + StringUtils.escapeAllSpecialChars(keywords) + "*");
query.set("spellcheck.q", StringUtils.escapeAllSpecialChars(keywords));
query.set("qt", "/spell");// 请求到spell
query.set("qf", "searchContent");// 查询字段
query.set("fl", "goodsId,goodsName,goodsPrice,minOrderNum,productUnit,coverUrl,isProvideSample,"
+ "transactions,countryCode,sourcePlace,productFeature,score,isEnquiry,isDiscount,fobMinPrice,fobMaxPrice,discount");// 返回字段

query.setStart((page - 1) * Global.PAGE_SIZE);//分页查询起始位置
query.setRows(Global.PAGE_SIZE);//分页大小
QueryResponse rp = server.query(query);

SolrDocumentList docList = rp.getResults();
result.setTotalRows(docList.getNumFound());
if (docList != null && docList.getNumFound() > 0) {
//获取文档中的数据，同时封装到javaBean中
String goodsName = (String) doc.getFieldValue("goodsName");
result.setProducts(products);
result.setSuccess(true);
} else {
SpellCheckResponse re = rp.getSpellCheckResponse();// 获取拼写检查的结果集
if (re != null) {
result.setSuccess(false);
result.setSuggestStr(re.getFirstSuggestion(keywords.replaceAll(" ", "")));
}
}
}

这是普通的查询，如果想要加上高亮，因为一般在搜索结果中，都会对结果进行高亮：

query.setHighlight(true);

query.addHighlightField("companyName");

query.setHighlightSimplePre("<font color='red'>");

query.setHighlightSimplePost("</font>");

Map<String, Map<String, List<String>>> highlightMap = queryResponse.getHighlighting();

List<String> companyNameList = highlightMap.get(idStr).get("companyName");

if (CollectionUtils.isNotEmpty(companyNameList)) {

//使用这个结果替换掉document中的值
supplier.setCompanyName(companyNameList.get(0));
}

到此，基本上就已经介绍完了solr的配置，以及基本应用，可能讲述的有些凌乱。

注：上述代码都是伪代码，是提供一个步骤和思路。大家可以进行参考。

如果转载，请注明出处

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： solr java 搜索引擎

相关文章推荐

新的分享

章节导航

solr 基本原理 配置，solr java 初级应用

solr 基本原理配置，solr java 初级应用