您的位置：首页 > 其它

《Lucene In Action》第三章.搜索

2010-12-03 18:42 453 查看

主要的类

IndexSearcher：搜索的主类。

Query（及具体子类）：被传入IndexSearcher的search方法，用于逻辑上的搜索。

QueryParser：将人工输入的查询字符串转化成Query对象。

TopDocs：存储着得分最高的那些文档，由IndexSearcher的search方法返回。

ScoreDoc：TopDocs中的每一个文档，他们只保留着Document的引用。

3.1 实现简单的索引功能

可以通过符合Lucene的字符串或者Query的组合实现复杂的查询，即QueryParser接受Query对象组合或者字符串两形式。

Term

在某一个具体字段(Field)上搜索。

一个简单的搜索例子：

public class BasicSearchingTest extends TestCase {

public void testTerm() throws Exception {

IndexSearcher searcher;

Directory dir = TestUtil.getBookIndexDirectory(); //A

searcher = new IndexSearcher(dir, //B

true); //B

Term t = new Term(“subject”, “ant”);

Query query = new TermQuery(t);

TopDocs docs = searcher.search(query, 10);

assertEquals(“JDwA”, 1, docs.totalHits); //C

t = new Term(“subject”, “junit”);

docs = searcher.search(new TermQuery(t), 10);

assertEquals(2, docs.totalHits); //D

searcher.close();

}

}

构造Term是较为关键的步骤。

QueryParser

可以使用QueryParser将String类型的查询串转化成Query对象，支持OR或者+ -这种。

构造函数是

QueryParser(Version matchVersion, String field, Analyzer analyzer)

matchVersion就Version.LUCENE_CURRENT吧。

field是默认的搜索字段。

analyser：只有在QueryParser中才使用analyzer，将对查询字符串进行处理。

QueryParser的parser将解析并生成Query对象。

public Query parse(String query) throws ParseException

解析失败将抛出异常，否则返回Query对象。

如果query包含多个词，默认使用OR连接各词
。

常用的String组合：

java ：只搜索Java

java junit 或者 java OR junit ：搜索包含java或者junit的，在默认字段

+java +junit 或者java AND junit：搜索包含java并且junit的，在默认字段

title:ant ：搜索title字段包含ant的

title:extreme –subject:sports 或者 title:extreme AND NOT subject:sports：搜索title字段包含extreme并且subject不好喊sports的。

title:”junit in action”：搜索title中精确包含”junit in action”的。

java*：搜索java开头的，例如javascript java.net等

java~：搜索java相近的例如lava

lastmodified: [1/1/04 TO 12/31/04] ：搜索lastmodified字段在两个日期之间的。

总之是很强大的，上述String的query均可以用Query对象组合而形成。

3.2 使用IndexSearcher

使用IndexSearch需要三个步骤：

Directory dir = FSDirectory.open(new File(“/path/xxx”));

IndexReader reader = IndexReader.open(dir);

IndexSearcher searcher = new IndexSearcher(reader);

IndexReader封装了底层的API操作，reader的open操作非常耗费资源，因此reader应该重用。

但是reader打开后便不能获悉之后更新的Index，因此可reopen：

reopen将尝试尽量重用

，如果无法重用将创建新的IndexReader，因此需要判断。

IndexReader newReader = reader.reopen();

if (reader != newReader) {

reader.close();

reader = newReader;

searcher = new IndexSearcher(reader);

}

执行搜索

IndexSearcher提供了很多API，下述几个均可以。

TopDocs search(Query query, int n)

TopDocs search(Query query, Filter filter, int n)

TopFieldDocs search(Query query, Filter filter, int n, Sort sort)

TopDocs

多数search直接返回一个TopDocs作为搜索的结果（已经按照相似度排序）

，它包含三个属性（方法）：

totalHits：有多少个Document被匹配

scoreDocs：每一个具体的搜索结果（含分、Document等）

结果的分页

在Lucene中，常用的解决方法有：

1、在第一次就把很多结果都抓取过来，然后根据用户的分页请求来显示

2、每次重新查询

一般来说，Web是“无状态协议”，重新查询可回避状态的存储，是一种较好的选择。每次用户选择后面的页后，将“n”的数值加大，即可显示后面的内容。

“实时搜索”

实时搜索的关键是：不要自己创建Directory->IndexReader，而是使用下述办法：

IndexWriter.getReader()：这可以不需要重新commit 索引就立即获得更新。

IndexReader newReader = reader.reopen()：重用reader，比起open非常快捷，但是注意如果reader!=oldReader，则需要关闭oldReader。

3.3 理解得分”Score”

Lucene使用得分Score来衡量Document与Query的匹配程度

。

得分公式

关于分数的推导，有详细的说明，请参考《Lucene打分公式的数学推导》

http://topic.csdn.net/u/20100308/21/3386acef-d853-4738-9941-2a8b0ee157ca.html

其中各个因子的作用为：

tf(t in d)：
Term t在文档d中出现的词频

idf(t)：
Term t在几篇文档中出现过

norm(t, d)：
标准化因子，它包括三个参数：

Document boost：
此值越大，说明此文档越重要。

Field boost：
此域越大，说明此域越重要。

lengthNorm(field)
= (1.0 / Math.sqrt(numTerms))：一个域中包含的Term总数越多，也即文档越长，此值越小，文档越短，此值越大。

boost(t.field in d)：
额外的提升

coord(q, d)：
主要用于AND查询时，符合多个的Term比其他的有更高的得分

queryNorm(q)：
计算每个查询条目的方差和，此值并不影响排序，而仅仅使得不同的query之间的分数可以比较。

通过Boost可以提升某文档的位置，相似性可以通过拓展Similarity来实现。

使用explain来理解得分

尽管公式非常复杂，但是可以使用内置的expalin()函数来理解得分。

Explanation explanation = searcher.explain(Quert, Document);

explanation可以获取详细的每一步的评分

。

3.4 Lucene提供的多种Query

TermQuery

某个字段的检索

IndexSearcher searcher = new IndexSearcher(TestUtil.getBookIndexDirectory());

Term t = new Term(“isbn”, “1930110995″);

Query query = new TermQuery(t);

TopDocs docs = searcher.search(query, 10);

assertEquals(“JUnit in Action”, 1, docs.totalHits);

searcher.close();

TermRangeQuery

因为是按照字典序排列的，所以Lucene中很容易通过”Range”即范围来检索。

Directory dir = TestUtil.getBookIndexDirectory();

IndexSearcher searcher = new IndexSearcher(dir);

TermRangeQuery query = new TermRangeQuery(“title2″, “d”, “j”, true, true);

TopDocs matches = searcher.search(query, 100);

assertEquals(3, matches.totalHits);

searcher.close();

dir.close();

两个true、true分别代表了是否包含d j两点。

也可以对不连续的进行选择，使用Collator，但性能很差。

NumericRangeQuery

与RangeQuery类似，只不过是对数值进行范围检索

Directory dir = TestUtil.getBookIndexDirectory();

IndexSearcher searcher = new IndexSearcher(dir);

// pub date of TTC was October 1988

NumericRangeQuery query = NumericRangeQuery.newIntRange(“pubmonth”,

198805,

198810,

true,

true);

TopDocs matches = searcher.search(query, 100);

assertEquals(1, matches.totalHits);

searcher.close();

dir.close();

PrefixQuery

前缀搜索，只检索前缀为xxx字符串的匹配结果。

IndexSearcher searcher = new IndexSearcher(TestUtil.getBookIndexDirectory());

// search for programming books, including subcategories

Term term = new Term(“category”, //#A

“/technology/computers/programming”); //#A

PrefixQuery query = new PrefixQuery(term); //#A

TopDocs matches = searcher.search(query, 10); //#A

int programmingAndBelow = matches.totalHits;

// only programming books, not subcategories

matches = searcher.search(new TermQuery(term), 10); //#B

int justProgramming = matches.totalHits;

assertTrue(programmingAndBelow > justProgramming);

searcher.close();

BooleanQuery

与、或、非的将其他Query组合起来。

public void add(Query query, BooleanClause.Occur occur)

通过occour设置AND、OR或NOT

AND：occour设置为Occur.MUST

OR：occour设置为Occur.SHOULD

NOT：occour设置为Occur.MUST_NOT

PhraseQuery

PhraseQuery支持多个关键字的搜索。

slop用于表示“距离”，设定PhraseQuery的slop可控制多关键词的检索。

例如对于Field：

doc.add(new Field(“field”, “the quick brown fox jumped over the lazy dog”, Field.Store.YES, Field.Index.ANALYZED));

相连的两词，将总被检索出来，无论slop为多少：

PhraseQuery query = new PhraseQuery();

query.add(new Term(“field”, “quick”));

query.add(new Term(“field”, “brown”));

可以被检索出来

再例如，brown,quick与原Doc的距离为3（注意顺序也有影响），则当slop大于等于3的时候才能被检索出来。

再例如下述PhraseQuery的检索结果。

assertFalse(“not close enough”,

matched(new String[] {“quick”, “jumped”, “lazy”}, 3));

assertTrue(“just enough”,

matched(new String[] {“quick”, “jumped”, “lazy”}, 4));

assertFalse(“almost but not quite”,

matched(new String[] {“lazy”, “jumped”, “quick”}, 7));

assertTrue(“bingo”,

matched(new String[] {“lazy”, “jumped”, “quick”}, 8));

slop实际是移动距离：将一个Query经过移动多少步可以符合另一个

。

WildcardQuery：通配符查询

Query query = new WildcardQuery(new Term(“contents”, “?ild*”));

WildcardQuery面临着较为严重的性能问题：当前缀（*?之前）较长时，需要遍历的term将减少，反之极端，在开头使用通配符将导致遍历所有term。

FuzzyQuery：模糊查询

使用了“编辑距离”：
number of character deletions, insertions, or substitutions required to transform one string to the other string.

如下所示：

indexSingleFieldDocs(new Field[] { new Field(“contents”,

“fuzzy”,

Field.Store.YES,

Field.Index.ANALYZED),

new Field(“contents”,

“wuzzy”,

Field.Store.YES,

Field.Index.ANALYZED)

});

IndexSearcher searcher = new IndexSearcher(directory);

Query query = new FuzzyQuery(new Term(“contents”, “wuzza”));

TopDocs matches = searcher.search(query, 10);

assertEquals(“both close enough”, 2, matches.totalHits);

assertTrue(“wuzzy closer than fuzzy”,

matches.scoreDocs[0].score != matches.scoreDocs[1].score);

Document doc = searcher.doc(matches.scoreDocs[0].doc);

使用FuzzyQuery，则wuzzy可以匹配wuzzy，也可以匹配fuzzy。

FuzzyQuery不接受“距离”，而是接受0~1之间的一个“阈值”。

例如构造函数：

FuzzyQuery
(Term
term, float minimumSimilarity, int prefixLength)

当编辑距离小于minimumSimilarity*(Length(term)-prefixLength)的时候，则认为匹配FuzzyQuery。

FuzzyQuery将枚举索引中全部的Term，比较耗费资源！！

MatchAllDocsQuery

MatchAllDocsQuery将匹配索引中所有的Doc，Boost值默认都是1.0，并支持按照某field计算Boost数值。

3.5 QueryParser

尽管通过QueryAPI可以创建强大的查询，但是不需要完全从API创建起来Query，也可以通过

String -> QueryParser解析->Query的方法。

例如：

+pubdate:[20040101 TO 20041231] Java AND (Jakarta OR Apache)

在String的Query字符串中，下述字符需要转移，在字符前面加上‘/’：

/ + – ! ( ) : ^ ] { } ~ * ?

对于一个Query对象，

Query.toString()可以显示其String类型的Query表示。

例如：

query.add(new FuzzyQuery(new Term(“field”, “kountry”)),

BooleanClause.Occur.MUST);

query.add(new TermQuery(new Term(“title”, “western”)),

BooleanClause.Occur.SHOULD);

的toString为：

+field:kountry~0.5 title:western

注意：FuzzyQuery的默认相似编辑距离为0.5。

TermQuery

QueryParser parser = new QueryParser(Version.LUCENE_CURRENT,

“subject”, analyzer);

将被解析为：

term: subject:computers

即field:term的形式

TermRangeQuery

范围搜索的String

title2:[K TO N] //两边都包含

title2:{K TO Mindstorms} //两边都不包含

NumericQuery和DateQuery

QueryParser不提供将String解析成NumericQuery或者DateQuery，需要通过继承QueryParser，在子类中实现。（见6.3.3和6.3.4）

前缀查询和通配符查询

Query q = new QueryParser(Version.LUCENE_CURRENT,

“field”, analyzer).parse(“PrefixQuery*”);

默认情况下，其String为

prefixquery*

即默认全部小写化。

可以通过这样控制不小写

qp.setLowercaseExpandedTerms(false);

布尔查询

AND OR NOT必须大写。

默认情况下，空格表示OR。

abc xyz => abc OR xyz

可以更改默认操作符：

parser.setDefaultOperator(QueryParser.AND_OPERATOR);

则之后的

abc xyz => abc AND xyz

也可以使用缩写形式即+-表示。

a AND b == +a +b

a OR b == a b

a AND NOT b == +a –b

注意NOT之前必须至少有一个非NOT的操作符，即不能单独使用NOT word来找不含word的所有Doc

。

PhraseQuery

将String的Query放在双引号“”内可创建一个QueryParser，用于将上述各种的Query的组合进行解析

。

注意一定要用引号“”包围

！

例如下述

This is Some Phrase*

将被解析为TermQuery，并非WildQuery，

而下述才可以：

/”This is Some Phrase*/”

但在此例子中，This、is将作为stop words被过滤。

双引号外面的~N可以设置slop数值，例如：

/”sloppy phrase/”~5 表示slop的数值时5（用于PhraseQuery）

FuzzyQuery

在Term后置~表示模糊查询，即FuzzyQuery。

例如：

Query query = parser.parse(“kountry~”);

或者

query = parser.parse(“kountry~0.7″);

MatchAllDocsQuery

*:*表示MatchAllDocsQuery，即匹配所有Document。

Grouping

Query query = new QueryParser(

Version.LUCENE_CURRENT,

“subject”,

analyzer).parse(“(agile OR extreme) AND methodology”);

字段选择

boost单个Term

^Float可以提升Term的Boost数值例如：

junit^2.0 testing

将junit的Boost提高一倍而testing不变

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航