您的位置:首页 > 其它

solr大量索引信息导致搜索变慢

2014-02-28 11:41 645 查看
困扰好久,考虑过的方法有很多,包括修改mergeFactor,设置autowarm以及各种optimize索引的方法,但是效果都不明显。

今天参考到了两篇文章:

http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1

其中提到了这个:

The short answer is "more disk seeks". The slowest queries are phrase
queries with commonly occurring words such as "a", "of", "the", "and", "an", etc. Queries with common words take longer because the data structures containing the lists of documents containing those words in the index are larger. For example, in
our 1 million document index, the word "the" occurs in about 800,000 documents whereas the word "aardvark" occurs in 1,500 documents. In processing a query, the search engine has to look at the list of document ids for each word in the query and
determine which documents contain all of the words. The longer the lists, the more data must be read from disk into memory and the more processing must be done to compare the lists.

Phrase queries take longer than Boolean queries because when the search
engine processes phrase queries it needs to insure not only that all the words in the query are in a document, but that the words occur in the document adjacent to each other and in the same order as the query. In order to do this the search engine needs
to consult the positions index, which indexes the position that each word occurs in each document.

Word
Number of
Postings list
Total Term Occurrences
Position list
Documents
Size(KB)
(Millions)


Size (GB)
the
800,000
800
4,351
4.35
of
892,000
892
2,795
2.8
and
769,000
769
1,870
1.87
literature
453,000
453
9
0.13
generation
414,000
414
5
0.01
lives
432,000
432
5
0.01
beat
278,000
278
1
0.01
Total
4,037
9
To process "the lives and literature of the beat generation" as a Boolean query about 4 megabytes of data would have to be read from cache or disk. To process the phrase query,
nearly 9 gigabytes of data has to be read from cache or disk.

大致意思就是对于phrase查询,与布尔查询不同,需要考虑term出现的先后位置,因此要记录其在doc中出现的位置。 越是经常出现的词,倒排表记录着它出现在哪些doc中,并且记录着这些term在doc中出现的位置position信息。因此根据上面的表:

包含the的doc数量为800,000个,而the在所有这些doc中出现了总共 4351 million次,假设倒排表中记录position的单位为一个byte(不考虑增量索引或者其他高端方法的情况下已经是最理想的单位),则记录这些position信息就已经需要4135,000,000个byte,即4.135GB的数据了,意思是即使我们只考虑查询the的情况,硬盘去读取这4GB的数据已经很吃力了。

考虑到我们的实际索引:

由于我们查询的field包括:var_poi_chinese,var_poi_alias,text(text = var_address_chinese+var_poi_chinese),var_address_chinese,BigTag。而这些信息里面包含 “市” 的field至少有address,而像poi以及text也是有极大的可能包含 “市” 信息的。

所以假设我们有10 million个doc,那么当我们查询 “上海市复旦大学”的时候,光是切分出的 “市”这个term它的position信息就至少有 10,000,000个之多,而实际情况至少会将这个数翻一倍,这还是不考虑 “市”这个此项在倒排表中所包含的其他信息。于是我们对于“市”这个无关紧要的term就进行了很多不必要的操作,而且就因为这个 所有doc中都有的“市”,我们最终查询到的结果就包含了索引中的所有doc。而且这里只是考虑到了取出“市”这个term对应的position信息而已,忽视了很多其他开销的情况下。

因此,根据这个时机情况,我们就需要考虑 CommonGramsFilterFactory 这个过滤器了,它的作用是将存在在commonwords.txt中的term进行如下处理:“市”
→ “市”“海_市”,“市_上海”。但是由于“市”仍然存在,所以我们考虑将“市”加入到 stopwords中,最终权衡将commonwords.txt就设置为stopwords.txt。

于是query:“上海市” 进行分词和过滤以后得到的term如下(这里用的分词器是NGram min=1 max=3):

上,海,海_市,市_上海,上海,海市,上海市

这种处理方法增加了索引的大小,但是在查询时能够有效的减少common words对于查询带来的损害。

加入到stopwords中的common words:省,市,县,地区,自治州,郊县,城区
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: