构建一个倒排索引 A first take at building an inverted index
2016-01-07 10:08
656 查看
A first take at building an inverted index
To gain the speed benefits of indexing at retrieval time, we have to build the index in advance. The major steps in this are:Collect the documents to be indexed:
...
Tokenize the text, turning each document into a list of tokens:
...
Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms:
...
Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
We will define and discuss the earlier stages of processing, that is, steps 1-3, in Section 2.2 (page
).
Until then you can think of tokensand normalized tokens as
also loosely equivalent to words. Here, we assume that the first 3 steps have already been done, and we examine building a basic inverted index
by sort-based indexing .
Within a document collection, we assume that each document has a unique serial number, known as the document identifier ( docID ).
During index construction, we can simply assign successive integers to each new document when it is first encountered. The input to indexing is a list of normalized tokens for each document, which we can equally think of as a list of pairs of term and docID,
as in Figure 1.4 . The core indexing step is sorting this list so that
the terms are alphabetical, giving us the representation in the middle column of Figure 1.4 . Multiple occurrences
of the same term from the same document are then merged.
Instances
of the same term are then grouped, and the result is split into a dictionary andpostings , as shown in the right column of Figure 1.4 .
Since a term generally occurs in a number of documents, this data organization already reduces the storage requirements of the index. The dictionary also records some statistics, such as the number of documents which contain each term (the document
frequency , which is here also the length of each postings list). This information is not vital for a basic Boolean search engine, but it allows us to improve the efficiency of the search engine at query time, and it is a statistic later used in many
ranked retrieval models. The postings are secondarily sorted by docID. This provides the basis for efficient query processing. This inverted index structure is essentially without rivals as the most efficient structure for supporting ad hoc text search.
In the resulting index, we pay for storage of both the dictionary and the postings lists. The latter are much larger, but the dictionary is commonly kept in memory, while postings lists are normally kept on disk,
so the size of each is important, and in Chapter 5 we will examine how each can be optimized for storage and access efficiency. What data structure
should be used for a postings list? A fixed length array would be wasteful as some words occur in many documents, and others in very few. For an in-memory postings list, two good alternatives are singly linked lists or variable length arrays. Singly linked
lists allow cheap insertion of documents into postings lists (following updates, such as when recrawling the web for updated documents), and naturally extend to more advanced indexing strategies such as skip lists (Section 2.3 ),
which require additional pointers. Variable length arrays win in space requirements by avoiding the overhead for pointers and in time requirements because their use of contiguous memory increases speed on modern processors with memory caches.
Extra pointers can in practice be encoded into the lists as offsets. If updates are relatively infrequent, variable length arrays will be more compact and faster to traverse. We can also use a hybrid scheme with a linked list of fixed length arrays for each
term. When postings lists are stored on disk, they are stored (perhaps compressed) as a contiguous run of postings without explicit pointers (as in Figure 1.3 ),
so as to minimize the size of the postings list and the number of disk seeks to read a postings list into memory.
Exercises.
Draw the inverted index that would be built for the following document collection. (See Figure 1.3 for
an example.)
Doc 1 new home sales top forecasts
Doc 2 home sales rise in july
Doc 3 increase in home sales in july
Doc 4 july new home sales rise
Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
Draw the term-document incidence matrix for this document collection.
Draw the inverted index representation for this collection, as in Figure 1.3 (page
).
For the document collection shown in Exercise 1.2 , what are the returned results for these queries:
schizophrenia AND drug
for AND NOT(drug OR approach)
from: http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html
相关文章推荐
- 使用jQuery.wechat构建微信WEB应用
- 用 Composer构建自己的 PHP 框架之基础准备
- Java并发编程中构建自定义同步工具
- (WJW)构建企业级Docker_Registry_V2
- 以NFS启动方式构建arm-linux仿真运行环境
- 中型企业网络构建案例
- 中型企业网络构建案例
- 中型企业网络构建案例
- 菜鸟如何构建一个入门级***检测系统
- 利用ASP.NET构建网上考试系统
- 如何用三层交换机构建企业VLAN
- 突破三大挑战 构建学习型企业
- 用Linux防火墙构建软路由
- 数据仓库构建步骤(上)
- gradle基础
- ant 构建时遇到 “非法字符: \65279”的解决办法
- 使用shell脚本curl调用jenkins进行构建并判断是否构建成功
- 《构建之法》 –读书笔记
- 使用Eclipse构建Maven项目
- Jenkins Maven Ant Hudson几个工具的比较