您的位置:首页 > 其它

倒排索引 -- 深入搜索引擎的工作原理 Inverted Indexes – Inside How Search Engines Work

2016-01-07 17:25 351 查看
An Inverted Index is a structure used by search engines and databases to make search terms to files or documents, trading the speed writing the document to the index for searching the index later on. There are two
versions of an inverted index, a record-level index which tells you which documents contain the term and a fully inverted index which tells you both the document a term is contained in and where in the file it is. For example if you built a search engine to
search the contents of sentences and it was fed these sentences:
{0} - "Turtles love pizza"
{1} - "I love my turtles"
{2} - "My pizza is good"

Then you would store them in a Inverted Indexes like this:
Record Level     Fully Inverted
"turtles"   {0, 1}           { (0, 0), (1, 3) }
"love"      {0, 1}           { (0, 1), (1, 1) }
"pizza"     {0, 2}           { (0, 2), (2, 1) }
"i"         {1}              { (1, 0) }
"my"        {1, 2}           { (1, 2), (2, 0) }
"is"        {2}              { (2, 2) }
"good"      {2}              { (2, 3) }

The record level sets represent just the document ids where the words are stored, and the fully inverted sets represent the document in the first number inside the parentheses and the location in the document is
stored in the second number.
So now if you wanted to search all three documents for the words “my turtles” you would grab the sets (looking at record level only):
"turtles"   {0, 1}
"my"        {1, 2}

Then you would intersect those sets, coming up with the only matching set being 1. Using the Fully Inverted Index would also let us know that the word “my” appeared at position 2 and the word “turtles” at position
3, assuming the word position is important your search.
There is no standard implementation for an Inverted Index as it’s more of a concept rather than an actual algorithm, this however gives you a lot of options.
For the index you can choose to use things like  Hashtables, BTrees, or any other fast search data structure.
The intersection becomes a more interesting problem. You can try using Bloom
Filters if accuracy isn’t 100% needed, you can brute force the problem by doing a full scan of each set for O(M+N) time for joining two sets. You can also try to do something a
little more complicated. Rumor has it that search engines like Google and Bing only merge results until they have enough for a search page and them dump the sets they are loading, though I know very little about how they actually solve this problem.
Here is an example of a simple Inverted Index written in C# that uses a Dictionary as the index and the Linq Intersect function:
public class InvertedIndex
{
private readonly Dictionary<string, HashSet<int>> _index = new Dictionary<string, HashSet<int>>();
private readonly Regex _findWords = new Regex(@"[A-Za-z]+");

public void Add(string text, int docId)
{
var words = _findWords.Matches(text);

for (var i = 0; i < words.Count; i++)
{
var word = words[i].Value;

if (!_index.ContainsKey(word))
_index[word] = new HashSet<int>();

if (!_index[word].Contains(docId))
_index[word].Add(docId);
}
}

public List<int> Search(string keywords)
{
var words = _findWords.Matches(keywords);
IEnumerable<int> rtn = null;

for (var i = 0; i < words.Count; i++)
{
var word = words[i].Value;
if (_index.ContainsKey(word))
{
rtn = rtn == null ? _index[word] : rtn.Intersect(_index[word]);
}
else
{
return new List<int>();
}
}

return rtn != null ? rtn.ToList() : new List<int>();
}
}

from: https://nullwords.wordpress.com/2013/04/18/inverted-indexes-inside-how-search-engines-work/
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息