您的位置:首页 > 大数据 > 人工智能

Elasticsearch scoring detailed explanation

2015-07-08 15:55 495 查看

Score computation mechanism

I am learning Elasticsearch these days, so I’m really curious about how Elasticsearch compute the score of retrieved documents.

在Elasticsearch的官方文档中,给出了文档score的计算的公式,但是其中有个
queryNorm
官方给出的解释非常的不清晰,但是我又很想知道在做查询的时候针对不同的
field
的query-time boosting 是如何添加到score 的计算过程中的,所以花了一些时间特别的研究了一下每一步的score的计算方式。下面会详细给出
queryNorm
的计算过程。

文本主要参考 Lucene’s Practical Scoring Function 中给出的score计算公式。

Important backgroud to keep in Mind

一定要记住,在Elasticsearch计算每个document score的时候,是以shard为单位的,也就是说计算
tf
idf
norm
的时候,不是以index为基本单位,而是以shard为基本单位,这就涉及到了Elasticsearch建立索引的内部机制, 因为Elasticsearch的每一个索引可以分为多个shard,每个shard有可能分布在不同的服务器上,所以以shard为基本单位计算score是合理的,如果一个index包含多个shard,那个搜索会在每个shard上进行,然后计算每个shard内找到的文档的score,最终将所有的shard的结果根据score进行重新排序。

同时,还要注意一点,就是即使以shard为单位,其实在真正计算score的时候,是分别在一篇文档的每个field上进行计算,然后将不同的field上的score的加起来作为整个文档的最终score。(其实Elasticsearch每个field都分别建立了一个索引)

Score Equation

Lucene’s Practical Scoring Function 中给出的score计算公式如下:

score(q,d)=queryNorm(q)⋅coord(q,d)⋅∑(tf(tind)⋅idf(t)²⋅t.getBoost()⋅norm(t,d))(tinq) (1)

上面的公式中,大部分都是比较简单好理解的,最复杂的部分在
queryNorm
,接下来会给出
queryNorm
的详细计算过程。

score(q,d) 是在每个field上分别计算的,然后求和(也取决于你是如何让Elasticsearch计算的)。

Term frequency

tf(t in d) = √frequency

The term frequency (tf) for term t in document d is the square root of the number of times the term appears in the document.

其实tf是在field中进行统计的。

Inverse document frequency

idf(t) = 1 + log ( numDocs / (docFreq + 1))

The inverse document frequency (idf) of term t is the logarithm of the number of documents in the index, divided by the number of documents that contain the term.

IDF也是在field中进行计算的。

Field-length norm

norm(d) = 1 / √numTerms

The field-length norm (norm) is the inverse square root of the number of terms in the field.

上面的这三个特征是在文本检索中最常用的,在Elasticsearch中把他们三个相乘获得token的一个特征: tf * idf * norm

Elasticsearch并没有采用Vector Space Model, 因为计算文档的向量比较费时间,而是采用了结合Boolean Model, TF/IDF Model 和Vector Space Model三种相结合的方式进行score计算。

在公式(1)中可以看到,针对query中的每个token都计算了该token的每个field的score, 然后将每个token的分数加起来,乘上queryNorm(q) 和coord(q,d) 就是最终的score。

要特别注意,虽然Elasticsearch官方给出的公式在计算每个token的 score的时候乘上了 t.getBoost(),但是实际在操作的时候并不是这样进行的。实际计算的时候是把 t.getBoost() 放到了queryNorm(q) 计算中,并且queryNorm(q) 的计算也结合了 query-time boosting.

t.getBoost()的官方解释:t.getBoost()

In fact, reading the explain output is a little more complex than that. You won’t see the boost value or t.getBoost() mentioned in the explanation at all. Instead, the boost is rolled into the queryNorm that is applied to a particular term. Although we said that the queryNorm is the same for every term, you will see that the queryNorm for a boosted term is higher than the queryNorm for an unboosted term.

所以,实际的计算过程,在每个token的每篇文档的得分只有: tf*idf*norm.

并且,这个公式中的 idf(t)² 也是不对的,根据elasticsearch给出的文档得分解释,应该是idf(t),而不是idf(t)² .

特别强调: 每个token的tf, idf, norm的计算都是以field为基础的。

Query Coordination

公式中的coord(q,d)比较好理解,大体意思就是说,如果query中有三个单词,那么在查找到的文档中,这三个单词出现的个数越多,则这个文档的相关性越大。

例如,我查询“oracle database setup”

在查找到的doc1的title field 中,只出现了两个单词“oracle database”,那么这篇文档的field的coord(q,d)=2/3.

The coordination factor (coord) is used to reward documents that contain a higher percentage of the query terms. The more query terms that appear in the document, the greater the chances that the document is a good match for the query.

Imagine that we have a query for quick brown fox, and that the weight for each term is 1.5. Without the coordination factor, the score would just be the sum of the weights of the terms in a document. For instance:

Document with fox → score: 1.5

Document with quick fox → score: 3.0

Document with quick brown fox → score: 4.5

The coordination factor multiplies the score by the number of matching terms in the document, and divides it by the total number of terms in the query. With the coordination factor, the scores would be as follows:

Document with fox → score: 1.5 * 1 / 3 = 0.5

Document with quick fox → score: 3.0 * 2 / 3 = 2.0

Document with quick brown fox → score: 4.5 * 3 / 3 = 4.5

The coordination factor results in the document that contains all three terms being much more relevant than the document that contains just two of them.

Query Normalization Factor

最后就剩下最难的queryNorm(q)啦,官方给出的解释稀里糊涂的,如下:

The query normalization factor (queryNorm) is an attempt to normalize a query so that the results from one query may be compared with the results of another.

queryNorm(q)的好处是使得不同的查询的结果的得分在同一个空间中,这个即使是不同的查询的结果也可以直接比较。

Even though the intent of the query norm is to make results from different queries comparable, it doesn’t work very well. The only purpose of the relevance _score is to sort the results of the current query in the correct order. You should not try to compare the relevance scores from different queries.

This factor is calculated at the beginning of the query. The actual calculation depends on the queries involved, but a typical implementation is as follows:

queryNorm = 1 / √sumOfSquaredWeights

The sumOfSquaredWeights is calculated by adding together the IDF of each term in the query, squared.

The same query normalization factor is applied to every document, and you have no way of changing it. For all intents and purposes, it can be ignored.

按照官方的这个说法,根本计算不出来在elasticsearch explain中的queryNorm。所以,咱们再看看lucene中是怎么定义的?

queryNorm in Lucene

TFIDFSimilarity中,对于queryNorm的定义如下:

queryNorm(q)=queryNorm(sumOfSquaredWeights)=1sumOfSquaredWeights1/2

sumOfSquaredWeights=q.getBoost()2⋅∑(idf(t)⋅t.getBoost())2(t in q)

恩,这下子比较明白了,但是,q.getBoost()和t.getBoost()怎么得到?貌似就没有下文了。

Lucene只有下面的解释:

t.getBoost() is a search time boost of term t in the query q as specified in the query text (see query syntax), or as set by application calls to setBoost(). Notice that there is really no direct API for accessing a boost of one term in a multi term query, but rather multi terms are represented in a query as multi TermQuery objects, and so the boost of a term in the query is accessible by calling the sub-query getBoost().

这个解释基本上没有什么帮助。

针对我之前的问题,我主要是想弄明白在给不同的field不同的boost之后,boost信息是如何整合到queryNorm(q)中的。

然后,经过一段非常痛苦的调查,我弄明白了boost信息是如何计算到queryNorm(q)里的。

例如:我们要在elasticsearch中做如下查询:

GET /test/news/_search?explain
{
"query": {
"multi_match": {
"query": "apple iphone6",
"fields": ["title^3", "body^2"],
"type": "most_fields"
}
}
}


这里,我的文档有”title”,”body”等字段,我想在“title”“body”两个字段上查询,并且给title field 一个3的boost, body field一个2的boost,并且我希望将每个字段上的得分加起来最为整个文档得得分(”type”: “most_fields”)。

这里我们可以理解为q.getBoost()在title字段得到3, 在body字段得到2,t.getBoost()在title字段得到3,body字段得到2.

按照上面的sumOfSquaredWeights的计算公式,并不能得到elasticseach给出的explain中得queryNorm值。

根据这个具体得例子,在某个field中的真实的
queryNorm
的真实的计算公式为:

sumOfSquaredWeights=(1fieldBoost)2⋅∑t inq(∑((idf(t)⋅t.getBoost())2)(field in searchFields))

真心搞了了好久才弄明白是通过这个公式计算sumOfSquaredWeights的,那些再非法转载的,诅咒点什么好呢。

通过这个公式,可以清楚的看出field boost以及其他的boost的信息使如何整合到queryNorm中的。

好了,公式都搞明白了。通过一个实例计算下看看。

实例计算

首先,需要设置一下index, 让我们得index只有一个shard, 这样score看起来比较简单一些。在有多个shard的情况下,会根据doc id进行hash运算,然后决定把doc放入哪个shard,那种情况下我们不能清楚得知道shard中包含哪些文档,不能清楚的计算得到tf,idf,nrom.

PUT /test
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
}
}


再做个简单的mapping

PUT /test/_mapping/news
{
"properties": {
"title": {
"type": "string",
"analyzer": "english"
},
"body": {
"type": "string",
"analyzer": "english"
},
"version": {
"type": "string",
"analyzer": "english"
}
}
}


然后,index两个文章,

PUT /test/news/1
{
"title": "apple released iphone",
"body": "last day, apple company has released their latest product iphone 6, which is the biggest ihpone in histroy"
}

PUT /test/news/2
{
"title": "microsoft suied apple",
"body": "microsoft told that apple has used many of their patents, apple need to pay for these patents for 12 billion"
}


好啦,搜索来啦:

GET /test/news/_search?explain
{
"query": {
"multi_match": {
"query": "apple iphone",
"fields": ["title^8", "body^3"],
"type": "most_fields"
}
}
}


JSON的检索结果比较多,所以就不全部给出了。给出部分跟我们计算相关得:

1. 首先是apple在文档1的title中的计算得分:

{
"value": 0.14224225,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.4784993,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.80482966,
"description": "queryNorm"
}
]
},
{
"value": 0.29726744,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.5,
"description": "fieldNorm(doc=0)"
}
]
}
]
}


我们可以自己计算,apple 的tf = 1, idf = 1+Math.log(maxDocs=2/ (1+1)) = 0.5945349,

field norm = 1 / √3 = 0.5773502691896258, 但是由于elasticsearch只采用了一个字节保存这个norm值,所以精度丢失,变成了0.5。

然后,到了最关键得计算queryNorm啦:

定义:

apple 在title的 idf 为 idf1: idf1 = 0.5945349

apple 在body的 idf 为 idf2: idf2 = 0.5945349

apple在title,body两个字段都在两个文档中出现过,所以idf1=idf2=1+log(2/3)

iphone 在title的 idf 为 idf3: idf3 = 1 = 1 + log(2/2)

iphone 在body的 idf 为 idf4: idf4 = 1

iphone在title, body两个字段都只在一个文档中出现,所以idf1=idf2 = 1 + log(2/2)

然后计算sumOfSquaredWeights:

这个query在title中的sumOfSquaredWeights:

1/8 * 1/8 * (idf1 * idf1 * 8 * 8 + idf2 * idf2 * 3 * 3 + idf3 * idf3 * 8 * 8 + idf4 * idf4 * 3 * 3) = 1.543803711784605

queryNorm = 1/Math.sqrt(1.543803711784605) = 0.8048296354648813

可以看到,这个query在title的queryNorm和elasticsearch给出的解释中的queryNorm完全一样。

这个query在body中的sumOfSquaredWeights:

1/3 * 1/3 * (idf1 * idf1 * 8 * 8 + idf2 * idf2 * 3 * 3 + idf3 * idf3 * 8 * 8 + idf4 * idf4 * 3 * 3)

queryNorm compare

当给定field boosting后,可以观察到,不同的field的queryNorm的比是和field boosting的比相等得。

本例中,

field boost 比为 8/3

queryNorm 比为 0.80482966/0.30181113 = 8/3

O(∩_∩)O哈哈~ COOL!

下面给出了这个query的所有explain结果,有兴趣的朋友可以自己算算:

{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.6467803,
"hits": [
{
"_shard": 0,
"_node": "hwVl0ucyS_6Ps9-xQ2Ihbw",
"_index": "test",
"_type": "news",
"_id": "1",
"_score": 0.6467803,
"_source": {
"title": "apple released iphone",
"body": "last day, apple company has released their latest product iphone 6, which is the biggest ihpone in histroy"
},
"_explanation": {
"value": 0.6467803,
"description": "sum of:",
"details": [
{
"value": 0.5446571,
"description": "sum of:",
"details": [
{
"value": 0.14224225,
"description": "weight(title:appl in 0) [PerFieldSimilarity], result of:",
"details": [
{ "value": 0.14224225, "description": "score(doc=0,freq=1.0), product of:", "details": [ { "value": 0.4784993, "description": "queryWeight, product of:", "details": [ { "value": 0.5945349, "description": "idf(docFreq=2, maxDocs=2)" }, { "value": 0.80482966, "description": "queryNorm" } ] }, { "value": 0.29726744, "description": "fieldWeight in 0, product of:", "details": [ { "value": 1, "description": "tf(freq=1.0), with freq of:", "details": [ { "value": 1, "description": "termFreq=1.0" } ] }, { "value": 0.5945349, "description": "idf(docFreq=2, maxDocs=2)" }, { "value": 0.5, "description": "fieldNorm(doc=0)" } ] } ] }
]
},
{
"value": 0.40241483,
"description": "weight(title:iphon in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.40241483,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.80482966,
"description": "queryWeight, product of:",
"details": [
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)"
},
{
"value": 0.80482966,
"description": "queryNorm"
}
]
},
{
"value": 0.5,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)"
},
{
"value": 0.5,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
}
]
},
{
"value": 0.10212321,
"description": "sum of:",
"details": [
{
"value": 0.026670424,
"description": "weight(body:appl in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.026670424,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.17943723,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.30181113,
"description": "queryNorm"
}
]
},
{
"value": 0.14863372,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.25,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
},
{
"value": 0.07545278,
"description": "weight(body:iphon in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.07545278,
"description": "score(doc=0,freq=1.0), product of:",
"details": [
{
"value": 0.30181113,
"description": "queryWeight, product of:",
"details": [
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)"
},
{
"value": 0.30181113,
"description": "queryNorm"
}
]
},
{
"value": 0.25,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 1,
"description": "idf(docFreq=1, maxDocs=2)"
},
{
"value": 0.25,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
}
]
}
]
}
},
{
"_shard": 0,
"_node": "hwVl0ucyS_6Ps9-xQ2Ihbw",
"_index": "test",
"_type": "news",
"_id": "2",
"_score": 0.08997996,
"_source": {
"title": "microsoft suied apple",
"body": "microsoft told that apple has used many of their patents, apple need to pay for these patents for 12 billion"
},
"_explanation": {
"value": 0.08997996,
"description": "sum of:",
"details": [
{
"value": 0.07112113,
"description": "product of:",
"details": [
{
"value": 0.14224225,
"description": "sum of:",
"details": [
{
"value": 0.14224225,
"description": "weight(title:appl in 0) [PerFieldSimilarity], result of:",
"details": [
{ "value": 0.14224225, "description": "score(doc=0,freq=1.0), product of:", "details": [ { "value": 0.4784993, "description": "queryWeight, product of:", "details": [ { "value": 0.5945349, "description": "idf(docFreq=2, maxDocs=2)" }, { "value": 0.80482966, "description": "queryNorm" } ] }, { "value": 0.29726744, "description": "fieldWeight in 0, product of:", "details": [ { "value": 1, "description": "tf(freq=1.0), with freq of:", "details": [ { "value": 1, "description": "termFreq=1.0" } ] }, { "value": 0.5945349, "description": "idf(docFreq=2, maxDocs=2)" }, { "value": 0.5, "description": "fieldNorm(doc=0)" } ] } ] }
]
}
]
},
{
"value": 0.5,
"description": "coord(1/2)"
}
]
},
{
"value": 0.018858837,
"description": "product of:",
"details": [
{
"value": 0.037717674,
"description": "sum of:",
"details": [
{
"value": 0.037717674,
"description": "weight(body:appl in 0) [PerFieldSimilarity], result of:",
"details": [
{
"value": 0.037717674,
"description": "score(doc=0,freq=2.0), product of:",
"details": [
{
"value": 0.17943723,
"description": "queryWeight, product of:",
"details": [
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.30181113,
"description": "queryNorm"
}
]
},
{
"value": 0.21019982,
"description": "fieldWeight in 0, product of:",
"details": [
{
"value": 1.4142135,
"description": "tf(freq=2.0), with freq of:",
"details": [
{
"value": 2,
"description": "termFreq=2.0"
}
]
},
{
"value": 0.5945349,
"description": "idf(docFreq=2, maxDocs=2)"
},
{
"value": 0.25,
"description": "fieldNorm(doc=0)"
}
]
}
]
}
]
}
]
},
{
"value": 0.5,
"description": "coord(1/2)"
}
]
}
]
}
}
]
}
}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息