您的位置:首页 > 其它

Elasticsearch 权威教程 - 多字段搜索

2018-03-01 21:55 337 查看

多字段搜索

只有一个简单的
match
子句的查询是很少的。我们经常需要在一个或者多个字段中查询相同的或者不同的查询字符串,意味着我们需要能够组合多个查询子句以及使他们的相关性得分有意义。

或许我们在寻找列夫·托尔斯泰写的一本叫《战争与和平》的书。或许我们在Elasticsearch的文档中查找
minimum should match
,它可能在标题中,或者在一页的正文中。或许我们查找名为John,姓为Smith的人。

在这一章节,我们会介绍用于构建多个查询子句搜索的可能的工具,以及怎么样选择解决方案来应用到你特殊的场景。

多重查询字符串

在明确的字段中的词查询是最容易处理的多字段查询。如果我们知道War and Peace是标题,Leo Tolstoy是作者,可以很容易的用match查询表达每个条件,并且用布尔查询组合起来:

GET /_search
{
"query": {
"bool": {
"should": [
{ "match": { "title":  "War and Peace" }},
{ "match": { "author": "Leo Tolstoy"   }}
]
}
}
}


布尔查询采用”匹配越多越好(More-matches-is-better)”的方法,所以每个match子句的得分会被加起来变成最后的每个文档的得分。匹配两个子句的文档的得分会比只匹配了一个文档的得分高。

当然,没有限制你只能使用match子句:布尔查询可以包装任何其他的查询类型,包含其他的布尔查询,我们可以添加一个子句来指定我们更喜欢看被哪个特殊的翻译者翻译的那版书:

GET /_search
{
"query": {
"bool": {
"should": [
{ "match": { "title":  "War and Peace" }},
{ "match": { "author": "Leo Tolstoy"   }},
{ "bool":  {
"should": [
{ "match": { "translator": "Constance Garnett" }},
{ "match": { "translator": "Louise Maude"      }}
]
}}
]
}
}
}


为什么我们把翻译者的子句放在一个独立的布尔查询中?所有的匹配查询都是should子句,所以为什么不把翻译者的子句放在和title以及作者的同一级?

答案就在如何计算得分中。布尔查询执行每个匹配查询,把他们的得分加在一起,然后乘以匹配子句的数量,并且除以子句的总数。每个同级的子句权重是相同的。在前面的查询中,包含翻译者的布尔查询占用总得分的三分之一。如果我们把翻译者的子句放在和标题与作者同级的目录中,我们会把标题与作者的作用减少的四分之一。

设置子句优先级

在先前的查询中我们可能不需要使每个子句都占用三分之一的权重。我们可能对标题以及作者比翻译者更感兴趣。我们需要调整查询来使得标题和作者的子句相关性更重要。

最简单的方法是使用boost参数。为了提高标题和作者字段的权重,我们给boost参数提供一个比1高的值:

GET /_search
{
"query": {
"bool": {
"should": [
{ "match": { <1>
"title":  {
"query": "War and Peace",
"boost": 2
}}},
{ "match": { <1>
"author":  {
"query": "Leo Tolstoy",
"boost": 2
}}},
{ "bool":  { <2>
"should": [
{ "match": { "translator": "Constance Garnett" }},
{ "match": { "translator": "Louise Maude"      }}
]
}}
]
}
}
}


<1> 标题和作者的boost值为2。

<2> 嵌套的布尔查询的boost值为默认的1。

通过试错(Trial and Error)的方式可以确定”最佳”的boost值:设置一个boost值,执行测试查询,重复这个过程。一个合理boost值的范围在1和10之间,也可能是15。比它更高的值的影响不会起到很大的作用,因为分值会被规范化(Normalized)

[source,js]

GET /_search
{
"query": {
"bool": {
"should": [
{ "match": { "title": "War and Peace" }},
{ "match": { "author": "Leo Tolstoy" }}
]
}
}

}

// SENSE: 110_Multi_Field_Search/05_Multiple_query_strings.json

The `bool` query takes a _more-matches-is-better_ approach, so the score from
each `match` clause will be added together to provide the final `_score` for
each document. Documents that match both clauses will score higher than
documents that match just one clause.

Of course, you're not restricted to using just `match` clauses: the `bool`
query can wrap any other query type, ((("bool query", "nested bool query in")))including other `bool` queries. We could
add a clause to specify that we prefer to see versions of the book that have
been translated by specific translators:

[source,js]

GET /_search
{
"query": {
"bool": {
"should": [
{ "match": { "title": "War and Peace" }},
{ "match": { "author": "Leo Tolstoy" }},
{ "bool": {
"should": [
{ "match": { "translator": "Constance Garnett" }},
{ "match": { "translator": "Louise Maude" }}
]
}}
]
}
}

}

// SENSE: 110_Multi_Field_Search/05_Multiple_query_strings.json

Why did we put the translator clauses inside a separate `bool` query? All four
`match` queries are `should` clauses, so why didn't we just put the translator
clauses at the same level as the title and author clauses?

The answer lies in how the score is calculated.((("relevance scores", "calculation in bool queries"))) The `bool` query runs each
`match` query, adds their scores together, then multiplies by the number of
matching clauses, and divides by the total number of clauses. Each clause at
the same level has the same weight. In the preceding query, the `bool` query
containing the translator clauses counts for one-third of the total score. If we had
put the translator clauses at the same level as title and author, they
would have reduced the contribution of the title and author clauses to one-quarter each.

[[prioritising-clauses]]
==== Prioritizing Clauses

It is likely that an even one-third split between clauses is not what we need for
the preceding query. ((("multifield search", "multiple query strings", "prioritizing query clauses")))((("bool query", "prioritizing clauses"))) Probably we're more interested in the title and author
clauses then we are in the translator clauses. We need to tune the query to
make the title and author clauses relatively more important.

The simplest weapon in our tuning arsenal is the `boost` parameter. To
increase the weight of the `title` and `author` fields, give ((("boost parameter", "using to prioritize query clauses")))((("weight", "using boost parameter to prioritize query clauses")))them a `boost`
value higher than `1`:

[source,js]

GET /_search
{
"query": {
"bool": {
"should": [
{ "match": {
"title": {
"query": "War and Peace",
"boost": 2
}}},
{ "match": {
"author": {
"query": "Leo Tolstoy",
"boost": 2
}}},
{ "bool": {
"should": [
{ "match": { "translator": "Constance Garnett" }},
{ "match": { "translator": "Louise Maude" }}
]
}}
]
}
}

}

// SENSE: 110_Multi_Field_Search/05_Multiple_query_strings.json

The `title` and `author` clauses have a `boost` value of `2`.
The nested `bool` clause has the default `boost` of `1`.

The ``best'' value for the `boost` parameter is most easily determined by
trial and error: set a `boost` value, run test queries, repeat. A reasonable
range for `boost` lies between `1` and `10`, maybe `15`. Boosts higher than
that have little more impact because scores are

单一查询字符串(Single Query String)

bool查询是多字段查询的中流砥柱。在很多场合下它都能很好地工作,特别是当你能够将不同的查询字符串映射到不同的字段时。

问题在于,现在的用户期望能够在一个地方输入所有的搜索词条,然后应用能够知道如何为他们得到正确的结果。所以当我们把含有多个字段的搜索表单称为高级搜索(Advanced Search)时,是有一些讽刺意味的。高级搜索虽然对用户而言会显得更”高级”,但是实际上它的实现方式更简单。

对于多词,多字段查询并没有一种万能(one-size-fits-all)的方法。要得到最佳的结果,你需要了解你的数据以及如何使用恰当的工具。

了解你的数据

当用户的唯一输入就是一个查询字符串时,你会经常碰到以下三种情况:

1.最佳字段(Best fields)::

当搜索代表某些概念的单词时,例如”brown fox”,几个单词合在一起表达出来的意思比单独的单词更多。类似title和body的字段,尽管它们是相关联的,但是也是互相竞争着的。文档在相同的字段中应该有尽可能多的单词(译注:搜索的目标单词),文档的分数应该来自拥有最佳匹配的字段。

2.多数字段(Most fields)::

一个用来调优相关度的常用技术是将相同的数据索引到多个字段中,每个字段拥有自己的分析链(Analysis Chain)。

主要字段会含有单词的词干部分,同义词和消除了变音符号的单词。它用来尽可能多地匹配文档。

相同的文本可以被索引到其它的字段中来提供更加精确的匹配。一个字段或许会包含未被提取成词干的单词,另一个字段是包含了变音符号的单词,第三个字段则使用shingle来提供关于单词邻近度(Word Proximity)的信息。

以上这些额外的字段扮演者signal的角色,用来增加每个匹配的文档的相关度分值。越多的字段被匹配则意味着文档的相关度越高。

3.跨字段(Cross fields)::

对于一些实体,标识信息会在多个字段中出现,每个字段中只含有一部分信息:

Person:
first_name
last_name


Book:
title
,
author
, 和
description


Address:
street
,
city
,
country
, 和
postcode


此时,我们希望在任意字段中找到尽可能多的单词。我们需要在多个字段中进行查询,就好像这些字段是一个字段那样。

以上这些都是多词,多字段查询,但是每种都需要使用不同的策略。我们会在本章剩下的部分解释每种策略。

+

A common technique for fine-tuning relevance is to index the same data into
multiple fields, each with its own analysis chain.

The main field may contain words in their stemmed form, synonyms, and words
stripped of their _diacritics_, or accents. It is used to match as many
documents as possible.

The same text could then be indexed in other fields to provide more-precise
matching. One field may contain the unstemmed version, another the original
word with accents, and a third might use _shingles_ to provide information
about

matching document. The more fields that match, the better.

Cross fields::

+

For some entities, the identifying information is spread across multiple
fields, each of which contains just a part of the whole:

Person:
first_name
and
last_name


Book:
title
,
author
, and
description


Address:
street
,
city
,
country
, and
postcode


In this case, we want to find as many words as possible in _any_ of the listed
fields. We need to search across multiple fields as if they were one big

field.

All of these are multiword, multifield queries, but each requires a
different strategy. We will examine each strategy in turn in the rest of this
chapter.
-->

最佳字段

假设我们有一个让用户搜索博客文章的网站(允许多字段搜索,最佳字段查询),就像这两份文档一样:

PUT /my_index/my_type/1
{
"title": "Quick brown rabbits",
"body":  "Brown rabbits are commonly seen."
}

PUT /my_index/my_type/2
{
"title": "Keeping pets healthy",
"body":  "My quick brown fox eats rabbits on a regular basis."
}


// SENSE: 110_Multi_Field_Search/15_Best_fields.json

用户输入了”Brown fox”,然后按下了搜索键。我们无法预先知道用户搜索的词条会出现在博文的title或者body字段中,但是用户是在搜索和他输入的单词相关的内容。右眼观察,以上的两份文档中,文档2似乎匹配的更好一些,因为它包含了用户寻找的两个单词。

让我们运行下面的bool查询:

{
"query": {
"bool": {
"should": [
{ "match": { "title": "Brown fox" }},
{ "match": { "body":  "Brown fox" }}
]
}
}
}


// SENSE: 110_Multi_Field_Search/15_Best_fields.json

然后我们发现文档1的分值更高:

{
"hits": [
{
"_id":      "1",
"_score":   0.14809652,
"_source": {
"title": "Quick brown rabbits",
"body":  "Brown rabbits are commonly seen."
}
},
{
"_id":      "2",
"_score":   0.09256032,
"_source": {
"title": "Keeping pets healthy",
"body":  "My quick brown fox eats rabbits on a regular basis."
}
}
]
}


要理解原因,想想bool查询是如何计算得到其分值的:

1.运行should子句中的两个查询

2.相加查询返回的分值

3.将相加得到的分值乘以匹配的查询子句的数量

4.除以总的查询子句的数量

文档1在两个字段中都包含了brown,因此两个match查询都匹配成功并拥有了一个分值。文档2在body字段中包含了brown以及fox,但是在title字段中没有出现任何搜索的单词。因此对body字段查询得到的高分加上对title字段查询得到的零分,然后在乘以匹配的查询子句数量1,最后除以总的查询子句数量2,导致整体分值比文档1的低。

在这个例子中,title和body字段是互相竞争的。我们想要找到一个最佳匹配(Best-matching)的字段。

如果我们不是合并来自每个字段的分值,而是使用最佳匹配字段的分值作为整个查询的整体分值呢?这就会让包含有我们寻找的两个单词的字段有更高的权重,而不是在不同的字段中重复出现的相同单词。

dis_max查询

相比使用bool查询,我们可以使用dis_max查询(Disjuction Max Query)。Disjuction的意思”OR”(而Conjunction的意思是”AND”),因此Disjuction Max Query的意思就是返回匹配了任何查询的文档,并且分值是产生了最佳匹配的查询所对应的分值:

{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Brown fox" }},
{ "match": { "body":  "Brown fox" }}
]
}
}
}


// SENSE: 110_Multi_Field_Search/15_Best_fields.json

它会产生我们期望的结果:

{
"hits": [
{
"_id":      "2",
"_score":   0.21509302,
"_source": {
"title": "Keeping pets healthy",
"body":  "My quick brown fox eats rabbits on a regular basis."
}
},
{
"_id":      "1",
"_score":   0.12713557,
"_source": {
"title": "Quick brown rabbits",
"body":  "Brown rabbits are commonly seen."
}
}
]
}


[source,js]

PUT /my_index/my_type/1
{
"title": "Quick brown rabbits",
"body": "Brown rabbits are commonly seen."
}

PUT /my_index/my_type/2
{
"title": "Keeping pets healthy",
"body": "My quick brown fox eats rabbits on a regular basis."

}

// SENSE: 110_Multi_Field_Search/15_Best_fields.json

The user types in the words ``Brown fox'' and clicks Search. We don't
know ahead of time if the user's search terms will be found in the `title` or
the `body` field of the post, but it is likely that the user is searching for
related words. To our eyes, document 2 appears to be the better match, as it
contains both words that we are looking for.

Now we run the following `bool` query:

[source,js]

{
"query": {
"bool": {
"should": [
{ "match": { "title": "Brown fox" }},
{ "match": { "body": "Brown fox" }}
]
}
}

}

// SENSE: 110_Multi_Field_Search/15_Best_fields.json

And we find that this query gives document 1 the higher score:

[source,js]

{
"hits": [
{
"_id": "1",
"_score": 0.14809652,
"_source": {
"title": "Quick brown rabbits",
"body": "Brown rabbits are commonly seen."
}
},
{
"_id": "2",
"_score": 0.09256032,
"_source": {
"title": "Keeping pets healthy",
"body": "My quick brown fox eats rabbits on a regular basis."
}
}
]

}

To understand why, think about how the `bool` query ((("bool query", "relevance score calculation")))((("relevance scores", "calculation in bool queries")))calculates its score:

It runs both of the queries in the
should
clause.

It adds their scores together.

It multiplies the total by the number of matching clauses.

It divides the result by the total number of clauses (two).

Document 1 contains the word `brown` in both fields, so both `match` clauses
are successful and have a score. Document 2 contains both `brown` and
`fox` in the `body` field but neither word in the `title` field. The high
score from the `body` query is added to the zero score from the `title` query,
and multiplied by one-half, resulting in a lower overall score than for document 1.

In this example, the `title` and `body` fields are competing with each other.
We want to find the single _best-matching_ field.

What if, instead of combining the scores from each field, we used the score
from the _best-matching_ field as the overall score for the query? This would
give preference to a single field that contains _both_ of the words we are
looking for, rather than the same word repeated in different fields.

[[dis-max-query]]
==== dis_max Query

Instead of the `bool` query, we can use the `dis_max` or _Disjunction Max
Query_. Disjunction means _or_((("dis_max (disjunction max) query"))) (while conjunction means _and_) so the
Disjunction Max Query simply means _return documents that match any of these
queries, and return the score of the best matching query_:

[source,js]

{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Brown fox" }},
{ "match": { "body": "Brown fox" }}
]
}
}

}

// SENSE: 110_Multi_Field_Search/15_Best_fields.json

This produces the results that we want:

[source,js]

{
"hits": [
{
"_id": "2",
"_score": 0.21509302,
"_source": {
"title": "Keeping pets healthy",
"body": "My quick brown fox eats rabbits on a regular basis."
}
},
{
"_id": "1",
"_score": 0.12713557,
"_source": {
"title": "Quick brown rabbits",
"body": "Brown rabbits are commonly seen."
}
}
]

}

-->

最佳字段查询的调优

如果用户(((“multifield search”, “best fields queries”, “tuning”)))(((“best fields queries”, “tuning”)))搜索的是”quick pets”,那么会发生什么呢?两份文档都包含了单词quick,但是只有文档2包含了单词pets。两份文档都没能在一个字段中同时包含搜索的两个单词。

一个像下面那样的简单dis_max查询会选择出拥有最佳匹配字段的查询子句,而忽略其他的查询子句:

{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Quick pets" }},
{ "match": { "body":  "Quick pets" }}
]
}
}
}


// SENSE: 110_Multi_Field_Search/15_Best_fields.json

{
"hits": [
{
"_id": "1",
"_score": 0.12713557, <1>
"_source": {
"title": "Quick brown rabbits",
"body": "Brown rabbits are commonly seen."
}
},
{
"_id": "2",
"_score": 0.12713557, <1>
"_source": {
"title": "Keeping pets healthy",
"body": "My quick brown fox eats rabbits on a regular basis."
}
}
]
}


<1> 可以发现,两份文档的分值是一模一样的。

我们期望的是同时匹配了title字段和body字段的文档能够拥有更高的排名,但是结果并非如此。需要记住:dis_max查询只是简单的使用最佳匹配查询子句得到的_score。

tie_breaker

但是,将其它匹配的查询子句考虑进来也是可能的。通过指定tie_breaker参数:

{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Quick pets" }},
{ "match": { "body":  "Quick pets" }}
],
"tie_breaker": 0.3
}
}
}


// SENSE: 110_Multi_Field_Search/15_Best_fields.json

它会返回以下结果:

{
"hits": [
{
"_id": "2",
"_score": 0.14757764, <1>
"_source": {
"title": "Keeping pets healthy",
"body": "My quick brown fox eats rabbits on a regular basis."
}
},
{
"_id": "1",
"_score": 0.124275915, <1>
"_source": {
"title": "Quick brown rabbits",
"body": "Brown rabbits are commonly seen."
}
}
]
}


<1> 现在文档2的分值比文档1稍高一些。

tie_breaker参数会让dis_max查询的行为更像是dis_max和bool的一种折中。它会通过下面的方式改变分值计算过程:

1.取得最佳匹配查询子句的_score。

2.将其它每个匹配的子句的分值乘以tie_breaker。

3.将以上得到的分值进行累加并规范化。

通过tie_breaker参数,所有匹配的子句都会起作用,只不过最佳匹配子句的作用更大。

提示:tie_breaker的取值范围是0到1之间的浮点数,取0时即为仅使用最佳匹配子句(译注:和不使用tie_breaker参数的dis_max查询效果相同),取1则会将所有匹配的子句一视同仁。它的确切值需要根据你的数据和查询进行调整,但是一个合理的值会靠近0,(比如,0.1 -0.4),来确保不会压倒dis_max查询具有的最佳匹配性质。

[source,js]

{

“query”: {

“dis_max”: {

“queries”: [

{ “match”: { “title”: “Quick pets” }},

{ “match”: { “body”: “Quick pets” }}

]

}

}

}

// SENSE: 110_Multi_Field_Search/15_Best_fields.json

[source,js]

{

“hits”: [

{

“_id”: “1”,

“_score”: 0.12713557, <1>

“_source”: {

“title”: “Quick brown rabbits”,

“body”: “Brown rabbits are commonly seen.”

}

},

{

“_id”: “2”,

“_score”: 0.12713557, <1>

“_source”: {

“title”: “Keeping pets healthy”,

“body”: “My quick brown fox eats rabbits on a regular basis.”

}

}

]

}

<1> Note that the scores are exactly the same.

We would probably expect documents that match on both the
title
field and

the
body
field to rank higher than documents that match on just one field,

but this isn’t the case. Remember: the
dis_max
query simply uses the

_score
from the single best-matching clause.

==== tie_breaker

It is possible, however, to(((“dis_max (disjunction max) query”, “using tie_breaker parameter”)))(((“relevance scores”, “calculation in dis_max queries”, “using tie_breaker parameter”))) also take the
_score
from the other matching

clauses into account, by specifying (((“tie_breaker parameter”)))the
tie_breaker
parameter:

[source,js]

{

“query”: {

“dis_max”: {

“queries”: [

{ “match”: { “title”: “Quick pets” }},

{ “match”: { “body”: “Quick pets” }}

],

“tie_breaker”: 0.3

}

}

}

// SENSE: 110_Multi_Field_Search/15_Best_fields.json

This gives us the following results:

[source,js]

{

“hits”: [

{

“_id”: “2”,

“_score”: 0.14757764, <1>

“_source”: {

“title”: “Keeping pets healthy”,

“body”: “My quick brown fox eats rabbits on a regular basis.”

}

},

{

“_id”: “1”,

“_score”: 0.124275915, <1>

“_source”: {

“title”: “Quick brown rabbits”,

“body”: “Brown rabbits are commonly seen.”

}

}

]

}

<1> Document 2 now has a small lead over document 1.

The
tie_breaker
parameter makes the
dis_max
query behave more like a

halfway house between
dis_max
and
bool
. It changes the score calculation

as follows:

Take the
_score
of the best-matching clause.

Multiply the score of each of the other matching clauses by the
tie_breaker
.

Add them all together and normalize.

With the
tie_breaker
, all matching clauses count, but the best-matching

clause counts most.

[NOTE]

The
tie_breaker
can be a floating-point value between
0
and
1
, where
0


uses just the best-matching clause(((“tie_breaker parameter”, “value of”))) and
1
counts all matching clauses

equally. The exact value can be tuned based on your data and queries, but a

reasonable value should be close to zero, (for example,
0.1 - 0.4
), in order not to

overwhelm the best-matching nature of
dis_max
.

–>### multi_match查询

multi_match查询提供了一个简便的方法用来对多个字段执行相同的查询。

提示:存在几种类型的multi_match查询,其中的3种正好和在“单一查询字符串”小节中”了解你的数据”单元中提到的几种类型相同:best_fields,most_fields以及cross_fields。

默认情况下,该查询以best_fields类型执行,它会为每个字段生成一个match查询,然后将这些查询包含在一个dis_max查询中。下面的dis_max查询:

{
"dis_max": {
"queries":  [
{
"match": {
"title": {
"query": "Quick brown fox",
"minimum_should_match": "30%"
}
}
},
{
"match": {
"body": {
"query": "Quick brown fox",
"minimum_should_match": "30%"
}
}
},
],
"tie_breaker": 0.3
}
}


可以通过multi_match简单地重写如下:

{
"multi_match": {
"query":                "Quick brown fox",
"type":                 "best_fields", <1>
"fields":               [ "title", "body" ],
"tie_breaker":          0.3,
"minimum_should_match": "30%" <2>
}
}


// SENSE: 110_Multi_Field_Search/25_Best_fields.json

<1> 注意到以上的type属性为best_fields。

<2> minimum_should_match和operator参数会被传入到生成的match查询中。

在字段名中使用通配符

字段名可以通过通配符指定:任何匹配了通配符的字段都会被包含在搜索中。你可以通过下面的查询来匹配book_title,chapter_title以及section_title字段:

{
"multi_match": {
"query":  "Quick brown fox",
"fields": "*_title"
}
}


加权个别字段

个别字段可以通过caret语法(^)进行加权:仅需要在字段名后添加^boost,其中的boost是一个浮点数:

{
"multi_match": {
"query":  "Quick brown fox",
"fields": [ "*_title", "chapter_title^2" ] <1>
}
}


<1> chapter_title字段的boost值为2,而book_title和section_title字段的boost值为默认的1。

[NOTE]

There are several types of
multi_match
query, three of which just

happen to coincide with the three scenarios that we listed in

<>:
best_fields
,
most_fields
, and
cross_fields
.

By default, this query runs as type
best_fields
, which means(((“best fields queries”, “multi-match queries”)))(((“dis_max (disjunction max) query”, “multi_match query wrapped in”))) that it generates a

match
query for each field and wraps them in a
dis_max
query. This

dis_max
query

[source,js]

{

“dis_max”: {

“queries”: [

{

“match”: {

“title”: {

“query”: “Quick brown fox”,

“minimum_should_match”: “30%”

}

}

},

{

“match”: {

“body”: {

“query”: “Quick brown fox”,

“minimum_should_match”: “30%”

}

}

},

],

“tie_breaker”: 0.3

}

}

could be rewritten more concisely with
multi_match
as follows:

[source,js]

{

“multi_match”: {

“query”: “Quick brown fox”,

“type”: “best_fields”, <1>

“fields”: [ “title”, “body” ],

“tie_breaker”: 0.3,

“minimum_should_match”: “30%” <2>

}

}

// SENSE: 110_Multi_Field_Search/25_Best_fields.json

<1> The
best_fields
type is the default and can be left out.

<2> Parameters like
minimum_should_match
or
operator
are passed through to

the generated
match
queries.

==== Using Wildcards in Field Names

Field names can be specified with wildcards: any field that matches the

wildcard pattern(((“multi_match queries”, “wildcards in field names”)))(((“wildcards in field names”)))(((“fields”, “wildcards in field names”))) will be included in the search. You could match on the

book_title
,
chapter_title
, and
section_title
fields, with the following:

[source,js]

{

“multi_match”: {

“query”: “Quick brown fox”,

“fields”: “*_title”

}

}

==== Boosting Individual Fields

Individual fields can be boosted by using the caret (
^
) syntax: just add

^boost
after the field(((“multi_match queries”, “boosting individual fields”)))(((“boost parameter”, “boosting individual fields in multi_match queries”))) name, where
boost
is a floating-point number:

[source,js]

{

“multi_match”: {

“query”: “Quick brown fox”,

“fields”: [ “*_title”, “chapter_title^2” ] <1>

}

}

<1> The
chapter_title
field has a
boost
of
2
, while the
book_title
and

section_title
fields have a default boost of
1
.

–>#### 多数字段(Most Fields)

全文搜索是一场召回率(Recall) - 返回所有相关的文档,以及准确率(Precision) - 不返回无关文档,之间的战斗。目标是在结果的第一页给用户呈现最相关的文档。

为了提高召回率,我们会广撒网 - 不仅包括精确匹配了用户搜索词条的文档,还包括了那些我们认为和查询相关的文档。如果一个用户搜索了”quick brown fox”,一份含有fast foxes的文档也可以作为一个合理的返回结果。

如果我们拥有的相关文档仅仅是含有fast foxes的文档,那么它会出现在结果列表的顶部。但是如果我们有100份含有quick brown fox的文档,那么含有fast foxes的文档的相关性就会变低,我们希望它出现在结果列表的后面。在包含了许多可能的匹配后,我们需要确保相关度高的文档出现在顶部。

一个用来调优全文搜索相关性的常用技术是将同样的文本以多种方式索引,每一种索引方式都提供了不同相关度的信号(Signal)。主要字段(Main field)中含有的词条的形式是最宽泛的(Broadest-matching),用来尽可能多的匹配文档。比如,我们可以这样做:

使用一个词干提取器来将jumps,jumping和jumped索引成它们的词根:jump。然后当用户搜索的是jumped时,我们仍然能够匹配含有jumping的文档。

包含同义词,比如jump,leap和hop。

移除变音符号或者声调符号:比如,ésta,está和esta都会以esta被索引。

但是,如果我们有两份文档,其中之一含有jumped,而另一份含有jumping,那么用户会希望第一份文档的排序会靠前,因为它含有用户输入的精确值。

我们可以通过将相同的文本索引到其它字段来提供更加精确的匹配。一个字段可以包含未被提取词干的版本,另一个则是含有变音符号的原始单词,然后第三个使用了shingles,用来提供和单词邻近度相关的信息。这些其它字段扮演的角色就是信号(Signals),它们用来增加每个匹配文档的相关度分值。能够匹配的字段越多,相关度就越高。

如果一份文档能够匹配具有最宽泛形式的主要字段(Main field),那么它就会被包含到结果列表中。如果它同时也匹配了信号字段,它会得到一些额外的分值用来将它移动到结果列表的前面。

我们会在本书的后面讨论同义词,单词邻近度,部分匹配以及其他可能的信号,但是我们会使用提取了词干和未提取词干的字段的简单例子来解释这个技术。

多字段映射(Multifield Mapping)

第一件事就是将我们的字段索引两次:一次是提取了词干的形式,一次是未提取词干的形式。为了实现它,我们会使用多字段(Multifields),在字符串排序和多字段中我们介绍过:

DELETE /my_index

PUT /my_index
{
"settings": { "number_of_shards": 1 }, <1>
"mappings": {
"my_type": {
"properties": {
"title": { <2>
"type":     "string",
"analyzer": "english",
"fields": {
"std":   { <3>
"type":     "string",
"analyzer": "standard"
}
}
}
}
}
}
}


// SENSE: 110_Multi_Field_Search/30_Most_fields.json

<1> See <<关联失效(相关性被破坏>>.

<2> title字段使用了english解析器进行词干提取。

<3> title.std字段则使用的是standard解析器,因此它没有进行词干提取。

下一步,我们会索引一些文档:

PUT /my_index/my_type/1
{ "title": "My rabbit jumps" }

PUT /my_index/my_type/2
{ "title": "Jumping jack rabbits" }


// SENSE: 110_Multi_Field_Search/30_Most_fields.json

以下是一个简单的针对title字段的match查询,它查询jumping rabbits:

GET /my_index/_search
{
"query": {
"match": {
"title": "jumping rabbits"
}
}
}


// SENSE: 110_Multi_Field_Search/30_Most_fields.json

它会变成一个针对两个提干后的词条jump和rabbit的查询,这要得益于english解析器。两份文档的title字段都包含了以上两个词条,因此两份文档的分值是相同的:

{
"hits": [
{
"_id": "1",
"_score": 0.42039964,
"_source": {
"title": "My rabbit jumps"
}
},
{
"_id": "2",
"_score": 0.42039964,
"_source": {
"title": "Jumping jack rabbits"
}
}
]
}


如果我们只查询title.std字段,那么只有文档2会匹配。但是,当我们查询两个字段并将它们的分值通过bool查询进行合并的话,两份文档都能够匹配(title字段也匹配了),而文档2的分值会更高一些(匹配了title.std字段):

GET /my_index/_search
{
"query": {
"multi_match": {
"query":  "jumping rabbits",
"type":   "most_fields", <1>
"fields": [ "title", "title.std" ]
}
}
}


// SENSE: 110_Multi_Field_Search/30_Most_fields.json

<1> 在上述查询中,由于我们想合并所有匹配字段的分值,因此使用的类型为most_fields。这会让multi_match查询将针对两个字段的查询子句包含在一个bool查询中,而不是包含在一个dis_max查询中。

{
"hits": [
{
"_id": "2",
"_score": 0.8226396, <1>
"_source": {
"title": "Jumping jack rabbits"
}
},
{
"_id": "1",
"_score": 0.10741998, <1>
"_source": {
"title": "My rabbit jumps"
}
}
]
}


<1> 文档2的分值比文档1的高许多。

我们使用了拥有宽泛形式的title字段来匹配尽可能多的文档 - 来增加召回率(Recall),同时也使用了title.std字段作为信号来让最相关的文档能够拥有更靠前的排序(译注:增加了准确率(Precision))。

每个字段对最终分值的贡献可以通过指定boost值进行控制。比如,我们可以提升title字段来让该字段更加重要,这也减小了其它信号字段的影响:

GET /my_index/_search
{
"query": {
"multi_match": {
"query":       "jumping rabbits",
"type":        "most_fields",
"fields":      [ "title^10", "title.std" ] <1>
}
}
}


// SENSE: 110_Multi_Field_Search/30_Most_fields.json

<1> boost=10让title字段的相关性比title.std更重要。

[source,js]

DELETE /my_index

PUT /my_index

{

“settings”: { “number_of_shards”: 1 }, <1>

“mappings”: {

“my_type”: {

“properties”: {

“title”: { <2>

“type”: “string”,

“analyzer”: “english”,

“fields”: {

“std”: { <3>

“type”: “string”,

“analyzer”: “standard”

}

}

}

}

}

}

}

// SENSE: 110_Multi_Field_Search/30_Most_fields.json

<1> See <>.

<2> The
title
field is stemmed by the
english
analyzer.

<3> The
title.std
field uses the
standard
analyzer and so is not stemmed.

Next we index some documents:

[source,js]

PUT /my_index/my_type/1

{ “title”: “My rabbit jumps” }

PUT /my_index/my_type/2

{ “title”: “Jumping jack rabbits” }

// SENSE: 110_Multi_Field_Search/30_Most_fields.json

Here is a simple
match
query on the
title
field for
jumping rabbits
:

[source,js]

GET /my_index/_search

{

“query”: {

“match”: {

“title”: “jumping rabbits”

}

}

}

// SENSE: 110_Multi_Field_Search/30_Most_fields.json

This becomes a query for the two stemmed terms
jump
and
rabbit
, thanks to the

english
analyzer. The
title
field of both documents contains both of those

terms, so both documents receive the same score:

[source,js]

{

“hits”: [

{

“_id”: “1”,

“_score”: 0.42039964,

“_source”: {

“title”: “My rabbit jumps”

}

},

{

“_id”: “2”,

“_score”: 0.42039964,

“_source”: {

“title”: “Jumping jack rabbits”

}

}

]

}

If we were to query just the
title.std
field, then only document 2 would

match. However, if we were to query both fields and to combine their scores

by using the
bool
query, then both documents would match (thanks to the
title


field) and document 2 would score higher (thanks to the
title.std
field):

[source,js]

GET /my_index/_search

{

“query”: {

“multi_match”: {

“query”: “jumping rabbits”,

“type”: “most_fields”, <1>

“fields”: [ “title”, “title.std” ]

}

}

}

// SENSE: 110_Multi_Field_Search/30_Most_fields.json

<1> We want to combine the scores from all matching fields, so we use the

most_fields
type. This causes the
multi_match
query to wrap the two

field-clauses in a
bool
query instead of a
dis_max
query.

[source,js]

{

“hits”: [

{

“_id”: “2”,

“_score”: 0.8226396, <1>

“_source”: {

“title”: “Jumping jack rabbits”

}

},

{

“_id”: “1”,

“_score”: 0.10741998, <1>

“_source”: {

“title”: “My rabbit jumps”

}

}

]

}

<1> Document 2 now scores much higher than document 1.

We are using the broad-matching
title
field to include as many documents as

possible–to increase recall–but we use the
title.std
field as a

signal to push the most relevant results to the top.

The contribution of each field to the final score can be controlled by

specifying custom
boost
values. For instance, we could boost the
title


field to make it the most important field, thus reducing the effect of any

other signal fields:

[source,js]

GET /my_index/_search

{

“query”: {

“multi_match”: {

“query”: “jumping rabbits”,

“type”: “most_fields”,

“fields”: [ “title^10”, “title.std” ] <1>

}

}

}

// SENSE: 110_Multi_Field_Search/30_Most_fields.json

<1> The
boost
value of
10
on the
title
field makes that field relatively

much more important than the
title.std
field.

–>

跨字段实体搜索(Cross-fields Entity Search)

现在让我们看看一个常见的模式:跨字段实体搜索。类似person,product或者address这样的实体,它们的信息会分散到多个字段中。我们或许有一个person实体被索引如下:

{
"firstname":  "Peter",
"lastname":   "Smith"
}


而address实体则是像下面这样:

{
"street":   "5 Poland Street",
"city":     "London",
"country":  "United Kingdom",
"postcode": "W1V 3DG"
}


这个例子也许很像在多查询字符串中描述的,但是有一个显著的区别。在多查询字符串中,我们对每个字段都使用了不同的查询字符串。在这个例子中,我们希望使用一个查询字符串来搜索多个字段。

用户也许会搜索名为”Peter Smith”的人,或者名为”Poland Street W1V”的地址。每个查询的单词都出现在不同的字段中,因此使用dis_max/best_fields查询来搜索单个最佳匹配字段显然是不对的。

一个简单的方法

实际上,我们想要依次查询每个字段然后将每个匹配字段的分值进行累加,这听起来很像bool查询能够胜任的工作:

{
"query": {
"bool": {
"should": [
{ "match": { "street":    "Poland Street W1V" }},
{ "match": { "city":      "Poland Street W1V" }},
{ "match": { "country":   "Poland Street W1V" }},
{ "match": { "postcode":  "Poland Street W1V" }}
]
}
}
}


对每个字段重复查询字符串很快就会显得冗长。我们可以使用multi_match查询进行替代,然后将type设置为most_fields来让它将所有匹配字段的分值合并:

{
"query": {
"multi_match": {
"query":       "Poland Street W1V",
"type":        "most_fields",
"fields":      [ "street", "city", "country", "postcode" ]
}
}
}


使用most_fields存在的问题

使用most_fields方法执行实体查询有一些不那么明显的问题:

它被设计用来找到匹配任意单词的多数字段,而不是找到跨越所有字段的最匹配的单词。

它不能使用operator或者minimum_should_match参数来减少低相关度结果带来的长尾效应。

每个字段的词条频度是不同的,会互相干扰最终得到较差的排序结果。

[source,js]

{

“firstname”: “Peter”,

“lastname”: “Smith”

}

Or an address like this:

[source,js]

{

“street”: “5 Poland Street”,

“city”: “London”,

“country”: “United Kingdom”,

“postcode”: “W1V 3DG”

}

This sounds a lot like the example we described in <>,

but there is a big difference between these two scenarios. In

<>, we used a separate query string for each field. In

this scenario, we want to search across multiple fields with a single query

string.

Our user might search for the person
Peter Smith'' or for the address
Poland Street W1V.” Each of those words appears in a different field, so

using a
dis_max
/
best_fields
query to find the single best-matching

field is clearly the wrong approach.

==== A Naive Approach

Really, we want to query each field in turn and add up the scores of every

field that matches, which sounds like a job for the
bool
query:

[source,js]

{

“query”: {

“bool”: {

“should”: [

{ “match”: { “street”: “Poland Street W1V” }},

{ “match”: { “city”: “Poland Street W1V” }},

{ “match”: { “country”: “Poland Street W1V” }},

{ “match”: { “postcode”: “Poland Street W1V” }}

]

}

}

}

Repeating the query string for every field soon becomes tedious. We can use

the
multi_match
query instead, (((“most fields queries”, “problems for entity search”)))(((“multi_match queries”, “most_fields type”)))and set the
type
to
most_fields
to tell it to

combine the scores of all matching fields:

[source,js]

{

“query”: {

“multi_match”: {

“query”: “Poland Street W1V”,

“type”: “most_fields”,

“fields”: [ “street”, “city”, “country”, “postcode” ]

}

}

}

==== Problems with the most_fields Approach

The
most_fields
approach to entity search has some problems that are not

immediately obvious:

It is designed to find the most fields matching any words, rather than to

find the most matching words across all fields.

It can’t use the
operator
or
minimum_should_match
parameters

to reduce the long tail of less-relevant results.

Term frequencies are different in each field and could interfere with each

other to produce badly ordered results.

–>

以字段为中心的查询(Field-centric Queries)

上述提到的三个问题都来源于most_fields是以字段为中心(Field-centric),而不是以词条为中心(Term-centric):它会查询最多匹配的字段(Most matching fields),而我们真正感兴趣的最匹配的词条(Most matching terms)。

提示:best_fields同样是以字段为中心的,因此它也存在相似的问题。

首先我们来看看为什么存在这些问题,以及如何解决它们。

问题1:在多个字段中匹配相同的单词

考虑一下most_fields查询是如何执行的:ES会为每个字段生成一个match查询,然后将它们包含在一个bool查询中。

我们可以将查询传入到validate-query API中进行查看:

GET /_validate/query?explain
{
"query": {
"multi_match": {
"query":   "Poland Street W1V",
"type":    "most_fields",
"fields":  [ "street", "city", "country", "postcode" ]
}
}
}


// SENSE: 110_Multi_Field_Search/40_Entity_search_problems.json

它会产生下面的解释(explaination):

(street:poland   street:street   street:w1v)
(city:poland     city:street     city:w1v)
(country:poland  country:street  country:w1v)
(postcode:poland postcode:street postcode:w1v)


你可以发现能够在两个字段中匹配poland的文档会比在一个字段中匹配了poland和street的文档的分值要高。

问题2:减少长尾

精度控制(Controlling Precision)一节中,我们讨论了如何使用and操作符和minimum_should_match参数来减少相关度低的文档数量:

{
"query": {
"multi_match": {
"query":       "Poland Street W1V",
"type":        "most_fields",
"operator":    "and", <1>
"fields":      [ "street", "city", "country", "postcode" ]
}
}
}


// SENSE: 110_Multi_Field_Search/40_Entity_search_problems.json

<1> 所有的term必须存在。

但是,使用best_fields或者most_fields,这些参数会被传递到生成的match查询中。该查询的解释如下(译注:通过validate-query API):

(+street:poland   +street:street   +street:w1v)
(+city:poland     +city:street     +city:w1v)
(+country:poland  +country:street  +country:w1v)
(+postcode:poland +postcode:street +postcode:w1v)


换言之,使用and操作符时,所有的单词都需要出现在相同的字段中,这显然是错的!这样做可能不会有任何匹配的文档。

问题3:词条频度

什么是相关度(What is Relevance(relevance-intro))一节中,我们解释了默认用来计算每个词条的相关度分值的相似度算法TF/IDF:

词条频度(Term Frequency)::

在一份文档中,一个词条在一个字段中出现的越频繁,文档的相关度就越高。

倒排文档频度(Inverse Document Frequency)::

一个词条在索引的所有文档的字段中出现的越频繁,词条的相关度就越低。

当通过多字段进行搜索时,TF/IDF会产生一些令人惊讶的结果。

考虑使用first_name和last_name字段搜索”Peter Smith”的例子。Peter是一个常见的名字,Smith是一个常见的姓氏 - 它们的IDF都较低。但是如果在索引中有另外一个名为Smith Williams的人呢?Smith作为名字是非常罕见的,因此它的IDF值会很高!

像下面这样的一个简单查询会将Smith Williams放在Peter Smith前面(译注:含有Smith Williams的文档分值比含有Peter Smith的文档分值高),尽管Peter Smith明显是更好的匹配:

{
"query": {
"multi_match": {
"query":       "Peter Smith",
"type":        "most_fields",
"fields":      [ "*_name" ]
}
}
}


// SENSE: 110_Multi_Field_Search/40_Bad_frequencies.json

smith在first_name字段中的高IDF值会压倒peter在first_name字段和smith在last_name字段中的两个低IDF值。

解决方案

这个问题仅在我们处理多字段时存在。如果我们将所有这些字段合并到一个字段中,该问题就不复存在了。我们可以向person文档中添加一个full_name字段来实现:

{
"first_name":  "Peter",
"last_name":   "Smith",
"full_name":   "Peter Smith"
}


当我们只查询full_name字段时:

拥有更多匹配单词的文档会胜过那些重复出现一个单词的文档。

minimum_should_match和operator参数能够正常工作。

first_name和last_name的倒排文档频度会被合并,因此smith无论是first_name还是last_name都不再重要。

尽管这种方法能工作,可是我们并不想存储冗余数据。因此,ES为我们提供了两个解决方案 - 一个在索引期间,一个在搜索期间。下一节对它们进行讨论。

[source,js]

GET /_validate/query?explain

{

“query”: {

“multi_match”: {

“query”: “Poland Street W1V”,

“type”: “most_fields”,

“fields”: [ “street”, “city”, “country”, “postcode” ]

}

}

}

// SENSE: 110_Multi_Field_Search/40_Entity_search_problems.json

which yields this
explanation
:

(street:poland   street:street   street:w1v)
(city:poland     city:street     city:w1v)
(country:poland  country:street  country:w1v)
(postcode:poland postcode:street postcode:w1v)


You can see that a document matching just the word
poland
in two fields

could score higher than a document matching
poland
and
street
in one

field.

==== Problem 2: Trimming the Long Tail

In <>, we talked about(((“and operator”, “most fields and best fields queries and”)))(((“minimum_should_match parameter”, “most fields and best fields queries”))) using the
and
operator or the

minimum_should_match
parameter to trim the long tail of almost irrelevant

results. Perhaps we could try this:

[source,js]

{

“query”: {

“multi_match”: {

“query”: “Poland Street W1V”,

“type”: “most_fields”,

“operator”: “and”, <1>

“fields”: [ “street”, “city”, “country”, “postcode” ]

}

}

}

// SENSE: 110_Multi_Field_Search/40_Entity_search_problems.json

<1> All terms must be present.

However, with
best_fields
or
most_fields
, these parameters are passed down

to the generated
match
queries. The
explanation
for this query shows the

following:

(+street:poland   +street:street   +street:w1v)
(+city:poland     +city:street     +city:w1v)
(+country:poland  +country:street  +country:w1v)
(+postcode:poland +postcode:street +postcode:w1v)


In other words, using the
and
operator means that all words must exist _in

the same field_, which is clearly wrong! It is unlikely that any documents

would match this query.

==== Problem 3: Term Frequencies

In <>, we explained that the default similarity algorithm

used to calculate the relevance score (((“term frequency”, “problems with field-centric queries”)))for each term is TF/IDF:

Term frequency::

The more often a term appears in a field in a single document, the more
relevant the document.


Inverse document frequency::

The more often a term appears in a field in all documents in the index,
the less relevant is that term.


When searching against multiple fields, TF/IDF can(((“Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm”, “surprising results when searching against multiple fields”))) introduce some surprising

results.

Consider our example of searching for
`Peter Smith'' using the
first_name
and
last_name` fields.(((“inverse document frequency”, “field-centric queries and”))) Peter is a common first name and Smith is a common

last name–both will have low IDFs. But what if we have another person in

the index whose name is Smith Williams? Smith as a first name is very

uncommon and so will have a high IDF!

A simple query like the following may well return Smith Williams above

Peter Smith in spite of the fact that the second person is a better match

than the first.

[source,js]

{

“query”: {

“multi_match”: {

“query”: “Peter Smith”,

“type”: “most_fields”,

“fields”: [ “*_name” ]

}

}

}

// SENSE: 110_Multi_Field_Search/40_Bad_frequencies.json

The high IDF of
smith
in the first name field can overwhelm the two low IDFs

of
peter
as a first name and
smith
as a last name.

==== Solution

These problems only exist because we are dealing with multiple fields. If we

were to combine all of these fields into a single field, the problems would

vanish. We could achieve this by adding a
full_name
field to our
person


document:

[source,js]

{

“first_name”: “Peter”,

“last_name”: “Smith”,

“full_name”: “Peter Smith”

}

When querying just the
full_name
field:

Documents with more matching words would trump documents with the same word

repeated.

The
minimum_should_match
and
operator
parameters would function as

expected.

The inverse document frequencies for first and last names would be combined

so it wouldn’t matter whether Smith were a first or last name anymore.

While this would work, we don’t like having to store redundant data. Instead,

Elasticsearch offers us two solutions–one at index time and one at search

time–which we discuss next.

–>

自定义_all字段

元数据:_all字段中,我们解释了特殊的_all字段会将其它所有字段中的值作为一个大字符串进行索引。尽管将所有字段的值作为一个字段进行索引并不是非常灵活。如果有一个自定义的_all字段用来索引人名,另外一个自定义的_all字段用来索引地址就更好了。

ES通过字段映射中的copy_to参数向我们提供了这一功能:

PUT /my_index
{
"mappings": {
"person": {
"properties": {
"first_name": {
"type":     "string",
"copy_to":  "full_name" <1>
},
"last_name": {
"type":     "string",
"copy_to":  "full_name" <1>
},
"full_name": {
"type":     "string"
}
}
}
}
}


// SENSE: 110_Multi_Field_Search/45_Custom_all.json

<1> first_name和last_name字段中的值会被拷贝到full_name字段中。

有了这个映射,我们可以通过first_name字段查询名字,last_name字段查询姓氏,或者full_name字段查询姓氏和名字。

提示:first_name和last_name字段的映射和full_name字段的索引方式的无关。full_name字段会从其它两个字段中拷贝字符串的值,然后仅根据full_name字段自身的映射进行索引。

[source,js]

PUT /my_index

{

“mappings”: {

“person”: {

“properties”: {

“first_name”: {

“type”: “string”,

“copy_to”: “full_name” <1>

},

“last_name”: {

“type”: “string”,

“copy_to”: “full_name” <1>

},

“full_name”: {

“type”: “string”

}

}

}

}

}

// SENSE: 110_Multi_Field_Search/45_Custom_all.json

<1> The values in the
first_name
and
last_name
fields

are also copied to the
full_name
field.

With this mapping in place, we can query the
first_name
field for first

names, the
last_name
field for last name, or the
full_name
field for firstand last names.

NOTE: Mappings of the
first_name
and
last_name
fields have no bearing

on how the
full_name
field is indexed. The
full_name
field copies the

string values from the other two fields, then indexes them according to the

mapping of the
full_name
field only.

–>#### 跨域查询(Cross-fields Queries)

如果你在索引文档前就能够自定义_all字段的话,那么使用_all字段就是一个不错的方法。但是,ES同时也提供了一个搜索期间的解决方案:使用类型为cross_fields的multi_match查询。cross_fields类型采用了一种以词条为中心(Term-centric)的方法,这种方法和best_fields及most_fields采用的以字段为中心(Field-centric)的方法有很大的区别。它将所有的字段视为一个大的字段,然后在任一字段中搜索每个词条。

为了阐述以字段为中心和以词条为中心的查询的区别,看看以字段为中心的most_fields查询的解释(译注:通过validate-query API得到):

GET /_validate/query?explain
{
"query": {
"multi_match": {
"query":       "peter smith",
"type":        "most_fields",
"operator":    "and", <1>
"fields":      [ "first_name", "last_name" ]
}
}
}


// SENSE: 110_Multi_Field_Search/50_Cross_field.json

<1> operator设为了and,表示所有的词条都需要出现。

对于一份匹配的文档,peter和smith两个词条都需要出现在相同的字段中,要么是first_name字段,要么是last_name字段:

(+first_name:peter +first_name:smith)
(+last_name:peter  +last_name:smith)


而以词条为中心的方法则使用了下面这种逻辑:

+(first_name:peter last_name:peter)
+(first_name:smith last_name:smith)


换言之,词条peter必须出现在任一字段中,同时词条smith也必须出现在任一字段中。

cross_fields类型首先会解析查询字符串来得到一个词条列表,然后在任一字段中搜索每个词条。仅这个区别就能够解决在以字段为中心的查询中提到的3个问题中的2个,只剩下倒排文档频度的不同这一问题。

幸运的是,cross_fields类型也解决了这个问题,从下面的validate-query请求中可以看到:

GET /_validate/query?explain
{
"query": {
"multi_match": {
"query":       "peter smith",
"type":        "cross_fields", <1>
"operator":    "and",
"fields":      [ "first_name", "last_name" ]
}
}
}


// SENSE: 110_Multi_Field_Search/50_Cross_field.json

<1>
cross_fields
使用以词条为中心(Term-centric)进行匹配。

它通过混合(Blending)字段的倒排文档频度来解决词条频度的问题:

+blended("peter", fields: [first_name, last_name])
+blended("smith", fields: [first_name, last_name])


换言之,它会查找词条smith在first_name和last_name字段中的IDF值,然后使用两者中较小的作为两个字段最终的IDF值。因为smith是一个常见的姓氏,意味着它也会被当做一个常见的名字。

提示:为了让cross_fields查询类型能以最佳的方式工作,所有的字段都需要使用相同的解析器。使用了相同的解析器的字段会被组合在一起形成混合字段(Blended Fields)。

如果你包含了使用不同解析链(Analysis Chain)的字段,它们会以和best_fields相同的方式被添加到查询中。比如,如果我们将title字段添加到之前的查询中(假设它使用了一个不同的解析器),得到的解释如下所示:

(+title:peter +title:smith)
(
+blended("peter", fields: [first_name, last_name]) +blended("smith", fields: [first_name, last_name]))


当使用了minimum_should_match以及operator参数时,这一点尤为重要。

逐字段加权(Per-field Boosting)

使用cross_fields查询相比使用自定义_all字段的一个优点是你能够在查询期间对个别字段进行加权。

对于first_name和last_name这类拥有近似值的字段,也许加权是不必要的,但是如果你通过title和description字段来搜索书籍,那么你或许会给予title字段更多的权重。这可以通过前面介绍的caret(^)语法来完成:

GET /books/_search
{
"query": {
"multi_match": {
"query":       "peter smith",
"type":        "cross_fields",
"fields":      [ "title^2", "description" ] <1>
}
}
}


<1> The
title
field has a boost of
2
, while the
description
field

has the default boost of
1
.

能够对个别字段进行加权带来的优势应该和对多个字段执行查询伴随的代价进行权衡,因为如果使用自定义的_all字段,那么只需要要对一个字段进行查询。选择能够给你带来最大收益的方案。

[source,js]

GET /_validate/query?explain
{
"query": {
"multi_match": {
"query": "peter smith",
"type": "most_fields",
"operator": "and",
"fields": [ "first_name", "last_name" ]
}
}

}

// SENSE: 110_Multi_Field_Search/50_Cross_field.json

All terms are required.

For a document to match, both `peter` and `smith` must appear in the same
field, either the `first_name` field or the `last_name` field:

(+first_name:peter +first_name:smith)
(+last_name:peter  +last_name:smith)


A _term-centric_ approach would use this logic instead:

+(first_name:peter last_name:peter)
+(first_name:smith last_name:smith)


In other words, the term `peter` must appear in either field, and the term
`smith` must appear in either field.

The `cross_fields` type first analyzes the query string to produce a list of
terms, and then it searches for each term in any field. That difference alone
solves two of the three problems that we listed in

[source,js]

GET /_validate/query?explain
{
"query": {
"multi_match": {
"query": "peter smith",
"type": "cross_fields",
"operator": "and",
"fields": [ "first_name", "last_name" ]
}
}

}

// SENSE: 110_Multi_Field_Search/50_Cross_field.json

Use `cross_fields` term-centric matching.

It solves the term-frequency problem by _blending_ inverse document
frequencies across fields: ((("cross-fields queries", "blending inverse document frequencies across fields")))((("inverse document frequency", "blending across fields in cross-fields queries")))

+blended("peter", fields: [first_name, last_name])
+blended("smith", fields: [first_name, last_name])


In other words, it looks up the IDF of `smith` in both the `first_name` and
the `last_name` fields and uses the minimum of the two as the IDF for both
fields. The fact that `smith` is a common last name means that it will be
treated as a common first name too.

[NOTE]

For the `cross_fields` query type to work optimally, all fields should have
the same analyzer.((("analyzers", "in cross-fields queries")))((("cross-fields queries", "analyzers in"))) Fields that share an analyzer are grouped together as
blended fields.

If you include fields with a different analysis chain, they will be added to
the query in the same way as for `best_fields`. For instance, if we added the
`title` field to the preceding query (assuming it uses a different analyzer), the
explanation would be as follows:

(+title:peter +title:smith)
(
+blended("peter", fields: [first_name, last_name]) +blended("smith", fields: [first_name, last_name]))


This is particularly important when using the `minimum_should_match` and

operator
parameters.

==== Per-Field Boosting

One of the advantages of using the `cross_fields` query over

[source,js]

GET /books/_search
{
"query": {
"multi_match": {
"query": "peter smith",
"type": "cross_fields",
"fields": [ "title^2", "description" ]
}
}

}

The `title` field has a boost of `2`, while the `description` field
has the default boost of `1`.

The advantage of being able to boost individual fields should be weighed
against the cost of querying multiple fields instead of querying a single
custom `_all` field. Use whichever of the two solutions that delivers the most
bang for your buck.

-->

精确值字段(Exact-value Fields)

在结束对于多字段查询的讨论之前的最后一个话题是作为not_analyzed类型的精确值字段。在multi_match查询中将not_analyzed字段混合到analyzed字段中是没有益处的。

原因可以通过validate-query进行简单地验证,假设我们将title字段设置为not_analyzed:

GET /_validate/query?explain
{
"query": {
"multi_match": {
"query":       "peter smith",
"type":        "cross_fields",
"fields":      [ "title", "first_name", "last_name" ]
}
}
}


// SENSE: 110_Multi_Field_Search/55_Not_analyzed.json

因为title字段时没有被解析的,它会以将整个查询字符串作为一个词条进行搜索!

title:peter smith
(
blended("peter", fields: [first_name, last_name])
blended("smith", fields: [first_name, last_name])
)


很显然该词条在title字段的倒排索引中并不存在,因此永远不可能被找到。在multi_match查询中避免使用not_analyzed字段。

[source,js]

GET /_validate/query?explain
{
"query": {
"multi_match": {
"query": "peter smith",
"type": "cross_fields",
"fields": [ "title", "first_name", "last_name" ]
}
}

}

// SENSE: 110_Multi_Field_Search/55_Not_analyzed.json

Because the
title
field is not analyzed, it searches that field for a single

term consisting of the whole query string!

title:peter smith
(
blended("peter", fields: [first_name, last_name])
blended("smith", fields: [first_name, last_name])
)


That term clearly does not exist in the inverted index of the
title
field,and can never be found. Avoid using
not_analyzed
fields in
multi_match


queries.

–>

https://github.com/uxff/elasticsearch-definitive-guide-cn
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  elasticsearch