Elasticsearch 权威教程 - 模糊匹配
2018-03-01 21:56
477 查看
模糊匹配
一般的全文检索方式使用 TF/IDF 处理文本或者文本数据中的某个字段内容。将字面切分成很多字、词(word)建立索引,match查询用query中的term来匹配索引中的字、词。match查询提供了文档数据中是否包含我们需要的query中的单、词,但仅仅这样是不够的,它无法提供文本中的字词之间的关系。举个例子:
小苏吃了鳄鱼
鳄鱼吃了小苏
小苏去哪儿都带着的鳄鱼皮钱包
用
match查询
小苏 鳄鱼,这三句话都会被命中,但是
tf/idf并不会告诉我们这两个词出现在同一句话里面还是在同一个段落中(仅仅提供这两个词在这段文本中的出现频率)
理解文本中词语之间的关系是一个很复杂的问题,而且这个问题通过更换query的表达方式是无法解决的。但是我们可以知道两个词语在文本中的距离远近,甚至是否相邻,这个信息似乎上能一定程度的表达这两个词比较相关。
一般的文本可能比我们举的例子长很多,正如我们提到的:
小苏跟
鳄鱼这两个词可能分布在文本的不同段落中。我们还是期望能找到这两个词分布均匀的文档,但是我们把这两个词距离比较近的文档赋予更好的相关性权重。
这就是段落匹配(phrase matching)或者模糊匹配(proximity matching)所做的事情。
[TIP]
==================================================
In this chapter, we are using the same example documents that we used for
the
[source,js]
GET /my_index/my_type/_search{
“query”: {
“match_phrase”: {
“title”: “quick brown fox”
}
}
}
// SENSE: 120_Proximity_Matching/05_Match_phrase_query.jsonLike the
matchquery, the
match_phrasequery first analyzes the query
string to produce a list of terms. It then searches for all the terms, but
keeps only documents that contain all of the search terms, in the same
positions relative to each other. A query for the phrase
quick fox
would not match any of our documents, because no document contains the word
quickimmediately followed by
fox.
[TIP]
Thematch_phrasequery can also be written as a
matchquery with type
phrase:
[source,js]
“match”: {“title”: {
“query”: “quick brown fox”,
“type”: “phrase”
}
}
// SENSE: 120_Proximity_Matching/05_Match_phrase_query.json==================================================
==== Term Positions
When a string is analyzed, the analyzer returns not(((“phrase matching”, “term positions”)))(((“match_phrase query”, “position of terms”)))(((“position-aware matching”))) only a list of terms, but
also the position, or order, of each term in the original string:
[source,js]
GET /_analyze?analyzer=standardQuick brown fox
// SENSE: 120_Proximity_Matching/05_Term_positions.jsonThis returns the following:
[role=”pagebreak-before”]
[source,js]
{“tokens”: [
{
“token”: “quick”,
“start_offset”: 0,
“end_offset”: 5,
“type”: “”,
“position”: 1 <1>
},
{
“token”: “brown”,
“start_offset”: 6,
“end_offset”: 11,
“type”: “”,
“position”: 2 <1>
},
{
“token”: “fox”,
“start_offset”: 12,
“end_offset”: 15,
“type”: “”,
“position”: 3 <1>
}
]
}
<1> Thepositionof each term in the original string.
Positions can be stored in the inverted index, and position-aware queries like
the
match_phrasequery can use them to match only documents that contain
all the words in exactly the order specified, with no words in-between.
==== What Is a Phrase
For a document to be considered a(((“match_phrase query”, “documents matching a phrase”)))(((“phrase matching”, “criteria for matching documents”))) match for the phrase “quick brown fox,” the following must be true:
quick,
brown, and
foxmust all appear in the field.
The position of
brownmust be
1greater than the position of
quick.
The position of
foxmust be
2greater than the position of
quick.
If any of these conditions is not met, the document is not considered a match.
[TIP]
Internally, thematch_phrasequery uses the low-level
spanquery family to
do position-aware matching. (((“match_phrase query”, “use of span queries for position-aware matching”)))(((“span queries”)))Span queries are term-level queries, so they have
no analysis phase; they search for the exact term specified.
Thankfully, most people never need to use the
spanqueries directly, as the
match_phrasequery is usually good enough. However, certain specialized
fields, like patent searches, use these low-level queries to perform very
specific, carefully constructed positional searches.
==================================================
[[slop]]
=== Mixing It Up
Requiring exact-phrase matches (((“proximity matching”, “slop parameter”)))may be too strict a constraint. Perhaps we do
want documents that contain
quick brown fox'' to be considered a match for the queryquick fox,” even though the positions aren’t exactly equivalent.
We can introduce a degree (((“slop parameter”)))of flexibility into phrase matching by using the
slopparameter:
[source,js]
GET /my_index/my_type/_search{
“query”: {
“match_phrase”: {
“title”: {
“query”: “quick fox”,
“slop”: 1
}
}
}
}
// SENSE: 120_Proximity_Matching/10_Slop.jsonThe
slopparameter tells the
match_phrasequery how(((“match_phrase query”, “slop parameter”))) far apart terms are
allowed to be while still considering the document a match. By _how far
apart_ we mean _how many times do you need to move a term in order to make
the query and document match_?
We’ll start with a simple example. To make the query
quick foxmatch
a document containing
quick brown foxwe need a
slopof just
1:
Pos 1 Pos 2 Pos 3 ----------------------------------------------- Doc: quick brown fox ----------------------------------------------- Query: quick fox Slop 1: quick ↳ fox
Although all words need to be present in phrase matching, even when using
slop,
the words don’t necessarily need to be in the same sequence in order to
match. With a high enough
slopvalue, words can be arranged in any order.
To make the query
fox quickmatch our document, we need a
slopof
3:
Pos 1 Pos 2 Pos 3 ----------------------------------------------- Doc: quick brown fox ----------------------------------------------- Query: fox quick Slop 1: fox|quick ↵ <1> Slop 2: quick ↳ fox Slop 3: quick ↳ fox
<1> Note that
foxand
quickoccupy the same position in this step.
Switching word order from
fox quickto
quick foxthus requires two
steps, or a
slopof
2.
=== Multivalue Fields
A curious thing can happen when you try to use phrase matching on multivalue
fields. (((“proximity matching”, “on multivalue fields”)))(((“match_phrase query”, “on multivalue fields”))) Imagine that you index this document:
[source,js]
PUT /my_index/groups/1{
“names”: [ “John Abraham”, “Lincoln Smith”]
}
// SENSE: 120_Proximity_Matching/15_Multi_value_fields.jsonThen run a phrase query for
Abraham Lincoln:
[source,js]
GET /my_index/groups/_search{
“query”: {
“match_phrase”: {
“names”: “Abraham Lincoln”
}
}
}
// SENSE: 120_Proximity_Matching/15_Multi_value_fields.jsonSurprisingly, our document matches, even though
Abrahamand
Lincoln
belong to two different people in the
namesarray. The reason for this comes
down to the way arrays are indexed in Elasticsearch.
When
John Abrahamis analyzed, it produces this:
Position 1:
john
Position 2:
abraham
Then when
Lincoln Smithis analyzed, it produces this:
Position 3:
lincoln
Position 4:
smith
In other words, Elasticsearch produces exactly the same list of tokens as it would have
for the single string
John Abraham Lincoln Smith. Our example query
looks for
abrahamdirectly followed by
lincoln, and these two terms do
indeed exist, and they are right next to each other, so the query matches.
Fortunately, there is a simple workaround for cases like these, called the
position_offset_gap, which(((“mapping (types)”, “position_offset_gap”)))(((“position_offset_gap”))) we need to configure in the field mapping:
[source,js]
DELETE /my_index/groups/ <1>PUT /my_index/_mapping/groups <2>
{
“properties”: {
“names”: {
“type”: “string”,
“position_offset_gap”: 100
}
}
}
// SENSE: 120_Proximity_Matching/15_Multi_value_fields.json<1> First delete the
groupsmapping and all documents of that type.
<2> Then create a new
groupsmapping with the correct values.
The
position_offset_gapsetting tells Elasticsearch that it should increase
the current term
positionby the specified value for every new array
element. So now, when we index the array of names, the terms are emitted with
the following positions:
Position 1:
john
Position 2:
abraham
Position 103:
lincoln
Position 104:
smith
Our phrase query would no longer match a document like this because
abraham
and
lincolnare now 100 positions apart. You would have to add a
slop
value of 100 in order for this document to match.
=== Closer Is Better
Whereas a phrase query simply excludes documents that don’t contain the exact
query phrase, a proximity query—a (((“proximity matching”, “proximity queries”)))(((“slop parameter”, “proximity queries and”)))phrase query where
slopis greater
than
0—incorporates the proximity of the query terms into the final
relevance
_score. By setting a high
slopvalue like
50or
100, you can
exclude documents in which the words are really too far apart, but give a higher
score to documents in which the words are closer together.
The following proximity query for
quick dogmatches both documents that
contain the words
quickand
dog, but gives a higher score to the
document(((“relevance scores”, “for proximity queries”))) in which the words are nearer to each other:
[source,js]
POST /my_index/my_type/_search{
“query”: {
“match_phrase”: {
“title”: {
“query”: “quick dog”,
“slop”: 50 <1>
}
}
}
}
// SENSE: 120_Proximity_Matching/20_Scoring.json<1> Note the high
slopvalue.
[source,js]
{“hits”: [
{
“_id”: “3”,
“_score”: 0.75, <1>
“_source”: {
“title”: “The quick brown fox jumps over the quick dog”
}
},
{
“_id”: “2”,
“_score”: 0.28347334, <2>
“_source”: {
“title”: “The quick brown fox jumps over the lazy dog”
}
}
]
}
<1> Higher score becausequickand
dogare close together
<2> Lower score because
quickand
dogare further apart
[[proximity-relevance]]
=== Proximity for Relevance
Although proximity queries are useful, the fact that they require all terms to be
present can make them overly strict.(((“proximity matching”, “using for relevance”)))(((“relevance”, “proximity queries for”))) It’s the same issue that we discussed in
<> in <>: if six out of seven terms match,
a document is probably relevant enough to be worth showing to the user, but
the
match_phrasequery would exclude it.
Instead of using proximity matching as an absolute requirement, we can
use it as a signal—as one of potentially many queries, each of which
contributes to the overall score for each document (see <>).
The fact that we want to add together the scores from multiple queries implies
that we should combine them by using the
boolquery.(((“bool query”, “proximity query for relevance in”)))
We can use a simple
matchquery as a
mustclause. This is the query that
will determine which documents are included in our result set. We can trim
the long tail with the
minimum_should_matchparameter. Then we can add other,
more specific queries as
shouldclauses. Every one that matches will
increase the relevance of the matching docs.
[source,js]
GET /my_index/my_type/_search{
“query”: {
“bool”: {
“must”: {
“match”: { <1>
“title”: {
“query”: “quick brown fox”,
“minimum_should_match”: “30%”
}
}
},
“should”: {
“match_phrase”: { <2>
“title”: {
“query”: “quick brown fox”,
“slop”: 50
}
}
}
}
}
}
// SENSE: 120_Proximity_Matching/25_Relevance.json<1> The
mustclause includes or excludes documents from the result set.
<2> The
shouldclause increases the relevance score of those documents that
match.
We could, of course, include other queries in the
shouldclause, where each
query targets a specific aspect of relevance.
[role=”pagebreak-before”]
=== Improving Performance
Phrase and proximity queries are more (((“proximity matching”, “improving performance”)))(((“phrase matching”, “improving performance”)))expensive than simple
matchqueries.
Whereas a
matchquery just has to look up terms in the inverted index, a
match_phrasequery has to calculate and compare the positions of multiple
possibly repeated terms.
The http://people.apache.org/~mikemccand/lucenebench/[Lucene nightly
benchmarks] show that a simple
termquery is about 10 times as fast as a
phrase query, and about 20 times as fast as a proximity query (a phrase query
with
slop). And of course, this cost is paid at search time instead of at index time.
[TIP]
Usually the extra cost of phrase queries is not as scary as these numberssuggest. Really, the difference in performance is a testimony to just how fast
a simple
termquery is. Phrase queries on typical full-text data usually
complete within a few milliseconds, and are perfectly usable in practice, even
on a busy cluster.
In certain pathological cases, phrase queries can be costly, but this is
unusual. An example of a pathological case is DNA sequencing, where there are
many many identical terms repeated in many positions. Using higher
slop
values in this case results in a huge growth in the number of position
calculations.
==================================================
So what can we do to limit the performance cost of phrase and proximity
queries? One useful approach is to reduce the total number of documents that
need to be examined by the phrase query.
[[rescore-api]]
==== Rescoring Results
In <
[source,js]
GET /my_index/my_type/_search{
“query”: {
“match”: { <1>
“title”: {
“query”: “quick brown fox”,
“minimum_should_match”: “30%”
}
}
},
“rescore”: {
“window_size”: 50, <2>
“query”: { <3>
“rescore_query”: {
“match_phrase”: {
“title”: {
“query”: “quick brown fox”,
“slop”: 50
}
}
}
}
}
}
// SENSE: 120_Proximity_Matching/30_Performance.json<1> The
matchquery decides which results will be included in the final
result set and ranks results according to TF/IDF.(((“window_size parameter”)))
<2> The
window_sizeis the number of top results to rescore, per shard.
<3> The only rescoring algorithm currently supported is another query, but
there are plans to add more algorithms later.
[[shingles]]
=== Finding Associated Words
As useful as phrase and proximity queries can be, they still have a downside.
They are overly strict: all terms must be present for a phrase query to match,
even when using
slop.(((“proximity matching”, “finding associated words”, range=”startofrange”, id=”ix_proxmatchassoc”)))
The flexibility in word ordering that you gain with
slopalso comes at a
price, because you lose the association between word pairs. While you can
identify documents in which
sue,
alligator, and
ateoccur close together,
you can’t tell whether Sue ate or the alligator ate.
When words are used in conjunction with each other, they express an idea that
is bigger or more meaningful than each word in isolation. The two clauses
I’m not happy I’m working and I’m happy I’m not working contain the sames words, in
close proximity, but have quite different meanings.
If, instead of indexing each word independently, we were to index pairs of
words, then we could retain more of the context in which the words were used.
For the sentence
Sue ate the alligator, we would not only index each word
(or unigram) as(((“unigrams”))) a term
["sue", "ate", "the", "alligator"]
but also each word and its neighbor as single terms:
["sue ate", "ate the", "the alligator"]
These word (((“bigrams”)))pairs (or bigrams) are (((“shingles”)))known as shingles.
[TIP]
Shingles are not restricted to being pairs of words; you could index wordtriplets (trigrams) as (((“trigrams”)))well:
["sue ate the", "ate the alligator"]
Trigrams give you a higher degree of precision, but greatly increase the
number of unique terms in the index. Bigrams are sufficient for most use
cases.
==================================================
Of course, shingles are useful only if the user enters the query in the same
order as in the original document; a query for
sue alligatorwould match
the individual words but none of our shingles.
Fortunately, users tend to express themselves using constructs similar to
those that appear in the data they are searching. But this point is an
important one: it is not enough to index just bigrams; we still need unigrams,
but we can use matching bigrams as a signal to increase the relevance score.
==== Producing Shingles
Shingles need to be created at index time as part of the analysis process.(((“shingles”, “producing at index time”))) We
could index both unigrams and bigrams into a single field, but it is cleaner
to keep unigrams and bigrams in separate fields that can be queried
independently. The unigram field would form the basis of our search, with the
bigram field being used to boost relevance.
First, we need to create an analyzer that uses the
shingletoken filter:
[source,js]
DELETE /my_indexPUT /my_index
{
“settings”: {
“number_of_shards”: 1, <1>
“analysis”: {
“filter”: {
“my_shingle_filter”: {
“type”: “shingle”,
“min_shingle_size”: 2, <2>
“max_shingle_size”: 2, <2>
“output_unigrams”: false <3>
}
},
“analyzer”: {
“my_shingle_analyzer”: {
“type”: “custom”,
“tokenizer”: “standard”,
“filter”: [
“lowercase”,
“my_shingle_filter” <4>
]
}
}
}
}
}
// SENSE: 120_Proximity_Matching/35_Shingles.json<1> See <>.
<2> The default min/max shingle size is
2so we don’t really need to set
these.
<3> The
shingletoken filter outputs unigrams by default, but we want to
keep unigrams and bigrams separate.
<4> The
my_shingle_analyzeruses our custom
my_shingles_filtertoken
filter.
First, let’s test that our analyzer is working as expected with the
analyze
API:
[source,js]
GET /my_index/_analyze?analyzer=my_shingle_analyzerSue ate the alligator
Sure enough, we get back three terms:sue ate
ate the
the alligator
Now we can proceed to setting up a field to use the new analyzer.
==== Multifields
We said that it is cleaner to index unigrams and bigrams separately, so we
will create the
titlefield (((“multifields”)))as a multifield (see <>):
[source,js]
PUT /my_index/_mapping/my_type{
“my_type”: {
“properties”: {
“title”: {
“type”: “string”,
“fields”: {
“shingles”: {
“type”: “string”,
“analyzer”: “my_shingle_analyzer”
}
}
}
}
}
}
With this mapping, values from our JSON document in the fieldtitlewill be
indexed both as unigrams (
title) and as bigrams (
title.shingles), meaning
that we can query these fields independently.
And finally, we can index our example documents:
[source,js]
POST /my_index/my_type/_bulk{ “index”: { “_id”: 1 }}
{ “title”: “Sue ate the alligator” }
{ “index”: { “_id”: 2 }}
{ “title”: “The alligator ate Sue” }
{ “index”: { “_id”: 3 }}
{ “title”: “Sue never goes anywhere without her alligator skin purse” }
==== Searching for ShinglesTo understand the benefit (((“shingles”, “searching for”)))that the
shinglesfield adds, let’s first look at
the results from a simple
matchquery for “The hungry alligator ate Sue”:
[source,js]
GET /my_index/my_type/_search{
“query”: {
“match”: {
“title”: “the hungry alligator ate sue”
}
}
}
This query returns all three documents, but note that documents 1 and 2have the same relevance score because they contain the same words:
[source,js]
{“hits”: [
{
“_id”: “1”,
“_score”: 0.44273707, <1>
“_source”: {
“title”: “Sue ate the alligator”
}
},
{
“_id”: “2”,
“_score”: 0.44273707, <1>
“_source”: {
“title”: “The alligator ate Sue”
}
},
{
“_id”: “3”, <2>
“_score”: 0.046571054,
“_source”: {
“title”: “Sue never goes anywhere without her alligator skin purse”
}
}
]
}
<1> Both documents containthe,
alligator, and
ateand so have the
same score.
<2> We could have excluded document 3 by setting the
minimum_should_match
parameter. See <>.
Now let’s add the
shinglesfield into the query. Remember that we want
matches on the
shinglesfield to act as a signal–to increase the
relevance score–so we still need to include the query on the main
title
field:
[source,js]
GET /my_index/my_type/_search{
“query”: {
“bool”: {
“must”: {
“match”: {
“title”: “the hungry alligator ate sue”
}
},
“should”: {
“match”: {
“title.shingles”: “the hungry alligator ate sue”
}
}
}
}
}
We still match all three documents, but document 2 has now been bumped intofirst place because it matched the shingled term
ate sue.
[source,js]
{“hits”: [
{
“_id”: “2”,
“_score”: 0.4883322,
“_source”: {
“title”: “The alligator ate Sue”
}
},
{
“_id”: “1”,
“_score”: 0.13422975,
“_source”: {
“title”: “Sue ate the alligator”
}
},
{
“_id”: “3”,
“_score”: 0.014119488,
“_source”: {
“title”: “Sue never goes anywhere without her alligator skin purse”
}
}
]
}
Even though our query included the wordhungry, which doesn’t appear in
any of our documents, we still managed to use word proximity to return the
most relevant document first.
==== Performance
Not only are shingles more flexible than phrase queries,(((“shingles”, “better performance than phrase queries”))) but they perform better
as well. Instead of paying the price of a phrase query every time you search,
queries for shingles are just as efficient as a simple
matchquery. A small price is paid at index time, because more terms need to
be indexed, which also means that fields with shingles use more disk space.
However, most applications write once and read many times, so it makes sense
to optimize for fast queries.
This is a theme that you will encounter frequently in Elasticsearch: enables you to achieve a lot at search time, without requiring any up-front
setup. Once you understand your requirements more clearly, you can achieve better results with better performance by modeling your data correctly at index time.
(((“proximity matching”, “finding associated words”, range=”endofrange”, startref =”ix_proxmatchassoc”)))
https://github.com/uxff/elasticsearch-definitive-guide-cn
相关文章推荐
- Elasticsearch 权威教程 - 模糊匹配
- Elasticsearch 权威教程 - 索引管理
- Elasticsearch 权威教程 - 分片介绍
- Elasticsearch 权威教程 - 控制关联
- Elasticsearch 权威教程 - 结构化搜索
- 使用Mongo Connector和Elasticsearch实现模糊匹配
- Elasticsearch 权威教程 - 入门
- ElasticSearch 模糊匹配查询
- 使用 Elasticsearch 的 NGram 分词器处理模糊匹配
- Elasticsearch 权威教程 - 集群工作方式
- Elasticsearch: 权威指南(官方教程)
- Elasticsearch 权威教程 - 数据吞吐
- Elasticsearch 权威教程 - 分布式文档存储
- Elasticsearch 权威教程 - 搜索——基本的工具
- 转:使用Mongo Connector和Elasticsearch实现模糊匹配
- ES-MongoDB学习5_使用Mongo Connector和Elasticsearch实现模糊匹配
- Elasticsearch 权威教程 - 映射和分析
- Elasticsearch 权威教程 - 请求体查询
- Elasticsearch 权威教程 - 相关性排序
- Elasticsearch 权威教程 - 全文检索