您的位置:首页 > 其它

Elasticsearch 权威教程 - 模糊匹配

2018-03-01 21:56 477 查看

模糊匹配

一般的全文检索方式使用 TF/IDF 处理文本或者文本数据中的某个字段内容。将字面切分成很多字、词(word)建立索引,match查询用query中的term来匹配索引中的字、词。match查询提供了文档数据中是否包含我们需要的query中的单、词,但仅仅这样是不够的,它无法提供文本中的字词之间的关系。

举个例子:

小苏吃了鳄鱼

鳄鱼吃了小苏

小苏去哪儿都带着的鳄鱼皮钱包

match
查询
小苏 鳄鱼
,这三句话都会被命中,但是
tf/idf
并不会告诉我们这两个词出现在同一句话里面还是在同一个段落中(仅仅提供这两个词在这段文本中的出现频率)

理解文本中词语之间的关系是一个很复杂的问题,而且这个问题通过更换query的表达方式是无法解决的。但是我们可以知道两个词语在文本中的距离远近,甚至是否相邻,这个信息似乎上能一定程度的表达这两个词比较相关。

一般的文本可能比我们举的例子长很多,正如我们提到的:
小苏
鳄鱼
这两个词可能分布在文本的不同段落中。我们还是期望能找到这两个词分布均匀的文档,但是我们把这两个词距离比较近的文档赋予更好的相关性权重。

这就是段落匹配(phrase matching)或者模糊匹配(proximity matching)所做的事情。

[TIP]
==================================================

In this chapter, we are using the same example documents that we used for
the

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“match_phrase”: {

“title”: “quick brown fox”

}

}

}

// SENSE: 120_Proximity_Matching/05_Match_phrase_query.json

Like the
match
query, the
match_phrase
query first analyzes the query

string to produce a list of terms. It then searches for all the terms, but

keeps only documents that contain all of the search terms, in the same

positions relative to each other. A query for the phrase
quick fox


would not match any of our documents, because no document contains the word

quick
immediately followed by
fox
.

[TIP]

The
match_phrase
query can also be written as a
match
query with type

phrase
:

[source,js]

“match”: {

“title”: {

“query”: “quick brown fox”,

“type”: “phrase”

}

}

// SENSE: 120_Proximity_Matching/05_Match_phrase_query.json

==================================================

==== Term Positions

When a string is analyzed, the analyzer returns not(((“phrase matching”, “term positions”)))(((“match_phrase query”, “position of terms”)))(((“position-aware matching”))) only a list of terms, but

also the position, or order, of each term in the original string:

[source,js]

GET /_analyze?analyzer=standard

Quick brown fox

// SENSE: 120_Proximity_Matching/05_Term_positions.json

This returns the following:

[role=”pagebreak-before”]

[source,js]

{

“tokens”: [

{

“token”: “quick”,

“start_offset”: 0,

“end_offset”: 5,

“type”: “”,

“position”: 1 <1>

},

{

“token”: “brown”,

“start_offset”: 6,

“end_offset”: 11,

“type”: “”,

“position”: 2 <1>

},

{

“token”: “fox”,

“start_offset”: 12,

“end_offset”: 15,

“type”: “”,

“position”: 3 <1>

}

]

}

<1> The
position
of each term in the original string.

Positions can be stored in the inverted index, and position-aware queries like

the
match_phrase
query can use them to match only documents that contain

all the words in exactly the order specified, with no words in-between.

==== What Is a Phrase

For a document to be considered a(((“match_phrase query”, “documents matching a phrase”)))(((“phrase matching”, “criteria for matching documents”))) match for the phrase “quick brown fox,” the following must be true:

quick
,
brown
, and
fox
must all appear in the field.

The position of
brown
must be
1
greater than the position of
quick
.

The position of
fox
must be
2
greater than the position of
quick
.

If any of these conditions is not met, the document is not considered a match.

[TIP]

Internally, the
match_phrase
query uses the low-level
span
query family to

do position-aware matching. (((“match_phrase query”, “use of span queries for position-aware matching”)))(((“span queries”)))Span queries are term-level queries, so they have

no analysis phase; they search for the exact term specified.

Thankfully, most people never need to use the
span
queries directly, as the

match_phrase
query is usually good enough. However, certain specialized

fields, like patent searches, use these low-level queries to perform very

specific, carefully constructed positional searches.

==================================================

[[slop]]

=== Mixing It Up

Requiring exact-phrase matches (((“proximity matching”, “slop parameter”)))may be too strict a constraint. Perhaps we do

want documents that contain
quick brown fox'' to be considered a match for

the query
quick fox,” even though the positions aren’t exactly equivalent.

We can introduce a degree (((“slop parameter”)))of flexibility into phrase matching by using the

slop
parameter:

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“match_phrase”: {

“title”: {

“query”: “quick fox”,

“slop”: 1

}

}

}

}

// SENSE: 120_Proximity_Matching/10_Slop.json

The
slop
parameter tells the
match_phrase
query how(((“match_phrase query”, “slop parameter”))) far apart terms are

allowed to be while still considering the document a match. By _how far

apart_ we mean _how many times do you need to move a term in order to make

the query and document match_?

We’ll start with a simple example. To make the query
quick fox
match

a document containing
quick brown fox
we need a
slop
of just
1
:

Pos 1         Pos 2         Pos 3
-----------------------------------------------
Doc:        quick         brown         fox
-----------------------------------------------
Query:      quick         fox
Slop 1:     quick                 ↳     fox


Although all words need to be present in phrase matching, even when using
slop
,

the words don’t necessarily need to be in the same sequence in order to

match. With a high enough
slop
value, words can be arranged in any order.

To make the query
fox quick
match our document, we need a
slop
of
3
:

Pos 1         Pos 2         Pos 3
-----------------------------------------------
Doc:        quick         brown         fox
-----------------------------------------------
Query:      fox           quick
Slop 1:     fox|quick  ↵  <1>
Slop 2:     quick      ↳  fox
Slop 3:     quick                 ↳     fox


<1> Note that
fox
and
quick
occupy the same position in this step.

Switching word order from
fox quick
to
quick fox
thus requires two

steps, or a
slop
of
2
.

=== Multivalue Fields

A curious thing can happen when you try to use phrase matching on multivalue

fields. (((“proximity matching”, “on multivalue fields”)))(((“match_phrase query”, “on multivalue fields”))) Imagine that you index this document:

[source,js]

PUT /my_index/groups/1

{

“names”: [ “John Abraham”, “Lincoln Smith”]

}

// SENSE: 120_Proximity_Matching/15_Multi_value_fields.json

Then run a phrase query for
Abraham Lincoln
:

[source,js]

GET /my_index/groups/_search

{

“query”: {

“match_phrase”: {

“names”: “Abraham Lincoln”

}

}

}

// SENSE: 120_Proximity_Matching/15_Multi_value_fields.json

Surprisingly, our document matches, even though
Abraham
and
Lincoln


belong to two different people in the
names
array. The reason for this comes

down to the way arrays are indexed in Elasticsearch.

When
John Abraham
is analyzed, it produces this:

Position 1:
john


Position 2:
abraham


Then when
Lincoln Smith
is analyzed, it produces this:

Position 3:
lincoln


Position 4:
smith


In other words, Elasticsearch produces exactly the same list of tokens as it would have

for the single string
John Abraham Lincoln Smith
. Our example query

looks for
abraham
directly followed by
lincoln
, and these two terms do

indeed exist, and they are right next to each other, so the query matches.

Fortunately, there is a simple workaround for cases like these, called the

position_offset_gap
, which(((“mapping (types)”, “position_offset_gap”)))(((“position_offset_gap”))) we need to configure in the field mapping:

[source,js]

DELETE /my_index/groups/ <1>

PUT /my_index/_mapping/groups <2>

{

“properties”: {

“names”: {

“type”: “string”,

“position_offset_gap”: 100

}

}

}

// SENSE: 120_Proximity_Matching/15_Multi_value_fields.json

<1> First delete the
groups
mapping and all documents of that type.

<2> Then create a new
groups
mapping with the correct values.

The
position_offset_gap
setting tells Elasticsearch that it should increase

the current term
position
by the specified value for every new array

element. So now, when we index the array of names, the terms are emitted with

the following positions:

Position 1:
john


Position 2:
abraham


Position 103:
lincoln


Position 104:
smith


Our phrase query would no longer match a document like this because
abraham


and
lincoln
are now 100 positions apart. You would have to add a
slop


value of 100 in order for this document to match.

=== Closer Is Better

Whereas a phrase query simply excludes documents that don’t contain the exact

query phrase, a proximity query—a (((“proximity matching”, “proximity queries”)))(((“slop parameter”, “proximity queries and”)))phrase query where
slop
is greater

than
0
—incorporates the proximity of the query terms into the final

relevance
_score
. By setting a high
slop
value like
50
or
100
, you can

exclude documents in which the words are really too far apart, but give a higher

score to documents in which the words are closer together.

The following proximity query for
quick dog
matches both documents that

contain the words
quick
and
dog
, but gives a higher score to the

document(((“relevance scores”, “for proximity queries”))) in which the words are nearer to each other:

[source,js]

POST /my_index/my_type/_search

{

“query”: {

“match_phrase”: {

“title”: {

“query”: “quick dog”,

“slop”: 50 <1>

}

}

}

}

// SENSE: 120_Proximity_Matching/20_Scoring.json

<1> Note the high
slop
value.

[source,js]

{

“hits”: [

{

“_id”: “3”,

“_score”: 0.75, <1>

“_source”: {

“title”: “The quick brown fox jumps over the quick dog”

}

},

{

“_id”: “2”,

“_score”: 0.28347334, <2>

“_source”: {

“title”: “The quick brown fox jumps over the lazy dog”

}

}

]

}

<1> Higher score because
quick
and
dog
are close together

<2> Lower score because
quick
and
dog
are further apart

[[proximity-relevance]]

=== Proximity for Relevance

Although proximity queries are useful, the fact that they require all terms to be

present can make them overly strict.(((“proximity matching”, “using for relevance”)))(((“relevance”, “proximity queries for”))) It’s the same issue that we discussed in

<> in <>: if six out of seven terms match,

a document is probably relevant enough to be worth showing to the user, but

the
match_phrase
query would exclude it.

Instead of using proximity matching as an absolute requirement, we can

use it as a signal—as one of potentially many queries, each of which

contributes to the overall score for each document (see <>).

The fact that we want to add together the scores from multiple queries implies

that we should combine them by using the
bool
query.(((“bool query”, “proximity query for relevance in”)))

We can use a simple
match
query as a
must
clause. This is the query that

will determine which documents are included in our result set. We can trim

the long tail with the
minimum_should_match
parameter. Then we can add other,

more specific queries as
should
clauses. Every one that matches will

increase the relevance of the matching docs.

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“bool”: {

“must”: {

“match”: { <1>

“title”: {

“query”: “quick brown fox”,

“minimum_should_match”: “30%”

}

}

},

“should”: {

“match_phrase”: { <2>

“title”: {

“query”: “quick brown fox”,

“slop”: 50

}

}

}

}

}

}

// SENSE: 120_Proximity_Matching/25_Relevance.json

<1> The
must
clause includes or excludes documents from the result set.

<2> The
should
clause increases the relevance score of those documents that

match.

We could, of course, include other queries in the
should
clause, where each

query targets a specific aspect of relevance.

[role=”pagebreak-before”]

=== Improving Performance

Phrase and proximity queries are more (((“proximity matching”, “improving performance”)))(((“phrase matching”, “improving performance”)))expensive than simple
match
queries.

Whereas a
match
query just has to look up terms in the inverted index, a

match_phrase
query has to calculate and compare the positions of multiple

possibly repeated terms.

The http://people.apache.org/~mikemccand/lucenebench/[Lucene nightly

benchmarks] show that a simple
term
query is about 10 times as fast as a

phrase query, and about 20 times as fast as a proximity query (a phrase query

with
slop
). And of course, this cost is paid at search time instead of at index time.

[TIP]

Usually the extra cost of phrase queries is not as scary as these numbers

suggest. Really, the difference in performance is a testimony to just how fast

a simple
term
query is. Phrase queries on typical full-text data usually

complete within a few milliseconds, and are perfectly usable in practice, even

on a busy cluster.

In certain pathological cases, phrase queries can be costly, but this is

unusual. An example of a pathological case is DNA sequencing, where there are

many many identical terms repeated in many positions. Using higher
slop


values in this case results in a huge growth in the number of position

calculations.

==================================================

So what can we do to limit the performance cost of phrase and proximity

queries? One useful approach is to reduce the total number of documents that

need to be examined by the phrase query.

[[rescore-api]]

==== Rescoring Results

In <

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“match”: { <1>

“title”: {

“query”: “quick brown fox”,

“minimum_should_match”: “30%”

}

}

},

“rescore”: {

“window_size”: 50, <2>

“query”: { <3>

“rescore_query”: {

“match_phrase”: {

“title”: {

“query”: “quick brown fox”,

“slop”: 50

}

}

}

}

}

}

// SENSE: 120_Proximity_Matching/30_Performance.json

<1> The
match
query decides which results will be included in the final

result set and ranks results according to TF/IDF.(((“window_size parameter”)))

<2> The
window_size
is the number of top results to rescore, per shard.

<3> The only rescoring algorithm currently supported is another query, but

there are plans to add more algorithms later.

[[shingles]]

=== Finding Associated Words

As useful as phrase and proximity queries can be, they still have a downside.

They are overly strict: all terms must be present for a phrase query to match,

even when using
slop
.(((“proximity matching”, “finding associated words”, range=”startofrange”, id=”ix_proxmatchassoc”)))

The flexibility in word ordering that you gain with
slop
also comes at a

price, because you lose the association between word pairs. While you can

identify documents in which
sue
,
alligator
, and
ate
occur close together,

you can’t tell whether Sue ate or the alligator ate.

When words are used in conjunction with each other, they express an idea that

is bigger or more meaningful than each word in isolation. The two clauses

I’m not happy I’m working and I’m happy I’m not working contain the sames words, in

close proximity, but have quite different meanings.

If, instead of indexing each word independently, we were to index pairs of

words, then we could retain more of the context in which the words were used.

For the sentence
Sue ate the alligator
, we would not only index each word

(or unigram) as(((“unigrams”))) a term

["sue", "ate", "the", "alligator"]


but also each word and its neighbor as single terms:

["sue ate", "ate the", "the alligator"]


These word (((“bigrams”)))pairs (or bigrams) are (((“shingles”)))known as shingles.

[TIP]

Shingles are not restricted to being pairs of words; you could index word

triplets (trigrams) as (((“trigrams”)))well:

["sue ate the", "ate the alligator"]


Trigrams give you a higher degree of precision, but greatly increase the

number of unique terms in the index. Bigrams are sufficient for most use

cases.

==================================================

Of course, shingles are useful only if the user enters the query in the same

order as in the original document; a query for
sue alligator
would match

the individual words but none of our shingles.

Fortunately, users tend to express themselves using constructs similar to

those that appear in the data they are searching. But this point is an

important one: it is not enough to index just bigrams; we still need unigrams,

but we can use matching bigrams as a signal to increase the relevance score.

==== Producing Shingles

Shingles need to be created at index time as part of the analysis process.(((“shingles”, “producing at index time”))) We

could index both unigrams and bigrams into a single field, but it is cleaner

to keep unigrams and bigrams in separate fields that can be queried

independently. The unigram field would form the basis of our search, with the

bigram field being used to boost relevance.

First, we need to create an analyzer that uses the
shingle
token filter:

[source,js]

DELETE /my_index

PUT /my_index

{

“settings”: {

“number_of_shards”: 1, <1>

“analysis”: {

“filter”: {

“my_shingle_filter”: {

“type”: “shingle”,

“min_shingle_size”: 2, <2>

“max_shingle_size”: 2, <2>

“output_unigrams”: false <3>

}

},

“analyzer”: {

“my_shingle_analyzer”: {

“type”: “custom”,

“tokenizer”: “standard”,

“filter”: [

“lowercase”,

“my_shingle_filter” <4>

]

}

}

}

}

}

// SENSE: 120_Proximity_Matching/35_Shingles.json

<1> See <>.

<2> The default min/max shingle size is
2
so we don’t really need to set

these.

<3> The
shingle
token filter outputs unigrams by default, but we want to

keep unigrams and bigrams separate.

<4> The
my_shingle_analyzer
uses our custom
my_shingles_filter
token

filter.

First, let’s test that our analyzer is working as expected with the
analyze


API:

[source,js]

GET /my_index/_analyze?analyzer=my_shingle_analyzer

Sue ate the alligator

Sure enough, we get back three terms:

sue ate


ate the


the alligator


Now we can proceed to setting up a field to use the new analyzer.

==== Multifields

We said that it is cleaner to index unigrams and bigrams separately, so we

will create the
title
field (((“multifields”)))as a multifield (see <>):

[source,js]

PUT /my_index/_mapping/my_type

{

“my_type”: {

“properties”: {

“title”: {

“type”: “string”,

“fields”: {

“shingles”: {

“type”: “string”,

“analyzer”: “my_shingle_analyzer”

}

}

}

}

}

}

With this mapping, values from our JSON document in the field
title
will be

indexed both as unigrams (
title
) and as bigrams (
title.shingles
), meaning

that we can query these fields independently.

And finally, we can index our example documents:

[source,js]

POST /my_index/my_type/_bulk

{ “index”: { “_id”: 1 }}

{ “title”: “Sue ate the alligator” }

{ “index”: { “_id”: 2 }}

{ “title”: “The alligator ate Sue” }

{ “index”: { “_id”: 3 }}

{ “title”: “Sue never goes anywhere without her alligator skin purse” }

==== Searching for Shingles

To understand the benefit (((“shingles”, “searching for”)))that the
shingles
field adds, let’s first look at

the results from a simple
match
query for “The hungry alligator ate Sue”:

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“match”: {

“title”: “the hungry alligator ate sue”

}

}

}

This query returns all three documents, but note that documents 1 and 2

have the same relevance score because they contain the same words:

[source,js]

{

“hits”: [

{

“_id”: “1”,

“_score”: 0.44273707, <1>

“_source”: {

“title”: “Sue ate the alligator”

}

},

{

“_id”: “2”,

“_score”: 0.44273707, <1>

“_source”: {

“title”: “The alligator ate Sue”

}

},

{

“_id”: “3”, <2>

“_score”: 0.046571054,

“_source”: {

“title”: “Sue never goes anywhere without her alligator skin purse”

}

}

]

}

<1> Both documents contain
the
,
alligator
, and
ate
and so have the

same score.

<2> We could have excluded document 3 by setting the
minimum_should_match


parameter. See <>.

Now let’s add the
shingles
field into the query. Remember that we want

matches on the
shingles
field to act as a signal–to increase the

relevance score–so we still need to include the query on the main
title


field:

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“bool”: {

“must”: {

“match”: {

“title”: “the hungry alligator ate sue”

}

},

“should”: {

“match”: {

“title.shingles”: “the hungry alligator ate sue”

}

}

}

}

}

We still match all three documents, but document 2 has now been bumped into

first place because it matched the shingled term
ate sue
.

[source,js]

{

“hits”: [

{

“_id”: “2”,

“_score”: 0.4883322,

“_source”: {

“title”: “The alligator ate Sue”

}

},

{

“_id”: “1”,

“_score”: 0.13422975,

“_source”: {

“title”: “Sue ate the alligator”

}

},

{

“_id”: “3”,

“_score”: 0.014119488,

“_source”: {

“title”: “Sue never goes anywhere without her alligator skin purse”

}

}

]

}

Even though our query included the word
hungry
, which doesn’t appear in

any of our documents, we still managed to use word proximity to return the

most relevant document first.

==== Performance

Not only are shingles more flexible than phrase queries,(((“shingles”, “better performance than phrase queries”))) but they perform better

as well. Instead of paying the price of a phrase query every time you search,

queries for shingles are just as efficient as a simple
match
query. A small price is paid at index time, because more terms need to

be indexed, which also means that fields with shingles use more disk space.

However, most applications write once and read many times, so it makes sense

to optimize for fast queries.

This is a theme that you will encounter frequently in Elasticsearch: enables you to achieve a lot at search time, without requiring any up-front

setup. Once you understand your requirements more clearly, you can achieve better results with better performance by modeling your data correctly at index time.

(((“proximity matching”, “finding associated words”, range=”endofrange”, startref =”ix_proxmatchassoc”)))

https://github.com/uxff/elasticsearch-definitive-guide-cn
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  elasticsearch