您的位置：首页 > 其它

Elasticsearch 权威教程 - 模糊匹配

2018-03-01 21:56 477 查看

模糊匹配

一般的全文检索方式使用 TF/IDF 处理文本或者文本数据中的某个字段内容。将字面切分成很多字、词(word)建立索引，match查询用query中的term来匹配索引中的字、词。match查询提供了文档数据中是否包含我们需要的query中的单、词，但仅仅这样是不够的，它无法提供文本中的字词之间的关系。

举个例子：

小苏吃了鳄鱼

鳄鱼吃了小苏

小苏去哪儿都带着的鳄鱼皮钱包

用

match

查询

小苏 鳄鱼

，这三句话都会被命中，但是

tf/idf

并不会告诉我们这两个词出现在同一句话里面还是在同一个段落中（仅仅提供这两个词在这段文本中的出现频率）

理解文本中词语之间的关系是一个很复杂的问题，而且这个问题通过更换query的表达方式是无法解决的。但是我们可以知道两个词语在文本中的距离远近，甚至是否相邻，这个信息似乎上能一定程度的表达这两个词比较相关。

一般的文本可能比我们举的例子长很多，正如我们提到的：

小苏

跟

鳄鱼

这两个词可能分布在文本的不同段落中。我们还是期望能找到这两个词分布均匀的文档，但是我们把这两个词距离比较近的文档赋予更好的相关性权重。

这就是段落匹配（phrase matching）或者模糊匹配（proximity matching）所做的事情。

[TIP]
==================================================

In this chapter, we are using the same example documents that we used for
the

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“match_phrase”: {

“title”: “quick brown fox”

}

}

}

// SENSE: 120_Proximity_Matching/05_Match_phrase_query.json

Like the

match

query, the

match_phrase

query first analyzes the query

string to produce a list of terms. It then searches for all the terms, but

keeps only documents that contain all of the search terms, in the same

positions relative to each other. A query for the phrase

quick fox

would not match any of our documents, because no document contains the word

quick

immediately followed by

fox

[TIP]

The

match_phrase

query can also be written as a

match

query with type

phrase

[source,js]

“match”: {

“title”: {

“query”: “quick brown fox”,

“type”: “phrase”

}

}

// SENSE: 120_Proximity_Matching/05_Match_phrase_query.json

==================================================

==== Term Positions

When a string is analyzed, the analyzer returns not(((“phrase matching”, “term positions”)))(((“match_phrase query”, “position of terms”)))(((“position-aware matching”))) only a list of terms, but

also the position, or order, of each term in the original string:

[source,js]

GET /_analyze?analyzer=standard

Quick brown fox

// SENSE: 120_Proximity_Matching/05_Term_positions.json

This returns the following:

[role=”pagebreak-before”]

[source,js]

{

“tokens”: [

{

“token”: “quick”,

“start_offset”: 0,

“end_offset”: 5,

“type”: “”,

“position”: 1 <1>

},

{

“token”: “brown”,

“start_offset”: 6,

“end_offset”: 11,

“type”: “”,

“position”: 2 <1>

},

{

“token”: “fox”,

“start_offset”: 12,

“end_offset”: 15,

“type”: “”,

“position”: 3 <1>

}

]

}

<1> The

position

of each term in the original string.

Positions can be stored in the inverted index, and position-aware queries like

the

match_phrase

query can use them to match only documents that contain

all the words in exactly the order specified, with no words in-between.

==== What Is a Phrase

For a document to be considered a(((“match_phrase query”, “documents matching a phrase”)))(((“phrase matching”, “criteria for matching documents”))) match for the phrase “quick brown fox,” the following must be true:

quick

brown

, and

fox

must all appear in the field.

The position of

brown

must be

greater than the position of

quick

.

The position of

fox

must be

greater than the position of

quick

.

If any of these conditions is not met, the document is not considered a match.

[TIP]

Internally, the

match_phrase

query uses the low-level

span

query family to

do position-aware matching. (((“match_phrase query”, “use of span queries for position-aware matching”)))(((“span queries”)))Span queries are term-level queries, so they have

no analysis phase; they search for the exact term specified.

Thankfully, most people never need to use the

span

queries directly, as the

match_phrase

query is usually good enough. However, certain specialized

fields, like patent searches, use these low-level queries to perform very

specific, carefully constructed positional searches.

==================================================

[[slop]]

=== Mixing It Up

Requiring exact-phrase matches (((“proximity matching”, “slop parameter”)))may be too strict a constraint. Perhaps we do

want documents that contain

quick brown fox'' to be considered a match for

the query

quick fox,” even though the positions aren’t exactly equivalent.

We can introduce a degree (((“slop parameter”)))of flexibility into phrase matching by using the

slop

parameter:

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“match_phrase”: {

“title”: {

“query”: “quick fox”,

“slop”: 1

}

}

}

}

// SENSE: 120_Proximity_Matching/10_Slop.json

The

slop

parameter tells the

match_phrase

query how(((“match_phrase query”, “slop parameter”))) far apart terms are

allowed to be while still considering the document a match. By _how far

apart_ we mean _how many times do you need to move a term in order to make

the query and document match_?

We’ll start with a simple example. To make the query

quick fox

match

a document containing

quick brown fox

we need a

slop

of just

Pos 1         Pos 2         Pos 3
-----------------------------------------------
Doc:        quick         brown         fox
-----------------------------------------------
Query:      quick         fox
Slop 1:     quick                 ↳     fox

Although all words need to be present in phrase matching, even when using

slop

,

the words don’t necessarily need to be in the same sequence in order to

match. With a high enough

slop

value, words can be arranged in any order.

To make the query

fox quick

match our document, we need a

slop

Pos 1         Pos 2         Pos 3
-----------------------------------------------
Doc:        quick         brown         fox
-----------------------------------------------
Query:      fox           quick
Slop 1:     fox|quick  ↵  <1>
Slop 2:     quick      ↳  fox
Slop 3:     quick                 ↳     fox

<1> Note that

fox

and

quick

occupy the same position in this step.

Switching word order from

fox quick

quick fox

thus requires two

steps, or a

slop

.

=== Multivalue Fields

A curious thing can happen when you try to use phrase matching on multivalue

fields. (((“proximity matching”, “on multivalue fields”)))(((“match_phrase query”, “on multivalue fields”))) Imagine that you index this document:

[source,js]

PUT /my_index/groups/1

{

“names”: [ “John Abraham”, “Lincoln Smith”]

}

// SENSE: 120_Proximity_Matching/15_Multi_value_fields.json

Then run a phrase query for

Abraham Lincoln

[source,js]

GET /my_index/groups/_search

{

“query”: {

“match_phrase”: {

“names”: “Abraham Lincoln”

}

}

}

// SENSE: 120_Proximity_Matching/15_Multi_value_fields.json

Surprisingly, our document matches, even though

Abraham

and

Lincoln

belong to two different people in the

names

array. The reason for this comes

down to the way arrays are indexed in Elasticsearch.

When

John Abraham

is analyzed, it produces this:

Position 1:

john

Position 2:

abraham

Then when

Lincoln Smith

is analyzed, it produces this:

Position 3:

lincoln

Position 4:

smith

In other words, Elasticsearch produces exactly the same list of tokens as it would have

for the single string

John Abraham Lincoln Smith

. Our example query

looks for

abraham

directly followed by

lincoln

, and these two terms do

indeed exist, and they are right next to each other, so the query matches.

Fortunately, there is a simple workaround for cases like these, called the

position_offset_gap

, which(((“mapping (types)”, “position_offset_gap”)))(((“position_offset_gap”))) we need to configure in the field mapping:

[source,js]

DELETE /my_index/groups/ <1>

PUT /my_index/_mapping/groups <2>

{

“properties”: {

“names”: {

“type”: “string”,

“position_offset_gap”: 100

}

}

}

// SENSE: 120_Proximity_Matching/15_Multi_value_fields.json

<1> First delete the

groups

mapping and all documents of that type.

<2> Then create a new

groups

mapping with the correct values.

The

position_offset_gap

setting tells Elasticsearch that it should increase

the current term

position

by the specified value for every new array

element. So now, when we index the array of names, the terms are emitted with

the following positions:

Position 1:

john

Position 2:

abraham

Position 103:

lincoln

Position 104:

smith

Our phrase query would no longer match a document like this because

abraham

and

lincoln

are now 100 positions apart. You would have to add a

slop

value of 100 in order for this document to match.

=== Closer Is Better

Whereas a phrase query simply excludes documents that don’t contain the exact

query phrase, a proximity query—a (((“proximity matching”, “proximity queries”)))(((“slop parameter”, “proximity queries and”)))phrase query where

slop

is greater

than

—incorporates the proximity of the query terms into the final

relevance

_score

. By setting a high

slop

value like

, you can

exclude documents in which the words are really too far apart, but give a higher

score to documents in which the words are closer together.

The following proximity query for

quick dog

matches both documents that

contain the words

quick

and

dog

, but gives a higher score to the

document(((“relevance scores”, “for proximity queries”))) in which the words are nearer to each other:

[source,js]

POST /my_index/my_type/_search

{

“query”: {

“match_phrase”: {

“title”: {

“query”: “quick dog”,

“slop”: 50 <1>

}

}

}

}

// SENSE: 120_Proximity_Matching/20_Scoring.json

<1> Note the high

slop

value.

[source,js]

{

“hits”: [

{

“_id”: “3”,

“_score”: 0.75, <1>

“_source”: {

“title”: “The quick brown fox jumps over the quick dog”

}

},

{

“_id”: “2”,

“_score”: 0.28347334, <2>

“_source”: {

“title”: “The quick brown fox jumps over the lazy dog”

}

}

]

}

<1> Higher score because

quick

and

dog

are close together

<2> Lower score because

quick

and

dog

are further apart

[[proximity-relevance]]

=== Proximity for Relevance

Although proximity queries are useful, the fact that they require all terms to be

present can make them overly strict.(((“proximity matching”, “using for relevance”)))(((“relevance”, “proximity queries for”))) It’s the same issue that we discussed in

<> in <>: if six out of seven terms match,

a document is probably relevant enough to be worth showing to the user, but

the

match_phrase

query would exclude it.

Instead of using proximity matching as an absolute requirement, we can

use it as a signal—as one of potentially many queries, each of which

contributes to the overall score for each document (see <>).

The fact that we want to add together the scores from multiple queries implies

that we should combine them by using the

bool

query.(((“bool query”, “proximity query for relevance in”)))

We can use a simple

match

query as a

must

clause. This is the query that

will determine which documents are included in our result set. We can trim

the long tail with the

minimum_should_match

parameter. Then we can add other,

more specific queries as

should

clauses. Every one that matches will

increase the relevance of the matching docs.

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“bool”: {

“must”: {

“match”: { <1>

“title”: {

“query”: “quick brown fox”,

“minimum_should_match”: “30%”

}

}

},

“should”: {

“match_phrase”: { <2>

“title”: {

“query”: “quick brown fox”,

“slop”: 50

}

}

}

}

}

}

// SENSE: 120_Proximity_Matching/25_Relevance.json

<1> The

must

clause includes or excludes documents from the result set.

<2> The

should

clause increases the relevance score of those documents that

match.

We could, of course, include other queries in the

should

clause, where each

query targets a specific aspect of relevance.

[role=”pagebreak-before”]

=== Improving Performance

Phrase and proximity queries are more (((“proximity matching”, “improving performance”)))(((“phrase matching”, “improving performance”)))expensive than simple

match

queries.

Whereas a

match

query just has to look up terms in the inverted index, a

match_phrase

query has to calculate and compare the positions of multiple

possibly repeated terms.

The http://people.apache.org/~mikemccand/lucenebench/[Lucene nightly

benchmarks] show that a simple

term

query is about 10 times as fast as a

phrase query, and about 20 times as fast as a proximity query (a phrase query

with

slop

). And of course, this cost is paid at search time instead of at index time.

[TIP]

Usually the extra cost of phrase queries is not as scary as these numbers

suggest. Really, the difference in performance is a testimony to just how fast

a simple

term

query is. Phrase queries on typical full-text data usually

complete within a few milliseconds, and are perfectly usable in practice, even

on a busy cluster.

In certain pathological cases, phrase queries can be costly, but this is

unusual. An example of a pathological case is DNA sequencing, where there are

many many identical terms repeated in many positions. Using higher

slop

values in this case results in a huge growth in the number of position

calculations.

==================================================

So what can we do to limit the performance cost of phrase and proximity

queries? One useful approach is to reduce the total number of documents that

need to be examined by the phrase query.

[[rescore-api]]

==== Rescoring Results

In <

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“match”: { <1>

“title”: {

“query”: “quick brown fox”,

“minimum_should_match”: “30%”

}

}

},

“rescore”: {

“window_size”: 50, <2>

“query”: { <3>

“rescore_query”: {

“match_phrase”: {

“title”: {

“query”: “quick brown fox”,

“slop”: 50

}

}

}

}

}

}

// SENSE: 120_Proximity_Matching/30_Performance.json

<1> The

match

query decides which results will be included in the final

result set and ranks results according to TF/IDF.(((“window_size parameter”)))

<2> The

window_size

is the number of top results to rescore, per shard.

<3> The only rescoring algorithm currently supported is another query, but

there are plans to add more algorithms later.

[[shingles]]

=== Finding Associated Words

As useful as phrase and proximity queries can be, they still have a downside.

They are overly strict: all terms must be present for a phrase query to match,

even when using

slop

.(((“proximity matching”, “finding associated words”, range=”startofrange”, id=”ix_proxmatchassoc”)))

The flexibility in word ordering that you gain with

slop

also comes at a

price, because you lose the association between word pairs. While you can

identify documents in which

sue

alligator

, and

ate

occur close together,

you can’t tell whether Sue ate or the alligator ate.

When words are used in conjunction with each other, they express an idea that

is bigger or more meaningful than each word in isolation. The two clauses

I’m not happy I’m working and I’m happy I’m not working contain the sames words, in

close proximity, but have quite different meanings.

If, instead of indexing each word independently, we were to index pairs of

words, then we could retain more of the context in which the words were used.

For the sentence

Sue ate the alligator

, we would not only index each word

(or unigram) as(((“unigrams”))) a term

["sue", "ate", "the", "alligator"]

but also each word and its neighbor as single terms:

["sue ate", "ate the", "the alligator"]

These word (((“bigrams”)))pairs (or bigrams) are (((“shingles”)))known as shingles.

[TIP]

Shingles are not restricted to being pairs of words; you could index word

triplets (trigrams) as (((“trigrams”)))well:

["sue ate the", "ate the alligator"]

Trigrams give you a higher degree of precision, but greatly increase the

number of unique terms in the index. Bigrams are sufficient for most use

cases.

==================================================

Of course, shingles are useful only if the user enters the query in the same

order as in the original document; a query for

sue alligator

would match

the individual words but none of our shingles.

Fortunately, users tend to express themselves using constructs similar to

those that appear in the data they are searching. But this point is an

important one: it is not enough to index just bigrams; we still need unigrams,

but we can use matching bigrams as a signal to increase the relevance score.

==== Producing Shingles

Shingles need to be created at index time as part of the analysis process.(((“shingles”, “producing at index time”))) We

could index both unigrams and bigrams into a single field, but it is cleaner

to keep unigrams and bigrams in separate fields that can be queried

independently. The unigram field would form the basis of our search, with the

bigram field being used to boost relevance.

First, we need to create an analyzer that uses the

shingle

token filter:

[source,js]

DELETE /my_index

PUT /my_index

{

“settings”: {

“number_of_shards”: 1, <1>

“analysis”: {

“filter”: {

“my_shingle_filter”: {

“type”: “shingle”,

“min_shingle_size”: 2, <2>

“max_shingle_size”: 2, <2>

“output_unigrams”: false <3>

}

},

“analyzer”: {

“my_shingle_analyzer”: {

“type”: “custom”,

“tokenizer”: “standard”,

“filter”: [

“lowercase”,

“my_shingle_filter” <4>

]

}

}

}

}

}

// SENSE: 120_Proximity_Matching/35_Shingles.json

<1> See <>.

<2> The default min/max shingle size is

so we don’t really need to set

these.

<3> The

shingle

token filter outputs unigrams by default, but we want to

keep unigrams and bigrams separate.

<4> The

my_shingle_analyzer

uses our custom

my_shingles_filter

token

filter.

First, let’s test that our analyzer is working as expected with the

analyze

API:

[source,js]

GET /my_index/_analyze?analyzer=my_shingle_analyzer

Sue ate the alligator

Sure enough, we get back three terms:

sue ate

ate the

the alligator

Now we can proceed to setting up a field to use the new analyzer.

==== Multifields

We said that it is cleaner to index unigrams and bigrams separately, so we

will create the

title

field (((“multifields”)))as a multifield (see <>):

[source,js]

PUT /my_index/_mapping/my_type

{

“my_type”: {

“properties”: {

“title”: {

“type”: “string”,

“fields”: {

“shingles”: {

“type”: “string”,

“analyzer”: “my_shingle_analyzer”

}

}

}

}

}

}

With this mapping, values from our JSON document in the field

title

will be

indexed both as unigrams (

title

) and as bigrams (

title.shingles

), meaning

that we can query these fields independently.

And finally, we can index our example documents:

[source,js]

POST /my_index/my_type/_bulk

{ “index”: { “_id”: 1 }}

{ “title”: “Sue ate the alligator” }

{ “index”: { “_id”: 2 }}

{ “title”: “The alligator ate Sue” }

{ “index”: { “_id”: 3 }}

{ “title”: “Sue never goes anywhere without her alligator skin purse” }

==== Searching for Shingles

To understand the benefit (((“shingles”, “searching for”)))that the

shingles

field adds, let’s first look at

the results from a simple

match

query for “The hungry alligator ate Sue”:

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“match”: {

“title”: “the hungry alligator ate sue”

}

}

}

This query returns all three documents, but note that documents 1 and 2

have the same relevance score because they contain the same words:

[source,js]

{

“hits”: [

{

“_id”: “1”,

“_score”: 0.44273707, <1>

“_source”: {

“title”: “Sue ate the alligator”

}

},

{

“_id”: “2”,

“_score”: 0.44273707, <1>

“_source”: {

“title”: “The alligator ate Sue”

}

},

{

“_id”: “3”, <2>

“_score”: 0.046571054,

“_source”: {

“title”: “Sue never goes anywhere without her alligator skin purse”

}

}

]

}

<1> Both documents contain

the

alligator

, and

ate

and so have the

same score.

<2> We could have excluded document 3 by setting the

minimum_should_match

parameter. See <>.

Now let’s add the

shingles

field into the query. Remember that we want

matches on the

shingles

field to act as a signal–to increase the

relevance score–so we still need to include the query on the main

title

field:

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“bool”: {

“must”: {

“match”: {

“title”: “the hungry alligator ate sue”

}

},

“should”: {

“match”: {

“title.shingles”: “the hungry alligator ate sue”

}

}

}

}

}

We still match all three documents, but document 2 has now been bumped into

first place because it matched the shingled term

ate sue

[source,js]

{

“hits”: [

{

“_id”: “2”,

“_score”: 0.4883322,

“_source”: {

“title”: “The alligator ate Sue”

}

},

{

“_id”: “1”,

“_score”: 0.13422975,

“_source”: {

“title”: “Sue ate the alligator”

}

},

{

“_id”: “3”,

“_score”: 0.014119488,

“_source”: {

“title”: “Sue never goes anywhere without her alligator skin purse”

}

}

]

}

Even though our query included the word

hungry

, which doesn’t appear in

any of our documents, we still managed to use word proximity to return the

most relevant document first.

==== Performance

Not only are shingles more flexible than phrase queries,(((“shingles”, “better performance than phrase queries”))) but they perform better

as well. Instead of paying the price of a phrase query every time you search,

queries for shingles are just as efficient as a simple

match

query. A small price is paid at index time, because more terms need to

be indexed, which also means that fields with shingles use more disk space.

However, most applications write once and read many times, so it makes sense

to optimize for fast queries.

This is a theme that you will encounter frequently in Elasticsearch: enables you to achieve a lot at search time, without requiring any up-front

setup. Once you understand your requirements more clearly, you can achieve better results with better performance by modeling your data correctly at index time.

(((“proximity matching”, “finding associated words”, range=”endofrange”, startref =”ix_proxmatchassoc”)))

https://github.com/uxff/elasticsearch-definitive-guide-cn

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： elasticsearch

相关文章推荐

新的分享

章节导航