您的位置：首页 > 其它

Elasticsearch 权威教程 - 模糊匹配

2018-03-01 21:59 555 查看

[[partial-matching]]

== Partial Matching

A keen observer will notice that all the queries so far in this book have

operated on whole terms.(((“partial matching”))) To match something, the smallest unit had to be a

single term. You can find only terms that exist in the inverted index.

But what happens if you want to match parts of a term but not the whole thing?

Partial matching allows users to specify a portion of the term they are

looking for and find any words that contain that fragment.

The requirement to match on part of a term is less common in the full-text

search-engine world than you might think. If you have come from an SQL

background, you likely have, at some stage of your career,

implemented a poor man’s full-text search using SQL constructs like this:

[source,js]

WHERE text LIKE "*quick*"
AND text LIKE "*brown*"

AND text LIKE “fox” <1>

<1>

*fox*

would match

fox'' and

foxes.”

Of course, with Elasticsearch, we have the analysis process and the inverted

index that remove the need for such brute-force techniques. To handle the

case of matching both

fox'' and

foxes,” we could simply use a stemmer to

index words in their root form. There is no need to match partial terms.

That said, on some occasions partial matching can be useful.

Common use (((“partial matching”, “common use cases”)))cases include the following:

Matching postal codes, product serial numbers, or other

not_analyzed

values

that start with a particular prefix or match a wildcard pattern

or even a regular expression

search-as-you-type—displaying the most likely results before the

user has finished typing the search terms

Matching in languages like German or Dutch, which contain long compound

words, like Weltgesundheitsorganisation (World Health Organization)

We will start by examining prefix matching on exact-value

not_analyzed

fields.

=== Postcodes and Structured Data

We will use United Kingdom postcodes (postal codes in the United States) to illustrate how(((“partial matching”, “postcodes and structured data”))) to use partial matching with

structured data. UK postcodes have a well-defined structure. For instance, the

postcode

W1V 3DG

can(((“postcodes (UK), partial matching with”))) be broken down as follows:

W1V

: This outer part identifies the postal area and district:

**

indicates the area (one or two letters)

**

1V

indicates the district (one or two numbers, possibly followed by a letter

3DG

: This inner part identifies a street or building:

**

indicates the sector (one number)

**

DG

indicates the unit (two letters)

Let’s assume that we are indexing postcodes as exact-value

not_analyzed

fields, so we could create our index as follows:

[source,js]

PUT /my_index

{

“mappings”: {

“address”: {

“properties”: {

“postcode”: {

“type”: “string”,

“index”: “not_analyzed”

}

}

}

}

}

// SENSE: 130_Partial_Matching/10_Prefix_query.json

And index some (((“indexing”, “postcodes”)))postcodes:

[source,js]

PUT /my_index/address/1

{ “postcode”: “W1V 3DG” }

PUT /my_index/address/2

{ “postcode”: “W2F 8HW” }

PUT /my_index/address/3

{ “postcode”: “W1F 7HW” }

PUT /my_index/address/4

{ “postcode”: “WC1N 1LZ” }

PUT /my_index/address/5

{ “postcode”: “SW5 0BE” }

// SENSE: 130_Partial_Matching/10_Prefix_query.json

Now our data is ready to be queried.

[[prefix-query]]

=== prefix Query

To find all postcodes beginning with

W1

, we could use a (((“prefix query”)))(((“postcodes (UK), partial matching with”, “prefix query”)))simple

prefix

query:

[source,js]

GET /my_index/address/_search

{

“query”: {

“prefix”: {

“postcode”: “W1”

}

}

}

// SENSE: 130_Partial_Matching/10_Prefix_query.json

The

prefix

query is a low-level query that works at the term level. It

doesn’t analyze the query string before searching. It assumes that you have

passed it the exact prefix that you want to find.

[TIP]

By default, the

prefix

query does no relevance scoring. It just finds

matching documents and gives them all a score of

. Really, it behaves more

like a filter than a query. The only practical difference between the

prefix

query and the

prefix

filter is that the filter can be cached.

==================================================

Previously, we said that

`you can find only terms that exist in the inverted

index,'' but we haven't done anything special to index these postcodes; each

postcode is simply indexed as the exact value specified in each document.  So

how does the

prefix` query work?

[role=”pagebreak-after”]

Remember that the inverted index consists(((“inverted index”, “for postcodes”))) of a sorted list of unique terms (in

this case, postcodes). For each term, it lists the IDs of the documents

containing that term in the postings list. The inverted index for our

example documents looks something like this:

Term:          Doc IDs:
-------------------------
"SW5 0BE"    |  5
"W1F 7HW"    |  3
"W1V 3DG"    |  1
"W2F 8HW"    |  2
"WC1N 1LZ"   |  4
-------------------------

To support prefix matching on the fly, the query does the following:

Skips through the terms list to find the first term beginning with

W1

.

Collects the associated document IDs.

Moves to the next term.

If that term also begins with

W1

, the query repeats from step 2; otherwise, we’re finished.

While this works fine for our small example, imagine that our inverted index

contains a million postcodes beginning with

W1

. The prefix query

would need to visit all one million terms in order to calculate the result!

And the shorter the prefix, the more terms need to be visited. If we were to

look for the prefix

instead of

W1

, perhaps we would match 10 million

terms instead of just one million.

CAUTION: The

prefix

query or filter are useful for ad hoc prefix matching, but

should be used with care. (((“prefix query”, “caution with”))) They can be used freely on fields with a small

number of terms, but they scale poorly and can put your cluster under a lot of

strain. Try to limit their impact on your cluster by using a long prefix;

this reduces the number of terms that need to be visited.

Later in this chapter, we present an alternative index-time solution that

makes prefix matching much more efficient. But first, we’ll take a look at

two related queries: the

wildcard

and

regexp

queries.

=== wildcard and regexp Queries

The

wildcard

query is a low-level, term-based query (((“wildcard query”)))(((“partial matching”, “wildcard and regexp queries”)))similar in nature to the

prefix

query, but it allows you to specify a pattern instead of just a prefix.

It uses the standard shell wildcards:

matches any character, and

matches zero or more characters.(((“postcodes (UK), partial matching with”, “wildcard queries”)))

This query would match the documents containing

W1F 7HW

and

W2F 8HW

[source,js]

GET /my_index/address/_search

{

“query”: {

“wildcard”: {

“postcode”: “W?F*HW” <1>

}

}

}

// SENSE: 130_Partial_Matching/15_Wildcard_regexp.json

<1> The

matches the

and the

, while the

matches the space

and the

and

.

Imagine now that you want to match all postcodes just in the

area. A

prefix match would also include postcodes starting with

WC

, and you would

have a similar problem with a wildcard match. We want to match only postcodes

that begin with a

, followed by a number.(((“postcodes (UK), partial matching with”, “regexp query”)))(((“regexp query”))) The

regexp

query allows you to

write these more complicated patterns:

[source,js]

GET /my_index/address/_search

{

“query”: {

“regexp”: {

“postcode”: “W[0-9].+” <1>

}

}

}

// SENSE: 130_Partial_Matching/15_Wildcard_regexp.json

<1> The regular expression says that the term must begin with a

, followed

by any number from 0 to 9, followed by one or more other characters.

The

wildcard

and

regexp

queries work in exactly the same way as the

prefix

query. They also have to scan the list of terms in the inverted

index to find all matching terms, and gather document IDs term by term. The

only difference between them and the

prefix

query is that they support more-complex patterns.

This means that the same caveats apply. Running these queries on a field with

many unique terms can be resource intensive indeed. Avoid using a

pattern that starts with a wildcard (for example,

*foo

or, as a regexp,

.*foo

).

Whereas prefix matching can be made more efficient by preparing your data at

index time, wildcard and regular expression matching can be done only

at query time. These queries have their place but should be used sparingly.

[CAUTION]

The

prefix

wildcard

, and

regexp

queries operate on terms. If you use

them to query an

analyzed

field, they will examine each term in the

field, not the field as a whole.(((“prefix query”, “on analyzed fields”)))(((“wildcard query”, “on analyzed fields”)))(((“regexp query”, “on analyzed fields”)))(((“analyzed fields”, “prefix, wildcard, and regexp queries on”)))

For instance, let’s say that our

title

field contains

`Quick brown fox''

which produces the terms

quick

brown

, and

fox`.

This query would match:

[source,json]

{ “regexp”: { “title”: “br.*” }}

But neither of these queries would match:

[source,json]

{ “regexp”: { “title”: “Qu.*” }} <1>

{ “regexp”: { “title”: “quick br*” }} <2>

<1> The term in the index is

quick

, not

Quick

.

<2>

quick

and

brown

are separate terms.

=================================================

=== Query-Time Search-as-You-Type

Leaving postcodes behind, let’s take a look at how prefix matching can help

with full-text queries. (((“partial matching”, “query time search-as-you-type”))) Users have become accustomed to seeing search results

before they have finished typing their query–so-called instant search, or

search-as-you-type. (((“search-as-you-type”)))(((“instant search”))) Not only do users receive their search results in less

time, but we can guide them toward results that actually exist in our index.

For instance, if a user types in

johnnie walker bl

, we would like to show results for Johnnie Walker Black Label and Johnnie Walker Blue

Label before they can finish typing their query.

As always, there are more ways than one to skin a cat! We will start by

looking at the way that is simplest to implement. You don’t need to prepare your

data in any way; you can implement search-as-you-type at query time on any

full-text field.

In <>, we introduced the

match_phrase

query, which matches

all the specified words in the same positions relative to each other. For-query time search-as-you-type, we can use a specialization of this query,

called (((“prefix query”, “match_phrase_prefix query”)))(((“match_phrase_prefix query”)))the

match_phrase_prefix

query:

[source,js]

{

“match_phrase_prefix” : {

“brand” : “johnnie walker bl”

}

}

// SENSE: 130_Partial_Matching/20_Match_phrase_prefix.json

This query behaves in the same way as the

match_phrase

query, except that it

treats the last word in the query string as a prefix. In other words, the

preceding example would look for the following:

johnnie

Followed by

walker

Followed by words beginning with

bl

If you were to run this query through the

validate-query

API, it would

produce this explanation:

"johnnie walker bl*"

Like the

match_phrase

query, it accepts a

slop

parameter (see <>) to

make the word order and relative positions (((“slop parameter”, “match_prhase_prefix query”)))(((“match_phrase_prefix query”, “slop parameter”)))somewhat less rigid:

[source,js]

{

“match_phrase_prefix” : {

“brand” : {

“query”: “walker johnnie bl”, <1>

“slop”: 10

}

}

}

// SENSE: 130_Partial_Matching/20_Match_phrase_prefix.json

<1> Even though the words are in the wrong order, the query still matches

because we have set a high enough

slop

value to allow some flexibility

in word positions.

However, it is always only the last word in the query string that is treated

as a prefix.

Earlier, in <>, we warned about the perils of the prefix–how

prefix

queries can be resource intensive. The same is true in this

case.(((“match_phrase_prefix query”, “caution with”))) A prefix of

could match hundreds of thousands of terms. Not only

would matching on this many terms be resource intensive, but it would also not be

useful to the user.

We can limit the impact (((“match_phrase_prefix query”, “max_expansions”)))(((“max_expansions parameter”)))of the prefix expansion by setting

max_expansions

to

a reasonable number, such as 50:

[source,js]

{

“match_phrase_prefix” : {

“brand” : {

“query”: “johnnie walker bl”,

“max_expansions”: 50

}

}

}

// SENSE: 130_Partial_Matching/20_Match_phrase_prefix.json

The

max_expansions

parameter controls how many terms the prefix is allowed

to match. It will find the first term starting with

bl

and keep collecting

terms (in alphabetical order) until it either runs out of terms with prefix

bl

, or it has more terms than

max_expansions

.

Don’t forget that we have to run this query every time the user types another

character, so it needs to be fast. If the first set of results isn’t what users are after, they’ll keep typing until they get the results that they want.

=== Index-Time Optimizations

All of the solutions we’ve talked about so far are implemented at

query time. (((“index time optimizations”)))(((“partial matching”, “index time optimizations”)))They don’t require any special mappings or indexing patterns;

they simply work with the data that you’ve already indexed.

The flexibility of query-time operations comes at a cost: search performance.

Sometimes it may make sense to move the cost away from the query. In a real-

time web application, an additional 100ms may be too much latency to tolerate.

By preparing your data at index time, you can make your searches more flexible

and improve performance. You still pay a price: increased index size and

slightly slower indexing throughput, but it is a price you pay once at index

time, instead of paying it on every query.

Your users will thank you.

=== Ngrams for Partial Matching

As we have said before,

`You can find only terms that exist in the inverted

index.'' Although the

prefix

wildcard

, and

regexp` queries demonstrated that

that is not strictly true, it is true that doing a single-term lookup is

much faster than iterating through the terms list to find matching terms on

the fly.(((“partial matching”, “index time optimizations”, “n-grams”))) Preparing your data for partial matching ahead of time will increase

your search performance.

Preparing your data at index time means choosing the right analysis chain, and

the tool that we use for partial matching is the n-gram.(((“n-grams”))) An n-gram can be

best thought of as a moving window on a word. The n stands for a length.

If we were to n-gram the word

quick

, the results would depend on the length

we have chosen:

[horizontal]

* Length 1 (unigram): [

]

* Length 2 (bigram): [

qu

ui

ic

ck

]

* Length 3 (trigram): [

qui

uic

ick

]

* Length 4 (four-gram): [

quic

uick

]

* Length 5 (five-gram): [

quick

]

Plain n-grams are useful for matching somewhere within a word, a technique

that we will use in <>. However, for search-as-you-type,

we use a specialized form of n-grams called edge n-grams. (((“edge n-grams”))) Edge

n-grams are anchored to the beginning of the word. Edge n-gramming the word

quick

would result in this:

qu

qui

quic

quick

You may notice that this conforms exactly to the letters that a user searching for “quick” would type. In other words, these are the

perfect terms to use for instant search!

=== Index-Time Search-as-You-Type

The first step to setting up index-time search-as-you-type is to(((“search-as-you-type”, “index time”)))(((“partial matching”, “index time search-as-you-type”))) define our

analysis chain, which we discussed in <>, but we will

go over the steps again here.

==== Preparing the Index

The first step is to configure a (((“partial matching”, “index time search-as-you-type”, “preparing the index”)))custom

edge_ngram

token filter,(((“edge_ngram token filter”))) which we

will call the

autocomplete_filter

[source,js]

{

“filter”: {

“autocomplete_filter”: {

“type”: “edge_ngram”,

“min_gram”: 1,

“max_gram”: 20

}

}

}

This configuration says that, for any term that this token filter receives,

it should produce an n-gram anchored to the start of the word of minimum

length 1 and maximum length 20.

Then we need to use this token filter in a custom analyzer,(((“analyzers”, “autocomplete custom analyzer”))) which we will call

the

autocomplete

analyzer:

[source,js]

{

“analyzer”: {

“autocomplete”: {

“type”: “custom”,

“tokenizer”: “standard”,

“filter”: [

“lowercase”,

“autocomplete_filter” <1>

]

}

}

}

<1> Our custom edge-ngram token filter

This analyzer will tokenize a string into individual terms by using the

standard

tokenizer, lowercase each term, and then produce edge n-grams of each

term, thanks to our

autocomplete_filter

.

The full request to create the index and instantiate the token filter and

analyzer looks like this:

[source,js]

PUT /my_index

{

“settings”: {

“number_of_shards”: 1, <1>

“analysis”: {

“filter”: {

“autocomplete_filter”: { <2>

“type”: “edge_ngram”,

“min_gram”: 1,

“max_gram”: 20

}

},

“analyzer”: {

“autocomplete”: {

“type”: “custom”,

“tokenizer”: “standard”,

“filter”: [

“lowercase”,

“autocomplete_filter” <3>

]

}

}

}

}

}

// SENSE: 130_Partial_Matching/35_Search_as_you_type.json

<1> See <>.

<2> First we define our custom token filter.

<3> Then we use it in an analyzer.

You can test this new analyzer to make sure it is behaving correctly by using

the

analyze

API:

[source,js]

GET /my_index/_analyze?analyzer=autocomplete

quick brown

// SENSE: 130_Partial_Matching/35_Search_as_you_type.json

The results show us that the analyzer is working correctly. It returns these

terms:

qu

qui

quic

quick

br

bro

brow

brown

To use the analyzer, we need to apply it to a field, which we can do

with(((“update-mapping API, applying custom autocomplete analyzer to a field”))) the

update-mapping

API:

[source,js]

PUT /my_index/_mapping/my_type

{

“my_type”: {

“properties”: {

“name”: {

“type”: “string”,

“analyzer”: “autocomplete”

}

}

}

}

// SENSE: 130_Partial_Matching/35_Search_as_you_type.json

Now, we can index some test documents:

[source,js]

POST /my_index/my_type/_bulk

{ “index”: { “_id”: 1 }}

{ “name”: “Brown foxes” }

{ “index”: { “_id”: 2 }}

{ “name”: “Yellow furballs” }

// SENSE: 130_Partial_Matching/35_Search_as_you_type.json

==== Querying the Field

If you test out a query for

`brown fo'' by using ((("partial matching", "index time search-as-you-type", "querying the field")))a simple

match` query

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“match”: {

“name”: “brown fo”

}

}

}

// SENSE: 130_Partial_Matching/35_Search_as_you_type.json

you will see that both documents match, even though the

Yellow furballs

doc contains neither

brown

nor

fo

[source,js]

{

“hits”: [

{

“_id”: “1”,

“_score”: 1.5753809,

“_source”: {

“name”: “Brown foxes”

}

},

{

“_id”: “2”,

“_score”: 0.012520773,

“_source”: {

“name”: “Yellow furballs”

}

}

]

}

As always, the

validate-query

API shines some light:

[source,js]

GET /my_index/my_type/_validate/query?explain

{

“query”: {

“match”: {

“name”: “brown fo”

}

}

}

// SENSE: 130_Partial_Matching/35_Search_as_you_type.json

The

explanation

shows us that the query is looking for edge n-grams of every

word in the query string:

name:b name:br name:bro name:brow name:brown name:f name:fo

The

name:f

condition is satisfied by the second document because

furballs

has been indexed as

fu

fur

, and so forth. In retrospect, this

is not surprising. The same

autocomplete

analyzer is being applied both at

index time and at search time, which in most situations is the right thing to

do. This is one of the few occasions when it makes sense to break this rule.

We want to ensure that our inverted index contains edge n-grams of every word,

but we want to match only the full words that the user has entered (

brown

and

fo

). (((“analyzers”, “changing search analyzer from index analyzer”))) We can do this by using the

autocomplete

analyzer at

index time and the

standard

analyzer at search time. One way to change the

search analyzer is just to specify it in the query:

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“match”: {

“name”: {

“query”: “brown fo”,

“analyzer”: “standard” <1>

}

}

}

}

// SENSE: 130_Partial_Matching/35_Search_as_you_type.json

<1> This overrides the

analyzer

setting on the

name

field.

Alternatively, we can specify (((“search_analyzer parameter”)))(((“index_analyzer parameter”)))the

index_analyzer

and

search_analyzer

in

the mapping for the

name

field itself. Because we want to change only the

search_analyzer

, we can update the existing mapping without having to

reindex our data:

[source,js]

PUT /my_index/my_type/_mapping

{

“my_type”: {

“properties”: {

“name”: {

“type”: “string”,

“index_analyzer”: “autocomplete”, <1>

“search_analyzer”: “standard” <2>

}

}

}

}

// SENSE: 130_Partial_Matching/35_Search_as_you_type.json

<1> Use the

autocomplete

analyzer at index time to produce edge n-grams of

every term.

<2> Use the

standard

analyzer at search time to search only on the terms

that the user has entered.

If we were to repeat the

validate-query

request, it would now give us this

explanation:

name:brown name:fo

Repeating our query correctly returns just the

Brown foxes

document.

Because most of the work has been done at index time, all this query needs to

do is to look up the two terms

brown

and

fo

, which is much more efficient

than the

match_phrase_prefix

approach of having to find all terms beginning

with

fo

.

.Completion Suggester

Using edge n-grams for search-as-you-type is easy to set up, flexible, and

fast. However, sometimes it is not fast enough. Latency matters, especially

when you are trying to provide instant feedback. Sometimes the fastest way of

searching is not to search at all.

The http://bit.ly/1IChV5j[completion suggester] in

Elasticsearch(((“completion suggester”))) takes a completely different approach. You feed it a list

of all possible completions, and it builds them into a _finite state

transducer_, an(((“Finite State Transducer”))) optimized data structure that resembles a big graph. To

search for suggestions, Elasticsearch starts at the beginning of the graph and

moves character by character along the matching path. Once it has run out of

user input, it looks at all possible endings of the current path to produce a

list of suggestions.

This data structure lives in memory and makes prefix lookups extremely fast,

much faster than any term-based query could be. It is an excellent match for

autocompletion of names and brands, whose words are usually organized in a

common order:

Johnny Rotten'' rather than

Rotten Johnny.”

When word order is less predictable, edge n-grams can be a better solution

than the completion suggester. This particular cat may be skinned in myriad

ways.

==== Edge n-grams and Postcodes

The edge n-gram approach can(((“postcodes (UK), partial matching with”, “using edge n-grams”)))(((“edge n-grams”, “and postcodes”))) also be used for structured data, such as the

postcodes example from <

[TIP]

The

keyword

tokenizer is the no-operation tokenizer, the tokenizer that does

nothing. Whatever string it receives as input, it emits exactly the same

string as a single token. It can therefore be used for values that we would

normally treat as

not_analyzed

but that require some other analysis

transformation such as lowercasing.

==================================================

This example uses the

keyword

tokenizer to convert the postcode string into a token stream, so that we can use the edge n-gram token filter:

[source,js]

{

“analysis”: {

“filter”: {

“postcode_filter”: {

“type”: “edge_ngram”,

“min_gram”: 1,

“max_gram”: 8

}

},

“analyzer”: {

“postcode_index”: { <1>

“tokenizer”: “keyword”,

“filter”: [ “postcode_filter” ]

},

“postcode_search”: { <2>

“tokenizer”: “keyword”

}

}

}

}

// SENSE: 130_Partial_Matching/35_Postcodes.json

<1> The

postcode_index

analyzer would use the

postcode_filter

to turn postcodes into edge n-grams.

<2> The

postcode_search

analyzer would treat search terms as

if they were

not_indexed

.

[[ngrams-compound-words]]

=== Ngrams for Compound Words

Finally, let’s take a look at how n-grams can be used to search languages with

compound words. (((“languages”, “using many compound words, indexing of”)))(((“n-grams”, “using with compound words”)))(((“partial matching”, “using n-grams for compound words”)))(((“German”, “compound words in”))) German is famous for combining several small words into one

massive compound word in order to capture precise or complex meanings. For

example:

Aussprachewörterbuch::

Pronunciation dictionary

Militärgeschichte::

Military history

Weißkopfseeadler::

White-headed sea eagle, or bald eagle

Weltgesundheitsorganisation::

World Health Organization

Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz::

The law concerning the delegation of duties for the supervision of cattle

marking and the labeling of beef

Somebody searching for

Wörterbuch'' (dictionary) would probably expect to

see

Aussprachewörtebuch” in the results list. Similarly, a search for

Adler'' (eagle) should include

Weißkopfseeadler.”

One approach to indexing languages like this is to break compound words into

their constituent parts using the http://bit.ly/1ygdjjC[compound word token filter].

However, the quality of the results depends on how good your compound-word

dictionary is.

Another approach is just to break all words into n-grams and to search for any

matching fragments–the more fragments that match, the more relevant the

document.

Given that an n-gram is a moving window on a word, an n-gram of any length

will cover all of the word. We want to choose a length that is long enough

to be meaningful, but not so long that we produce far too many unique terms.

A trigram (length 3) is (((“trigrams”)))probably a good starting point:

[source,js]

PUT /my_index

{

“settings”: {

“analysis”: {

“filter”: {

“trigrams_filter”: {

“type”: “ngram”,

“min_gram”: 3,

“max_gram”: 3

}

},

“analyzer”: {

“trigrams”: {

“type”: “custom”,

“tokenizer”: “standard”,

“filter”: [

“lowercase”,

“trigrams_filter”

]

}

}

}

},

“mappings”: {

“my_type”: {

“properties”: {

“text”: {

“type”: “string”,

“analyzer”: “trigrams” <1>

}

}

}

}

}

// SENSE: 130_Partial_Matching/40_Compound_words.json

<1> The

text

field uses the

trigrams

analyzer to index its contents as

n-grams of length 3.

Testing the trigrams analyzer with the

analyze

API

[source,js]

GET /my_index/_analyze?analyzer=trigrams

Weißkopfseeadler

// SENSE: 130_Partial_Matching/40_Compound_words.json

returns these terms:

wei, eiß, ißk, ßko, kop, opf, pfs, fse, see, eea,ead, adl, dle, ler

We can index our example compound words to test this approach:

[source,js]

POST /my_index/my_type/_bulk

{ “index”: { “_id”: 1 }}

{ “text”: “Aussprachewörterbuch” }

{ “index”: { “_id”: 2 }}

{ “text”: “Militärgeschichte” }

{ “index”: { “_id”: 3 }}

{ “text”: “Weißkopfseeadler” }

{ “index”: { “_id”: 4 }}

{ “text”: “Weltgesundheitsorganisation” }

{ “index”: { “_id”: 5 }}

{ “text”: “Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz” }

// SENSE: 130_Partial_Matching/40_Compound_words.json

A search for

`Adler'' (eagle) becomes a query for the three terms

adl

dle

,

and

ler`:

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“match”: {

“text”: “Adler”

}

}

}

// SENSE: 130_Partial_Matching/40_Compound_words.json

which correctly matches “Weißkopfsee-adler”:

[source,js]

{

“hits”: [

{

“_id”: “3”,

“_score”: 3.3191128,

“_source”: {

“text”: “Weißkopfseeadler”

}

}

]

}

// SENSE: 130_Partial_Matching/40_Compound_words.json

A similar query for

Gesundheit'' (health) correctly matches

Welt-gesundheit-sorganisation,” but it also matches

Militär-__ges__-chichte'' and

Rindfleischetikettierungsüberwachungsaufgabenübertragungs-ges-etz,”

both of which also contain the trigram

ges

.

Judicious use of the

minimum_should_match

parameter can remove these

spurious results by requiring that a minimum number of trigrams must be

present for a document to be considered a match:

[source,js]

GET /my_index/my_type/_search

{

“query”: {

“match”: {

“text”: {

“query”: “Gesundheit”,

“minimum_should_match”: “80%”

}

}

}

}

// SENSE: 130_Partial_Matching/40_Compound_words.json

This is a bit of a shotgun approach to full-text search and can result in a

large inverted index, but it is an effective generic way of indexing languages

that use many compound words or that don’t use whitespace between words,

such as Thai.

This technique is used to increase recall—the number of relevant

documents that a search returns. It is usually used in combination with

other techniques, such as shingles (see <>) to improve precision and

the relevance score of each document.

https://github.com/uxff/elasticsearch-definitive-guide-cn

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： elasticsearch

相关文章推荐

新的分享

章节导航