Elasticsearch 权威教程 - 模糊匹配
2018-03-01 21:59
555 查看
[[partial-matching]]
== Partial Matching
A keen observer will notice that all the queries so far in this book have
operated on whole terms.(((“partial matching”))) To match something, the smallest unit had to be a
single term. You can find only terms that exist in the inverted index.
But what happens if you want to match parts of a term but not the whole thing?
Partial matching allows users to specify a portion of the term they are
looking for and find any words that contain that fragment.
The requirement to match on part of a term is less common in the full-text
search-engine world than you might think. If you have come from an SQL
background, you likely have, at some stage of your career,
implemented a poor man’s full-text search using SQL constructs like this:
Of course, with Elasticsearch, we have the analysis process and the inverted
index that remove the need for such brute-force techniques. To handle the
case of matching both
index words in their root form. There is no need to match partial terms.
That said, on some occasions partial matching can be useful.
Common use (((“partial matching”, “common use cases”)))cases include the following:
Matching postal codes, product serial numbers, or other
that start with a particular prefix or match a wildcard pattern
or even a regular expression
search-as-you-type—displaying the most likely results before the
user has finished typing the search terms
Matching in languages like German or Dutch, which contain long compound
words, like Weltgesundheitsorganisation (World Health Organization)
We will start by examining prefix matching on exact-value
fields.
=== Postcodes and Structured Data
We will use United Kingdom postcodes (postal codes in the United States) to illustrate how(((“partial matching”, “postcodes and structured data”))) to use partial matching with
structured data. UK postcodes have a well-defined structure. For instance, the
postcode
**
**
**
**
Let’s assume that we are indexing postcodes as exact-value
fields, so we could create our index as follows:
{
“mappings”: {
“address”: {
“properties”: {
“postcode”: {
“type”: “string”,
“index”: “not_analyzed”
}
}
}
}
And index some (((“indexing”, “postcodes”)))postcodes:
{ “postcode”: “W1V 3DG” }
PUT /my_index/address/2
{ “postcode”: “W2F 8HW” }
PUT /my_index/address/3
{ “postcode”: “W1F 7HW” }
PUT /my_index/address/4
{ “postcode”: “WC1N 1LZ” }
PUT /my_index/address/5
Now our data is ready to be queried.
[[prefix-query]]
=== prefix Query
To find all postcodes beginning with
query:
{
“query”: {
“prefix”: {
“postcode”: “W1”
}
}
The
doesn’t analyze the query string before searching. It assumes that you have
passed it the exact prefix that you want to find.
matching documents and gives them all a score of
like a filter than a query. The only practical difference between the
==================================================
Previously, we said that
[role=”pagebreak-after”]
Remember that the inverted index consists(((“inverted index”, “for postcodes”))) of a sorted list of unique terms (in
this case, postcodes). For each term, it lists the IDs of the documents
containing that term in the postings list. The inverted index for our
example documents looks something like this:
To support prefix matching on the fly, the query does the following:
Skips through the terms list to find the first term beginning with
Collects the associated document IDs.
Moves to the next term.
If that term also begins with
While this works fine for our small example, imagine that our inverted index
contains a million postcodes beginning with
would need to visit all one million terms in order to calculate the result!
And the shorter the prefix, the more terms need to be visited. If we were to
look for the prefix
terms instead of just one million.
CAUTION: The
should be used with care. (((“prefix query”, “caution with”))) They can be used freely on fields with a small
number of terms, but they scale poorly and can put your cluster under a lot of
strain. Try to limit their impact on your cluster by using a long prefix;
this reduces the number of terms that need to be visited.
Later in this chapter, we present an alternative index-time solution that
makes prefix matching much more efficient. But first, we’ll take a look at
two related queries: the
=== wildcard and regexp Queries
The
It uses the standard shell wildcards:
matches zero or more characters.(((“postcodes (UK), partial matching with”, “wildcard queries”)))
This query would match the documents containing
{
“query”: {
“wildcard”: {
“postcode”: “W?F*HW” <1>
}
}
<1> The
and the
Imagine now that you want to match all postcodes just in the
prefix match would also include postcodes starting with
have a similar problem with a wildcard match. We want to match only postcodes
that begin with a
write these more complicated patterns:
{
“query”: {
“regexp”: {
“postcode”: “W[0-9].+” <1>
}
}
<1> The regular expression says that the term must begin with a
by any number from 0 to 9, followed by one or more other characters.
The
index to find all matching terms, and gather document IDs term by term. The
only difference between them and the
This means that the same caveats apply. Running these queries on a field with
many unique terms can be resource intensive indeed. Avoid using a
pattern that starts with a wildcard (for example,
Whereas prefix matching can be made more efficient by preparing your data at
index time, wildcard and regular expression matching can be done only
at query time. These queries have their place but should be used sparingly.
them to query an
field, not the field as a whole.(((“prefix query”, “on analyzed fields”)))(((“wildcard query”, “on analyzed fields”)))(((“regexp query”, “on analyzed fields”)))(((“analyzed fields”, “prefix, wildcard, and regexp queries on”)))
For instance, let’s say that our
This query would match:
<2>
=================================================
=== Query-Time Search-as-You-Type
Leaving postcodes behind, let’s take a look at how prefix matching can help
with full-text queries. (((“partial matching”, “query time search-as-you-type”))) Users have become accustomed to seeing search results
before they have finished typing their query–so-called instant search, or
search-as-you-type. (((“search-as-you-type”)))(((“instant search”))) Not only do users receive their search results in less
time, but we can guide them toward results that actually exist in our index.
For instance, if a user types in
Label before they can finish typing their query.
As always, there are more ways than one to skin a cat! We will start by
looking at the way that is simplest to implement. You don’t need to prepare your
data in any way; you can implement search-as-you-type at query time on any
full-text field.
In <>, we introduced the
all the specified words in the same positions relative to each other. For-query time search-as-you-type, we can use a specialization of this query,
called (((“prefix query”, “match_phrase_prefix query”)))(((“match_phrase_prefix query”)))the
“match_phrase_prefix” : {
“brand” : “johnnie walker bl”
}
This query behaves in the same way as the
treats the last word in the query string as a prefix. In other words, the
preceding example would look for the following:
Followed by
Followed by words beginning with
If you were to run this query through the
produce this explanation:
Like the
make the word order and relative positions (((“slop parameter”, “match_prhase_prefix query”)))(((“match_phrase_prefix query”, “slop parameter”)))somewhat less rigid:
“match_phrase_prefix” : {
“brand” : {
“query”: “walker johnnie bl”, <1>
“slop”: 10
}
}
<1> Even though the words are in the wrong order, the query still matches
because we have set a high enough
in word positions.
However, it is always only the last word in the query string that is treated
as a prefix.
Earlier, in <>, we warned about the perils of the prefix–how
case.(((“match_phrase_prefix query”, “caution with”))) A prefix of
would matching on this many terms be resource intensive, but it would also not be
useful to the user.
We can limit the impact (((“match_phrase_prefix query”, “max_expansions”)))(((“max_expansions parameter”)))of the prefix expansion by setting
a reasonable number, such as 50:
“match_phrase_prefix” : {
“brand” : {
“query”: “johnnie walker bl”,
“max_expansions”: 50
}
}
The
to match. It will find the first term starting with
terms (in alphabetical order) until it either runs out of terms with prefix
Don’t forget that we have to run this query every time the user types another
character, so it needs to be fast. If the first set of results isn’t what users are after, they’ll keep typing until they get the results that they want.
=== Index-Time Optimizations
All of the solutions we’ve talked about so far are implemented at
query time. (((“index time optimizations”)))(((“partial matching”, “index time optimizations”)))They don’t require any special mappings or indexing patterns;
they simply work with the data that you’ve already indexed.
The flexibility of query-time operations comes at a cost: search performance.
Sometimes it may make sense to move the cost away from the query. In a real-
time web application, an additional 100ms may be too much latency to tolerate.
By preparing your data at index time, you can make your searches more flexible
and improve performance. You still pay a price: increased index size and
slightly slower indexing throughput, but it is a price you pay once at index
time, instead of paying it on every query.
Your users will thank you.
=== Ngrams for Partial Matching
As we have said before,
that is not strictly true, it is true that doing a single-term lookup is
much faster than iterating through the terms list to find matching terms on
the fly.(((“partial matching”, “index time optimizations”, “n-grams”))) Preparing your data for partial matching ahead of time will increase
your search performance.
Preparing your data at index time means choosing the right analysis chain, and
the tool that we use for partial matching is the n-gram.(((“n-grams”))) An n-gram can be
best thought of as a moving window on a word. The n stands for a length.
If we were to n-gram the word
we have chosen:
[horizontal]
* Length 1 (unigram): [
* Length 2 (bigram): [
* Length 3 (trigram): [
* Length 4 (four-gram): [
* Length 5 (five-gram): [
Plain n-grams are useful for matching somewhere within a word, a technique
that we will use in <>. However, for search-as-you-type,
we use a specialized form of n-grams called edge n-grams. (((“edge n-grams”))) Edge
n-grams are anchored to the beginning of the word. Edge n-gramming the word
You may notice that this conforms exactly to the letters that a user searching for “quick” would type. In other words, these are the
perfect terms to use for instant search!
=== Index-Time Search-as-You-Type
The first step to setting up index-time search-as-you-type is to(((“search-as-you-type”, “index time”)))(((“partial matching”, “index time search-as-you-type”))) define our
analysis chain, which we discussed in <>, but we will
go over the steps again here.
==== Preparing the Index
The first step is to configure a (((“partial matching”, “index time search-as-you-type”, “preparing the index”)))custom
will call the
“filter”: {
“autocomplete_filter”: {
“type”: “edge_ngram”,
“min_gram”: 1,
“max_gram”: 20
}
}
it should produce an n-gram anchored to the start of the word of minimum
length 1 and maximum length 20.
Then we need to use this token filter in a custom analyzer,(((“analyzers”, “autocomplete custom analyzer”))) which we will call
the
“analyzer”: {
“autocomplete”: {
“type”: “custom”,
“tokenizer”: “standard”,
“filter”: [
“lowercase”,
“autocomplete_filter” <1>
]
}
}
This analyzer will tokenize a string into individual terms by using the
term, thanks to our
The full request to create the index and instantiate the token filter and
analyzer looks like this:
{
“settings”: {
“number_of_shards”: 1, <1>
“analysis”: {
“filter”: {
“autocomplete_filter”: { <2>
“type”: “edge_ngram”,
“min_gram”: 1,
“max_gram”: 20
}
},
“analyzer”: {
“autocomplete”: {
“type”: “custom”,
“tokenizer”: “standard”,
“filter”: [
“lowercase”,
“autocomplete_filter” <3>
]
}
}
}
}
<1> See <>.
<2> First we define our custom token filter.
<3> Then we use it in an analyzer.
You can test this new analyzer to make sure it is behaving correctly by using
the
The results show us that the analyzer is working correctly. It returns these
terms:
To use the analyzer, we need to apply it to a field, which we can do
with(((“update-mapping API, applying custom autocomplete analyzer to a field”))) the
{
“my_type”: {
“properties”: {
“name”: {
“type”: “string”,
“analyzer”: “autocomplete”
}
}
}
Now, we can index some test documents:
{ “index”: { “_id”: 1 }}
{ “name”: “Brown foxes” }
{ “index”: { “_id”: 2 }}
==== Querying the Field
If you test out a query for
{
“query”: {
“match”: {
“name”: “brown fo”
}
}
you will see that both documents match, even though the
doc contains neither
“hits”: [
{
“_id”: “1”,
“_score”: 1.5753809,
“_source”: {
“name”: “Brown foxes”
}
},
{
“_id”: “2”,
“_score”: 0.012520773,
“_source”: {
“name”: “Yellow furballs”
}
}
]
{
“query”: {
“match”: {
“name”: “brown fo”
}
}
The
word in the query string:
The
is not surprising. The same
index time and at search time, which in most situations is the right thing to
do. This is one of the few occasions when it makes sense to break this rule.
We want to ensure that our inverted index contains edge n-grams of every word,
but we want to match only the full words that the user has entered (
index time and the
search analyzer is just to specify it in the query:
{
“query”: {
“match”: {
“name”: {
“query”: “brown fo”,
“analyzer”: “standard” <1>
}
}
}
<1> This overrides the
Alternatively, we can specify (((“search_analyzer parameter”)))(((“index_analyzer parameter”)))the
the mapping for the
reindex our data:
{
“my_type”: {
“properties”: {
“name”: {
“type”: “string”,
“index_analyzer”: “autocomplete”, <1>
“search_analyzer”: “standard” <2>
}
}
}
<1> Use the
every term.
<2> Use the
that the user has entered.
If we were to repeat the
explanation:
Repeating our query correctly returns just the
document.
Because most of the work has been done at index time, all this query needs to
do is to look up the two terms
than the
with
.Completion Suggester
Using edge n-grams for search-as-you-type is easy to set up, flexible, and
fast. However, sometimes it is not fast enough. Latency matters, especially
when you are trying to provide instant feedback. Sometimes the fastest way of
searching is not to search at all.
The http://bit.ly/1IChV5j[completion suggester] in
Elasticsearch(((“completion suggester”))) takes a completely different approach. You feed it a list
of all possible completions, and it builds them into a _finite state
transducer_, an(((“Finite State Transducer”))) optimized data structure that resembles a big graph. To
search for suggestions, Elasticsearch starts at the beginning of the graph and
moves character by character along the matching path. Once it has run out of
user input, it looks at all possible endings of the current path to produce a
list of suggestions.
This data structure lives in memory and makes prefix lookups extremely fast,
much faster than any term-based query could be. It is an excellent match for
autocompletion of names and brands, whose words are usually organized in a
common order:
When word order is less predictable, edge n-grams can be a better solution
than the completion suggester. This particular cat may be skinned in myriad
ways.
==== Edge n-grams and Postcodes
The edge n-gram approach can(((“postcodes (UK), partial matching with”, “using edge n-grams”)))(((“edge n-grams”, “and postcodes”))) also be used for structured data, such as the
postcodes example from <
nothing. Whatever string it receives as input, it emits exactly the same
string as a single token. It can therefore be used for values that we would
normally treat as
transformation such as lowercasing.
==================================================
This example uses the
“analysis”: {
“filter”: {
“postcode_filter”: {
“type”: “edge_ngram”,
“min_gram”: 1,
“max_gram”: 8
}
},
“analyzer”: {
“postcode_index”: { <1>
“tokenizer”: “keyword”,
“filter”: [ “postcode_filter” ]
},
“postcode_search”: { <2>
“tokenizer”: “keyword”
}
}
}
<1> The
to turn postcodes into edge n-grams.
<2> The
if they were
[[ngrams-compound-words]]
=== Ngrams for Compound Words
Finally, let’s take a look at how n-grams can be used to search languages with
compound words. (((“languages”, “using many compound words, indexing of”)))(((“n-grams”, “using with compound words”)))(((“partial matching”, “using n-grams for compound words”)))(((“German”, “compound words in”))) German is famous for combining several small words into one
massive compound word in order to capture precise or complex meanings. For
example:
Aussprachewörterbuch::
Pronunciation dictionary
Militärgeschichte::
Military history
Weißkopfseeadler::
White-headed sea eagle, or bald eagle
Weltgesundheitsorganisation::
World Health Organization
Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz::
The law concerning the delegation of duties for the supervision of cattle
marking and the labeling of beef
Somebody searching for
One approach to indexing languages like this is to break compound words into
their constituent parts using the http://bit.ly/1ygdjjC[compound word token filter].
However, the quality of the results depends on how good your compound-word
dictionary is.
Another approach is just to break all words into n-grams and to search for any
matching fragments–the more fragments that match, the more relevant the
document.
Given that an n-gram is a moving window on a word, an n-gram of any length
will cover all of the word. We want to choose a length that is long enough
to be meaningful, but not so long that we produce far too many unique terms.
A trigram (length 3) is (((“trigrams”)))probably a good starting point:
{
“settings”: {
“analysis”: {
“filter”: {
“trigrams_filter”: {
“type”: “ngram”,
“min_gram”: 3,
“max_gram”: 3
}
},
“analyzer”: {
“trigrams”: {
“type”: “custom”,
“tokenizer”: “standard”,
“filter”: [
“lowercase”,
“trigrams_filter”
]
}
}
}
},
“mappings”: {
“my_type”: {
“properties”: {
“text”: {
“type”: “string”,
“analyzer”: “trigrams” <1>
}
}
}
}
<1> The
n-grams of length 3.
Testing the trigrams analyzer with the
returns these terms:
We can index our example compound words to test this approach:
{ “index”: { “_id”: 1 }}
{ “text”: “Aussprachewörterbuch” }
{ “index”: { “_id”: 2 }}
{ “text”: “Militärgeschichte” }
{ “index”: { “_id”: 3 }}
{ “text”: “Weißkopfseeadler” }
{ “index”: { “_id”: 4 }}
{ “text”: “Weltgesundheitsorganisation” }
{ “index”: { “_id”: 5 }}
A search for
{
“query”: {
“match”: {
“text”: “Adler”
}
}
which correctly matches “Weißkopfsee-adler”:
“hits”: [
{
“_id”: “3”,
“_score”: 3.3191128,
“_source”: {
“text”: “Weißkopfseeadler”
}
}
]
A similar query for
both of which also contain the trigram
Judicious use of the
spurious results by requiring that a minimum number of trigrams must be
present for a document to be considered a match:
{
“query”: {
“match”: {
“text”: {
“query”: “Gesundheit”,
“minimum_should_match”: “80%”
}
}
}
This is a bit of a shotgun approach to full-text search and can result in a
large inverted index, but it is an effective generic way of indexing languages
that use many compound words or that don’t use whitespace between words,
such as Thai.
This technique is used to increase recall—the number of relevant
documents that a search returns. It is usually used in combination with
other techniques, such as shingles (see <>) to improve precision and
the relevance score of each document.
https://github.com/uxff/elasticsearch-definitive-guide-cn
== Partial Matching
A keen observer will notice that all the queries so far in this book have
operated on whole terms.(((“partial matching”))) To match something, the smallest unit had to be a
single term. You can find only terms that exist in the inverted index.
But what happens if you want to match parts of a term but not the whole thing?
Partial matching allows users to specify a portion of the term they are
looking for and find any words that contain that fragment.
The requirement to match on part of a term is less common in the full-text
search-engine world than you might think. If you have come from an SQL
background, you likely have, at some stage of your career,
implemented a poor man’s full-text search using SQL constructs like this:
[source,js]
WHERE text LIKE "*quick*" AND text LIKE "*brown*"
AND text LIKE “fox” <1>
<1>*fox*would match
fox'' andfoxes.”
Of course, with Elasticsearch, we have the analysis process and the inverted
index that remove the need for such brute-force techniques. To handle the
case of matching both
fox'' andfoxes,” we could simply use a stemmer to
index words in their root form. There is no need to match partial terms.
That said, on some occasions partial matching can be useful.
Common use (((“partial matching”, “common use cases”)))cases include the following:
Matching postal codes, product serial numbers, or other
not_analyzedvalues
that start with a particular prefix or match a wildcard pattern
or even a regular expression
search-as-you-type—displaying the most likely results before the
user has finished typing the search terms
Matching in languages like German or Dutch, which contain long compound
words, like Weltgesundheitsorganisation (World Health Organization)
We will start by examining prefix matching on exact-value
not_analyzed
fields.
=== Postcodes and Structured Data
We will use United Kingdom postcodes (postal codes in the United States) to illustrate how(((“partial matching”, “postcodes and structured data”))) to use partial matching with
structured data. UK postcodes have a well-defined structure. For instance, the
postcode
W1V 3DGcan(((“postcodes (UK), partial matching with”))) be broken down as follows:
W1V: This outer part identifies the postal area and district:
**
Windicates the area (one or two letters)
**
1Vindicates the district (one or two numbers, possibly followed by a letter
3DG: This inner part identifies a street or building:
**
3indicates the sector (one number)
**
DGindicates the unit (two letters)
Let’s assume that we are indexing postcodes as exact-value
not_analyzed
fields, so we could create our index as follows:
[source,js]
PUT /my_index{
“mappings”: {
“address”: {
“properties”: {
“postcode”: {
“type”: “string”,
“index”: “not_analyzed”
}
}
}
}
}
// SENSE: 130_Partial_Matching/10_Prefix_query.jsonAnd index some (((“indexing”, “postcodes”)))postcodes:
[source,js]
PUT /my_index/address/1{ “postcode”: “W1V 3DG” }
PUT /my_index/address/2
{ “postcode”: “W2F 8HW” }
PUT /my_index/address/3
{ “postcode”: “W1F 7HW” }
PUT /my_index/address/4
{ “postcode”: “WC1N 1LZ” }
PUT /my_index/address/5
{ “postcode”: “SW5 0BE” }
// SENSE: 130_Partial_Matching/10_Prefix_query.jsonNow our data is ready to be queried.
[[prefix-query]]
=== prefix Query
To find all postcodes beginning with
W1, we could use a (((“prefix query”)))(((“postcodes (UK), partial matching with”, “prefix query”)))simple
prefix
query:
[source,js]
GET /my_index/address/_search{
“query”: {
“prefix”: {
“postcode”: “W1”
}
}
}
// SENSE: 130_Partial_Matching/10_Prefix_query.jsonThe
prefixquery is a low-level query that works at the term level. It
doesn’t analyze the query string before searching. It assumes that you have
passed it the exact prefix that you want to find.
[TIP]
By default, theprefixquery does no relevance scoring. It just finds
matching documents and gives them all a score of
1. Really, it behaves more
like a filter than a query. The only practical difference between the
prefixquery and the
prefixfilter is that the filter can be cached.
==================================================
Previously, we said that
`you can find only terms that exist in the inverted index,'' but we haven't done anything special to index these postcodes; each postcode is simply indexed as the exact value specified in each document. So how does theprefix` query work?
[role=”pagebreak-after”]
Remember that the inverted index consists(((“inverted index”, “for postcodes”))) of a sorted list of unique terms (in
this case, postcodes). For each term, it lists the IDs of the documents
containing that term in the postings list. The inverted index for our
example documents looks something like this:
Term: Doc IDs: ------------------------- "SW5 0BE" | 5 "W1F 7HW" | 3 "W1V 3DG" | 1 "W2F 8HW" | 2 "WC1N 1LZ" | 4 -------------------------
To support prefix matching on the fly, the query does the following:
Skips through the terms list to find the first term beginning with
W1.
Collects the associated document IDs.
Moves to the next term.
If that term also begins with
W1, the query repeats from step 2; otherwise, we’re finished.
While this works fine for our small example, imagine that our inverted index
contains a million postcodes beginning with
W1. The prefix query
would need to visit all one million terms in order to calculate the result!
And the shorter the prefix, the more terms need to be visited. If we were to
look for the prefix
Winstead of
W1, perhaps we would match 10 million
terms instead of just one million.
CAUTION: The
prefixquery or filter are useful for ad hoc prefix matching, but
should be used with care. (((“prefix query”, “caution with”))) They can be used freely on fields with a small
number of terms, but they scale poorly and can put your cluster under a lot of
strain. Try to limit their impact on your cluster by using a long prefix;
this reduces the number of terms that need to be visited.
Later in this chapter, we present an alternative index-time solution that
makes prefix matching much more efficient. But first, we’ll take a look at
two related queries: the
wildcardand
regexpqueries.
=== wildcard and regexp Queries
The
wildcardquery is a low-level, term-based query (((“wildcard query”)))(((“partial matching”, “wildcard and regexp queries”)))similar in nature to the
prefixquery, but it allows you to specify a pattern instead of just a prefix.
It uses the standard shell wildcards:
?matches any character, and
*
matches zero or more characters.(((“postcodes (UK), partial matching with”, “wildcard queries”)))
This query would match the documents containing
W1F 7HWand
W2F 8HW:
[source,js]
GET /my_index/address/_search{
“query”: {
“wildcard”: {
“postcode”: “W?F*HW” <1>
}
}
}
// SENSE: 130_Partial_Matching/15_Wildcard_regexp.json<1> The
?matches the
1and the
2, while the
*matches the space
and the
7and
8.
Imagine now that you want to match all postcodes just in the
Warea. A
prefix match would also include postcodes starting with
WC, and you would
have a similar problem with a wildcard match. We want to match only postcodes
that begin with a
W, followed by a number.(((“postcodes (UK), partial matching with”, “regexp query”)))(((“regexp query”))) The
regexpquery allows you to
write these more complicated patterns:
[source,js]
GET /my_index/address/_search{
“query”: {
“regexp”: {
“postcode”: “W[0-9].+” <1>
}
}
}
// SENSE: 130_Partial_Matching/15_Wildcard_regexp.json<1> The regular expression says that the term must begin with a
W, followed
by any number from 0 to 9, followed by one or more other characters.
The
wildcardand
regexpqueries work in exactly the same way as the
prefixquery. They also have to scan the list of terms in the inverted
index to find all matching terms, and gather document IDs term by term. The
only difference between them and the
prefixquery is that they support more-complex patterns.
This means that the same caveats apply. Running these queries on a field with
many unique terms can be resource intensive indeed. Avoid using a
pattern that starts with a wildcard (for example,
*fooor, as a regexp,
.*foo).
Whereas prefix matching can be made more efficient by preparing your data at
index time, wildcard and regular expression matching can be done only
at query time. These queries have their place but should be used sparingly.
[CAUTION]
Theprefix,
wildcard, and
regexpqueries operate on terms. If you use
them to query an
analyzedfield, they will examine each term in the
field, not the field as a whole.(((“prefix query”, “on analyzed fields”)))(((“wildcard query”, “on analyzed fields”)))(((“regexp query”, “on analyzed fields”)))(((“analyzed fields”, “prefix, wildcard, and regexp queries on”)))
For instance, let’s say that our
titlefield contains
`Quick brown fox'' which produces the termsquick
,brown
, andfox`.
This query would match:
[source,json]
{ “regexp”: { “title”: “br.*” }}
But neither of these queries would match:[source,json]
{ “regexp”: { “title”: “Qu.*” }} <1>{ “regexp”: { “title”: “quick br*” }} <2>
<1> The term in the index isquick, not
Quick.
<2>
quickand
brownare separate terms.
=================================================
=== Query-Time Search-as-You-Type
Leaving postcodes behind, let’s take a look at how prefix matching can help
with full-text queries. (((“partial matching”, “query time search-as-you-type”))) Users have become accustomed to seeing search results
before they have finished typing their query–so-called instant search, or
search-as-you-type. (((“search-as-you-type”)))(((“instant search”))) Not only do users receive their search results in less
time, but we can guide them toward results that actually exist in our index.
For instance, if a user types in
johnnie walker bl, we would like to show results for Johnnie Walker Black Label and Johnnie Walker Blue
Label before they can finish typing their query.
As always, there are more ways than one to skin a cat! We will start by
looking at the way that is simplest to implement. You don’t need to prepare your
data in any way; you can implement search-as-you-type at query time on any
full-text field.
In <>, we introduced the
match_phrasequery, which matches
all the specified words in the same positions relative to each other. For-query time search-as-you-type, we can use a specialization of this query,
called (((“prefix query”, “match_phrase_prefix query”)))(((“match_phrase_prefix query”)))the
match_phrase_prefixquery:
[source,js]
{“match_phrase_prefix” : {
“brand” : “johnnie walker bl”
}
}
// SENSE: 130_Partial_Matching/20_Match_phrase_prefix.jsonThis query behaves in the same way as the
match_phrasequery, except that it
treats the last word in the query string as a prefix. In other words, the
preceding example would look for the following:
johnnie
Followed by
walker
Followed by words beginning with
bl
If you were to run this query through the
validate-queryAPI, it would
produce this explanation:
"johnnie walker bl*"
Like the
match_phrasequery, it accepts a
slopparameter (see <>) to
make the word order and relative positions (((“slop parameter”, “match_prhase_prefix query”)))(((“match_phrase_prefix query”, “slop parameter”)))somewhat less rigid:
[source,js]
{“match_phrase_prefix” : {
“brand” : {
“query”: “walker johnnie bl”, <1>
“slop”: 10
}
}
}
// SENSE: 130_Partial_Matching/20_Match_phrase_prefix.json<1> Even though the words are in the wrong order, the query still matches
because we have set a high enough
slopvalue to allow some flexibility
in word positions.
However, it is always only the last word in the query string that is treated
as a prefix.
Earlier, in <>, we warned about the perils of the prefix–how
prefixqueries can be resource intensive. The same is true in this
case.(((“match_phrase_prefix query”, “caution with”))) A prefix of
acould match hundreds of thousands of terms. Not only
would matching on this many terms be resource intensive, but it would also not be
useful to the user.
We can limit the impact (((“match_phrase_prefix query”, “max_expansions”)))(((“max_expansions parameter”)))of the prefix expansion by setting
max_expansionsto
a reasonable number, such as 50:
[source,js]
{“match_phrase_prefix” : {
“brand” : {
“query”: “johnnie walker bl”,
“max_expansions”: 50
}
}
}
// SENSE: 130_Partial_Matching/20_Match_phrase_prefix.jsonThe
max_expansionsparameter controls how many terms the prefix is allowed
to match. It will find the first term starting with
bland keep collecting
terms (in alphabetical order) until it either runs out of terms with prefix
bl, or it has more terms than
max_expansions.
Don’t forget that we have to run this query every time the user types another
character, so it needs to be fast. If the first set of results isn’t what users are after, they’ll keep typing until they get the results that they want.
=== Index-Time Optimizations
All of the solutions we’ve talked about so far are implemented at
query time. (((“index time optimizations”)))(((“partial matching”, “index time optimizations”)))They don’t require any special mappings or indexing patterns;
they simply work with the data that you’ve already indexed.
The flexibility of query-time operations comes at a cost: search performance.
Sometimes it may make sense to move the cost away from the query. In a real-
time web application, an additional 100ms may be too much latency to tolerate.
By preparing your data at index time, you can make your searches more flexible
and improve performance. You still pay a price: increased index size and
slightly slower indexing throughput, but it is a price you pay once at index
time, instead of paying it on every query.
Your users will thank you.
=== Ngrams for Partial Matching
As we have said before,
`You can find only terms that exist in the inverted index.'' Although theprefix
,wildcard
, andregexp` queries demonstrated that
that is not strictly true, it is true that doing a single-term lookup is
much faster than iterating through the terms list to find matching terms on
the fly.(((“partial matching”, “index time optimizations”, “n-grams”))) Preparing your data for partial matching ahead of time will increase
your search performance.
Preparing your data at index time means choosing the right analysis chain, and
the tool that we use for partial matching is the n-gram.(((“n-grams”))) An n-gram can be
best thought of as a moving window on a word. The n stands for a length.
If we were to n-gram the word
quick, the results would depend on the length
we have chosen:
[horizontal]
* Length 1 (unigram): [
q,
u,
i,
c,
k]
* Length 2 (bigram): [
qu,
ui,
ic,
ck]
* Length 3 (trigram): [
qui,
uic,
ick]
* Length 4 (four-gram): [
quic,
uick]
* Length 5 (five-gram): [
quick]
Plain n-grams are useful for matching somewhere within a word, a technique
that we will use in <>. However, for search-as-you-type,
we use a specialized form of n-grams called edge n-grams. (((“edge n-grams”))) Edge
n-grams are anchored to the beginning of the word. Edge n-gramming the word
quickwould result in this:
q
qu
qui
quic
quick
You may notice that this conforms exactly to the letters that a user searching for “quick” would type. In other words, these are the
perfect terms to use for instant search!
=== Index-Time Search-as-You-Type
The first step to setting up index-time search-as-you-type is to(((“search-as-you-type”, “index time”)))(((“partial matching”, “index time search-as-you-type”))) define our
analysis chain, which we discussed in <>, but we will
go over the steps again here.
==== Preparing the Index
The first step is to configure a (((“partial matching”, “index time search-as-you-type”, “preparing the index”)))custom
edge_ngramtoken filter,(((“edge_ngram token filter”))) which we
will call the
autocomplete_filter:
[source,js]
{“filter”: {
“autocomplete_filter”: {
“type”: “edge_ngram”,
“min_gram”: 1,
“max_gram”: 20
}
}
}
This configuration says that, for any term that this token filter receives,it should produce an n-gram anchored to the start of the word of minimum
length 1 and maximum length 20.
Then we need to use this token filter in a custom analyzer,(((“analyzers”, “autocomplete custom analyzer”))) which we will call
the
autocompleteanalyzer:
[source,js]
{“analyzer”: {
“autocomplete”: {
“type”: “custom”,
“tokenizer”: “standard”,
“filter”: [
“lowercase”,
“autocomplete_filter” <1>
]
}
}
}
<1> Our custom edge-ngram token filterThis analyzer will tokenize a string into individual terms by using the
standardtokenizer, lowercase each term, and then produce edge n-grams of each
term, thanks to our
autocomplete_filter.
The full request to create the index and instantiate the token filter and
analyzer looks like this:
[source,js]
PUT /my_index{
“settings”: {
“number_of_shards”: 1, <1>
“analysis”: {
“filter”: {
“autocomplete_filter”: { <2>
“type”: “edge_ngram”,
“min_gram”: 1,
“max_gram”: 20
}
},
“analyzer”: {
“autocomplete”: {
“type”: “custom”,
“tokenizer”: “standard”,
“filter”: [
“lowercase”,
“autocomplete_filter” <3>
]
}
}
}
}
}
// SENSE: 130_Partial_Matching/35_Search_as_you_type.json<1> See <>.
<2> First we define our custom token filter.
<3> Then we use it in an analyzer.
You can test this new analyzer to make sure it is behaving correctly by using
the
analyzeAPI:
[source,js]
GET /my_index/_analyze?analyzer=autocompletequick brown
// SENSE: 130_Partial_Matching/35_Search_as_you_type.jsonThe results show us that the analyzer is working correctly. It returns these
terms:
q
qu
qui
quic
quick
b
br
bro
brow
brown
To use the analyzer, we need to apply it to a field, which we can do
with(((“update-mapping API, applying custom autocomplete analyzer to a field”))) the
update-mappingAPI:
[source,js]
PUT /my_index/_mapping/my_type{
“my_type”: {
“properties”: {
“name”: {
“type”: “string”,
“analyzer”: “autocomplete”
}
}
}
}
// SENSE: 130_Partial_Matching/35_Search_as_you_type.jsonNow, we can index some test documents:
[source,js]
POST /my_index/my_type/_bulk{ “index”: { “_id”: 1 }}
{ “name”: “Brown foxes” }
{ “index”: { “_id”: 2 }}
{ “name”: “Yellow furballs” }
// SENSE: 130_Partial_Matching/35_Search_as_you_type.json==== Querying the Field
If you test out a query for
`brown fo'' by using ((("partial matching", "index time search-as-you-type", "querying the field")))a simplematch` query
[source,js]
GET /my_index/my_type/_search{
“query”: {
“match”: {
“name”: “brown fo”
}
}
}
// SENSE: 130_Partial_Matching/35_Search_as_you_type.jsonyou will see that both documents match, even though the
Yellow furballs
doc contains neither
brownnor
fo:
[source,js]
{“hits”: [
{
“_id”: “1”,
“_score”: 1.5753809,
“_source”: {
“name”: “Brown foxes”
}
},
{
“_id”: “2”,
“_score”: 0.012520773,
“_source”: {
“name”: “Yellow furballs”
}
}
]
}
As always, thevalidate-queryAPI shines some light:
[source,js]
GET /my_index/my_type/_validate/query?explain{
“query”: {
“match”: {
“name”: “brown fo”
}
}
}
// SENSE: 130_Partial_Matching/35_Search_as_you_type.jsonThe
explanationshows us that the query is looking for edge n-grams of every
word in the query string:
name:b name:br name:bro name:brow name:brown name:f name:fo
The
name:fcondition is satisfied by the second document because
furballshas been indexed as
f,
fu,
fur, and so forth. In retrospect, this
is not surprising. The same
autocompleteanalyzer is being applied both at
index time and at search time, which in most situations is the right thing to
do. This is one of the few occasions when it makes sense to break this rule.
We want to ensure that our inverted index contains edge n-grams of every word,
but we want to match only the full words that the user has entered (
brownand
fo). (((“analyzers”, “changing search analyzer from index analyzer”))) We can do this by using the
autocompleteanalyzer at
index time and the
standardanalyzer at search time. One way to change the
search analyzer is just to specify it in the query:
[source,js]
GET /my_index/my_type/_search{
“query”: {
“match”: {
“name”: {
“query”: “brown fo”,
“analyzer”: “standard” <1>
}
}
}
}
// SENSE: 130_Partial_Matching/35_Search_as_you_type.json<1> This overrides the
analyzersetting on the
namefield.
Alternatively, we can specify (((“search_analyzer parameter”)))(((“index_analyzer parameter”)))the
index_analyzerand
search_analyzerin
the mapping for the
namefield itself. Because we want to change only the
search_analyzer, we can update the existing mapping without having to
reindex our data:
[source,js]
PUT /my_index/my_type/_mapping{
“my_type”: {
“properties”: {
“name”: {
“type”: “string”,
“index_analyzer”: “autocomplete”, <1>
“search_analyzer”: “standard” <2>
}
}
}
}
// SENSE: 130_Partial_Matching/35_Search_as_you_type.json<1> Use the
autocompleteanalyzer at index time to produce edge n-grams of
every term.
<2> Use the
standardanalyzer at search time to search only on the terms
that the user has entered.
If we were to repeat the
validate-queryrequest, it would now give us this
explanation:
name:brown name:fo
Repeating our query correctly returns just the
Brown foxes
document.
Because most of the work has been done at index time, all this query needs to
do is to look up the two terms
brownand
fo, which is much more efficient
than the
match_phrase_prefixapproach of having to find all terms beginning
with
fo.
.Completion Suggester
Using edge n-grams for search-as-you-type is easy to set up, flexible, and
fast. However, sometimes it is not fast enough. Latency matters, especially
when you are trying to provide instant feedback. Sometimes the fastest way of
searching is not to search at all.
The http://bit.ly/1IChV5j[completion suggester] in
Elasticsearch(((“completion suggester”))) takes a completely different approach. You feed it a list
of all possible completions, and it builds them into a _finite state
transducer_, an(((“Finite State Transducer”))) optimized data structure that resembles a big graph. To
search for suggestions, Elasticsearch starts at the beginning of the graph and
moves character by character along the matching path. Once it has run out of
user input, it looks at all possible endings of the current path to produce a
list of suggestions.
This data structure lives in memory and makes prefix lookups extremely fast,
much faster than any term-based query could be. It is an excellent match for
autocompletion of names and brands, whose words are usually organized in a
common order:
Johnny Rotten'' rather thanRotten Johnny.”
When word order is less predictable, edge n-grams can be a better solution
than the completion suggester. This particular cat may be skinned in myriad
ways.
==== Edge n-grams and Postcodes
The edge n-gram approach can(((“postcodes (UK), partial matching with”, “using edge n-grams”)))(((“edge n-grams”, “and postcodes”))) also be used for structured data, such as the
postcodes example from <
[TIP]
Thekeywordtokenizer is the no-operation tokenizer, the tokenizer that does
nothing. Whatever string it receives as input, it emits exactly the same
string as a single token. It can therefore be used for values that we would
normally treat as
not_analyzedbut that require some other analysis
transformation such as lowercasing.
==================================================
This example uses the
keywordtokenizer to convert the postcode string into a token stream, so that we can use the edge n-gram token filter:
[source,js]
{“analysis”: {
“filter”: {
“postcode_filter”: {
“type”: “edge_ngram”,
“min_gram”: 1,
“max_gram”: 8
}
},
“analyzer”: {
“postcode_index”: { <1>
“tokenizer”: “keyword”,
“filter”: [ “postcode_filter” ]
},
“postcode_search”: { <2>
“tokenizer”: “keyword”
}
}
}
}
// SENSE: 130_Partial_Matching/35_Postcodes.json<1> The
postcode_indexanalyzer would use the
postcode_filter
to turn postcodes into edge n-grams.
<2> The
postcode_searchanalyzer would treat search terms as
if they were
not_indexed.
[[ngrams-compound-words]]
=== Ngrams for Compound Words
Finally, let’s take a look at how n-grams can be used to search languages with
compound words. (((“languages”, “using many compound words, indexing of”)))(((“n-grams”, “using with compound words”)))(((“partial matching”, “using n-grams for compound words”)))(((“German”, “compound words in”))) German is famous for combining several small words into one
massive compound word in order to capture precise or complex meanings. For
example:
Aussprachewörterbuch::
Pronunciation dictionary
Militärgeschichte::
Military history
Weißkopfseeadler::
White-headed sea eagle, or bald eagle
Weltgesundheitsorganisation::
World Health Organization
Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz::
The law concerning the delegation of duties for the supervision of cattle
marking and the labeling of beef
Somebody searching for
Wörterbuch'' (dictionary) would probably expect to seeAussprachewörtebuch” in the results list. Similarly, a search for
Adler'' (eagle) should includeWeißkopfseeadler.”
One approach to indexing languages like this is to break compound words into
their constituent parts using the http://bit.ly/1ygdjjC[compound word token filter].
However, the quality of the results depends on how good your compound-word
dictionary is.
Another approach is just to break all words into n-grams and to search for any
matching fragments–the more fragments that match, the more relevant the
document.
Given that an n-gram is a moving window on a word, an n-gram of any length
will cover all of the word. We want to choose a length that is long enough
to be meaningful, but not so long that we produce far too many unique terms.
A trigram (length 3) is (((“trigrams”)))probably a good starting point:
[source,js]
PUT /my_index{
“settings”: {
“analysis”: {
“filter”: {
“trigrams_filter”: {
“type”: “ngram”,
“min_gram”: 3,
“max_gram”: 3
}
},
“analyzer”: {
“trigrams”: {
“type”: “custom”,
“tokenizer”: “standard”,
“filter”: [
“lowercase”,
“trigrams_filter”
]
}
}
}
},
“mappings”: {
“my_type”: {
“properties”: {
“text”: {
“type”: “string”,
“analyzer”: “trigrams” <1>
}
}
}
}
}
// SENSE: 130_Partial_Matching/40_Compound_words.json<1> The
textfield uses the
trigramsanalyzer to index its contents as
n-grams of length 3.
Testing the trigrams analyzer with the
analyzeAPI
[source,js]
GET /my_index/_analyze?analyzer=trigramsWeißkopfseeadler
// SENSE: 130_Partial_Matching/40_Compound_words.jsonreturns these terms:
wei, eiß, ißk, ßko, kop, opf, pfs, fse, see, eea,ead, adl, dle, ler
We can index our example compound words to test this approach:
[source,js]
POST /my_index/my_type/_bulk{ “index”: { “_id”: 1 }}
{ “text”: “Aussprachewörterbuch” }
{ “index”: { “_id”: 2 }}
{ “text”: “Militärgeschichte” }
{ “index”: { “_id”: 3 }}
{ “text”: “Weißkopfseeadler” }
{ “index”: { “_id”: 4 }}
{ “text”: “Weltgesundheitsorganisation” }
{ “index”: { “_id”: 5 }}
{ “text”: “Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz” }
// SENSE: 130_Partial_Matching/40_Compound_words.jsonA search for
`Adler'' (eagle) becomes a query for the three termsadl
,dle
, andler`:
[source,js]
GET /my_index/my_type/_search{
“query”: {
“match”: {
“text”: “Adler”
}
}
}
// SENSE: 130_Partial_Matching/40_Compound_words.jsonwhich correctly matches “Weißkopfsee-adler”:
[source,js]
{“hits”: [
{
“_id”: “3”,
“_score”: 3.3191128,
“_source”: {
“text”: “Weißkopfseeadler”
}
}
]
}
// SENSE: 130_Partial_Matching/40_Compound_words.jsonA similar query for
Gesundheit'' (health) correctly matchesWelt-gesundheit-sorganisation,” but it also matches
Militär-__ges__-chichte'' andRindfleischetikettierungsüberwachungsaufgabenübertragungs-ges-etz,”
both of which also contain the trigram
ges.
Judicious use of the
minimum_should_matchparameter can remove these
spurious results by requiring that a minimum number of trigrams must be
present for a document to be considered a match:
[source,js]
GET /my_index/my_type/_search{
“query”: {
“match”: {
“text”: {
“query”: “Gesundheit”,
“minimum_should_match”: “80%”
}
}
}
}
// SENSE: 130_Partial_Matching/40_Compound_words.jsonThis is a bit of a shotgun approach to full-text search and can result in a
large inverted index, but it is an effective generic way of indexing languages
that use many compound words or that don’t use whitespace between words,
such as Thai.
This technique is used to increase recall—the number of relevant
documents that a search returns. It is usually used in combination with
other techniques, such as shingles (see <>) to improve precision and
the relevance score of each document.
https://github.com/uxff/elasticsearch-definitive-guide-cn
相关文章推荐
- Elasticsearch 权威教程 - 模糊匹配
- Elasticsearch 权威教程 - 请求体查询
- 转:使用Mongo Connector和Elasticsearch实现模糊匹配
- Elasticsearch: 权威指南(官方教程)
- Elasticsearch 权威教程 - 相关性排序
- Elasticsearch 权威教程 - 全文检索
- 使用Mongo Connector和Elasticsearch实现模糊匹配
- Elasticsearch 权威教程 - 多字段搜索
- Elasticsearch 权威教程 - 分布式搜索的执行方式
- Elasticsearch 权威教程 - 索引管理
- Elasticsearch 权威教程 - 分片介绍
- Elasticsearch 权威教程 - 控制关联
- Elasticsearch 权威教程 - 结构化搜索
- ElasticSearch 模糊匹配查询
- 使用 Elasticsearch 的 NGram 分词器处理模糊匹配
- ES-MongoDB学习5_使用Mongo Connector和Elasticsearch实现模糊匹配
- Elasticsearch 权威教程 - 入门
- Elasticsearch 权威教程 - 集群工作方式
- Elasticsearch 权威教程 - 数据吞吐
- Elasticsearch 权威教程 - 分布式文档存储