您的位置:首页 > 其它

ES权威指南_03_Dealing with Human Language_01 get start

2017-02-06 17:32 681 查看
https://www.elastic.co/guide/en/elasticsearch/guide/current/language-intro.html

Full-text search is a battle between precision—returning as few irrelevant documents as possible—and recall—returning as many relevant documents as possible.

Wouldn’t you expect a search for “quick brown fox” to match a document containing “fast brown foxes”.

exactly

weaker matches, potential matches

Remove diacritics(变音符) like ´, ^ ,Normalizing Tokens.

Remove the distinction between singular and plural(单复数)—fox versus foxes,Root Form.

Remove commonly used words or stopwords like the, and, and or,Stopwords: Performance Versus Precision.

Including synonyms, quick match fast,Synonyms.

Check for misspellings or alternate spellings, or match on homophones(同音字,如their 和there、meat 和meet ),Typoes and Mispelings

Before we can manipulate individual words, we need to divide text into words, which means that we need to know what constitutes a word. We will tackle this in Identifying Words.

Elasticsearch ships with a collection of language analyzers that provide good, basic, out-of-the-box support for many of the world’s most common languages:

These analyzers typically perform four roles:

1. Tokenize text into individual words:

2. Lowercase tokens:

3. Remove common stopwords

4. Stem tokens to their root form:

Each analyzer may also apply other transformations specific to its language in order to make words from that language more searchable:

The english analyzer removes the possessive(所有关系)’s : John’s → john

1 Using Language Analyzers

The built-in language analyzers are available globally and don’t need to be configured before being used.

PUT /my_index
{
"mappings": {
"blog": {
"properties": {
"title": {
"type":     "string",
"analyzer": "english" //替代默认的standard
}
}
}
}
}


GET /my_index/_analyze?field=title
I'm not happy about the foxes


Emits token: i’m, happi, about, fox

To get the best of both worlds, we can use
multifields
to index:


"title": {
"type": "string",
"fields": {
"english": {
"type":     "string",
"analyzer": "english"
}
}
}


title : standard

title.english : english

PUT /my_index/blog/1
{ "title": "I'm happy for this fox" }

PUT /my_index/blog/2
{ "title": "I'm not happy about my fox problem" }

GET /_search
{
"query": {
"multi_match": {
"type":     "most_fields", //match the same text in as many fields as possible.
"query":    "not happy foxes",
"fields": [ "title", "title.english" ]
}
}
}


2 Configuring Language Analyzers

PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_english": {
"type": "english",
"stem_exclusion": [ "organization", "organizations" ],
"stopwords": ["a", "an", "and", "are"]
}
}
}
}
}

GET /my_index/_analyze?analyzer=my_english
The World Health Organization does not sell organs.


Emits tokens world, health, organization, does, not, sell, organ

stem_exclusion: prevent the words
organization
and
organizations
from being stemmed.

2 Pitfalls of Mixing Languages(了解)

If you have to deal with only a single language, count yourself lucky.

3 One Language per Doc

Documents from different languages can be stored in separate indices—blogs-en, blogs-fr,

PUT /blogs-en
{
"mappings": {
"post": {
"properties": {
"title": {
"type": "string",
"fields": {
"stemmed": {
"type":     "string",
"analyzer": "english"  //对不同语言使用不同...
}
}}}}}}


GET /blogs-*/post/_search
{
"query": {
"multi_match": {
"query":   "deja vu",
"fields":  [ "title", "title.stemmed" ]
"type":    "most_fields"
}
},
"indices_boost": {
"blogs-en": 3,
"blogs-fr": 2
}
}


Don’t Use Types for Languages

You may be tempted to use a separate type for each language, instead of a separate index.

fields from different types but with the same field name are indexed into the same inverted index. This means that the term frequencies(TF) from each type (and thus each language) are mixed together.

4 One Language per Field

PUT movies
{
"mappings": {
"movie": {
"properties": {
"title": {
"type":       "string"
},
"title_br": {
"type":     "string",
"analyzer": "brazilian"
}
...


5 Mixed-Language Fields

Split into separate fields

One Language per Field.

Analyze multiple times

"title": {

"type": "string",

"fields": {

"de": {

"type":     "string",

"analyzer": "german"

},

"en": {

"type":     "string",

"analyzer": "english"

}

...


Use n-grams

You could index all words as n-grams, using the same approach as described in Ngrams for Compound Words.

"es": {
"type":     "string",
"analyzer": "spanish"
},
"general": {
"type":     "string",
"analyzer": "trigrams" //
}


When querying the catchall general field, you can use
minimum_should_match
to reduce the number of low-quality matches.

GET /movies/movie/_search
{
"query": {
"multi_match": {
"query":    "club de la lucha",
"fields": [ "title*^1.5", "title.general" ],
"type":     "most_fields",
"minimum_should_match": "75%"
}
}
}


参考:

Analysis » Tokenizers » NGram Tokenizer

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html

An Introduction to Ngrams in Elasticsearch

https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: