ES权威指南_03_Dealing with Human Language_01 get start
2017-02-06 17:32
681 查看
https://www.elastic.co/guide/en/elasticsearch/guide/current/language-intro.html
Full-text search is a battle between precision—returning as few irrelevant documents as possible—and recall—returning as many relevant documents as possible.
Wouldn’t you expect a search for “quick brown fox” to match a document containing “fast brown foxes”.
exactly
weaker matches, potential matches
Remove diacritics(变音符) like ´, ^ ,Normalizing Tokens.
Remove the distinction between singular and plural(单复数)—fox versus foxes,Root Form.
Remove commonly used words or stopwords like the, and, and or,Stopwords: Performance Versus Precision.
Including synonyms, quick match fast,Synonyms.
Check for misspellings or alternate spellings, or match on homophones(同音字,如their 和there、meat 和meet ),Typoes and Mispelings。
Before we can manipulate individual words, we need to divide text into words, which means that we need to know what constitutes a word. We will tackle this in Identifying Words.
Elasticsearch ships with a collection of language analyzers that provide good, basic, out-of-the-box support for many of the world’s most common languages:
These analyzers typically perform four roles:
1. Tokenize text into individual words:
2. Lowercase tokens:
3. Remove common stopwords
4. Stem tokens to their root form:
Each analyzer may also apply other transformations specific to its language in order to make words from that language more searchable:
The english analyzer removes the possessive(所有关系)’s : John’s → john
Emits token: i’m, happi, about, fox
To get the best of both worlds, we can use
title : standard
title.english : english
Emits tokens world, health, organization, does, not, sell, organ
stem_exclusion: prevent the words
fields from different types but with the same field name are indexed into the same inverted index. This means that the term frequencies(TF) from each type (and thus each language) are mixed together.
One Language per Field.
Analyze multiple times
Use n-grams
You could index all words as n-grams, using the same approach as described in Ngrams for Compound Words.
When querying the catchall general field, you can use
参考:
Analysis » Tokenizers » NGram Tokenizer
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
An Introduction to Ngrams in Elasticsearch
https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch
Full-text search is a battle between precision—returning as few irrelevant documents as possible—and recall—returning as many relevant documents as possible.
Wouldn’t you expect a search for “quick brown fox” to match a document containing “fast brown foxes”.
exactly
weaker matches, potential matches
Remove diacritics(变音符) like ´, ^ ,Normalizing Tokens.
Remove the distinction between singular and plural(单复数)—fox versus foxes,Root Form.
Remove commonly used words or stopwords like the, and, and or,Stopwords: Performance Versus Precision.
Including synonyms, quick match fast,Synonyms.
Check for misspellings or alternate spellings, or match on homophones(同音字,如their 和there、meat 和meet ),Typoes and Mispelings。
Before we can manipulate individual words, we need to divide text into words, which means that we need to know what constitutes a word. We will tackle this in Identifying Words.
Elasticsearch ships with a collection of language analyzers that provide good, basic, out-of-the-box support for many of the world’s most common languages:
These analyzers typically perform four roles:
1. Tokenize text into individual words:
2. Lowercase tokens:
3. Remove common stopwords
4. Stem tokens to their root form:
Each analyzer may also apply other transformations specific to its language in order to make words from that language more searchable:
The english analyzer removes the possessive(所有关系)’s : John’s → john
1 Using Language Analyzers
The built-in language analyzers are available globally and don’t need to be configured before being used.PUT /my_index { "mappings": { "blog": { "properties": { "title": { "type": "string", "analyzer": "english" //替代默认的standard } } } } }
GET /my_index/_analyze?field=title I'm not happy about the foxes
Emits token: i’m, happi, about, fox
To get the best of both worlds, we can use
multifieldsto index:
"title": { "type": "string", "fields": { "english": { "type": "string", "analyzer": "english" } } }
title : standard
title.english : english
PUT /my_index/blog/1 { "title": "I'm happy for this fox" } PUT /my_index/blog/2 { "title": "I'm not happy about my fox problem" } GET /_search { "query": { "multi_match": { "type": "most_fields", //match the same text in as many fields as possible. "query": "not happy foxes", "fields": [ "title", "title.english" ] } } }
2 Configuring Language Analyzers
PUT /my_index { "settings": { "analysis": { "analyzer": { "my_english": { "type": "english", "stem_exclusion": [ "organization", "organizations" ], "stopwords": ["a", "an", "and", "are"] } } } } } GET /my_index/_analyze?analyzer=my_english The World Health Organization does not sell organs.
Emits tokens world, health, organization, does, not, sell, organ
stem_exclusion: prevent the words
organizationand
organizationsfrom being stemmed.
2 Pitfalls of Mixing Languages(了解)
If you have to deal with only a single language, count yourself lucky.3 One Language per Doc
Documents from different languages can be stored in separate indices—blogs-en, blogs-fr,PUT /blogs-en { "mappings": { "post": { "properties": { "title": { "type": "string", "fields": { "stemmed": { "type": "string", "analyzer": "english" //对不同语言使用不同... } }}}}}}
GET /blogs-*/post/_search { "query": { "multi_match": { "query": "deja vu", "fields": [ "title", "title.stemmed" ] "type": "most_fields" } }, "indices_boost": { "blogs-en": 3, "blogs-fr": 2 } }
Don’t Use Types for Languages
You may be tempted to use a separate type for each language, instead of a separate index.fields from different types but with the same field name are indexed into the same inverted index. This means that the term frequencies(TF) from each type (and thus each language) are mixed together.
4 One Language per Field
PUT movies { "mappings": { "movie": { "properties": { "title": { "type": "string" }, "title_br": { "type": "string", "analyzer": "brazilian" } ...
5 Mixed-Language Fields
Split into separate fieldsOne Language per Field.
Analyze multiple times
"title": { "type": "string", "fields": { "de": { "type": "string", "analyzer": "german" }, "en": { "type": "string", "analyzer": "english" } ...
Use n-grams
You could index all words as n-grams, using the same approach as described in Ngrams for Compound Words.
"es": { "type": "string", "analyzer": "spanish" }, "general": { "type": "string", "analyzer": "trigrams" // }
When querying the catchall general field, you can use
minimum_should_matchto reduce the number of low-quality matches.
GET /movies/movie/_search { "query": { "multi_match": { "query": "club de la lucha", "fields": [ "title*^1.5", "title.general" ], "type": "most_fields", "minimum_should_match": "75%" } } }
参考:
Analysis » Tokenizers » NGram Tokenizer
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html
An Introduction to Ngrams in Elasticsearch
https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch
相关文章推荐
- ES权威指南_03_Dealing with Human Language_03 Normalizing Tokens(归一化词元)
- ES权威指南_03_Dealing with Human Language_04 Reducing Words to Root Form
- ES权威指南_03_Dealing with Human Language_05 Stopwords: Performance vs Precision
- ES权威指南_03_Dealing with Human Language_06 Synonyms
- ES权威指南_03_Dealing with Human Language_07 Typoes and Mispelings
- ES权威指南_03_Dealing with Human Language_02 Identifying Words
- ES权威指南_01_get start_01 You Know, for Search…
- ES权威指南_01_get start_02 Life Inside a Cluster(ES集群内部原理)
- ES权威指南_01_get start_03 Data In, Data Out
- ES权威指南_01_get start_04 Distributed Document Store
- ES权威指南_01_get start_05 Searching—The Basic Tools
- ES权威指南_01_get start_06 Mapping and Analysis
- ES权威指南_01_get start_07 Full-Body Search
- ES权威指南_01_get start_08 Sorting and Relevance
- ES权威指南_01_get start_09 Distributed Search Execution
- ES权威指南_01_get start_10 Index Management
- ES权威指南_01_get start_11 Inside a Shard
- ES权威指南[官方文档学习笔记]-36 dealing with conflicts
- ES权威指南_04_aggs_03 Building Bar Charts(柱状、直方图)
- ES权威指南_07_admin monitor deploy_01 Monitoring