ES权威指南_03_Dealing with Human Language_03 Normalizing Tokens(归一化词元)
2017-02-06 17:34
381 查看
https://www.elastic.co/guide/en/elasticsearch/guide/current/token-normalization.html
Breaking text into tokens is only half the job.
To make those tokens more easily searchable, they need to go through a normalization process(标准化) to remove insignificant differences(无意义差异) between otherwise identical words, such as
This is the job of the token filters, which receive a stream of tokens from the tokenizer. You can have multiple token filters, each doing its particular job.
It is often useful to strip diacritics from words, allowing rôle to match role, and vice versa. With Western languages, this can be done with the asciifolding character filter. Actually, it does more than just strip diacritics. It tries to convert many Unicode characters into a simpler ASCII representation:
ß ⇒ ss
æ ⇒ ae
ł ⇒ l
ɰ ⇒ m
⁇ ⇒ ??
❷ ⇒ 2
⁶ ⇒ 6
There are four Unicode normalization forms, all of which convert Unicode characters into a standard format, making all characters comparable at a byte level: nfc, nfd, nfkc, nfkd.
It doesn’t really matter which normalization form you choose, as long as all your text is in the same form. That way, the same tokens consist of the same bytes.
You can use the icu_normalizer token filter to ensure that all of your tokens are in the same form:
Case folding is the act of converting words into a (usually lowercase) form that does not necessarily result in the correct spelling, but does allow case-insensitive comparisons.
String Sorting and Multifields【analyzed + not_analyzed 】
Every language has its own sort order, and sometimes even multiple sort orders.
整理是将文本按预定义顺序排序的过程。
The Unicode Collation Algorithm, or UCA defines a method of sorting strings into the order defined in a Collation Element Table (usually referred to just as a collation).
Breaking text into tokens is only half the job.
To make those tokens more easily searchable, they need to go through a normalization process(标准化) to remove insignificant differences(无意义差异) between otherwise identical words, such as
uppercase versus lowercase(大小写等同). Perhaps we also need to remove significant differences(重大差异), to make
esta, ésta, and está all searchable as the same word.
This is the job of the token filters, which receive a stream of tokens from the tokenizer. You can have multiple token filters, each doing its particular job.
1 In That Case(这个例子)
The most frequently used token filter is the lowercase filter.GET /_analyze?tokenizer=standard&filters=lowercase The QUICK Brown FOX!
PUT my_index { "settings": { "analysis": { "analyzer": { "my_lowercaser": {//自定义 "tokenizer": "standard", "filter": [ "lowercase" ] } } } } }
2 You Have an Accent(如果有口音)
English uses diacritics (变音符,like ´, ^, and ¨) only for imported words—like rôle, déjà, and däis—but usually they are optional. Other languages require diacritics in order to be correct.Of course, just because words are spelled correctly in your index doesn’t mean that the user will search for the correct spelling.It is often useful to strip diacritics from words, allowing rôle to match role, and vice versa. With Western languages, this can be done with the asciifolding character filter. Actually, it does more than just strip diacritics. It tries to convert many Unicode characters into a simpler ASCII representation:
ß ⇒ ss
æ ⇒ ae
ł ⇒ l
ɰ ⇒ m
⁇ ⇒ ??
❷ ⇒ 2
⁶ ⇒ 6
PUT my_index { "settings": { "analysis": { "analyzer": { "folding": { "tokenizer": "standard", "filter": [ "lowercase", "asciifolding" ] } } } } }
Retaining Meaning
"title": { "type": "string", "analyzer": "standard", "fields": { // "folded": { "type": "string", "analyzer": "folding" } } }
3 Living in a Unicode World
When Elasticsearch compares one token with another, it does so at the byte level,for two tokens to be considered the same, they need to consist of exactly the same bytes. Unicode, however, allows you to write the same letter in different ways.There are four Unicode normalization forms, all of which convert Unicode characters into a standard format, making all characters comparable at a byte level: nfc, nfd, nfkc, nfkd.
It doesn’t really matter which normalization form you choose, as long as all your text is in the same form. That way, the same tokens consist of the same bytes.
You can use the icu_normalizer token filter to ensure that all of your tokens are in the same form:
PUT my_index { "settings": { "analysis": { "filter": { "nfkc_normalizer": { "type": "icu_normalizer", "name": "nfkc" // } }, "analyzer": { "my_normalizer": { "tokenizer": "icu_tokenizer", "filter": [ "nfkc_normalizer" ] } } } } }
4 Unicode Case Folding【大小写】
The whole point of lowercasing terms is to make them more likely to match, not less! In Unicode, this job is done by case folding rather than by lowercasing.Case folding is the act of converting words into a (usually lowercase) form that does not necessarily result in the correct spelling, but does allow case-insensitive comparisons.
PUT /my_index { "settings": { "analysis": { "analyzer": { "my_lowercaser": { "tokenizer": "icu_tokenizer", "filter": [ "icu_normalizer" ] //nfkc_cf eq lowercase token filter } } } } }
5 Unicode Character Folding
The icu_folding token filter appliesUnicode normalization and case foldingfrom
nfkc_cfautomatically, so the
icu_normalizeris not required:
PUT my_index { "settings": { "analysis": { "analyzer": { "my_folder": { "tokenizer": "icu_tokenizer", "filter": [ "icu_folding" ] } } } } }
6 Sorting and Collations
string sorting.String Sorting and Multifields【analyzed + not_analyzed 】
PUT my_index { "settings": { "analysis": { "analyzer": { "case_insensitive_sort": {//大小写不敏感 "tokenizer": "keyword", "filter": [ "lowercase" ] //lowercases the token } } } } }
Every language has its own sort order, and sometimes even multiple sort orders.
Unicode Sorting
Collation is the process of sorting text into a predefined order.整理是将文本按预定义顺序排序的过程。
The Unicode Collation Algorithm, or UCA defines a method of sorting strings into the order defined in a Collation Element Table (usually referred to just as a collation).
PUT /my_index { "settings": { "analysis": { "analyzer": { "ducet_sort": { "tokenizer": "keyword", "filter": [ "icu_collation" ] //DUCET collation for sorting } } } } }
Specifying a Language
The icu_collation filter can be configured to use the collation table for a specific language,{ "language": "en" }
"analysis": { "filter": { "german_phonebook": { "type": "icu_collation", "language": "de", "country": "DE", "variant": "@collation=phonebook" } }
相关文章推荐
- ES权威指南_03_Dealing with Human Language_01 get start
- ES权威指南_03_Dealing with Human Language_02 Identifying Words
- ES权威指南_03_Dealing with Human Language_04 Reducing Words to Root Form
- ES权威指南_03_Dealing with Human Language_05 Stopwords: Performance vs Precision
- ES权威指南_03_Dealing with Human Language_06 Synonyms
- ES权威指南_03_Dealing with Human Language_07 Typoes and Mispelings
- ES权威指南[官方文档学习笔记]-36 dealing with conflicts
- 03-limit-memory-usage-es控制聚合内存使用-elasticsearch权威指南翻译
- ES权威指南_04_aggs_03 Building Bar Charts(柱状、直方图)
- ES权威指南[官方文档学习笔记]-26 Coping with failure
- ES权威指南_05_Geolocation_03 Geo Aggs
- ES权威指南[官方文档学习笔记]-11 search with query dsl
- ES权威指南_06_Modeling Your Data_03 Parent-Child Relationship
- OpenGL ES 入门指南 - Getting Started with OpenGL ES
- ES权威指南_06_Modeling Your Data_04 Designing for Scale
- ES权威指南[官方文档学习笔记]-6 document oriented
- ES权威指南[官方文档学习笔记]-32 Checking whether a document XX
- ES权威指南[官方文档学习笔记]-16 Analytics
- ES权威指南[官方文档学习笔记]-56 Inverted index
- 04-fielddata-filtering-es控制聚合内存使用-elasticsearch权威指南翻译