您的位置:首页 > 其它

ES权威指南_03_Dealing with Human Language_03 Normalizing Tokens(归一化词元)

2017-02-06 17:34 381 查看
https://www.elastic.co/guide/en/elasticsearch/guide/current/token-normalization.html

Breaking text into tokens is only half the job.

To make those tokens more easily searchable, they need to go through a normalization process(标准化) to remove insignificant differences(无意义差异) between otherwise identical words, such as
uppercase versus lowercase(大小写等同)
. Perhaps we also need to remove significant differences(重大差异), to make
esta, ésta, and está all searchable as the same word
.

This is the job of the token filters, which receive a stream of tokens from the tokenizer. You can have multiple token filters, each doing its particular job.

1 In That Case(这个例子)

The most frequently used token filter is the lowercase filter.

GET /_analyze?tokenizer=standard&filters=lowercase
The QUICK Brown FOX!


PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_lowercaser": {//自定义
"tokenizer": "standard",
"filter":  [ "lowercase" ]
}
}
}
}
}


2 You Have an Accent(如果有口音)

English uses diacritics (变音符,like ´, ^, and ¨) only for imported words—like rôle, déjà, and däis—but usually they are optional. Other languages require diacritics in order to be correct.Of course, just because words are spelled correctly in your index doesn’t mean that the user will search for the correct spelling.

It is often useful to strip diacritics from words, allowing rôle to match role, and vice versa. With Western languages, this can be done with the asciifolding character filter. Actually, it does more than just strip diacritics. It tries to convert many Unicode characters into a simpler ASCII representation:

ß ⇒ ss

æ ⇒ ae

ł ⇒ l

ɰ ⇒ m

⁇ ⇒ ??

❷ ⇒ 2

⁶ ⇒ 6

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter":  [ "lowercase", "asciifolding" ]
}
}
}
}
}


Retaining Meaning

"title": {
"type":           "string",
"analyzer":       "standard",
"fields": { //
"folded": {
"type":       "string",
"analyzer":   "folding"
}
}
}


3 Living in a Unicode World

When Elasticsearch compares one token with another, it does so at the byte level,for two tokens to be considered the same, they need to consist of exactly the same bytes. Unicode, however, allows you to write the same letter in different ways.

There are four Unicode normalization forms, all of which convert Unicode characters into a standard format, making all characters comparable at a byte level: nfc, nfd, nfkc, nfkd.

It doesn’t really matter which normalization form you choose, as long as all your text is in the same form. That way, the same tokens consist of the same bytes.

You can use the icu_normalizer token filter to ensure that all of your tokens are in the same form:

PUT my_index
{
"settings": {
"analysis": {
"filter": {
"nfkc_normalizer": {
"type": "icu_normalizer",
"name": "nfkc"  //
}
},
"analyzer": {
"my_normalizer": {
"tokenizer": "icu_tokenizer",
"filter":  [ "nfkc_normalizer" ]
}
}
}
}
}


4 Unicode Case Folding【大小写】

The whole point of lowercasing terms is to make them more likely to match, not less! In Unicode, this job is done by case folding rather than by lowercasing.

Case folding is the act of converting words into a (usually lowercase) form that does not necessarily result in the correct spelling, but does allow case-insensitive comparisons.

PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_lowercaser": {
"tokenizer": "icu_tokenizer",
"filter":  [ "icu_normalizer" ] //nfkc_cf eq lowercase token filter
}
}
}
}
}


5 Unicode Character Folding

The icu_folding token filter applies
Unicode normalization and case folding
from
nfkc_cf
automatically, so the
icu_normalizer
is not required:

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_folder": {
"tokenizer": "icu_tokenizer",
"filter":  [ "icu_folding" ]
}
}
}
}
}


6 Sorting and Collations

string sorting.

String Sorting and Multifields【analyzed + not_analyzed 】

PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"case_insensitive_sort": {//大小写不敏感
"tokenizer": "keyword",
"filter":  [ "lowercase" ] //lowercases the token
}
}
}
}
}


Every language has its own sort order, and sometimes even multiple sort orders.

Unicode Sorting

Collation is the process of sorting text into a predefined order.

整理是将文本按预定义顺序排序的过程。

The Unicode Collation Algorithm, or UCA defines a method of sorting strings into the order defined in a Collation Element Table (usually referred to just as a collation).

PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"ducet_sort": {
"tokenizer": "keyword",
"filter": [ "icu_collation" ] //DUCET collation for sorting
}
}
}
}
}


Specifying a Language

The icu_collation filter can be configured to use the collation table for a specific language,

{ "language": "en" }


"analysis": {
"filter": {
"german_phonebook": {
"type":     "icu_collation",
"language": "de",
"country":  "DE",
"variant":  "@collation=phonebook"
}
}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: