您的位置:首页 > 其它

ES学习笔记五-搜索相关性

2015-02-21 19:09 204 查看
By default, results are returned sorted by relevance—with
the most relevant docs first。

首先来了解一下排序:

{query:{

},

"from":0,

"size":10,

"sort":"field" | "sort:"["filed1","field2"] | "sort":{"filed":"desc"}

}

"sort": {
    "dates": {
        "order": "asc",
        "mode":  "min"
    }
}


string sorting and multifields

Analyzed string fields are also multivalue fields, but
sorting on them seldom gives you the results you want. If you analyze a string like 
fine
old art
, it results in three terms. We probably want to sort alphabetically on the first term, then the second term, and so forth, but Elasticsearch
doesn’t have this information at its disposal at sort time.

被分析的string类型的字段是多值字段,如果在这些字段上排序很有可能得不到预期结果。

解决的办法是定义mapping

"tweet": { 
    "type":     "string",
    "analyzer": "english",
    "fields": {
        "raw": { 
            "type":  "string",
            "index": "not_analyzed"
        }
    }
}

GET /_search
{
    "query": {
        "match": {
            "tweet": "elasticsearch"
        }
    },
    "sort": "tweet.raw"
}

搜索结果相关性

The standard similarity algorithm used in Elasticsearch is known
as term frequency/inverse document frequency, or TF/IDF, which takes the following factors into account:

Term frequency 词元在此文档中出现的频率越高,则相关性越好How often does the term appear in the field? The more often, the more relevant. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention.Inverse document frequency 词元在其他文档中出现的频率越高,则相关性越低How often does each term appear in the index? The more often, the less relevant. Terms that appear in many documents have a lower weight than
more-uncommon terms.Field-length norm 文档的长度越低,相关度越小How long is the field? The longer it is, the less likely it is that words in the field will be relevant. A term appearing in a short 
title
 field
carries more weight than the same term appearing in a long 
content
 field.
It adds information about the shard and the node that the document came from, which is useful to know because term and document frequencies are calculated
per shard, rather than per index
相关性得分计算是以分片为单位计算的,不是以索引为单位计算的。

GET /_search?explain 
{
   "query"   : { "match" : { "tweet" : "honeymoon" }}
}

记得 explain只在debug中使用 production model中请关闭此选项,性能开销很大


fielddata

To make sorting efficient, Elasticsearch loads all the values for the field that you want to sort on into memory. This is referred to as fielddata.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: