【阅读笔记】Ranking Relevance in Yahoo Search (二)—— maching learned ranking
2018-01-29 10:06
633 查看
3. MACHINE LEARNED RANKING
1) 完全使用不好的数据去训练模型不可行,因为负面结果不可能覆盖到所有方面;
2) 搜索可以看做是个二分问题,在此实验中,我们使用gradient boosting trees(GBDT) with logistic loss,该方法可以用来减少首页出现的bad urls -
该方法首先确定urls与给定query相关与否的分界线(logistic loss);
而后在模型中加入Perfect、Excellent、Good的信息去区分urls(GBDT);
3.1 Core Ranking(相当于chinaso中booster的功能)
使用GBDT和logistic loss;
3.1.1 logistic loss:采用二分思想,用来减少首页出现的bad/fair urls
1)步骤:
按标签分等级:Perfect、Excellent、Good:+1;Fair、Bad:-1
公式:待加
2)优点
logistic loss相对于其他种类的loss函数(如hinge loss)更能提供靠谱的排序
因为:logistic loss always places the force on positive/negative towards positive/negative infinite;
3.1.2 GBDT 用来区分Perfect、Excellent、Good
1)步骤:
使用different levels区分Perfect、Excellent、Good(使Perfect data samples get relatively higher forces to positive infinite than Excellent ones, which are higher than the Good ones)
公式:待加
备注:其中scale(label)可以按经验设置为scale(Perfect)=3, scale(Excellent)=2, scale(Good/Fair/bad)=1以用来区分Perfect / Excellent / Good;
2)对于Fair / Bad samples,因为他们的分数始终为负值,所以没有必要为他们分等级;
3.1.3 评估分析(name this learning algorighm: LogisticRank)
compare with GBRank, LambdaMar
1)前期准备:
数据 - 200万query-url配对;
2)结果&分析
图表待加;
3.2 Contextual Reranking(相当于chinaso中tuner的功能)
1)reranking的执行时机:
core ranking仅仅考虑了query-url配对的特征,而忽略了其他contextual information(因为在进行core ranking的时候,数据量太大);
reranking解析适用于从core ranking得到的大约几十个结果在一台机器上的排序操作(因为数据少所以可以利用模型中的重要特征进行提取);
2)在tens of results中提取的特征:
Rank: soring URLs by the feature value in ascending order to get the ranks of specific URLs
Mean: calculating the mean of the feature values of top 30 URLs
Variance: .... the variance of ...
Normalized feature(特征归一化): normalizing the feature by using mean and standard deviation
Topic model feature: aggregating the topical distributions of 30 URLs to create a query topic model vector, and calculating similarity with each individual result
3.3 Implementation and deployment
core ranking的部署相当于chinaso中的leaf
reranking的部署相当于chinaso中的searchroot
1) 完全使用不好的数据去训练模型不可行,因为负面结果不可能覆盖到所有方面;
2) 搜索可以看做是个二分问题,在此实验中,我们使用gradient boosting trees(GBDT) with logistic loss,该方法可以用来减少首页出现的bad urls -
该方法首先确定urls与给定query相关与否的分界线(logistic loss);
而后在模型中加入Perfect、Excellent、Good的信息去区分urls(GBDT);
3.1 Core Ranking(相当于chinaso中booster的功能)
使用GBDT和logistic loss;
3.1.1 logistic loss:采用二分思想,用来减少首页出现的bad/fair urls
1)步骤:
按标签分等级:Perfect、Excellent、Good:+1;Fair、Bad:-1
公式:待加
2)优点
logistic loss相对于其他种类的loss函数(如hinge loss)更能提供靠谱的排序
因为:logistic loss always places the force on positive/negative towards positive/negative infinite;
3.1.2 GBDT 用来区分Perfect、Excellent、Good
1)步骤:
使用different levels区分Perfect、Excellent、Good(使Perfect data samples get relatively higher forces to positive infinite than Excellent ones, which are higher than the Good ones)
公式:待加
备注:其中scale(label)可以按经验设置为scale(Perfect)=3, scale(Excellent)=2, scale(Good/Fair/bad)=1以用来区分Perfect / Excellent / Good;
2)对于Fair / Bad samples,因为他们的分数始终为负值,所以没有必要为他们分等级;
3.1.3 评估分析(name this learning algorighm: LogisticRank)
compare with GBRank, LambdaMar
1)前期准备:
数据 - 200万query-url配对;
2)结果&分析
图表待加;
3.2 Contextual Reranking(相当于chinaso中tuner的功能)
1)reranking的执行时机:
core ranking仅仅考虑了query-url配对的特征,而忽略了其他contextual information(因为在进行core ranking的时候,数据量太大);
reranking解析适用于从core ranking得到的大约几十个结果在一台机器上的排序操作(因为数据少所以可以利用模型中的重要特征进行提取);
2)在tens of results中提取的特征:
Rank: soring URLs by the feature value in ascending order to get the ranks of specific URLs
Mean: calculating the mean of the feature values of top 30 URLs
Variance: .... the variance of ...
Normalized feature(特征归一化): normalizing the feature by using mean and standard deviation
Topic model feature: aggregating the topical distributions of 30 URLs to create a query topic model vector, and calculating similarity with each individual result
3.3 Implementation and deployment
core ranking的部署相当于chinaso中的leaf
reranking的部署相当于chinaso中的searchroot
相关文章推荐
- 【阅读笔记】Ranking Relevance in Yahoo Search (四 / 完结篇)—— recency-sensitive ranking
- 【阅读笔记】Ranking Relevance in Yahoo Search (三)—— query rewriting
- 【阅读笔记】Ranking Relevance in Yahoo Search (一)—— introduction & background
- 读论文《Ranking Relevance in Yahoo Search》
- CVPR 2015 In Defense of Color-based Model-free Tracking 阅读笔记
- 阅读笔记[2] ——《Think In Java》 Chapter 14 (1)
- 【论文阅读笔记】Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
- 2017年Nature文章“Millions of online book co-purchases reveal partisan differences in the consumption of science”阅读笔记
- LEARNING TO NAVIGATE IN COMPLEX ENVIRONMENTS 阅读笔记
- 【论文阅读笔记】Segmentation as Selective Search for Object Recognition
- Danger is My Middle Name – Experimenting with SSL Vulnerabilities in Android Apps 阅读笔记
- [NLP论文阅读]Learned in Translation: Contextualized Word Vectors
- 阅读笔记[1] ——《Think In Java》 Chapter 13
- leetcode笔记:Search in Rotated Sorted Array
- A Dynamic Algorithm for Local Community Detection in Graphs--阅读笔记
- 2016-02-21-阅读笔记:大脑计划+3rd open GPU+ Delip质疑google Swivel中loss-function + training trick BN in DL
- 图割论文阅读笔记:Interactive Graph Cuts for Optimal Boundary & Region Segmentation of Objects in N-D Images
- 论文阅读笔记——Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree
- ThinkAir: Dynamic resource allocation and parallel execution in the cloud for mobile code offloading阅读笔记
- A Deep Learning-Based Segmentation Method for Brain Tumor in MR Images-阅读笔记