您的位置:首页 > 其它

微软句向量工具包Sent2vec

2015-05-13 10:02 162 查看

工具介绍:

What is sent2vec

sent2vec maps a pair of short text strings (e.g., sentences or query-answer pairs) to a pair of feature vectors in a continuous, low-dimensional space where the semantic similarity between the text strings is computed as the
cosine similarity between their vectors in that space.

sent2vec performs the mapping using the Deep Structured Semantic Model (DSSM) proposed in (Huang et al. 2013), or the DSSM with convolutional-pooling structure (CDSSM) proposed in (Shen et al. 2014; Gao et al. 2014). Please cite
the papers if you use sent2vec in published research.

工具包地址:

http://research.microsoft.com/en-us/downloads/731572aa-98e4-4c50-b99d-ae3f0c9562b9/default.aspx

Slides:

http://emnlp2014.org/material/presentation-EMNLP2014002.pdf

Slides中的Deep Semantic Similarity Model(DSSM)



看了上图,发现这个工具就是卷积神经网络,网络的输入是一个word harsing(word harsing后句子特征维度就不变了),然后做卷积和池化(关于什么是卷积和池化 参考:http://blog.csdn.net/silence1214/article/details/11809947)。

看到slides中word harsing步骤,问题就来了。如下图:



为了控制输入控件的维度,作者使用了letter-trigram representation。也就是word 变为一堆letter-trigram representation。感觉中文行不通啊,中文分完词语,粒度大部分都是两三个字。然后做这个letter-trigram representation,效果会好吗?

源自:http://weibo.com/1402400261/ChhIgASO1?type=comment#_rnd1431482545348
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  NLP工具 sen2vec