Finding Similar Items 文本相似度计算的算法——机器学习、词向量空间cosine、NLTK、diff、Levenshtein距离
2017-02-21 12:05
926 查看
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf汇总于此还有这本书http://www-nlp.stanford.edu/IR-book/里面有词向量空间SVM等介绍
http://pages.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch27b_ir2-vectorspace-95.pdf专门介绍向量空间
https://courses.cs.washington.edu/courses/cse573/12sp/lectures/17-ir.pdf也提到了其他思路貌似类似语音识别的统计模型
使用深度学习来做文档相似度计算https://cs224d.stanford.edu/reports/PoulosJackson.pdf还有这里http://www.cms.waikato.ac.nz/~ml/publications/2012/JASIST2012.pdf
网页里直接比较文本相似度的http://www.scurtu.it/documentSimilarity.html
这里汇总了一些回答http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents包括利用NLPNLTK库来做,或者是diff,skylearn词向量空间+cos
http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene也有cosine相似度计算方法
lucene3里的cosine相似度计算方法https://darakpanand.wordpress.com/2013/06/01/document-comparison-by-cosine-methodology-using-lucene/#more-53注意:4和3的计算方法不一样
向量空间模型(http://stackoverflow.com/questions/10649898/better-way-of-calculating-document-similarity-using-lucene):
Onceyou'vegotyourdatacomponentsproperlystandardized,thenyoucanworryaboutwhat'sbetter:fuzzymatch,Levenshteindistance,orcosinesimilarity(etc.)
AsItoldyouinmycomment,Ithinkyoumadeamistakesomewhere.Thevectorsactuallycontainthe
Documenta:
Documentb:
Vectora:
Vectorb:
Whichresultinthefollowingsimilaritymeasure:
lucene里morelikethis:
youmaywanttochecktheMoreLikeThisfeatureoflucene.
MoreLikeThisconstructsalucenequerybasedontermswithinadocumenttofindothersimilardocumentsintheindex.
http://lucene.apache.org/java/3_0_1/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThis.html
Samplecodeexample(javareference)-
ihavebuiltanindexinLucene.Iwantwithoutspecifyingaquery,justtogetascore(cosinesimilarityoranotherdistance?)betweentwodocumentsintheindex.
ForexampleiamgettingfrompreviouslyopenedIndexReaderirthedocumentswithids2and4.Documentd1=ir.document(2);Documentd2=ir.document(4);
Howcanigetthecosinesimilaritybetweenthesetwodocuments?
Thankyou
Whenindexing,there'sanoptiontostoretermfrequencyvectors.
Duringruntime,lookupthetermfrequencyvectorsforbothdocumentsusingIndexReader.getTermFreqVector(),andlookupdocumentfrequencydataforeachtermusingIndexReader.docFreq().Thatwillgiveyouallthecomponentsnecessarytocalculatethecosinesimilaritybetweenthetwodocs.
AneasierwaymightbetosubmitdocAasaquery(addingallwordstothequeryasORterms,boostingeachbytermfrequency)andlookfordocBintheresultset.
使用深度学习来做文档相似度计算
网页里直接比较文本相似度的
这里汇总了一些回答
lucene3里的cosine相似度计算方法
向量空间模型(http://stackoverflow.com/questions/10649898/better-way-of-calculating-document-similarity-using-lucene):
Onceyou'vegotyourdatacomponentsproperlystandardized,thenyoucanworryaboutwhat'sbetter:fuzzymatch,Levenshteindistance,orcosinesimilarity(etc.)
AsItoldyouinmycomment,Ithinkyoumadeamistakesomewhere.Thevectorsactuallycontainthe
<word,frequency>pairs,not
wordsonly.Therefore,whenyoudeletethesentence,onlythefrequencyofthecorrespondingwordsaresubtractedby1(thewordsafterarenotshifted).Considerthefollowingexample:
Documenta:
ABCAABC.DDEAB.DABCBA.
Documentb:
ABCAABC.DABCBA.
Vectora:
A:6,B:5,C:3,D:3,E:1
Vectorb:
A:5,B:4,C:3,D:1,E:0
Whichresultinthefollowingsimilaritymeasure:
(6*5+5*4+3*3+3*1+1*0)/(Sqrt(6^2+5^2+3^2+3^2+1^2)Sqrt(5^2+4^2+3^2+1^2+0^2))= 62/(8.94427*7.14143)= 0.970648
lucene里morelikethis:
youmaywanttochecktheMoreLikeThisfeatureoflucene.
MoreLikeThisconstructsalucenequerybasedontermswithinadocumenttofindothersimilardocumentsintheindex.
Samplecodeexample(javareference)-
MoreLikeThismlt=newMoreLikeThis(reader);//Passtheindexreader mlt.setFieldNames(newString[]{"title","author"});//specifythefieldsforsimiliarity Queryquery=mlt.like(docID);//Passthedocid TopDocssimilarDocs=searcher.search(query,10);//Usethesearcher if(similarDocs.totalHits==0) //Dohandling }
http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene提到:
ihavebuiltanindexinLucene.Iwantwithoutspecifyingaquery,justtogetascore(cosinesimilarityoranotherdistance?)betweentwodocumentsintheindex.
ForexampleiamgettingfrompreviouslyopenedIndexReaderirthedocumentswithids2and4.Documentd1=ir.document(2);Documentd2=ir.document(4);
Howcanigetthecosinesimilaritybetweenthesetwodocuments?
Thankyou
Whenindexing,there'sanoptiontostoretermfrequencyvectors.
Duringruntime,lookupthetermfrequencyvectorsforbothdocumentsusingIndexReader.getTermFreqVector(),andlookupdocumentfrequencydataforeachtermusingIndexReader.docFreq().Thatwillgiveyouallthecomponentsnecessarytocalculatethecosinesimilaritybetweenthetwodocs.
AneasierwaymightbetosubmitdocAasaquery(addingallwordstothequeryasORterms,boostingeachbytermfrequency)andlookfordocBintheresultset.
16downvote | AsJuliapointsoutimportjava.io.IOException; |
相关文章推荐
- Python----python实现机器学习中的各种距离计算及文本相似度算法
- python实现机器学习中的各种距离计算及文本相似度算法
- java文本相似度计算(Levenshtein Distance算法(中文翻译:编辑距离算法))----代码和详解
- [python]My Unique JsonDiff算法——如何计算2个json串之间的差距并Diff出来(一):编辑距离(Levenshtein)算法
- java文本相似度计算(Levenshtein Distance算法(中文翻译:编辑距离算法))----代码和详解
- 向量空间模型(VSM)一种文本相似度算法
- 文本相似度算法(二):Levenshtein距离
- 文本相似度的计算-向量空间模型
- 文本相似度计算之--- 编辑距离 && 最长公共子串
- Levenshtein Distance + LCS 算法计算两个字符串的相似度
- Levenshtein计算相似度距离
- Levenshtein算法:计算字符串相似度
- 机器学习系列(14)_SVM碎碎念part2:SVM中的向量与空间距离
- 计算字符串相似度算法—Levenshtein
- 编辑距离LCS算法详解:Levenshtein Distance算法计算两个字符串的相似度
- NLP点滴——文本相似度,计算文本间的距离
- [转]向量空间模型(VSM)在文档相似度计算上的简单介绍
- C# 文章相似度算法 Levenshtein 编辑距离算法(转)
- (6)文本挖掘(三)——文本特征TFIDF权重计算及文本向量空间VSM表示
- word2vec词向量训练及中文文本相似度计算