您的位置:首页 > 其它

Finding Similar Items 文本相似度计算的算法——机器学习、词向量空间cosine、NLTK、diff、Levenshtein距离

2017-02-21 12:05 926 查看
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf汇总于此还有这本书http://www-nlp.stanford.edu/IR-book/里面有词向量空间SVM等介绍
http://pages.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch27b_ir2-vectorspace-95.pdf专门介绍向量空间
https://courses.cs.washington.edu/courses/cse573/12sp/lectures/17-ir.pdf也提到了其他思路貌似类似语音识别的统计模型

使用深度学习来做文档相似度计算https://cs224d.stanford.edu/reports/PoulosJackson.pdf还有这里http://www.cms.waikato.ac.nz/~ml/publications/2012/JASIST2012.pdf
网页里直接比较文本相似度的http://www.scurtu.it/documentSimilarity.html
这里汇总了一些回答http://stackoverflow.com/questions/8897593/similarity-between-two-text-documents包括利用NLPNLTK库来做,或者是diff,skylearn词向量空间+cos
http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene也有cosine相似度计算方法

lucene3里的cosine相似度计算方法https://darakpanand.wordpress.com/2013/06/01/document-comparison-by-cosine-methodology-using-lucene/#more-53注意:4和3的计算方法不一样

向量空间模型(http://stackoverflow.com/questions/10649898/better-way-of-calculating-document-similarity-using-lucene):

Onceyou'vegotyourdatacomponentsproperlystandardized,thenyoucanworryaboutwhat'sbetter:fuzzymatch,Levenshteindistance,orcosinesimilarity(etc.)

AsItoldyouinmycomment,Ithinkyoumadeamistakesomewhere.Thevectorsactuallycontainthe
<word,frequency>
pairs,not
words
only.Therefore,whenyoudeletethesentence,onlythefrequencyofthecorrespondingwordsaresubtractedby1(thewordsafterarenotshifted).Considerthefollowingexample:

Documenta:

ABCAABC.DDEAB.DABCBA.

Documentb:

ABCAABC.DABCBA.

Vectora:

A:6,B:5,C:3,D:3,E:1

Vectorb:

A:5,B:4,C:3,D:1,E:0

Whichresultinthefollowingsimilaritymeasure:

(6*5+5*4+3*3+3*1+1*0)/(Sqrt(6^2+5^2+3^2+3^2+1^2)Sqrt(5^2+4^2+3^2+1^2+0^2))=
62/(8.94427*7.14143)=
0.970648


lucene里morelikethis:

youmaywanttochecktheMoreLikeThisfeatureoflucene.

MoreLikeThisconstructsalucenequerybasedontermswithinadocumenttofindothersimilardocumentsintheindex.

http://lucene.apache.org/java/3_0_1/api/contrib-queries/org/apache/lucene/search/similar/MoreLikeThis.html

Samplecodeexample(javareference)-

MoreLikeThismlt=newMoreLikeThis(reader);//Passtheindexreader
mlt.setFieldNames(newString[]{"title","author"});//specifythefieldsforsimiliarity

Queryquery=mlt.like(docID);//Passthedocid
TopDocssimilarDocs=searcher.search(query,10);//Usethesearcher
if(similarDocs.totalHits==0)
//Dohandling
}


http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene提到:


ihavebuiltanindexinLucene.Iwantwithoutspecifyingaquery,justtogetascore(cosinesimilarityoranotherdistance?)betweentwodocumentsintheindex.

ForexampleiamgettingfrompreviouslyopenedIndexReaderirthedocumentswithids2and4.Documentd1=ir.document(2);Documentd2=ir.document(4);

Howcanigetthecosinesimilaritybetweenthesetwodocuments?

Thankyou

Whenindexing,there'sanoptiontostoretermfrequencyvectors.

Duringruntime,lookupthetermfrequencyvectorsforbothdocumentsusingIndexReader.getTermFreqVector(),andlookupdocumentfrequencydataforeachtermusingIndexReader.docFreq().Thatwillgiveyouallthecomponentsnecessarytocalculatethecosinesimilaritybetweenthetwodocs.

AneasierwaymightbetosubmitdocAasaquery(addingallwordstothequeryasORterms,boostingeachbytermfrequency)andlookfordocBintheresultset.

16downvote
AsJuliapointsoutSujitPal'sexampleisveryusefulbuttheLucene4APIhassubstantialchanges.HereisaversionrewrittenforLucene4.

importjava.io.IOException;
importjava.util.*;

importorg.apache.commons.math3.linear.*;
importorg.apache.lucene.analysis.Analyzer;
importorg.apache.lucene.analysis.core.SimpleAnalyzer;
importorg.apache.lucene.document.*;
importorg.apache.lucene.document.Field.Store;
importorg.apache.lucene.index.*;
importorg.apache.lucene.store.*;
importorg.apache.lucene.util.*;

publicclassCosineDocumentSimilarity{

publicstaticfinalStringCONTENT="Content";

privatefinalSet<String>terms=newHashSet<>();
privatefinalRealVectorv1;
privatefinalRealVectorv2;

CosineDocumentSimilarity(Strings1,Strings2)throwsIOException{
Directorydirectory=createIndex(s1,s2);
IndexReaderreader=DirectoryReader.open(directory);
Map<String,Integer>f1=getTermFrequencies(reader,0);
Map<String,Integer>f2=getTermFrequencies(reader,1);
reader.close();
v1=toRealVector(f1);
v2=toRealVector(f2);
}

DirectorycreateIndex(Strings1,Strings2)throwsIOException{
Directorydirectory=newRAMDirectory();
Analyzeranalyzer=newSimpleAnalyzer(Version.LUCENE_CURRENT);
IndexWriterConfigiwc=newIndexWriterConfig(Version.LUCENE_CURRENT,
analyzer);
IndexWriterwriter=newIndexWriter(directory,iwc);
addDocument(writer,s1);
addDocument(writer,s2);
writer.close();
returndirectory;
}

/*Indexed,tokenized,stored.*/
publicstaticfinalFieldTypeTYPE_STORED=newFieldType();

static{
TYPE_STORED.setIndexed(true);
TYPE_STORED.setTokenized(true);
TYPE_STORED.setStored(true);
TYPE_STORED.setStoreTermVectors(true);
TYPE_STORED.setStoreTermVectorPositions(true);
TYPE_STORED.freeze();
}

voidaddDocument(IndexWriterwriter,Stringcontent)throwsIOException{
Documentdoc=newDocument();
Fieldfield=newField(CONTENT,content,TYPE_STORED);
doc.add(field);
writer.addDocument(doc);
}

doublegetCosineSimilarity(){
return(v1.dotProduct(v2))/(v1.getNorm()*v2.getNorm());
}

publicstaticdoublegetCosineSimilarity(Strings1,Strings2)
throwsIOException{
returnnewCosineDocumentSimilarity(s1,s2).getCosineSimilarity();
}

Map<String,Integer>getTermFrequencies(IndexReaderreader,intdocId)
throwsIOException{
Termsvector=reader.getTermVector(docId,CONTENT);
TermsEnumtermsEnum=null;
termsEnum=vector.iterator(termsEnum);
Map<String,Integer>frequencies=newHashMap<>();
BytesReftext=null;
while((text=termsEnum.next())!=null){
Stringterm=text.utf8ToString();
intfreq=(int)termsEnum.totalTermFreq();
frequencies.put(term,freq);
terms.add(term);
}
returnfrequencies;
}

RealVectortoRealVector(Map<String,Integer>map){
RealVectorvector=newArrayRealVector(terms.size());
inti=0;
for(Stringterm:terms){
intvalue=map.containsKey(term)?map.get(term):0;
vector.setEntry(i++,value);
}
return(RealVector)vector.mapDivide(vector.getL1Norm());
}
}


内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: