Python 评估字词在文件集的重要程度 (文档频率和逆向文档频率 TF-IDF)
2015-01-17 08:18
585 查看
#!/usr/bin/python # -*- coding: utf-8 -*- ''' Created on 2015-1-17 @author: beyondzhou @name: tf_idf_sample.py ''' from tfIdf import tf, tf_idf, idf # Enter in a query term from the corpus variable QUERY_TERMS = ['mr.', 'green'] corpus = \ {'a': 'Mr. Green killed Colonel Mustard in the study with the candlestick. \ Mr. Green is not a very nice fellow.', 'b': 'Professor Plum has a green plant in his study.', 'c': "Miss Scarlett watered Professor Plum's green plant while he was away \ from his office last week."} for (k, v) in sorted(corpus.items()): print k, ':', v print # Score queries by calculating cumulative tf_idf score for each term in query query_scores = {'a':0, 'b':0, 'c':0} for term in [t.lower() for t in QUERY_TERMS]: for doc in sorted(corpus): print 'TF(%s): %s' % (doc, term), tf(term, corpus[doc]) print 'IDF: %s' % (term, ), idf(term, corpus.values()) print for doc in sorted(corpus): score = tf_idf(term, corpus[doc], corpus.values()) print 'TF-IDF(%s): %s' % (doc, term), score query_scores[doc] += score print print "Overall TF-IDF scores for query '%s'" % (' '.join(QUERY_TERMS), ) for (doc, score) in sorted(query_scores.items()): print doc, score
from math import log def tf(term, doc, normalize=True): doc = doc.lower().split() if normalize: return doc.count(term.lower()) / float(len(doc)) else: return doc.count(term.lower()) / 1.0 def idf(term, corpus): num_texts_with_term = len([True for text in corpus if term.lower() in text.lower().split()]) # tf-idf calc incolves multiplying against a tf value less than 0, so it's # neccessary to return a value greater than 1 for consistent scoring. # (Multiplying two values less than 1 returns a value less then each of them.) try: return 1.0 + log(float(len(corpus)) / num_texts_with_term) except ZeroDivisionError: return 1.0 def tf_idf(term, doc, corpus): return tf(term, doc) * idf(term, corpus)
a : Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow. b : Professor Plum has a green plant in his study. c : Miss Scarlett watered Professor Plum's green plant while he was away from his office last week. TF(a): mr. 0.105263157895 TF(b): mr. 0.0 TF(c): mr. 0.0 IDF: mr. 2.09861228867 TF-IDF(a): mr. 0.220906556702 TF-IDF(b): mr. 0.0 TF-IDF(c): mr. 0.0 TF(a): green 0.105263157895 TF(b): green 0.111111111111 TF(c): green 0.0625 IDF: green 1.0 TF-IDF(a): green 0.105263157895 TF-IDF(b): green 0.111111111111 TF-IDF(c): green 0.0625 Overall TF-IDF scores for query 'mr. green' a 0.326169714597 b 0.111111111111 c 0.0625
相关文章推荐
- 【机器学习基础】估算一个字词重要程度的方法TF-IDF
- 文件文档文档的词频-反向文档频率(TF-IDF)计算
- 关于TF(词频) 和TF-IDF(词频-逆向文件频率 )的理解
- [python] LDA处理文档主题分布及分词、词频、tfidf计算
- 文档的词频-反向文档频率(TF-IDF)计算
- TFIDF并不能用来说明特征词的重要与否,只是用来区分不同文档
- 分享自用小工具:TF-IDF计算文档相似性的python实现
- Lucene.Net笔记--逆向文档频率(IDF)
- [python] LDA处理文档主题分布及分词、词频、tfidf计算
- 文档的词频-反向文档频率(TF-IDF)计算
- 文档的词频-反向文档频率(TF-IDF)计算
- python 分词计算文档TF-IDF值并排序
- 文档的词频-反向文档频率(TF-IDF)计算
- python 分词计算文档TF-IDF值并排序
- 文档的词频-反向文档频率(TF-IDF)计算
- 使用tf*idf实现对文档集合的检索
- tf–idf算法解释及其python代码实现(上)
- python 利用sklearn自带的模块 快速简单实现文章的 tfidf向量空间的表示
- IDF 逆向题 python ByteCode
- TF-IDF算法解析与Python实现