python计算tfidf
2016-08-16 21:28
489 查看
本例来自mining social web
from math import log
# XXX: Enter in a query term from the corpus variable
QUERY_TERMS = ['mr.', 'green']
def tf(term, doc, normalize=True):
doc = doc.lower().split()
if normalize:
return doc.count(term.lower()) / float(len(doc))
else:
return doc.count(term.lower()) / 1.0
def idf(term, corpus):
num_texts_with_term = len([True for text in corpus if term.lower()
in text.lower().split()])
# tf-idf calc involves multiplying against a tf value less than 0, so it's
# necessary to return a value greater than 1 for consistent scoring.
# (Multiplying two values less than 1 returns a value less than each of
# them.)
try:
return 1.0 + log(float(len(corpus)) / num_texts_with_term)
except ZeroDivisionError:
return 1.0
def tf_idf(term, doc, corpus):
return tf(term, doc) * idf(term, corpus)
corpus = \
{'a': 'Mr. Green killed Colonel Mustard in the study with the candlestick. \
Mr. Green is not a very nice fellow.',
'b': 'Professor Plum has a green plant in his study.',
'c': "Miss Scarlett watered Professor Plum's green plant while he was away \
from his office last week."}
for (k, v) in sorted(corpus.items()):
print k, ':', v
print
# Score queries by calculating cumulative tf_idf score for each term in query
query_scores = {'a': 0, 'b': 0, 'c': 0}
for term in [t.lower() for t in QUERY_TERMS]:
for doc in sorted(corpus):
print 'TF(%s): %s' % (doc, term), tf(term, corpus[doc])
print 'IDF: %s' % (term, ), idf(term, corpus.values())
print
for doc in sorted(corpus):
score = tf_idf(term, corpus[doc], corpus.values())
print 'TF-IDF(%s): %s' % (doc, term), score
query_scores[doc] += score
print
print "Overall TF-IDF scores for query '%s'" % (' '.join(QUERY_TERMS), )
for (doc, score) in sorted(query_scores.items()):
print doc, score
from math import log
# XXX: Enter in a query term from the corpus variable
QUERY_TERMS = ['mr.', 'green']
def tf(term, doc, normalize=True):
doc = doc.lower().split()
if normalize:
return doc.count(term.lower()) / float(len(doc))
else:
return doc.count(term.lower()) / 1.0
def idf(term, corpus):
num_texts_with_term = len([True for text in corpus if term.lower()
in text.lower().split()])
# tf-idf calc involves multiplying against a tf value less than 0, so it's
# necessary to return a value greater than 1 for consistent scoring.
# (Multiplying two values less than 1 returns a value less than each of
# them.)
try:
return 1.0 + log(float(len(corpus)) / num_texts_with_term)
except ZeroDivisionError:
return 1.0
def tf_idf(term, doc, corpus):
return tf(term, doc) * idf(term, corpus)
corpus = \
{'a': 'Mr. Green killed Colonel Mustard in the study with the candlestick. \
Mr. Green is not a very nice fellow.',
'b': 'Professor Plum has a green plant in his study.',
'c': "Miss Scarlett watered Professor Plum's green plant while he was away \
from his office last week."}
for (k, v) in sorted(corpus.items()):
print k, ':', v
# Score queries by calculating cumulative tf_idf score for each term in query
query_scores = {'a': 0, 'b': 0, 'c': 0}
for term in [t.lower() for t in QUERY_TERMS]:
for doc in sorted(corpus):
print 'TF(%s): %s' % (doc, term), tf(term, corpus[doc])
print 'IDF: %s' % (term, ), idf(term, corpus.values())
for doc in sorted(corpus):
score = tf_idf(term, corpus[doc], corpus.values())
print 'TF-IDF(%s): %s' % (doc, term), score
query_scores[doc] += score
print "Overall TF-IDF scores for query '%s'" % (' '.join(QUERY_TERMS), )
for (doc, score) in sorted(query_scores.items()):
print doc, score
相关文章推荐
- python scikit-learn计算tf-idf词语权重
- python 分词计算文档TF-IDF值并排序
- python scikit-learn计算tf-idf词语权重
- [python] LDA处理文档主题分布及分词、词频、tfidf计算
- python tfidf值计算方法汇总
- TF-IDF计算 Python
- 分享自用小工具:TF-IDF计算文档相似性的python实现
- python 分词计算文档TF-IDF值并排序
- python scikit-learn计算tf-idf词语权重
- python进行中文文本聚类实例(TFIDF计算、词袋构建)
- [python] LDA处理文档主题分布及分词、词频、tfidf计算
- 计算文章TF-IDF
- 基于庖丁分词的TFIDF计算
- 特征提取-计算tf-idf
- TF-IDF算法-Python实现(附源代码)
- 文件文档文档的词频-反向文档频率(TF-IDF)计算
- Python 评估字词在文件集的重要程度 (文档频率和逆向文档频率 TF-IDF)
- Python 使用nltk获取TF-IDF
- TF-IDF在关键词自动提取、计算文本相似度和摘要自动生成上的应用
- 文档的词频-反向文档频率(TF-IDF)计算