TF-IDF算法实现,稀疏矩阵的转化
2019-05-10 12:54
120 查看
[code]from math import log10 import numpy as np # docList is the corpus with each element a doc, each doc is a list of words def tfidf(docList): docNum = len(docList) docList = [i.split(' ') for i in docList] term_idf = dict() for doc in docList: # set(doc) 得到每篇文档的词,不包含重复, 即可统计该词在不同文档的出现次数 for term in set(doc): if term not in term_idf: term_idf[term] = 1.0 else: term_idf[term] += 1.0 # IDF 统计词语的逆文档频率 for term in term_idf: # log10 10为底数 term_idf[term] = log10(docNum / term_idf[term]) print('all word num = ', len(term_idf)) # term_tfidf 总词典 term_tfidf = dict() doc_id = 0 for doc in docList: term_tfidf[doc_id] = dict() # 每个文档的词频统计 term_tf = dict() for term in doc: if term not in term_tf: term_tf[term] = 1.0 else: term_tf[term] += 1.0 # 每个文档的词数目 docLen = len(doc) for term in doc: tfidf = term_tf[term] / docLen * term_idf[term] term_tfidf[doc_id][term] = tfidf doc_id += 1 for voc in term_idf.keys(): all_word.append(voc) return term_tfidf with open('demo.txt') as f: data = [] for line in f.readlines(): if line != '\n': line = line.strip('\n').strip('.[]()') data.append(line) # print(data) print('all doc num = ', len(data)) # 词表循环 all_word = [] score = tfidf(data) X = np.zeros((len(data), len(all_word))) doc_id = 0 # 转换为稀疏矩阵 for (d,x) in score.items(): for (k, v) in x.items(): if k in all_word: X[doc_id][all_word.index(k)] = float(v) doc_id += 1 print(X)
相关文章推荐
- 稀疏矩阵的接压缩算法的实现
- TF-IDF算法解析与Python实现
- 数据结构稀疏矩阵的快速转置算法实现
- tf-idf算法的基本实现,java
- 【数据结构与算法】数组应用3:稀疏矩阵压缩(Java实现)
- TF-IDF算法解析与Python实现
- 稀疏矩阵基于“三元组”的转置算法实现
- TF-IDF算法解析与Python实现方法详解
- 稀疏矩阵实现算法(部分)
- 第九周项目3--稀疏矩阵的三元组表示的实现及应用--(1)建立稀疏矩阵三元组表示的算法库
- python TF-IDF算法实现文本关键词提取
- 针对新闻标签提取的tf-idf优化算法1.0版本——基于jieba分词实现
- TF-IDF算法-Python实现(附源代码)
- 稀疏矩阵的三元组表示的实现及应用(1)——建立稀疏矩阵三元组表示的算法库
- 文本挖掘——基于TF-IDF的KNN分类算法实现
- 简单的TFIDF算法实现Java代码
- tf-idf算法,实现文章关键字抽取
- 稀疏矩阵的三元组表示的实现及应用(2)——采用三元组存储稀疏矩阵,设计两个稀疏矩阵相加的运算算法
- TF-IDF算法-Python实现(附源代码)
- TF-IDF算法讲解和Java实现