您的位置:首页 > 运维架构

主题模型TopicModel:通过gensim实现LDA

2017-08-18 12:46 489 查看
主页地址:http://radimrehurek.com/gensim/models/ldamodel.html

gemsim是一个免费python库,能够从文档中有效地自动抽取语义主题。gensim中的算法包括:LSA(Latent
Semantic Analysis), LDA(Latent Dirichlet Allocation), RP (Random Projections), 通过在一个训练文档语料库中,检查词汇统计联合出现模式, 可以用来发掘文档语义结构,这些算法属于非监督学习,可以处理原始的,非结构化的文本(”plain text”)。

Gensim是一个相当专业的主题模型Python工具包。在文本处理中,比如商品评论挖掘,有时需要了解每个评论分别和商品的描述之间的相似度,以此衡量评论的客观性。评论和商品描述的相似度越高,说明评论的用语比较官方,不带太多感情色彩,比较注重描述商品的属性和特性,角度更客观。

Gensim:实现语言,Python,实现模型,LDA,Dynamic Topic Model,Dynamic Influence Model,HDP,LSI,Random Projections,深度学习的word2vec,paragraph2vec。github代码地址:https://github.com/piskvorky/gensim
内存独立- 对于训练语料来说,没必要在任何时间将整个语料都驻留在RAM中
有效实现了许多流行的向量空间算法-包括tf-idf,分布式LSA, 分布式LDA 以及 RP;并且很容易添加新算法
对流行的数据格式进行了IO封装和转换
在其语义表达中,可以相似查询
gensim的创建的目的是,由于缺乏简单的(java很复杂)实现主题建模的可扩展软件框架.

gensim的整个package会涉及三个概念:corpus, vector, model.
语库(corpus)

文档集合,用于自动推出文档结构,以及它们的主题等,也可称作训练语料。

向量(vector)

在向量空间模型(VSM)中,每个文档被表示成一个特征数组。例如,一个单一特征可以被表示成一个问答对(question-answer pair):

[1].在文档中单词”splonge”出现的次数? 0个

[2].文档中包含了多少句子? 2个

[3].文档中使用了多少字体? 5种

这里的问题可以表示成整型id (比如:1,2,3等), 因此,上面的文档可以表示成:(1, 0.0), (2, 2.0), (3, 5.0). 如果我们事先知道所有的问题,我们可以显式地写成这样:(0.0, 2.0, 5.0). 这个answer序列可以认为是一个多维矩阵(3维). 对于实际目的,只有question对应的answer是一个实数.

对于每个文档来说,answer是类似的. 因而,对于两个向量来说(分别表示两个文档),我们希望可以下类似的结论:“如果两个向量中的实数是相似的,那么,原始的文档也可以认为是相似的”。当然,这样的结论依赖于我们如何去选取我们的question。

稀疏矩阵(Sparse vector)

通常,大多数answer的值都是0.0. 为了节省空间,我们需要从文档表示中忽略它们,只需要写:(2, 2.0), (3, 5.0) 即可(注意:这里忽略了(1, 0.0)). 由于所有的问题集事先都知道,那么在稀疏矩阵的文档表示中所有缺失的特性可以认为都是0.0.

gensim的特别之处在于,它没有限定任何特定的语料格式;语料可以是任何格式,当迭代时,通过稀疏矩阵来完成即可。例如,集合 ([(2, 2.0), (3, 5.0)], ([0, -1.0], [3, -1.0])) 是一个包含两个文档的语料,每个都有两个非零的 pair。

模型(model)

对于我们来说,一个模型就是一个变换(transformation),将一种文档表示转换成另一种。初始和目标表示都是向量--它们只在question和answer之间有区别。这个变换可以通过训练的语料进行自动学习,无需人工监督,最终的文档表示将更加紧凑和有用;相似的文档具有相似的表示。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
__title__ = 'topic model - build lda - 20news dataset'
__author__ = 'pi'
__mtime__ = '12/26/2014-026'
# code is far away from bugs with the god animal protecting
I love animals. They taste delicious.
┏┓      ┏┓
┏┛┻━━━┛┻┓
┃      ☃      ┃
┃  ┳┛  ┗┳  ┃
┃      ┻      ┃
┗━┓      ┏━┛
┃      ┗━━━┓
┃  神兽保佑    ┣┓
┃ 永无BUG!   ┏┛
┗┓┓┏━┳┓┏┛
┃┫┫  ┃┫┫
┗┻┛  ┗┻┛
"""
from Colors import *
from collections import defaultdict
import re
import datetime
from sklearn import datasets
import nltk
from gensim import corpora
from gensim import models
import numpy as np
from scipy import spatial
from CorePyPro.Fun.TimeStump import totalTime

def load_texts(dataset_type='train', groups=None):
"""
load datasets to bytes list
:return:train_dataset_bunch.data bytes list
"""
if groups == 'small':
groups = ['comp.graphics', 'comp.os.ms-windows.misc']  # 仅用于小数据测试时用, #1368
elif groups == 'medium':
groups = ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.ma c.hardware',
'comp.windows.x', 'sci.space']  # 中量数据时用    #3414
train_dataset_bunch = datasets.load_mlcomp('20news-18828', dataset_type, mlcomp_root='./datasets',
categories=groups)  # 13180
return train_dataset_bunch.data

def preprocess_texts(texts, test_doc_id=1):
"""
texts preprocessing
:param texts: bytes list
:return:bytes list
"""
texts = [t.decode(errors='ignore') for t in texts]  # bytes2str
# print(REDH, 'original texts[%d]: ' % test_doc_id, DEFAULT, '\n',texts[test_doc_id])
# split_texts = [t.lower().split() for t in texts]
# print(REDH, 'split texts[%d]: #%d' % (test_doc_id, len(split_texts)), DEFAULT, '\n',split_texts[test_doc_id])

# lower str & split str 2 word list with sep=... & delete None
SEPS = '[\s()-/,:.?!]\s*'
texts = [re.split(SEPS, t.lower()) for t in texts]
for t in texts:
while '' in t:
t.remove('')
# print(REDH, 'texts[%d] lower & split(seps= %s) & delete None: #%d' % (test_doc_id, SEPS, len(texts[test_doc_id])), DEFAULT, '\n',texts[test_doc_id])

# nltk.download()   #then choose the corpus.stopwords
stopwords = set(nltk.corpus.stopwords.words('english'))  # #127
stopwords.update(['from', 'subject', 'writes'])  # #129
word_usage = defaultdict(int)
for t in texts:
for w in t:
word_usage[w] += 1
COMMON_LINE = len(texts) / 10
too_common_words = [w for w in t if word_usage[w] > COMMON_LINE]  # set(too_common_words)
# print('too_common_words: #', len(too_common_words), '\n', too_common_words)   #68
stopwords.update(too_common_words)
# print('stopwords: #', len(stopwords), '\n', stopwords)  #   #147

english_stemmer = nltk.SnowballStemmer('english')
MIN_WORD_LEN = 3  # 4
texts = [[english_stemmer.stem(w) for w in t if
not set(w) & set('@+>0123456789*') and w not in stopwords and len(w) >= MIN_WORD_LEN] for t in
texts]  # set('+-.?!()>@0123456789*/')
# print(REDH, 'texts[%d] delete ^alphanum & stopwords & len<%d & stemmed: #' % (test_doc_id, MIN_WORD_LEN),
# len(texts[test_doc_id]), DEFAULT, '\n', texts[test_doc_id])
return texts

def build_corpus(texts):
"""
build corpora
:param texts: bytes list
:return: corpus DirectTextCorpus(corpora.TextCorpus)
"""

class DirectTextCorpus(corpora.TextCorpus):
def get_texts(self):
return self.input

def __len__(self):
return len(self.input)

corpus = DirectTextCorpus(texts)
return corpus

def build_id2word(corpus):
"""
from corpus build id2word=dict
:param corpus:
:return:dict = corpus.dictionary
"""
dict = corpus.dictionary  # gensim.corpora.dictionary.Dictionary
# print(dict.id2token)
try:
dict['anything']
except:
pass
# print("dict.id2token is not {} now")
# print(dict.id2token)
return dict

def save_corpus_dict(dict, corpus, dictDir='./LDA/id_word.dict', corpusDir='./LDA/corpus.mm'):
dict.save(dictDir)
print(GREENL, 'dict saved into %s successfully ...' % dictDir, DEFAULT)
corpora.MmCorpus.serialize(corpusDir, corpus)
print(GREENL, 'corpus saved into %s successfully ...' % corpusDir, DEFAULT)
# corpus.save(fname='./LDA/corpus.mm')  # stores only the (tiny) iteration object

def load_ldamodel(modelDir='./lda.pkl'):
model = models.LdaModel.load(fname=modelDir)
print(GREENL, 'ldamodel load from %s successfully ...' % modelDir, DEFAULT)
return model

def load_corpus_dict(dictDir='./LDA/id_word.dict', corpusDir='./LDA/corpus.mm'):
dict = corpora.Dictionary.load(fname=dictDir)
print(GREENL, 'dict load from %s successfully ...' % dictDir, DEFAULT)
# dict = corpora.Dictionary.load_from_text('./id_word.txt')
corpus = corpora.MmCorpus(corpusDir)  # corpora.mmcorpus.MmCorpus
print(GREENL, 'corpus load from %s successfully ...' % corpusDir, DEFAULT)
return dict, corpus

def build_doc_word_mat(corpus, model, num_topics):
"""
build doc_word_mat in topic space
:param corpus:
:param model:
:param num_topics: int
:return:doc_word_mat np.array (len(topics) * num_topics)
"""
topics = [model[c] for c in corpus]  # (word_id, weight) list
doc_word_mat = np.zeros((len(topics), num_topics))
for doc, topic in enumerate(topics):
for word_id, weight in topic:
doc_word_mat[doc, word_id] += weight
return doc_word_mat

def compute_pairwise_dist(doc_word_mat):
"""
compute pairwise dist
:param doc_word_mat: np.array (len(topics) * num_topics)
:return:pairwise_dist <class 'numpy.ndarray'>
"""
pairwise_dist = spatial.distance.squareform(spatial.distance.pdist(doc_word_mat))
max_weight = pairwise_dist.max() + 1
for i in list(range(len(pairwise_dist))):
pairwise_dist[i, i] = max_weight
return pairwise_dist

def closest_texts(corpus, model, num_topics, test_doc_id=1, topn=5):
"""
find the closest_doc_ids for  doc[test_doc_id]
:param corpus:
:param model:
:param num_topics:
:param test_doc_id:
:param topn:
:return:
"""
doc_word_mat = build_doc_word_mat(corpus, model, num_topics)
pairwise_dist = compute_pairwise_dist(doc_word_mat)
# print(REDH, 'original texts[%d]: ' % test_doc_id, DEFAULT, '\n', original_texts[test_doc_id])
closest_doc_ids = pairwise_dist[test_doc_id].argsort()
# return closest_doc_ids[:topn]
for closest_doc_id in closest_doc_ids[:topn]:
print(RED, 'closest doc[%d]' % closest_doc_id, DEFAULT, '\n', original_texts[closest_doc_id])

def evaluate_model(model):
"""
計算模型在test data的Perplexity
:param model:
:return:model.log_perplexity float
"""
test_texts = load_texts(dataset_type='test', groups='small')
test_texts = preprocess_texts(test_texts)
test_corpus = build_corpus(test_texts)
return model.log_perplexity(test_corpus)

def test_num_topics():
dict, corpus = load_corpus_dict()
print("#corpus_items:", len(corpus))
for num_topics in [3, 5, 10, 30, 50, 100, 150, 200, 300]:
start_time = datetime.datetime.now()
model = models.LdaModel(corpus, num_topics=num_topics, id2word=dict)
end_time = datetime.datetime.now()
print("total running time = ", end_time - start_time)
print(REDL, 'model.log_perplexity for test_texts with num_topics=%d : ' % num_topics, evaluate_model(model),
DEFAULT)

def test():
texts = load_texts(dataset_type='train', groups='small')
original_texts = texts
test_doc_id = 1

# texts = preprocess_texts(texts, test_doc_id=test_doc_id)
# corpus = build_corpus(texts=texts)  # corpus DirectTextCorpus(corpora.TextCorpus)
# dict = build_id2word(corpus)
# save_corpus_dict(dict, corpus)
dict, corpus = load_corpus_dict()
# print(len(corpus))

num_topics = 100
model = models.LdaModel(corpus, num_topics=num_topics, id2word=dict)  # 每次结果不同
model.show_topic(0)
# model.save(fname='./lda.pkl')

# model = load_ldamodel()
# closest_texts(corpus, model, num_topics, test_doc_id=1, topn=3)

print(REDL, 'model.log_perplexity for test_texts', evaluate_model(model), DEFAULT)

if __name__ == '__main__':
test()
# test_num_topics()


使用gensim python拓展包

from:http://blog.csdn.net/pipisorry/article/details/46447561

ref: [Gensim官方教程翻译(一)——快速入门 ]

[Gensim LDA主题模型实验]
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: