您的位置:首页 > 编程语言 > Python开发

人工智能:python 实现 第十章,NLP 第七天,主题模型

2018-02-27 21:41 816 查看

文档主题生成模型

topic model指一种统计模型,用来从一批文档的集合中发现抽象的主题/论题。如果文本包含多个主题,这个技术能够用来识别和分离这些主题。我们这样做可以发掘给定的一系列文本的隐藏的主题结构。
Topic Modeling 以一个最佳的方式帮助我们组织文档,这种方式能够被用来分析。值得注意的是,Topic modeling 算法不需要任何被标记的数据。这就像无监督学习一样,依靠自己本身来识别模式。对于网络上产生的海量的文本数据,Topic Modeling 就很重要了,因为它能够让我们归纳所有的数据,这对于人来说是不可能的。

LDA(Latent Dirichlet Allocation)是一种文档主题生成模型,也称为一个三层贝叶斯概率模型,包含词、主题和文档三层结构。所谓生成模型,就是说,我们认为一篇文章的每个词都是通过“以一定概率选择了某个主题,并从这个主题中以一定概率选择某个词语”这样一个过程得到。文档到主题服从多项式分布,主题到词服从多项式分布。LDA是一种非监督机器学习技术,可以用来识别大规模文档集(document collection)或语料库(corpus)中潜藏的主题信息。它采用了词袋(bag of words)的方法,这种方法将每一篇文档视为一个词频向量,从而将文本信息转化为了易于建模的数字信息。但是词袋方法没有考虑词与词之间的顺序,这简化了问题的复杂性,同时也为模型的改进提供了契机。每一篇文档代表了一些主题所构成的一个概率分布,而每一个主题又代表了很多单词所构成的一个概率分布。我们将在这一节使用 一个叫做gensim的库,我们已经在第一节中安装了这个库。
实现代码如下:from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora

# Load input data
def load_data(input_file):
data = []
with open(input_file, 'r') as f:
for line in f.readlines():
data.append(line[:-1])

return data

# Processor function for tokenizing, removing stop
# words, and stemming
def process(input_text):
# Create a regular expre
4000
ssion tokenizer
tokenizer = RegexpTokenizer(r'\w+')

# Create a Snowball stemmer
stemmer = SnowballStemmer('english')

# Get the list of stop words
stop_words = stopwords.words('english')

# Tokenize the input string
tokens = tokenizer.tokenize(input_text.lower())

# Remove the stop words
tokens = [x for x in tokens if not x in stop_words]

# Perform stemming on the tokenized words
tokens_stemmed = [stemmer.stem(x) for x in tokens]

return tokens_stemmed

if __name__=='__main__':
# Load input data
data = load_data('data.txt')

# Create a list for sentence tokens
tokens = [process(x) for x in data]

# Create a dictionary based on the sentence tokens
dict_tokens = corpora.Dictionary(tokens)

# Create a document-term matrix
doc_term_mat = [dict_tokens.doc2bow(token) for token in tokens]

# Define the number of topics for the LDA model
num_topics = 2

# Generate the LDA model
ldamodel = models.ldamodel.LdaModel(doc_term_mat,
num_topics=num_topics, id2word=dict_tokens, passes=25)

num_words = 5
print('\nTop ' + str(num_words) + ' contributing words to each topic:')
for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
print('\nTopic', item[0])

# Print the contributing words along with their relative contributions
list_of_strings = item[1].split(' + ')
for text in list_of_strings:
weight = text.split('*')[0]
word = text.split('*')[1]
print(word, '==>', str(round(float(weight) * 100, 2)) + '%')data.txtThe Roman empire expanded very rapidly and it was the biggest empire in the world for a long time.
An algebraic structure is a set with one or more finitary operations defined on it that satisfies a list of axioms.
Renaissance started as a cultural movement in Italy in the Late Medieval period and later spread to the rest of Europe.
The line of demarcation between prehistoric and historical times is crossed when people cease to live only in the present.
Mathematicians seek out patterns and use them to formulate new conjectures.
A notational symbol that represents a number is called a numeral in mathematics.
The process of extracting the underlying essence of a mathematical concept is called abstraction.
Historically, people have frequently waged wars against each other in order to expand their empires.
Ancient history indicates that various outside influences have helped formulate the culture and traditions of Eastern Europe.
Mappings between sets which preserve structures are of special interest in many fields of mathematics. 运行结果



我们可以看出这个模型确实是很好将文本分成了两个主题-数学和历史。如果你去读这份文本,有能够研制这个结果。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: