您的位置:首页 > 编程语言 > Python开发

text_mining_tutorial

2016-02-23 12:44 656 查看

Working with Text Data

目的:将一组新闻按照主题分类

categories = ['alt.atheism', 'soc.religion.christian',
'comp.graphics', 'sci.med']

from sklearn.datasets import fetch_20newsgroups

from sklearn.datasets import load_files


from sklearn.datasets import load_files


问题申明

原代码应该并不是这样的,具体参考 文本挖掘教学网站

由于源代码会产生无法连接服务器的问题,所以使用替代方法

手动下载数据集,通过函数导入,参数需要用到如下两个,具体参见下行代码

twenty_train = fetch_20newsgroups(subset='train',
categories=categories, shuffle=True, random_state=42)


type(twenty_train)


sklearn.datasets.base.Bunch


twenty_train.target_names


['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']


len(twenty_train.data)


2257


len(twenty_train.filenames)


2257


type(twenty_train.data[0])


str


print("\n".join(twenty_train.data[0].split("\n")[:3]))


From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton


print(twenty_train.target_names[twenty_train.target[0]])


comp.graphics


print(twenty_train.data[0])


From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hamptonOrganization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format. We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance. Michael.
--
Michael Collier (Programmer) The Computer Unit,
Email: M.P.Collier@uk.ac.city The City University,
Tel: 071 477-8000 x3769 London,
Fax: 071 477-8565 EC1V 0HB.


twenty_train.target[0]


1


for t in twenty_train.target[:10]:
print(twenty_train.target_names[t])


comp.graphicscomp.graphicssoc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med


twenty_train.target


array([1, 1, 3, ..., 2, 2, 2])


from sklearn.feature_extraction.text import CountVectorizer


count_vect = CountVectorizer()


X_train_counts = count_vect.fit_transform(twenty_train.data)


X_train_counts.shape


(2257, 35788)


count_vect.vocabulary_.get(u'algorithm')


4690


从文本文件里面提取特征

为了使用机器学习,必须讲文本转换成文字

字典:文字包

1. 为训练集中的每个词设定一个固定的ID,就像字典一样

2. 对于每一个文本,计算词频并保存在
X[i, j]
的形式,i是文本的编号,j是字典里的索引变好,也就是说

X[文本编号,字典索引编号]输出值为 改词词频

对于这样的词汇包
n_features
就是不同的词汇,就是说数据集的列是不同的词汇名

- 通常这样的属性列的数量会大于十万

如果
n_samples
的数量为10000(一万),那么数据集的大小将达到10000*100000*4bytes=4GB

文字包通常有很多空元素,或者说是0元素,所以单词包是一个高维度分散数据集,如果只保留非空值那么将会节省大量空间

scipy.sparse矩阵就是做这个事情的,skilearn里面包含了这一个功能

处理数据

将文本数据处理成之前所说的文字包形式可以使用

CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer


count_vect = CountVectorizer()


count_vect.fit_transform(twenty_train.data)


<2257x35788 sparse matrix of type '<class 'numpy.int64'>'
with 365886 stored elements in Compressed Sparse Row format>


X_train_counts = count_vect.fit_transform(twenty_train.data)

X_train_counts.shape


(2257, 35788)


Countervextor 支持将文字列成字典,从而可以索引

count_vect.vocabulary_.get('algorithm')


4690


建立频率

记录出现次数是一个很不错的开始,但是有一个问题需要注意

对于同一个主题的数据,更长的文本意味着更多的记录次数

1. 解决办法

- 用 出现次数 / 该文章总字数 来作为 频率

- 这样的方法在这里称为 tf (term frequency)

2. 解决方法2

- 降低常用词权重(weight)

- 降低权重称为 tf-idf (Term frequency times inverse document frequency) 词频和文本频率的倒数

from sklearn.feature_extraction.text import TfidfTransformer


有一些需要理清的概念

1. iterm frequcency

- 维基百科解释

- ft,d一个词在文本中出现的次数总和

- tf(t,d)=ft,d

- t:单词t在文本d中的出现次数

- D:总文本

inverse document frequency

idf(t,D)=log总的文本数目包含这个单词的文本数目

term frequency-inverse document frequency

tfidf(t,d,D)=tf(t,d)×idf(t,D)

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)


X_train_tf = tf_transformer.transform(X_train_counts)


X_train_tf.shape


(2257, 35788)


通过fit_transform

之前的两个步骤使用一个fit_transform就可以了

tfidf_transformer = TfidfTransformer()


X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)


X_train_tfidf.shape


(2257, 35788)


训练分类器

之前的操作将原始数据转换成文字包的形势,接着优化文字包,让他更有效率一些

下一步训练分类器,就可以对数据分类了

from sklearn.naive_bayes import MultinomialNB


clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)


docs_new = ['God is love', 'OpenGL on the GPU is fast']


X_new_counts = count_vect.transform(docs_new)


X_new_tfidf = tfidf_transformer.transform(X_new_counts)


predicted = clf.predict(X_new_tfidf)


X_new_counts


<2x35788 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Row format>


X_new_counts.toarray()


array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])


for doc, category in zip(docs_new, predicted):
print('%r => %s'%(doc, twenty_train.target_names[category]))


'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


predicted


array([3, 1])


[i for i in zip(docs_new, predicted)]


[('God is love', 3), ('OpenGL on the GPU is fast', 1)]


使用pipeline

为了让上述的 vectorizer => transformer => classifier 更加简单,我们使用 Pipeline

pipeline就是一个整合过的分类器,整合了上述功能

from sklearn.pipeline import Pipeline


text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])


现在只需要一行代码就可以使用分类器分类了

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)


测试准确率

准确率测试非常简单

import numpy as np


twenty_test = fetch_20newsgroups(subset='test',
categories=categories,
shuffle=True,
random_state=42)


docs_test = twenty_test.data


predicted = text_clf.predict(docs_test)


np.mean(predicted == twenty_test.target)


0.83488681757656458


predicted == twenty_test.target


array([ True,  True, False, ...,  True,  True,  True], dtype=bool)


使用支持向量机SVM

支持向量机被誉为最好的文本分类算法

缺点是比较慢

只需要更改pipeline的一个参数就可以

from sklearn.linear_model import SGDClassifier


text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier(loss='hinge',
penalty='l2',
alpha=1e-3,
n_iter=5,
random_state=42)
),])


_=text_clf.fit(twenty_train.data, twenty_train.target)


predicted = text_clf.predict(docs_test)


np.mean(predicted == twenty_test.target)


0.9127829560585885


可以看到预测准确率的提高,从84%提高到91%

更加细致的分类器结果展示

from sklearn import metrics


print(metrics.classification_report(twenty_test.target,
predicted,
target_names=twenty_test.target_names))


precision    recall  f1-score   support

alt.atheism       0.95      0.81      0.87       319
comp.graphics       0.88      0.97      0.92       389
sci.med       0.94      0.90      0.92       396
soc.religion.christian       0.90      0.95      0.93       398

avg / total       0.92      0.91      0.91      1502


metrics.confusion_matrix(twenty_test.target, predicted)


array([[258,  11,  15,  35],
[  4, 379,   3,   3],
[  5,  33, 355,   3],
[  5,  10,   4, 379]])


迷惑矩阵confusion metrics

迷惑矩阵里面除对角线外,

数值越大代表混淆的数据越多,

行的数目代表真是数据个数,

列的总和为预测数目个数

因为只是用了四个种类的数据进行试验,所以产生十足试验数据详细报告,而之前使用全部数据的时候,样本数量以及类别远大于目前

参数优化,寻找最优参数,grid search

对于不同的算法,往往有很多参数需要设置,寻找最优的参数可以使用

from sklearn.grid_search import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
'tfidf__use_idf': (True, False),
'clf__alpha': (1e-2, 1e-3),
}


GridSearchCV

用于实验最佳参数,由于上面使用的是pipeline,所以参数的字典使用了name__para的形式

n_jobs使用-1作为参数可以自动判定cpu核心数目并使用多线程

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)


gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)


twenty_train.target_names[gs_clf.predict(['God is love'])]


/Users/Houbowei/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:1: DeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
if __name__ == '__main__':

'soc.religion.christian'


gs_clf.grid_scores_


[mean: 0.90430, std: 0.00570, params: {'vect__ngram_range': (1, 1), 'tfidf__use_idf': True, 'clf__alpha': 0.01},
mean: 0.92113, std: 0.01206, params: {'vect__ngram_range': (1, 2), 'tfidf__use_idf': True, 'clf__alpha': 0.01},
mean: 0.81303, std: 0.01682, params: {'vect__ngram_range': (1, 1), 'tfidf__use_idf': False, 'clf__alpha': 0.01},
mean: 0.83562, std: 0.02234, params: {'vect__ngram_range': (1, 2), 'tfidf__use_idf': False, 'clf__alpha': 0.01},
mean: 0.96544, std: 0.00329, params: {'vect__ngram_range': (1, 1), 'tfidf__use_idf': True, 'clf__alpha': 0.001},
mean: 0.95968, std: 0.00641, params: {'vect__ngram_range': (1, 2), 'tfidf__use_idf': True, 'clf__alpha': 0.001},
mean: 0.92158, std: 0.00284, params: {'vect__ngram_range': (1, 1), 'tfidf__use_idf': False, 'clf__alpha': 0.001},
mean: 0.93088, std: 0.00290, params: {'vect__ngram_range': (1, 2), 'tfidf__use_idf': False, 'clf__alpha': 0.001}]


每一行是一个sklearn类

type(py_list[0])


sklearn.grid_search._CVScoreTuple


gs_clf.grid_scores_[0]


mean: 0.90430, std: 0.00570, params: {'vect__ngram_range': (1, 1), 'tfidf__use_idf': True, 'clf__alpha': 0.01}


type(gs_clf.grid_scores_)


list


best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1])
for param_name in sorted(parameters.keys()):
print("%s: %r" % (param_name, best_parameters[param_name]))


clf__alpha: 0.001tfidf__use_idf: True
vect__ngram_range: (1, 1)


score


0.96544085068675234


文本挖掘的教学内容到此就结束了,下面会尝试实例来联系相关技巧
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息