您的位置:首页 > 其它

scikit-learn:0.3. 从文本文件中提取特征(tf、tf-idf)、训练一个分类器

2015-07-12 20:52 411 查看
上一篇讲了如何加载数据。

本篇参考:http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

主要讲解如下部分:

Extracting features from text files

Training a classifier



跑模型之前,需要将文本文件的内容转换为数字特征向量。常见的是tf、tf-idf。

1、tf:

首先解决high-dimensional sparse datasets:scipy.sparse matrices就是解决这个问题,scikit-learn 已经内置了该数据结构(built-in
support for these structures)。

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.<strong>fit_transform</strong>(rawData.data)

X_train_counts
Out[43]: 
<6x11 sparse matrix of type '<type 'numpy.int64'>'
	with 18 stored elements in Compressed Sparse Row format>

X_train_counts.shape
Out[44]: (6, 11)

print count_vect.vocabulary_.get(u'like')
print count_vect.vocabulary_.get(u'good')
3
1

print rawData_counts
  (0, 8)        1
  (0, 0)        1
  (0, 3)        1
  (1, 8)        1
  (1, 3)        1
  (1, 10)       1
  (1, 9)        1
  (2, 8)        1
  (2, 4)        1
  (3, 8)        1
  (3, 6)        1
  (3, 1)        1
  (4, 8)        1
  (4, 2)        1
  (5, 8)        1
  (5, 1)        1
  (5, 5)        1
  (5, 7)        1


2、tf-idf:

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.<strong>fit_transform</strong>(rawData_counts)
X_train_tfidf.shape
Out[53]: (6, 11)

X_train_tfidf
Out[54]: 
<6x11 sparse matrix of type '<type 'numpy.float64'>'
	with 18 stored elements in Compressed Sparse Row format>

print X_train_tfidf
  (0, 3)        0.599738830611
  (0, 0)        0.731376058697
  (0, 8)        0.324657351406
  (1, 9)        0.590335838052
  (1, 10)       0.590335838052
  (1, 3)        0.484083832074
  (1, 8)        0.262049690228
  (2, 4)        0.913996360826
  (2, 8)        0.405722383406
  (3, 1)        0.599738830611
  (3, 6)        0.731376058697
  (3, 8)        0.324657351406
  (4, 2)        0.913996360826
  (4, 8)        0.405722383406
  (5, 7)        0.590335838052
  (5, 5)        0.590335838052
  (5, 1)        0.484083832074
  (5, 8)        0.262049690228


3、训练一个分类器:

以naive
bayes为例:

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, rawData.target)


4、预测:
新文件来了,需要进行完全相同的特征提取过程。不同之处是,我们使用“transform instead
of fit_transform on
the transformers
,因为我们已经在训练集上fit了:

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, rawData.target)
docs_new = ['i like this', 'haha, start.']
X_new_counts = count_vect.<strong>transform</strong>(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, rawData.target_names[category]))
'i like this' => category_2_folder
'haha, start.' => category_1_folder
看来简单预测还是比较准确的啊。。。。

Extracting features from text files
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: