您的位置:首页 > 其它

sklearn——朴素贝叶斯文本分类4

2017-02-17 15:08 253 查看
把数据去掉'headers', 'footers', 'quotes',准确率反而降低了

from sklearn.datasets import fetch_20newsgroups
news=fetch_20newsgroups(subset='all',remove=('headers', 'footers', 'quotes'))
from sklearn.cross_validation import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(news.data,news.target,test_size=0.25)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
X_tfidf_train=tfidf.fit_transform(X_train)
X_tfidf_test=tfidf.transform(X_test)
from sklearn.naive_bayes import MultinomialNB
mnb_tfidf=MultinomialNB()
mnb_tfidf.fit(X_tfidf_train,Y_train)
print(mnb_tfidf.score(X_tfidf_test,Y_test))
去掉'headers', 'footers', 'quotes'之后数据集就变成这样了

A "moment of silence" doesn't mean much unless *everyone*
participates.  Otherwise it's not silent, now is it?

Non-religious reasons for having a "moment of silence" for a dead
classmate: (1) to comfort the friends by showing respect to the
deceased , (2) to give the classmates a moment to grieve together, (3)
to give the friends a moment to remember their classmate *in the
context of the school*, (4) to deal with the fact that the classmate
is gone so that it's not disruptive later.

Blindly opposing everything with a flavor of religion in it is
utterly idiotic.

结果:

使用tfidf去掉停用词去掉开头结尾准确率
1010.6
1110.68
1000.85
1100.87
说明去掉'headers', 'footers', 'quotes'效果更不好,不如留下来
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: