您的位置：首页 > 其它

使用mahout下的朴素贝叶斯分类器对新闻分类

2015-12-18 14:48 316 查看

转载地址：http://www.letiantian.me/2014-10-22-mahout-naive-bayes-newsgroups/

mahout版本是0.9；hadoop版本是1.2.1。

下载数据集20
newsgroups dataset，解压后得到

20news-bydate

$ cp -R 20news-bydate/*/* 20news-all

在

20news-all

目录下，一个子目录代表一个分类，每个分类下有多个文本文件，每个文本文件代表一个样本。

然后，复制到hdfs中：

$ hadoop  distcp file:///home/sunlt/20news-all 20news-all

查看一下：

$ hadoop  dfs -ls
Warning: $HADOOP_HOME is deprecated.

Found 6 items
......
drwxr-xr-x   - sunlt supergroup          0 2014-10-13 10:09 /user/sunlt/20news-all
......

转换成的序列文件：

$ mahout seqdirectory \
-i 20news-all \
-o 20news-seq \
-ow

转换成< Text, VectorWritable >的序列文件：

$ mahout seq2sparse -i 20news-seq -o 20news-vectors -lnorm  -nv -wt tfidf

此处使用了TF-IDF（词频-逆文档频率）生成向量，也可以指定为

TF

。

更多参数，可以使用

mahout
seq2sparse --help

查看。

拆分数据

将60%的数据用于训练，40%的数据用于测试。

$ mahout split \
-i 20news-vectors/tfidf-vectors \
--trainingOutput 20news-train-vectors \
--testOutput 20news-test-vectors  \
--randomSelectionPct 40 \
--overwrite --sequenceFiles -xm sequential

此时：

$ hadoop dfs -ls
Warning: $HADOOP_HOME is deprecated.

Found 10 items
drwxr-xr-x   - sunlt supergroup          0 2014-10-13 09:52 /user/sunlt/20_newsgroups
drwxr-xr-x   - sunlt supergroup          0 2014-10-13 10:09 /user/sunlt/20news-all
drwxr-xr-x   - sunlt supergroup          0 2014-10-13 10:28 /user/sunlt/20news-seq
drwxr-xr-x   - sunlt supergroup          0 2014-10-13 11:17 /user/sunlt/20news-test-vectors
drwxr-xr-x   - sunlt supergroup          0 2014-10-13 11:17 /user/sunlt/20news-train-vectors
drwxr-xr-x   - sunlt supergroup          0 2014-10-13 10:55 /user/sunlt/20news-vectors

训练：

$ mahout trainnb \
-i 20news-train-vectors\
-el  \
-o model \
-li labelindex \
-ow \
-c

测试训练好的贝叶斯分类器：

$ mahout testnb \
-i 20news-test-vectors \
-m model \
-l labelindex \
-ow \
-o 20news-testing  \
-c

运行结果：

.......
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       3023       93.7946%
Incorrectly Classified Instances        :        200        6.2054%
Total Classified Instances              :       3223

=======================================================
Confusion Matrix
-------------------------------------------------------
a        b       c       d       e       f       g       h       i       <--Classified as
306      1       0       0       1       2       1       1       16       |  328     a     = alt.atheism
0        391     2       1       3       1       0       0       2        |  400     b     = comp.windows.x
0        7       345     6       12      12      2       5       1        |  390     c     = misc.forsale
0        1       2       399     2       1       1       0       0        |  406     d     = rec.motorcycles
0        5       10      2       371     3       0       3       1        |  395     e     = sci.electronics
0        5       0       1       4       371     1       2       1        |  385     f     = sci.med
1        3       0       2       1       1       317     8       0        |  333     g     = talk.politics.guns
1        0       0       1       2       3       11      311     5        |  334     h     = talk.politics.misc
22       1       2       1       0       0       8       6       212      |  252     i     = talk.religion.misc

=======================================================
Statistics
-------------------------------------------------------
Kappa                                       0.9105
Accuracy                                   93.7946%
Reliability                                84.0504%
Reliability (standard deviation)            0.2984

对角线代表预测正确的数量。

补充

导出数据：

$ mahout seqdumper -i 20news-testing/part-m-00000 -o ./20news_testing.res

查看：

$ gedit ./20news_testing.res

也可以这样查看：

$ hadoop dfs -cat 20news-testing/part-m-00000

由于是sequence文件，查看的结果是乱码。

将hdfs中某个目录中的文件拷贝到本地：

$ hadoop dfs -copyToLocal 20news-vectors/* .

参考

Twenty
Newsgroups Classification Example

贝叶斯分类算法示例

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航