您的位置：首页 > 运维架构

mahout的安装，配置及聚类测试

2015-08-12 20:14 465 查看

Mahout 是 Apache Software Foundation（ASF）旗下的一个开源项目，提供一些可扩展的机器学习领域经典算法的实现，旨在帮助开发人员更加方便快捷地创建智能应用程序。Mahout包含许多实现，包括聚类、分类、推荐过滤、频繁子项挖掘。此外，通过使用 Apache Hadoop 库，Mahout 可以有效地扩展到云中。

最新版apache-mahout-distribution-0.11.0中提供的算法有：

arff.vector: : Generate Vectors from an ARFF file or directory

baumwelch: : Baum-Welch algorithm for unsupervised HMM training

buildforest: : Build the random forest classifier

canopy: : Canopy clustering（聚类算法，文本输入类型）

cat: : Print a file or resource as the logistic regression models

would see it

cleansvd: : Cleanup and verification of SVD output

clusterdump: : Dump cluster output to text（聚类输出命令）

clusterpp: : Groups Clustering Output In Clusters

cmdump: : Dump confusion matrix in HTML or text formats

concatmatrices: : Concatenates 2 matrices of same cardinality into a

single matrix

cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)

cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.

describe: : Describe the fields and target variable in a data set

evaluateFactorization: : compute RMSE and MAE of a rating matrix

factorization against probes

fkmeans: : Fuzzy K-means clustering

hmmpredict: : Generate random sequence of observations by given HMM

itemsimilarity: : Compute the item-item-similarities for item-based

collaborative filtering（计算偏好类型数据，三元组文本输入类型）

kmeans: : K-means clustering

lucene.vector: : Generate Vectors from a Lucene index

lucene2seq: : Generate Text SequenceFiles from a Lucene index

matrixdump: : Dump matrix in CSV format

matrixmult: : Take the product of two matrices

parallelALS: : ALS-WR factorization of a rating matrix

qualcluster: : Runs clustering experiments and summarizes results in

a CSV

recommendfactorized: : Compute recommendations using the

factorization of a rating matrix

recommenditembased: : Compute recommendations using item-based collaborative filtering

regexconverter: : Convert text files on a per line basis based on

regular expressions

resplit: : Splits a set of SequenceFiles into a number of equal splits

rowid: : Map SequenceFile(Text,VectorWritable) to {SequenceFile(IntWritable,VectorWritable),SequenceFile(IntWritable,Text)}

rowsimilarity: : Compute the pairwise similarities of the rows of a

matrix（矩阵行相似性，矩阵向量输入类型）

runAdaptiveLogistic: : Score new production data using a probably

trained and validated AdaptivelogisticRegression model

runlogistic: : Run a logistic regression model against CSV data

seq2encoded: : Encoded Sparse Vector generation from Text sequence

files

seq2sparse: : Sparse Vector generation from Text sequence files（仅支持对文本内容的向量化，不适用数值矩阵）

seqdirectory: : Generate sequence files (of Text) from a directory（文本转sequencefile类型）

seqdumper: : Generic Sequence File dumper（查看sequencefile类型内容）

seqmailarchives: : Creates SequenceFile from a directory containing

gzipped mail archives

seqwiki: : Wikipedia xml dump to sequence file

spectralkmeans: : Spectral k-means clustering

split: : Split Input data into test and train sets

splitDataset: : split a rating dataset into training and probe parts

ssvd: : Stochastic SVD

streamingkmeans: : Streaming k-means clustering

svd: : Lanczos Singular Value Decomposition

testforest: : Test the random forest classifier

testnb: : Test the Vector-based Bayes classifier

trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model

trainlogistic: : Train a logistic regression using stochastic

gradient descent

trainnb: : Train the Vector-based Bayes classifier（贝叶斯训练器）

transpose: : Take the transpose of a matrix

validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression

model against hold-out data set

vecdist: : Compute the distances between a set of Vectors (or Cluster

or Canopy, they must fit in memory) and a list of Vectors（计算两个向量簇间的距离，比如canopy产生的seeds）

vectordump: : Dump vectors from a sequence file to text（查看向量类型文件内容）

viterbi: : Viterbi decoding of hidden states from given output states

sequence

准备

如同大多数linux平台下的安装方式一样，首先需要下载相应的tar.gz包，这里下载了最新的apache-mahout-distribution-0.11.0版本（注：不要以为最新的就是最好用的，切记，否则很痛苦），将下载好的压缩包至于用户主目录下面。

此外，这里可供学习的数据提供下载数据源，（本文使用synthetic_control 数据，该数据集可直接用于kmeans操作，不需要单独转化为sequenceFile格式）

安装与部署

为了保证管理与使用的方便性，可以在用户主目录下面新建software文件夹，以后新安装的工具均可放置在该文件夹下统一管理。

利用cp或mv命令将mahout安装包放置到software文件夹中，使用tar命令解压：

cd ~/software

tar -zxvf apache-mahout-distribution-0.11.0.tar.gz

然后返回用户主目录，使用ls -a命令可查看目录下含有.bashrc文件

vim .bashrc

添加如下配置信息（主要是hadoop与mahout的）：

export MAHOUT_HOME=/home/safe/software/apache-mahout-distribution-0.11.0
export MAHOUT_CONF_DIR=$MAHOUT_HOME/conf
export HADOOP_HOME=/home/username/software/hadoop
export JAVA_HOME=/home/username/software/java
export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_HOME/bin:$MAHOUT_HOME/conf:$MAHOUT_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar

编辑保存完成后注销用户重新登录即可生效。

测试

重新登录后，敲入

mahout

命令，将返回上文所列出的算法，如下图所示，即是安装成功了

下面将用下载的synthetic_control.data数据集来测试算法是否能正确运行

利用hadoop fs -put ~/synthetic_control.data testdata 命令，将数据集上传到hdfs中的testdata文件夹（一般默认的数据输入文件夹）

使用Mahout中的kmeans聚类算法，执行命令mahout -core org.apache.mahout.clustering.syntheticcontrol.kmeans.Job（最新版的始终出错，需指定必要参数），这里使用命令mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -i testdata/synthetic_control.data -o output/ –t1 4 –t2 10 -x 100 （参数含义不作具体解释）

直接mahout kmeans -i iptrends -o ipcluster -k 4 -c tempPoint -x 30命令，需要注意的是-i中的输入文件格式为向量型，若是指定了k则只需随意指定-c目录，否则需要指明初始质心点。

具体的聚类运行原理可参考http://my.oschina.net/BreathL/blog/58104

最终会产生如下结果：

注：还可运用类似于hadoop运行jar包的方式来运行jar代码包

hadoop jar /home/username/sofeware/apache-mahout-distribution-0.11.0/mahout-examples-0.11.0-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -input testdata -output output

小插曲：当遇到错误Exception in thread “main” java.io.FileNotFoundException: 时，Exception in thread “main” java.io.FileNotFoundException: /tmp/safe/hadoop-unjar8255800323492605035/org/apache/mahout/cf/taste/impl/eval/LoadStatistics.class，提示(No space left on device)，查看mahout脚本发现其实就是调用hadoop来运行jar包，不能用unjar来解压运行包到hadoop配置的临时文件夹中（hadoop.tmp.dir），故清除、tmp下的文件即可

PS:请参见Linux的inode的理解或inode理解

如有运行不正确之处，请尽量选用旧版本安装包。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： hadoop mahout 数据挖掘

相关文章推荐

新的分享

章节导航