mahout的安装,配置及聚类测试
2015-08-12 20:14
465 查看
Mahout 是 Apache Software Foundation(ASF) 旗下的一个开源项目,提供一些可扩展的机器学习领域经典算法的实现,旨在帮助开发人员更加方便快捷地创建智能应用程序。Mahout包含许多实现,包括聚类、分类、推荐过滤、频繁子项挖掘。此外,通过使用 Apache Hadoop 库,Mahout 可以有效地扩展到云中。
最新版apache-mahout-distribution-0.11.0中提供的算法有:
arff.vector: : Generate Vectors from an ARFF file or directory
baumwelch: : Baum-Welch algorithm for unsupervised HMM training
buildforest: : Build the random forest classifier
canopy: : Canopy clustering(聚类算法,文本输入类型)
cat: : Print a file or resource as the logistic regression models
would see it
cleansvd: : Cleanup and verification of SVD output
clusterdump: : Dump cluster output to text(聚类输出命令)
clusterpp: : Groups Clustering Output In Clusters
cmdump: : Dump confusion matrix in HTML or text formats
concatmatrices: : Concatenates 2 matrices of same cardinality into a
single matrix
cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
describe: : Describe the fields and target variable in a data set
evaluateFactorization: : compute RMSE and MAE of a rating matrix
factorization against probes
fkmeans: : Fuzzy K-means clustering
hmmpredict: : Generate random sequence of observations by given HMM
itemsimilarity: : Compute the item-item-similarities for item-based
collaborative filtering(计算偏好类型数据,三元组文本输入类型)
kmeans: : K-means clustering
lucene.vector: : Generate Vectors from a Lucene index
lucene2seq: : Generate Text SequenceFiles from a Lucene index
matrixdump: : Dump matrix in CSV format
matrixmult: : Take the product of two matrices
parallelALS: : ALS-WR factorization of a rating matrix
qualcluster: : Runs clustering experiments and summarizes results in
a CSV
recommendfactorized: : Compute recommendations using the
factorization of a rating matrix
recommenditembased: : Compute recommendations using item-based collaborative filtering
regexconverter: : Convert text files on a per line basis based on
regular expressions
resplit: : Splits a set of SequenceFiles into a number of equal splits
rowid: : Map SequenceFile(Text,VectorWritable) to {SequenceFile(IntWritable,VectorWritable),SequenceFile(IntWritable,Text)}
rowsimilarity: : Compute the pairwise similarities of the rows of a
matrix(矩阵行相似性,矩阵向量输入类型)
runAdaptiveLogistic: : Score new production data using a probably
trained and validated AdaptivelogisticRegression model
runlogistic: : Run a logistic regression model against CSV data
seq2encoded: : Encoded Sparse Vector generation from Text sequence
files
seq2sparse: : Sparse Vector generation from Text sequence files(仅支持对文本内容的向量化,不适用数值矩阵)
seqdirectory: : Generate sequence files (of Text) from a directory(文本转sequencefile类型)
seqdumper: : Generic Sequence File dumper(查看sequencefile类型内容)
seqmailarchives: : Creates SequenceFile from a directory containing
gzipped mail archives
seqwiki: : Wikipedia xml dump to sequence file
spectralkmeans: : Spectral k-means clustering
split: : Split Input data into test and train sets
splitDataset: : split a rating dataset into training and probe parts
ssvd: : Stochastic SVD
streamingkmeans: : Streaming k-means clustering
svd: : Lanczos Singular Value Decomposition
testforest: : Test the random forest classifier
testnb: : Test the Vector-based Bayes classifier
trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model
trainlogistic: : Train a logistic regression using stochastic
gradient descent
trainnb: : Train the Vector-based Bayes classifier(贝叶斯训练器)
transpose: : Take the transpose of a matrix
validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression
model against hold-out data set
vecdist: : Compute the distances between a set of Vectors (or Cluster
or Canopy, they must fit in memory) and a list of Vectors(计算两个向量簇间的距离,比如canopy产生的seeds)
vectordump: : Dump vectors from a sequence file to text(查看向量类型文件内容)
viterbi: : Viterbi decoding of hidden states from given output states
sequence
此外,这里可供学习的数据提供下载数据源,(本文使用synthetic_control 数据,该数据集可直接用于kmeans操作,不需要单独转化为sequenceFile格式)
利用cp或mv命令将mahout安装包放置到software文件夹中,使用tar命令解压:
cd ~/software
tar -zxvf apache-mahout-distribution-0.11.0.tar.gz
然后返回用户主目录,使用ls -a命令可查看目录下含有.bashrc文件
vim .bashrc
添加如下配置信息(主要是hadoop与mahout的):
编辑保存完成后注销用户重新登录即可生效。
mahout
命令,将返回上文所列出的算法,如下图所示,即是安装成功了
下面将用下载的synthetic_control.data数据集来测试算法是否能正确运行
利用hadoop fs -put ~/synthetic_control.data testdata 命令,将数据集上传到hdfs中的testdata文件夹(一般默认的数据输入文件夹)
使用Mahout中的kmeans聚类算法,执行命令mahout -core org.apache.mahout.clustering.syntheticcontrol.kmeans.Job(最新版的始终出错,需指定必要参数),这里使用命令mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -i testdata/synthetic_control.data -o output/ –t1 4 –t2 10 -x 100 (参数含义不作具体解释)
直接mahout kmeans -i iptrends -o ipcluster -k 4 -c tempPoint -x 30命令,需要注意的是-i中的输入文件格式为向量型,若是指定了k则只需随意指定-c目录,否则需要指明初始质心点。
具体的聚类运行原理可参考http://my.oschina.net/BreathL/blog/58104
最终会产生如下结果:
注:还可运用类似于hadoop运行jar包的方式来运行jar代码包
hadoop jar /home/username/sofeware/apache-mahout-distribution-0.11.0/mahout-examples-0.11.0-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -input testdata -output output
小插曲:当遇到错误Exception in thread “main” java.io.FileNotFoundException: 时,Exception in thread “main” java.io.FileNotFoundException: /tmp/safe/hadoop-unjar8255800323492605035/org/apache/mahout/cf/taste/impl/eval/LoadStatistics.class,提示(No space left on device),查看mahout脚本发现其实就是调用hadoop来运行jar包,不能用unjar来解压运行包到hadoop配置的临时文件夹中(hadoop.tmp.dir),故清除、tmp下的文件即可
PS:请参见Linux的inode的理解或inode理解
如有运行不正确之处,请尽量选用旧版本安装包。
最新版apache-mahout-distribution-0.11.0中提供的算法有:
arff.vector: : Generate Vectors from an ARFF file or directory
baumwelch: : Baum-Welch algorithm for unsupervised HMM training
buildforest: : Build the random forest classifier
canopy: : Canopy clustering(聚类算法,文本输入类型)
cat: : Print a file or resource as the logistic regression models
would see it
cleansvd: : Cleanup and verification of SVD output
clusterdump: : Dump cluster output to text(聚类输出命令)
clusterpp: : Groups Clustering Output In Clusters
cmdump: : Dump confusion matrix in HTML or text formats
concatmatrices: : Concatenates 2 matrices of same cardinality into a
single matrix
cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
describe: : Describe the fields and target variable in a data set
evaluateFactorization: : compute RMSE and MAE of a rating matrix
factorization against probes
fkmeans: : Fuzzy K-means clustering
hmmpredict: : Generate random sequence of observations by given HMM
itemsimilarity: : Compute the item-item-similarities for item-based
collaborative filtering(计算偏好类型数据,三元组文本输入类型)
kmeans: : K-means clustering
lucene.vector: : Generate Vectors from a Lucene index
lucene2seq: : Generate Text SequenceFiles from a Lucene index
matrixdump: : Dump matrix in CSV format
matrixmult: : Take the product of two matrices
parallelALS: : ALS-WR factorization of a rating matrix
qualcluster: : Runs clustering experiments and summarizes results in
a CSV
recommendfactorized: : Compute recommendations using the
factorization of a rating matrix
recommenditembased: : Compute recommendations using item-based collaborative filtering
regexconverter: : Convert text files on a per line basis based on
regular expressions
resplit: : Splits a set of SequenceFiles into a number of equal splits
rowid: : Map SequenceFile(Text,VectorWritable) to {SequenceFile(IntWritable,VectorWritable),SequenceFile(IntWritable,Text)}
rowsimilarity: : Compute the pairwise similarities of the rows of a
matrix(矩阵行相似性,矩阵向量输入类型)
runAdaptiveLogistic: : Score new production data using a probably
trained and validated AdaptivelogisticRegression model
runlogistic: : Run a logistic regression model against CSV data
seq2encoded: : Encoded Sparse Vector generation from Text sequence
files
seq2sparse: : Sparse Vector generation from Text sequence files(仅支持对文本内容的向量化,不适用数值矩阵)
seqdirectory: : Generate sequence files (of Text) from a directory(文本转sequencefile类型)
seqdumper: : Generic Sequence File dumper(查看sequencefile类型内容)
seqmailarchives: : Creates SequenceFile from a directory containing
gzipped mail archives
seqwiki: : Wikipedia xml dump to sequence file
spectralkmeans: : Spectral k-means clustering
split: : Split Input data into test and train sets
splitDataset: : split a rating dataset into training and probe parts
ssvd: : Stochastic SVD
streamingkmeans: : Streaming k-means clustering
svd: : Lanczos Singular Value Decomposition
testforest: : Test the random forest classifier
testnb: : Test the Vector-based Bayes classifier
trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model
trainlogistic: : Train a logistic regression using stochastic
gradient descent
trainnb: : Train the Vector-based Bayes classifier(贝叶斯训练器)
transpose: : Take the transpose of a matrix
validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression
model against hold-out data set
vecdist: : Compute the distances between a set of Vectors (or Cluster
or Canopy, they must fit in memory) and a list of Vectors(计算两个向量簇间的距离,比如canopy产生的seeds)
vectordump: : Dump vectors from a sequence file to text(查看向量类型文件内容)
viterbi: : Viterbi decoding of hidden states from given output states
sequence
准备
如同大多数linux平台下的安装方式一样,首先需要下载相应的tar.gz包,这里下载了最新的apache-mahout-distribution-0.11.0版本(注:不要以为最新的就是最好用的,切记,否则很痛苦),将下载好的压缩包至于用户主目录下面。此外,这里可供学习的数据提供下载数据源,(本文使用synthetic_control 数据,该数据集可直接用于kmeans操作,不需要单独转化为sequenceFile格式)
安装与部署
为了保证管理与使用的方便性,可以在用户主目录下面新建software文件夹,以后新安装的工具均可放置在该文件夹下统一管理。利用cp或mv命令将mahout安装包放置到software文件夹中,使用tar命令解压:
cd ~/software
tar -zxvf apache-mahout-distribution-0.11.0.tar.gz
然后返回用户主目录,使用ls -a命令可查看目录下含有.bashrc文件
vim .bashrc
添加如下配置信息(主要是hadoop与mahout的):
export MAHOUT_HOME=/home/safe/software/apache-mahout-distribution-0.11.0 export MAHOUT_CONF_DIR=$MAHOUT_HOME/conf export HADOOP_HOME=/home/username/software/hadoop export JAVA_HOME=/home/username/software/java export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_HOME/bin:$MAHOUT_HOME/conf:$MAHOUT_HOME/bin export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
编辑保存完成后注销用户重新登录即可生效。
测试
重新登录后,敲入mahout
命令,将返回上文所列出的算法,如下图所示,即是安装成功了
下面将用下载的synthetic_control.data数据集来测试算法是否能正确运行
利用hadoop fs -put ~/synthetic_control.data testdata 命令,将数据集上传到hdfs中的testdata文件夹(一般默认的数据输入文件夹)
使用Mahout中的kmeans聚类算法,执行命令mahout -core org.apache.mahout.clustering.syntheticcontrol.kmeans.Job(最新版的始终出错,需指定必要参数),这里使用命令mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -i testdata/synthetic_control.data -o output/ –t1 4 –t2 10 -x 100 (参数含义不作具体解释)
直接mahout kmeans -i iptrends -o ipcluster -k 4 -c tempPoint -x 30命令,需要注意的是-i中的输入文件格式为向量型,若是指定了k则只需随意指定-c目录,否则需要指明初始质心点。
具体的聚类运行原理可参考http://my.oschina.net/BreathL/blog/58104
最终会产生如下结果:
注:还可运用类似于hadoop运行jar包的方式来运行jar代码包
hadoop jar /home/username/sofeware/apache-mahout-distribution-0.11.0/mahout-examples-0.11.0-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -input testdata -output output
小插曲:当遇到错误Exception in thread “main” java.io.FileNotFoundException: 时,Exception in thread “main” java.io.FileNotFoundException: /tmp/safe/hadoop-unjar8255800323492605035/org/apache/mahout/cf/taste/impl/eval/LoadStatistics.class,提示(No space left on device),查看mahout脚本发现其实就是调用hadoop来运行jar包,不能用unjar来解压运行包到hadoop配置的临时文件夹中(hadoop.tmp.dir),故清除、tmp下的文件即可
PS:请参见Linux的inode的理解或inode理解
如有运行不正确之处,请尽量选用旧版本安装包。
相关文章推荐
- 详解HDFS Short Circuit Local Reads
- Hadoop_2.1.0 MapReduce序列图
- 使用Hadoop搭建现代电信企业架构
- 单机版搭建Hadoop环境图文教程详解
- hadoop常见错误以及处理方法详解
- hadoop 单机安装配置教程
- hadoop的hdfs文件操作实现上传文件到hdfs
- 数据挖掘之Apriori算法详解和Python实现代码分享
- hadoop实现grep示例分享
- Apache Hadoop版本详解
- linux下搭建hadoop环境步骤分享
- hadoop client与datanode的通信协议分析
- hadoop中一些常用的命令介绍
- Hadoop单机版和全分布式(集群)安装
- 用PHP和Shell写Hadoop的MapReduce程序
- hadoop map-reduce中的文件并发操作
- Hadoop1.2中配置伪分布式的实例
- java结合HADOOP集群文件上传下载
- 用python + hadoop streaming 分布式编程(一) -- 原理介绍,样例程序与本地调试
- Hadoop安装感悟