Spark机器学习(一) -- Machine Learning Library (MLlib)
2016-04-23 08:51
1351 查看
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
MLlib是Spark机器学习库。它的目标是构造实用的、可扩展的、简单的机器学习。它的通用组成部分分为学习算法和工具包,包括:分类、回归、聚集、协同过滤、降维,也提供了lower-level级别的原型优化和higher-level级别的pipeline API。
It divides into two packages:
它分为两个包:
Using
推荐使用 spark.ml ,因为基于DataFrames的API 更加通用和灵活。但是我们将继续支持spark.mllib 和spark.ml一起发展。用户可以舒畅的使用spark.mllib特性,并且期望更多特色的到来。开发人员安装了可以贡献新的算法给spark.ml,当然这些算法应与ML pipeline概念相适应。
e.g:extractors(提取器) 和 transformers(转换器)
We list major functionality from both below, with links to detailed guides.
我们在下面列出了主要的功能,通过连接进入详细指南。
Basic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
streaming significance testing
random data generation
Classification and regression
linear models (SVMs, logistic regression, linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
bisecting k-means
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
association rules
PrefixSpan
Evaluation metrics
PMML model export
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
3ff0
MLlib是Spark机器学习库。它的目标是构造实用的、可扩展的、简单的机器学习。它的通用组成部分分为学习算法和工具包,包括:分类、回归、聚集、协同过滤、降维,也提供了lower-level级别的原型优化和higher-level级别的pipeline API。
It divides into two packages:
spark.mllibcontains the original API built on top of RDDs.
spark.mlprovides higher-level API built on top of DataFrames for constructing ML pipelines.
它分为两个包:
spark.mllib:包括构建在 RDDs之上的原型API。
spark.ml:提供构建在 DataFrames 上的 higher-level API ,而DataFrames 是为了构造ML管道的。
Using
spark.mlis recommended because with DataFrames the API is more versatile and flexible. But we will keep supporting
spark.mllibalong with the development of
spark.ml. Users should be comfortable using
spark.mllibfeatures and expect more features coming. Developers should contribute new algorithms to
spark.mlif they fit the ML pipeline concept well, e.g., feature extractors and transformers.
推荐使用 spark.ml ,因为基于DataFrames的API 更加通用和灵活。但是我们将继续支持spark.mllib 和spark.ml一起发展。用户可以舒畅的使用spark.mllib特性,并且期望更多特色的到来。开发人员安装了可以贡献新的算法给spark.ml,当然这些算法应与ML pipeline概念相适应。
e.g:extractors(提取器) 和 transformers(转换器)
We list major functionality from both below, with links to detailed guides.
我们在下面列出了主要的功能,通过连接进入详细指南。
spark.mllib: data types, algorithms, utilities
Data typesBasic statistics
summary statistics
correlations
stratified sampling
hypothesis testing
streaming significance testing
random data generation
Classification and regression
linear models (SVMs, logistic regression, linear regression)
naive Bayes
decision trees
ensembles of trees (Random Forests and Gradient-Boosted Trees)
isotonic regression
Collaborative filtering
alternating least squares (ALS)
Clustering
k-means
Gaussian mixture
power iteration clustering (PIC)
latent Dirichlet allocation (LDA)
bisecting k-means
streaming k-means
Dimensionality reduction
singular value decomposition (SVD)
principal component analysis (PCA)
Feature extraction and transformation
Frequent pattern mining
FP-growth
association rules
PrefixSpan
Evaluation metrics
PMML model export
Optimization (developer)
stochastic gradient descent
limited-memory BFGS (L-BFGS)
3ff0
相关文章推荐
- Spark RDD API详解(一) Map和Reduce
- 使用spark和spark mllib进行股票预测
- Spark随谈——开发指南(译)
- Spark,一种快速数据分析替代方案
- 康诺云推出三款智能硬件产品,为健康管理业务搭建数据池
- MySQL中使用innobackupex、xtrabackup进行大数据的备份和还原教程
- php+ajax导入大数据时产生的问题处理
- C# 大数据导出word的假死报错的处理方法
- 用Python从零实现贝叶斯分类器的机器学习的教程
- 用Python实现协同过滤的教程
- Python利用多进程将大量数据放入有限内存的教程
- eclipse 开发 spark Streaming wordCount
- mongodb常遇到的错误。
- Understanding Spark Caching
- ClassNotFoundException:scala.PreDef$
- Windows 下Spark 快速搭建Spark源码阅读环境
- Spark中将对象序列化存储到hdfs
- My Machine Learning
- 机器学习---学习首页 3ff0