您的位置:首页 > 大数据

Spark机器学习(一) -- Machine Learning Library (MLlib)

2016-04-23 08:51 1351 查看
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
MLlib是Spark机器学习库。它的目标是构造实用的、可扩展的、简单的机器学习。它的通用组成部分分为学习算法和工具包,包括:分类、回归、聚集、协同过滤、降维,也提供了lower-level级别的原型优化和higher-level级别的pipeline API。

It divides into two packages:

spark.mllib
contains the original API built on top of RDDs.

spark.ml
provides higher-level API built on top of DataFrames for constructing ML pipelines.

它分为两个包:

spark.mllib
:包括构建在 RDDs之上的原型API。

spark.ml
:提供构建在 DataFrames 上的 higher-level API ,而DataFrames 是为了构造ML管道的。

Using
spark.ml
is recommended because with DataFrames the API is more versatile and flexible. But we will keep supporting
spark.mllib
along with the development of
spark.ml
. Users should be comfortable using
spark.mllib
features and expect more features coming. Developers should contribute new algorithms to
spark.ml
if they fit the ML pipeline concept well, e.g., feature extractors and transformers.

推荐使用 spark.ml ,因为基于DataFrames的API 更加通用和灵活。但是我们将继续支持spark.mllib 和spark.ml一起发展。用户可以舒畅的使用spark.mllib特性,并且期望更多特色的到来。开发人员安装了可以贡献新的算法给spark.ml,当然这些算法应与ML pipeline概念相适应。

e.g:extractors(提取器) 和 transformers(转换器)

We list major functionality from both below, with links to detailed guides.

我们在下面列出了主要的功能,通过连接进入详细指南。

spark.mllib: data types, algorithms, utilities

Data types

Basic statistics

summary statistics

correlations

stratified sampling

hypothesis testing

streaming significance testing

random data generation

Classification and regression

linear models (SVMs, logistic regression, linear regression)

naive Bayes

decision trees

ensembles of trees (Random Forests and Gradient-Boosted Trees)

isotonic regression

Collaborative filtering

alternating least squares (ALS)

Clustering

k-means

Gaussian mixture

power iteration clustering (PIC)

latent Dirichlet allocation (LDA)

bisecting k-means

streaming k-means

Dimensionality reduction

singular value decomposition (SVD)

principal component analysis (PCA)

Feature extraction and transformation

Frequent pattern mining

FP-growth

association rules

PrefixSpan

Evaluation metrics

PMML model export

Optimization (developer)

stochastic gradient descent

limited-memory BFGS (L-BFGS)

3ff0
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息