Spark机器学习(二) 局部向量 Local-- Data Types - MLlib
2016-04-23 10:01
1146 查看
Local vector
Labeled point
Local matrix
Distributed matrix
RowMatrix
IndexedRowMatrix
CoordinateMatrix
BlockMatrix
MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by Breeze and jblas. A training example used in supervised learning is called a “labeled point” in MLlib.
MLlib支持 在单独节点上本地化存储局部向量(local vectors) 和局部矩阵(local matrices),也可以依赖一个或更多的RDD来进行分布式的存储矩阵。局部向量和局部矩阵是简单的数据模型,被作为公共接口。底层的线性代数操作由 Breeze 和 jblas 提供。在MLlib中,一个使用监督式学习的例子被叫做“labeled point”。
一个局部向量由一个从0开始的整数类型索引和一个double类型的值组成,被存储在一个单独的机器上。MLlib支持两种类型的局部向量:密集型和稀疏行。一个密集型依靠一个double型数组来代表他的entry值,而一个稀疏型向量依靠两个并行数组:索引数组和值数组。举个例子,一个向量(1.0,0.0,3.0)可以被表示为密集型格式:[1.0, 0.0, 3.0] 或者被表示为稀疏型格式:(3, [0,2], [1.0, 3.0]),元组的第一个值3是向量的数量。
Scala
The base class of local vectors is
局部向量的基本类型是Vector,我们提供了两种实现:
我们推荐使用
Refer to the
详细信息请参阅
Note: Scala imports
Labeled point
Local matrix
Distributed matrix
RowMatrix
IndexedRowMatrix
CoordinateMatrix
BlockMatrix
MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by Breeze and jblas. A training example used in supervised learning is called a “labeled point” in MLlib.
MLlib支持 在单独节点上本地化存储局部向量(local vectors) 和局部矩阵(local matrices),也可以依赖一个或更多的RDD来进行分布式的存储矩阵。局部向量和局部矩阵是简单的数据模型,被作为公共接口。底层的线性代数操作由 Breeze 和 jblas 提供。在MLlib中,一个使用监督式学习的例子被叫做“labeled point”。
局部向量 Local vector
A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values. For example, a vector(1.0, 0.0, 3.0)can be represented in dense format as
[1.0, 0.0, 3.0]or in sparse format as
(3, [0, 2], [1.0, 3.0]), where
3is the size of the vector.
一个局部向量由一个从0开始的整数类型索引和一个double类型的值组成,被存储在一个单独的机器上。MLlib支持两种类型的局部向量:密集型和稀疏行。一个密集型依靠一个double型数组来代表他的entry值,而一个稀疏型向量依靠两个并行数组:索引数组和值数组。举个例子,一个向量(1.0,0.0,3.0)可以被表示为密集型格式:[1.0, 0.0, 3.0] 或者被表示为稀疏型格式:(3, [0,2], [1.0, 3.0]),元组的第一个值3是向量的数量。
Scala
The base class of local vectors is
Vector, and we provide two implementations:
DenseVectorand
SparseVector. We recommend using the factory methods implemented in
Vectorsto create local vectors.
局部向量的基本类型是Vector,我们提供了两种实现:
DenseVectorand
SparseVector.
我们推荐使用
Vectors 已经实现了的工厂方法来创建局部向量。
Refer to the
VectorScala docs and
VectorsScala docs for details on the API.
详细信息请参阅
VectorScala docs and
VectorsScala docs API.
import org.apache.spark.mllib.linalg.{Vector, Vectors} // Create a dense vector (1.0, 0.0, 3.0). val dv: Vector = Vectors.dense(1.0, 0.0, 3.0) // Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries. val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)) // Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries. val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0))) //创建一个密集型局部向量(density) val dv = Vectors.dense(Array(1.0,0.0,3.0)) val densityVector = Vectors.dense(1.0,0.0,3.0) //创建一个稀疏型局部向量(sparse),两种方式: //一:使用并行数组:格式-> (size,index[Int],values[Double]) val sv1 = Vectors.sparse(3,Array(0,2),Array(1.0,3.0)) //二:使用Seq:格式-> (size,Seq((index,values)+)) val sv2 = Vectors.sparse(3,Seq((0,1.0),(2,3.0))) println(dv) println(densityVector) println(sv1) println(sv2) println(sv3) result: [1.0,0.0,3.0] [1.0,0.0,3.0] (3,[0,2],[1.0,3.0]) (3,[0,2],[1.0,3.0]) (3,[0,2],[1.0,3.0])
Note: Scala imports
scala.collection.immutable.Vectorby default, so you have to import
org.apache.spark.mllib.linalg.Vectorexplicitly to use MLlib’s
Vector.
相关文章推荐
- Spark RDD API详解(一) Map和Reduce
- 使用spark和spark mllib进行股票预测
- Windows下Scala环境搭建
- Spark随谈——开发指南(译)
- Spark,一种快速数据分析替代方案
- 康诺云推出三款智能硬件产品,为健康管理业务搭建数据池
- MySQL中使用innobackupex、xtrabackup进行大数据的备份和还原教程
- Windows7下安装Scala 2.9.2教程
- php+ajax导入大数据时产生的问题处理
- C# 大数据导出word的假死报错的处理方法
- 用Python实现协同过滤的教程
- Python利用多进程将大量数据放入有限内存的教程
- eclipse 开发 spark Streaming wordCount
- mongodb常遇到的错误。
- Understanding Spark Caching
- Scala代码实现列出Hadoop 文件夹下面的所有文件
- ClassNotFoundException:scala.PreDef$
- Windows 下Spark 快速搭建Spark源码阅读环境