Spark MLlib 之 Basic Statistics
2016-01-06 14:43
281 查看
Spark MLlib提供了一些基本的统计学的算法,下面主要说明一下:
1、Summary statistics
对于RDD[Vector]类型,Spark MLlib提供了colStats的统计方法,该方法返回一个
2、Correlations(相关性)
计算两个序列的相关性,提供了计算Pearson’s and Spearman’s correlation.如下所示:
1、Summary statistics
对于RDD[Vector]类型,Spark MLlib提供了colStats的统计方法,该方法返回一个
MultivariateStatisticalSummary的实例。他封装了列的最大值,最小值,均值、方差、总数。如下所示:
val conf = new SparkConf().setAppName("Simple Application").setMaster("yarn-cluster") val sc = new SparkContext(conf) val observations = sc.textFile("/user/liujiyu/spark/mldata1.txt") .map(_.split(' ') // 转换为RDD[Array[String]]类型 .map(_.toDouble)) // 转换为RDD[Array[Double]]类型 .map(line => Vectors.dense(line)) //转换为RDD[Vector]类型 // Compute column summary statistics. val summary: MultivariateStatisticalSummary = Statistics.colStats(observations) println(summary.mean) // a dense vector containing the mean value for each column println(summary.variance) // column-wise variance println(summary.numNonzeros) // number of nonzeros in each column
2、Correlations(相关性)
计算两个序列的相关性,提供了计算Pearson’s and Spearman’s correlation.如下所示:
val conf = new SparkConf().setAppName("Simple Application").setMaster("yarn-cluster") val sc = new SparkContext(conf) val observations = sc.textFile("/user/liujiyu/spark/mldata1.txt") val data1 = Array(1.0, 2.0, 3.0, 4.0, 5.0) val data2 = Array(1.0, 2.0, 3.0, 4.0, 5.0) val distData1: RDD[Double] = sc.parallelize(data1) val distData2: RDD[Double] = sc.parallelize(data2) // must have the same number of partitions and cardinality as seriesX // compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a // method is not specified, Pearson's method will be used by default. val correlation: Double = Statistics.corr(distData1, distData2, "pearson") val data: RDD[Vector] = observations // note that each Vector is a row and not a column // calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method. // If a method is not specified, Pearson's method will be used by default. val correlMatrix: Matrix = Statistics.corr(data, "pearson")
相关文章推荐
- hibernate事务和jdbc事务冲突问题
- iOS笔记链接
- java中字符串 utf8 转为 gbk
- 在Eclipse中关联Android API源码
- FreeBSD系统SSH连接不上?
- html基础之表单
- SAP MM移动类型概念详述
- ListView 添加长度样式不固定的分割线
- JSONP原理及实现
- windows和linux 下将tomcat注册为服务
- 菜鸟之路【计算导论与C基础】练习2:计算概论第六周 B-05作业
- Unity3D -- 使用可移动图片作为3D背景
- Ubuntu Builder —— 一个制作自己的发行版的工具
- Javassist注解(Annotation)的使用:CXF WebService动态生成
- Python开发简单爬虫
- js获取某个标签中的信息
- oracle 开发 第03章 sqlplus
- Linux里如何查找文件内容 (转)
- 【Hibernate3】(5)关联映射(二)
- Linux 下压缩与解压.zip和.rar及.7z文件