SparkMLlib之二Basic Stastics
2016-01-12 19:28
483 查看
Summary statistics
We provide column summary statistics for RDD[Vector] through the functioncolStatsavailable in
Statistics.
import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics} val observations: RDD[Vector] = ... // an RDD of Vectors // 计算列统计值 val summary: MultivariateStatisticalSummary = Statistics.colStats(observations) println(summary.mean) //包含每列均值的dense vector println(summary.variance) // column-wise variance println(summary.numNonzeros) // number of nonzeros in each column
另外还有
bstract Value Members abstract def count: Long Sample size. abstract def max: Vector Maximum value of each column. abstract def mean: Vector Sample mean vector. abstract def min: Vector Minimum value of each column. abstract def normL1: Vector L1 norm of each column abstract def normL2: Vector Euclidean magnitude of each column abstract def numNonzeros: Vector Number of nonzero elements (including explicitly presented zero values) in each column. abstract def variance: Vector Sample variance vector.
Correlation
import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.stat.Statistics val sc: SparkContext = ... val seriesX: RDD[Double] = ... // a series val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX // compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a // method is not specified, Pearson's method will be used by default. val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson") val data: RDD[Vector] = ... // note that each Vector is a row and not a column // calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method. // If a method is not specified, Pearson's method will be used by default. val correlMatrix: Matrix = Statistics.corr(data, "pearson")
Stratified sampling分层抽样
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.rdd.PairRDDFunctions val sc: SparkContext = ... val data = ... // an RDD[(K, V)] of any key value pairs val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key // Get an exact sample from each stratum val approxSample = data.sampleByKey(withReplacement = false, fractions) val exactSample = data.sampleByKeyExact(withReplacement = false, fractions)
假设检验
import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.stat.Statistics._ val sc: SparkContext = ... val vec: Vector = ... // a vector composed of the frequencies of events // compute the goodness of fit. If a second vector to test against is not supplied as a parameter, // the test runs against a uniform distribution. val goodnessOfFitTestResult = Statistics.chiSqTest(vec) println(goodnessOfFitTestResult) // summary of the test including the p-value, degrees of freedom, // test statistic, the method used, and the null hypothesis. val mat: Matrix = ... // a contingency matrix // conduct Pearson's independence test on the input contingency matrix val independenceTestResult = Statistics.chiSqTest(mat) println(independenceTestResult) // summary of the test including the p-value, degrees of freedom... val obs: RDD[LabeledPoint] = ... // (feature, label) pairs. // The contingency table is constructed from the raw (feature, label) pairs and used to conduct // the independence test. Returns an array containing the ChiSquaredTestResult for every feature // against the label. val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs) var i = 1 featureTestResults.foreach { result => println(s"Column $i:\n$result") i += 1 } // summary of the test
Statistics provides methods to run a 1-sample, 2-sided Kolmogorov-Smirnov test
import org.apache.spark.mllib.stat.Statistics val data: RDD[Double] = ... // an RDD of sample data // run a KS test for the sample versus a standard normal distribution val testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0, 1) println(testResult) // summary of the test including the p-value, test statistic, // and null hypothesis // if our p-value indicates significance, we can reject the null hypothesis // perform a KS test using a cumulative distribution function of our making val myCDF: Double => Double = ... val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
随机数生成
import org.apache.spark.SparkContext import org.apache.spark.mllib.random.RandomRDDs._ val sc: SparkContext = ... // Generate a random double RDD that contains 1 million i.i.d. values drawn from the // standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions. val u = normalRDD(sc, 1000000L, 10) // Apply a transform to get a random double RDD following `N(1, 4)`. val v = u.map(x => 1.0 + 2.0 * x)
核密度估计
import org.apache.spark.mllib.stat.KernelDensity import org.apache.spark.rdd.RDD val data: RDD[Double] = ... // an RDD of sample data // Construct the density estimator with the sample data and a standard deviation for the Gaussian // kernels val kd = new KernelDensity() .setSample(data) .setBandwidth(3.0) // Find density estimates for the given values val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
相关文章推荐
- 疑难文件夹 一拖搞定
- Ubuntu下Apache、php、mysql默认安装路径
- MySQL高级十四——表的优化
- 用ffmpeg命令实现rtsp转rtmp
- 四种方案解决ScrollView嵌套ListView问题
- UI课程(ScrollView)
- UIAutomator输入中文
- 如何绘制业务流程图
- Android EditText 的<requestFocus />用于点击tab键或enter键焦点自动进入下一个输入框
- 事件驱动模型Libev(一)
- php_codesninffer phpcs用法学习使用:
- 产品经理业务流程图的绘制流程分享
- 主流浏览器内核介绍(前端开发值得了解的浏览器内核历史)
- sendmail 报错:unable to qualify my own domain name
- jenkins Email-EXT plugins
- 接受xml字符串+以流的形式返回信息
- 一个基于 EasyUI 的前台架构(5)右键快捷菜单
- hdoj--1201--18岁生日(模拟)
- Tomcat - SSL操作大全
- ios第二天{函数}