热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

SparkMLlib之BasicStatistics

SparkMLlib提供了一些基本的统计学的算法,下面主要说明一下:1、Summarystatistics对于RDD[Vector]类型,SparkMLlib提供了colStats

Spark MLlib提供了一些基本的统计学的算法,下面主要说明一下:

1、Summary statistics

对于RDD[Vector]类型,Spark MLlib提供了colStats的统计方法,该方法返回一个MultivariateStatisticalSummary的实例。他封装了列的最大值,最小值,均值、方差、总数。如下所示:

val cOnf= new SparkConf().setAppName("Simple Application").setMaster("yarn-cluster")
val sc = new SparkContext(conf)
val observatiOns= sc.textFile("/user/liujiyu/spark/mldata1.txt")
.map(_.split(‘ ‘) // 转换为RDD[Array[String]]类型
.map(_.toDouble)) // 转换为RDD[Array[Double]]类型
.map(line => Vectors.dense(line)) //转换为RDD[Vector]类型
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each column

2、Correlations(相关性)

计算两个序列的相关性,提供了计算Pearson’s and Spearman’s correlation.如下所示:

val cOnf= new SparkConf().setAppName("Simple Application").setMaster("yarn-cluster")
val sc = new SparkContext(conf)
val observatiOns= sc.textFile("/user/liujiyu/spark/mldata1.txt")
val data1 = Array(1.0, 2.0, 3.0, 4.0, 5.0)
val data2 = Array(1.0, 2.0, 3.0, 4.0, 5.0)
val distData1: RDD[Double] = sc.parallelize(data1)
val distData2: RDD[Double] = sc.parallelize(data2) // must have the same number of partitions and cardinality as seriesX
// compute the correlation using Pearson‘s method. Enter "spearman" for Spearman‘s method. If a
// method is not specified, Pearson‘s method will be used by default.
val correlation: Double = Statistics.corr(distData1, distData2, "pearson")
val data: RDD[Vector] = observations // note that each Vector is a row and not a column
// calculate the correlation matrix using Pearson‘s method. Use "spearman" for Spearman‘s method.
// If a method is not specified, Pearson‘s method will be used by default.
val correlMatrix: Matrix = Statistics.corr(data, "pearson")

 


推荐阅读
author-avatar
kenson4930
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有