您的位置:首页 > 其它

机器学习算法之K-means-spark

2017-06-16 11:02 316 查看
1 聚类

简单回顾一下:

首先,随机在点群中选取K个点,作为划分聚落的种子点;

然后,求点群中所有的点到这K个点的距离;

接下来,将离种子点近的点都移动到种子点附近;

最后,不断重复第二和第三步,直到没有点需要移动了。

以上只是一个概念的解释,我想这种问题还是必须看看公式才能清楚理解:

1、随机选取K个种子点,设为μ1......μk;

2、对点群中的每一个点计算公式:

argminj||xi−μj||,得到最小距离的那个种子点的J。

即该点属于K个类别中的cj类。

3、这次所有点都计算完毕就完事了吗?

当然不是了,毕竟种子点也是瞎选的,一般情况都不是最优的分类。所以需要迭代计算新聚落的质心。算出的质心又被拿出来当种子点继续重复第二步。

最后直到质心的位置基本不变为止,基本可以确定聚落情况。

2 聚类与EM

当我们在对点群进行聚类的时候,是不知道每个点的具体类别的。所以第一次我们指定x这个样本的类别假设为y。

通过极大似然估计,是可以衡量<
1fff7
nobr>(x,y)这个分类的可能性的。

所以现在可以先调节其他参数让当前条件下的极大似然P(x,y)的可能性最大。

y这个类别是我们随便选的,就算现在情况下可能性已经最好了,但是万一存在更好的分类y′,还可以使得P(x,y′)更大,那么就可以继续调节参数使得y′的P(x,y′)最大化。

这样反复迭代,直到没有更好的得y′存在。

y′:聚类界不允许有这么牛逼的y存在

3 EM算法

训练样本,就是上文中的点群x1,x2.....xm,假设各点之间是相互独立的,那么假设每个样本都一个隐含的类别ym。x点都是n维的向量,必须的。

在假设条件下求最大似然估计:

L(θ)=∑m1log∑yp(x,y;θ)

此时,想求解参数还有一个未知量y的存在,就算求了导数也是枉费。

于是EM假设每个样本类别都存在一个分布Qi。

在做一个简单等价:

L(θ)=∑m1log∑yp(xi,yi;θ)=∑m1log∑yQi(yi)p(xi,yi;θ)Qi(yi)≥∑i∑ziQi(yi)logp(xi,yi;θ)Qi(yi)

这个大于等于是根据一个叫做Jensen不等式得到的。

Jensen不等式就是严格凸函数的函数期望小于等于x期望做参数时的函数值。

等于符号成立的条件是参数x是常量。

现在问题变成求L(θ)的下界了。

那么等式成立的条件就是:

p(xi,yi;θ)Qi(yi)=C

又类别的分布满足:

∑iQi(yi)=1

于是综合可得:

∑ip(xi,yi;θ)=C

Qi(yi)=p(xi,yi;θ)∑ip(xi,yi;θ)=p(yi|xi;θ)

这样最小值也就E阶段完成了,要开始M阶段了。

M阶段做的就是1和2中说的迭代找到使P最大的θ.

4 聚类算法的Spark源码分析

def run(data: RDD[Vector]): KMeansModel = {
run(data, None)
}

private[spark] def run(
data: RDD[Vector],
instr: Option[Instrumentation[NewKMeans]]): KMeansModel = {
#//cache的作用
if (data.getStorageLevel == StorageLevel.NONE) {
logWarning("The input data is not directly cached, which may hurt performance if its"
+ " parent RDDs are also uncached.")
}

"// 计算 每一行的L2范数
val norms = data.map(Vectors.norm(_, 2.0))
#while (i < size) {
# sum += values(i) * values(i)
# i += 1
#}
norms.persist()//cache
val zippedData = data.zip(norms).map { case (v, norm) =>
new VectorWithNorm(v, norm)
}
val model = runAlgorithm(zippedData, instr)
norms.unpersist()

// Warn at the end of the run as well, for increased visibility.
if (data.getStorageLevel == StorageLevel.NONE) {
logWarning("The input data was not directly cached, which may hurt performance if its"
+ " parent RDDs are also uncached.")
}
model
}

/**
* Implementation of K-Means algorithm.
*/
private def runAlgorithm(
data: RDD[VectorWithNorm],
instr: Option[Instrumentation[NewKMeans]]): KMeansModel = {

val sc = data.sparkContext

val initStartTime = System.nanoTime()

val centers = initialModel match {
case Some(kMeansCenters) =>//自己提供种子点
kMeansCenters.clusterCenters.map(new VectorWithNorm(_))
case None =>
if (initializationMode == KMeans.RANDOM) {
initRandom(data)//随机初始化
} else {
initKMeansParallel(data)//初始化方式k-means||
}
}
val initTimeInSeconds = (System.nanoTime() - initStartTime) / 1e9
logInfo(f"Initialization with $initializationMode took $initTimeInSeconds%.3f seconds.")

var converged = false//是否收敛
var cost = 0.0
var iteration = 0

val iterationStartTime = System.nanoTime()

instr.foreach(_.logNumFeatures(centers.head.vector.size))

// 计算每个样本离得最近的种子点,种子点累加样本的值和count,然后更新种子点
while (iteration < maxIterations && !converged) {
val costAccum = sc.doubleAccumulator
val bcCenters = sc.broadcast(centers)//centers通过sparkContext的broadcast函数进行广播,
//最后在rdd的每一个partition的迭代时,使用这个广播变量.

// Find the sum and count of points mapping to each center
val totalContribs = data.mapPartitions { points =>
val thisCenters = bcCenters.value
val dims = thisCenters.head.vector.size

val sums = Array.fill(thisCenters.length)(Vectors.zeros(dims))//初始化为(thisCenters.length,dims)的全0矩阵
val counts = Array.fill(thisCenters.length)(0L)

points.foreach { point =>
val (bestCenter, cost) = KMeans.findClosest(thisCenters, point)//对于每个点都找出离得最近的种子点
costAccum.add(cost)
val sum = sums(bestCenter)
axpy(1.0, point.vector, sum)//将每个聚落求和
counts(bestCenter) += 1
}

counts.indices.filter(counts(_) > 0).map(j => (j, (sums(j), counts(j)))).iterator
}.reduceByKey { case ((sum1, count1), (sum2, count2)) =>
axpy(1.0, sum2, sum1)
(sum1, count1 + count2)
}.collectAsMap()

bcCenters.destroy(blocking = false)

// 继续迭代,更新新的种子点
converged = true
totalContribs.foreach { case (j, (sum, count)) =>
scal(1.0 / count, sum)//更新中心点,得到聚落新的质心
val newCenter = new VectorWithNorm(sum)
if (converged && KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {//判断newCenter与centers之间的距离是否 > epsilon * epsilon
converged = false//大于设定值,那就是不收敛
}
centers(j) = newCenter//质心位置基本不变了
}

cost = costAccum.value
iteration += 1
}

val iterationTimeInSeconds = (System.nanoTime() - iterationStartTime) / 1e9
logInfo(f"Iterations took $iterationTimeInSeconds%.3f seconds.")

if (iteration == maxIterations) {
logInfo(s"KMeans reached the max number of iterations: $maxIterations.")
} else {
logInfo(s"KMeans converged in $iteration iterations.")
}

logInfo(s"The cost is $cost.")

new KMeansModel(centers.map(_.vector))
}


4.1 参数配置

private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol
with HasSeed with HasPredictionCol with HasTol {

//K-means中的K
@Since("1.5.0")
final val k = new IntParam(this, "k", "The number of clusters to create. " +
"Must be > 1.", ParamValidators.gt(1))

/** @group getParam */
@Since("1.5.0")
def getK: Int = $(k)

/**
初始化聚落中心的方法:随机或者K-means||
*/
@Since("1.5.0")
final val initMode = new Param[String](this, "initMode", "The initialization algorithm. " +
"Supported options: 'random' and 'k-means||'.",
(value: String) => MLlibKMeans.validateInitMode(value))

/** @group expertGetParam */
@Since("1.5.0")
def getInitMode: String = $(initMode)

/**
针对K-means||的初始化方法的step
*/
@Since("1.5.0")
final val initSteps = new IntParam(this, "initSteps", "The number of steps for k-means|| " +
"initialization mode. Must be > 0.", ParamValidators.gt(0))

/** @group expertGetParam */
@Since("1.5.0")
def getInitSteps: Int = $(initSteps)

/**
* Validates and transforms the input schema.
* @param schema input schema
* @return output schema
*/
protected def validateAndTransformSchema(schema: StructType): StructType = {
SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
}
}


4.2 findClosest

返回被给点离的最近的那个种子点的index

这里首先有一个判断:

真正的欧拉距离公式:

(a1−a2)2+(b1−b2)2=a21+a22+b21+b22−2(a1a2+b1b2),先不开方;

不等式缩放一下:

a21+a22+b21+b22−2(a1a2+b1b2)≥a21+a22+b21+b22−2(a21+b21)(a22+b22)−−−−−−−−−−−−−−√=(a21+b21−−−−−−√−a22+b22−−−−−−√)2

如果(a21+b21−−−−−−√−a22+b22−−−−−−√)2都比最小距离大,那就不用再计算欧拉了。

private[mllib] def findClosest(
centers: TraversableOnce[VectorWithNorm],
point: VectorWithNorm): (Int, Double) = {
var bestDistance = Double.PositiveInfinity
var bestIndex = 0
var i = 0
centers.foreach { center =>
// Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary
// distance computation.
var lowerBoundOfSqDist = center.norm - point.norm
lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist
if (lowerBoundOfSqDist < bestDistance) {
val distance: Double = fastSquaredDistance(center, point)
if (distance < bestDistance) {
bestDistance = distance
bestIndex = i
}
}
i += 1
}
(bestIndex, bestDistance)
}


其中比较关键的是:

private[mllib] def fastSquaredDistance(
v1: Vector,
norm1: Double,
v2: Vector,
norm2: Double,
precision: Double = 1e-6): Double = {
val n = v1.size
require(v2.size == n)
require(norm1 >= 0.0 && norm2 >= 0.0)
val sumSquaredNorm = norm1 * norm1 + norm2 * norm2//a_1^2+a_2^2+b_1^2+b_2^2
val normDiff = norm1 - norm2
var sqDist = 0.0

val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)"//精度计算
if (precisionBound1 < precision) {
sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)//a_1^2+a_2^2+b_1^2+b_2^2-2(a_1a_2+b_1b_2)
} else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
val dotValue = dot(v1, v2)
sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
(sqDist + EPSILON)
if (precisionBound2 > precision) {
sqDist = Vectors.sqdist(v1, v2)//(a_1-a_2)^2+(b_1-b_2)^2
}
} else {
sqDist = Vectors.sqdist(v1, v2)
}
sqDist
}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: