您的位置：首页 > 理论基础 > 计算机网络

spark mllib源码分析之逻辑回归弹性网络ElasticNet

2018-01-03 00:00 701 查看

相关文章
spark mllib源码分析之逻辑回归弹性网络ElasticNet（二）
spark源码分析之L-BFGS
spark mllib源码分析之OWLQN
spark中的online均值/方差统计
spark源码分析之二分类逻辑回归evaluation
spark正则化

spark在ml包中将逻辑回归封装了下，同时在算法中引入了L1和L2正则化，通过elasticNetParam来调节两种正则化的系数，同时根据选择的正则化，决定使用L-BFGS还是OWLQN优化，是谓Elastic Net。

1. 辅助类

我们首先介绍下模型训练和预测，评价中使用到的一些类。

1.1. MultiClassSummarizer

主要用在样本的训练过程中，统计数据中各种label出现的次数及其weight，这里引入了样本weight，可以用在unbalance的数据中，通过惩罚数量大的class达到样本均衡，默认为1

class MultiClassSummarizer extends Serializable {
private val distinctMap = new mutable.HashMap[Int, (Long, Double)]
private var totalInvalidCnt: Long = 0L

1

2

3

distinctMap的key是label，类型为Long，value是个tuple，第一个元素是label出现的次数，第二维是weight的和，

∑wil=l∗∑wi

l是label，weight为1的时候，这里相当于label的数量。
因为这个累积器主要用在treeAggregator中，重要的是两个函数，add用于累积样本，merge用于两个MultiClassSummarizer的合并

/**
* Add a new label into this MultilabelSummarizer, and update the distinct map.
*
* @param label The label for this data point.
* @param weight The weight of this instances.
* @return This MultilabelSummarizer
*/
def add(label: Double, weight: Double = 1.0): this.type = {
require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")

if (weight == 0.0) return this
//这里要求label必须为整数，否则认为非法
if (label - label.toInt != 0.0 || label < 0) {
totalInvalidCnt += 1
this
}else {
val (counts: Long, weightSum: Double) = distinctMap.getOrElse(label.toInt, (0L, 0.0))
//累加样本次数及weight
distinctMap.put(label.toInt, (counts + 1L, weightSum + weight))
this
}
}

/**
* Merge another MultilabelSummarizer, and update the distinct map.
* (Note that it will merge the smaller distinct map into the larger one using in-place
* merging, so either `this` or `other` object will be modified and returned.)
*
* @param other The other MultilabelSummarizer to be merged.
* @return Merged MultilabelSummarizer object.
*/
def merge(other: MultiClassSummarizer): MultiClassSummarizer = {
//将size小的并入大的，性能
val (largeMap, smallMap) = if (this.distinctMap.size > other.distinctMap.size) {
(this, other)
} else {
(other, this)
}
smallMap.distinctMap.foreach {
case (key, value) =>
val (counts: Long, weightSum: Double) = largeMap.distinctMap.getOrElse(key, (0L, 0.0))
//直接累加
largeMap.distinctMap.put(key, (counts + value._1, weightSum + value._2))
}
largeMap.totalInvalidCnt += smallMap.totalInvalidCnt
largeMap
}

返回统计到的class数，默认从0开始，所以是最大label+1

def numClasses: Int = if (distinctMap.isEmpty) 0 else distinctMap.keySet.max + 1

1

返回weight累积和

def histogram: Array[Double] = {
val result = Array.ofDim[Double](numClasses)
var i = 0
//应该是val len = numClasses
val len = result.length
//这里要求class从0到k-1
while (i < len) {
result(i) = distinctMap.getOrElse(i, (0L, 0.0))._2
i += 1
}
result
}

对比numClasses，可以看到这里result实现是有点问题的，必须要求class从0到k-1全部出现了，否则会丢失部分的class的统计。

1.2. MultivariateOnlineSummarizer

在spark中的online均值/方差统计中已有介绍，计算样本集的方差，用于归一化。

1.3 LogisticRegressionModel

逻辑回归model，放着训练得到的系数矩阵，矩阵，class数，是否多分类等参数。

1.3.1. 预测

override protected def predict(features: Vector): Double = if (isMultinomial) {
super.predict(features)
} else {
// Note: We should use getThreshold instead of $(threshold) since getThreshold is overridden.
if (score(features) > getThreshold) 1 else 0
}

可以看到二分类与多分类是分开处理的，其原理是不同的

1.3.1.1. 二分类

从上面可以看到二分类的预测是通过计算特征得分，与threshold比较，大于为1，否则0，score函数代码

private val score: Vector => Double = (features) => {
val m = margin(features)
1.0 / (1.0 + math.exp(-m))
}

从score函数可以看到，这里是将margin带入了sigmoid函数，我们看margin函数

private val margin: Vector => Double = (features) => {
BLAS.dot(features, _coefficients) + _intercept
}

1

2

3

就是将特征与系数相乘，再加上截距。
二分类中还实现了一些低级API，用在evaluate model，分别计算margin，预测值，预测label

//计算二分类的margin，返回DenseVector
override protected def predictRaw(features: Vector): Vector = {
val m = margin(features)
Vectors.dense(-m, m)
}
//由margin计算原始的预测值，也就是经过sigmoid函数的值
override protected def raw2probabilityInPlace(rawPrediction: Vector): Vector = {
rawPrediction match {
case dv: DenseVector =>
var i = 0
val size = dv.size
while (i < size) {
dv.values(i) = 1.0 / (1.0 + math.exp(-dv.values(i)))
i += 1
}
dv
case sv: SparseVector =>
throw new RuntimeException("Unexpected error in LogisticRegressionModel:" +
" raw2probabilitiesInPlace encountered SparseVector")
}
}
//由原始的预测值，预测label，从上面可知vector(1)为实际的预测值，用来预测label
override protected def raw2prediction(rawPrediction: Vector): Double = {
// Note: We should use getThreshold instead of $(threshold) since getThreshold is overridden.
val t = getThreshold
val rawThreshold = if (t == 0.0) {
Double.NegativeInfinity
} else if (t == 1.0) {
Double.PositiveInfinity
} else {
math.log(t / (1.0 - t))
}
if (rawPrediction(1) > rawThreshold) 1 else 0
}

1.3.1.2. 多分类

多分类时，其调用了父类的predict函数

override protected def predict(features: FeaturesType): Double = {
raw2prediction(predictRaw(features))
}

1

2

3

调用了raw2prediction函数

override protected def raw2prediction(rawPrediction: Vector): Double = {
if (!isDefined(thresholds)) {
rawPrediction.argmax
} else {
probability2prediction(raw2probability(rawPrediction))
}
}

1

2

3

4

5

6

7

可以看到，如果没有设置thresholds数组（一般不会设置），直接返回了入参rawPrediction向量中最大元素所在的位置（index），举例来说rawPrediction如果是[2.3, 1.2, 5.1, 3.4]，则返回2(最大值5.1)。rawPrediction来自于predictRaw函数

override protected def predictRaw(features: Vector): Vector = {
if (isMultinomial) {
margins(features)
} else {
val m = margin(features)
Vectors.dense(-m, m)
}
}

1

2

3

4

5

6

7

8

直接调用了margins函数

private val margins: Vector => Vector = (features) => {
val m = interceptVector.toDense.copy
//m = alpha * coefficientMatrix * features + beta * m
BLAS.gemv(1.0, coefficientMatrix, features, 1.0, m)
m
}

1

2

3

4

5

6

代码比较简单，系数矩阵分别于特征向量相乘，再与截距向量相加。

1.3.2. save model

使用LogisticRegressionModelWriter将训练的参数和得到的系数矩阵写入hdfs

class LogisticRegressionModelWriter(instance: LogisticRegressionModel)
extends MLWriter with Logging {

private case class Data(
numClasses: Int,
numFeatures: Int,
interceptVector: Vector,
coefficientMatrix: Matrix,
isMultinomial: Boolean)

override protected def saveImpl(path: String): Unit = {
//训练时的参数
DefaultParamsWriter.saveMetadata(instance, path, sc)
//保存训练结果
val data = Data(instance.numClasses, instance.numFeatures, instance.interceptVector,
instance.coefficientMatrix, instance.isMultinomial)
val dataPath = new Path(path, "data").toString
sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
}
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

metadata中除了训练参数，还保存了训练时的环境，官方demo的训练参数保存结果

{
"class": "org.apache.spark.ml.classification.LogisticRegressionModel",
"timestamp": 1500886361787,
"sparkVersion": "2.0.2",
"uid": "logreg_ea57ce7dcde4",
"paramMap": {
"fitIntercept": true,
"rawPredictionCol": "rawPrediction",
"predictionCol": "prediction",
"tol": 0.000001,
"labelCol": "label",
"standardization": true,
"regParam": 0.3,
"probabilityCol": "probability",
"featuresCol": "features",
"maxIter": 10,
"elasticNetParam": 0.8,
"threshold": 0.5
}
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

1.3.3. load model

使用LogisticRegressionModelReader将save保存的模型读取回来，metadata使用json解析回来，解析parquet获取系数矩阵，截距等，比较简单。

1.4. LogisticAggregator

LogisticAggregator用于训练过程中，计算每轮迭代的梯度和loss，需要分布式计算，类似于上面的summarizer，也是用在treeAggregator中。

1.4.1. 算法

用于训练过程中计算梯度与loss，在前面介绍L-BFGS时说过其训练结果返回的系数向量只有k-1维，预测时则默认class 0的margin是0，这种是带pivot class，二分类属于这种；这里的多分类不使用这种方法，而是训练得到k个class分别对应的系数。

1.4.1. 1. 二分类

如前文所述，二分类是有pivot，一般二分类的梯度

δδθjJ(θ)=−1m∑i=1m(hθ(xi)−yi)xi,jJ(θ)=−1m∑i=1m[yiloghθ(xi)+(1−yi)log(1−hθ(xi))]

这里的m是样本数，对于单个样本m=1，h是sigmoid函数，y是label，整理可得

margin=−x⃗ ⋅β⃗ ⋯(1)multiplier=wi∗(11+emargin−label)⋯(2)∂ℓ(β,x⃗ i,wi)∂β=xi,j∗multiplier⋯(3)whenlabel=1ℓ(β,x⃗ i,wi)whenlabel=0ℓ(β,x⃗ i,wi)=−yiloghθ(xi)=log(1+emargin)⋯(4)=−(1−yi)log(1−hθ(xi))=−logemargin1+emargin=log(1+emargin)−margin⋯(5)

1.4.1. 2. 多分类

多分类时，

P(yi=0|x⃗ i,β)=ex⃗ Tiβ⃗ 0∑K−1k=0ex⃗ Tiβ⃗ kP(yi=1|x⃗ i,β)=ex⃗ Tiβ⃗ 1∑K−1k=0ex⃗ Tiβ⃗ k⋯⋯P(yi=K−1|x⃗ i,β)=ex⃗ Tiβ⃗ K−1∑K−1k=0ex⃗ Tiβ⃗ k

模型的系数组成一个K(classNum)乘N（特征数，如果有截距就是N+1)的矩阵。对比有pivot class的方式，这种方式其实更加简洁优雅，但是其实我们对所有P都分子分母同时除以ex⃗ Tiβ⃗ 0，就是pivot class方式的表述，而且这种方式带来一个问题，就是从形式上看当截距变化时，概率p是不随其改变的

ex⃗ Ti(β⃗ 0+c⃗ )∑K−1k=0ex⃗ Ti(β⃗ k+c⃗ )=ex⃗ Tiβ⃗ 0ex⃗ Tic⃗ ex⃗ Tic⃗ ∑K−1k=0ex⃗ Tiβ⃗ k=ex⃗ Tiβ⃗ 0∑K−1k=0ex⃗ Tiβ⃗ k

但是如果加入正则化，我们则只有一组系数矩阵可以最小化正则项，则这个系数矩阵就是具有区分度的（或者说是唯一的）。对于单个样本，其loss（忽略正则项）可写作

ℓ(β,xi)=−logP(yi|x⃗ i,β)=log(∑k=0K−1ex⃗ Tiβ⃗ k)−x⃗ Tiβ⃗ y=log(∑k=0K−1emarginsk)−marginsy⋯(8)wheremarginsk=x⃗ Tiβ⃗ k

优化求导可得

∂ℓ(β,x⃗ i,wi)∂βj,k=xi,j⋅wi⋅(ex⃗ i⋅β⃗ k∑K−1k′=0ex⃗ i⋅β⃗ k′−Iy=k)=xi,j⋅wi⋅multiplierk⋯(6)Iy=k={10y=kelsemultiplierk=⎛⎝ex⃗ i⋅β⃗ k∑K−1k=0ex⃗ i⋅β⃗ k−Iy=k⎞⎠⋯(7)

这里的I的含义是对于class k的样本，我们计算所有class的梯度向量时，只有当k==label时，I为1，其他时候为0。wi是样本权重。
类似于我们在L-BFGS文章中的讨论，这里的指数超过709.78时，有溢出的风险，类似处理

ℓ(β,x)=log(∑k=0K−1emarginsk−maxMargin)−marginsy+maxMargin⋯(9)

梯度也是做类似处理

multiplierk=ex⃗ i⋅β⃗ k−maxMargin∑K−1k′=0ex⃗ i⋅β⃗ k′−maxMargin−Iy=k

1.4.2. 实现

梯度和loss的计算支持分布式计算，add函数用于计算样本，merge用户累积器的合并。

1.4.2.1. add

add的入参是特征向量features，样本weight，label。

1.4.2.1.1. 二分类

add直接调用binaryUpdateInPlace函数

private def binaryUpdateInPlace(
features: Vector,
weight: Double,
label: Double): Unit = {

val localFeaturesStd = bcFeaturesStd.value
val localCoefficients = bcCoefficients.value
val localGradientArray = gradientSumArray
//指数部分，式(1)
val margin = - {
var sum = 0.0
features.foreachActive { (index, value) =>
if (localFeaturesStd(index) != 0.0 && value != 0.0) {
//归一化
sum += localCoefficients(index) * value / localFeaturesStd(index)
}
}
//截距
if (fitIntercept) sum += localCoefficients(numFeaturesPlusIntercept - 1)
sum
}
//式（2）
val multiplier = weight * (1.0 / (1.0 + math.exp(margin)) - label)
//式(3)，更新梯度
features.foreachActive { (index, value) =>
if (localFeaturesStd(index) != 0.0 && value != 0.0) {
//归一化
localGradientArray(index) += multiplier * value / localFeaturesStd(index)
}
}

if (fitIntercept) {
localGradientArray(numFeaturesPlusIntercept - 1) += multiplier
}
//loss
if (label > 0) {
// 式(4)
lossSum += weight * MLUtils.log1pExp(margin)
} else {
//式(5)
lossSum += weight * (MLUtils.log1pExp(margin) - margin)
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

1.4.2.1.2. 多分类

add调用multinomialUpdateInPlace，对应上述算法，源码实现

private def multinomialUpdateInPlace(
features: Vector,
weight: Double,
label: Double): Unit = {
// TODO: use level 2 BLAS operations
/*
Note: this can still be used when numClasses = 2 for binary
logistic regression without pivoting.
*/
val localFeaturesStd = bcFeaturesStd.value
val localCoefficients = bcCoefficients.value
val localGradientArray = gradientSumArray

// marginOfLabel is margins(label) in the formula
var marginOfLabel = 0.0
var maxMargin = Double.NegativeInfinity
//计算每个class的margin
val margins = new Array[Double](numClasses)
//计算系数与特征部分
features.foreachActive { (index, value) =>
val stdValue = value / localFeaturesStd(index)
var j = 0
while (j < numClasses) {
margins(j) += localCoefficients(index * numClasses + j) * stdValue
j += 1
}
}
//加截距
var i = 0
while (i < numClasses) {
if (fitIntercept) {
margins(i) += localCoefficients(numClasses * numFeatures + i)
}
//记录label对应的margin，用于loss计算
if (i == label.toInt) marginOfLabel = margins(i)
//记录最大的margin，看是否需要额外处理
if (margins(i) > maxMargin) {
maxMargin = margins(i)
}
i += 1
}

/**
* When maxMargin is greater than 0, the original formula could cause overflow.
* We address this by subtracting maxMargin from all the margins, so it's guaranteed
* that all of the new margins will be smaller than zero to prevent arithmetic overflow.
*/
val multipliers = new Array[Double](numClasses)
//式(7)的分母，所有class的margin和
val sum = {
var temp = 0.0
var i = 0
while (i < numClasses) {
//最大margin大于0，先减去max
if (maxMargin > 0) margins(i) -= maxMargin
val exp = math.exp(margins(i))
temp += exp
multipliers(i) = exp
i += 1
}
temp
}
//式(7)
margins.indices.foreach { i =>
//label对应的margin，I=1，否则I=0
multipliers(i) = multipliers(i) / sum - (if (label == i) 1.0 else 0.0)
}
features.foreachActive { (index, value) =>
if (localFeaturesStd(index) != 0.0 && value != 0.0) {
val stdValue = value / localFeaturesStd(index)
var j = 0
//式(6)，更新梯度
while (j < numClasses) {
localGradientArray(index * numClasses + j) +=
weight * multipliers(j) * stdValue
j += 1
}
}
}
//截距当做特征值全为1的一维特征，更新方法可类比于正常特征
if (fitIntercept) {
var i = 0
while (i < numClasses) {
localGradientArray(numFeatures * numClasses + i) += weight * multipliers(i)
i += 1
}
}

val loss = if (maxMargin > 0) {
//式(8)
math.log(sum) - marginOfLabel + maxMargin
} else {
//式(9)
math.log(sum) - marginOfLabel
}
lossSum += weight * loss
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

1.4.2.2. merge

merge处理累积器之间的合并，loss和梯度都是直接累加即可，这里不再赘述

1.4.2.3. 结果返回

merge之后的结果需要对weight（如样本weight为1，这里相当于m)平均

def loss: Double = {
require(weightSum > 0.0, s"The effective number of instances should be " +
s"greater than 0.0, but $weightSum.")
lossSum / weightSum
}

def gradient: Matrix = {
require(weightSum > 0.0, s"The effective number of instances should be " +
s"greater than 0.0, but $weightSum.")
val result = Vectors.dense(gradientSumArray.clone())
scal(1.0 / weightSum, result)
new DenseMatrix(numCoefficientSets, numFeaturesPlusIntercept, result.toArray)
}

1

2

3

4

5

6

7

8

9

10

11

12

13

梯度与系数矩阵是对应的，在迭代中是当成一维的向量存储，按维度展开有两种展开方式，如下图

结合梯度更新的代码，我们可以看出梯度向量在迭代中的存储格式是图中的第一种，先存特征0在各class的梯度，再存特征1，以此类推。对应到上面的DenseMatrix，其行是numCoefficientSets，列是numFeaturesPlusIntercept，是一个K*N的矩阵，取元素(i,j)(从0开始)则是i+j∗numCoefficientSets，例如我们要取class1，特征2对应的梯度值，应该是1+2k，对号对应上图第一种f2的第2个位置，对应代码

private[ml] def index(i: Int, j: Int): Int = {
require(i >= 0 && i < numRows, s"Expected 0 <= i < $numRows, got i = $i.")
require(j >= 0 && j < numCols, s"Expected 0 <= j < $numCols, got j = $j.")
//本例isTransposed=false
if (!isTransposed) i + numRows * j else j + numCols * i
}

1

2

3

4

5

6

1.5. LogisticCostFun

逻辑回归的损失函数，用于每轮迭代中计算所有样本的loss和gradient，对所有样本累积的时候会使用LogisticAggregator，然后再加上正则项，返回本次更新的梯度。类成员

private class LogisticCostFun(
instances: RDD[Instance],  //样本集
numClasses: Int,           //分类数
fitIntercept: Boolean,     //是否拟合截距
standardization: Boolean,   //是否归一化
bcFeaturesStd: Broadcast[Array[Double]], //各维特征的标准差
regParamL2: Double,         //L2正则化系数
multinomial: Boolean,       //是否是多分类
//累积层数，从样本逐层累积，类似于树
aggregationDepth: Int) extends DiffFunction[BDV[Double]] {

1

2

3

4

5

6

7

8

9

10

calculate用于计算每轮迭代时的loss和gradient

override def calculate(coefficients: BDV[Double]): (Double, BDV[Double]) = {
val coeffs = Vectors.fromBreeze(coefficients)
val bcCoeffs = instances.context.broadcast(coeffs)
val featuresStd = bcFeaturesStd.value
val numFeatures = featuresStd.length
val numCoefficientSets = if (multinomial) numClasses else 1
val numFeaturesPlusIntercept = if (fitIntercept) numFeatures + 1 else numFeatures
//所有样本，计算loss和gradient，参见LogisticAggregator的add和merge
val logisticAggregator = {
val seqOp = (c: LogisticAggregator, instance: Instance) => c.add(instance)
val combOp = (c1: LogisticAggregator, c2: LogisticAggregator) => c1.merge(c2)

instances.treeAggregate(
new LogisticAggregator(bcCoeffs, bcFeaturesStd, numClasses, fitIntercept,
multinomial)
)(seqOp, combOp, aggregationDepth)
}
//正则项
val totalGradientMatrix = logisticAggregator.gradient
val coefMatrix = new DenseMatrix(numCoefficientSets, numFeaturesPlusIntercept, coeffs.toArray)
// regVal is the sum of coefficients squares excluding intercept for L2 regularization.
val regVal = if (regParamL2 == 0.0) {
0.0
} else {
var sum = 0.0
coefMatrix.foreachActive { case (classIndex, featureIndex, value) =>
// We do not apply regularization to the intercepts
val isIntercept = fitIntercept && (featureIndex == numFeatures)
if (!isIntercept) {
//计算带正则项的梯度和loss，更新梯度矩阵
sum += {
if (standardization) {
val gradValue = totalGradientMatrix(classIndex, featureIndex)
totalGradientMatrix.update(classIndex, featureIndex, gradValue + regParamL2 * value)
value * value
} else {
if (featuresStd(featureIndex) != 0.0) {
//设置不使用归一化，但是在计算梯度时使用了归一化，这里正则项需要反归一化，使得优化函数与无归一化等效
val temp = value / (featuresStd(featureIndex) * featuresStd(featureIndex))
val gradValue = totalGradientMatrix(classIndex, featureIndex)
totalGradientMatrix.update(classIndex, featureIndex, gradValue + regParamL2 * temp)
value * temp
} else {
0.0
}
}
}
}
}
0.5 * regParamL2 * sum
}
bcCoeffs.destroy(blocking = false)
//更新loss和梯度
(logisticAggregator.loss + regVal, new BDV(totalGradientMatrix.toArray))
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

1.6. BinaryLogisticRegressionSummary

计算二分类逻辑回归的模型评估指标，如AUC，F-measure等

class BinaryLogisticRegressionSummary private[classification] (
//样本集
@Since("1.5.0") @transient override val predictions: DataFrame,
//预测值score的类名，用于DataFrame select
@Since("1.5.0") override val probabilityCol: String,
//label列名，用于DataFrame select
@Since("1.5.0") override val labelCol: String,
//特征向量列名，用于DataFrame select
@Since("1.6.0") override val featuresCol: String)

1

2

3

4

5

6

7

8

9

这里的predictions是样本经过模型预测，增加了预测值。

private val binaryMetrics = new BinaryClassificationMetrics(
predictions.select(col(probabilityCol), col(labelCol).cast(DoubleType)).rdd.map {
case Row(score: Vector, label: Double) => (score(1), label)
}, 100
)

1

2

3

4

5

计算评价指标只需要预测值与label两列，用来初始化BinaryClassificationMetrics类，参见 spark源码分析之二分类逻辑回归evaluation。这里返回的评估指标其实都来自于BinaryClassificationMetrics中，只不过在其返回的数据中加入了列名，构造成DataFrame，包括ROC曲线，AUC值，pr曲线，threshold-fMeasure曲线，threshold-precision曲线，threshold-recall曲线，比较简单，不再赘述。

相关文章
spark mllib源码分析之逻辑回归弹性网络ElasticNet（一）
spark源码分析之L-BFGS
spark mllib源码分析之OWLQN
spark中的online均值/方差统计
spark源码分析之二分类逻辑回归evaluation
spark正则化

2. 训练

2.1. 训练参数设置

设置用于控制训练的参数

setMaxIter，最大迭代次数，训练的截止条件，默认100次

setFamily，binomial(二分类)/multinomial(多分类)/auto，默认为auto。设为auto时，会根据schema或者样本中实际的class情况设置是二分类还是多分类，最好明确设置

setElasticNetParam，弹性参数，用于调节L1和L2之间的比例，两种正则化比例加起来是1，详见后面正则化的设置，默认为0，只使用L2正则化，设置为1就是只用L1正则化

setRegParam，正则化系数，默认为0，不使用正则化

setTol，训练的截止条件，两次迭代之间的改善小于tol训练将截止

setFitIntercept，是否拟合截距，默认为true

setStandardization，是否使用归一化，这里归一化只针对各维特征的方差进行

setThresholds/setThreshold，设置多分类/二分类的判决阈值，多分类是一个数组，二分类是double值

setAggregationDepth，设置分布式统计时的层数，主要用在treeAggregate中，数据量越大，可适当加大这个值，默认为2

set*Col，因为训练时使用的样本先被搞成了DataFrame结构，这些是设置列名，方便训练时选取label，weight，feature等列，spark实现了默认libsvm格式的读取，如果需要，可以自己读取样本文件，转成DataFrame格式，并设置这些列名

2.2.数据准备

2.2.1. 样本处理

将封装成DataFrame的输入数据再转成简单结构的instance，包括label，weight，特征，默认每个样本的weight为1

val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
val instances: RDD[Instance] =
dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
case Row(label: Double, weight: Double, features: Vector) =>
Instance(label, weight, features)
}

1

2

3

4

5

6

7

2.2.2. 统计

统计样本每个特征的方差，均值，label的分布情况，用到了MultivariateOnlineSummarizer和MultiClassSummarizer，前面有介绍

val (summarizer, labelSummarizer) = {
val seqOp = (c: (MultivariateOnlineSummarizer, MultiClassSummarizer),
instance: Instance) =>
(c._1.add(instance.features, instance.weight), c._2.add(instance.label, instance.weight))

val combOp = (c1: (MultivariateOnlineSummarizer, MultiClassSummarizer),
c2: (MultivariateOnlineSummarizer, MultiClassSummarizer)) =>
(c1._1.merge(c2._1), c1._2.merge(c2._2))

instances.treeAggregate(
new MultivariateOnlineSummarizer, new MultiClassSummarizer
)(seqOp, combOp, $(aggregationDepth))
}
//各维特征的weightSum
val histogram = labelSummarizer.histogram
//label非法，主要是label非整数和小于0的情况
val numInvalid = labelSummarizer.countInvalid
val numFeatures = summarizer.mean.size
//如果有截距，相当于增加一维值全为1的特征
val numFeaturesPlusIntercept = if (getFitIntercept) numFeatures + 1 else numFeatures

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

2.2.3. 其他参数推断

class数，如果在schema中指定，就要求其大于等于统计得到的class数，否则取统计得到的class数

val numClasses = MetadataUtils.getNumClasses(dataset.schema($(labelCol))) match {
case Some(n: Int) =>
require(n >= histogram.length, s"Specified number of classes $n was " +
s"less than the number of unique labels ${histogram.length}.")
n
//最好是labelSummarizer.numClasses
case None => histogram.length
}

1

2

3

4

5

6

7

8

是否多分类，根据family参数确定。如果是二分类，要求numClasses为1或2，返回false；如果是多分类，则返回true；如果是默认的auto，则根据numClasses是否大于2得到

val isMultinomial = $(family) match {
case "binomial" =>
require(numClasses == 1 || numClasses == 2, s"Binomial family only supports 1 or 2 " +
s"outcome classes but found $numClasses.")
false
case "multinomial" => true
case "auto" => numClasses > 2
case other => throw new IllegalArgumentException(s"Unsupported family: $other")
}

1

2

3

4

5

6

7

8

9

系数矩阵包含的系数集个数，我们之前的文章说过，多分类的系数矩阵在实际模型训练时（breeze库）是将多个weight向量是拼在一起的，相当于[[w11,w12,...],[w21,w22,...],...]，因此个数与class数相同，这里的系数矩阵是使用矩阵存储的，这个值作为矩阵的行；而二分类因为只有一个weight向量，因此为1

val numCoefficientSets = if (isMultinomial) numClasses else 1

1

阈值，因为这里兼容了二分类与多分类，thresholds实际是个Array，length应该等于class的个数；如果是二分类，阈值就只有一个，设置在threshold，Double类型

if (isDefined(thresholds)) {
require($(thresholds).length == numClasses, this.getClass.getSimpleName +
".train() called with non-matching numClasses and thresholds.length." +
s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
}

1

2

3

4

5

2.3. 确定待训练模型

根据二/多分类，是否拟合截距，L1/L2等确定训练使用的优化方法，损失函数

2.3.1. 拟合截距 && label唯一

判断条件为

$(fitIntercept) && isConstantLabel

1

label是否唯一的判断

val isConstantLabel = histogram.count(_ != 0.0) == 1

1

histogram是Array，里面放着样本中各label的数量，也就是说样本里只有一种label。
这种情况返回的系数矩阵为全0的SparseMatrix，对于截距，如果是多分类，返回稀疏向量，向量长度为numClasses，只有index为label的位置有值Double.PositiveInfinity；如果是二分类，返回dense vector，值为Double.PositiveInfinity

2.3.2. 不拟合截距 && label唯一

判断条件为

!$(fitIntercept) && isConstantLabel

1

此种情况下，算法可能不会收敛，给出了警告信息，但是会继续尝试优化。

2.3.3. 不拟合截距 && 各维特征的特征值相同却非0

判断条件

!$(fitIntercept) && (0 until numFeatures).exists { i =>
featuresStd(i) == 0.0 && featuresMean(i) != 0.0 }

1

2

各维特征的方差为0，但是均值不为0，说明各维特征值各自相同且非0，这种情况从警告信息来看，spark最终会给出全为0的系数矩阵，但是会尝试优化。

2.3.4. 只用L2或者不使用正则化

判断条件

$(elasticNetParam) == 0.0 || $(regParam) == 0.0

1

对应损失函数

//实际的L1正则化系数
val regParamL1 = $(elasticNetParam) * $(regParam)
//实际的L2正则化系数
val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)

val bcFeaturesStd = instances.context.broadcast(featuresStd)
//损失函数，参见前文
val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept),
$(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial,
$(aggregationDepth))

1

2

3

4

5

6

7

8

9

10

使用LBFGS模型优化

val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) {
new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
}

1

2

3

2.3.5. 只使用L1 或者 L1和L2同时用

L2与损失函数类似2.3.4节，L1正则化

def regParamL1Fun = (index: Int) => {
// 正则化与截距无关，系数矩阵是按列展开的，每一列是一维特征，因此所有的截距在展开数组的最后
val isIntercept = $(fitIntercept) && index >= numFeatures * numCoefficientSets
if (isIntercept) {
0.0
} else {
if (standardizationParam) {
regParamL1
} else {
//因为优化时统一使用标准化，这里如果设置不使用，需要先进行反过程已使得目标函数一致
val featureIndex = index / numCoefficientSets
if (featuresStd(featureIndex) != 0.0) {
regParamL1 / featuresStd(featureIndex)
} else {
0.0
}
}
}
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

使用OWLQN模型优化

new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol))

2.4. 初始化weight矩阵

先将稀疏矩阵初始化成全0的矩阵，行是numCoefficientSets（numClasses），列为numFeaturesPlusIntercept（包括截距），然后根据是否有初始模型，二/多分类对weight矩阵进行初始化。

2.4.1. 有初始模型

这种情况是存在着一个初始的模型，里面存放着已经训练的系数矩阵，先验证模型的有效性

val initialModelIsValid = optInitialModel match {
case Some(_initialModel) =>
val providedCoefs = _initialModel.coefficientMatrix
//系数矩阵的行（numClasses）是否相同
val modelIsValid = (providedCoefs.numRows == numCoefficientSets) &&
//稀疏矩阵的列(特征数)是否相同
(providedCoefs.numCols == numFeatures) &&
//截距数是否正确（等于numClasses）
(_initialModel.interceptVector.size == numCoefficientSets) &&
//是否拟合截距
(_initialModel.getFitIntercept == $(fitIntercept))
if (!modelIsValid) {
logWarning(s"Initial coefficients will be ignored! Its dimensions " +
s"(${providedCoefs.numRows}, ${providedCoefs.numCols}) did not match the " +
s"expected size ($numCoefficientSets, $numFeatures)")
}
modelIsValid
case None => false
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

如果初始模型有效，则用初始化模型中的系数初始化

val providedCoef = optInitialModel.get.coefficientMatrix
providedCoef.foreachActive { (classIndex, featureIndex, value) =>
// We need to scale the coefficients since they will be trained in the scaled space
//更新weight矩阵，行号classIndex，列号featureIndex，第三个参数是值
initialCoefWithInterceptMatrix.update(classIndex, featureIndex,
value * featuresStd(featureIndex))
}
//更新截距
if ($(fitIntercept)) {
optInitialModel.get.interceptVector.foreachActive { (classIndex, value) =>
initialCoefWithInterceptMatrix.update(classIndex, numFeatures, value)
}
}

1

2

3

4

5

6

7

8

9

10

11

12

13

2.4.2. 没有初始模型 && 拟合截距 && 多分类

对于多分类逻辑回归，如果系数矩阵初始化为0，则把截距初始化与label同分布，能够更快的收敛

P(1)=eb1/Z...P(K)=ebK/ZwhereZ=∑k=1Kebk

但是label的真正分布很难知道，一般用label在样本中的分布近似，即

ebk=countk∗eλbk=log(countk)+λ

我们让λ=0，并使其均值为0（均值中心化）

bk=log(countk)b′k=bk−E(bk)

对应到spark的实现

//histogram里面放的是weightSum
val rawIntercepts = histogram.map(c => math.log(c + 1)) // add 1 for smoothing
val rawMean = rawIntercepts.sum / rawIntercepts.length
rawIntercepts.indices.foreach { i =>
initialCoefWithInterceptMatrix.update(i, numFeatures, rawIntercepts(i) - rawMean)
}

1

2

3

4

5

6

2.4.3. 没有初始模型 && 拟合截距&&二分类

类似于多分类，截距与label同分布可加快收敛

P(0)=1(1+eb)P(1)=eb1+eb

同理

b=logP(1)P(0)=logcount1count0

源码实现

initialCoefWithInterceptMatrix.update(0, numFeatures,
math.log(histogram(1) / histogram(0)))

1

2

2.5. 训练

优化求解是通过上面确定的optimizer，主要的训练过程都隐藏在iterations背后，无论是L-BFGS还是OWLQN都需要实现这个方法，返回所有轮的状态放在stats里面

val states = optimizer.iterations(new CachedDiffFunction(costFun),
//矩阵按列展开为数组，对照1.4.2.3节的第一种存储格式
new BDV[Double](initialCoefWithInterceptMatrix.toArray))

1

2

3

2.6. 获取训练结果

从训练结果里取回系数矩阵和截距

2.6.1. 初始化系数和截距

//训练结果以数组的形式返回
val allCoefficients = state.x.toArray.clone()
//数组转成矩阵，numClasses*numberFeature
val allCoefMatrix = new DenseMatrix(numCoefficientSets,
numFeaturesPlusIntercept, allCoefficients)
val denseCoefficientMatrix = new DenseMatrix(numCoefficientSets, numFeatures,
new Array[Double](numCoefficientSets * numFeatures), isTransposed = true)
//需要拟合截距或者二分类的情况，每类的截距都是实际有值的，用dense vector
val interceptVec = if ($(fitIntercept) || !isMultinomial) {
Vectors.zeros(numCoefficientSets)
} else {
//sparse vector
Vectors.sparse(numCoefficientSets, Seq())
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

2.6.2. 用训练结果赋值

从原始结果中分别提取系数与截距

allCoefMatrix.foreachActive { (classIndex, featureIndex, value) =>
//allCoefMatrix的列numFeaturesPlusIntercept，如果拟合了截距，每行的最后一列就是截距
val isIntercept = $(fitIntercept) && (featureIndex == numFeatures)
if (!isIntercept && featuresStd(featureIndex) != 0.0) {
//我们在分析L-BFGS源码的文章中提过，因为训练时特征是对标准差归一化的，为了在使用时直接使用特征值，这里相当于对返回的系数标准化
denseCoefficientMatrix.update(classIndex, featureIndex,
value / featuresStd(featureIndex))
}
if (isIntercept) interceptVec.toArray(classIndex) = value
}

1

2

3

4

5

6

7

8

9

10

2.6.3. 对待返回的系数进行调整

我们前面提到过，如果无pivot class多分类不使用正则化其系数矩阵不是唯一的，因此这里对系数矩阵进行0均值中心化(mean centered coefficients)，使其可复现(reproducibility，我的理解是重复训练时结果一样)

//按行向量化
val denseValues = denseCoefficientMatrix.values
//均值
val coefficientMean = denseValues.sum / denseValues.length
denseCoefficientMatrix.update(_ - coefficientMean)

1

2

3

4

5

如果系数矩阵不是多分类，则根据系数非0的情况决定是否使用稀疏矩阵SparseMatrix

val compressedCoefficientMatrix = if (isMultinomial) {
denseCoefficientMatrix
} else {
//矩阵向量化后进行压缩
val compressedVector = Vectors.dense(denseCoefficientMatrix.values).compressed
compressedVector match {
case dv: DenseVector =>
//系数稠密，不需要压缩
denseCoefficientMatrix
case sv: SparseVector =>
//0系数较多，返回稀疏矩阵
new SparseMatrix(1, numFeatures, Array(0, sv.indices.length), sv.indices, sv.values,
isTransposed = true)
}
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

多分类时，对截距进行中心化

val compressedCoefficientMatrix = if (isMultinomial) {
denseCoefficientMatrix
} else {
if ($(fitIntercept) && isMultinomial) {
val interceptArray = interceptVec.toArray
val interceptMean = interceptArray.sum / interceptArray.length
(0 until interceptVec.size).foreach { i => interceptArray(i) -= interceptMean }
}

1

2

3

4

5

6

7

8

2.7. 构造逻辑回归模型

用训练得到的稀疏矩阵和截距向量，class数等构造逻辑回归模型

val model = copyValues(new LogisticRegressionModel(uid,
coefficientMatrix, interceptVector,numClasses, isMultinomial))
//如果是二分类，加入模型的评估指标，见BinaryLogisticRegressionSummary
val m = if (!isMultinomial) {
val (summaryModel, probabilityColName) = model.findSummaryModelAndProbabilityCol()
val logRegSummary = new BinaryLogisticRegressionTrainingSummary(
//加计算每个样本的margin（rawPredictionCol），预测值（probabilityCol，margin带入sigmoid函数），预测label（predictionCol，与threshold比较）
summaryModel.transform(dataset),
probabilityColName,
$(labelCol),
$(featuresCol),
objectiveHistory)
model.setSummary(Some(logRegSummary))
} else {
model
}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

2.8. 归一化

从上面的介绍中我们可以看到，在计算loss和梯度时，spark没有判断是否使用归一化(standardization)，直接进行了归一化，也就是说如果我们设置了归一化，其损失函数

lossstd+L1+L2

并且在2.6.2节我们看到，返回的系数是统一进行了归一化，注释中也有解释，系数进行归一化，在预测的时候就不需要对特征进行归一化了，整个训练过程可以看做是对系数进行归一化，因此如果我们设置不进行归一化，loss和梯度计算部分仍旧归一化，则可以

lossstd+L1std+L2std2

所有项都除以std，对整体的目标函数是没有影响的，因此可以看到源码中L1正则化和L2正则化在设置不归一化的时候对方差进行了归一化，因为L2是平方项，因此分母也是std的平方。
在实际的业务场景中，其实很多都是非数值型特征，而且数值型的往往也会离散化，全部变成0/1的特征，这种情况使用归一化其实是没什么意义的，spark这里强制使用归一化我感觉欠妥了。

3. 结语

本文分析了spark ElasticNet的实现，其中用到的一些技巧等。其计算过程中使用归一化及要求label必须连续且为整数等其实限制了其使用场景，希望后续能够优化这块，多考虑一些实际的业务场景。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： ElasticNet

相关文章推荐

新的分享

章节导航