您的位置：首页 > 其它

scikit-learn（工程中用的相对较多的模型介绍）：1.4. Support Vector Machines

2015-08-04 07:33 288 查看

参考：http://scikit-learn.org/stable/modules/svm.html

在实际项目中，我们真的很少用到那些简单的模型，比如LR、kNN、NB等，虽然经典，但在工程中确实不实用。

今天我们关注在工程中用的相对较多的SVM。

SVM功能不少：Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers
detection.

好处多多：高维空间的高效率；维度大于样本数的有效性；仅使用训练点的子集（称作支持向量），空间占用少；有不同的kernel functions供选择。

也有坏处：维度大于样本数的有效性----但维度如果相对样本数过高，则效果会非常差；不能直接提供概率估计，需要通过an expensive five-fold cross-validation (see Scores
and probabilities, below).才能实现。

（SVM支持dense和sparse sample vectors，但是如果预测使用的sparse data，那训练也要使用稀疏数据。为了发挥SVM效用，请use
C-ordered numpy.ndarray (dense)
or scipy.sparse.csr_matrix (sparse)
with dtype=float64.）

1、分类

SVC, NuSVC and LinearSVC 是三个可以进行multi-class分类的模型。三者的本质区别就是 have
different mathematical formulations，具体参考本文最后的公式。

SVC, NuSVC and LinearSVC 和其他分类器一样，使用fit、predict方法：

>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]
>>> y = [0, 1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)  
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)

After being fitted, the model can then be used to predict new values:

>>>

>>> clf.predict([[2., 2.]])
array([1])

SVM中的支持向量的相关属性可以使用 support_vectors_, support_ and n_support来获取：

>>> # get support vectors
>>> clf.support_vectors_
array([[ 0.,  0.],
       [ 1.,  1.]])
>>> # get indices of support vectors
>>> clf.support_ 
array([0, 1]...)
>>> # get number of support vectors for each class
>>> clf.n_support_ 
array([1, 1]...)

对于multi-class分类：

SVC and NuSVC 的机制是“one-against-one”（training n_class * (n_class - 1) / 2个 models），而 LinearSVC 的策略是“one-vs-the-rest”（training n_class个 models）
。而实践中，one-vs-rest是常用和较好的，因为结果其实差不多，但时间省好多。。。

[python] view
plaincopy

>>> X = [[0], [1], [2], [3]]

>>> Y = [0, 1, 2, 3]

>>> clf = svm.SVC()

>>> clf.fit(X, Y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,

gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None,

shrinking=True, tol=0.001, verbose=False)

>>> dec = clf.decision_function([[1]])

>>> dec.shape[1] # 4 classes: 4*3/2 = 6

6

>>> lin_clf = svm.LinearSVC()

>>> lin_clf.fit(X, Y)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,

intercept_scaling=1, loss='squared_hinge', max_iter=1000,

multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,

verbose=0)

>>> dec = lin_clf.decision_function([[1]])

>>> dec.shape[1]

4

关于样本所属类别的confidence：The SVC method decision_function gives
per-class scores for each sample。另外还有所谓的option probability，但是，If
confidence scores are required, but these do not have to be probabilities, then it is advisable to set probability=False and
use decision_function instead
of predict_proba.（主要是因为probability的理论背景有缺陷）

在每个class或者sample的权重不同的情况下，可以设置keywords class_weight andsample_weight ：

类别权重：SVC (but
not NuSVC)
implement a keyword class_weight in
the fit method.
It’s a dictionary of the form {class_label : value}, where
value is a floating point number > 0 that sets the parameter C of
class class_label to C * value.

样本权重：SVC, NuSVC, SVR, NuSVR and OneClassSVM implement
also weights for individual samples in method fit through
keyword sample_weight.
Similar to class_weight,
these set the parameter C for
the i-th example to C * sample_weight[i].

最后给几个例子：

Plot
different SVM classifiers in the iris dataset,
SVM:
Maximum margin separating hyperplane,
SVM:
Separating hyperplane for unbalanced classes
SVM-Anova:
SVM with univariate feature selection,
Non-linear
SVM
SVM:
Weighted samples,

2、回归

Support Vector Regression.

看能明白这句话不能：Analogously（to
SVClassfication）, the model produced by Support Vector Regression depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction.

同样也是三个模型： SVR, NuSVR and LinearSVR。

>>> from sklearn import svm
>>> X = [[0, 0], [2, 2]]
>>> y = [0.5, 2.5]
>>> clf = svm.SVR()
>>> clf.fit(X, y) 
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.0,
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
>>> clf.predict([[1, 1]])
array([ 1.5])

给个例子：

Support
Vector Regression (SVR) using linear and non-linear kernels

3、Density estimation，novelty detection（密度估计、新颖性检测）

先看下wiki上怎么说Novelty
detection：Novelty detection is the
identification of new or unknown data that a machine
learning system has not been trained with and was not previously aware of,[1] with
the help of either statistical or machine
learning based approaches.

OneClassSVM is
used for novelty detection, that is, given a set of samples, it will detect the soft boundary of that set so as to classify
new points as belonging to that set or not. 过程是无监督的，所以输入只有X。

具体详细应用参考：section Novelty
and Outlier Detection 。

最后给出两个例子：

One-class
SVM with non-linear kernel (RBF)
Species
distribution modeling

4、复杂度

The
QP（quadratic programming problem） solver used by this libsvm-based
implementation scales between

and

depending
on how efficiently the libsvm cache
is used in practice (dataset dependent).

5、实际应用中的一些小tips

Avoid data copy；kernel cache size；

Setting C：C默认是1，但是如果data中有很多noisy observations，需要减小C；

it is highly recommended to
scale your data. For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or
standardize it to have mean 0 and variance 1. Note that the same scaling
must be applied to the test vector to obtain meaningful results.
在 SVC中，如果数据样本unbalanced，set class_weight='auto' and/or
try different penalty parameters C.

6、kernel function

使用方式为：svm.SVC(kernel='linear')，常见的kernel有：

linear:

.
polynomial:

is
specified by keyword degree,

by coef0.
rbf:

is
specified by keyword gamma, must be greater than 0.
sigmoid (

),
where

is specified by coef0.

也可自定义kernel，例如：

>>> import numpy as np
>>> from sklearn import svm
>>> def my_kernel(x, y):
...     return np.dot(x, y.T)
...
>>> clf = svm.SVC(kernel=my_kernel)

SVM
with custom kernel.

7、Mathematical formulation

1、SVC：

2、SVR：

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航