scikit-learn:3.3. Model evaluation: quantifying the quality of predictions
2015-07-29 08:57
267 查看
参考:http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
三种方法评估模型的预测质量:
Estimator score method: Estimators都有 score method作为默认的评估标准,不属于本节内容,具体参考不同estimators的文档。
Scoring parameter: Model-evaluation toolsusing cross-validation (such
as cross_validation.cross_val_score and grid_search.GridSearchCV)
rely on an internal scoring strategy. 本节讨论The
scoring parameter: defining model evaluation rules.(参考第一小节)
Metric functions: The metrics module 能较全面评价预测质量,本节讨论Classification
metrics, Multilabel
ranking metrics, Regression
metrics and Clustering
metrics.(参考二、三、四、五小节)
最后介绍 Dummy estimators ,提供随机猜测的策略,可以作为预测质量评价的baseline。(参考第六小节)
See also
For “pairwise” metrics, between samples and not estimators or predictions, see the Pairwise
metrics, Affinities and Kernels section.
具体内容有时间再写。。。
1、
The scoring parameter: defining model evaluation rules
Model selection and evaluation using tools, such as grid_search.GridSearchCV and cross_validation.cross_val_score,
take a scoring parameter
that controls what metric they apply to the estimators evaluated.
1)预定义的标准
所有的scorer都是越大越好,因此mean_absolute_error and mean_squared_error(测量预测点离模型的距离)是负值。
[thead]
给个例子:
3)自定义scoring标准
following two rules:
It can be called with parameters (estimator, X, y),
where estimator is the model that should be evaluated, X is
validation data, and y is the ground truth target for X (in
the supervised case) or None (in the unsupervised case).
It returns a floating point number that quantifies the estimator prediction
quality on X, with reference to y.
Again, by convention higher numbers are better, so if your scorer returns loss, that value should be negated.
2、
Classification metrics
The sklearn.metrics module
implements several loss, score, and utility functions to measure classification performance.
Some of these are restricted to the binary classification case:
Others also work in the multiclass case:
Some also work in the multilabel case:
And some work with binary and multilabel (but not multiclass) problems:
In the following sub-sections, we will describe each of those functions, preceded by some notes on common API and metric definition.
2)accuracy score:
The accuracy_score function
computes the accuracy,
默认是计算预测正确的比例,如果设置normalize=False,计算预测正确的绝对数量。给个例子就明白:
对于multilabel classification,只有所有的labels全部预测对,该sample才算预测对。给个例子就明白:
再参考:
See Test
with permutations the significance of a classification score for an example of accuracy score usage using permutations of the dataset.
3)confusion
matrix:
The confusion_matrix function
evaluates classification accuracy by computing the confusion
matrix. 给个例子:
(注意:纵轴是true label,横轴是predict label)
再参考:
See Confusion
matrix for an example of using a confusion matrix to evaluate classifier output quality.
See Recognizing
hand-written digits for an example of using a confusion matrix to classify hand-written digits.
See Classification
of text documents using sparse features for an example of using a confusion matrix to classify text documents.
4)classification
report:
The classification_report function
builds a text report showing the main classification metrics. 给个例子:
再参考:
See Recognizing
hand-written digits for an example of classification report usage for hand-written digits.
See Classification
of text documents using sparse features for an example of classification report usage for text documents.
See Parameter
estimation using grid search with cross-validation for an example of classification report usage for grid search with nested cross-validation.
下面的一些不常用,简单列出来,不做过多解释和翻译:
5)hamming
loss:
If
is
the predicted value for the
-th
label of a given sample,
is
the corresponding true value, and
is
the number of classes or labels, then the Hamming loss
between
two samples is defined as:
6)jaccard
similarity coefficient score:
The Jaccard similarity coefficient of the
-th samples,
with a ground truth label set
and predicted label set
,
is defined as
7)precision、recall、f-measures:
Several functions allow you to analyze the precision, recall and F-measures score:
Note that the precision_recall_curve function
is restricted to the binary case. The average_precision_score function
works only in binary classification and multilabel indicator format.
8)hinge loss:
9)log loss:
10)matthews
correlation coefficient:
11)receiver
operating characteristic(ROC):
12)zero one loss:
3、
Multilabel ranking metrics
In multilabel learning, each sample can have any number of ground truth labels associated with it. The goal is to give
high scores and better rank to the ground truth labels.
1)coverage error:
2)label ranking average precision:
4、
Regression metrics
The sklearn.metrics module
implements several loss, score, and utility functions to measure regression performance.
Some of those have been enhanced to handle the multioutput case: mean_absolute_error, mean_squared_error, median_absolute_error and r2_score.
1)explained variance score:
If
is
the estimated target output,
the
corresponding (correct) target output, and
is Variance,
the square of the standard deviation, then the explained variance is estimated as follow:
2)mean absolute error:
If
is
the predicted value of the
-th
sample, and
is
the corresponding true value, then the mean absolute error (MAE) estimated over
is
defined as
3)mean squared error:
If
is
the predicted value of the
-th
sample, and
is
the corresponding true value, then the mean squared error (MSE) estimated over
is
defined as
4)R^2 score、the coefficient of determination:
If
is
the predicted value of the
-th
sample and
is
the corresponding true value, then the score R² estimated over
is
defined as
5、
Clustering metrics
The sklearn.metrics module
implements several loss, score, and utility functions. For more information see the Clustering
performance evaluation section for instance clustering, and Biclustering
evaluation for biclustering.
6、Dummy estimators
对于supervised learning,使用随机产生的结果作为baseline是很简单的对比。
DummyClassifier提供了产生随机结果的简单的策略:
stratified generates random predictions by respecting the training set class distribution.
most_frequent always predicts the most frequent label in the training set.
uniform generates predictions uniformly at random.
constant always predicts a constant label that is provided by the user.(A
major motivation of this method is F1-scoring, when the positive class is in the minority.)
Note that with all these strategies, the predict method completely ignores the input data!
给个简单例子:
first let’s create an imbalanced dataset:
>>>
Next, let’s compare the accuracy of SVC and most_frequent:
>>>
We see that SVC doesn’t do much better than a dummy classifier. Now, let’s change the kernel:
>>>
同理,对于回归问题:
DummyRegressor also
implements four simple rules of thumb for regression:
mean always predicts the mean of the training targets.
median always predicts the median of the training targets.
quantile always predicts a user provided quantile of the training targets.
constant always predicts a constant value that is provided by the user.
In all these strategies, the predict method completely ignores the input data.
三种方法评估模型的预测质量:
Estimator score method: Estimators都有 score method作为默认的评估标准,不属于本节内容,具体参考不同estimators的文档。
Scoring parameter: Model-evaluation toolsusing cross-validation (such
as cross_validation.cross_val_score and grid_search.GridSearchCV)
rely on an internal scoring strategy. 本节讨论The
scoring parameter: defining model evaluation rules.(参考第一小节)
Metric functions: The metrics module 能较全面评价预测质量,本节讨论Classification
metrics, Multilabel
ranking metrics, Regression
metrics and Clustering
metrics.(参考二、三、四、五小节)
最后介绍 Dummy estimators ,提供随机猜测的策略,可以作为预测质量评价的baseline。(参考第六小节)
See also
For “pairwise” metrics, between samples and not estimators or predictions, see the Pairwise
metrics, Affinities and Kernels section.
具体内容有时间再写。。。
1、
The scoring parameter: defining model evaluation rules
Model selection and evaluation using tools, such as grid_search.GridSearchCV and cross_validation.cross_val_score,
take a scoring parameter
that controls what metric they apply to the estimators evaluated.
1)预定义的标准
所有的scorer都是越大越好,因此mean_absolute_error and mean_squared_error(测量预测点离模型的距离)是负值。
Scoring | Function | Comment |
---|---|---|
Classification | ||
‘accuracy’ | metrics.accuracy_score | |
‘average_precision’ | metrics.average_precision_score | |
‘f1’ | metrics.f1_score | for binary targets |
‘f1_micro’ | metrics.f1_score | micro-averaged |
‘f1_macro’ | metrics.f1_score | macro-averaged |
‘f1_weighted’ | metrics.f1_score | weighted average |
‘f1_samples’ | metrics.f1_score | by multilabel sample |
‘log_loss’ | metrics.log_loss | requires predict_proba support |
‘precision’ etc. | metrics.precision_score | suffixes apply as with ‘f1’ |
‘recall’ etc. | metrics.recall_score | suffixes apply as with ‘f1’ |
‘roc_auc’ | metrics.roc_auc_score | |
Clustering | ||
‘adjusted_rand_score’ | metrics.adjusted_rand_score | |
Regression | ||
‘mean_absolute_error’ | metrics.mean_absolute_error | |
‘mean_squared_error’ | metrics.mean_squared_error | |
‘median_absolute_error’ | metrics.median_absolute_error | |
‘r2’ | metrics.r2_score |
>>> from sklearn import svm, cross_validation, datasets >>> iris = datasets.load_iris() >>> X, y = iris.data, iris.target >>> model = svm.SVC() >>> cross_validation.cross_val_score(model, X, y, scoring='wrong_choice') Traceback (most recent call last): ValueError: 'wrong_choice' is not a valid scoring value. Valid options are ['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc'] >>> clf = svm.SVC(probability=True, random_state=0) >>> cross_validation.cross_val_score(clf, X, y, scoring='log_loss') array([-0.07..., -0.16..., -0.06...])
3)自定义scoring标准
following two rules:
It can be called with parameters (estimator, X, y),
where estimator is the model that should be evaluated, X is
validation data, and y is the ground truth target for X (in
the supervised case) or None (in the unsupervised case).
It returns a floating point number that quantifies the estimator prediction
quality on X, with reference to y.
Again, by convention higher numbers are better, so if your scorer returns loss, that value should be negated.
2、
Classification metrics
The sklearn.metrics module
implements several loss, score, and utility functions to measure classification performance.
Some of these are restricted to the binary classification case:
matthews_corrcoef(y_true, y_pred) | Compute the Matthews correlation coefficient (MCC) for binary classes |
precision_recall_curve(y_true, probas_pred) | Compute precision-recall pairs for different probability thresholds |
roc_curve(y_true, y_score[, pos_label, ...]) | Compute Receiver operating characteristic (ROC) |
confusion_matrix(y_true, y_pred[, labels]) | Compute confusion matrix to evaluate the accuracy of a classification |
hinge_loss(y_true, pred_decision[, labels, ...]) | Average hinge loss (non-regularized) |
accuracy_score(y_true, y_pred[, normalize, ...]) | Accuracy classification score. |
classification_report(y_true, y_pred[, ...]) | Build a text report showing the main classification metrics |
f1_score(y_true, y_pred[, labels, ...]) | Compute the F1 score, also known as balanced F-score or F-measure |
fbeta_score(y_true, y_pred, beta[, labels, ...]) | Compute the F-beta score |
hamming_loss(y_true, y_pred[, classes]) | Compute the average Hamming loss. |
jaccard_similarity_score(y_true, y_pred[, ...]) | Jaccard similarity coefficient score |
log_loss(y_true, y_pred[, eps, normalize, ...]) | Log loss, aka logistic loss or cross-entropy loss. |
precision_recall_fscore_support(y_true, y_pred) | Compute precision, recall, F-measure and support for each class |
precision_score(y_true, y_pred[, labels, ...]) | Compute the precision |
recall_score(y_true, y_pred[, labels, ...]) | Compute the recall |
zero_one_loss(y_true, y_pred[, normalize, ...]) | Zero-one classification loss. |
average_precision_score(y_true, y_score[, ...]) | Compute average precision (AP) from prediction scores |
roc_auc_score(y_true, y_score[, average, ...]) | Compute Area Under the Curve (AUC) from prediction scores |
2)accuracy score:
The accuracy_score function
computes the accuracy,
默认是计算预测正确的比例,如果设置normalize=False,计算预测正确的绝对数量。给个例子就明白:
>>> import numpy as np >>> from sklearn.metrics import accuracy_score >>> y_pred = [0, 2, 1, 3] >>> y_true = [0, 1, 2, 3] >>> accuracy_score(y_true, y_pred) 0.5 >>> accuracy_score(y_true, y_pred, normalize=False) 2
对于multilabel classification,只有所有的labels全部预测对,该sample才算预测对。给个例子就明白:
>>> accuracy_score(np.array([[0, 1], [1, 1]]), np.ones((2, 2))) 0.5
再参考:
See Test
with permutations the significance of a classification score for an example of accuracy score usage using permutations of the dataset.
3)confusion
matrix:
The confusion_matrix function
evaluates classification accuracy by computing the confusion
matrix. 给个例子:
>>> from sklearn.metrics import confusion_matrix >>> y_true = [2, 0, 2, 2, 0, 1] >>> y_pred = [0, 0, 2, 2, 0, 2] >>> confusion_matrix(y_true, y_pred) array([[2, 0, 0], [0, 0, 1], [1, 0, 2]])
(注意:纵轴是true label,横轴是predict label)
再参考:
See Confusion
matrix for an example of using a confusion matrix to evaluate classifier output quality.
See Recognizing
hand-written digits for an example of using a confusion matrix to classify hand-written digits.
See Classification
of text documents using sparse features for an example of using a confusion matrix to classify text documents.
4)classification
report:
The classification_report function
builds a text report showing the main classification metrics. 给个例子:
>>> from sklearn.metrics import classification_report >>> y_true = [0, 1, 2, 2, 0] >>> y_pred = [0, 0, 2, 2, 0] >>> target_names = ['class 0', 'class 1', 'class 2'] >>> print(classification_report(y_true, y_pred, target_names=target_names)) precision recall f1-score support class 0 0.67 1.00 0.80 2 class 1 0.00 0.00 0.00 1 class 2 1.00 1.00 1.00 2 avg / total 0.67 0.80 0.72 5
再参考:
See Recognizing
hand-written digits for an example of classification report usage for hand-written digits.
See Classification
of text documents using sparse features for an example of classification report usage for text documents.
See Parameter
estimation using grid search with cross-validation for an example of classification report usage for grid search with nested cross-validation.
下面的一些不常用,简单列出来,不做过多解释和翻译:
5)hamming
loss:
If
is
the predicted value for the
-th
label of a given sample,
is
the corresponding true value, and
is
the number of classes or labels, then the Hamming loss
between
two samples is defined as:
6)jaccard
similarity coefficient score:
The Jaccard similarity coefficient of the
-th samples,
with a ground truth label set
and predicted label set
,
is defined as
7)precision、recall、f-measures:
Several functions allow you to analyze the precision, recall and F-measures score:
average_precision_score(y_true, y_score[, ...]) | Compute average precision (AP) from prediction scores |
f1_score(y_true, y_pred[, labels, ...]) | Compute the F1 score, also known as balanced F-score or F-measure |
fbeta_score(y_true, y_pred, beta[, labels, ...]) | Compute the F-beta score |
precision_recall_curve(y_true, probas_pred) | Compute precision-recall pairs for different probability thresholds |
precision_recall_fscore_support(y_true, y_pred) | Compute precision, recall, F-measure and support for each class |
precision_score(y_true, y_pred[, labels, ...]) | Compute the precision |
recall_score(y_true, y_pred[, labels, ...]) | Compute the recall |
is restricted to the binary case. The average_precision_score function
works only in binary classification and multilabel indicator format.
8)hinge loss:
9)log loss:
10)matthews
correlation coefficient:
11)receiver
operating characteristic(ROC):
12)zero one loss:
3、
Multilabel ranking metrics
In multilabel learning, each sample can have any number of ground truth labels associated with it. The goal is to give
high scores and better rank to the ground truth labels.
1)coverage error:
2)label ranking average precision:
4、
Regression metrics
The sklearn.metrics module
implements several loss, score, and utility functions to measure regression performance.
Some of those have been enhanced to handle the multioutput case: mean_absolute_error, mean_squared_error, median_absolute_error and r2_score.
1)explained variance score:
If
is
the estimated target output,
the
corresponding (correct) target output, and
is Variance,
the square of the standard deviation, then the explained variance is estimated as follow:
2)mean absolute error:
If
is
the predicted value of the
-th
sample, and
is
the corresponding true value, then the mean absolute error (MAE) estimated over
is
defined as
3)mean squared error:
If
is
the predicted value of the
-th
sample, and
is
the corresponding true value, then the mean squared error (MSE) estimated over
is
defined as
4)R^2 score、the coefficient of determination:
If
is
the predicted value of the
-th
sample and
is
the corresponding true value, then the score R² estimated over
is
defined as
5、
Clustering metrics
The sklearn.metrics module
implements several loss, score, and utility functions. For more information see the Clustering
performance evaluation section for instance clustering, and Biclustering
evaluation for biclustering.
6、Dummy estimators
对于supervised learning,使用随机产生的结果作为baseline是很简单的对比。
DummyClassifier提供了产生随机结果的简单的策略:
stratified generates random predictions by respecting the training set class distribution.
most_frequent always predicts the most frequent label in the training set.
uniform generates predictions uniformly at random.
constant always predicts a constant label that is provided by the user.(A
major motivation of this method is F1-scoring, when the positive class is in the minority.)
Note that with all these strategies, the predict method completely ignores the input data!
给个简单例子:
first let’s create an imbalanced dataset:
>>>
>>> from sklearn.datasets import load_iris >>> from sklearn.cross_validation import train_test_split >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> y[y != 1] = -1 >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Next, let’s compare the accuracy of SVC and most_frequent:
>>>
>>> from sklearn.dummy import DummyClassifier >>> from sklearn.svm import SVC >>> clf = SVC(kernel='linear', C=1).fit(X_train, y_train) >>> clf.score(X_test, y_test) 0.63... >>> clf = DummyClassifier(strategy='most_frequent',random_state=0) >>> clf.fit(X_train, y_train) DummyClassifier(constant=None, random_state=0, strategy='most_frequent') >>> clf.score(X_test, y_test) 0.57...
We see that SVC doesn’t do much better than a dummy classifier. Now, let’s change the kernel:
>>>
>>> clf = SVC(kernel='rbf', C=1).fit(X_train, y_train) >>> clf.score(X_test, y_test) 0.97...
同理,对于回归问题:
DummyRegressor also
implements four simple rules of thumb for regression:
mean always predicts the mean of the training targets.
median always predicts the median of the training targets.
quantile always predicts a user provided quantile of the training targets.
constant always predicts a constant value that is provided by the user.
In all these strategies, the predict method completely ignores the input data.
相关文章推荐
- lua运行环境搭建:
- lua安装流程完整流程
- Lua的__index和__newindex之间的沉默与合作
- Lua强大的元方法__newindex
- Lua强大的元方法__index
- lua中的table函数库
- Hello world with LUA
- 封装GetProcAddress让Lua调用Windows API
- 有趣的Lua表
- Evaluate a Cubic Bézier on GPU
- lua中字符分割和去掉某类字符
- Lua实现的Base64编码
- lua中文字符串长度计算和截取
- scikit-learn:3.1. Cross-validation: evaluating estimator performance
- 正确lua简单的扩展,可以加速相关C++数据。
- scikit-learn:3. Model selection and evaluation
- 如何从wireshark中获取H264码流(原创)
- Lua与C++ 第六篇(Lua调用C++的函数)
- Lua与C++ 第五篇(C++调用Lua的函数)
- Lua与C++ 第四篇(获取Lua表结构数据)