您的位置：首页 > 编程语言 > Python开发

python机器学习模型选择&调参工具Hyperopt-sklearn（1）——综述&分类问题

2017-03-23 15:46 986 查看

针对特定的数据集选择合适的机器学习算法是冗长的过程，即使是针对特定的机器学习算法，亦需要花费大量时间和精力调整参数，才能让模型获得好的效果，Hyperopt-sklearn可以辅助解决这样的问题。

主页：http://hyperopt.github.io/hyperopt-sklearn/

安装方法：

git clone https://github.com/hyperopt/hyperopt-sklearn.git cd hyperopt
pip install -e .

基础实例：

from hpsklearn import HyperoptEstimator

# Load Data
# ...

# Create the estimator object
estim = HyperoptEstimator()

# Search the space of classifiers and preprocessing steps and their
# respective hyperparameters in sklearn to fit a model to the data
estim.fit(train_data, train_label)

# Make a prediction using the optimized model
prediction = estim.predict(unknown_data)

# Report the accuracy of the classifier on a given set of data
score = estim.score(test_data, test_label)

# Return instances of the classifier and preprocessing steps
model = estim.best_model()

针对分类问题，可以如下指定HyperoptEstimator

from hyperopt import tpe
from hpsklearn import HyperoptEstimator, any_classifier
estim = HyperoptEstimator(classifier=any_classifier('clf'),algo=tpe.suggest)
estim.fit(X_train,y_train)

其中any_classifier是常用分类器的集合，根据源码

def any_classifier(name):
return hp.choice('%s' % name, [
svc(name + '.svc'),
knn(name + '.knn'),
random_forest(name + '.random_forest'),
extra_trees(name + '.extra_trees'),
ada_boost(name + '.ada_boost'),
gradient_boosting(name + '.grad_boosting', loss='deviance'),
sgd(name + '.sgd'),
])

可以发现目前支持的分类器有：

（1）svc（实现基础：sklearn.svm.SVC）

（2）knn（实现基础：sklearn.neighbors.KNeighborsClassifier）

（3）random_forest（实现基础：sklearn.ensemble.RandomForestClassifier）

（4）extra_trees（实现基础：sklearn.ensemble.ExtraTreesClassifier）

（5）ada_boost（实现基础：sklearn.ensemble.AdaBoostClassifier）

（6）gradient_boosting（实现基础：sklearn.ensemble.GradientBoostingClassifier）

（7）sgd（实现基础：sklearn.linear_model.SGDClassifier）

在默认情况下，HyperoptEstimator会尝试对数据进行预处理，根据源码

def any_preprocessing(name):
"""Generic pre-processing appropriate for a wide variety of data
"""
return hp.choice('%s' % name, [
[pca(name + '.pca')],
[standard_scaler(name + '.standard_scaler')],
[min_max_scaler(name + '.min_max_scaler')],
[normalizer(name + '.normalizer')],
# -- not putting in one-hot because it can make vectors huge
#[one_hot_encoder(name + '.one_hot_encoder')],
[]
])

可以发现目前支持的预处理方法有：

（1）pca（实现基础：sklearn.decomposition.PCA）

（2）standard_scaler（实现基础：sklearn.preprocessing.StandardScaler）

（3）min_max_scaler（实现基础：sklearn.preprocessing.MinMaxScaler）

（4）normalizer（实现基础：sklearn.preprocessing.Normalizer）

分类问题实例：

首先读入数据

import time
import numpy as np
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from hyperopt import tpe
from hpsklearn import HyperoptEstimator, any_classifier
from hpsklearn import svc

digits = load_digits()
X = digits.data
y = digits.target
test_size = int(0.2*len(y))
np.random.seed(0)
indices = np.random.permutation(len(X))
X_train = X[indices[:-test_size]]
y_train = y[indices[:-test_size]]
X_test = X[indices[-test_size:]]
y_test = y[indices[-test_size:]]

然后进行分类

estim = HyperoptEstimator(classifier=any_classifier('clf'),algo=tpe.suggest)
estim.fit(X_train,y_train)
print(estim.score(X_test,y_test))
print(estim.best_model())

输出如下（可能会有差异）

0.983286908078
{'learner': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
metric_params=None, n_jobs=1, n_neighbors=10, p=2,
weights='uniform'), 'preprocs': (), 'ex_preprocs': ()}

如果希望每次得到相同的结果，可以设置seed参数

# ensure that the result is the same
estim = HyperoptEstimator(classifier=any_classifier('clf'),algo=tpe.suggest, seed=0)
estim.fit(X_train,y_train)
print(estim.score(X_test,y_test))
print(estim.best_model())

输出如下

0.980501392758
{'learner': SVC(C=61953.1811067, cache_size=512, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=1, gamma='auto', kernel='linear',
max_iter=18658754.0, probability=False, random_state=3, shrinking=False,
tol=7.18807580055e-05, verbose=False), 'preprocs': (StandardScaler(copy=True, with_mean=False, with_std=True),), 'ex_preprocs': ()}

如果希望针对特定算法进行优化，可以通过classifier参数指定

以SVM为例，优化前测试集准确率39.28%，优化后测试集准确率98.61%

start = time.time()
clf1 = SVC( )
clf1.fit(X_train, y_train)
end = time.time()
print 'old test score:', clf1.score(X_test, y_test)
print 'old time:', (end-start) , 's'
print 'old model:', clf1

old test score: 0.392757660167
old time: 0.422000169754 s
old model: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

# significant improvement
start = time.time()
clf2 = HyperoptEstimator(classifier=svc('mySVC'), seed=0)
clf2.fit(X_train, y_train)
end = time.time()
print "new score", clf2.score(X_test, y_test)
print 'new time:', (end-start) , 's'
print 'new model:', clf2.best_model()

new score 0.986072423398
new time: 9.24400019646 s
new model: {'learner': SVC(C=3148.38646281, cache_size=512, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=1, gamma=0.0475906452129,
kernel='rbf', max_iter=46434501.0, probability=False, random_state=4,
shrinking=False, tol=0.00158569665523, verbose=False), 'preprocs': (MinMaxScaler(copy=True, feature_range=(-1.0, 1.0)),), 'ex_preprocs': ()}

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 机器学习

相关文章推荐

新的分享

章节导航