您的位置：首页 > 其它

Scikit-learn：主要模块和基本使用方法

2016-08-19 10:22 519 查看

http://blog.csdn.net/pipisorry/article/details/52128222

scikit-learn: Machine Learning in Python.scikit-learn库实现了很多机器学习算法。
scikit-learn是一个基于NumPy, SciPy, Matplotlib的开源机器学习工具包，主要涵盖分类，回归和聚类算法，例如SVM，逻辑回归，朴素贝叶斯，随机森林，k-means等算法，代码和文档都非常不错，在许多Python项目中都有应用。例如在我们熟悉的NLTK中，分类器方面就有专门针对scikit-learn的接口，可以调用scikit-learn的分类算法以及训练数据来训练分类器模型。
scikit-learn的基本功能主要被分为六个部分，分类，回归，聚类，数据降维，模型选择，数据预处理，具体可以参考官方网站上的文档。

安装

Note: 要先安装Numpy, Scipy。[linux和windows下安装python拓展包及requirement.txt安装类库 ]

linux下安装scikit-learn

Building scikit-learn with pip
This is usually the fastest way to install or upgrade to the latest stablerelease:
pip install -U scikit-learn
pip install --user --install-option="--prefix=" -U scikit-learn
Note:1. The --user flag ask pip to install scikit-learn in the $HOME/.local folder therefore not requiring root permission. This flag should make pip ignore any old version of scikit-learn previously installed on the system while benefitting from system packages for numpy and scipy. Those dependencies can be long and complex to build correctly from source.
2. The --install-option="--prefix=" flag is only required if Python has adistutils.cfg configuration with a predefinedprefix= entry.
[Installing scikit-learn]

scikit-learn机器学习问题解决思路

对于具体的机器学习问题，通常可以分为三个步骤，数据准备与预处理，模型选择与训练，模型验证与参数调优。

逻辑回归模型示例

scikit-learn支持多种格式的数据，包括经典的iris数据，LibSVM格式数据等等。为了方便起见，推荐使用LibSVM格式的数据，详细见LibSVM的官网。
from sklearn.datasets importload_svmlight_file，导入这个模块就可以加载LibSVM模块的数据，
t_X,t_y=load_svmlight_file("filename")
机器学习模型也要导入相应的模块，逻辑回归模型在下面的模块中。
from sklearn.linear_modelimport LogisticRegression
regressionFunc =LogisticRegression(C=10, penalty='l2', tol=0.0001)
train_sco=regressionFunc.fit(train_X,train_y).score(train_X,train_y)
test_sco=regressionFunc.score(test_X,test_y)
就可以完成模型的训练和测试了。
为了选择更好地模型可以进行交叉实验，或者使用贪心算法进行参数调优。
导入如下模块就可以，
CV：
from sklearn importcross_validation
X_train_m, X_test_m,y_train_m, y_test_m = cross_validation.train_test_split(t_X,t_y, test_size=0.5,random_state=seed_i)
regressionFunc_2.fit(X_train_m,y_train_m)
sco=regressionFunc_2.score(X_test_m,y_test_m, sample_weight=None)

GridSearch：
from sklearn.grid_searchimport GridSearchCV
tuned_parameters =[{'penalty': ['l1'], 'tol': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'penalty': ['l2'], 'tol':[1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]}]
clf =GridSearchCV(LogisticRegression(), tuned_parameters, cv=5, scoring=['precision','recall'])
print(clf.best_estimator_)

当然可以利用matplotlib绘制学习曲线，需要导入相应模块如下：
from sklearn.learning_curveimport learning_curve,validation_curve
核心代码如下，具体参见Scikit-Learn的官方文档：
rain_sizes, train_scores,test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs,train_sizes=train_sizes)
train_scores, test_scores =validation_curve(
        estimator, X, y, param_name,param_range,
        cv, scoring, n_jobs)

皮皮blog

预处理

加载数据(Data Loading)

我们假设输入时一个特征矩阵或者csv文件。
首先，数据应该被载入内存中。
scikit-learn的实现使用了NumPy中的arrays，所以，我们要使用NumPy来载入csv文件。
以下是从UCI机器学习数据仓库中下载的数据。import numpy as np
import urllib
# url with dataset
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]我们要使用该数据集作为例子，将特征矩阵作为X，目标变量作为y。

数据归一化(Data Normalization)

大多数机器学习算法中的梯度方法对于数据的缩放和尺度都是很敏感的，在开始跑算法之前，我们应该进行归一化或者标准化的过程，这使得特征数据缩放到0-1范围中。scikit-learn提供了归一化的方法：from sklearn import preprocessing
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)[Scikit-learn：数据预处理Preprocessing data]

特征选择(Feature Selection)

在解决一个实际问题的过程中，选择合适的特征或者构建特征的能力特别重要。这成为特征选择或者特征工程。
特征选择时一个很需要创造力的过程，更多的依赖于直觉和专业知识，并且有很多现成的算法来进行特征的选择。
下面的树算法(Tree algorithms)计算特征的信息量：from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X, y)
# display the relative importance of each attribute
print(model.feature_importances_)皮皮blog

scikit-learn算法的使用

scikit-learn实现了机器学习的大部分基础算法，让我们快速了解一下。

逻辑回归

大多数问题都可以归结为二元分类问题。这个算法的优点是可以给出数据所在类别的概率。

这里我们使用Pima Indians Diabetes dataset，其中包含健康数据和糖尿病状态数据，一共有768个病人的数据。

import pandas as pd
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(url, header=None, names=col_names)
pima.head()


	pregnant	glucose	bp	skin	insulin	bmi	pedigree	age	label
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

上面表格中的label一列，1表示该病人有糖尿病，0表示该病人没有糖尿病

# define X and y
feature_cols = ['pregnant', 'insulin', 'bmi', 'age']
X = pima[feature_cols]
y = pima.label

# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1,
max_iter=100, multi_class='ovr', penalty='l2', random_state=None, solver='liblinear', tol=0.0001,verbose=0)

# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

结果：
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, penalty=l2, random_state=None, tol=0.0001)
precision recall f1-score support 0.0 0.79 0.89 0.84 500
1.0 0.74 0.55 0.63 268avg / total 0.77 0.77 0.77 768
[[447 53]
[120 148]]

朴素贝叶斯

这也是著名的机器学习算法，该方法的任务是还原训练样本数据的分布密度，其在多类别分类中有很好的效果。from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))结果：
GaussianNB()
precision recall f1-score support 0.0 0.80 0.86 0.83 500
1.0 0.69 0.60 0.64 268avg / total 0.76 0.77 0.76 768
[[429 71]
[108 160]]

k近邻

k近邻算法常常被用作是分类算法一部分，比如可以用它来评估特征，在特征选择上我们可以用到它。from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
# fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))结果：
KNeighborsClassifier(algorithm=auto, leaf_size=30, metric=minkowski,
n_neighbors=5, p=2, weights=uniform)
precision recall f1-score support 0.0 0.82 0.90 0.86 500
1.0 0.77 0.63 0.69 268avg / total 0.80 0.80 0.80 768
[[448 52]
[ 98 170]]

决策树

分类与回归树(Classification and Regression Trees ,CART)算法常用于特征含有类别信息的分类或者回归问题，这种方法非常适用于多分类情况。from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))结果：
DecisionTreeClassifier(compute_importances=None, criterion=gini,
max_depth=None, max_features=None, min_density=None,
min_samples_leaf=1, min_samples_split=2, random_state=None,
splitter=best)
precision recall f1-score support 0.0 1.00 1.00 1.00 500
1.0 1.00 1.00 1.00 268avg / total 1.00 1.00 1.00 768
[[500 0]
[ 0 268]]

支持向量机

[Scikit-learn：分类classification ：svm]

除了分类和回归算法外，scikit-learn提供了更加复杂的算法，比如聚类算法，还实现了算法组合的技术，如Bagging和Boosting算法。
皮皮blog

如何优化算法参数

一项更加困难的任务是构建一个有效的方法用于选择正确的参数，我们需要用搜索的方法来确定参数。scikit-learn提供了实现这一目标的函数。
下面的例子是一个进行正则参数选择的程序：import numpy as np
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV
# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))
grid.fit(X, y)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)结果：
GridSearchCV(cv=None,
estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, solver=auto, tol=0.001),
estimator__alpha=1.0, estimator__copy_X=True,
estimator__fit_intercept=True, estimator__max_iter=None,
estimator__normalize=False, estimator__solver=auto,
estimator__tol=0.001, fit_params={}, iid=True, loss_func=None,
n_jobs=1,
param_grid={'alpha': array([ 1.00000e+00, 1.00000e-01, 1.00000e-02, 1.00000e-03,
1.00000e-04, 0.00000e+00])},
pre_dispatch=2*n_jobs, refit=True, score_func=None, scoring=None,
verbose=0)
0.282118955686
1.0
有时随机从给定区间中选择参数是很有效的方法，然后根据这些参数来评估算法的效果进而选择最佳的那个。import numpy as np
from scipy.stats import uniform as sp_rand
from sklearn.linear_model import Ridge
from sklearn.grid_search import RandomizedSearchCV
# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'alpha': sp_rand()}
# create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
rsearch.fit(X, y)
print(rsearch)
# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)结果：
RandomizedSearchCV(cv=None,
estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, solver=auto, tol=0.001),
estimator__alpha=1.0, estimator__copy_X=True,
estimator__fit_intercept=True, estimator__max_iter=None,
estimator__normalize=False, estimator__solver=auto,
estimator__tol=0.001, fit_params={}, iid=True, n_iter=100,
n_jobs=1,
param_distributions={'alpha': <scipy.stats.distributions.rv_frozen object at 0x04B86DD0>},
pre_dispatch=2*n_jobs, random_state=None, refit=True,
scoring=None, verbose=0)
0.282118643885
0.988443794636
[简书：【机器学习实验】scikit-learn的主要模块和基本使用]

from: http://blog.csdn.net/pipisorry/article/details/52128222

ref: [Homepage: scikit-learn Machine Learning in Python]

[莫烦: 用 Scikit-learn 和 Python 优雅地学会机器学习 machine learning sklearn 教学优酷教程视频列表]
[scikit-learn User Guide]
[scikit-learn Tutorials]

[Scikit-learn 使用手册中文版]
[Fabian Pedregosa, Gael Varoquaux: Scipy Lecture Notes: 2.11. scikit-learn: machine learning in Python]

[翻译：Scikit Learn: 在Python中机器学习 Scikit Learn: 在python中机器学习]

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航