您的位置：首页 > 编程语言 > Python开发

Python数据分析：scikit-learn基础（二）

2019-05-06 14:55 676 查看

Python数据分析：scikit-learn基础（二）

使用scikit-learn流程

准备数据集

数据处理数据集格式
二维数组，形状（n_samples,n_features)
使用np.reshape()转换数据集形状

特征工程

特征归一化(normalization)
……

分割训练集、测试集train_test_split()

特征归一化（normalization)

preprocessing.scale() 选择合适的模型

训练模型

estimator对象
从训练数据学习得到
可以使分类算法、回归算法或是特征提取算法
fit方法用于训练estimator
estimator的参数可以训练前初始化或者之后更新
get_params()返回之前定义的参数
score()对estimator进行评分回归模型使用决定系数评分（coefficient of determination）
分类模型使用准确率评分（accuracy）

调整参数

依靠经验
交叉验证（cross validation） cross_val_score()

测试模型

model.predict(X_test) 返回测试样本的预测标签
model.score(X_test,y_test) 根据预测值和真实值计算评分

import numpy as np
from sklearn.model_selection import train_test_split

X = np.random.randint(0, 100, (10, 4))
y = np.random.randint(0, 3, 10)
y.sort()

print('样本：')
print(X)
print('标签：', y)

运行：

# 分割训练集、测试集
# random_state确保每次随机分割得到相同的结果
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=7)

print('训练集：')
print(X_train)
print(y_train)

print('测试集：')
print(X_test)
print(y_test)

运行：

# 特征归一化
from sklearn import preprocessing

x1 = np.random.randint(0, 1000, 5).reshape(5,1)
x2 = np.random.randint(0, 10, 5).reshape(5, 1)
x3 = np.random.randint(0, 100000, 5).reshape(5, 1)

X = np.concatenate([x1, x2, x3], axis=1)
print(X)
print('归一化后的数据集：')
print(preprocessing.scale(X))

运行：

验证归一化对结果的重要性不归一化：

from sklearn import svm
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2,
random_state=25, n_clusters_per_class=1, scale=100)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=7)
svm_classifier = svm.SVC()
svm_classifier.fit(X_train, y_train)
svm_classifier.score(X_test, y_test)

运行：

归一化：

from sklearn import svm
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2,
random_state=25, n_clusters_per_class=1, scale=100)
#归一化
X = preprocessing.scale(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=7)
svm_classifier = svm.SVC()
svm_classifier.fit(X_train, y_train)
svm_classifier.score(X_test, y_test)

运行：

交叉验证：

from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
%matplotlib inline

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3., random_state=10)

k_range = range(1, 31)
cv_scores = []
for n in k_range:
knn = KNeighborsClassifier(n)
scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy') # 分类问题使用
cv_scores.append(scores.mean())

plt.plot(k_range, cv_scores)
plt.xlabel('K')
plt.ylabel('Accuracy')
plt.show()

运行：

# 选择最优的K
best_knn = KNeighborsClassifier(n_neighbors=5)
best_knn.fit(X_train, y_train)
print(best_knn.score(X_test, y_test))
print(best_knn.predict(X_test))

运行：

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航