您的位置：首页 > 编程语言 > Python开发

[Python][Scikit-learn][学习笔记01]线性回归之波士顿房价实例分析

2018-03-17 23:28 976 查看

>数据的选择

从Scikit-learn的数据集里载入波士顿的房价数据：

from sklearn import datasets
boston = datasetd.load_boston()

波士顿数据集是一个具有13个特征的常见线性数据集，也是NG网课里的第一个例子。我们可以打印其描述文档来获取其各项属性：

print boston.DESCR

结果如下：

Data Set Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target

:Attribute Information (in order):
- CRIM     per capita crime rate by town
- ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS    proportion of non-retail business acres per town
- CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX      nitric oxides concentration (parts per 10 million)
- RM       average number of rooms per dwelling
- AGE      proportion of owner-occupied units built prior to 1940
- DIS      weighted distances to five Boston employment centres
- RAD      index of accessibility to radial highways
- TAX      full-value property-tax rate per $10,000
- PTRATIO  pupil-teacher ratio by town
- B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT    % lower status of the population
- MEDV     Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

>线性回归模型——手动分割训练集和测试集

我们先给定一个默认的采样频率，如0.5，用于将训练集和测试集分为两个相等的集合：

sampleRatio = 0.5
n_samples = len(boston.target)
sampleBoundary = int(n_samples * sampleRatio)

接着，洗乱整个集合，并取出相应的训练集和测试集数据：

shuffleIdx = range(n_samples)
numpy.random.shuffle(shuffleIdx) # 需要导入numpy
# 训练集的特征和回归值
train_features = boston.data[shuffleIdx[:sampleBoundary]]
train_targets = boston.target[shuffleIdx[:sampleBoundary]]
# 测试集的特征和回归值
test_features = boston.data[shuffleIdx[sampleBoundary:]]
test_targets = boston.target[shuffleIdx[sampleBoundary:]]

接下来，获取回归模型，拟合并得到测试集的预测结果：

lr = sklearn.linear_model.LinearRegression() # 需要导入sklearn的linear_model
lr.fit(train_features, train_targets) # 拟合
y = lr.predict(test_features) # 预测

最后，把预测结果通过matplotlib画出来：

import matplotlib.pyplot as plt
plt.plot(y, test_targets, 'rx') # y = ωX
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'b-.', lw=4) # f(x)=x
plt.ylabel("Predieted Price")
plt.xlabel("Real Price")
plt.show()

得到的结果如下：

在蓝线上的点是准确预测的点，而在蓝线以下及以上的点，分别是过低预测及过高预测的结果。

>线性回归模型——KFlod交叉验证

来自官方的样例：

from sklearn import datasets
from sklearn.model_selection import cross_val_predict
from sklearn import linear_model
import matplotlib.pyplot as plt

lr = linear_model.LinearRegression()
boston = datasets.load_boston()
y = boston.target

# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, boston.data, y, cv=10)

fig, ax = plt.subplots()
ax.scatter(y, predicted, edgecolors=(0, 0, 0))
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()

主要用的是交叉验证模型中的cross_val_predict，同样给定了线性回归模型(linear_model.LinearRegression()，模型需要实现fit()方法)，并划分了cv=10个交叉验证集合。比起手动划分集合，代码更加简短且易读性更好，没什么好过多分析的。默认的是KFlod方式，结果如下：

>交叉验证模型的打分

考虑到使用了交叉验证，我们可以对一种估计模型（estimator）进行评分，需要用到sklearn.cross_validation的cross_val_score()：

from sklearn import cross_validation
print cross_validation.cross_val_score(lr, boston.data, y, cv=10)

得到10个交叉验证集的结果：

[ 0.73334917  0.47229799 -1.01097697  0.64126348  0.54709821  0.73610181
0.37761817 -0.13026905 -0.78372253  0.41861839]

显然结果并不算好。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python scikit-learn 线性回归模型交叉验证

相关文章推荐

新的分享

章节导航