您的位置:首页 > 编程语言 > Go语言

Google---机器学习速成课程(3.3)-合成特征和离群值

2018-03-16 09:47 330 查看
请按照指定顺序运行以下三项练习:Pandas 简介。 Pandas 是用于进行数据分析和建模的重要库,广泛应用于 TensorFlow 编码。该教程提供了您学习本课程所需的全部 Pandas 信息。如果您已了解 Pandas,则可以跳过此练习。
使用 TensorFlow 的起始步骤。此练习介绍了线性回归。
合成特征和离群值。此练习介绍了合成特征,以及输入离群值会造成的影响。(本篇介绍这节)
-------------------------------------------------------

合成特征和离群值

学习目标:创建一个合成特征,即另外两个特征的比例
将此新特征用作线性回归模型的输入
通过识别和截取(移除)输入数据中的离群值来提高模型的有效性
我们来回顾下之前的“使用 TensorFlow 的基本步骤”练习中的模型。首先,我们将加利福尼亚州住房数据导入 Pandas 
DataFrame
 中:import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.metrics as metrics
import tensorflow as tf
from tensorflow.python.data import Dataset

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")

california_housing_dataframe = california_housing_dataframe.reindex(
np.random.permutation(california_housing_dataframe.index))
california_housing_dataframe["median_house_value"] /= 1000.0
california_housing_dataframe接下来,我们将设置输入函数,并针对模型训练来定义该函数:
def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
"""Trains a linear regression model of one feature.

Args:
features: pandas DataFrame of features
targets: pandas DataFrame of targets
batch_size: Size of batches to be passed to the model
shuffle: True or False. Whether to shuffle the data.
num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
Returns:
Tuple of (features, labels) for next data batch
"""

# Convert pandas data into a dict of np arrays.
features = {key:np.array(value) for key,value in dict(features).items()}

# Construct a dataset, and configure batching/repeating
ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit
ds = ds.batch(batch_size).repeat(num_epochs)

# Shuffle the data, if specified
if shuffle:
ds = ds.shuffle(buffer_size=10000)

# Return the next batch of data
features, labels = ds.make_one_shot_iterator().get_next()
return features, labels
def train_model(learning_rate, steps, batch_size, input_feature):
"""Trains a linear regression model.

Args:
learning_rate: A `float`, the learning rate.
steps: A non-zero `int`, the total number of training steps. A training step
consists of a forward and backward pass using a single batch.
batch_size: A non-zero `int`, the batch size.
input_feature: A `string` specifying a column from `california_housing_dataframe`
to use as input feature.

Returns:
A Pandas `DataFrame` containing targets and the corresponding predictions done
after training the model.
"""

periods = 10
steps_per_period = steps / periods

my_feature = input_feature
my_feature_data = california_housing_dataframe[[my_feature]].astype('float32')
my_label = "median_house_value"
targets = california_housing_dataframe[my_label].astype('float32')

# Create input functions
training_input_fn = lambda: my_input_fn(my_feature_data, targets, batch_size=batch_size)
predict_training_input_fn = lambda: my_input_fn(my_feature_data, targets, num_epochs=1, shuffle=False)

# Create feature columns
feature_columns = [tf.feature_column.numeric_column(my_feature)]

# Create a linear regressor object.
my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
linear_regressor = tf.estimator.LinearRegressor(
feature_columns=feature_columns,
optimizer=my_optimizer
)

# Set up to plot the state of our model's line each period.
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
plt.title("Learned Line by Period")
plt.ylabel(my_label)
plt.xlabel(my_feature)
sample = california_housing_dataframe.sample(n=300)
plt.scatter(sample[my_feature], sample[my_label])
colors = [cm.coolwarm(x) for x in np.linspace(-1, 1, periods)]

# Train the model, but do so inside a loop so that we can periodically assess
# loss metrics.
print "Training model..."
print "RMSE (on training data):"
root_mean_squared_errors = []
for period in range (0, periods):
# Train the model, starting from the prior state.
linear_regressor.train(
input_fn=training_input_fn,
steps=steps_per_period,
)
# Take a break and compute predictions.
predictions = linear_regressor.predict(input_fn=predict_training_input_fn)
predictions = np.array([item['predictions'][0] for item in predictions])

# Compute loss.
root_mean_squared_error = math.sqrt(
metrics.mean_squared_error(predictions, targets))
# Occasionally print the current loss.
print "  period %02d : %0.2f" % (period, root_mean_squared_error)
# Add the loss metrics from this period to our list.
root_mean_squared_errors.append(root_mean_squared_error)
# Finally, track the weights and biases over time.
# Apply some math to ensure that the data and line are plotted neatly.
y_extents = np.array([0, sample[my_label].max()])

weight = linear_regressor.get_variable_value('linear/linear_model/%s/weigh
4000
ts' % input_feature)[0]
bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')

x_extents = (y_extents - bias) / weight
x_extents = np.maximum(np.minimum(x_extents,
sample[my_feature].max()),
sample[my_feature].min())
y_extents = weight * x_extents + bias
plt.plot(x_extents, y_extents, color=colors[period])
print "Model training finished."

# Output a graph of loss metrics over periods.
plt.subplot(1, 2, 2)
plt.ylabel('RMSE')
plt.xlabel('Periods')
plt.title("Root Mean Squared Error vs. Periods")
plt.tight_layout()
plt.plot(root_mean_squared_errors)

# Create a table with calibration data.
calibration_data = pd.DataFrame()
calibration_data["predictions"] = pd.Series(predictions)
calibration_data["targets"] = pd.Series(targets)
display.display(calibration_data.describe())

print "Final RMSE (on training data): %0.2f" % root_mean_squared_error

return calibration_data

----------------------------------

任务 1:尝试合成特征

total_rooms
 和 
population
 特征都会统计指定街区的相关总计数据。但是,如果一个街区比另一个街区的人口更密集,会怎么样?我们可以创建一个合成特征(即 
total_rooms
 与 
population
 的比例)来探索街区人口密度与房屋价值中位数之间的关系。在以下单元格中,创建一个名为 
rooms_per_person
 的特征,并将其用作 
train_model()
 的 
input_feature
。通过调整学习速率,您使用这一特征可以获得的最佳效果是什么?(效果越好,回归线与数据的拟合度就越高,最终 RMSE 也会越低。)california_housing_dataframe["rooms_per_person"] = (
california_housing_dataframe["total_rooms"] / california_housing_dataframe["population"])

calibration_data = train_model(
learning_rate=0.05,
steps=500,
batch_size=5,
input_feature="rooms_per_person")
-------------------------------------------------------

任务 2:识别离群值

我们可以通过创建预测值与目标值的散点图来可视化模型效果。理想情况下,这些值将位于一条完全相关的对角线上。使用您在任务 1 中训练过的人均房间数模型,并使用 Pyplot 的 
scatter()
 创建预测值与目标值的散点图。您是否看到任何异常情况?通过查看 
rooms_per_person
 中值的分布情况,将这些异常情况追溯到源数据。plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
plt.scatter(calibration_data["predictions"], calibration_data["targets"])校准数据显示,大多数散点与一条线对齐。这条线几乎是垂直的,我们稍后再讲解。现在,我们重点关注偏离这条线的点。我们注意到这些点的数量相对较少。如果我们绘制 
rooms_per_person
 的直方图,则会发现我们的输入数据中有少量离群值:plt.subplot(1, 2, 2)
_ = california_housing_dataframe["rooms_per_person"].hist()

任务 3:截取离群值

看看您能否通过将 
rooms_per_person
 的离群值设置为相对合理的最小值或最大值来进一步改进模型拟合情况。以下是一个如何将函数应用于 Pandas 
Series
 的简单示例,供您参考:
clipped_feature = my_dataframe["my_feature_name"].apply(lambda x: max(x, 0))
上述 
clipped_feature
 没有小于 
0
 的值。

解决方案

我们在任务 2 中创建的直方图显示,大多数值都小于 
5
。我们将 
rooms_per_person
 的值截取为 5,然后绘制直方图以再次检查结果。california_housing_dataframe["rooms_per_person"] = (
california_housing_dataframe["rooms_per_person"]).apply(lambda x: min(x, 5))

_ = california_housing_dataframe["rooms_per_person"].hist()为了验证截取是否有效,我们再训练一次模型,并再次输出校准数据calibration_data = train_model(
learning_rate=0.05,
steps=500,
batch_size=5,
input_feature="rooms_per_person")
_ = plt.scatter(calibration_data["predictions"], calibration_data["targets"])
-------------------------------------------------------
以上整理转载在谷歌出品的机器学习速成课程点击打开链接 侵删!
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: