您的位置：首页 > 其它

TensorFlow学习笔记9----TensorFlow Wide & Deep Learning Tutorial

2017-05-25 19:03 483 查看

原文教程：tensorflow官方教程

记录关键内容与学习感受。未完待续。。

TensorFlow Wide & Deep Learning Tutorial

——在前面的教程TensorFlow Linear Model Tutorial中，我们训练了一个逻辑回归模型，使用Census Income

Dataset来预测个人年收入是否超过50000美元可能性。tensorflow对于训练深度神经网络也是很好用的，你可能会考虑，选着哪一个呢？为什么不能两个都呢？将两者的强度同时加在一个模型上是可能的吗？

——在这个教程中，我们将介绍如何使用tf.learnAPI共同的训练一个宽度线性模型和深度前向反馈神经网络。这种方法结合了记忆和泛化的优势。这对于一般的大型的，有着稀疏输入特征（例如类别特征，有着大量可能特征值）的回归和分类问题是有用的。如果你对学习更多的关于宽度&深度学习是如何工作有兴趣，请点击research paper。

——上图展示了宽度模型（有着稀疏特征和转换的逻辑回归）、深度模型（有着嵌入层和多个隐藏层的前向反馈神经网络）和宽度&深度模型（两个结合的共同训练）的对比。在高层，使用tf.learnAPI，只需要三步来配置宽度、深度、宽度&深度模型。

1、选择宽度部分的特征：选择你想使用的稀疏基列和交叉列。

2、选择深度部分的特征：选择连序列、对每一个类别列的嵌入维度和隐藏层大小

3、将他们放在一起组成宽度&深度模型（DNNLinearCombinedClassifier）。

——现在进入一个简单的例子。

1、安装

——为了获得本教程的代码：

安装tensorflow

下载the tutorial code

安装pandas数据分析库。tf.learn不要求使用pandas，但是支持它，本教程使用pandas。安装pandas：

安装pip

# Ubuntu/Linux 64-bit

$ sudo apt-get install python-pip python-dev

# Mac OS X

$ sudo easy_install pip
$ sudo easy_install --upgrade six

使用pip安装pandas

$ sudo pip install pandas

如果你安装pandas方面还有什么问题，请点击instructions。

用以下的命令执行教程的代码，训练教程中描述的线性模型。

$ python wide_n_deep_tutorial.py --model_type=wide_n_deep

——阅读并发现这个代码是如何建立线性模型的。

2、Define Base Feature Columns

——首先，定义我们使用的基类别和连续特征列，这些基列将会是构建模块，用在宽度和深度部分。

import tensorflow as tf

# categotical base columns
gender = tf.contrib.layers.sparse_column_with_keys(
column_name = "gender",
keys = ["Female","Male"]
)
race = tf.contrib.layers.sparse_column_with_keys(
column_name = "race",
keys = ["Amer-Indian-Eskimo","Asian-Pac-Islander",
"Black","Other","White"]
)
education = tf.contrib.layers.sparse_column_with_hash_bucket("education",hash_bucket_size=1000)
relationship = tf.contrib.layers.sparse_column_with_hash_bucket("relationship",hash_bucket_size=100)
workclass = tf.contrib.layers.sparse_column_with_hash_bucket("workclass",hash_bucket_size=100)
occupation = tf.contrib.layers.sparse_column_with_hash_bucket("occupation",hash_bucket_size=1000)
native_country = tf.contrib.layers.sparse_column_with_hash_bucket("native_country",hash_bucket_size=1000)

# continuous base columns
age = tf.contrib.layers.real_valued_column("age")
age_buckets = tf.contrib.layers.bucketized_column(
age,
boundaries = [18,25,30,35,40,45,50,55,60,65]
)
education_num = tf.contrib.layers.real_valued_column("education_num")
capital_gain = tf.contrib.layers.real_valued_column("capital_gain")
capital_loss = tf.contrib.layers.real_valued_column("capital_loss")
hours_per_week = tf.contrib.layers.real_valued_column("hours_per_week")

3、The Wide Model: Linear Model with Crossed Feature Columns

——宽度模型是一个线性模型，有着稀疏和交叉特征列的宽度集合。

wide_columns = [
gender, native_country, education, occupation, workclass,
relationship, age_buckets,
tf.contrib.layers.crossed_column([education,occupation],hash_bucket_size=int(1e4)),
tf.contrib.layers.crossed_column([native_country,occupation],hash_bucket_size=int(1e4)),
tf.contrib.layers.crossed_column([age_buckets,education,occupation],hash_bucket_size=int(1e6))
]

——带有交叉特征的宽度模型可以有效存储稀疏特征之间的相互作用。也就是说，交叉特征列的一个限制是它们不必推广到不会再训练数据上出现的特征组合。现在将嵌入特征添加到深度模型中解决这个问题。

4、The Deep Model: Neural Network with Embeddings

——向前面提到的一样，深度模型是一个前向反馈的神经网络。每一个稀疏的、高维类别特征都是首次转换为低维的、稠密的真值向量，经常作为嵌入向量的参考。这些低维、稠密、嵌入向量与连续特征是串联的，接着在前向反馈过程时，输入到神经网络的隐藏层。这些嵌入值通常随机初始化，并且和其他模型参数一起初始化来最小化训练的损失。如果你对学习更多的嵌入感兴趣，点击tensorflow的教程Vector Representations of Words或者是initialized 。

——我们使用embedding_column设置对类别列的嵌入，并且使用连续列联结它们。

deep_columns = [
tf.contrib.layers.embedding_column(workclass,dimension=8),
tf.contrib.layers.embedding_column(education,dimension=8),
tf.contrib.layers.embedding_column(gender,dimension=8),
tf.contrib.layers.embedding_column(relationship,dimension=8),
tf.contrib.layers.embedding_column(native_country,dimension=8),
tf.contrib.layers.embedding_column(occupation,dimension=8),
age,education_num,capital_gain,capital_loss,hours_per_week
]

——嵌入列的维度越高，这个模型不得不学习的特征表示的自由度就越大。简单起见，这里我们每一个特征列都设置维度为8。根据经验，对于维度的数字来说，一个更为已知的决定可以与一个与

同种类的值一起开始，其中，n是在一个特征列中唯一特征的数字，k是一个比较小的常数（通常小于10）。

——通过稠密的嵌入，深度模型可以得到更好的推广和在先前从未见过的训练数据的特征对上做决策。然而，当两个特征列之间潜在的相互作用举证是稀疏的、高等级的时候，这对于学习如何表示高效、低维的特征列很困难。在这种情况下，除了少部分以外，大部分特征对之间的相互作用应该是0，但是稠密的嵌入对所有特征对会导致非0的预测，因此可以过度泛化。另一方面，带有交叉特征的线性模型可以有效存储这些带有更少的模型参数的异常规则。

——现在，让我们来看如何共同训练一个宽度深度网络，并且允许他们补足另一个的强度和缺点。

5、Combining Wide and Deep Models into One

——宽度深度模型通过将他们最后的输出日志的几率求和作为最终的预测，接着将预测输送到逻辑损失函数中。所有的图定义和变量分配在你启动之前已经做好，因此只需要创建一个DNNLinearCombinedClassifier。

import tempfile
model_dir = tempfile.mkdtemp()
m = tf.contrib.learn.DNNLinearCombinedClassifier(
model_dir = model_dir,
linear_feature_columns = wide_columns,
dnn_feature_columns = deep_columns,
dnn_hidden_units = [100,50]
)

6、Training and Evaluating The Model

——在我们训练模型之前，像我们之前在TensorFlow Linear Model tutorial上做的那样读取人口普查数据集，这份对输入数据处理的代码放在这里是为了你的方便。

import pandas as pd
import urllib

# define the column names for the data sets.
COLUMNS = [
"age", "workclass", "fnlwgt", "education", "education_num",
"marital_status", "occupation", "relationship", "race","gender",
"capital_gain","capital_loss","hours_per_week","native_country",
"income_bracket"
]
LABEL_COLUMN = 'label'
CATEGORICAL_COLUMNS = [
"workclass","education","marital_status","occupation",
"relationship","race","gender","native_country"
]
CONTINUOUS_COLUMNS = [
"age", "education_num","capital_gain","capital_loss",
"hours_per_week"
]

# download the training and test data to temporary files.
# alternatively, you can download them yourself and change
# train_file and test_file to your own paths.
train_file = tempfile.NamedTemporaryFile()
test_file = tempfile.NamedTemporaryFile()
urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",train_file.name)
urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",test_file.name)

# read the training and test data sets into pandas dataframe.
df_train = pd.read_csv(train_file,names=COLUMNS,skipinitialspace=True)
df_test = pd.read_csv(test_file,names=COLUMNS,skipinitialspace=True,skiprows=1)
df_train[LABEL_COLUMN] = (df_train['income_bracket'].apply(lambda x:'>50k' in x)).astype(int)
df_test[LABEL_COLUMN] = (df.test['income_bracket'].apply(lambda x:'>50k' in x)).astype(int)

def input_fn(df):
# creates a dictionary mapping from each continuous feature
# column name(k) to the values of that column stored in a
# constant tensor
continuous_cols = {k: tf.constant(df[k].values)
for k in CONTINUOUS_COLUMNS}
# create a dictionary mapping from each categorical feature
# column name(k) to the values of that column stored in a
# tf.SparseTensor.
categorical_cols = {
k: tf.SparseTensor(
indices = [[i,0] for i in range(df[k].size)],
values = df[k].values,
shape = [df[k].size,1]
)
for k in CATEGORICAL_COLUMNS
}
# merges the two dictionaries into one.
feature_cols = dict(continuous_cols.items()+categorical_cols.items())
# converts the label column into a constant tensor.
label = tf.constant(df[LABEL_COLUMN].values)
# returns the feature columns and the label.
return feature_cols,label

def train_input_fn():
return input_fn(df_train)
def eval_input_fn():
return input_fn(df_test)

——在读完数据后，你可以训练和验证模型。

m.fit(input_fn = train_input_fn,steps=200)
results = m.evaluate(input_fn=eval_input_fn,steps=1)
for key in sorted(results):
print "%s: %s" % (key,results[key])

——输出的第一行应该是正确度：0.84429705。我们可以看到正确度，从只是用宽度模型的83.6%提高到使用宽度&深度模型的84.4%。如果你喜欢看端到端的例子，可以下载example code。

——-记住，这个教程只是一个在小数据集上让你熟悉tf.learnAPI的简单的例子。如果你在一个带有很多稀疏特征列，有大量可能特征值的大的数据集上，宽度&深度模型将更有力。你可以查看research paper来获取更多的想法，关于在真实世界，大型机器学习问题上，如何使用宽度&深度模型。

7、实际运行结果

——-综合代码如下：

import tensorflow as tf

# step1: define base feature columns
# categotical base columns
gender = tf.contrib.layers.sparse_column_with_keys(
column_name = "gender",
keys = ["Female","Male"]
)
race = tf.contrib.layers.sparse_column_with_keys(
column_name = "race",
keys = ["Amer-Indian-Eskimo","Asian-Pac-Islander",
"Black","Other","White"]
)
education = tf.contrib.layers.sparse_column_with_hash_bucket("education",hash_bucket_size=1000)
relationship = tf.contrib.layers.sparse_column_with_hash_bucket("relationship",hash_bucket_size=100)
workclass = tf.contrib.layers.sparse_column_with_hash_bucket("workclass",hash_bucket_size=100)
occupation = tf.contrib.layers.sparse_column_with_hash_bucket("occupation",hash_bucket_size=1000)
native_country = tf.contrib.layers.sparse_column_with_hash_bucket("native_country",hash_bucket_size=1000)

# continuous base columns
age = tf.contrib.layers.real_valued_column("age")
age_buckets = tf.contrib.layers.bucketized_column(
age,
boundaries = [18,25,30,35,40,45,50,55,60,65]
)
education_num = tf.contrib.layers.real_valued_column("education_num")
capital_gain = tf.contrib.layers.real_valued_column("capital_gain")
capital_loss = tf.contrib.layers.real_valued_column("capital_loss")
hours_per_week = tf.contrib.layers.real_valued_column("hours_per_week")

# step2: define wide feature columns
wide_columns = [
gender, native_country, education, occupation, workclass,
relationship, age_buckets,
tf.contrib.layers.crossed_column([education,occupation],hash_bucket_size=int(1e4)),
tf.contrib.layers.crossed_column([native_country,occupation],hash_bucket_size=int(1e4)),
tf.contrib.layers.crossed_column([age_buckets,education,occupation],hash_bucket_size=int(1e6))
]

# step3: define deep feature columns
deep_columns = [
tf.contrib.layers.embedding_column(workclass,dimension=8),
tf.contrib.layers.embedding_column(education,dimension=8),
tf.contrib.layers.embedding_column(gender,dimension=8),
tf.contrib.layers.embedding_column(relationship,dimension=8),
tf.contrib.layers.embedding_column(native_country,dimension=8),
tf.contrib.layers.embedding_column(occupation,dimension=8),
age,education_num,capital_gain,capital_loss,hours_per_week
]

# step4: create a dnnlinearcombinedclassifier
import tempfile
model_dir = tempfile.mkdtemp() # create a temp path
m = tf.contrib.learn.DNNLinearCombinedClassifier(
model_dir = model_dir,
linear_feature_columns = wide_columns,
dnn_feature_columns = deep_columns,
dnn_hidden_units = [100,50]
)

# step5: process input_data
import pandas as pd
import urllib

# define the column names for the data sets.
COLUMNS = [
"age", "workclass", "fnlwgt", "education", "education_num",
"marital_status", "occupation", "relationship", "race","gender",
"capital_gain","capital_loss","hours_per_week","native_country",
"income_bracket"
]
LABEL_COLUMN = 'label'
CATEGORICAL_COLUMNS = [
"workclass","education","marital_status","occupation",
"relationship","race","gender","native_country"
]
CONTINUOUS_COLUMNS = [
"age", "education_num","capital_gain","capital_loss",
"hours_per_week"
]

# download the training and test data to temporary files.
# alternatively, you can download them yourself and change
# train_file and test_file to your own paths.
train_file = tempfile.NamedTemporaryFile()
test_file = tempfile.NamedTemporaryFile()
urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",train_file.name)
urllib.urlretrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",test_file.name)

# read the training and test data sets into pandas dataframe.
df_train = pd.read_csv(train_file,names=COLUMNS,skipinitialspace=True)
df_test = pd.read_csv(test_file,names=COLUMNS,skipinitialspace=True,skiprows=1)

df_train[LABEL_COLUMN] = (df_train['income_bracket'].apply(lambda x:'>50k' in x)).astype(int)
df_test[LABEL_COLUMN] = (df_test['income_bracket'].apply(lambda x:'>50k' in x)).astype(int)

def input_fn(df):
# creates a dictionary mapping from each continuous feature
# column name(k) to the values of that column stored in a
# constant tensor
continuous_cols = {k: tf.constant(df[k].values)
for k in CONTINUOUS_COLUMNS}
# create a dictionary mapping from each categorical feature
# column name(k) to the values of that column stored in a
# tf.SparseTensor.
categorical_cols = {
k: tf.SparseTensor(
indices = [[i,0] for i in range(df[k].size)],
values = df[k].values,
dense_shape = [df[k].size,1]
)
for k in CATEGORICAL_COLUMNS
}
# merges the two dictionaries into one.
feature_cols = dict(continuous_cols.items()+categorical_cols.items())
# converts the label column into a constant tensor.
label = tf.constant(df[LABEL_COLUMN].values)
# returns the feature columns and the label.
return feature_cols,label

def train_input_fn():
return input_fn(df_train)
def eval_input_fn():
return input_fn(df_test)

# step6: train and evaluate
m.fit(input_fn = train_input_fn,steps=200)
results = m.evaluate(input_fn=eval_input_fn,steps=1)
for key in sorted(results):
print "%s: %s" % (key,results[key])

——实际运行结果，由于电脑太卡了，跑不出来。但是官网下载的代码可以跑出来，照着教程写的加卡住，心好累。下面是官方代码跑的结果。

——目前先记录按照教程中写的代码所出现的问题：

“/usr/local/lib/python2.7/dist-packages/pandas/core/computation/init.py:18: UserWarning: The installed version of numexpr 2.2.2 is not supported in pandas and will be not be used

The minimum supported version is 2.4.6

ver=ver, min_ver=_MIN_NUMEXPR_VERSION), UserWarning)

WARNING:tensorflow:The default stddev value of initializer will change from “1/sqrt(vocab_size)” to “1/sqrt(dimension)” after 2017/02/25.

WARNING:tensorflow:From wide_deep_train.py:58: calling init (from tensorflow.contrib.learn.python.learn.estimators.dnn_linear_combined) with fix_global_step_increment_bug=False is deprecated and will be removed after 2017-04-15.

Instructions for updating:

Please set fix_global_step_increment_bug=True and update training steps in your pipeline. See pydoc for details.

WARNING:tensorflow:Rank of input Tensor (1) should be the same as output_rank (2) for column. Will attempt to expand dims. It is highly recommended that you resize your input, as this behavior may change.

WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py:95: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.

Instructions for updating:

Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.

WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py:96: histogram_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.

Instructions for updating:

Please switch to tf.summary.histogram. Note that tf.summary.histogram uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in.

WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/feature_column.py:1861: calling sparse_feature_cross (from tensorflow.contrib.layers.python.ops.sparse_feature_cross_op) with hash_key=None is deprecated and will be removed after 2016-11-20.

Instructions for updating:

The default behavior of sparse_feature_cross is changing, the default

value for hash_key will change to SPARSE_FEATURE_CROSS_DEFAULT_HASH_KEY.

From that point on sparse_feature_cross will always use FingerprintCat64

to concatenate the feature fingerprints. And the underlying

_sparse_feature_cross_op.sparse_feature_cross operation will be marked

as deprecated.

WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/head.py:615: scalar_summary (from tensorflow.python.ops.logging_ops) is deprecated and will be removed after 2016-11-30.

Instructions for updating:

Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.

2017-06-01 15:36:31.892371: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn’t compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.

已杀死

”