您的位置：首页 > 理论基础 > 计算机网络

LSTM 递归神经网络基本结构及 TensorFlow 示例模型介绍

2017-05-01 21:55 1061 查看

节选下面部分链接的文章对LSTM进行了解。
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Understanding LSTM Networks

Essential to these successes is the use of “LSTMs”, a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version, Almost all exciting results based on recurrent neural networks are achieved with
them. It’s these LSTMs that this essay will explore.

The problem of Long-Term Dependencies

One of the appeals（诉求） of RNNs is the idea that they might be able to connect previous information to present task.such as using previous video frames might inform the understanding of
the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends

Sometimes, we only need to look at recent information to perform the present task.For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky”,
we don’t need any further context-- is’s pretty obvious the next word is gonging to be sky. In such case, where the gap between the relevant(相关的) information and the place that it’s
needed is small. RNNs can learn to use the past information.

But there are also cases where we need more context. Consider trying to predict the last word in the next text “I grew up in France...I speak fluent French” Recent information suggests that the next word is probably the name of a language, but if we want
to narrow down which language , we need the context of France, from further back. It’s entirely possible for gap between the relevant information and the points where it is needed to become very large.

Unfortunately, as the gap grows, RNNs become unable to learn to connect the information.

In theory, RNNs are absolutely capable of handling such “long-term dependencies” A human could carefully pick the parameters for them to solve toy problem of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored
in depth by Hochreiter ....

Who found some pretty fundamental reasons why it might be difficult.

Thankfully, LSTMs don’t have this problem.

LSTM Networks.

Long Short Term Memory networks -- usually just called “LSTMs”-- are a special kind of RNN, capable of learning long-term dependencies.

They were introduced by ... And were refined(改进) and popularized by many people in following work. They work tremendously well on a large variety of problems, and are now widely used.

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!(LSTMs能很好地解决
long-term dependency的问题)

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

下面是示意图

下面引用 Tensorflow官网中关于LSTMs的模型例子来说明模型构建过程

代码链接

https://github.com/tensorflow/models/tree/master/tutorials/rnn/ptb

ptb_word_lm.py

smail medium large 三种强度设定

参数设定

init_scale the initial scale of the weights

这是选定神经网络初始变换矩阵权值range的相关参数

通过tf.random_uniform_initializer 进行对第一个参数 minval--Lower bound of the range of random values to generate.

其与模型复杂度成反比（large相对应的init_scale最小）

learning_rate

max_grad_norm the maximum permissible norm of the gradient

num_layers the number of LSTM layers

此参数用于确定MultiRNNCell对象cell参数的迭代实例化的迭代次数，即神经元层数

num_steps the number of unrolled steps of LSTM

hidden_size the number of LSTM units

关于hidden_size的解释见 stack overflow的如下链接
http://stackoverflow.com/questions/37901047/what-is-num-units-in-tensorflow-basiclstmcell
The number of hidden units is a direct representation of th
4000
e learning capacity of a neural network--it reflects the number of learned parameters. The value likely selected arbitrarily or empirically. You can change that value experimentally and return the
program to see how it affects the training accuracy(you can get better than 90% test accuracy with a lot fewer hidden units). Using more units makes it more likely to perfectly memorize the complete training set (although it will take longer, and you run the
risk of over-fitting)

The key thing to understand, which is somewhat subtle is that x is an array of data(tensor)--it is not meant to be a scalar value. Where, for example, the tanh function is shown, it is meant to imply that the function is broadcast across the entire array(an
implict for loop)--and not simply performed once per time-step.

这还不是很明确，看下面链接的一句话

https://www.quora.com/What-is-the-meaning-of-%E2%80%9CThe-number-of-units-in-the-LSTM-cell

Tensorflow’s num_units is the size of the LSTM’s hidden state(which is also the size of the output if no projection is used)

即不进行投影（降维）变换的话其就是输出的维数。

max_epoch the number of epochs trained with the initial learning rate.

从initial 知道这里的学习率可能是变化的 tensorflow中推荐的是衰减学习率(decaying the learning rate-- when training a model, it is often recommended to lower the learning rate as the training progresses.)

(见后续参数lr_decay)

max_max_epoch the total number of epochs for training.

keep_prob the probability of keeping weights in the dropout layer

使用dropout的原因是防止过拟合 idea来自于

https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

以随机方式丢弃若干神经元
类似如下示意

#coding: utf-8
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import inspect
# use inspect.getargspec to get the names and default values of a function's argument
# use in closure defination of lstm_cell

import time
import numpy as np
import tensorflow as tf

from . import reader
# read file usage

flags = tf.flags
# The flags are Google's way of handleing command line parameters, similar to python's
# awesome argparse module.

logging = tf.logging

flags.DEFINE_string("model", "small", "A type of model. Possible options are: small, medium, large")
flags.DEFINE_string("data_path", None, "Where the training/test data is used.")
flags.DEFINE_string("save_path", None, "Model output directory.")

flags.DEFINE_bool("use_fp16", False, "Train using 16-bit floats instead 32bit floats")
# correlate to the precision
# These flags have flagname, default value and docstring

FLAGS = flags.FLAGS
# Global container and accrssor for flags and their values.

def data_type():
return tf.float16 if FLAGS.use_fp16 else tf.float32

class PTBinput(object):
"""The input data."""

if __name__ == '__main__':
def __init__(self, config, data, name = None):
self.batch_size = batch_size = config.batch_size
self.num_steps = num_steps = config.num_steps
self.epoch_size = ((len(data) // batch_size) - 1) // num_steps
self.input_data , self.targets = reader.ptb_producer(data, batch_size, num_steps, name=name)
# the return of reader.ptb_producer is A pair of Tensors, each shaped [batch_size, num_steps]. The second element
# of the tuple is the same data time-shifted to the right by one.

class PTBModel(object):
"""The PTB Model."""

def __init__(self, is_training, config, input_):
self._input = input_

batch_size = input_.batch_size
num_steps = input_.num_steps
size = config.hidden_size
vocab_size = config.vocab_size

# Slightly better results can be obtained with forget gate biases
# (may be the const term of linear transformation in s curve)
# initialized to 1 but the hyperparameters of the model would need to be
# different than reported in the paper
def lstm_cell():
# With the latest Tensorflow source code, the BasicLSTMCell will need a reuse parameter
# which is unfortunately not define in Tensorflow 1.0, To maintain backwards compatibility ,
# we add an argument check here:

if 'reuse' in inspect.getargspec(tf.contrib.rnn.BasicLSTMCell.__init__).args:
return tf.contrib.rnn.BasicLSTMCell(size, forget_bias = 0.0, state_is_tuple = True, reuse = tf.get_variable_scope().reuse)
else:
return tf.contrib.rnn.BasicLSTMCell(size, forget_bias = 0.0, state_is_tuple = True)
# the if-else chuck define above relate to the concept of Variable Scope in TenssorFlow
# Which can be analogied to the namespace(use namespace std??) in C++
# In https://www.tensorflow.org/programmers_guide/variable_scope # The introduce idea of it is to solve the problem come with sharied valuable(which also exist in Theano)
# frequently used in CNN for sharing weights or similar filter and so on.
# It is a light weight solver(than defination of manage class or dict) when you can define
# new valiable by tf.get_varable(default reuse para set to False) or reused it
# And there is an example to use scops.reuse_variables() in the same variable scope defined by
# with... as ... syntax which can be seen in the following code.

attn_cell = lstm_cell
if is_training and config.keep_prob < 1:
def attn_cell():
return tf.contrib.rnn.DropoutWrapper(lstm_cell(), output_keep_prob = config.keep_prob)
# apply dropout to cell.

cell = tf.contrib.rnn.MultiRNNCell([attn_cell() for _ in range(config.num_layers)], state_is_tuple = True)
# construct MultiRNNCell set state_is_tuple to True induce the state input shape as a ndarray
# rather than the straighten form.(concatenate by column)

self._initial_state = cell.zero_state(batch_size, data_type())

with tf.device("/cpu:0"):
embedding = tf.get_variable("embedding", [vocab_size, size], dtype=data_type())
inputs = tf.nn.embedding_lookup(embedding, input_.input_data)
# the explain of get_variable can appeal to the content of variable scope above
# tf.device can assign the operation in the with closure to the specially device
# In this step embedding complete the unique random encode of vocab
# The usage of tf.nn.embedding_lookup is generate the embedding encode correspond to embedding
# Can refer to http://stackoverflow.com/questions/34870614/what-does-tf-nn-embedding-lookup-function-do # the usage of embedding in this may be trivial, but in word2vec may consider the real sense
# of embedding words into dense vector space where similar meaning words have low distance
# which may analogous to concept of factorial analysis
# refer to http://stackoverflow.com/questions/40184537/what-does-embedding-do-in-tensorflow 
if is_training and config.keep_prob < 1:
inputs = tf.nn.dropout(inputs, config.keep_prob)

# Simplified version of models/tutorials/rnn/rnn.py's rnn()
# This builds an unrolled LSTM for tutorial purposes only.
# In general, use the rnn() or state_saving_rnn() from rnn.py
#
# The alternative version of the code below is:
#
# inputs = tf.unstack(inputs, num = num_steps, axis = 1)
# outputs, state = tf.contrib.rnn.static_rnn(cell, inputs, initial_state = self._initial_state)
# the unstack func above "split" the tensor along the "axis" to a list.

outputs = []
state = self._initial_state
with tf.variable_scope("RNN"):
for time_step in range(num_steps):
if time_step > 0: tf.get_variable_scope().reuse_variables()
(cell_output, state) = cell(inputs[:,time_step,:], state)
outputs.append(cell_output)

output = tf.reshape(tf.concat(axis=1, values=outputs), [-1, size])
# skip basic operation like numpy
softmax_w = tf.get_variable("softmax_w", [size, vocab_size], dtype=data_type())
softmax_b = tf.get_variable("softmax_b", [vocab_size], dtype=data_type())
logits = tf.matmul(output, softmax_w) + softmax_b
# matmul: Multiplies matrices
loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example(
[logits],
[tf.reshape(input_.targets, [-1])],
[tf.ones([batch_size * num_steps], dtype=data_type())]
)
# sequence_loss_by_example is the Weighted cross-entropy loss for a sequence of logits (per example)
# the form can be found in https://hit-scir.gitbooks.io/neural-networks-and-deep-learning-zh_cn/content/chap3/c3s1.html # which is similar to the MLE of 0-1 distribution. there use the equal weights for the example

self._cost = cost = tf.reduce_sum(loss) / batch_size
self._final_state = state

if not is_training:
return

# above is the construct of loss pattern
# following is the solve process(by grad)

self._lr = tf.Variable(0.0, trainable=False)
# this variable will be used as the learning_rate of GradientDescentOptimizer
# in http://stackoverflow.com/questions/33919948/how-to-set-adaptive-learning-rate-for-gradientdescentoptimizer # can find that the original usage of GradientDesceentOptimizer is to init learning_rate by a const which
# will be used in all steps. The above usage is to update the learning_rate which have mentioned
# in the args above in lr_decay

tvars = tf.trainable_variables()
# tvars are all variables created with trainable=True as a list

grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), config.max_grad_norm)
optimizer = tf.train.GradientDescentOptimizer(self._lr)
self._train_op = optimizer.apply_gradients(zip(grads, tvars), global_step=tf.contrib.framework.get_or_create_global_step())
# global_step in apply_gradients set a Optional Variable to increment by one after the variables have been updated.

self._new_lr = tf.placeholder(tf.float32, shape = [], name = "new_learning_rate")
# tf.placeholder insert a placeholder for a tensor that will be always fed.
# which value must be fed using 'feed_dict' optional argument to 'Session.run()'
# can think about final variable in Java must be init otherwise will induce an error.
# Session Object is the basic operation part in tensorflow,(always in its fun method) comprehend this stage first.

self._lr_update = tf.assign(self._lr, self._new_lr)

def assign_lr(self, session, lr_value):
session.run(self._lr_update, feed_dict = {self._new_lr: lr_value})

@property
def input(self):
return self._input

@property
def initial_state(self):
return self._initial_state

@property
def cost(self):
return self._cost

@property
def final_state(self):
return self._final_state

@property
def lr(self):
return self._lr

@property
def train_op(self):
return self._train_op

# following are different size model defination
class SmallConfig(object):
"""Small config"""
init_scale = 0.1
learning_rate = 1.0
max_grad_norm = 5
num_layers = 2
num_steps = 20
hidden_size = 200
max_epoch = 4
max_max_epoch = 13
keep_prob = 1.0
lr_decay = 0.5
batch_size = 20
vocab_size = 10000

class MediumConfig(object):
"""Medium config."""
init_scale = 0.05
learning_rate = 1.0
max_grad_norm = 5
num_layers = 2
num_steps = 35
hidden_size = 60
max_epoch = 6
max_max_epoch = 39
keep_prob = 0.5
batch_size = 20
vocab_size = 10000

class LargeConfig(object):
"""Large config."""
init_scale = 0.04
learning_rate = 1.0
max_grad_norm = 10
num_layers = 2
num_steps = 35
hidden_size = 1500
max_epoch = 14
max_max_epoch = 55
keep_prob = 0.35
lr_decay = 1 / 1.15
batch_size = 20
vocab_size = 10000

class TestConfig(object):
"""Tiny config, for testing"""
init_scale = 0.1
learning_rate = 1.0
max_grad_norm = 1
num_layers = 1
num_steps = 2
hidden_size = 2
max_epoch = 1
max_max_epoch = 1
keep_prob = 1.0
lr_decay = 0.5
batch_size = 20
vocab_size = 10000

def run_epoch(session, model, eval_op = None, verbose = False):
"""Run the model on the given data."""
# the op may be the operation
# in the code of main , we will find the eval_op will be train_op define above to perform grad descend
start_time = time.time()
costs = 0.0
iters = 0
state = session.run(model.initial_state)

# run in Session use fetches as first arg and return the same construct of fetches input
# the fetches may be some subclasses of tf.Tensor or container of Tensor and so on.

fetches = {
"cost": model.cost,
"final_state": model.final_state
}
if eval_op is not None:
fetches["eval_op"] = eval_op

for step in range(model.input.epoch_size):
feed_dict = {}
for i, (c, h) in enumerate(model.initial_state):
feed_dict[c] = state[i].c
feed_dict[h] = state[i].h

vals = session.run(fetches, feed_dict)
cost = vals["cost"]
state = vals["final_state"]

costs += cost
iters += model.input.num_steps

if verbose and step % (model.input.epoch_size // 10) == 10:
print("%.3f perplexity: %.3f speed: %.0f wps" %
(step * 1.0 / model.input.epoch_size, np.exp(costs / iters),
iters * model.input.batch_size / (time.time() - start_time)))

return np.exp(costs / iters)

def get_config():
if FLAGS.model == "small":
return SmallConfig()
elif FLAGS.model == "medium":
return MediumConfig()
elif FLAGS.model == "large":
return LargeConfig()
elif FLAGS.model == "test":
return TestConfig()

def main(_):
if not FLAGS.data_path:
raise  ValueError("Must set --date_path to PTB data directory")

raw_data = reader.ptb_producer(FLAGS.data_path)
train_data, valid_data, test_data, _ = raw_data
# data split

config = get_config()

eval_config = get_config()
eval_config.batch_size = 1
eval_config.num_steps = 1
# eval config will be used in test

with tf.Graph().as_default():
initializer = tf.random_uniform_initializer(-config.init_scale, config.init_scale)

with tf.name_scope("Train"):
train_input = PTBinput(config = config, data = train_data, name = "TrainInput")
with tf.variable_scope("Model", reuse = None, initializer=initializer):
m = PTBModel(is_training=True, config=config, input_=train_input)
tf.summary.scalar("Training Loss", m.cost)

with tf.name_scope("Valid"):
valid_input = PTBinput(config=config, data=valid_data, name="ValidInput")
with tf.variable_scope("Model", reuse=True, initializer=initializer):
mvalid = PTBModel(is_training=False, config=config, input_=valid_input)
tf.summary.scalar("Validation Loss", mvalid.cost)

with tf.na
ba42
me_scope("Test"):
test_input = PTBinput(config=eval_config, data=test_data, name="TestInput")
with tf.variable_scope("Model", reuse=True, initializer=initializer):
mtest = PTBModel(is_training=False, config=eval_config,
input_=test_input)

sv = tf.train.Supervisor(logdir=FLAGS.save_path)
with sv.managed_session() as session:
for i in range(config.max_max_epoch):
lr_decay = config.lr_decay ** max(i + 1 - config.max_epoch, 0.0)
m.assign_lr(session, config.learning_rate * lr_decay)

print("Epoch: %d Learning rate: %.3f" % (i + 1, session.run(m.lr)))
train_perplexity = run_epoch(session, m, eval_op=m.train_op,
verbose=True)
print("Epoch: %d Train Perplexity: %.3f" % (i + 1, train_perplexity))
valid_perplexity = run_epoch(session, mvalid)
print("Epoch: %d Valid Perplexity: %.3f" % (i + 1, valid_perplexity))

test_perplexity = run_epoch(session, mtest)
print("Test Perplexity: %.3f" % test_perplexity)

if FLAGS.save_path:
print("Saving model to %s." % FLAGS.save_path)
sv.saver.save(session, FLAGS.save_path, global_step=sv.global_step)

if __name__ == "__main__":
tf.app.run()

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： TensorFlow LSTM RNN 递归神经网络深度学习

相关文章推荐

新的分享

章节导航

LSTM 递归神经网络 基本结构 及 TensorFlow 示例模型介绍

LSTM 递归神经网络基本结构及 TensorFlow 示例模型介绍