LSTM 递归神经网络 基本结构 及 TensorFlow 示例模型介绍
2017-05-01 21:55
1061 查看
节选下面部分链接的文章对LSTM进行了解。
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Understanding LSTM Networks
Essential to these successes is the use of “LSTMs”, a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version, Almost all exciting results based on recurrent neural networks are achieved with
them. It’s these LSTMs that this essay will explore.
The problem of Long-Term Dependencies
One of the appeals(诉求) of RNNs is the idea that they might be able to connect previous information to present task.such as using previous video frames might inform the understanding of
the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends
Sometimes, we only need to look at recent information to perform the present task.For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky”,
we don’t need any further context-- is’s pretty obvious the next word is gonging to be sky. In such case, where the gap between the relevant(相关的) information and the place that it’s
needed is small. RNNs can learn to use the past information.
But there are also cases where we need more context. Consider trying to predict the last word in the next text “I grew up in France...I speak fluent French” Recent information suggests that the next word is probably the name of a language, but if we want
to narrow down which language , we need the context of France, from further back. It’s entirely possible for gap between the relevant information and the points where it is needed to become very large.
Unfortunately, as the gap grows, RNNs become unable to learn to connect the information.
In theory, RNNs are absolutely capable of handling such “long-term dependencies” A human could carefully pick the parameters for them to solve toy problem of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored
in depth by Hochreiter ....
Who found some pretty fundamental reasons why it might be difficult.
Thankfully, LSTMs don’t have this problem.
LSTM Networks.
Long Short Term Memory networks -- usually just called “LSTMs”-- are a special kind of RNN, capable of learning long-term dependencies.
They were introduced by ... And were refined(改进) and popularized by many people in following work. They work tremendously well on a large variety of problems, and are now widely used.
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!(LSTMs能很好地解决
long-term dependency的问题)
All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.
下面是示意图
下面引用 Tensorflow官网中关于LSTMs的模型例子来说明模型构建过程
代码链接
https://github.com/tensorflow/models/tree/master/tutorials/rnn/ptb
ptb_word_lm.py
smail medium large 三种强度设定
参数设定
init_scale the initial scale of the weights
这是选定神经网络初始变换矩阵权值range的相关参数
通过tf.random_uniform_initializer 进行对第一个参数 minval--Lower bound of the range of random values to generate.
其与模型复杂度成反比(large相对应的init_scale最小)
learning_rate
max_grad_norm the maximum permissible norm of the gradient
num_layers the number of LSTM layers
此参数用于确定MultiRNNCell对象cell参数的迭代实例化的迭代次数,即神经元层数
num_steps the number of unrolled steps of LSTM
hidden_size the number of LSTM units
关于hidden_size的解释见 stack overflow的如下链接
http://stackoverflow.com/questions/37901047/what-is-num-units-in-tensorflow-basiclstmcell
The number of hidden units is a direct representation of th
4000
e learning capacity of a neural network--it reflects the number of learned parameters. The value likely selected arbitrarily or empirically. You can change that value experimentally and return the
program to see how it affects the training accuracy(you can get better than 90% test accuracy with a lot fewer hidden units). Using more units makes it more likely to perfectly memorize the complete training set (although it will take longer, and you run the
risk of over-fitting)
The key thing to understand, which is somewhat subtle is that x is an array of data(tensor)--it is not meant to be a scalar value. Where, for example, the tanh function is shown, it is meant to imply that the function is broadcast across the entire array(an
implict for loop)--and not simply performed once per time-step.
这还不是很明确,看下面链接的一句话
https://www.quora.com/What-is-the-meaning-of-%E2%80%9CThe-number-of-units-in-the-LSTM-cell
Tensorflow’s num_units is the size of the LSTM’s hidden state(which is also the size of the output if no projection is used)
即不进行投影(降维)变换的话其就是输出的维数。
max_epoch the number of epochs trained with the initial learning rate.
从initial 知道这里的学习率可能是变化的 tensorflow中推荐的是衰减学习率(decaying the learning rate-- when training a model, it is often recommended to lower the learning rate as the training progresses.)
(见后续参数lr_decay)
max_max_epoch the total number of epochs for training.
keep_prob the probability of keeping weights in the dropout layer
使用dropout的原因是防止过拟合 idea来自于
https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
以随机方式丢弃若干神经元
类似如下示意
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Understanding LSTM Networks
Essential to these successes is the use of “LSTMs”, a very special kind of recurrent neural network which works, for many tasks, much much better than the standard version, Almost all exciting results based on recurrent neural networks are achieved with
them. It’s these LSTMs that this essay will explore.
The problem of Long-Term Dependencies
One of the appeals(诉求) of RNNs is the idea that they might be able to connect previous information to present task.such as using previous video frames might inform the understanding of
the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends
Sometimes, we only need to look at recent information to perform the present task.For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky”,
we don’t need any further context-- is’s pretty obvious the next word is gonging to be sky. In such case, where the gap between the relevant(相关的) information and the place that it’s
needed is small. RNNs can learn to use the past information.
But there are also cases where we need more context. Consider trying to predict the last word in the next text “I grew up in France...I speak fluent French” Recent information suggests that the next word is probably the name of a language, but if we want
to narrow down which language , we need the context of France, from further back. It’s entirely possible for gap between the relevant information and the points where it is needed to become very large.
Unfortunately, as the gap grows, RNNs become unable to learn to connect the information.
In theory, RNNs are absolutely capable of handling such “long-term dependencies” A human could carefully pick the parameters for them to solve toy problem of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored
in depth by Hochreiter ....
Who found some pretty fundamental reasons why it might be difficult.
Thankfully, LSTMs don’t have this problem.
LSTM Networks.
Long Short Term Memory networks -- usually just called “LSTMs”-- are a special kind of RNN, capable of learning long-term dependencies.
They were introduced by ... And were refined(改进) and popularized by many people in following work. They work tremendously well on a large variety of problems, and are now widely used.
LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!(LSTMs能很好地解决
long-term dependency的问题)
All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.
下面是示意图
下面引用 Tensorflow官网中关于LSTMs的模型例子来说明模型构建过程
代码链接
https://github.com/tensorflow/models/tree/master/tutorials/rnn/ptb
ptb_word_lm.py
smail medium large 三种强度设定
参数设定
init_scale the initial scale of the weights
这是选定神经网络初始变换矩阵权值range的相关参数
通过tf.random_uniform_initializer 进行对第一个参数 minval--Lower bound of the range of random values to generate.
其与模型复杂度成反比(large相对应的init_scale最小)
learning_rate
max_grad_norm the maximum permissible norm of the gradient
num_layers the number of LSTM layers
此参数用于确定MultiRNNCell对象cell参数的迭代实例化的迭代次数,即神经元层数
num_steps the number of unrolled steps of LSTM
hidden_size the number of LSTM units
关于hidden_size的解释见 stack overflow的如下链接
http://stackoverflow.com/questions/37901047/what-is-num-units-in-tensorflow-basiclstmcell
The number of hidden units is a direct representation of th
4000
e learning capacity of a neural network--it reflects the number of learned parameters. The value likely selected arbitrarily or empirically. You can change that value experimentally and return the
program to see how it affects the training accuracy(you can get better than 90% test accuracy with a lot fewer hidden units). Using more units makes it more likely to perfectly memorize the complete training set (although it will take longer, and you run the
risk of over-fitting)
The key thing to understand, which is somewhat subtle is that x is an array of data(tensor)--it is not meant to be a scalar value. Where, for example, the tanh function is shown, it is meant to imply that the function is broadcast across the entire array(an
implict for loop)--and not simply performed once per time-step.
这还不是很明确,看下面链接的一句话
https://www.quora.com/What-is-the-meaning-of-%E2%80%9CThe-number-of-units-in-the-LSTM-cell
Tensorflow’s num_units is the size of the LSTM’s hidden state(which is also the size of the output if no projection is used)
即不进行投影(降维)变换的话其就是输出的维数。
max_epoch the number of epochs trained with the initial learning rate.
从initial 知道这里的学习率可能是变化的 tensorflow中推荐的是衰减学习率(decaying the learning rate-- when training a model, it is often recommended to lower the learning rate as the training progresses.)
(见后续参数lr_decay)
max_max_epoch the total number of epochs for training.
keep_prob the probability of keeping weights in the dropout layer
使用dropout的原因是防止过拟合 idea来自于
https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
以随机方式丢弃若干神经元
类似如下示意
#coding: utf-8 from __future__ import absolute_import from __future__ import division from __future__ import print_function import inspect # use inspect.getargspec to get the names and default values of a function's argument # use in closure defination of lstm_cell import time import numpy as np import tensorflow as tf from . import reader # read file usage flags = tf.flags # The flags are Google's way of handleing command line parameters, similar to python's # awesome argparse module. logging = tf.logging flags.DEFINE_string("model", "small", "A type of model. Possible options are: small, medium, large") flags.DEFINE_string("data_path", None, "Where the training/test data is used.") flags.DEFINE_string("save_path", None, "Model output directory.") flags.DEFINE_bool("use_fp16", False, "Train using 16-bit floats instead 32bit floats") # correlate to the precision # These flags have flagname, default value and docstring FLAGS = flags.FLAGS # Global container and accrssor for flags and their values. def data_type(): return tf.float16 if FLAGS.use_fp16 else tf.float32 class PTBinput(object): """The input data.""" if __name__ == '__main__': def __init__(self, config, data, name = None): self.batch_size = batch_size = config.batch_size self.num_steps = num_steps = config.num_steps self.epoch_size = ((len(data) // batch_size) - 1) // num_steps self.input_data , self.targets = reader.ptb_producer(data, batch_size, num_steps, name=name) # the return of reader.ptb_producer is A pair of Tensors, each shaped [batch_size, num_steps]. The second element # of the tuple is the same data time-shifted to the right by one. class PTBModel(object): """The PTB Model.""" def __init__(self, is_training, config, input_): self._input = input_ batch_size = input_.batch_size num_steps = input_.num_steps size = config.hidden_size vocab_size = config.vocab_size # Slightly better results can be obtained with forget gate biases # (may be the const term of linear transformation in s curve) # initialized to 1 but the hyperparameters of the model would need to be # different than reported in the paper def lstm_cell(): # With the latest Tensorflow source code, the BasicLSTMCell will need a reuse parameter # which is unfortunately not define in Tensorflow 1.0, To maintain backwards compatibility , # we add an argument check here: if 'reuse' in inspect.getargspec(tf.contrib.rnn.BasicLSTMCell.__init__).args: return tf.contrib.rnn.BasicLSTMCell(size, forget_bias = 0.0, state_is_tuple = True, reuse = tf.get_variable_scope().reuse) else: return tf.contrib.rnn.BasicLSTMCell(size, forget_bias = 0.0, state_is_tuple = True) # the if-else chuck define above relate to the concept of Variable Scope in TenssorFlow # Which can be analogied to the namespace(use namespace std??) in C++ # In https://www.tensorflow.org/programmers_guide/variable_scope # The introduce idea of it is to solve the problem come with sharied valuable(which also exist in Theano) # frequently used in CNN for sharing weights or similar filter and so on. # It is a light weight solver(than defination of manage class or dict) when you can define # new valiable by tf.get_varable(default reuse para set to False) or reused it # And there is an example to use scops.reuse_variables() in the same variable scope defined by # with... as ... syntax which can be seen in the following code. attn_cell = lstm_cell if is_training and config.keep_prob < 1: def attn_cell(): return tf.contrib.rnn.DropoutWrapper(lstm_cell(), output_keep_prob = config.keep_prob) # apply dropout to cell. cell = tf.contrib.rnn.MultiRNNCell([attn_cell() for _ in range(config.num_layers)], state_is_tuple = True) # construct MultiRNNCell set state_is_tuple to True induce the state input shape as a ndarray # rather than the straighten form.(concatenate by column) self._initial_state = cell.zero_state(batch_size, data_type()) with tf.device("/cpu:0"): embedding = tf.get_variable("embedding", [vocab_size, size], dtype=data_type()) inputs = tf.nn.embedding_lookup(embedding, input_.input_data) # the explain of get_variable can appeal to the content of variable scope above # tf.device can assign the operation in the with closure to the specially device # In this step embedding complete the unique random encode of vocab # The usage of tf.nn.embedding_lookup is generate the embedding encode correspond to embedding # Can refer to http://stackoverflow.com/questions/34870614/what-does-tf-nn-embedding-lookup-function-do # the usage of embedding in this may be trivial, but in word2vec may consider the real sense # of embedding words into dense vector space where similar meaning words have low distance # which may analogous to concept of factorial analysis # refer to http://stackoverflow.com/questions/40184537/what-does-embedding-do-in-tensorflow if is_training and config.keep_prob < 1: inputs = tf.nn.dropout(inputs, config.keep_prob) # Simplified version of models/tutorials/rnn/rnn.py's rnn() # This builds an unrolled LSTM for tutorial purposes only. # In general, use the rnn() or state_saving_rnn() from rnn.py # # The alternative version of the code below is: # # inputs = tf.unstack(inputs, num = num_steps, axis = 1) # outputs, state = tf.contrib.rnn.static_rnn(cell, inputs, initial_state = self._initial_state) # the unstack func above "split" the tensor along the "axis" to a list. outputs = [] state = self._initial_state with tf.variable_scope("RNN"): for time_step in range(num_steps): if time_step > 0: tf.get_variable_scope().reuse_variables() (cell_output, state) = cell(inputs[:,time_step,:], state) outputs.append(cell_output) output = tf.reshape(tf.concat(axis=1, values=outputs), [-1, size]) # skip basic operation like numpy softmax_w = tf.get_variable("softmax_w", [size, vocab_size], dtype=data_type()) softmax_b = tf.get_variable("softmax_b", [vocab_size], dtype=data_type()) logits = tf.matmul(output, softmax_w) + softmax_b # matmul: Multiplies matrices loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example( [logits], [tf.reshape(input_.targets, [-1])], [tf.ones([batch_size * num_steps], dtype=data_type())] ) # sequence_loss_by_example is the Weighted cross-entropy loss for a sequence of logits (per example) # the form can be found in https://hit-scir.gitbooks.io/neural-networks-and-deep-learning-zh_cn/content/chap3/c3s1.html # which is similar to the MLE of 0-1 distribution. there use the equal weights for the example self._cost = cost = tf.reduce_sum(loss) / batch_size self._final_state = state if not is_training: return # above is the construct of loss pattern # following is the solve process(by grad) self._lr = tf.Variable(0.0, trainable=False) # this variable will be used as the learning_rate of GradientDescentOptimizer # in http://stackoverflow.com/questions/33919948/how-to-set-adaptive-learning-rate-for-gradientdescentoptimizer # can find that the original usage of GradientDesceentOptimizer is to init learning_rate by a const which # will be used in all steps. The above usage is to update the learning_rate which have mentioned # in the args above in lr_decay tvars = tf.trainable_variables() # tvars are all variables created with trainable=True as a list grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), config.max_grad_norm) optimizer = tf.train.GradientDescentOptimizer(self._lr) self._train_op = optimizer.apply_gradients(zip(grads, tvars), global_step=tf.contrib.framework.get_or_create_global_step()) # global_step in apply_gradients set a Optional Variable to increment by one after the variables have been updated. self._new_lr = tf.placeholder(tf.float32, shape = [], name = "new_learning_rate") # tf.placeholder insert a placeholder for a tensor that will be always fed. # which value must be fed using 'feed_dict' optional argument to 'Session.run()' # can think about final variable in Java must be init otherwise will induce an error. # Session Object is the basic operation part in tensorflow,(always in its fun method) comprehend this stage first. self._lr_update = tf.assign(self._lr, self._new_lr) def assign_lr(self, session, lr_value): session.run(self._lr_update, feed_dict = {self._new_lr: lr_value}) @property def input(self): return self._input @property def initial_state(self): return self._initial_state @property def cost(self): return self._cost @property def final_state(self): return self._final_state @property def lr(self): return self._lr @property def train_op(self): return self._train_op # following are different size model defination class SmallConfig(object): """Small config""" init_scale = 0.1 learning_rate = 1.0 max_grad_norm = 5 num_layers = 2 num_steps = 20 hidden_size = 200 max_epoch = 4 max_max_epoch = 13 keep_prob = 1.0 lr_decay = 0.5 batch_size = 20 vocab_size = 10000 class MediumConfig(object): """Medium config.""" init_scale = 0.05 learning_rate = 1.0 max_grad_norm = 5 num_layers = 2 num_steps = 35 hidden_size = 60 max_epoch = 6 max_max_epoch = 39 keep_prob = 0.5 batch_size = 20 vocab_size = 10000 class LargeConfig(object): """Large config.""" init_scale = 0.04 learning_rate = 1.0 max_grad_norm = 10 num_layers = 2 num_steps = 35 hidden_size = 1500 max_epoch = 14 max_max_epoch = 55 keep_prob = 0.35 lr_decay = 1 / 1.15 batch_size = 20 vocab_size = 10000 class TestConfig(object): """Tiny config, for testing""" init_scale = 0.1 learning_rate = 1.0 max_grad_norm = 1 num_layers = 1 num_steps = 2 hidden_size = 2 max_epoch = 1 max_max_epoch = 1 keep_prob = 1.0 lr_decay = 0.5 batch_size = 20 vocab_size = 10000 def run_epoch(session, model, eval_op = None, verbose = False): """Run the model on the given data.""" # the op may be the operation # in the code of main , we will find the eval_op will be train_op define above to perform grad descend start_time = time.time() costs = 0.0 iters = 0 state = session.run(model.initial_state) # run in Session use fetches as first arg and return the same construct of fetches input # the fetches may be some subclasses of tf.Tensor or container of Tensor and so on. fetches = { "cost": model.cost, "final_state": model.final_state } if eval_op is not None: fetches["eval_op"] = eval_op for step in range(model.input.epoch_size): feed_dict = {} for i, (c, h) in enumerate(model.initial_state): feed_dict[c] = state[i].c feed_dict[h] = state[i].h vals = session.run(fetches, feed_dict) cost = vals["cost"] state = vals["final_state"] costs += cost iters += model.input.num_steps if verbose and step % (model.input.epoch_size // 10) == 10: print("%.3f perplexity: %.3f speed: %.0f wps" % (step * 1.0 / model.input.epoch_size, np.exp(costs / iters), iters * model.input.batch_size / (time.time() - start_time))) return np.exp(costs / iters) def get_config(): if FLAGS.model == "small": return SmallConfig() elif FLAGS.model == "medium": return MediumConfig() elif FLAGS.model == "large": return LargeConfig() elif FLAGS.model == "test": return TestConfig() def main(_): if not FLAGS.data_path: raise ValueError("Must set --date_path to PTB data directory") raw_data = reader.ptb_producer(FLAGS.data_path) train_data, valid_data, test_data, _ = raw_data # data split config = get_config() eval_config = get_config() eval_config.batch_size = 1 eval_config.num_steps = 1 # eval config will be used in test with tf.Graph().as_default(): initializer = tf.random_uniform_initializer(-config.init_scale, config.init_scale) with tf.name_scope("Train"): train_input = PTBinput(config = config, data = train_data, name = "TrainInput") with tf.variable_scope("Model", reuse = None, initializer=initializer): m = PTBModel(is_training=True, config=config, input_=train_input) tf.summary.scalar("Training Loss", m.cost) with tf.name_scope("Valid"): valid_input = PTBinput(config=config, data=valid_data, name="ValidInput") with tf.variable_scope("Model", reuse=True, initializer=initializer): mvalid = PTBModel(is_training=False, config=config, input_=valid_input) tf.summary.scalar("Validation Loss", mvalid.cost) with tf.na ba42 me_scope("Test"): test_input = PTBinput(config=eval_config, data=test_data, name="TestInput") with tf.variable_scope("Model", reuse=True, initializer=initializer): mtest = PTBModel(is_training=False, config=eval_config, input_=test_input) sv = tf.train.Supervisor(logdir=FLAGS.save_path) with sv.managed_session() as session: for i in range(config.max_max_epoch): lr_decay = config.lr_decay ** max(i + 1 - config.max_epoch, 0.0) m.assign_lr(session, config.learning_rate * lr_decay) print("Epoch: %d Learning rate: %.3f" % (i + 1, session.run(m.lr))) train_perplexity = run_epoch(session, m, eval_op=m.train_op, verbose=True) print("Epoch: %d Train Perplexity: %.3f" % (i + 1, train_perplexity)) valid_perplexity = run_epoch(session, mvalid) print("Epoch: %d Valid Perplexity: %.3f" % (i + 1, valid_perplexity)) test_perplexity = run_epoch(session, mtest) print("Test Perplexity: %.3f" % test_perplexity) if FLAGS.save_path: print("Saving model to %s." % FLAGS.save_path) sv.saver.save(session, FLAGS.save_path, global_step=sv.global_step) if __name__ == "__main__": tf.app.run()
相关文章推荐
- cocos2d-x入门学习笔记,主要介绍cocos2d-x的基本结构,并且介绍引擎自带的示例
- LVS基本介绍及NAT模型配置示例
- LVS基本介绍及NAT模型配置示例
- LVS基本介绍及NAT模型配置示例
- TensorFlow入门,基本介绍,基本概念,计算图,pip安装,helloworld示例,实现简单的神经网络
- LVS基本介绍及NAT模型配置示例
- Tensorflow中基本概念及神经网络模型的介绍
- PerformanceCounter 基本介绍以及示例方法
- [转]PerformanceCounter 基本介绍以及示例方法
- PerformanceCounter 基本介绍以及示例方法(转载)
- hello_world-2.2之简单设备驱动模型(二)---device,bus,driver结构介绍
- Linux设备模型--总线、驱动、设备、设备类 (相关结构介绍)
- Linux驱动模型的基本数据结构kobject介绍
- (教学思路 html一) HTML讲述介绍及基本结构
- oracle 数据库学习 基本结构介绍
- oracle 数据库学习 基本结构介绍
- (总结)数据结构之链表的基本操作说明和示例(待补充)
- PerformanceCounter 基本介绍以及示例方法
- Rexsee API介绍:Android定时任务Alarm,附基本的闹钟功能示例
- PerformanceCounter 基本介绍以及示例方法