您的位置:首页 > 其它

一个关于LSTM的high-level介绍(但是很好):A Gentle Introduction to Long Short-Term Memory Networks by the Experts

2017-05-27 09:41 766 查看
http://machinelearningmastery.com/gentle-introduction-long-short-term-memory-networks-experts/

作者很厉害,感谢总结:

This provides you both with a clear and precise idea of what LSTMs are and how they work, as well as important articulation on the promise of LSTMs in the
field of recurrent neural networks.



Hi, I'm Dr. Jason Brownlee. 

My goal is to make practitioners like YOU awesome at applied machine learning.

Read More

大致翻译一下关键信息:

1)普通RNN能处理5-10的时间戳信息(input和target依赖在10步之内),再长就会出现vanishing gradients and exploding gradients问题。LSTM可以处理1000个时间戳以上的依赖问题!!!

2)LSTM主要工作原理是memory cell以及对应的几个gate。input gate能够拦截输入中不相关的信息,从而使memory cell中的关键信息保持较长的时间;output gate能够拦截memory cell中不相关的信息,从而使输出保持正确。

3)Learning rate and network size是LSTM最关键的参数。好的是,LSTM的参数可以independently进行调节;实际中常见的是,使用small network先调节learning rate,然后固定好learning rate增大network
size,这样可以节省不少测试时间。

4)LSTM的应用就是任何和序列问题相关的,未来的结果可能依赖于之前时间的信息 的任务都可以用LSTM。具体的,语言模型、语音识别、机器翻译、问答、手写字符识别与生成、蛋白质二级结构预测、视频下一帧信息预测等等。。。

5)Bidirectional LSTM(双向LSTM,B-LSTM):训练两个LSTM,一个接受输入序列的forward输入,一个接受输入序列的backward输入,两个LSTM连接到同一个output layer。这意味着,在每个时刻进行输出时,B-LSTM都是知道整个输入序列的完整信息,而不是只知道当前时刻之前的信息!!!===》

… for temporal problems like speech recognition, relying on knowledge of the future seems at first sight to violate causality … How can we base our understanding
of what we’ve heard on something that hasn’t been said yet? However, human listeners do exactly that. Sounds, words, and even whole sentences that at first mean nothing are found to make sense in the light of future context.

6)Seq2Seq-LSTM:看这个就好了http://blog.csdn.net/mmc2015/article/details/72773854,其中介绍了常见的LSTM-LSTM模型。像image caption任务中,则使用CNN-LSTM,因为CNN做image的encoder很正常。==》

An “encoder” RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn in used as the initial hidden state of a “decoder” RNN that generates the target sentence. Here, we propose
to follow this elegant recipe, replacing the encoder RNN by a deep convolution neural network (CNN). … it is natural to use a CNN as an image “encoder”, by first pre-training it for an image classification task and
using the last hidden layer as an input to the RNN decoder that generates sentences.


— Oriol Vinyals, et al., Show and Tell: A Neural Image Caption Generator,
2014

原文:

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems.

This is a behavior required in complex problem domains like machine translation, speech recognition, and more.

LSTMs are a complex area of deep learning. It can be hard to get your hands around what LSTMs are, and how terms like bidirectional and sequence-to-sequence relate to the field.

In this post, you will get insight into LSTMs using the words of research scientists that developed the methods and applied them to new and important problems.

There are few that are better at clearly and precisely articulating both the promise of LSTMs and how they work than the experts that developed them.

We will explore key questions in the field of LSTMs using quotes from the experts, and if you’re interested, you will be able to dive into the original papers from which the quotes were taken.



A Gentle Introduction to Long Short-Term Memory Networks by the Experts

Photo by Oran Viriyincy, some rights
reserved.


The Promise of Recurrent Neural Networks

Recurrent neural networks are different from traditional feed-forward neural networks.

This difference in the addition of complexity comes with the promise of new behaviors that the traditional methods cannot achieve.

Recurrent networks … have an internal state that can represent context information. … [they] keep information about past inputs for an amount of time that is not fixed a priori, but rather depends on its weights and on the input data.



A recurrent network whose inputs are not fixed but rather constitute an input sequence can be used to transform an input sequence into an output sequence while taking into account contextual information in a flexible way.

— Yoshua Bengio, et al., Learning Long-Term
Dependencies with Gradient Descent is Difficult, 1994.

The paper defines 3 basic requirements of a recurrent neural network:

That the system be able to store information for an arbitrary duration.

That the system be resistant to noise (i.e. fluctuations of the inputs that are random or irrelevant to predicting a correct output).

That the system parameters be trainable (in reasonable time).

The paper also describes the “minimal task” for demonstrating recurrent neural networks.

Context is key.

Recurrent neural networks must use context when making predictions, but to this extent, the context required must also be learned.

… recurrent neural networks contain cycles that feed the network activations from a previous time step as inputs to the network to influence predictions at the current time step. These activations are stored in the internal states of the network which can in
principle hold long-term temporal contextual information. This mechanism allows RNNs to exploit a dynamically changing contextual window over the input sequence history

— Hassim Sak, et al., Long Short-Term Memory Recurrent Neural Network
Architectures for Large Scale Acoustic Modeling, 2014


LSTMs Deliver on the Promise

The success of LSTMs is in their claim to be one of the first implements to overcome the technical problems and deliver on the promise of recurrent neural networks.

Hence standard RNNs fail to learn in the presence of time lags greater than 5 – 10 discrete time steps between relevant input events and target signals. The vanishing error problem casts doubt on whether standard RNNs can indeed exhibit significant practical
advantages over time window-based feedforward networks. A recent model, “Long Short-Term Memory” (LSTM), is not affected by this problem. LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error ow through
“constant error carrousels” (CECs) within special units, called cells

— Felix A. Gers, et al., Learning to
Forget: Continual Prediction with LSTM, 2000

The two technical problems overcome by LSTMs are vanishing gradients and exploding gradients, both related to how the network is trained.

Unfortunately, the range of contextual information that standard RNNs can access is in practice quite limited. The problem is that the influence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponentially
as it cycles around the network’s recurrent connections. This shortcoming … referred to in the literature as the vanishing gradient problem … Long Short-Term Memory (LSTM) is an RNN architecture specifically designed to address the vanishing gradient problem.

— Alex Graves, et al., A Novel Connectionist System for Unconstrained
Handwriting Recognition, 2009

The key to the LSTM solution to the technical problems was the specific internal structure of the units used in the model.

… governed by its ability to deal with vanishing and exploding gradients, the most common challenge in designing and training RNNs. To address this challenge, a particular form of recurrent nets, called LSTM, was introduced and applied with great success to
translation and sequence generation.

— Alex Graves, et al., Framewise
Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures, 2005.


How do LSTMs Work?

Rather than go into the equations that govern how LSTMs are fit, analogy is a useful tool to quickly get a handle on how they work.

We use networks with one input layer, one hidden layer, and one output layer… The (fully) self-connected hidden layer contains memory cells and corresponding gate units…



Each memory cell’s internal architecture guarantees constant error ow within its constant error carrousel CEC… This represents the basis for bridging very long time lags. Two gate units learn to open and close access to error ow within each memory cell’s CEC.
The multiplicative input gate affords protection of the CEC from perturbation by irrelevant inputs. Likewise, the multiplicative output gate protects other units from perturbation by currently irrelevant memory contents.

— Sepp Hochreiter and Jurgen Schmidhuber, Long
Short-Term Memory, 1997.

Multiple analogies can help to give purchase on what differentiates LSTMs from traditional neural networks comprised of simple neurons.

The Long Short Term Memory architecture was motivated by an analysis of error flow in existing RNNs which found that long time lags were inaccessible to existing architectures, because backpropagated error either blows up or decays exponentially.

An LSTM layer consists of a set of recurrently connected blocks, known as memory blocks. These blocks can be thought of as a differentiable version of the memory chips in a digital computer. Each one contains one or more recurrently connected memory cells and
three multiplicative units – the input, output and forget gates – that provide continuous analogues of write, read and reset operations for the cells. … The net can only interact with the cells via the gates.

— Alex Graves, et al., Framewise
Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures, 2005.

It is interesting to note, that even after more than 20 years, the simple (or vanilla) LSTM may still be the best place to start when applying the technique.

The most commonly used LSTM architecture (vanilla LSTM) performs reasonably well on various datasets…

Learning rate and network size are the most crucial tunable LSTM hyperparameters …

… This implies that the hyperparameters can be tuned independently. In particular, the learning rate can be calibrated first using a fairly small network, thus saving a lot of experimentation time.

— Klaus Greff, et al., LSTM: A Search Space Odyssey, 2015


What are LSTM Applications?

It is important to get a handle on exactly what type of sequence learning problems that LSTMs are suitable to address.

Long Short-Term Memory (LSTM) can solve numerous tasks not solvable by previous learning algorithms for recurrent neural networks (RNNs).



… LSTM holds promise for any sequential processing task in which we suspect that a hierarchical decomposition may exist, but do not know in advance what this decomposition is.

— Felix A. Gers, et al., Learning to
Forget: Continual Prediction with LSTM, 2000

The Recurrent Neural Network (RNN) is neural sequence model that achieves state of the art performance on important tasks that include language modeling, speech recognition, and machine translation.

— Wojciech Zaremba, Recurrent Neural Network Regularization, 2014.

Since LSTMs are effective at capturing long-term temporal dependencies without suffering from the optimization hurdles that plague simple recurrent networks (SRNs), they have been used to advance the state of the art for many difficult problems. This includes
handwriting recognition and generation, language modeling and translation, acoustic modeling of speech, speech synthesis, protein secondary structure prediction, analysis of audio, and video data among others.

— Klaus Greff, et al., LSTM: A Search Space Odyssey, 2015


What are Bidirectional LSTMs?

A commonly mentioned improvement upon LSTMs are bidirectional LSTMs.

The basic idea of bidirectional recurrent neural nets is to present each training sequence forwards and backwards to two separate recurrent nets, both of which are connected to the same output layer. … This means that for every point in a given sequence, the
BRNN has complete, sequential information about all points before and after it. Also, because the net is free to use as much or as little of this context as necessary, there is no need to find a (task-dependent) time-window or target delay size.

… for temporal problems like speech recognition, relying on knowledge of the future seems at first sight to violate causality … How can we base our understanding of what we’ve heard on something that hasn’t been said yet? However, human listeners do exactly
that. Sounds, words, and even whole sentences that at first mean nothing are found to make sense in the light of future context.

— Alex Graves, et al., Framewise
Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures, 2005.

One shortcoming of conventional RNNs is that they are only able to make use of previous context. … Bidirectional RNNs (BRNNs) do this by processing the data in both directions with two separate hidden layers, which are then fed forwards to the same output layer.
… Combining BRNNs with LSTM gives bidirectional LSTM, which can access long-range context in both input directions

— Alex Graves, et al., Speech recognition with deep
recurrent neural networks, 2013

Unlike conventional RNNs, bidirectional RNNs utilize both the previous and future context, by processing the data from two directions with two separate hidden layers. One layer processes the input sequence in the forward direction, while the other processes
the input in the reverse direction. The output of current time step is then generated by combining both layers’ hidden vector…

— Di Wang and Eric Nyberg, A Long Short-Term Memory Model for
Answer Sentence Selection in
Question Answering, 2015


What are seq2seq LSTMs or RNN Encoder-Decoders?

The sequence-to-sequence LSTM, also called encoder-decoder LSTMs, are an application of LSTMs that are receiving a lot of attention given their impressive capability.

… a straightforward application of the Long Short-Term Memory (LSTM) architecture can solve general sequence to sequence problems.



The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed-dimensional vector representation, and then to use another LSTM to extract the output sequence from that vector. The second LSTM is essentially a recurrent
neural network language model except that it is conditioned on the input sequence.

The LSTM’s ability to successfully learn on data with long range temporal dependencies makes it a natural choice for this application due to the considerable time lag between the inputs and their corresponding outputs.

We were able to do well on long sentences because we reversed the order of words in the source sentence but not the target sentences in the training and test set. By doing so, we introduced many short term dependencies that made the optimization problem much
simpler. … The simple trick of reversing the words in the source sentence is one of the key technical contributions of this work

— Ilya Sutskever, et al., Sequence to Sequence Learning with Neural Networks,
2014

An “encoder” RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn in used as the initial hidden state of a “decoder” RNN that generates the target sentence. Here, we propose to follow this elegant recipe,
replacing the encoder RNN by a deep convolution neural network (CNN). … it is natural to use a CNN as an image “encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates
sentences.

— Oriol Vinyals, et al., Show and Tell: A Neural Image Caption Generator,
2014

… an RNN Encoder–Decoder, consists of two recurrent neural networks (RNN) that act as an encoder and a decoder pair. The encoder maps a variable-length source sequence to a fixed-length vector, and the decoder maps the vector representation back to a variable-length
target sequence.

— Kyunghyun Cho, et al., Learning Phrase Representations using RNN Encoder-Decoder
for Statistical Machine Translation, 2014


Summary

In this post, you received a gentle introduction to LSTMs in the words of the research scientists that developed and applied the techniques.

This provides you both with a clear and precise idea of what LSTMs are and how they work, as well as important articulation on the promise of LSTMs in the field of recurrent neural networks.

Did any of the quotes help your understanding or inspire you?

Let me know in the comments below.




About Jason Brownlee

Dr. Jason Brownlee is a husband, proud father, academic researcher, author, professional developer and a machine learning practitioner. He is dedicated to helping developers get started and get good at applied machine learning. Learn
more.

View all posts by Jason Brownlee →
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
相关文章推荐