[深度学习论文笔记][Attention]Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention
2016-11-15 19:53
1411 查看
Xu, Kelvin, et al. “Show, attend and tell: Neural image caption generation with visual attention.” arXiv preprint arXiv:1502.03044 2.3 (2015): 5. (Citations: 401).
1 Motivation
In the previous image captioning model, the RNN decoder part only looks at the whole image once. Besides, the CNN encoder part encode fc7 representations which distill information in image down to the most salient objects.
However, this has one potential drawback of losing information which could be useful for richer, more descriptive captions. Using more low-level representation (conv4/conv5
features) can help preserve this information. However working with these features necessitates a attention mechanism to learn to fix its gaze on salient objects while generating the corresponding words in the output sequence to release computational burden.
Another usage of attention model is the ability to visualize what the model “sees”.
The attention model also in accord with the the human visual system. Rather than compress an entire image into a static representation, attention allows for salient features to
dynamically come to the forefront as needed. This is especially important when there is a lot of clutter in an image.
2 Pipeline
See Fig. Where ⃗ z is the context vector, capturing the visual information associated with attention. L represents possible locations (conv4/conv5 different grid cells in our
case), each of which is a D dimensional embedding vector. The distribution over L locations satisfy
Note that p ⃗ is the a ⃗ used in the Fig.
3 Hard Attention
Each time, z is taken from one location of a .
Because of the arg max, ∂ ∂⃗ p ⃗ z is zero almost everywhere since slightly change p ⃗ will not affect l⋆ . Therefore, it can not be trained using SGD. We use reinforcement learning instead.
4 Soft Attention
Each time, z is the summarization of all locations
This form is easy to take derivative, so it can be trained with SGD.
5 Doublely Stochastic Attention
Besides
, we also encourage
.
This can be interpreted as encouraging the model to pay equal attention to every part of the image over the course of
generation. In practice, we found that this regularization leads to more rich and descriptive captions.
6 Results
See Fig. The model can attend to “non object” salient regions.
1 Motivation
In the previous image captioning model, the RNN decoder part only looks at the whole image once. Besides, the CNN encoder part encode fc7 representations which distill information in image down to the most salient objects.
However, this has one potential drawback of losing information which could be useful for richer, more descriptive captions. Using more low-level representation (conv4/conv5
features) can help preserve this information. However working with these features necessitates a attention mechanism to learn to fix its gaze on salient objects while generating the corresponding words in the output sequence to release computational burden.
Another usage of attention model is the ability to visualize what the model “sees”.
The attention model also in accord with the the human visual system. Rather than compress an entire image into a static representation, attention allows for salient features to
dynamically come to the forefront as needed. This is especially important when there is a lot of clutter in an image.
2 Pipeline
See Fig. Where ⃗ z is the context vector, capturing the visual information associated with attention. L represents possible locations (conv4/conv5 different grid cells in our
case), each of which is a D dimensional embedding vector. The distribution over L locations satisfy
Note that p ⃗ is the a ⃗ used in the Fig.
3 Hard Attention
Each time, z is taken from one location of a .
Because of the arg max, ∂ ∂⃗ p ⃗ z is zero almost everywhere since slightly change p ⃗ will not affect l⋆ . Therefore, it can not be trained using SGD. We use reinforcement learning instead.
4 Soft Attention
Each time, z is the summarization of all locations
This form is easy to take derivative, so it can be trained with SGD.
5 Doublely Stochastic Attention
Besides
, we also encourage
.
This can be interpreted as encouraging the model to pay equal attention to every part of the image over the course of
generation. In practice, we found that this regularization leads to more rich and descriptive captions.
6 Results
See Fig. The model can attend to “non object” salient regions.
相关文章推荐
- 论文笔记:Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
- Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
- 论文阅读:Neural Image Caption Generation with Visual Attention
- [深度学习论文笔记][Image Classification] ImageNet Classification with Deep Convolutional Neural Networks
- [深度学习论文笔记][Image to Sentence Generation] Deep Visual-Semantic Alignments for Generating Image Descri
- Visual Attention Based on Long-Short Term Memory Model for Image Caption Generation 论文笔记
- [深度学习论文笔记][Video Classification] Large-scale Video Classification with Convolutional Neural Networks
- 深度学习论文笔记--Recover Canonical-View Faces in the Wild with Deep Neural Network
- [深度学习论文笔记][Visualizing] Deep Inside Convolutional Networks Visualising Image Classification
- 深度学习入门笔记:Fast Image Search with Deep Convolutional Neural Networks and Efficient Hashing Codes
- Deep learning论文笔记一:ImageNet Classification with Deep Convolutional Neural Networks
- [深度学习论文笔记][Neural Arts] Inceptionism: Going Deeper into Neural Networks
- [深度学习论文笔记][Adversarial Examples] Intriguing properties of neural networks
- [深度学习论文笔记][Visualizing] Understanding Neural Networks Through Deep Visualization
- [深度学习论文笔记][Image Classification] Deep Residual Learning for Image Recognition
- [深度学习论文笔记][Depth Estimation] Depth Map Prediction from a Single Image using a Multi-Scale Deep Netw
- 深度学习论文笔记--Depth Map Prediction from a Single Image using a Multi-Scale Deep Network
- 深度学习论文笔记-Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
- [深度学习论文笔记][Human Pose Estimation] DeepPose: Human Pose Estimation via Deep Neural Networks
- 深度学习研究理解4:ImageNet Classification with Deep Convolutional Neural Network