[深度学习论文笔记][Attention] Spatial Transformer Networks
2016-11-15 22:02
471 查看
Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. “Spatial transformer networks.” Advances in Neural Information Processing Systems. 2015. (Citations: 116).
1 Motivation
The Show, Attend and Tell only allow attention constrained to fixed grid. We want the model can attend to arbitary part of the image.
The pooling operation allows a network to be somewhat spatially invariant to the position of features. However, due to the typically small spatial support for max-pooling, this
spatial invariance is only realised over a deep hierarchy of max-pooling and convolutions, and the intermediate feature maps in a CNN are not actually invariant to large transformations of the input data.
Our goal is to introduce a spatial transformer module, which intelligently select features of interest (attention), and transform them by scaling, cropping, rotations, and non-rigid
deformations.
2 Spatial Transformers
We want a differentiable module which applies a spatial transformation to a feature map during a single forward pass. For each pixel coordinates .x s , y s /, we compute the corresponding ouput .x t , y t / by
We normalize coordinates x s , y s in range [-1, 1]. This transformation allows cropping, translation, rotation, scale, and skew to be applied to the input feature map.
For multi-channel inputs, the same warping is applied to each channel. We repeat for all pixels in output to get a sampling grid, and then use bilinear interpolation to compute
output.
3 Architecture
See Fig. One can also use multiple spatial transformers in parallel — this can be useful if there are multiple objects or parts of interest in a feature map that should be
focussed on individually. A limitation of this architecture in a purely feed-forward network is that the number of parallel spatial transformers limits the number of objects that the
network can model.
4 Training Details
For training, we initialize
This allows the output to be the same as input.
5 Results
See Fig. We insert spatial transformers into a classification network and it learns to attend and transform the input.
6 References
[1]. https://www.youtube.com/watch?v=Ywv0Xi2-14Y.
[2]. https://www.youtube.com/watch?v=T5k0GnBmZVI.
1 Motivation
The Show, Attend and Tell only allow attention constrained to fixed grid. We want the model can attend to arbitary part of the image.
The pooling operation allows a network to be somewhat spatially invariant to the position of features. However, due to the typically small spatial support for max-pooling, this
spatial invariance is only realised over a deep hierarchy of max-pooling and convolutions, and the intermediate feature maps in a CNN are not actually invariant to large transformations of the input data.
Our goal is to introduce a spatial transformer module, which intelligently select features of interest (attention), and transform them by scaling, cropping, rotations, and non-rigid
deformations.
2 Spatial Transformers
We want a differentiable module which applies a spatial transformation to a feature map during a single forward pass. For each pixel coordinates .x s , y s /, we compute the corresponding ouput .x t , y t / by
We normalize coordinates x s , y s in range [-1, 1]. This transformation allows cropping, translation, rotation, scale, and skew to be applied to the input feature map.
For multi-channel inputs, the same warping is applied to each channel. We repeat for all pixels in output to get a sampling grid, and then use bilinear interpolation to compute
output.
3 Architecture
See Fig. One can also use multiple spatial transformers in parallel — this can be useful if there are multiple objects or parts of interest in a feature map that should be
focussed on individually. A limitation of this architecture in a purely feed-forward network is that the number of parallel spatial transformers limits the number of objects that the
network can model.
4 Training Details
For training, we initialize
This allows the output to be the same as input.
5 Results
See Fig. We insert spatial transformers into a classification network and it learns to attend and transform the input.
6 References
[1]. https://www.youtube.com/watch?v=Ywv0Xi2-14Y.
[2]. https://www.youtube.com/watch?v=T5k0GnBmZVI.
相关文章推荐
- 深度学习论文笔记:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
- 深度学习论文笔记-Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
- [深度学习论文笔记][Human Pose Estimation] DeepPose: Human Pose Estimation via Deep Neural Networks
- [深度学习论文笔记][Semantic Segmentation] Recurrent Convolutional Neural Networks for Scene Labeling
- 【论文笔记】Spatial Transformer Networks
- [深度学习论文笔记][Attention]Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention
- 论文笔记:Spatial Transformer Networks中的仿射变换和双线性插值
- [深度学习论文笔记][Visualizing] Understanding Neural Networks Through Deep Visualization
- [深度学习论文笔记][Video Classification] Long-term Recurrent Convolutional Networks for Visual Recognition a
- [深度学习论文笔记] Convolutional Neuron Networks and its Applications
- [深度学习论文笔记][Image Classification] Very Deep Convolutional Networks for Large-Scale Image Recognitio
- [深度学习论文笔记][Image Classification] ImageNet Classification with Deep Convolutional Neural Networks
- [深度学习论文笔记][Recurrent Neural Networks] Visualizing and Understanding Recurrent Networks
- [深度学习论文笔记][Visualizing] Visualizing and Understanding Convolutional Networks
- [深度学习论文笔记][Image Classification] Maxout Networks
- [深度学习论文笔记][Video Classification] Two-Stream Convolutional Networks for Action Recognition in Videos
- [深度学习论文笔记][Video Classification] Large-scale Video Classification with Convolutional Neural Networks
- 【深度学习论文笔记:Recognition】:Deep Neural Networks for Object Detection