您的位置:首页 > 其它

[深度学习论文笔记][Attention] Spatial Transformer Networks

2016-11-15 22:02 471 查看
Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. “Spatial transformer networks.” Advances in Neural Information Processing Systems. 2015. (Citations: 116).

1 Motivation

The Show, Attend and Tell only allow attention constrained to fixed grid. We want the model can attend to arbitary part of the image.

The pooling operation allows a network to be somewhat spatially invariant to the position of features. However, due to the typically small spatial support for max-pooling, this

spatial invariance is only realised over a deep hierarchy of max-pooling and convolutions, and the intermediate feature maps in a CNN are not actually invariant to large transformations of the input data.

Our goal is to introduce a spatial transformer module, which intelligently select features of interest (attention), and transform them by scaling, cropping, rotations, and non-rigid
deformations.

2 Spatial Transformers

We want a differentiable module which applies a spatial transformation to a feature map during a single forward pass. For each pixel coordinates .x s , y s /, we compute the corresponding ouput .x t , y t / by 

We normalize coordinates x s , y s in range [-1, 1]. This transformation allows cropping, translation, rotation, scale, and skew to be applied to the input feature map.



For multi-channel inputs, the same warping is applied to each channel. We repeat for all pixels in output to get a sampling grid, and then use bilinear interpolation to compute

output.

3 Architecture

See Fig. One can also use multiple spatial transformers in parallel — this can be useful if there are multiple objects or parts of interest in a feature map that should be

focussed on individually. A limitation of this architecture in a purely feed-forward network is that the number of parallel spatial transformers limits the number of objects that the

network can model.



4 Training Details

For training, we initialize



This allows the output to be the same as input.

5 Results

See Fig. We insert spatial transformers into a classification network and it learns to attend and transform the input. 

6 References

[1]. https://www.youtube.com/watch?v=Ywv0Xi2-14Y.
[2]. https://www.youtube.com/watch?v=T5k0GnBmZVI.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
相关文章推荐