[深度学习论文笔记][Video Classification] Two-Stream Convolutional Networks for Action Recognition in Videos
2016-11-17 09:27
721 查看
Simonyan, Karen, and Andrew Zisserman. “Two-stream convolutional networks for action recognition in videos.” Advances in Neural Information Processing Systems. 2014.
(Citations: 425).
1 Motivation
The features learnt by Spatio-Temporal CNN do not capture the motion well. Our idea is to separate CNN streams for appearance from still frames and motion between frames, and combine them by late fusion. Decoupling the spatial and temporal nets also allows
us to exploit the availability of large amounts of annotated image data by pre-training the spatial net on the ImageNet challenge dataset.
2 Architecture
See Fig.
The spatial stream are used to perform action recognition from still frames. This is the standard Image classification task. Thus, we can use CNN pre-trained on ImageNet.
The temporal stream are used to perform action recognition from motion. The input to this model is formed by stacking optical flow displacement fields between several consecutive frames. Such input explicitly describes the motion between video frames. Thus,
the convolution is 3d convolution.
The final fusion is by class score averaging or linear SVM on top of l_2 normalized softmax scores as features.
3 Temporal Stream
There are several variations of the temporal stream part.
3.1 Optical Flow Stacking
The input is a set of displacement vector fields d ⃗ t between the pairs of consecutive frames t and t + 1. By d ⃗ t (i, j) we denote the displacement vector at the point (i, j) in frame t, which moves the point to the corresponding point in the following
frame t + 1. To represent the motion across a sequence of frames, we stack the horizontal and vertical components of the vector field d t (i, j) x , d t (i, j) y of T consecutive frames to form a total of 2T input channels.
3.2 Trajectory Stacking
Replaces the optical flow, sampled at the same locations across several frames, with the flow, sampled along the motion trajectories of some anchor points.
See Fig. for illustration.
3.3 Bi-directional Optical Flow
We can construct an input volume by stacking T/2 forward flows between frames t and t + T/2 and T/2 backward flows between frames tT/2 and t. The input thus has the same
number of channels (2T) as before. The flows can be represented using either of the optical flow stacking or trajectory stacking.
3.4 Training Details
It is generally beneficial to perform zero-centering of the network input, as it allows the model to better exploit the rectification non-linearities. In our case, the displacement vector field components can be dominated by a particular displacement, e.g.,
caused by the camera movement. We consider a simpler approach: from each displacement field d ⃗ we subtract its mean vector. Because the datasets are small, to combat overfitting, we use multi-task learning. The CNN architecture is modified so that it has
two softmax classification layers on top of the last fully-connected layer, one for each dataset.
5 Results
Stacking multiple (T > 1) displacement fields in the input is highly beneficial, as it provides the network with long-term motion information. Mean subtraction is helpful, as it
reduces the effect of global motion between the frames. Optical flow stacking performs better than trajectory stacking, and using the bi-directional optical flow is only slightly
better than a uni-directional forward flow. Temporal CNN significantly outperform the spatial CNN, which confirms the importance of motion information for action recognition.
Temporal and spatial recognition streams are complementary, as their fusion significantly improves on both.
6 References
[1]. https://www.youtube.com/watch?v=FXQZBZVrigM.
(Citations: 425).
1 Motivation
The features learnt by Spatio-Temporal CNN do not capture the motion well. Our idea is to separate CNN streams for appearance from still frames and motion between frames, and combine them by late fusion. Decoupling the spatial and temporal nets also allows
us to exploit the availability of large amounts of annotated image data by pre-training the spatial net on the ImageNet challenge dataset.
2 Architecture
See Fig.
The spatial stream are used to perform action recognition from still frames. This is the standard Image classification task. Thus, we can use CNN pre-trained on ImageNet.
The temporal stream are used to perform action recognition from motion. The input to this model is formed by stacking optical flow displacement fields between several consecutive frames. Such input explicitly describes the motion between video frames. Thus,
the convolution is 3d convolution.
The final fusion is by class score averaging or linear SVM on top of l_2 normalized softmax scores as features.
3 Temporal Stream
There are several variations of the temporal stream part.
3.1 Optical Flow Stacking
The input is a set of displacement vector fields d ⃗ t between the pairs of consecutive frames t and t + 1. By d ⃗ t (i, j) we denote the displacement vector at the point (i, j) in frame t, which moves the point to the corresponding point in the following
frame t + 1. To represent the motion across a sequence of frames, we stack the horizontal and vertical components of the vector field d t (i, j) x , d t (i, j) y of T consecutive frames to form a total of 2T input channels.
3.2 Trajectory Stacking
Replaces the optical flow, sampled at the same locations across several frames, with the flow, sampled along the motion trajectories of some anchor points.
See Fig. for illustration.
3.3 Bi-directional Optical Flow
We can construct an input volume by stacking T/2 forward flows between frames t and t + T/2 and T/2 backward flows between frames tT/2 and t. The input thus has the same
number of channels (2T) as before. The flows can be represented using either of the optical flow stacking or trajectory stacking.
3.4 Training Details
It is generally beneficial to perform zero-centering of the network input, as it allows the model to better exploit the rectification non-linearities. In our case, the displacement vector field components can be dominated by a particular displacement, e.g.,
caused by the camera movement. We consider a simpler approach: from each displacement field d ⃗ we subtract its mean vector. Because the datasets are small, to combat overfitting, we use multi-task learning. The CNN architecture is modified so that it has
two softmax classification layers on top of the last fully-connected layer, one for each dataset.
5 Results
Stacking multiple (T > 1) displacement fields in the input is highly beneficial, as it provides the network with long-term motion information. Mean subtraction is helpful, as it
reduces the effect of global motion between the frames. Optical flow stacking performs better than trajectory stacking, and using the bi-directional optical flow is only slightly
better than a uni-directional forward flow. Temporal CNN significantly outperform the spatial CNN, which confirms the importance of motion information for action recognition.
Temporal and spatial recognition streams are complementary, as their fusion significantly improves on both.
6 References
[1]. https://www.youtube.com/watch?v=FXQZBZVrigM.
相关文章推荐
- [论文阅读笔记]Two-Stream Convolutional Networks for Action Recognition in Videos
- 【论文学习】Two-Stream Convolutional Networks for Action Recognition in Videos
- 论文笔记-Two-Stream Convolutional Networks for Action Recognition in Videos
- 【ML】Two-Stream Convolutional Networks for Action Recognition in Videos
- Two-Stream Convolutional Networks for Action Recognition in Videos
- READING NOTE: Two-Stream Convolutional Networks for Action Recognition in Videos
- Two-Stream Convolutional Networks for Action Recognition in Videos
- 深度学习论文笔记-Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
- Two-Stream Convolutional Networks for Action Recognition in Videos
- 视频动作识别--Two-Stream Convolutional Networks for Action Recognition in Videos
- 深度学习论文笔记:Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
- 【CV论文阅读】Two stream convolutional Networks for action recognition in Vedios
- 深度学习笔记空间金字塔池化阅读笔记Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
- Two-Stream RNN/CNN for Action Recognition in 3D Videos-阅读笔记
- Two-Stream Convolutional Networks for Action Recognition in Video
- [深度学习论文笔记][Video Classification] Long-term Recurrent Convolutional Networks for Visual Recognition a
- 深度学习笔记(一)空间金字塔池化阅读笔记Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
- Two-Stream Convolutional Networks for Action Recognition in Video
- Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition--SPP-net论文笔记