您的位置:首页 > 其它

[深度学习论文笔记][Video Classification] Two-Stream Convolutional Networks for Action Recognition in Videos

2016-11-17 09:27 721 查看
Simonyan, Karen, and Andrew Zisserman. “Two-stream convolutional networks for action recognition in videos.” Advances in Neural Information Processing Systems. 2014.

(Citations: 425).

1 Motivation

The features learnt by Spatio-Temporal CNN do not capture the motion well. Our idea is to separate CNN streams for appearance from still frames and motion between frames, and combine them by late fusion. Decoupling the spatial and temporal nets also allows
us to exploit the availability of large amounts of annotated image data by pre-training the spatial net on the ImageNet challenge dataset.

2 Architecture
See Fig.



The spatial stream are used to perform action recognition from still frames. This is the standard Image classification task. Thus, we can use CNN pre-trained on ImageNet.

The temporal stream are used to perform action recognition from motion. The input to this model is formed by stacking optical flow displacement fields between several consecutive frames. Such input explicitly describes the motion between video frames. Thus,
the convolution is 3d convolution.

The final fusion is by class score averaging or linear SVM on top of l_2 normalized softmax scores as features.

3 Temporal Stream

There are several variations of the temporal stream part.

3.1 Optical Flow Stacking

The input is a set of displacement vector fields d ⃗ t between the pairs of consecutive frames t and t + 1. By d ⃗ t (i, j) we denote the displacement vector at the point (i, j) in frame t, which moves the point to the corresponding point in the following
frame t + 1. To represent the motion across a sequence of frames, we stack the horizontal and vertical components of the vector field d t (i, j) x , d t (i, j) y of T consecutive frames to form a total of 2T input channels.

3.2 Trajectory Stacking

Replaces the optical flow, sampled at the same locations across several frames, with the flow, sampled along the motion trajectories of some anchor points.

See Fig. for illustration.



3.3 Bi-directional Optical Flow

We can construct an input volume by stacking T/2 forward flows between frames t and t + T/2 and T/2 backward flows between frames tT/2 and t. The input thus has the same

number of channels (2T) as before. The flows can be represented using either of the optical flow stacking or trajectory stacking.

3.4 Training Details

It is generally beneficial to perform zero-centering of the network input, as it allows the model to better exploit the rectification non-linearities. In our case, the displacement vector field components can be dominated by a particular displacement, e.g.,
caused by the camera movement. We consider a simpler approach: from each displacement field d ⃗ we subtract its mean vector. Because the datasets are small, to combat overfitting, we use multi-task learning. The CNN architecture is modified so that it has
two softmax classification layers on top of the last fully-connected layer, one for each dataset. 

5 Results

Stacking multiple (T > 1) displacement fields in the input is highly beneficial, as it provides the network with long-term motion information. Mean subtraction is helpful, as it

reduces the effect of global motion between the frames. Optical flow stacking performs better than trajectory stacking, and using the bi-directional optical flow is only slightly

better than a uni-directional forward flow. Temporal CNN significantly outperform the spatial CNN, which confirms the importance of motion information for action recognition.

Temporal and spatial recognition streams are complementary, as their fusion significantly improves on both.

6 References

[1]. https://www.youtube.com/watch?v=FXQZBZVrigM.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
相关文章推荐