您的位置:首页 > 其它

[深度学习论文笔记][Video Classification] Beyond Short Snippets: Deep Networks for Video Classification

2016-11-17 14:57 841 查看
Yue-Hei Ng, Joe, et al. “Beyond short snippets: Deep networks for video classification.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. (Citations: 171).

1 Architecture

See Fig. We propose processing only one frame per second, and incorporate explicit motion information in the form of optical flow images computed over adjacent frames.

In all methods CNN share parameters across frames. We found that initializing from a model trained on raw image frames can help classify optical flow images by allowing
faster convergence than when training from scratch.



2 Feature Pooling

The feature pooling networks independently process each frame using a CNN and then combine frame-level information using various pooling layers. Different pooling architectures can be seen in Fig. We found that both average pooling and a fully connected layer
for pooling failed to learn effectively due to the large number of gradients that they generate. Max-pooling generates much sparser updates, and as a result tends to yield networks that learn faster, since the gradient update is generated by a sparse set of
features from each frame.



[Conv Pooling] Pooling over the final convolutional layer across the video’s frames. The spatial information in the output of the convolutional layer is preserved through a max operation over the time domain.

[Late Pooling] First passes convolutional features through two fc layers before applying the max-pooling layer.

[Slow Pooling] Pooling is first applied over 10-frames of convolutional features with stride 5. In the second stage, a single max-pooling layer combines the outputs of all fc layers.

[Local Pooling] Local Pooling only contains a single stage of max-pooling after the convolutional layers.

[Time-Domain Convolution] It contains an extra time-domain convolutional layer before feature pooling across frames.

3 Results

We find that Conv Pooling provides the best results. Late Pooling performs worse than all other methods, indicating that preserving the spatial information while performing the

pooling operation across the time domain is important. Time-Domain Convolution gives inferior results. This suggests that a single time-domain convolutional layer is not effective in learning temporal relations on high level features, which motivates us to
explore more sophisticated network architectures like LSTM which learns from temporal sequences.

using LSTMs on both image frames and optical flow yields the highest published performance measure.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
相关文章推荐