您的位置：首页 > 其它

[深度学习论文笔记][Video Classification] Large-scale Video Classification with Convolutional Neural Networks

2016-11-16 10:16 931 查看

Karpathy, Andrej, et al. “Large-scale video classification with convolutional neural networks.” Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014. (Citations: 654).

1 Spatio-Temporal CNN

We treat every video as a bag of short, fixed-sized clips (15 frames in our case). Since each clip contains several contiguous frames in time, we can extend the connectivity of the network in time dimension to learn spatio-temporal features. There are four
fuse information across temporal domain. See Fig.

[Single-frame] Process each single frame independently.

Late Fustion] Place two separate single-frame neworks with shared parameters a distance of 15 frames apart, and then merges the two streams in the first fully connected layer, which can compute global motion characteristics by comparing outputs of both networks.

[Early Fusion] Modify the filters on conv1 in the single-frame network by extending them to be size (DT) × F H × F W .

[Slow Fusion] This is a balance between late fustion and slow fusion, in which higher layers get access to progressively more global information in both spatial and temporal dimensions. This is implemented by extending the connectivity of all convolutional
layers in time dimension and carrying out temporal convolutions in addition to spatial convolutions to compute activations. This model turns to work best.

2 Multi-resolution CNNs

We want to speed up the networks. However, simply reducing the nuber of layers or neurons or training with lower resolution will hurt the performance. We proposed multi-
resolution CNN which composed by two separate streams. The context stream receives the downsampled frames at half the original spatial resolution (89 × 89 pixels), while the fovea stream receives the center 89 × 89 region at the original resolution. In this
way, the the total input dimensionality is halved. Notably, this design takes advantage of the camera bias present in many online videos, since the object of interest often occupies the center regio he activations from both streams are concatenated and fed
into the first fully connected layer with dense connections. See Fig.

3 Results

The single-frame model already displays strong performance, suggesting that local motion cues may not be critically important.

4 References

[1]. https://www.youtube.com/watch?v=qrzQ_AB1DZk.
[2]. http://techtalks.tv/talks/large-scale-video-classification-with-convolutional-neural-networks-2/60272/.

[3]. https://vimeo.com/101555393.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： Video Classification CNN Computer Vision Deep Learning Papers

相关文章推荐

新的分享

章节导航