[深度学习论文笔记][Video Classification] Large-scale Video Classification with Convolutional Neural Networks
2016-11-16 10:16
931 查看
Karpathy, Andrej, et al. “Large-scale video classification with convolutional neural networks.” Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014. (Citations: 654).
1 Spatio-Temporal CNN
We treat every video as a bag of short, fixed-sized clips (15 frames in our case). Since each clip contains several contiguous frames in time, we can extend the connectivity of the network in time dimension to learn spatio-temporal features. There are four
fuse information across temporal domain. See Fig.
[Single-frame] Process each single frame independently.
Late Fustion] Place two separate single-frame neworks with shared parameters a distance of 15 frames apart, and then merges the two streams in the first fully connected layer, which can compute global motion characteristics by comparing outputs of both networks.
[Early Fusion] Modify the filters on conv1 in the single-frame network by extending them to be size (DT) × F H × F W .
[Slow Fusion] This is a balance between late fustion and slow fusion, in which higher layers get access to progressively more global information in both spatial and temporal dimensions. This is implemented by extending the connectivity of all convolutional
layers in time dimension and carrying out temporal convolutions in addition to spatial convolutions to compute activations. This model turns to work best.
2 Multi-resolution CNNs
We want to speed up the networks. However, simply reducing the nuber of layers or neurons or training with lower resolution will hurt the performance. We proposed multi-
resolution CNN which composed by two separate streams. The context stream receives the downsampled frames at half the original spatial resolution (89 × 89 pixels), while the fovea stream receives the center 89 × 89 region at the original resolution. In this
way, the the total input dimensionality is halved. Notably, this design takes advantage of the camera bias present in many online videos, since the object of interest often occupies the center regio he activations from both streams are concatenated and fed
into the first fully connected layer with dense connections. See Fig.
3 Results
The single-frame model already displays strong performance, suggesting that local motion cues may not be critically important.
4 References
[1]. https://www.youtube.com/watch?v=qrzQ_AB1DZk.
[2]. http://techtalks.tv/talks/large-scale-video-classification-with-convolutional-neural-networks-2/60272/.
[3]. https://vimeo.com/101555393.
1 Spatio-Temporal CNN
We treat every video as a bag of short, fixed-sized clips (15 frames in our case). Since each clip contains several contiguous frames in time, we can extend the connectivity of the network in time dimension to learn spatio-temporal features. There are four
fuse information across temporal domain. See Fig.
[Single-frame] Process each single frame independently.
Late Fustion] Place two separate single-frame neworks with shared parameters a distance of 15 frames apart, and then merges the two streams in the first fully connected layer, which can compute global motion characteristics by comparing outputs of both networks.
[Early Fusion] Modify the filters on conv1 in the single-frame network by extending them to be size (DT) × F H × F W .
[Slow Fusion] This is a balance between late fustion and slow fusion, in which higher layers get access to progressively more global information in both spatial and temporal dimensions. This is implemented by extending the connectivity of all convolutional
layers in time dimension and carrying out temporal convolutions in addition to spatial convolutions to compute activations. This model turns to work best.
2 Multi-resolution CNNs
We want to speed up the networks. However, simply reducing the nuber of layers or neurons or training with lower resolution will hurt the performance. We proposed multi-
resolution CNN which composed by two separate streams. The context stream receives the downsampled frames at half the original spatial resolution (89 × 89 pixels), while the fovea stream receives the center 89 × 89 region at the original resolution. In this
way, the the total input dimensionality is halved. Notably, this design takes advantage of the camera bias present in many online videos, since the object of interest often occupies the center regio he activations from both streams are concatenated and fed
into the first fully connected layer with dense connections. See Fig.
3 Results
The single-frame model already displays strong performance, suggesting that local motion cues may not be critically important.
4 References
[1]. https://www.youtube.com/watch?v=qrzQ_AB1DZk.
[2]. http://techtalks.tv/talks/large-scale-video-classification-with-convolutional-neural-networks-2/60272/.
[3]. https://vimeo.com/101555393.
相关文章推荐
- CV论文笔记(二) Large-scale Video Classification with Convolutional Neural Networks
- 【论文学习】Large-scale Video Classification with Convolutional Neural Networks
- [深度学习论文笔记][Image Classification] ImageNet Classification with Deep Convolutional Neural Networks
- Notes on Large-scale Video Classification with Convolutional Neural Networks
- Large-scale Video Classification with Convolutional Neural Networks(泛读)
- Large-scale Video Classification with Convolutional Neural Networks
- [深度学习论文笔记][Image Classification] Very Deep Convolutional Networks for Large-Scale Image Recognitio
- 经典计算机视觉论文笔记——《ImageNet Classification with Deep Convolutional Neural Networks》
- 论文笔记ImageNet Classification with Deep Convolutional Neural Networks(AlexNet)
- Large-scale Video Classification with Convolution Neural Networks
- AlexNet卷积神经网络学习参考论文《ImageNet Classification with Deep Convolutional Neural NetWorks》
- 深度学习入门笔记:Fast Image Search with Deep Convolutional Neural Networks and Efficient Hashing Codes
- [深度学习论文笔记][Video Classification] Beyond Short Snippets: Deep Networks for Video Classification
- 论文笔记:ImageNet Classification with Deep Convolutional Neural Networks
- [深度学习] Very Deep Convolutional Networks for Large-Scale Image Recognition(VGGNet)阅读笔记
- Deep learning论文笔记一:ImageNet Classification with Deep Convolutional Neural Networks
- [深度学习论文笔记][Weight Initialization] Data-dependent Initializations of Convolutional Neural Networks
- 深度学习论文随记(二)---VGGNet模型解读-2014年(Very Deep Convolutional Networks for Large-Scale Image Recognition)
- [深度学习论文笔记][Video Classification] Delving Deeper into Convolutional Networks for Learning Video Repre
- 深度学习论文理解3:Flexible, high performance convolutional neural networks for image classification