您的位置：首页 > Web前端

【深度学习论文翻译】Learning Spatiotemporal Features with 3D Convolutional Networks全文对照翻译

2017-11-04 11:21 651 查看

用3D卷积网络学习时空特征
摘要

We propose a simple, yet effective approach for spatiotemporal feature learning using deep 3-dimensional convolutional networks (3D ConvNets) trained on a large scale supervised video dataset. Our findings are three-fold: 1) 3D ConvNets are more suitable
for spatiotemporal feature learning compared to 2D ConvNets; 2) A homogeneous architecture with small 3 × 3 × 3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets; and 3) Our learned features, namely C3D (Convolutional
3D), with a simple linear classifier outperform state-of-the-art methods on 4 different benchmarks and are comparable with current best methods on the other 2 benchmarks.

In addition, the features are compact: achieving 52.8% accuracy on UCF101 dataset with only 10 dimensions and also very efficient to compute due to the fast inference of ConvNets. Finally, they are conceptually very simple and easy to train and use.

我们提出了一种简单而有效的方法，用于使用在大规模监督视频数据集上训练的深层三维卷积网络（3D ConvNets）进行时空特征学习。我们的发现是三重的：

1）与二维ConvNets相比，3D ConvNets更适合于时空特征学习;

2）所有层级的3×3×3小卷积核心的均匀架构是3D ConvNets中性能最好的架构之一;

3）我们使用简单的线性分类器学习的特征，即C3D（卷积3D），在4个不同的基准上优于最先进的方法，并且与其他2个基准上的当前最佳方法相当。

另外，特征非常紧凑：仅使用10维的UCF101数据集的精度达到52.8％，由于ConvNets的快速推理能力，其计算效率也非常高。最后，它们在概念上很简单，易于训练和使用。

1.介绍

Multimedia on the Internet is growing rapidly resulting in an increasing number of videos being shared every minute. To combat the information explosion it is essential to understand and analyze these videos for various purposes like search, recommendation,
ranking etc. The computer vision community has been working on video analysis for decades and tackled different problems such as action recognition , abnormal event detection, and activity understanding . Considerable progress has been made in these individual
problems by employing different specific solutions. However, there is still a growing need for a generic video descriptor that helps in solving large-scale video tasks in a homogeneous way.

互联网上的多媒体正在快速增长，导致每分钟共享越来越多的视频。为了应对信息爆炸，必须了解和分析这些视频，用于搜索、推荐、排名等各种用途。计算机视觉社区已经进行了数十年的视频分析，解决了不同的问题，如动作识别、异常事件检测和活动理解。通过采用不同的具体解决方案，在这些个别问题上取得了相当大的进展。然而，仍然需要一种通用的视频描述符，有助于以同类的方式解决大规模视频任务。

There are four properties for an effective video descriptor:

(i) it needs to be generic, so that it can represent different types of videos well while being discriminative. For example, Internet videos can be of landscapes, natural scenes, sports, TV shows, movies, pets, food and so on;

(ii) the descriptor needs to be compact: as we are working with millions of videos, a compact descriptor helps processing, storing, and retrieving tasks much more scalable;

(iii) it needs to be efficient to compute, as thousands of videos are expected to be processed every minute in real world systems; and

(iv) it must be simple to implement. Instead of using complicated feature encoding methods and classifiers, a good descriptor should work well even with a simple model (e.g. linear classifier).

有效的视频描述符有四个属性：

（i）它需要是通用的，所以它可以代表不同类型的视频，同时具有识别力。例如，互联网视频可以是景观、自然场景、体育、电视节目、电影、宠物、食物等；

（ii）描述符需要紧凑：因为我们正在使用数百万个视频，一个紧凑的描述符有助于处理、存储和检索任务更加可扩展；

（iii）需要有效的计算，因为在现实系统中每分钟预计会处理数千个视频；

（iv）实现必须很简单。代替使用复杂的特征编码方法和分类器，即使使用简单的模型（例如线性分类器），一个好的描述符也应该很好地工作。

Inspired by the deep learning breakthroughs in the image domain where rapid progress has been made in the past few years in feature learning, various pre-trained convolutional network (ConvNet) models are made available for extracting image features. These
features are the activations of the network’s last few fully-connected layers which perform well on transfer learning tasks. However, such image based deep features are not directly suitable for videos due to lack of motion modeling (as shown in our experiments
in sections 4,5,6). In this paper we propose to learn spatio-temporal features using deep 3D ConvNet. We empirically show that these learned features with a simple linear classifier can yield good performance on various video analysis tasks. Although 3D ConvNets
were proposed before, to our knowledge this work exploits 3D ConvNets in the context of large-scale supervised training datasets and modern deep architectures to achieve the best performance on different types of video analysis tasks. The features from these
3D ConvNets encapsulate information related to objects, scenes and actions in a video, making them useful for various tasks without requiring to finetune the model for each task. C3D has the properties that a good descriptor should have: it is generic, compact,
simple and efficient. To summarize, our contributions in this paper are:

• We experimentally show 3D convolutional deep networks are good feature learning machines that model appearance and motion simultaneously.

• We empirically find that 3 × 3 × 3 convolution kernel for all layers to work best among the limited set of explored architectures.

• The proposed features with a simple linear model outperform or approach the current best methods on 4 different tasks and 6 different benchmarks (see Table 1). They are also compact and efficient to compute.

受到过去几年在特征学习方面快速发展的图像领域的深度学习突破的鼓舞，各种预处理卷积网络（ConvNet）模型，纷纷被用于提取图像特征。这些特征是网络最后几个完全连接的层的激活，在传输学习任务上表现良好。然而，由于缺乏运动建模，这种基于图像的深层特征并不直接适用于视频（如第4,5,6节中的实验所示）。在本文中，我们提出使用深度3D ConvNet来学习时空特征。我们经验表明，这些使用简单线性分类器的学习特征可以在各种视频分析任务上产生良好的性能。尽管以前提出了3D ConvNets，但据我们所知，这项工作在大规模监督训练数据集和现代深度架构的背景下利用3D
ConvNets，以在不同类型的视频分析任务上实现最佳性能。这些来源于3D ConvNets的特征将与视频中的对象、场景和动作相关的信息封装起来，使其对各种任务有用，而无需为每个任务分配模型。C3D具有良好的描述符应具有的属性：它是通用的、紧凑的、简单的和高效的。总而言之，我们在本文中的贡献是：

•我们实验表明3D卷积深度网络是同时对外观和运动进行建模的良好特征学习机。

•我们经验发现，所有层的3×3×3卷积核在有限的探索架构中最有效。

•提出的具有简单线性模型的特征优于或接近4个不同任务和6个不同基准的最佳方法（见表1）。它们也是紧凑和高效的计算。

Table 1. C3D compared to best published results. C3D outperforms all previous best reported methods on a range of benchmarks except for Sports-1M and UCF101. On UCF101, we report accuracy for two groups of methods. The first set of methods use only RGB frame
inputs while the second set of methods (in parentheses) use all possible features (e.g. optical flow, improved Dense Trajectory).

表1. C3D与最佳公布结果相比较。除了Sports-1M和UCF101之外，C3D在一系列基准上胜过以前最好的报告方法。在UCF101上，我们报告了两组方法的准确性。第一组方法仅使用RGB帧输入，而第二组方法（括号中）使用所有可能的特征（例如光流、改进的密集轨迹）。

2.相关工作

Videos have been studied by the computer vision community for decades. Over the years various problems like action recognition , anomaly detection , video retrieval , event and action detection , and many more have been proposed. Considerable portion of
these works are about video representations. Laptev and Lindeberg proposed spatio-temporal interest points (STIPs) by extending Harris corner detectors to 3D. SIFT and HOG are also extended into SIFT-3D and HOG3D for action recognition. Dollar et al. proposed
Cuboids features for behavior recognition . Sadanand and Corso built ActionBank for action recognition . Recently, Wang et al. proposed improved Dense Trajectories (iDT) which is currently the state-of-the-art hand-crafted feature. The iDT descriptor is an
interesting example showing that temporal signals could be handled differently from that of spatial signal. Instead of extending Harris corner detector into 3D, it starts with densely-sampled feature points in video frames and uses optical flows to track them.
For each tracker corner different hand-crafted features are extracted along the trajectory. Despite its good performance, this method is computationally intensive and becomes intractable on largesc
166f5
ale datasets.

视频已被计算机视觉界研究了几十年。多年以来，已经提出了诸如动作识别，异常检测，视频检索，事件和动作检测等诸多问题。这些作品的相当一部分是关于视频表示。Laptev和Lindeberg通过将Harris角探测器扩展到3D来提出空间兴趣点（STIPs）。 SIFT和HOG也扩展到SIFT-3D和HOG3D以进行动作识别。 Dollar 等人提出了行为识别的Cuboids特征。 Sadanand和Corso建立了ActionBank来进行动作识别。最近，Wang等人提出了改进的密集轨迹（iDT），这是目前最先进的人为设计特征。
iDT描述符是一个有趣的例子，它显示时间信号可以与空间信号的处理不同。它不是将Harris角检测器扩展到3D，而是从视频帧中的密集采样特征点开始，使用光流来跟踪它们。对于每个跟踪器角，沿着轨迹提取不同的人为设计特征。尽管它的性能很好，但是这种方法计算量很大，并且在大规模数据集中变得棘手。

With recent availability of powerful parallel machines (GPUs, CPU clusters), together with large amounts of training data, convolutional neural networks (ConvNets) have made a come back providing breakthroughs on visual recognition . ConvNets have also been
applied to the problem of human pose estimation in both images and videos . More interestingly these deep networks are used for image feature learning . Similarly, Zhou et al. and perform well on transferred learning tasks. Deep learning has also been applied
to video feature learning in an unsupervised setting. In Le et al. , the authors use stacked ISA to learn spatio-temporal features for videos. Although this method showed good results on action recognition, it is still computationally intensive at training
and hard to scale up for testing on large datasets. 3D ConvNets were proposed for human action recognition and for medical image segmentation . 3D convolution was also used with Restricted Boltzmann Machines to learn spatiotemporal features . Recently, Karpathy
et al. trained deep networks on a large video dataset for video classification. Simonyan and Zisserman used two stream networks to achieve best results on action recognition.

随着最近强大的并行机（GPU，CPU集群）的可利用，以及大量的训练数据，卷积神经网络（ConvNets）已经成为视觉识别的突破。 ConvNets也被应用于图像和视频中的人体姿态理解的问题。更有趣的是，这些深层网络用于图像特征学习。同样，Zhou等人并且在转移的学习任务上表现良好。深度学习也被应用于无监督设置的视频特征学习。在Le 等人的研究中，作者使用堆叠的ISA来学习视频的时空特征。虽然这种方法在动作识别方面表现出良好的效果，但是在训练上仍然是计算密集型，并且难以扩展到大规模数据集的测试。3D ConvNets被提出用于人类动作识别和医学图像分割。限制玻尔兹曼机器也使用3D卷积来学习时空特征。最近，Karpathy等在大型视频数据集上进行深度网络训练，以分类视频。
Simonyan和Zisserman使用两个流网络来获得最佳的动作识别结果。

Among these approaches, the 3D ConvNets approach is most closely related to us. This method used a human detector and head tracking to segment human subjects in videos. The segmented video volumes are used as inputs for a 3-convolution-layer 3D ConvNet to
classify actions. In contrast, our method takes full video frames as inputs and does not rely on any preprocessing, thus easily scaling to large datasets. We also share some similarities with Karpathy et al. and Simonyan and Zisserman in terms of using full
frames for training the ConvNet. However, these methods are built on using only 2D convolution and 2D pooling operations (except for the Slow Fusion model in [18]) whereas our model performs 3D convolutions and 3D pooling propagating temporal information across
all the layers in the network (further detailed in section 3). We also show that gradually pooling space and time information and building deeper networks achieves best results and we discuss more about the architecture search in section 3.2.

在这些方法中，3D ConvNets方法与我们最密切相关。该方法使用人体检测器和头部跟踪来在视频中分割人类受试者。分段视频卷被用作3卷积层3D ConvNet的输入以对动作进行分类。相比之下，我们的方法将完整的视频帧作为输入，并且不依赖于任何预处理，因此容易地扩展到大型数据集。我们也与Karpathy等人有一些相似之处。以及Simonyan和Zisserman使用全框架来训练ConvNet。然而，这些方法建立在仅使用2D卷积和2D池化操作（[18]中的Slow Fusion模型除外），而我们的模型执行3D卷积和3D池化在网络中的所有层中传播时间信息（进一步详细描述在第3节）。我们还显示，逐渐池化空间和时间信息，建立更深层次的网络可以取得最佳效果，我们将在3.2节讨论有关体系结构搜索的更多信息。

3.使用3D ConvNets学习特征

In this section we explain in detail the basic operations of 3D ConvNets, analyze different architectures for 3D ConvNets empirically, and elaborate how to train them on largescale datasets for feature learning.

在本节中，我们详细介绍了3D ConvNets的基本操作，以经验为主地分析了3D ConvNets的不同体系结构，并阐述了如何在特征学习的大规模数据集上进行训练。

3.1. 3D卷积和池化

We believe that 3D ConvNet is well-suited for spatiotemporal feature learning. Compared to 2D ConvNet, 3D ConvNet has the ability to model temporal information better owing to 3D convolution and 3D pooling operations. In 3D ConvNets, convolution and pooling
operations are performed spatio-temporally while in 2D ConvNets they are done only spatially. Figure 1 illustrates the difference, 2D convolution applied on an image will output an image, 2D convolution applied on multiple images (treating them as different
channels) also results in an image. Hence, 2D ConvNets lose temporal information of the input signal right after every convolution operation. Only 3D convolution preserves the temporal information of the input signals resulting in an output volume. The same
phenomena is applicable for 2D and 3D polling. In [36], although the temporal stream network takes multiple frames as input, because of the 2D convolutions, after the first convolution layer, temporal information is collapsed completely. Similarly, fusion
models in [18] used 2D convolutions, most of the networks lose their input’s temporal signal after the first convolution layer. Only the Slow Fusion model in [18] uses 3D convolutions and averaging pooling in its first 3 convolution layers. We believe this
is the key reason why it performs best among all networks studied in [18]. However, it still loses all temporal information after the third convolution layer.

我们相信3D ConvNet非常适合于时空特征学习。与2D ConvNet相比，3D ConvNet能够通过3D卷积和3D池化操作更好地建模时间信息。在3D ConvNets中，卷积和池化操作在时空上执行，而在2D ConvNets中，它们仅在空间上完成。图1示出了差异，应用于图像的2D卷积将输出图像，施加在多个图像上的2D卷积（将它们视为不同的通道）也输出图像。因此，2D ConvNets在每次卷积运算之后就会丢失输入信号的时间信息。只有3D卷积才能保留输入信号的时间信息，从而产生输出量。相同的现象适用于2D和3D池化。在[36]中，虽然时间流网络采用多个帧作为输入，但是由于2D卷积，在第一卷积层之后，时间信息被完全折叠。类似地，[18]中的融合模型使用2D卷积，大多数网络在第一卷积层之后失去其输入的时间信号。只有[18]中的慢融合模型才能在其前3个卷积层中使用3D卷积和平均池化。我们认为这是在[18]的研究在所有网络中表现最好的关键原因。然而，它仍然在第三卷积层之后失去所有时间信息。

Figure 1. 2D and 3D convolution operations.a) Applying 2D convolution on an image results in an image. b) Applying 2Dconvolution on a video volume (multiple frames as multiple channels) alsoresults in an image. c) Applying 3D convolution on a video volume
results inanother volume, preserving temporal information of the input signal.

图1. 2D和3D卷积运算。 a）在图像上应用2D卷积会产生图像。 b）在视频卷积上应用2D卷积（多个帧作为多个通道）也会产生图像。 c）在视频卷积上应用3D卷积可产生另一个卷积，保留输入信号的时间信息。

In this section, we empirically try to identify a good architecture for 3D ConvNets. Because training deep networks on large-scale video datasets is very time-consuming, we first experiment with UCF101, a medium-scale dataset, to search for the best architecture.
We verify the findings on a large scale dataset with a smaller number of network experiments. According to the findings in 2D ConvNet [37], small receptive fields of 3 × 3 convolution kernels with deeper architectures yield best results. Hence, for our architecture
search study we fix the spatial receptive field to 3 × 3 and vary only the temporal depth of the 3D convolution kernels.

在本节中，我们经验性地尝试识别3D ConvNets的良好架构。由于在大型视频数据集上训练深层网络非常耗时，我们首先尝试使用中型数据集UCF101来搜索最佳架构。我们使用较少数量的网络实验在大型数据集中来验证发现。根据2D ConvNet [37]的研究结果，具有更深体系结构的3×3卷积内核的小型接收场产生最佳效果。因此，对于我们的架构搜索研究，我们将空间接收场确定为3×3，并且仅改变3D卷积内核的时间深度。

Notations: For simplicity, from now on we refer video clips with a size of c × l × h × w where c is the number of channels, l is length in number of frames, h and w are the height and width of the frame, respectively. We also refer 3D convolution and pooling
kernel size by d×k ×k, where d is kernel temporal depth and k is kernel spatial size.

符号：为了简单起见，从现在开始，我们将尺寸为c×l×h×w的视频片段，其中c是通道数，l是帧数的长度，h和w分别是帧的高度和宽度。我们还将3D卷积和池化内核大小指向d×k×k，其中d是内核时间深度，k是内核空间大小。

Common network settings: In this section we describe the network settings that are common to all the networks we trained. The networks are set up to take video clips as inputs and predict the class labels which belong to 101 different actions. All video
frames are resized into 128 × 171. This is roughly half resolution of the UCF101 frames. Videos are split into non-overlapped 16-frame clips which are then used as input to the networks. The input dimensions are 3 × 16 × 128 × 171. We also use jittering by
using random crops with a size of 3 × 16 × 112 × 112 of the input clips during training. The networks have 5 convolution layers and 5 pooling layers (each convolution layer is immediately followed by a pooling layer), 2 fully-connected layers and a softmax
loss layer to predict action labels. The number of filters for 5 convolution layers from 1 to 5 are 64, 128, 256, 256, 256, respectively.

常用网络设置：在本节中，我们将介绍我们训练的所有网络通用的网络设置。网络设置为将视频片段作为输入，并预测属于101个不同动作的类标签。所有视频帧都被调整为128×171.这大约是UCF101帧的一半分辨率。视频被分割成非重叠的16帧片段，然后将其用作网络的输入。输入尺寸为3×16×128×171。我们还通过在训练期间使用尺寸为3×16×112×112的随机裁剪来使用抖动。网络具有5个卷积层和5个池化层（每个卷积层紧随其后的是池化层），2个完全连接的层和softmax损耗层以预测动作标签。
对于5个卷积层，从1到5卷积层的滤波器数量分别为64,128,256,256,256。

All convolution kernels have a size of d where d is the kernel temporal depth (we will later vary the value d of these layers to search for a good 3D architecture). All of these convolution layers are applied with appropriate padding (both spatial and temporal)
and stride 1, thus there is no change in term of size from the input to the output of these convolution layers. All pooling layers are max pooling with kernel size 2 × 2 × 2 (except for the first layer) with stride 1 which means the size of output signal is
reduced by a factor of 8 compared with the input signal. The first pooling layer has kernel size 1 × 2 × 2 with the intention of not to merge the temporal signal too early and also to satisfy the clip length of 16 frames (e.g. we can temporally pool with factor
2 at most 4 times before completely collapsing the temporal signal). The two fully connected layers have 2048 outputs. We train the networks from scratch using mini-batches of 30 clips, with initial learning rate of 0.003. The learning rate is divided by 10
after every 4 epochs. The training is stopped after 16 epochs.

所有卷积内核都具有d的大小，其中d是内核时间深度（稍后将改变这些层的值d以搜索良好的3D体系结构）。所有这些卷积层都应用适当的填充（空间和时间）和步长1，因此这些卷积层从输入到输出的尺寸项没有变化。所有池化层都是尺寸为2×2×2（第一层除外）、步长为1的最大池化，这意味着与输入信号相比，输出信号的大小减小了8倍。第一个池化层的内核大小为1×2×2，其意图是不能太早地合并时间信号，并且也能够满足16帧的片段长度（例如，在完全崩溃之前，我们可以暂时将时间信号进行最多4次2倍池化）。两个完全连接的层有2048个输出。我们从头开始使用30个片段的小批量训练网络，初始学习率为0.003。学习率在每4个周期之后除以10。训练在16个周期之后停止。

Varying network architectures: For the purposes of this study we are mainly interested in how to aggregate temporal information through the deep networks. To search for a good 3D ConvNet architecture, we only vary kernel temporal depth di of the convolution
layers while keeping all other common settings fixed as stated above. We experiment with two types of architectures: 1) homogeneous temporal depth: all convolution layers have the same kernel temporal depth; and 2) varying temporal depth: kernel temporal depth
is changing across the layers. For homogeneous setting, we experiment with 4 networks having kernel temporal depth of d equal to 1, 3, 5, and 7. We name these networks as depth-d, where d is their homogeneous temporal depth. Note that depth-1 net is equivalent
to applying 2D convolutions on separate frames. For the varying temporal depth setting, we experiment two networks with temporal depth increasing: 3-3-5-5-7 and decreasing: 7-5-5-3-3 from the first to the fifth convolution layer respectively. We note that
all of these networks have the same size of the output signal at the last pooling layer, thus they have the same number of parameters for fully connected layers. Their number of parameters is only different at convolution layers due to different kernel temporal
depth. These differences are quite minute compared to millions of parameters in the fully connected layers. For example, any two of the above nets with temporal depth difference of 2, only has 17K parameters fewer or more from each other. The biggest difference
in number of parameters is between depth-1 net and depth-7 net where depth-7 net has 51K more parameters which is less than 0.3% of the total of 17.5 millions parameters of each network. This indicates that the learning capacity of the networks are comparable
and the differences in number of parameters should not affect the results of our architecture search.

不同的网络架构：为了本研究的目的，我们主要关注如何通过深层网络聚合时间信息。为了寻找一个很好的3D ConvNet架构，我们只改变卷积层的内核时间深度di，同时保持所有其他常见设置如上所述。我们尝试两种类型的架构：1）均匀时间深度：所有卷积层具有相同的内核时间深度;和2）变化的时间深度：内核时间深度在层之间变化。对于均匀设置，我们试验了具有d等于1,3,5和7的内核时间深度的4个网络。我们将这些网络命名为depth-d，其中d是其均匀时间深度。请注意，depth-1网络相当于在单独的帧上应用2D卷积。对于变化的时间深度设置，我们分别从第一到第五卷积层试验了两个网络，时间深度增加的：3-3-5-5-7和时间深度增加减少的：7-5-5-3-3。我们注意到，所有这些网络在最后一个池化层具有相同的输出信号大小，因此它们对于完全连接的层具有相同数量的参数。由于不同的核时间深度，它们的参数数量在卷积层上是不同的。与完全连接的层中的数百万个参数相比，这些差异是相当微小的。例如，上述时间深度差为2的网络中只有17K个参数彼此之间较少或较多。
参数数量的最大差异在于depth-1净值和depth-7网络之间，depth-7网络具有51K以上的参数，小于每个网络17.5百万个参数的0.3％。这表明网络的学习能力是可比较的，参数数量的差异不应影响我们的架构搜索结果。

Figure 2. 3D convolution kernel temporal depth search. Action recognition clip accuracy on UCF101 test split-1 of different kernel temporal depth settings. 2D ConvNet performs worst and 3D ConvNet with 3 × 3 × 3 kernels performs best among the experimented
nets.

图2. 3D卷积内核时间深度搜索。 UCF101的动作识别片段精度测试不同内核时间深度设置的split-1。 2D ConvNet执行最差，带有3×3×3内核的3D ConvNet在实验网中表现最佳。

3.2.探索内核时间深度

We train these networks on the train split 1 of UCF101. Figure 2 presents clip accuracy of different architectures on UCF101 test split 1. The left plot shows results of nets with homogeneous temporal depth and the right plot presents results of nets that
changing kernel temporal depth. Depth-3 performs best among the homogeneous nets. Note that depth-1 is significantly worse than the other nets which we believe is due to lack of motion modeling. Compared to the varying temporal depth nets, depth-3 is the best
performer, but the gap is smaller. We also experiment with bigger spatial receptive field (e.g. 5 × 5) and/or full input resolution (240 × 320 frame inputs) and still observe similar behavior. This suggests 3 × 3 × 3 is the best kernel choice for 3D ConvNets
(according to our subset of experiments) and 3D ConvNets are consistently better than 2D ConvNets for video classification. We also verify that 3D ConvNet consistently performs better than 2D ConvNet on a large-scale internal dataset, namely I380K.

我们在UCF101的split-1上训练这些网络。图2显示了UCF101测试split-1时不同架构的片段精度。左图显示了具有均匀时间深度的网络的结果，右图显示了变化内核时间深度的网络的结果。Depth-3在均匀网络中表现最好。请注意，depth-1比其他网络明显更差，我们认为是由于缺乏运动建模。与不变化时间深度网络相比，depth-3是表现最好的，但差距较小。我们还尝试更大的空间接收场（例如5×5）和/或全输入分辨率（240×320帧输入），并且仍然观察到类似的行为。这表明3×3×3是3D ConvNets的最佳内核选择（根据我们的实验子集），3D
ConvNets始终优于2D ConvNets进行视频分类。我们还验证了3D ConvNet在大规模内部数据集（即I380K）上的性能优于2D ConvNet。

3.3.时空特征学习

Network architecture: Our findings in the previous section indicate that homogeneous setting with convolution kernels of 3 × 3 × 3 is the best option for 3D ConvNets. This finding is also consistent with a similar finding in 2D ConvNets . With a large-scale
dataset, one can train a 3D ConvNet with 3×3×3 kernel as deep as possible subject to the machine memory limit and computation affordability. With current GPU memory, we design our 3D ConvNet to have 8 convolution layers, 5 pooling layers, followed by two fully
connected layers, and a softmax output layer. The network architecture is presented in figure 3. For simplicity, we call this net C3D from now on. All of 3D convolution filters are 3 × 3 × 3 with stride 1 × 1 × 1. All 3D pooling layers are 2 × 2 × 2 with stride
2 × 2 × 2 except for pool1 which has kernel size of 1 × 2 × 2 and stride 1 × 2 × 2 with the intention of preserving the temporal information in the early phase. Each fully connected layer has 4096 output units.

网络架构：上一节的发现表明，3×3×3的卷积内核的均匀设置是3D ConvNets的最佳选择。这个发现也与2D ConvNets中的类似发现一致。使用大型数据集，可以根据机器内存限制和计算承受能力，尽可能深入地训练具有3×3×3内核的3D ConvNet。使用目前的GPU内存，我们设计了我们的3D ConvNet，具有8个卷积层、5个池化层，其次是两个完全连接的层，以及一个softmax输出层。网络架构如图3所示。为了简单起见，我们从现在开始将这个网络称为C3D。所有3D卷积滤波器均为3×3×3，步长为1×1×1。除了为了保持早期的时间信息而设置的内核大小为1×2×2、步长1×2×2的pool1，所有3D池化层均为2×2×2，步长为2×2×2。每个完全连接的层有4096个输出单元。

Dataset. To learn spatiotemproal features, we train our C3D on Sports-1M dataset [18] which is currently the largest video classification benchmark. The dataset consists of 1.1 million sports videos. Each video belongs to one of 487 sports categories. Compared
with UCF101, Sports- 1M has 5 times the number of categories and 100 times the number of videos.

数据集。 为了学习时空功能，我们在Sports-1M数据集上训练我们的C3D，这是目前最大的视频分类基准。数据集由110万个体育视频组成。每个视频属于487个运动类别之一。与UCF101相比，Sports-1M具有5倍的类别和100倍的视频数量。

Training: Training is done on the Sports-1M train split. As Sports-1M has many long videos, we randomly extract five 2-second long clips from every training video. Clips are resized to have a frame size of 128 × 171. On training, we randomly crop input clips
into 16×112×112 crops for spatial and temporal jittering. We also horizontally flip them with 50% probability. Training is done by SGD with minibatch size of 30 examples. Initial learning rate is 0.003, and is divided by 2 every 150K iterations. The optimization
is stopped at 1.9M iterations (about 13 epochs). Beside the C3D net trained from scratch, we also experiment with C3D net fine-tuned from the model pre-trained on I380K.

训练：在Sports-1M训练分离段上进行训练。由于Sports-1M有许多长视频，我们从每个训练视频中随机提取出2秒长的五个片段。片段调整帧大小为128×171。在训练中，我们随机将输入片段剪辑成16×112×112片段，用于空间和时间抖动。我们也以50％的概率水平翻转它们。训练由SGD完成，小批量大小为30个例子。初始学习率为0.003，每150K次迭代除以2。优化在1.9M迭代（约13个时期）停止。除了从头开始训练的C3D网络外，我们还从在I380K上预先训练的模型中对C3D网进行了微调。

Sports-1M classification results: Table 2 presents the results of our C3D networks compared with DeepVideo and Convolution pooling . We use only a single center crop per clip, and pass it through the network to make the clip prediction. For video predictions,
we average clip predictions of 10 clips which are randomly extracted from the video. It is worth noting some setting differences between the comparing methods. DeepVideo and C3D use short clips while Convolution pooling uses much longer clips. DeepVideo uses
more crops: 4 crops per clip and 80 crops per video compared with 1 and 10 used by C3D, respectively. The C3D network trained from scratch yields an accuracy of 84.4% and the one fine-tuned from the I380K pre-trained model yields 85.5% at video top- 5 accuracy.
Both C3D networks outperform DeepVideo’s networks. C3D is still 5.6% below the method of [29]. However, this method uses convolution pooling of deep image features on long clips of 120 frames, thus it is not directly comparable to C3D and DeepVideo which operate
on much shorter clips. We note that the difference in top-1 accuracy for clips and videos of this method is small (1.6%) as it already uses 120-frame clips as inputs. In practice, convolution pooling or more sophisticated aggregation schemes can be applied
on top of C3D features to improve video hit performance.

Sports-1M分类结果：表2显示了与DeepVideo和Convolution池化相比，我们的C3D网络的结果。我们每个片段只使用一个中心裁剪，并通过网络进行片段预测。对于视频预测，我们平均片段预测从视频中随机提取的10个片段。值得注意的是比较方法之间的一些设置差异。 DeepVideo和C3D使用短片段，而卷积池化使用更长的片段。 DeepVideo使用更多的裁剪：每个片段4个裁剪，每个视频80个裁剪，C3D分别使用1个和10个。从零开始训练的C3D网络产生了84.4％的准确度，从I380K预训练模型中微调的C3D网络在视频前5个精度下产生85.5％。两个C3D网络都胜过DeepVideo的网络。
C3D仍比[29]的方法低5.6％。然而，这种方法在120帧的长片段上使用深度图像特征的卷积池化，因此它不能直接与在更短的片段上操作的C3D和DeepVideo相比较。我们注意到，该方法在片段和视频的前1个精度中的差异很小（1.6％），因为它已经使用120帧片段作为输入。在实践中，卷积池化或更复杂的聚合方案可以应用于C3D特征之上，以提高视频命中性能。

C3D video descriptor: After training, C3D can be used as a feature extractor for other video analysis tasks. To extract C3D feature, a video is split into 16 frame long clips with a 8-frame overlap between two consecutive clips. These clips are passed to
the C3D network to extract fc6 activations. These clip fc6 activations are averaged to form a 4096-dim video descriptor which is then followed by an L2-normalization. We refer to this representation as C3D video descriptor/feature in all experiments, unless
we clearly specify the difference.

C3D视频描述符：训练后，C3D可用作其他视频分析任务的特征提取器。为了提取C3D特征，视频被分割成16帧长的片段，在两个连续片段之间具有8帧重叠。这些片段被传递到C3D网络以提取fc6激活。对这些片段fc6激活进行平均以形成4096维的视频描述符，然后接着做L2标准化。在所有实验中，我们将此表示法称为C3D视频描述符/特征，除非我们明确指出差异。

Figure 3. C3D architecture. C3D net has 8 convolution, 5 max-pooling, and 2 fully connected layers, followed by a softmax output layer. All 3D convolution kernels are 3 × 3 × 3 with stride 1 in both spatial and temporal dimensions. Number of filters are
denoted in each box. The 3D pooling layers are denoted from pool1 to pool5. All pooling kernels are 2 × 2 × 2, except for pool1 is 1 × 2 × 2. Each fully connected layer has 4096 output units.

图3. C3D架构。 C3D网络有8个卷积，5个最大池化和2个完全连接的层，其次是softmax输出层。所有的3D卷积核心都是3×3×3，在空间和时间上都有步长1。过滤器的数量表示在每个框中。 3D池化层由pool1到pool5表示。所有池化的内核为2×2×2，pool1为1×2×2。每个完全连接的层有4096个输出单元。

Table 2. Sports-1M classification result. C3D outperforms [18] by 5% on top-5 video-level accuracy. (*)We note that the method of [29] uses long clips, thus its clip-level accuracy is not directly comparable to that of C3D and DeepVideo.

表2. Sports-1M分类结果。 C3D在前5个视频级精度上优于[18] 5％。我们注意到，[29]的方法使用长片段，因此其片段级精度与C3D和DeepVideo的准确性不能直接相比。

What does C3D learn? We use the deconvolution method explained in [46] to understand what C3D is learning internally. We observe that C3D starts by focusing on appearance in the first few frames and tracks the salient motion in the subsequent frames. Figure
4 visualizes deconvolution of two C3D conv5b feature maps with highest activations projected back to the image space. In the first example, the feature focuses on the whole person and then tracks the motion of the pole vault performance over the rest of the
frames. Similarly in the second example it first focuses on the eyes and then tracks the motion happening around the eyes while applying the makeup. Thus C3D differs from standard 2D ConvNets in that it selectively attends to both motion and appearance. We
provide more visualizations in the supplementary material to give a better insight about the learned feature.

C3D学习什么？ 我们使用[46]中解释的反卷积方法来了解C3D内部学习。我们观察到，C3D首先关注前几帧的外观，并跟踪后续帧中的显著运动。图4可视化两个C3D conv5b特征映射图的反卷积，最大的激活投射回图像空间。在第一个例子中，特征集中在整个人身上，然后跟踪其余帧上撑杆跳表演的运动。类似地，在第二个例子中，它首先关注眼睛，然后在化妆的同时跟踪眼睛周围发生的运动。因此，C3D与标准2D ConvNets的不同之处在于它有选择地参与运动和外观。我们在补充材料中提供更多的可视化，以更好地了解学习功能。

Figure 4. Visualization of C3D model, using the method from [46]. Interestingly, C3D captures appearance for the first few frames but thereafter only attends to salient motion. Best viewed on a color screen.

图4. C3D模型的可视化，使用[46]中的方法。有趣的是，C3D捕获了前几帧的外观，但其后仅出现在显著的运动上。最好在彩色屏幕上观看。

4.动作识别

Dataset: We evaluate C3D features on UCF101 dataset . The dataset consists of 13, 320 videos of 101 human action categories. We use the three split setting provided with this dataset.

数据集：我们评估UCF101数据集上的C3D特征。数据集由101个人类动作类别的13,320个视频组成。我们使用此数据集提供的三个拆分设置。

Classification model: We extract C3D features and input them to a multi-class linear SVM for training models. We experiment with C3D descriptor using 3 different nets: C3D trained on I380K, C3D trained on Sports-1M, and C3D trained on I380K and fine-tuned
on Sports-1M. In the multiple nets setting, we concatenate the L2-normalized C3D descriptors of these nets.

分类模型：我们提取C3D特征并将其输入到用于训练模型的多类线性SVM。我们使用3个不同网络的C3D描述符进行试验：在I380K上训练的C3D，在Sports-1M上训练的C3D，以及在I380K上训练并在Sports-1M上进行微调的C3D。在多网络设置中，我们连接这些网络的L2标准化C3D描述符。

Baselines: We compare C3D feature with a few baselines: the current best hand-crafted features, namely improved dense trajectories (iDT) and the popular-used deep image features, namely Imagenet , using Caffe’s Imagenet pre-train model. For iDT, we use the
bag-of-word representation with a codebook size of 5000 for each feature channel of iDT which are trajectories, HOG, HOF, MBHx, and MBHy. We normalize histogram of each channel separately using L1-norm and concatenate these normalized histograms to form a
25K feature vector for a video. For Imagenet baseline, similar to C3D, we extract Imagenet fc6 feature for each frame, average these frame features to make video descriptor. A multi-class linear SVM is also used for these two baselines for a fair comparison.

基线：我们比较C3D特征与几个基线：目前最好的人为设计特征，即改进的密集轨迹（iDT），以及流行的深层图像特征，即Imagenet，使用Caffe的Imagenet预训练模型。对于iDT，我们使用iDT的每个特征通道（轨迹、HOG、HOF、MBHx和MBHy）的码本大小为5000的码字表示。我们使用L1范数分别对每个通道的直方图进行归一化，并且连接这些归一化直方图以形成一个视频的25K特征向量。对于Imagenet基线，类似于C3D，我们为每一帧提取Imagenet
fc6特征，平均这些帧特征来制作视频描述符。对于这两个基线，也可以使用多类线性SVM进行公平比较。

Results: Table 3 presents action recognition accuracy of C3D compared with the two baselines and current best methods. The upper part shows results of the two baselines. The middle part presents methods that use only RGB frames as inputs. And the lower part
reports all current best methods using all possible feature combinations (e.g. optical flows, iDT).

结果：表3显示了与两个基线相比较的C3D的动作识别准确度和当前最佳方法。上半部分显示了两个基线的结果。中间部分显示了仅使用RGB帧作为输入的方法。而下面的部分报告了使用所有可能的特征组合（例如光流，iDT）的所有当前最佳方法。

Table 3. Action recognition results on UCF101. C3D compared with baselines and current state-of-the-art methods. Top: simple features with linear SVM; Middle: methods taking only RGB frames as inputs; Bottom: methods using multiple feature combinations.

表3. UCF101的动作识别结果。 C3D与基线和当前最先进的方法相比。顶部：线性SVM的简单特征; 中间：仅采用RGB帧作为输入的方法; 底部：使用多个特征组合的方法。

C3D fine-tuned net performs best among three C3D nets described previously. The performance gap between these three nets, however, is small (1%). From now on, we refer to the fine-tuned net as C3D, unless otherwise stated. C3D using one net which has only
4, 096 dimensions obtains an accuracy of 82.3%. C3D with 3 nets boosts the accuracy to 85.2% with the dimension is increased to 12, 288. C3D when combined with iDT further improves the accuracy to 90.4%, while when it is combined with Imagenet, we observe
only 0.6% improvement. This indicates C3D can well capture both appearance and motion information, thus there is no benefit to combining with Imagenet which is an appearance based deep feature. On the other hand, it is beneficial to combine C3D with iDT as
they are highly complementary to each other. In fact, iDT are hand-crafted features based on optical flow tracking and histograms of low-level gradients while C3D captures high level abstract/semantic information.

C3D微调网络在前面描述的三个C3D网络中表现最好。然而，这三个网络之间的效果差距很小（1％）。从现在开始，除非另有说明，否则我们将调整后的网络称为C3D。 C3D使用一个仅具有4,096个尺寸的网络，得到了82.3％的精度。具有3个网络的C3D将精度提高到85.2％，尺寸增加到12,288。C3D与iDT组合进一步将精度提高到90.4％，而与Imagenet相结合，我们观察到只有0.6％的提高。这表明C3D可以很好地捕获外观和运动信息，因此与Imagenet相结合没有任何好处，Imagenet是基于外观的深层特征。
另一方面，将C3D与iDT相结合是有益的，因为它们彼此高度互补。事实上，iDT是基于光流跟踪和低级梯度直方图的人为设计特征，而C3D则捕获高级抽象/语义信息。

C3D with 3 nets achieves 85.2% which is 9% and 16.4% better than the iDT and Imagenet baselines, respectively. On the only RGB input setting, compared with CNN-based approaches, Our C3D outperforms deep networks and spatial stream network in [36] by 19.8%
and 12.6%, respectively. Both deep networks [18] and spatial stream network in [36] use AlexNet architecture. While in [18], the net is fine-tuned from their model pre-trained on Sports-1M, spatial stream network in [36] is fine-tuned from Imagenet pretrained
model. Our C3D is different from these CNN-base methods in term of network architecture and basic operations. In addition, C3D is trained on Sports-1M and used as is without any finetuning. Compared with Recurrent Neural Networks (RNN) based methods, C3D outperforms
Longterm Recurrent Convolutional Networks (LRCN) and LSTM composite model by 14.1% and 9.4%, respectively. C3D with only RGB input still outperforms these two RNN-based methods when they used both optical flows and RGB as well as the temporal stream network
in [36]. However, C3D needs to be combined with iDT to outperform two-stream networks , the other iDT-based methods [31, 25], and the method that focuses on long-term modeling [29]. Apart from the promising numbers, C3D also has the advantage of simplicity
compared to the other methods.

具有3个网络的C3D了达到85.2％，比iDT和Imagenet基线分别提高了9％和16.4％。在只有RGB输入设置中，与基于CNN的方法相比，我们的C3D在[36]中分别优于深度网络和空间流网络为19.8％和12.6％。深层网络和 [36]的空间流网络都使用AlexNet架构。在[18]中，网络由他们在Sports-1M上预训练的模型进行微调，[36]中的空间流网络由Imagenet预训练模型进行了微调。我们的C3D在网络架构和基本操作方面与这些CNN基础方法不同。此外，C3D已经在Sports-1M上进行了训练，并且在没有任何微调的情况下被使用。
与基于循环神经网络（RNN）的方法相比，C3D性能分别优于长期循环卷积网络（LRCN）和LSTM复合模型14.1％和9.4％。只有RGB输入的C3D在使用光流和RGB以及[36]中的时间流网络时仍然优于这两种基于RNN的方法。然而，C3D需要与iDT组合以优于双流网络，另一种基于iDT的方法[31,25]以及专注于长期建模的方法[29]。除了有希望的数字外，与其他方法相比，C3D还具有简单的优点。

Figure 5. C3D compared with Imagenet and iDT in low dimensions. C3D, Imagenet, and iDT accuracy on UCF101 using PCA dimensionality reduction and a linear SVM. C3D outperforms Imagenet and iDT by 10-20% in low dimensions.

图5. C3D与Imagenet和iDT在低尺寸下的比较。C3D，Imagenet和iDT在UCF101上使用PCA维数降低和线性SVM的精度。 C3D在低尺寸下优于Imagenet和iDT 10-20％。

Figure 6. Feature embedding. Feature embedding visualizations of Imagenet and C3D on UCF101 dataset using t-SNE . C3D features are semantically separable compared to Imagenet suggesting that it is a better feature for videos. Each clip is visualized as a
point and clips belonging to the same action have the same color. Best viewed in color.

图6.特征嵌入。在UCF101数据集上使用t-SNE对Imagenet和C3D的特征嵌入可视化。与Imagenet相比，C3D特征在语义上可分离，表明对于视频它是更好的特征。每个片段可视化为一个点，属于同一动作的片段具有相同的颜色。最好在着色条件下观看。

C3D is compact: In order to evaluate the compactness of C3D features we use PCA to project the features into lower dimensions and report the classification accuracy of the projected features on UCF101 using a linear SVM. We apply the same process with iDT
as well as Imagenet features and compare the results in Figure 5. At the extreme setting with only 10 dimensions, C3D accuracy is 52.8% which is more than 20% better than the accuracy of Imagenet and iDT which are about 32%. At 50 and 100 dim, C3D obtains
an accuracy of 72.6% and 75.6% which are about 10-12% better than Imagenet and iDT. Finally, with 500 dimensions, C3D is able to achieve 79.4% accuracy which is 6% better than iDT and 11% better than Imagenet. This indicates that our features are both compact
and discriminative. This is very helpful for large-scale retrieval applications where low storage cost and fast retrieval are crucial.

C3D紧凑：为了评估C3D特征的紧凑性，我们使用PCA将特征投影到较低维度，并使用线性SVM报告在UCF101上投影特征的分类精度。我们对iDT ]和Imagenet特征应用相同的过程，并比较图5中的结果。在仅有10个维度的极限设置下，C3D精度为52.8％，比Imagenet和iDT的准确度高出20％的，Imagenet和iDT的准确度约为32％。在50和100维度时，C3D得到的精度为72.6％和75.6％，比Imagenet和iDT好10-12％。最后，具有500个维度，C3D能够实现79.4％的精度，比iDT好6％，比Imagenet好11％。
这表明我们的特征既紧凑又具有识别力。这对于低存储成本和快速检索至关重要的大规模检索应用非常有用。

We qualitatively evaluate our learned C3D features to verify if it is a good generic feature for video by visualizing the learned feature embedding on another dataset. We randomly select 100K clips from UCF101, then extract fc6 features for those clips using
for features from Imagenet and C3D. These features are then projected to 2-dimensional space using t-SNE . Figure 6 visualizes the feature embedding of the features from Imagenet and our C3D on UCF101. It is worth noting that we did not do any finetuning as
we wanted to verify if the features show good generalization capability across datasets. We quantitatively observe that C3D is better than Imagenet.

我们定性地评估我们学习的C3D特征，以通过可视化嵌入在另一个数据集上的学习特征来验证它是否是视频的一个很好的通用特征。我们从UCF101随机选择100K个片段，然后使用来自Imagenet和C3D的特征来提取这些片段的fc6特征。然后使用t-SNE将这些特征投影到二维空间。图6显示了嵌入在UCF101上Imagenet和C3D的特征。值得注意的是，我们没有做任何微调，因为我们想验证这些特征是否显示出跨数据集的良好的泛化能力。我们定量观察到C3D优于Imagenet。

5.动作相似性标签

Dataset: The ASLAN dataset consists of 3, 631 videos from 432 action classes. The task is to predict if a given pair of videos belong to the same or different action. We use the prescribed 10-fold cross validation with the splits provided with the dataset.
This problem is different from action recognition, as the task focuses on predicting action similarity not the actual action label. The task is quite challenging because the test set contains videos of “never-seenbefore” actions.

数据集：ASLAN数据集由432个动作类的3,631个视频组成。任务是预测给定的一对视频对象是否属于相同或不同的动作。我们使用数据集提供的拆分进行规定的10倍交叉验证。这个问题与动作识别不同，因为任务着重于预测动作相似性而不是实际动作标签。这个任务是非常具有挑战性的，因为测试集包含“从未见过的”动作的视频。

Features: We split videos into 16-frame clips with an overlap of 8 frames. We extract C3D features: prob, fc7, fc6, pool5 for each clip. The features for videos are computed by averaging the clip features separately for each type of feature, followed by
an L2 normalization.

特征：我们将视频分为重叠8帧的16帧片段。我们提取每个片段的C3D特征：prob，fc7，fc6，pool5。通过分别平均每种特征类型的片段特征，然后进行L2归一化，来计算视频特征。

Classification model: We follow the same setup used in [21]. Given a pair of videos, we compute the 12 different distances provided in [21]. With 4 types of features, we obtain 48-dimensional (12 × 4 = 48) feature vector for each video pair. As these 48
distances are not comparable to each other, we normalize them independently such that each dimension has zero mean and unit variance. Finally, a linear SVM is trained to classify video pairs into same or different on these 48-dim feature vectors. Beside comparing
with current methods, we also compare C3D with a strong baseline using deep image-based features. The baseline has the same setting as our C3D and we replace C3D features with Imagenet features.

分类模型：我们遵循[21]中使用的相同设置。给出一对视频，我们计算[21]中提供的12个不同的距离。具有4种特征，我们从每对视频获得48维（12×4 = 48）特征向量。由于这48个距离彼此无法比较，我们将它们独立地归一化，使得每个维度具有零平均值和单位方差。最后，训练线性SVM以根据这48维特征向量将视频对归类为相同或不同的。除了与当前的方法进行比较，我们还使用基于深度图像的特征将C3D与强基线进行比较。基线与我们的C3D设置相同，我们用Imagenet特征替换C3D特征。

Figure 7. Action similarity labeling result. ROC curve of C3D evaluated on ASLAN. C3D achieves 86.5% on AUC and outperforms current state-of-the-art by 11.1%.

图7.动作相似性标注结果。C3D的ROC曲线在ASLAN上评估。C3D在AUC上达到86.5％，优于目前最先进的11.1％。

Table 4. Action similarity labeling result on ASLAN. C3D significantly outperforms state-of-the-art method [45] by 9.6% in accuracy and by 11.1% in area under ROC curve.

表4. ASLAN的动作相似性标注结果。C3D显著优于最先进的方法[45]，精度提升了9.6％，ROC曲线下面积提升了11.1％。

Results: We report the result of C3D and compare with state-of-the-art methods in table 4. While most current methods use multiple hand-crafted features, strong encoding methods (VLAD, Fisher Vector), and complex learning models, our method uses a simple
averaging of C3D features over the video and a linear SVM. C3D significantly outperforms state-of-the-art method [45] by 9.6% on accuracy and 11.1% on area under ROC curve (AUC). Imagenet baseline performs reasonably well which is just 1.2% below state-of-the-art
method, but 10.8% worse than C3D due to lack of motion modeling. Figure 7 plots the ROC curves of C3D compared with current methods and human performance. C3D has clearly made a significant improvement which is a halfway from current state-of-the-art method
to human performance (98.9%).

结果：我们报告C3D的结果并与表4中的最佳方法进行比较。尽管目前大多数方法使用多种人为设计特征，强编码方法（VLAD，Fisher Vector）和复杂的学习模型，但我们的方法使用一种视频上的C3D特征和线性SVM的简单平均。 C3D在ROC曲线（AUC）下显著优于最先进的方法，精度提升了9.6％，面积提升了11.1％。Imagenet基线表现相当好，仅比最佳方法低1.2％，但由于缺乏运动模型，比C3D差10.8％。图7绘制了C3D与当前方法和人类表现相比的ROC曲线。
C3D已经显著提升，这是目前最先进的方法到人类表现（98.9％）的一半。

Table 5. Scene recognition accuracy. C3D using a simple linear SVM outperforms current methods on Maryland and YUPENN.

表5.场景识别精度。使用简单线性SVM的C3D优于马里兰州和YUPENN上的当前方法。

6.场景和事物识别

Datasets: For dynamic scene recognition, we evaluate C3D on two benchmarks: YUPENN and Maryland . YUPENN consists of 420 videos of 14 scene categories and Maryland has 130 videos of 13 scene categories. For object recognition, we test C3D on egocentric dataset
[32] which consists 42 types of everyday objects. A point to note, this dataset is egocentric and all videos are recorded in a first person view which have quite different appearance and motion characteristics than any of the videos we have in the training
dataset.

数据集：对于动态场景识别，我们在两个基准上评估C3D：YUPENN和马里兰州。 YUPENN包括14个场景类别的420个视频，马里兰州有13个场景类别的130个视频。对于对事物识别，我们测试了自然中心数据集[32]上的C3D，它包含42种类型的日常事物。值得注意的是，该数据集是以自我为中心的，所有视频都记录在第一人称视图中，它们具有与我们在训练数据集中拥有的任何视频所完全不同的外观和运动特征。

Classification model: For both datasets, we use the same setup of feature extraction and linear SVM for classification and follow the same leave-one-out evaluation protocol as described by the authors of these datasets. For object dataset, the standard evaluation
is based on frames. However, C3D takes a video clip of length 16 frames to extract the feature. We slide a window of 16 frames over all videos to extract C3D features. We choose the ground truth label for each clip to be the most frequently occurring label
of the clip. If the most frequent label in a clip occurs fewer than 8 frames, we consider it as negative clip with no object and discard it in both training and testing. We train and test C3D features using linear SVM and report the object recognition accuracy.
We follow the same split provided in [32]. We also compare C3D with a baseline using Imagenet feature on these 3 benchmarks.

分类模型：对于两个数据集，我们使用相同的特征提取体系和线性SVM进行分类，并遵循这些数据集的作者所述的相同的留一法估计协议。对于事物数据集，标准评估基于帧。但是，C3D会拍摄长度为16帧的视频片段来提取特征。我们在所有视频中滑动16帧的窗口，以提取C3D特征。我们选择每个片段的地面真实标签作为片段最常发生的标签。如果片段中最常见的标签发生少于8帧，我们认为它是没有事物的负片段，并在训练和测试中丢弃它。我们使用线性SVM训练和测试C3D特征，并报告事物识别精度。
我们遵循[32]中提供的相同分割。我们还在这3个基准上对C3D与使用Imagenet特征基准线进行比较。

Results: Table 5 reports our C3D results and compares it with the current best methods. On scene classification, C3D outperforms state-of-the-art method by 10% and 1.9% on Maryland and YUPENN respectively. It is worth nothing that C3D uses only a linear
SVM with simple averaging of clip features while the second best method [9] uses different complex feature encodings (FV, LLC, and dynamic pooling). The Imagenet baseline achieves similar performance with C3D on Maryland and 1.4% lower than C3D on YUPENN.
On object recognition, C3D obtains 22.3% accuracy and outperforms [32] by 10.3% with only linear SVM where the comparing method used RBF-kernel on strong SIFT-RANSAC feature matching. Compared with Imagenet baseline, C3D is still 3.4% worse. This can be explained
by the fact that C3D uses smaller input resolution (128 × 128) compared to full-size resolution (256 × 256) using by Imagenet. Since C3D is trained only on Sports- 1M videos without any fine-tuning while Imagenet is fully trained on 1000 object categories,
we did not expect C3D to work that well on this task. The result is very surprising and shows how generic C3D is on capturing appearance and motion information in videos.

结果：表5报告了我们的C3D结果，并将其与当前最佳方法进行比较。在场景分类中，C3D在马里兰州和YUPENN分别优于最先进的方法10％和1.9％。 C3D仅使用具有简单平均片段特征的线性SVM是不值得的，而第二好的方法[9]使用不同的复杂特征编码（FV，LLC和动态池）。 Imagenet基线在马里兰州与C3D表现相似，在YUPENN上比C3D低1.4％。在事物识别方面，只有线性SVM的情况下，C3D获得22.3％的精度，优于[32]10.3％，比较方法在强SIFT-RANSAC特征匹配的情况下使用RBF-内核。与Imagenet基线相比，C3D更差3.4％。这可以解释为与Imagenet使用的全尺寸分辨率（256×256）相比，C3D使用较小的输入分辨率（128×128）。由于C3D仅在Sports-1M视频上进行了训练，而没有任何微调，而Imagenet已经对1000个事物类别进行了全面训练，因此我们并不期望C3D能够很好地完成此任务。结果非常令人惊讶，并显示了通用C3D如何捕捉视频中的外观和运动信息。

Table 6. Runtime analysis on UCF101. C3D is 91x faster than improved dense trajectories and 274x faster than Brox’s GPU implementation in OpenCV.

表6. UCF101的运行时间分析。 C3D比改进密集轨迹快了91倍，比Brox在OpenCV中的GPU实现速度快了274倍。

7.运行时间分析

We compare the runtime of C3D and with iDT [44] and the Temporal stream network [36]. For iDT, we use the code kindly provided by the authors [44]. For [36], there is no public model available to evaluate. However, this method uses Brox’s optical flows [3]
as inputs. We manage to evaluate runtime of Brox’s method using two different versions: CPU implementation provided by the authors [3] and the GPU implementation provided in OpenCV.

我们比较C3D和iDT [44]和时间流网络[36]的运行时间。对于iDT，我们使用作者提供的代码[44]。对于[36]，没有可用的评估公共模型。然而，该方法使用Brox的光流[3]作为输入。我们设法使用两种不同的版本来评估Brox方法的运行时间：作者提供的CPU实现[3]和OpenCV中提供的GPU实现。

We report runtime of the three above-mentioned methods to extract features (including I/O) for the whole UCF101 dataset in table 6 using using a single CPU or a single K40 Tesla GPU. [36] reported a computation time (without I/O) of 0.06s for a pair of images.
In our experiment, Brox’s GPU implementation takes 0.85-0.9s per image pair including I/O. Note that this is not a fair comparison for iDT as it uses only CPU. We cannot find any GPU implementation of this method and it is not trivial to implement a parallel
version of this algorithm on GPU. Note that C3D is much faster than real-time, processing at 313 fps while the other two methods have a processing speed of less than 4 fps.

我们报告上述三种方法的运行时间，以使用单个CPU或单个K40 Tesla GPU来提取表6中整个UCF101数据集的特征（包括I / O）。 [36]报告了一对图像的计算时间（无I / O）为0.06s。在我们的试验中，Brox的GPU实现需要0.85-0.9s每个图像对，包括I / O。请注意，对于iDT这不是公平的比较，因为它只使用CPU。我们找不到此方法的任何GPU实现，并且在GPU上实现此算法的并行版本并不是微不足道的。请注意，C3D比实时快得多，处理速度为313 fps，而其他两种方法的处理速度小于4
fps。

8.结论

In this work we try to address the problem of learning spatiotemporal features for videos using 3D ConvNets trained on large-scale video datasets. We conducted a systematic study to find the best temporal kernel length for 3D ConvNets. We showed that C3D
can model appearance and motion information simultaneously and outperforms the 2D ConvNet features on various video analysis tasks. We demonstrated that C3D features with a linear classifier can outperform or approach current best methods on different video
analysis benchmarks. Last but not least, the proposed C3D features are efficient, compact, and extremely simple to use.

在这项工作中，我们试图解决使用经过大规模视频数据集训练的3D ConvNets来学习视频的时空特征的问题。我们进行了系统的研究，以找到3D ConvNets的最佳时间核长度。我们展示了C3D可以同时对外观和运动信息进行建模，在各种视频分析任务上优于2D ConvNet特征。我们展示了具有线性分类器的C3D特征可以在不同的视频分析基准上胜过或接近现行的最佳方法。最后但并非最不重要的是，提出的C3D特征是高效的、紧凑的、使用非常简单的。

C3D源代码和预训练模型可从http://vlg.cs.dartmouth.edu/c3d获得。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航