您的位置：首页 > 移动开发 > Objective-C

翻译《Deep Learning For Generic Object Detection : A Survey》

2018-10-24 13:57 816 查看

Deep Learning For Generic Object Detection : A Survey

4 Fundamental SubProblems 基本子问题
4.1 DCNN based Object Representation基于DCNN的对象表示
4.1.1 Popular CNN Architectures 流行的CNN框架
4.1.2 Methods For Improving Object Representation 改进对象表示的方法

4.2 Context Modeling 上下文建模

4.3 Detection Proposal Methods 检测提议方法

4.4 Other Special Issues 其他特殊问题

5 Datasets and Performance Evaluation 数据集和评估性能

5.1 Datasets 数据集
5.2 Evaluation Criteria
5.3 Performance

6 Conclusions

References

4 Fundamental SubProblems 基本子问题

In this section important subproblems are described, including feature representation, region proposal, context information mining, and training strategies. Each approach is reviewed with respect to its primary contribution.

重要的子问题将被描述在这个单元，包括特征表示，区域提议，上下文信息挖掘，训练策略。关于每一个部分的主要贡献都会回顾到。

4.1 DCNN based Object Representation基于DCNN的对象表示

As one of the main components in any detector, good feature representations are of primary importance in object detection [46, 65, 62, 249]. In the past, a great deal of effort was devoted to designing local descriptors (e.g., SIFT [139] and HOG [42]) and to explore approaches (e.g., Bag of Words 1 and Fisher Vector [166]) to group and abstract the descriptors into higher level representations in order to allow the discriminative object parts to begin to emerge, however these feature representation methods required careful engineering and considerable domain expertise.

作为一个在任意一个检测器中的主要组成成分，好的特征表示在对象检测[46,65,62,249]中起着重要的作用。在过去，大量的努力致力于设计局部描述符（比如SIFT [139] and HOG [42]）和探索将描述符分组和抽象成高级别表示的方法（例如Bag of Words [194] and Fisher Vector [166]），为了让有区别的对象部分可以呈现出来，但是这些特征表示方法需要仔细的工程和相当多的领域专业知识。

In contrast, deep learning methods (especially deep CNNs, or DCNNs), which are composed of multiple processing layers, can learn powerful feature representations with multiple levels of abstraction directly from raw images [12, 116]. As the learning procedure reduces the dependency of specific domain knowledge and complex procedures needed in traditional feature engineering [12, 116], the burden for feature representation has been transferred to the design of better network architectures.

相比之下，深度学习方法（特别是DCNNs），一种由大量计算层组成的网络，可以强大地从原始图片学习到多级别的抽象的特征表示。因为学习程序减少对特定领域知识的依赖性，以及在传统特征工程中需要复杂的程序，特征表示的负担已经转移到更好的网络架构的设计上。

The leading frameworks reviewed in Section 3 (RCNN [65], Fast RCNN [64], Faster RCNN [175], YOLO [174], SSD [136]) have persistently promoted detection accuracy and speed. It is generally accepted that the CNN representation plays a crucial role and it is the CNN architecture which is the engine of a detector. As a result, most of the recent improvements in detection accuracy have been achieved via research into the development of novel networks. Therefore we begin by reviewing popular CNN architectures used in Generic Object Detection, followed by a review of the effort devoted to improving object feature representations, such as developing invariant features to accommodate geometric variations in object scale, pose, viewpoint, part deformation and performing multiscale analysis to improve object detection over a wide range of scales.

领先的框架(RCNN [65], Fast RCNN [64], Faster RCNN [175], YOLO [174], SSD [136])已经在单元3进行回顾了，这些框架已经很好地促进检测的精度和速度。CNN表示已经是一个重要的角色，CNN架构是检测器的引擎，这些是被广泛接受的。因此，最近在检测精度方面的大多数改进都是通过研究新网络的发展而实现的。所以，我们开始回顾流行的用于一般对象检测CNN架构，接着回顾致力于改进对象特征表示的努力，比如，为了改进在大尺度范围内的对象检测，研究不变的特征，以适应对象比例、姿势、视点、部件变形和多尺度分析的几何变化。

4.1.1 Popular CNN Architectures 流行的CNN框架

CNN architectures serve as network backbones to be used in the detection frameworks described in Section 3. Representative frameworks include AlexNet [110], ZFNet [234] VGGNet [191], GoogLeNet [200], Inception series [99, 201, 202], ResNet [79], DenseNet [94] and SENet [91], which are summarized in Table 2, and where the network improvement in object recognition can be seen from Fig. 9. A further review of recent CNN advances can be found in [71].

在单元3描述的那样，CNN框架在检测框架中被用来充当网络主干。有代表性的框架，包括AlexNet [110], ZFNet [234] VGGNet [191], GoogLeNet [200], Inception series [99, 201, 202], ResNet [79], DenseNet [94] and SENet [91]，这些在Table 2 中进行了总结，另外这些网络在对象识别的改进可以在Fig.9中看到。CNN最新进展的进一步回顾可以在[71]中找到。

Briefly, a CNN has a hierarchical structure and is composed of a number of layers such as convolution, nonlinearity, pooling etc. From finer to coarser layers, the image repeatedly undergoes filtered convolution, and with each layer the receptive field (region of support) of these filters increases. For example, the pioneering AlexNet [110] has five convolutional layers and two Fully Connected (FC) layers, and where the first layer contains 96 filters of size 11 × 11 × 3. In general, the first CNN layer extracts low level features (e.g. edges), intermediate layers extract features of increasing complexity, such as combinations of low level features, and later convolutional layers detect objects as combinations of earlier parts [234, 12, 116, 157].

简而言之，CNN具有层次结构，它由许多层组成，如卷积层、非线性层（激活层）、池化层等。从较细到较粗的层，图片反复的经过滤波卷积，在每一层之后，这些滤波器的感受野增加。例如，AlexNet[110]具有5个卷积层和2个全连接层，其中，第一层包含了96个11113大小的滤波器。通常来说，第一个CNN层提取低级特征（比如边缘）。中间层提取越来越复杂的特征，比如多个低级特征的组合。后面的卷积层根据前面部分的组合检测对象 [234, 12, 116, 157]。

As can be observed from Table 2, the trend in architecture evolution is that networks are getting deeper: AlexNet consisted of 8 layers, VGGNet 16 layers, and more recently ResNet and DenseNet both surpassed the 100 layer mark, and it was VGGNet [191] and GoogLeNet [200], in particular, which showed that in-creasing depth can improve the representational power of deep networks. Interestingly, as can be observed from Table 2, networks such as AlexNet, OverFeat, ZFNet and VGGNet have an enormous number of parameters, despite being only few layers deep, since a large fraction of the parameters come from the FC layers. Therefore, newer networks like Inception, ResNet, and DenseNet, although having a very great network depth, have far fewer parameters by avoiding the use of FC layers.

如Table 2 所示，网络变的越来越深是框架演变的趋势：AlexNet由8层组成，VGGNet是16层，最近的ResNet和DenseNet都超过了100层。特别是VGGNet和GoogLeNet，说明了深度的增加可以提升深层网络的特征表现能力。有趣的是，如Table 2，网络如AlexNet, OverFeat, ZFNet and VGGNet具有庞大数量的参数，尽管只有几层深，这是因为大部分的参数是来自FC层。因此，更新的网络，像Inception，ResNet和DenseNet，虽然有着一个很大的网络深度，但是通过不使用FC层，有更少的参数。

With the use of Inception modules in carefully designed topologies, the parameters of GoogLeNet is dramatically reduced. Similarly ResNet demonstrated the effectiveness of skip connections for learning extremely deep networks with hundreds of layers, winning the ILSVRC 2015 classification task. Inspired by ResNet [79], InceptionResNets [202] combine the Inception networks with shortcut connections, claiming that shortcut connections can significantly accelerate the training of Inception networks. Extending ResNets, Huang et al. [94] proposed DenseNets which are built from dense blocks, where dense blocks connect each layer to every other layer in a feed-forward fashion, leading to compelling advantages such as parameter efficiency, implicit deep supervision, and feature reuse. Recently, Hu et al. [79] proposed an architectural unit termed the Squeeze and Excitation (SE) block which can be combined with existing deep architectures to boost their performance at minimal additional computational cost, by adaptively recalibrating channelwise feature responses by explicitly modeling the interdependencies between convolutional feature channels, leading to winning the ILSVRC 2017 classification task. Research on CNN architectures remain active, and a number of backbone networks are still emerging such as Dilated Residual Networks [230], Xception [35], DetNet [127], and Dual Path Networks (DPN) [31].

由于在精心设计的拓扑结构中使用Inception模块，GoogLeNet的参数显著减少。类似地，ResNet也证明了通过skip connections（跳过连接）学习具有数百层的极深网络的有效性，赢得了ILSVRC 2015分类任务。受ResNet[79]的启发，InceptionResNet，联合了Inception网络和shortcut connection（捷径连接），声称shortcut connection可以显著地加快Inception networks的训练。ResNet的延伸，黄等人[94]提出由dense blocks（密集快）构成的DenseNets，dense block 把每一层和另外的每一层用一种前馈的方式连接，形成了引人注目的优势，如parameter efficiency, implicit deep supervision, and feature reuse参数效率、隐式深监控和特征重用。最近，Hu等人[79]提出一种体系结构单元，称为Squeeze and Excitation(SE) block块（挤压和激励），它可以和现有的深层架构相联合，通过显式地建模卷积特征通道之间的相互依赖性，自适应地重新校准的通道聪明式的特征响应(adaptively recalibrating channelwise feature responses )，在极小的额外的计算消耗下加速性能，赢下了ILSVRC 2017 分类任务。在CNN架构的研究仍旧是活跃的，大量的主干网络还在出现，像Dilated Residual Networks[230]，Xception [35], DetNet [127], and Dual Path Networks (DPN) [31].

The training of a CNN requires a large labelled dataset with sufficient label and intraclass diversity. Unlike image classification, detection requires localizing (possibly many) objects from an image. It has been shown [161] that pretraining the deep model with a large scale dataset having object-level annotations (such as the ImageNet classification and localization dataset), instead of only image-level annotations, improves the detection performance. However collecting bounding box labels is expensive, especially for hundreds of thousands of categories. A common scenario is for a CNN to be pretrained on a large dataset (usually with a large number of visual categories) with image-level labels; the pretrained CNN can then be applied to a small dataset, directly, as a generic feature extractor [172, 8, 49, 228], which can support a wider range of visual recognition tasks. For detection, the pretrained network is typically finetuned on a given detection dataset [49, 65, 67]. Several large scale image classification datasets are used for CNN pretraining; among them the ImageNet1000 dataset [44, 179] with 1.2 million images of 1000 object categories, or the Places dataset [245] which is much larger than ImageNet1000 but has fewer classes, or a recent hybrid dataset [245] combining the Places and ImageNet datasets.

CNN的训练需要一个大的已标注的数据集（需要充分的标签和同类多样性）。与图像分类不同，检测需要从一张图片定位（可能是多个）对象。已经标明[161]，用带有对象级别标注的，而不是图像级别的标注，大规模数据集（如ImageNet）去预训练深层的网络可以提升检测的性能。然而，要完成bounding box（边界框）是非常昂贵的，尤其是有着成千上万的类别。对CNN来说，常见的情况下，是在一个大规模的图像级别标注的数据集（通常有大量的可视类别）进行预训练；然后，经过预训练的CNN作为一个通用的特征提取器[172, 8, 49, 228]，可以直接地被应用在小数据集，这样可以支持更宽的视觉识别任务。对检测来说，预训练的网络通常用一个提供的检测数据集进行finetune[19,65,67]。有那么几种大规模图片识别数据集被用在CNN预训练；其中，ImageNet1000 dataset[44,179]有着1000对象类别的120万图片；Places dataset[245]，比ImageNet大，但是类别更少；或者最近的联合了Places和ImageNet的混合数据集。

Pretrained CNNs without finetuning were explored for object classification and detection in [49, 67, 1], where it was shown that features performance is a function of the extracted layer; for example, for AlexNet pretrained on ImageNet, FC6 / FC7 / Pool5 are in descending order of detection accuracy [49, 67]; finetuning a pretrained network can increase detection performance significantly [65, 67], although in the case of AlexNet the finetuning performance boost was shown to be much larger for FC6 and FC7 than for Pool5, suggesting that the Pool5 features are more general. Furthermore the relationship or similarity between the source and target datasets plays a critical role, for example that ImageNet based CNN features show better performance [243] on object related image datasets.

在[49,67,1]中，我们探索了没有finetune的预训练的CNN用于目标分类和检测，其中特征性能是提取层的函数；例如，AlexNet在ImageNet上进行预训练，FC6、FC7、Pool5的检测精度呈递减趋势[49,67]；finetune一个预训练的网络可以显著地提高检测性能[65,67]，虽然在AlexNet的例子中，FC6和FC7的finetun性能提升比Pool5要大得多，但这表明Pool5的特性更具有普遍性。此外，源数据集和目标数据集之间的关系或相似性起着关键作用，例如基于ImageNet的CNN特征在对象相关的图像数据集上表现出更好的性能[243]。

4.1.2 Methods For Improving Object Representation 改进对象表示的方法

Deep CNN based detectors such as RCNN [65], Fast RCNN [64], Faster RCNN [175] and YOLO [174], typically use the deep CNN architectures listed in 2 as the backbone network and use features from the top layer of the CNN as object representation, however detecting objects across a large range of scales is a fundamental challenge. A classical strategy to address this issue is to run the detector over a number of scaled input images (e.g., an image pyramid) [56, 65, 77], which typically produces more accurate detection, however with obvious limitations of inference time and memory. In contrast, a CNN computes its feature hierarchy layer by layer, and the subsampling layers in the feature hierarchy lead to an inherent multiscale pyramid.

基于DCNN的检测测，如RCNN[65],Fast RCNN[64],Faster RCNN[175]和YOLO[174],通常，使用2中列出的DCNN架构作为骨干网络，使用CNN顶层特征作为对象表示,然而在一个大范围的尺度检测对象是一个基本的挑战。一个解决这个问题的经典策略是在大量缩放的输入图像上运行检测器(例如，图像金字塔)[56,65,77]，这通常会产生更精确的检测，但inference时间和内存有明显的限制。相比之下，CNN逐层计算其特征，特征层次结构中的子采样层导致固有的多尺度金字塔。

This inherent feature hierarchy produces feature maps of different spatial resolutions, but have inherent problems in structure [75, 138, 190]: the later (or higher) layers have a large receptive field and strong semantics, and are the most robust to variations such as object pose, illumination and part deformation, but the resolution is low and the geometric details are lost. On the contrary, the earlier (or lower) layers have a small receptive field and rich geometric details, but the resolution is high and is much less sensitive to semantics. Intuitively, semantic concepts of objects can emerge in different layers, depending on the size of the objects. So if a target object is small it requires fine detail information in earlier layers and may very well disappear at later layers, in principle making small object detection very challenging, for which tricks such as dilated convolutions [229] or atrous convolution [40, 27] have been proposed. On the other hand if the target object is large then the semantic concept will emerge in much later layers. Clearly it is not optimal to predict objects of different scales with features from only one layer, therefore a number of methods [190, 241, 130, 104] have been proposed to improve detection accuracy by exploiting multiple CNN layers, broadly falling into three types of multiscale object detection:

Detecting with combined features of multiple CNN layers [75, 103, 10];

Detecting at multiple CNN layers;

Combinations of the above two methods [58, 130, 190, 104, 246, 239].

这个固有特征层次结构产生不同空间分辨率的feature maps，但在结构上有着inherent问题[75、138、190]：后来的(或更高层的)层有一个很大的感受视野和强大语义,并且对变化有着很好地鲁棒性，例如对象姿势、照明和部分变形，但是分辨率低，并且几何细节丢失。相反，较前的(或较低的)层具有较小的感受视野和丰富的几何细节，但分辨率高，对语义的敏感性小得多。直观上，根据对象的大小，对象的语义概念可以出现在不同的层中。因此，如果目标对象很小，那么它需要较早层中的详细信息，并且很可能在较后面的层中消失，在根本上，小对象的检测非常具有挑战性。因此有人提出了一些技巧，如dilated convolutions（空洞卷积？）[229]或 atrous convolution（多孔卷积、带孔卷积？）[40,27]。另一方面，如果目标对象很大，那么语义概念将出现在更晚的层中。显然，仅从一层的特征预测不同尺度的目标不是理想的，因此提出了多种方法[190、241、130、104]，通过利用CNN的多个层来提高检测精度，大致可分为三种多尺度目标检测:

利用联合多个CNN层的特征进行检测[75,103,10];
在多个CNN层上检测;
以上两种方法的组合[58,130,190,104,246,239]。

(1) Detecting with combined features of multiple CNN layers seeks to combine features from multiple layers before making a prediction. Representative approaches include Hypercolumns [75], HyperNet [103], and ION [10]. Such feature combining is commonly accomplished via skip connections, a classic neural network idea that skips some layers in the network and feeds the output of an earlier layer as the input to a later layer, architectures which have recently become popular for semantic segmentation [138, 185, 75]. As shown in Fig. 10 (a), ION [10] uses skip pooling to extract RoI features from multiple layers, and then the object proposals generated by selective search and edgeboxes are classified by using the combined features. HyperNet [103], as shown in Fig. 10 (b), follows a similar idea and integrates deep, intermediate and shallow features to generate object proposals and predict objects via an end to end joint training strategy. This method extracts only 100 candidate regions in each image. The combined feature is more descriptive and is more beneficial for localization and classification, but at increased computational complexity.

(1)利用CNN多个层的联合特征进行检测在进行预测之前，试图从多个层联合特征。具有代表性的方法包括Hypercolumns[75]、HyperNet[103]和ION[10]。这种特征组合通常是通过skip connection（跳跃连接）来完成的，这是一种经典的神经网络思想，它跳过网络中的某些层，并将较前层的输出作为较后层的输入，这种体系结构最近在语义分割方面变得流行[138,185,75]。如Fig. 10 (a)所示，ION[10]使用skip pooling（跳跃池化）从多层中提取RoI（Region of Interest）特征，然后使用选择性搜索生成目标提案，接着使用组合的特征对box分类。HyperNet[103]，如Fig.10 (b)所示，采用类似的思路和integrate deep（整合深度），通过端到端联合训练策略，使用中间和浅层特征，生成目标提案和预测目标。该方法在每幅图像中只提取100个候选区域。组合特征更具有描述性，更有利于定位和分类，但增加了计算复杂度。

(2) Detecting at multiple CNN layers [138, 185] combines coarse to fine predictions from multiple layers by averaging segmentation probabilities. SSD [136] and MSCNN [20], RBFNet [135], and DSOD [186] combine predictions from multiple feature maps to handle objects of various sizes. SSD spreads out default boxes of different scales to multiple layers within a CNN and enforces each layer to focus on predicting objects of a certain scale. Liu et al. [135] proposed RFBNet which simply replaces the later convolution layers of SSD with a Receptive Field Block (RFB) to enhance the discriminability and robustness of features. The RFB is a multibranch convolutional block, similar to the Inception block [200], but combining multiple branches with different kernels and convolution layers [27]. MSCNN [20] applies deconvolution on multiple layers of a CNN to increase feature map resolution before using the layers to learn region proposals and pool features.

(2) 在多个CNN层检测[138,185]，通过平均分割概率，结合从多个层的粗到细的预测。SSD[136]和MSCNN[20]、RBFNet[135]和DSOD[186]结合了多feature maps（特征映射）的预测，来处理不同大小的对象。SSD将不同scale的默认box分散到CNN中的多个层，并强制每个层集中于预测特定scale的对象。Liu等人[135]提出了RFBNet，简单地将SSD后面的卷积层替换为感受视野块 (receptive Field Block, RFB)，以增强特征的可区别性和鲁棒性。RFB是一个多分支卷积块，类似于Inception block[200]，但它将多个分支和不同的核、卷积层[27]组合在一起。MSCNN[20]对CNN的多层应用反卷积来增加feature map的分辨率，在使用这些层来学习region proposals and pool features区域提议和池特性之前。

(3) Combination of the above two methods recognizes that, on the one hand, the utility of the hyper feature representation by simply incorporating skip features into detection like UNet [154], Hypercolumns [75], HyperNet [103] and ION [10] does not yield significant improvements due to the high dimensionality. On the other hand, it is natural to detect large objects from later layers with large receptive fields and to use earlier layers with small receptive fields to detect small objects; however, simply detecting objects from earlier layers may result in low performance because earlier layers possess less semantic information. Therefore, in order to combine the best of both worlds, some recent works propose to detect objects at multiple layers, and the feature of each detection layer is obtained by combining features from different layers. Representative methods include SharpMask [168], Deconvolutional Single Shot Detector (DSSD) [58], Feature Pyramid Network (FPN) [130], Top Down Modulation (TDM ) [190], Reverse connection with Objectness prior Network (RON) [104], ZIP [122] (shown in Fig. 12), Scale Transfer Detection Network (STDN)[246], RefineDet [239] and StairNet [217], as shown in Table 3 and contrasted in Fig. 11.

(3)结合以上两种方法人们认识到，在一方面，通过简单地将skip features合并到检测中的超特征表示的实用性，像UNet[154] 、hypercolumn[75]、HyperNet[103]和ION[10]等，并没有因为高维性而产生显著的改善。另一方面，很自然的是，从较后的感受视野较大的层中检测较大的对象，和使用较前的具有较小感受视野的层来检测较小的对象;然而，仅仅从较前的层检测对象可能导致低性能，因为较前的层拥有较少的语义信息。因此,为了结合两边的有点，一些近期作品提出在多个层检测对象，通过结合不同层的特征获取每个检测层的特征。代表方法包括SharpMask[168],Deconvolutional Single Shot Detector(DSSD) [58],Feature Pyramid Network(FPN) [130],Top Down Modulation(TDM)[190],Reverse connection with Objectness prior Network (RON) [104],ZIP[122] (Fig.12所示),Scale Transfer Detection Network (STDN)[246],RefineDet [239] and StairNet [217],如Table 3所示, 在Fig. 11进行对比。

As can be observed from Fig. 11 (a1) to (e1), these methods have highly similar detection architectures which incorporate a top down network with lateral connections to supplement the standard bottom-up, feedforward network. Specifically, after a bottom-up pass the final high level semantic features are transmitted back by the top-down network to combine with the bottom-up features from intermediate layers after lateral processing. The combined features are further processed, then used for detection and also transmitted down by the top-down network. As can be seen from Fig. 11 (a2) to (e2), one main difference is the design of the Reverse Fusion Block (RFB) which handles the selection of the lower layer filters and the combination of multilayer features. The top-down and lateral features are processed with small convolutions and combined with elementwise sum or elementwise product or concatenation. FPN shows significant improvement as a generic feature extractor in several applications including object detection [130, 131] and instance segmentation [80], e.g. using FPN in a basic Faster RCNN detector. These methods have to add additional layers to obtain multi-scale features, introducing cost that can not be neglected. STDN [246] used DenseNet [94] to combine features of different layers and designed a scale transfer module to obtain feature maps with different resolutions. The scale transfer module module can be directly embedded into DenseNet with little additional cost.

从Fig. 11 (a1) to (e1)可以看出，这些方法具有高度相似的检测体系结构，它们包含了一个带有横向连接的自顶向下的网络，以补充标准的自底向上的前馈的网络。具体来说，自底向上传递后，最终的高级语义特征通过自顶向下网络传输回来，和来自中间层的经过横向处理的自底向上的特征结合。结合后的特征被进一步处理，然后用于检测，也通过自顶向下网络继续传输下去。从Fig. 11 (a2) to (e2)可以看出，其中一个主要的不同之处在于Reverse Fusion Block(RFB)的设计，它处理了低层滤波器的选择和多层特征的组合。自顶向下和横向的特征通过小卷积处理，并与elementwise sum或elementwise product或concatenation相结合。FPN作为一种通用的特征提取器在一些应用中得到了显著的改进，包括对象检测[130,131]和实例分割[80]，例如在basic Faster RCNN检测器中使用FPN。这些方法必须增加额外的层来获得多尺度的特征，引入了不可忽视的成本。STDN[246]利用DenseNet[94]将不同层的特征进行组合，设计了一个scale transfer模块，去获得不同分辨率的feature maps。scale transfer module可以直接嵌入DenseNet中，几乎不需要额外的消耗。

(4) Model Geometric Transformations. DCNNs are inherently limited to model significant geometric transformations. An empirical study of the invariance and equivalence of DCNN representations to image transformations can be found in [118]. Some approaches have been presented to enhance the robustness of CNN representations, aiming at learning invariant CNN representations with respect to different types of transformations such as scale [101, 18], rotation [18, 32, 218, 248], or both [100].

(4) 模型几何变换。 DCNNs天生局限于模型重要的几何变换。关于DCNN对图像变换的不变性和等价性的representations（表示）的实证研究可以在[118]中找到。为了提高CNN representations的鲁棒性，一些方法被提出，目的是在不同类型的变换下，比如scale [101, 18]、rotate[18, 32, 218, 248]、或者两者都有[100]，学习invariant CNN representations不变的CNN表示。

Modeling Object Deformations: Before deep learning, Deformable Part based Models (DPMs) [56] have been very successful for generic object detection, representing objects by component parts arranged in a deformable configuration. This DPM modeling is less sensitive to transformations in object pose, viewpoint and nonrigid deformations because the parts are positioned accordingly and their local appearances are stable, motivating researchers [41, 66, 147, 160, 214] to explicitly model object composition to improve CNN based detection. The first attempts [66, 214] combined DPMs with CNNs by using deep features learned by AlexNet in DPM based detection, but without region proposals. To enable a CNN to enjoy the built-in capability of modeling the deformations of object parts, a number of approaches were proposed, including DeepIDNet [160]，DCN [41] and DPFCN [147] (shown in Table 3). Although similar in spirit, deformations are computed in a different ways: DeepIDNet [161] designed a deformation constrained pooling layer to replace a regular max pooling layer to learn the shared visual patterns and their deformation properties across different object classes, Dai et al. [41] designed a deformable convolution layer and a deformable RoI pooling layer, both of which are based on the idea of augmenting the regular grid sampling locations in the feature maps with additional position offsets and learning the offsets via convolutions, leading to Deformable Convolutional Networks (DCN), and in DPFCN [147], Mordan et al. proposed deformable part based RoI pooling layer which selects discriminative parts of objects around object proposals by simultaneously optimizing latent displacements of all parts.

对象变形建模:在深度学习之前，基于可变形部件的模型(DPMs)[56]在通用对象检测中非常成功，它通过可变形配置中的组合部件来表示对象。这种DPM建模对于对象的姿态、视点和非刚性变形的变化不太敏感，因为部件的位置是相应的，并且它们的局部外观是稳定的，这激励了研究人员[41,66,147,160,214]来显式地建模对象组合，以改进基于CNN的检测。第一次尝试[66,214]，通过在基于DPM的检测中，使用AlexNet学习到的深层特征，将DPMs和CNNs结合起来，但是没有区域提议。为了使CNN能够有很好的对对象部件变形的内置的建模能力，人们提出了许多方法，包括DeepIDNet[160]、DCN[41]和DPFCN[147] (见Table 3)。DeepIDNet[161]设计了一个 deformation constrained pooling layer(变形约束池化层)取代常规max pooling层，去学习 shared visual patterns共享的视觉模式及其在不同的对象类中的变形特性。Dai等人[41]设计了一个deformable convolution layer（可变形的卷积层）和 deformable RoI pooling layer（可变形的RoI池化层）。这两个设计是基于这样的想法：在带有额外的位置偏移的feature maps上，增加正则网格采样点；和通过卷积学习偏移，由此产生了 Deformable Convolutional Networks (DCN)。在DPFCN [147]中，Mordan等人提出了 deformable part based RoI pooling layer基于可变形部分的RoI池层，该层通过同时优化各部分的潜在位移，在对象提议周围选择对象的有区别的部件。

4.2 Context Modeling 上下文建模

In the physical world visual objects occur in particular environments and usually coexist with other related objects, and there is strong psychological evidence [13, 9] that context plays an essential role in human object recognition. It is recognized that proper modeling of context helps object detection and recognition [203, 155, 27, 26, 47, 59], especially when object appearance features are insufficient because of small object size, occlusion, or poor image quality. Many different types of context have been discussed, in particular see surveys [47, 59]. Context can broadly be grouped into one of three categories [13, 59]:

Semantic context: The likelihood of an object to be found in some scenes but not in others;

Spatial context: The likelihood of finding an object in some position and not others with respect to other objects in the scene;

Scale context: Objects have a limited set of sizes relative to other objects in the scene.
A great deal of work [28, 47, 59, 143, 152, 171, 162] preceded the prevalence of deep learning, however much of this work has not been explored in DCNN based object detectors [29, 90].

在物理世界中，视觉对象出现在特定的环境中，通常与其他相关对象共存，有权威的心理学[13,9]表明语境（上下文）在人类对象识别中起着重要的作用。人们认识到，适当的上下文建模有助于对象检测和识别[203,155,27,26,47,59]，特别是当因为小的对象尺寸、遮挡或图像质量差而导致对象的外观特征不足时。人们已经讨论了许多不同类型的上下文，特别是参见调查[47,59]。语境大致可分为三类[13,59]:

语义上下文：在某些场景中发现这个对象的可能，而在其他场景中不存在可能性;
空间上下文：在场景中，在某一位置找到物体的可能性，而不是在其他的位置;
尺寸上下文：相对场景中的其他对象，对象的大小是在一个有限的集合中。
在深度学习普及之前，有大量的工作[28,47,59,143,152,171,162]有关于此，然而大部分工作还没有在基于DCNN的目标探测器中进行尝试[29,90]。

The current state of the art in object detection [175, 136, 80] detects objects without explicitly exploiting any contextual information. It is broadly agreed that DCNNs make use of contextual information implicitly [234, 242] since they learn hierarchical representations with multiple levels of abstraction. Nevertheless there is still value in exploring contextual information explicitly in DCNN based detectors [90, 29, 236], and so the following reviews recent work in exploiting contextual cues in DCNN based object detectors, organized into categories of global and local contexts, motivated by earlier work in [240, 59]. Representative approaches are summarized in Table 4.

当前对象检测的研究[175、136、80]没有显式地利用上下文信息来检测对象。人们普遍认为，DCNNs隐式地利用上下文信息[234,242]，因为它们学习了具有多层抽象的层次结构表现。尽管如此，在基于DCNN的检测器中显式地探索上下文信息仍然有价值[90,29,236]，因此，下面将回顾最近在基于DCNN的对象检测器中开发contextual cues上下文线索的工作，这些工作被分成两类——全局上下文和局部上下文（由早期的工作(240,59)所决定）。Table 4总结了具有代表性的方法。

Global context [240, 59] refers to image or scene level context, which can serve as cues for object detection (e.g., a bedroom will predict the presence of a bed). In DeepIDNet [160], the image classification scores were used as contextual features, and concatenated with the object detection scores to improve detection results. In ION [10], Bell et al. proposed to use spatial Recurrent Neural Networks (RNNs) to explore contextual information across the entire image. In SegDeepM [250], Zhu et al. proposed a MRF model that scores appearance as well as context for each detection, and allows each candidate box to select a segment and score the agreement between them. In [188], semantic segmentation was used as a form of contextual priming.

全局上下文[240,59]是指图像或场景级别的文本，可以作为目标检测的线索(例如，卧室可以预测床的存在)。在DeepIDNet[160]中，为提高检测结果，将图像分类分数作为上下文特征，与目标检测分数相连接。在ION[10]中，Bell等人提出使用spatial Recurrent Neural Networks (RNNs)空间递归神经网络，在整张图像上探索上下文信息。在SegDeepM[250]中，Zhu等人提出了一种MRF模型，对每次检测的表现和上下文进行打分（计算？），并允许每个候选框选择一个部分，并计算它们之间的一致性。在[188]中，语义分割被用作上下文启动的一种形式。

**Local context **[240, 59, 171] considers local surroundings in object relations, the interactions between an object and its surrounding area. In general, modeling object relations is challenging, requiring reasoning about bounding boxes of different classes, locations, scales etc. In the deep learning era, research that explicitly models object relations is quite limited, with representative ones being Spatial Memory Network (SMN) [29], Object Relation Network [90], and Structure Inference Network (SIN) [137]. In SMN, spatial memory essentially assembles object instances back into a pseudo image representation that is easy to be fed into another CNN for object relations reasoning, leading to a new sequential reasoning architecture where image and memory are processed in parallel to obtain detections which further update memory. Inspired by the recent success of attention modules in natural language processing field [211], Hu et al. [90] proposed a lightweight ORN, which processes a set of objects simultaneously through interaction between their appearance feature and geometry. It does not require additional supervision and is easy to embed in existing networks. It has been shown to be effective in improving object recognition and duplicate removal steps in modern object detection pipelines, giving rise to the first fully end-to-end object detector. SIN [137] considered two kinds of context including scene contextual information and object relationships within a single image. It formulates object detection as a problem of graph structure inference, where given an image the objects are treated as nodes in a graph and relationships between objects are modeled as edges in such graph.

局部上下文[240,59,171]考虑对象关系中的局部环境——对象与其周围区域之间的相互作用。一般来说，对对象的关系进行建模是具有挑战性的，需要推理不同类别、位置、尺度的边界框等。在深度学习时代，研究显式模型对象的关系是相当有限的，有代表性的是 Spatial Memory Network (SMN)(空间内存网络)[29]、Object Relation Network [90] (对象关系网络),和Structure Inference Network (SIN)(结构推断网络)[137]。在SMN中，spatial memory本质上是将对象实例组装回一个伪图像表示，这样很容易将其输入到另一个CNN中进行对象关系推理，这导致了一种新的有序的推理体系结构，其中图像和memory（内存）并行处理，以获得检测结果，从而进一步更新内存。Hu等人[90]受自然语言处理领域注意力模块近期成功的启发[211]，提出了一种lightweight ORN（轻量级ORN），它通过外观特征和几何形状之间的交互，同时处理一组对象。它不需要额外的监督，而且很容易嵌入到现有的网络中。实践证明，该方法能够有效地改进对象识别和在现代对象检测管道中的重复删除步骤，从而产生了第一个完全端到端的目标检测器。SIN[137]考虑了两种上下文，包括场景上下文信息和单一图像中的对象关系。它将目标检测问题归结为graph structure inference(图结构推理问题)，在给定图像的情况下，将对象视为图中的节点，将对象之间的关系建模为图中的边。

A wider range of methods has approached the problem more simply, normally by enlarging the detection window size to extract some form of local context. Representative approaches include MRCNN [62], Gated BiDirectional CNN (GBDNet) [235, 236], Attention to Context CNN (ACCNN) [123], CoupleNet [251], and Sermanet et al. [182].

一个更广泛的方法更简单地解决了这个问题，通常通过增大检测窗口大小来提取某种形式的局部上下文。代表性的方法有MRCNN[62]、Gated BiDirectional CNN (GBDNet) （门控双向CNN）[235, 236]、Attention to Context CNN (ACCNN) [123]、CoupleNet[251]、Sermanet等[182]。

In MRCNN [62] (Fig. 13 (a)), in addition to the features extracted from the original object proposal at the last CONV layer of the backbone, Gidaris and Komodakis proposed to extract features from a number of different regions of an object proposal (half regions, border regions, central regions, contextual region and semantically segmented regions), in order to obtain a richer and more robust object representation. All of these features are combined simply by concatenation.

在MRCNN中[62] (Fig. 13(a)) ，除了从在最后的 CONV backbone 层中的原始对象提议中提取特征，Gidaris和Komodakis提出从一些不同的对象提议区域(一半区域、边界区域、中部区域、上下文区域和语义分割区域)中提取特征，以获得更丰富、更鲁棒的对象表示。所有这些特性都通过简单地连接组合在一起。

Quite a number of methods, all closely related to MRCNN, have been proposed since. The method in [233] used only four contextual regions, organized in a foveal structure, where the classifier is trained jointly end to end. Zeng et al. proposed GBDNet [235, 236] (Fig. 13 (b)) to extract features from multiscale contextualized regions surrounding an object proposal to improve detection performance. Different from the naive way of learning CNN features for each region separately and then concatenating them, as in MRCNN, GBDNet can pass messages among features from different contextual regions, implemented through convolution. Noting that message passing is not always helpful but dependent on individual samples, Zeng et al. used gated functions to control message transmission, like in Long Short Term Memory (LSTM) networks [83]. Concurrent with GBDNet, Li et al. [123] presented ACCNN (Fig. 13 ©) to utilize both global and local contextual information to facilitate object detection. To capture global context, a Multiscale Local Contextualized (MLC) subnetwork was proposed, which recurrently generates an attention map for an input image to highlight useful global contextual locations, through multiple stacked LSTM layers. To encode local surroundings context, Li et al. [123] adopted a method similar to that in MRCNN [62]. As shown in Fig. 13 (d), CoupleNet [251] is conceptually similar to ACCNN [123], but built upon RFCN [40]. In addition to the original branch in RFCN [40], which captures object information with position sensitive RoI pooling, CoupleNet [251] added one branch to encode the global context information with RoI pooling.

从那以后，人们提出了许多与MRCNN密切相关的方法。[233]中的方法只使用了四个上下文区域，被组织在一个中央凹结构中，分类器被共同训练端到端。Zeng等人提出了GBDNet[235,236] (Fig. 13 (b))，从目标提议周围的多尺度的上下文化的区域提取特征，以提高检测性能。与单纯的对每个区域分别学习CNN特征并将其连接起来的方法不同（如MRCNN），GBDNet可以通过卷积的方式在不同上下文区域的特征之间实现传递消息。Zeng等人注意到消息传递并不总是有帮助的，而是依赖于单个样本，他们使用gated functions （门控函数）来控制消息传输，像 Long Short Term Memory (LSTM) networks 一样[83]。与GBDNet同时，Li等[123]提出了ACCNN(Fig. 13 ©)，利用全局和局部上下文信息方便目标检测。为了捕获全局上下文，提出了一种Multiscale Local Contextualized (MLC) subnetwork （多尺度局部上下文化子网络），该子网络通过 multiple stacked LSTM layers（多堆叠LSTM层），为输入图像循环地生成attention map（注意力映射），以突出有用的 global contextual locations（全局上下文位置）。Li等人[123]为了对局部环境上下文进行编码，采用了与MRCNN相似的方法[62]。如Fig. 13 (d)所示，CoupleNet[251]在概念上与ACCNN[123]相似，但构建于RFCN[40]之上。除了RFCN[40]中的原始分支【使用position sensitive RoI pooling（位置敏感的RoI池化）来捕获对象信息】外，CoupleNet[251]还添加了一个分支来使用RoI pooling对全局上下文信息进行编码。

4.3 Detection Proposal Methods 检测提议方法

An object can be located at any position and scale in an image. During the heyday of handcrafted feature descriptors (e.g., SIFT [140], HOG [42] and LBP [153]), the Bag of Words (BoW) [194, 37] and the DPM [55] used sliding window techniques [213, 42, 55, 76, 212]. However the number of windows is large and grows with the number of pixels in an image, and the need to search at multiple scales and aspect ratios further significantly increases the search space. Therefore, it is computationally too expensive to apply more sophisticated classifiers.

物体可以定位在图像的任何的位置和比例上。在手工制作特征描述符(例如SIFT[140]、HOG[42]、LBP[153])的鼎盛时期，the Bag of Words(BoW词包)[194,37]和DPM[55]使用滑动窗口技术[213,42,55,76,212]。然而，窗口的数量是很大的，并且随着图像中像素的增加而增加，在多个尺度和纵横比下搜索的需求进一步显著增加了搜索空间。因此，应用在更复杂的分类器，在计算上过于昂贵。

Around 2011, researchers proposed to relieve the tension between computational tractability and high detection quality by using detection proposals [210, 209]（We use the terminology detection proposals, object proposals and region proposals interchangeably.）. Originating in the idea of objectness proposed by [2], object proposals are a set of candidate regions in an image that are likely to contain objects. Detection proposals are usually used as a preprocessing step, in order to reduce the computational complexity by limiting the number of regions that need be evaluated by the detector. Therefore, a good detection proposal should have the following characteristics:

High recall, which can be achieved with only a few proposals;

The proposals match the objects as accurately as possible;

High efficiency.

The success of object detection based on detection proposals given by selective search [210, 209] has attracted broad interest [21, 7, 3, 33, 254, 50, 105, 144].

在2011年前后，研究人员提出了利用检测提议来缓解计算可操作性和高检测质量之间的紧张关系[210,209]。object proposal源于[2]提出的objecteness这个概念，object proposal是图像中可能包含对象的一组候选区域。检测提议通常作为预处理步骤，通过限制需要由检测器评估的区域数量来降低计算复杂度。因此，一个好的检测提议应该具有以下特征:
1。高召回，只需推荐一些提议即可实现;
2。提议尽可能准确地匹配目标;
3。高效率。
基于选择性搜索给出的检测提议的目标检测的成功[210,209]已经吸引了广泛的兴趣[21,7,3,33,254,50,105,144]。

A comprehensive review of object proposal algorithms is outside the scope of this paper, because object proposals have applications beyond object detection [6, 72, 252]. We refer interested readers to the recent surveys [86, 23] which provides an in-depth analysis of many classical object proposal algorithms and their impact on detection performance. Our interest here is to review object proposal methods that are based on DCNNs, output class agnostic proposals, and related to generic object detection.

对目标提议算法的全面综述超出了本文的范围，因为目标提议具有超出目标检测的应用[6,72,252]。我们请感兴趣的读者参考最近的调查[86,23]，该调查对许多经典的对象提议算法及其对检测性能的影响进行了深入分析。我们的兴趣在于回顾基于DCNNs的对象提议方法、输出类别无关提议以及与通用对象检测相关的。

In 2014, the integration of object proposals [210, 209] and DCNN features [109] led to the milestone RCNN [65] in generic object detection. Since then, detection proposal algorithms have quickly become a standard preprocessing step, evidenced by the fact that all winning entries in the PASCAL VOC [53], ILSVRC [179] and MS COCO [129] object detection challenges since 2014 used detection proposals [65, 160, 64, 175, 236, 80].

2014年，目标提议[210,209]与DCNN特征[109]的融合，使得RCNN非常成功，在通用目标检测领域中具有里程碑的意义[65]。从那时起，检测提议算法迅速成为标准的预处理步骤，事实证明，2014年以来PASCAL VOC[53]、ILSVRC[179]和MS COCO[129]对象检测挑战的所有获胜作品都使用了检测方案[65,160,64,175,236,80]。

Among object proposal approaches based on traditional low-level cues (e.g., color, texture, edge and gradients), Selective Search [209], MCG [7] and EdgeBoxes [254] are among the more popular. As the domain rapidly progressed, traditional object proposal approaches [86] (e.g. selective search[209] and [254]), which were adopted as external modules independent of the detectors, became the bottleneck of the detection pipeline [175]. An emerging class of object proposal algorithms [52, 175, 111, 61, 167, 224] using DCNNs has attracted broad attention.

在基于传统低层线索(如颜色、纹理、边缘和渐变)的目标提议方法中， Selective Search选择性搜索[209]、MCG[7]和edgebox[254]是最流行的。随着领域的快速发展，传统的目标提议方法[86] (如selective search[209]和[254] ) 作为独立于检测器的外部模块，成为检测管道的瓶颈[175]。使用DCNNs的一类新出现的对象提议算法[52,175,111,61,167,224]引起了广泛的关注。

Recent DCNN based object proposal methods generally fall into two categories: bounding box based and object segment based, with representative methods summarized in Table 5.

最近基于DCNN的对象提议方法一般分为两类:基于边框的和基于对象分割的，有代表性的方法如Table 5所示。

Bounding Box Proposal Methods is best exemplified by the RPN method [175] of Ren et al., illustrated in Fig. 14. RPN predicts object proposals by sliding a small network over the feature map of the last shared CONV layer (as shown in Fig. 14). At each sliding window location, it predicts k proposals simultaneously by using k anchor boxes, where each anchor box is centered at some location in the image, and is associated with a particular scale and aspect ratio. Ren et al. [175] proposed to integrate RPN and Fast RCNN into a single network by sharing their convolutional layers. Such a design led to substantial speedup and the first end-to-end detection pipeline, Faster RCNN [175]. RPN has been broadly selected as the proposal method by many state of the art object detectors, as can be observed from Tables 3 and 4.

边界框提议方法是最好用Ren等人RPN方法[175]来举例说明,见Fig. 14。RPN通过在最后一层的共享卷积层的feature map上滑动小型网络预测对象提议(见Fig. 14)。在每个滑动窗口的位置，它使用k个anchor boxes同时预测k个proposals，其中每个anchor box的中心落在图像中某个位置，并且它将一个特定的scale和 aspect ratio（宽高比）联合起来。Ren等人[175]提出通过共享卷积层将RPN和Fast RCNN集成到一个网络中。这样的设计成就了显著的加速和首个端到端检测管道的出现——Faster RCNN[175]。从表3和表4可以看出，许多先进的对象检测器都广泛地将RPN作为提案方法。

Instead of fixing a priori a set of anchors as MultiBox [52, 199] and RPN [175], Lu et al. [141] proposed to generate anchor locations by using a recursive search strategy which can adaptively guide computational resources to focus on subregions likely to contain objects. Starting with the whole image, all regions visited during the search process serve as anchors. For any anchor region encountered during the search procedure, a scalar zoom indicator is used to decide whether to further partition the region, and a set of bounding boxes with objectness scores are computed with a deep network called Adjacency and Zoom Network (AZNet). AZNet extends RPN by adding a branch to compute the scalar zoom indicator in parallel with the existing branch.

没有如MultiBox[52,199]和RPN[175]那样准备一组先验的anchors，取而代之的是，Lu等人[141]提出使用递归搜索策略生成锚点位置，该策略可以自适应地引导计算资源关注可能包含对象的子区域。从整个图像开始，搜索过程中访问的所有区域都充当锚。对于任何在搜索过程中遇到的锚区域，使用scalar zoom indicator来决定是否进一步划分该区域，并使用一个叫做Adjacency and Zoom Network (AZNet)的深度网络计算一组具有objectness scores对象得分的边界框。AZNet通过添加一个分支，以与现有分支并行计算scalar zoom indicator，来扩展RPN。

There is further work attempting to generate object proposals by exploiting multi-layer convolutional features [103, 61, 224, 122].Concurrent with RPN [175], Ghodrati et al. [61] proposed Deep-Proposal which generates object proposals by using a cascade of multiple convolutional features, building an inverse cascade to select the most promising object locations and to refine their boxes in a coarse to fine manner. An improved variant of RPN, HyperNet [103] designs Hyper Features which aggregate multi-layer convolutional features and shares them both in generating proposals and detecting objects via an end to end joint training strategy. Yang et al. proposed CRAFT [224] which also used a cascade strategy, first training an RPN network to generate object proposals and then using them to train another binary Fast RCNN network to further distinguish objects from background. Li et al. [122] proposed ZIP to improve RPN by leveraging a commonly used idea of predicting object proposals with multiple convolutional feature maps at different depths of a network to integrate both low level details and high level semantics. The backbone network used in ZIP is a “zoom out and in” network inspired by the conv and deconv structure [138].

还有进一步的工作试图通过利用多层卷积特征来生成目标提案[103,61,224,122]。与RPN[175]同时，Ghodrati等人[61]提出了 Deep-Proposal深度建议，该方法通过使用多个卷积特征的cascade（级联）生成对象提议，构建了一个inverse cascade（逆级联）来选择最有可能的对象位置，并以粗到细的方式对这些框进行细化。HyperNet[103]是RPN的一种改进变体，它设计了一种Hyper Features超特征，这种特征将多层卷积特征聚合在一起，并通过端到端联合训练策略在生成提议和检测对象时共享。Yang等人提出的CRAFT[224]也采用了cascade（级联）策略，首先训练RPN网络生成目标提案，然后用它们训练另一个 binary Fast RCNN network，进一步区分目标和背景。Li等人[122]提出了一种用于改进RPN的ZIP方法，该方法利用了一种常用的方法，即在网络的不同深度使用多个convolutional feature maps 来预测对象提议，从而集成了低层细节和高层语义。ZIP中使用的骨干网络是一个受conv卷积和deconv反卷积结构启发的“zoom out and in”网络[138]。

Finally, recent work which deserves mention includes Deepbox [111], which proposed a light weight CNN to learn to re-rank proposals generated by EdgeBox, and DeNet [208] which introduces a bounding box corner estimation to predict object proposals efficiently to replace RPN in a Faster RCNN style two stage detector.

最后，最近值得一提的工作包括Deepbox[111]，它提出了一个轻量级CNN来学习如何重新排列EdgeBox生成的提案；DeNet[208]引入了一个bounding box corner estimation包围盒角估计来有效预测目标提案，从而在Faster RCNN风格的two stage detector 两级检测器中取代RPN。

Object Segment Proposal Methods [167, 168] aim to generate segment proposals that are likely to correspond to objects. Segment proposals are more informative than bounding box proposals, and take a step further towards object instance segmentation [74, 39, 126]. A pioneering work was DeepMask proposed by Pinheiro et al. [167], where segment proposals are learned directly from raw image data with a deep network. Sharing similarities with RPN, after a number of shared convolutional layers DeepMask splits the network into two branches to predict a class agnostic mask and an associated objectness score. Similar to the efficient sliding window prediction strategy in OverFeat [183], the trained DeepMask network is applied in a sliding window manner to an image (and its rescaled versions) during inference. More recently, Pinheiro et al. [168] proposed SharpMask by augmenting the DeepMask architecture with a refinement module, similar to the architectures shown in Fig. 11 (b1) and (b2), augmenting the feedforward network with a top-down refinement process. SharpMask can efficiently integrate the spatially rich information from early features with the strong semantic information encoded in later layers to generate high fidelity object masks.

Object Segment Proposal 对象分割提议方法[167,168]旨在生成可能与对象相对应的Segment Proposal 分割提议。分割提议比 bounding box包围盒提议更具有信息性，并且在对象实例分割方面更进一步[74,39,126]。Pinheiro等人[167]提出的DeepMask是一项开创性的工作，通过深度网络直接从原始图像数据中学习分割提议。在共享方面上与RPN相似，在多个共享卷积层之后，DeepMask将网络分割为两个分支，以预测一个类无关的mask和一个相关的对象得分。类似于OverFeat中的高效滑动窗口预测策略[183]，经过训练的DeepMask网络在推理过程中以滑动窗口的方式应用于图像(及其rescaled版本)。最近Pinheiro等人[168]提出SharpMask，通过使用 refinement module细化模块对DeepMask体系结构进行增强，类似于Fig. 11 (b1)和(b2)，通过自顶向下的细化过程对前馈网络进行增强。SharpMask可以有效地将早期特征中丰富的空间信息与较后层中的已编码的强语义信息结合起来，生成高保真对象mask。

Motivated by Fully Convolutional Networks (FCN) for semantic segmentation [138] and DeepMask [167], Dai et al. proposed InstanceFCN [38] for generating instance segment proposals. Similar to DeepMask, the InstanceFCN network is split into two branches,however the two branches are fully convolutional, where one branch generates a small set of instance sensitive score maps, followed by an assembling module that outputs instances, and the other branch for predicting the objectness score. Hu et al. proposed FastMask [89] to efficiently generate instance segment proposals in a one-shot manner similar to SSD [136], in order to make use of multiscale convolutional features in a deep network. Sliding windows extracted densely from multiscale convolutional feature maps were input to a scale-tolerant attentional head module to predict segmentation masks and objectness scores. FastMask is claimed to run at 13 FPS on a 800 × 600 resolution image with a slight trade off in average recall. Qiao et al. [170] proposed ScaleNet to extend previous object proposal methods like SharpMask [168] by explicitly adding a scale prediction phase. That is, ScaleNet estimates the distribution of object scales for an input image, upon which Sharp-Mask searches the input image at the scales predicted by ScaleNet and outputs instance segment proposals. Qiao et al. [170] showed their method outperformed the previous state of the art on supermarket datasets by a large margin.

受Fully Convolutional Networks完全卷积网络(FCN)用于语义分割[138]和DeepMask [167]激励， Dai等人提出InstanceFCN[38]用于生成实例分割提议。与DeepMask类似，InstanceFCN网络被拆分为两个分支，但是这两个分支完全是卷积的，其中一个分支生成一小组instance sensitive score maps实例敏感得分映射，接着的是输出实例的组装模块，另一个分支用于预测对象得分。Hu等人提出的FastMask[89]以类似SSD[136]的 one-shot 一次性方式高效地生成实例分割提案，以便在深度网络中利用多尺度卷积特征。从多尺度卷积特征图中密集地提取的滑动窗口被输入到一个 scale-tolerant attentional head module 尺寸宽容的注意头模块，以预测分割mask和目标得分。FastMask声称可以以a slight trade off in average recall表现，在13 FPS 800×600分辨率图像上运行。Qiao等人[170]提出ScaleNet，通过显式地添加一个scale prediction phase尺寸预测阶段，对SharpMask等[168]以前的对象提案方法进行了扩展。也就是说，ScaleNet估计了输入图像的对象尺寸分布，在此基础上，Sharp-Mask在ScaleNet预测的尺度上搜索输入图像，并输出实例分割提案。Qiao等人[170]的研究表明，他们的方法在超市数据集上的表现远远优于之前的先进水平。

4.4 Other Special Issues 其他特殊问题

Aiming at obtaining better and more robust DCNN feature representations, data augmentation tricks are commonly used [22, 64, 65]. It can be used at training time, at test time, or both. Augmentation refers to perturbing an image by transformations that leave the underlying category unchanged, such as cropping, flipping, rotating, scaling and translating in order to generate additional samples of the class. Data augmentation can affect the recognition performance of deep feature representations. Nevertheless, it has obvious limitations. Both training and inference computational complexity increases significantly, limiting its usage in real applications. Detecting objects under a wide range of scale variations, and especially, detecting very small objects stands out as one of key challenges. It has been shown [96, 136] that image resolution has a considerable impact on detection accuracy. Therefore, among those data augmentation tricks, scaling (especially a higher resolution input) is mostly used, since high resolution inputs enlarge the possibility of small objects to be detected [96]. Recently, Singh et al. proposed advanced and efficient data argumentation methods SNIP [192] and SNIPER [193] to illustrate the scale invariance problem, as summarized in Table 6. Motivated by the intuitive understanding that small and large objects are difficult to detect at smaller and larger scales respectively, Singh et al. presented a novel training scheme named SNIP can reduce scale variations during training but without reducing training samples. SNIPER [193] is an approach proposed for efficient multiscale training. It only processes context regions around ground truth objects at the appropriate scale instead of processing a whole image pyramid. Shrivastava et al. [189] and Lin et al. explored approaches to handle the extreme foreground-background class imbalance issue [131]. Wang et al. [216] proposed to train an adversarial network to generate examples with occlusions and deformations that are difficult for the object detector to recognize. There are some works focusing on developing better methods for nonmaximum suppression [16, 87, 207].

为了获得更好、更鲁棒的DCNN特征表示，通常使用数据增强技巧[22,64,65]。它可以在训练时使用，在测试时使用，或者两者都使用。增强是指通过使基础类别保持不变的转换(如裁剪、翻转、旋转、缩放和偏移)来扰乱图像，以生成类的其他样本。数据增强会影响深度特征表示的识别性能。然而，它有明显的局限性。训练和推理的计算复杂度都显著增加，限制了其在实际应用中的应用。在大范围尺寸变化下的目标检测，特别是对非常小的目标的检测是一个关键的挑战。已经证明[96,136]，图像分辨率对检测精度有相当大的影响。因此，在这些数据增强技巧中，缩放(尤其是更高分辨率的输入)是最常用的，因为高分辨率的输入增加了小对象被检测的可能性[96]。最近Singh等人提出了先进高效的数据增大方法SNIP[192]和SNIPER[193]来说明尺寸不变性问题，如Table 6所示。Singh等人基于对小物体和大物体在较小尺度和较大尺度下难以检测的直观理解，提出了一种新的训练方案——SNIP，可以在训练过程中减小尺寸变化，但不减少训练样本。SNIPER[193]是一种有效的多尺度训练方法。它只在适当的比例下处理ground truth对象周围的上下文区域，而不是处理整个图像金字塔。Shrivastava等人[189]和Lin等人探索了处理极端前背景类失衡问题的方法[131]。Wang等[216]提出通过训练一个对抗性网络来生成对象检测器难以识别的具有遮挡和变形的样本。目前有一些研究致力于开发更好的非极大值抑制方法[16,87,207]。

5 Datasets and Performance Evaluation 数据集和评估性能

5.1 Datasets 数据集

Datasets have played a key role throughout the history of object recognition research. They have been one of the most important factors for the considerable progress in the field, not only as a common ground for measuring and comparing performance of competing algorithms, but also pushing the field towards increasingly complex and challenging problems. The present access to large numbers of images on the Internet makes it possible to build comprehensive datasets of increasing numbers of images and categories in order to capture an ever greater richness and diversity of objects. The rise of large scale datasets with millions of images has paved the way for significant breakthroughs and enabled unprecedented performance in object recognition. Recognizing space limitations, we refer interested readers to several papers [53, 54, 129, 179, 107] for detailed description of related datasets.

在物体识别研究的历史上，数据集一直扮演着重要的角色。它们已成为该领域取得长远进展的最重要因素之一，不仅是衡量和比较竞争算法性能的共同基础，而且还将该领域推向日益复杂和具有挑战性的问题。目前可以在互联网上访问大量的图像，这使得建立越来越多的图像和类别的综合数据集成为可能，以获取越来越丰富和多样化的对象。拥有数百万张图像的大规模数据集的兴起为重大突破铺平了道路，并在对象识别方面实现了前所未有的性能。认识到篇幅的有限性，我们请感兴趣的读者参考几篇论文[53,54,129,179,107]以详细描述相关数据集。

Beginning with Caltech101 [119], representative datasets include Caltech256 [70], Scenes15 [114], PASCAL VOC (2007) [54], Tiny Images [204], CIFAR10 [108], SUN [221], ImageNet [44], Places [245], MS COCO [129], and Open Images [106]. The features of these datasets are summarized in Table 7, and selected sample images are shown in Fig. 15.

从Caltech101[119]开始，代表性的数据集包括Caltech256[70]、Scenes15[114]、PASCAL VOC(2007)[54]、Tiny Images[204]、CIFAR10[108]、SUN[221]、ImageNet[44]、Places[245]、MS COCO[129]、Open Images[106]。Table 7总结了这些数据集的特征，选取的样本图像如Fig. 15所示。

Earlier datasets, such as Caltech101 or Caltech256, were criticized because of the lack of intraclass variations that they exhibit. As a result, SUN [221] was collected by finding images depicting various scene categories, and many of its images have scene and object annotations which can support scene recognition and object detection. Tiny Images [204] created a dataset at an unprecedented scale, giving comprehensive coverage of all object categories and scenes, however its annotations were not manually verified, containing numerous errors, so two benchmarks (CIFAR10 and CIFAR100 [108]) with reliable labels were derived from Tiny Images.

较早的数据集，如Caltech101或Caltech256，由于缺乏所显示的类内变化而受到批评。因此，SUN出现，通过寻找描述各种场景类别的图像来[221]，其中很多图像都有场景和对象标注，可以支持场景识别和对象检测。Tiny Images[204]以前所未有的规模创建了一个数据集，全面覆盖了所有的对象类别和场景，但是它的注释没有人工验证，包含了大量的错误，因此两个具有可靠标签的基准(CIFAR10和CIFAR100[108])都来自于Tiny Images。

PASCAL VOC [53, 54], a multiyear effort devoted to the creation and maintenance of a series of benchmark datasets for classification and object detection, creates the precedent for standardized evaluation of recognition algorithms in the form of annual competitions. Starting from only four categories in 2005, increasing to 20 categories that are common in everyday life, as shown in Fig. 15. ImageNet [44] contains over 14 million images and over 20,000 categories, the backbone of ILSVRC [44, 179] challenge, which has pushed object recognition research to new heights.

PASCAL VOC[53,54]是一项持续多年的工作，致力于创建和维护一系列用于分类和目标检测的基准数据集，开创了以年度竞赛形式对识别算法进行标准化评估的先例。从2005年的四类开始，增加到日常生活中常见的20类，如Fig. 15所示。ImageNet[44]包含超过1400万幅图像和超过20000个类别，这是ILSVRC[44,179]挑战赛的支柱，将目标识别研究推向了一个新的高度。

ImageNet has been criticized that the objects in the dataset tend to be large and well centered, making the dataset atypical of real world scenarios. With the goal of addressing this problem and pushing research to richer image understanding, researchers created the MS COCO database [129]. Images in MS COCO are complex everyday scenes containing common objects in their natural context, closer to real life, and objects are labeled using fully-segmented instances to provide more accurate detector evaluation. The Places database [245] contains 10 million scene images, labeled with scene semantic categories, offering the opportunity for data hungry deep learning algorithms to reach human level recognition of visual patterns. More recently, Open Images [106] is a dataset of about 9 million images that have been annotated with image level labels and object bounding boxes.

ImageNet一直受到批评，因为数据集中的对象往往很大，并且集中得很好，这使得数据集在真实世界场景中是非典型的数据集。为了解决这个问题，并将研究推向更丰富的图像理解，研究人员创建了MS COCO数据库[129]。MS COCO 图像是复杂的日常场景，包含了自然环境下的常见对象，更接近现实生活，对象被标记为完全分割的实例，以提供更精确的检测器评估。Places数据库[245]包含了1000万幅场景图像，这些图像被标记为场景语义类别，为数据贪婪的深度学习算法提供了机会，以达到人类对视觉模式的层次识别。最近，Open Images[106]是一个包含约900万张图像的数据集，这些图像已经用图像级别标签和对象边界框进行了注释。

There are three famous challenges for generic object detection: PASCAL VOC [53, 54], ILSVRC [179] and MS COCO [129]. Each challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardized evaluation software; and (ii) an annual competition and corresponding workshop. Statistics for the number of images and object instances in the training, validation and testing datasets for the detection challenges is given in Table 8.（The annotations on the test set are not publicly released, except for PASCAL VOC2007.）

通用对象检测有三个著名的挑战:PASCAL VOC[53,54]、ILSVRC[179]和MS COCO[129]。每个挑战由两个部分组成:(i)公开可获得的图像数据集，包含ground truth注释和标准化评估软件;(ii)年度比赛及相应的workshop。Table 8给出了用于检测挑战的训练、验证和测试数据集中图像和对象实例数量的统计数据。(测试集上的注释不公开发布,除了PASCAL VOC2007)。

For the PASCAL VOC challenge, since 2009 the data consist of the previous years’ images augmented with new images, allowing the number of images to grow each year and, more importantly, meaning that test results can be compared with the previous years’ images.

对于PASCAL VOC的挑战，从2009年开始，这些数据包括了前几年的图像与新图像的增加，使得图像的数量每年都在增加，更重要的是，这意味着测试结果可以与前几年的图像进行比较。

ILSVRC [179] scales up PASCAL VOC’s goal of standardized training and evaluation of detection algorithms by more than an order of magnitude in the number of object classes and images. The ILSVRC object detection challenge has been run annually from 2013 to the present.

ILSVRC[179]将PASCAL VOC检测算法的标准化训练和评估目标在对象类和图像数量上提升了一个数量级。ILSVRC目标检测挑战从2013年到现在每年都在进行。

The COCO object detection challenge is designed to push the state of the art in generic object detection forward, and has been run annually from 2015 to the present. It features two object detection tasks: using either bounding box output or object instance segmentation output. It has fewer object categories than ILSVRC (80 in COCO versus 200 in ILSVRC object detection) but more instances per category (11000 on average compared to about 2600 in ILSVRC object detection). In addition, it contains object segmentation annotations which are not currently available in ILSVRC. COCO introduced several new challenges: (1) it contains objects at a wide range of scales, including a high percentage of small objects (e.g. smaller than 1% of image area [192]). (2) objects are less iconic and amid clutter or heavy occlusion, and (3) the evaluation metric (see Table 9) encourages more accurate object localization.

COCO对象检测挑战旨在推动通用对象检测技术的发展，从2015年开始每年进行一次。它具有两个对象检测任务:使用边界框输出或对象实例分割输出。它的对象类别比ILSVRC少(COCO中是80个，而ILSVRC对象检测中是200个)，但是每个类别的实例多(平均11000个，而ILSVRC对象检测中是2600个)。此外，它还包含对象分割注释，这些注释目前在ILSVRC中是不可用的。COCO引入了一些新的挑战:(1)它包含了大尺寸变化的对象，包括高百分比的小对象(例如小于图像区域的1%[192])。(2)对象不那么具有标志性，并且处于杂乱或严重遮挡之中，(3)评价指标(见Table 9)鼓励更精确的对象定位。

COCO has become the most widely used dataset for generic object detection, with the dataset statistics for training, validation and testing summarized in Table 8. Starting in 2017, the test set has only the Dev and Challenge splits, where the Test-Dev split is the default test data, and results in papers are generally reported on Test-Dev to allow for fair comparison.

COCO已成为应用最广泛的通用对象检测数据集，Table 8总结了用于训练、验证和测试的数据集统计数据。从2017年开始，测试集只有Dev和Challenge部分，其中Test-Dev部分是默认的测试数据，并且论文中的结果通常报告在Test-Dev上，以便进行公平的比较。

2018 saw the introduction of the Open Images Object Detection Challenge, following in the tradition of PASCAL VOC, ImageNet and COCO, but at an unprecedented scale. It offers a broader range of object classes than previous challenges, and has two tasks: bounding box object detection of 500 different classes and visual relationship detection which detects pairs of objects in particular relations.

2018年，继PASCAL VOC、ImageNet和COCO的传统之后， Open Images Object Detection Challenge开放图像目标检测挑战开始引入，规模空前。与以前的挑战相比，它提供了更广泛的对象类，并且有两个任务:500个不同类的包围盒对象检测和检测特定关系中的多个对象的视觉关系检测 visual relationship detection which detects pairs of objects in particular relations。

5.2 Evaluation Criteria

There are three criteria for evaluating the performance of detection algorithms: detection speed (Frames Per Second, FPS), precision, and recall. The most commonly used metric is Average Precision (AP), derived from precision and recall. AP is usually evaluated in a category specific manner, i.e., computed for each object category separately. In generic object detection, detectors are usually tested in terms of detecting a number of object categories. To compare performance over all object categories, the mean AP (mAP) averaged over all object categories is adopted as the final measure of performance( In object detection challenges such as PASCAL VOC and ILSVRC, the winning entry of each object category is that with the highest AP score, and the winner of the challenge is the team that wins on the most object categories. The mAP is also used as the measure of a team’s performance, and is justified since the ranking of teams by mAP was always the same as the ranking by the number of object categories won [179]). More details on these metrics can be found in [53, 54, 179, 84].

评价检测算法性能的标准有三个:检测速度(Frames Per Second, FPS 帧每秒，帧秒)、精度和召回率。最常用的度量标准是平均精度(AP)，它是从精度和召回率得出的。AP通常以一个特定于类别的方式进行评估，即，分别计算每个对象类别。在一般对象检测中，检测器通常是通过检测一些对象类别来进行测试的。为了比较所有对象类别的性能，使用所有对象类别的平均AP (mAP)作为最终性能度量。(在对象检测PASCAL VOC和ILSVRC等挑战，每个对象类别的获胜作品是最高的AP分数,挑战的获胜队伍是在大多数对象类别上胜利。mAP也被用来衡量一个团队的表现，这是合理的，因为根据mAP对团队的排名总是与根据赢得的对象类别数量的排名相同[179]。)关于这些指标的更多细节可以在[53,54,179,84]中找到。

The standard outputs of a detector applied to a testing image I are the predicted detections {(bj , cj , pj )}j , indexed by j. A given detection (b, c, p) (omitting j for notational simplicity) denotes the predicted location (i.e., the Bounding Box, BB) b with its predicted category label c and its confidence level p. A predicted detection (b, c, p) is regarded as a True Positive (TP) if

The predicted class label c is the same as the ground truth label cg.

The overlap ratio IOU (Intersection Over Union) [53, 179]

between the predicted BB b and the ground truth one bg is not smaller than a predefined threshold ε. Here area(b ∩ bg) denotes the intersection of the predicted and ground BBs, and area(b U bg) their union. A typical value of ε is 0.5.

应用于测试图像 I 的检测器的标准输出是预测检测结果 {(bj , cj , pj )}j，由j索引。一个给定的检测(b, c, p)(省略j表示符号简单性)表示预测的位置(如，Bounding Box, BB) b，带有其预测类别标签c，置信度p。一个预测检测(b, c, p)将会被认为是一个True Positive(TP)，如果

预测类标签c与地面真实标签cg相同。
重叠比IOU(Intersection Over Union) [53,179]

Otherwise, it is considered as a False Positive (FP). The confidence level p is usually compared with some threshold β to determine whether the predicted class label c is accepted.

否则，它被认为是假的正例 False Positive ((FP)。置信水平p通常与一些阈值β确定预测类标签c是否可以接受。

AP is computed separately for each of the object classes, based on Precision and Recall. For a given object class c and a testing image Ii, let {(bij , pij )}j=1M denote the detections returned by a detector, ranked by the confidence pij in decreasing order. Let B = {bgik}Kk=1 be the ground truth boxes on image Ii for the given object class c. Each detection (bij , pij ) is either a TP or a FP, which can be determined via the algorithm7 in Fig. 16. Based on the TP and FP detections, the precision P (β) and recall R(β) [53] can be computed as a function of the confidence threshold β, so by varying the confidence threshold different pairs (P, R) can be obtained, in principle allowing precision to be regarded as a function of recall, i.e. P ®, from which the Average Precision (AP) [53, 179] can be found.

Table 9 summarizes the main metrics used in the PASCAL, ILSVRC and MS COCO object detection challenges.

5.3 Performance

A large variety of detectors has appeared in the last several years, and the introduction of standard benchmarks such as PASCAL VOC [53, 54], ImageNet [179] and COCO [129] has made it easier to compare detectors with respect to accuracy. As can be seen from our earlier discussion in Sections 3 and 4, it is difficult to objectively compare detectors in terms of accuracy, speed and memory alone, as they can differ in fundamental / contextual respects, including the following:

Meta detection frameworks, such as RCNN [65], Fast RCNN [64], Faster RCNN [175], RFCN [40], Mask RCNN [80], YOLO [174] and SSD [136];

Backbone networks such as VGG [191], Inception [200, 99, 201], ResNet [79], ResNeXt [223], Xception [35] and DetNet[127] etc. listed in Table 2;

Innovations such as multi-layer feature combination [130, 190, 58], deformable convolutional networks [41], deformable RoI pooling [160, 41], heavier heads [177, 164], and lighter heads [128];

Pretraining with datasets such as ImageNet [179], COCO [129], Places [245], JFT [82] and Open Images [106]

Different detection proposal methods and different numbers of object proposals;

Train/test data augmentation “tricks” such as multi-crop, horizontal flipping, multi-scale images and novel multi-scale training strategies [192, 193] etc, mask tightening, and model ensembling.

近年来出现了大量的检测器，而PASCAL VOC[53,54]、ImageNet[179]和COCO[129]等标准基准的引入，使得检测器在精确度方面的比较更加容易。从我们之前在第3节和第4节的讨论中可以看出，仅就精确度、速度和内存而言，很难客观地比较检测器，因为它们在基本上/上下文方面可能存在差异，包括以下方面:

Meta detection frameworks，如RCNN[65]、Fast RCNN[64]、Faster RCNN[175]、RFCN[40]、Mask RCNN[80]、YOLO[174]、SSD [136];
Backbone networks，如VGG [191]， Inception [200, 99, 201]， ResNet [79]， ResNeXt [223]， Xception [35]， DetNet[127] ，见Table 2;
创新的点，如 multi-layer feature combination [130, 190, 58], deformable convolutional networks [41], deformable RoI pooling [160, 41], heavier heads [177, 164], and lighter heads [128];
使用ImageNet[179]、COCO[129]、Places[245]、JFT[82]和Open Images[106]等数据集进行预训练
不同的检测提议方法和不同数量的对象提议;
训练/测试数据增强“技巧”，如multi-crop, horizontal flipping, multi-scale images and novel multi-scale training strategies [192, 193] etc，mask tightening, and model ensembling.

Although it may be impractical to compare every recently proposed detector, it is nevertheless highly valuable to integrate representative and publicly available detectors into a common platform and to compare them in a unified manner. There has been very limited work in this regard, except for Huang’s study [96] of the trade off between accuracy and speed of three main families of detectors (Faster RCNN [175], RFCN [40] and SSD [136]) by varying the backbone network, image resolution, and the number of box proposals etc.

尽管将最近提出的每个检测器进行比较可能不切实际，但是将具有代表性的和公开可用的检测器集成到一个通用平台并以统一的方式进行比较是非常有价值的。这方面的工作非常有限，除了Huang[96]通过改变主干网络、图像分辨率和盒子提议数量等，研究了三大类检测器(Faster RCNN[175]、RFCN[40]和SSD[136])的准确性和速度之间的权衡。

As can be seen from Tables 3, 4, 5, 6 and Table 10, we have summarized the best reported performance of many methods on three widely used standard benchmarks. The results of these methods were reported on the same test benchmark, despite their differing in one or more of the aspects listed above.

从Tables 3、4、5、6、10可以看出，我们总结了许多方法在三个广泛使用的标准基准上的最佳性能报告。这些方法的结果是在相同的测试基准上报告的，尽管它们在上面列出的一个或多个方面有所不同。

Figs. 1 and 17 present a very brief overview of the state of the art, summarizing the best detection results of the PASCAL VOC, ILSVRC and MSCOCO challenges. More results can be found at detection challenge websites [98, 148, 163]. In summary, the backbone network, the detection framework design and the availability of large scale datasets are the three most important factors in detection. Furthermore ensembles of multiple models, the incorporation of context features, and data augmentation all help to achieve better accuracy.

Figs. 1、17简要介绍了该技术的现状，总结了PASCAL VOC、ILSVRC和MSCOCO 挑战的最佳检测结果。更多的结果可以在检测挑战网站上找到[98,148,163]。综上所述，backbone network,、检测框架设计和大规模数据集的可用性是检测中最重要的三个因素。此外， ensembles of multiple models, the incorporation of context features, and data augmentation多个模型的集成、上下文特征的结合和数据增强都有助于实现更好的准确性。

In less than five years, since AlexNet [109] was proposed, the Top5 error on ImageNet classification [179] with 1000 classes has dropped from 16% to 2%, as shown in Fig. 9. However, the mAP of the best performing detector [164] (which is only trained to detect 80 classes) on COCO [129] has reached 73%, even at 0.5 IoU, illustrating clearly how object detection is much harder than image classification. The accuracy level achieved by the state of the art detectors is far from satisfying the requirements of general purpose practical applications, so there remains significant room for future improvement.

自AlexNet[109]提出后不到5年，ImageNet分类中1000个类的Top5误差从16%下降到2%，如图9所示。然而，在COCO上，性能最好的检测器[164] (仅训练用于检测80个级别)的映射已经达到73%，甚至达到0.5 IoU，这清楚地说明了对象检测比图像分类更难。目前最先进的探测器所达到的精度水平远远不能满足一般用途实际应用的要求，因此仍有很大的改进空间。

6 Conclusions

Generic object detection is an important and challenging problem in computer vision, and has received considerable attention. Thanks to remarkable development of deep learning techniques, the field of object detection has dramatically evolved. As a comprehensive survey on deep learning for generic object detection, this paper has highlighted the recent achievements, provided a structural taxonomy for methods according to their roles in detection, summarized existing popular datasets and evaluation criteria, and discussed performance for the most representative methods.

通用目标检测是计算机视觉中一个重要而又具有挑战性的问题，受到了广泛的关注。由于深度学习技术的显著发展，目标检测领域发生了巨大的变化。本文是对通用目标检测的深度学习的综合考察，重点介绍了近年来的研究成果，根据方法在检测中的作用进行了结构分类，总结了现有的常用数据集和评价标准，讨论了最具代表性的方法的性能。

Despite the tremendous successes achieved in the past several years (e.g. detection accuracy improving significantly from 23% in ILSVRC2013 to 73% in ILSVRC2017), there remains a huge gap between the state-of-the-art and human-level performance, especially in terms of open world learning. Much work remains to be done, which we see focused on the following eight domains:

尽管在过去几年中取得了巨大的成功(例如，检测准确率从ILSVRC2013的23%显著提高到ILSVRC2017的73%)，但在技术水平和人类水平之间仍然存在巨大的差距，尤其是在开放世界学习方面。还有许多工作要做，我们认为集中于以下八个领域:

(1) Open World Learning: The ultimate goal is to develop object detection systems that are capable of accurately and efficiently recognizing and localizing instances of all object categories (thousands or more object classes [43]) in all open world scenes, competing with the human visual system. Recent object detection algorithms are learned with limited datasets [53, 54, 129, 179], recognizing and localizing the object categories included in the dataset, but blind, in principle, to other object categories outside the dataset, although ideally a powerful detection system should be able to recognize novel object categories [112, 73]. Current detection datasets [53, 179, 129] contain only dozens to hundreds of categories, which is significantly smaller than those which can be recognized by humans. To achieve this goal, new large-scale labeled datasets with significantly more categories for generic object detection will need to be developed, since the state of the art in CNNs require extensive data to train well. However collecting such massive amounts of data, particularly bounding box labels for object detection, is very expensive, especially for hundreds of thousands categories.

**(1) 开放世界学习：**最终目标是开发对象检测系统，能够准确有效地识别和定位所有开放世界场景中所有对象类别(数千个或更多对象类[43])的实例，与人类视觉系统竞争。最近的对象检测算法是通过有限的数据集[53,54,129,179]来学习的，识别和定位数据集中所包含的对象类别，但原则上对数据集中之外的其他对象类别是盲目的，尽管理想的强大检测系统应该能够识别新的对象类别[112,73]。目前的检测数据集[53,179,129]只包含几十到几百个类别，这远远小于人类能够识别的类别。为了实现这一目标，需要开发新的大规模标记数据集，这些数据集具有明显更多的类别，用于通用对象检测，因为CNNs的技术水平需要大量数据才能很好地训练。然而，收集如此大量的数据，特别是用于对象检测的box标签，是非常昂贵的，尤其是对于数十万个类别而言。

(2) Better and More Efficient Detection Frameworks: One of the factors for the tremendous successes in generic object detection has been the development of better detection frameworks, both region-based (RCNN [65], Fast RCNN [64], Faster RCNN [175], Mask RCNN [80]) and one-state detectors (YOLO [174], SSD [136]). Region-based detectors have the highest accuracy, but are too computationally intensive for embedded or real-time sys-tems. One-stage detectors have the potential to be faster and simpler, but have not yet reached the accuracy of region-based detectors. One possible limitation is that the state of the art object detectors depend heavily on the underlying backbone network, which have been initially optimized for image classification, causing a learning bias due to the differences between classification and detection, such that one potential strategy is to learn object detectors from scratch, like the DSOD detector [186].

**(2)更好、更高效的检测框架:**在通用对象检测方面取得巨大成功的因素之一是开发了更好的基于区域的检测框架(RCNN[65]、Fast RCNN[64]、Faster RCNN[175]、Mask RCNN[80])和 one-state detectors(YOLO[174]、SSD[136])。基于区域的探测器具有最高的精度，但对于嵌入式或实时系统来说，计算量太大。单级探测器有可能更快、更简单，但尚未达到基于区域的探测器的精度。一个可能的限制是突出的对象探测器的状态过于依赖底层的骨干网络，这些骨干网络已初步优化了图像分类，由于分类和检测之间的差异导致学习bias偏移，这样一个潜在的策略是从头开始学习对象探测器，DSOD detector [186]。

(3) Compact and Efficient Deep CNN Features: Another significant factor in the considerable progress in generic object detection has been the development of powerful deep CNNs, which have increased remarkably in depth, from several layers (e.g., AlexNet [110]) to hundreds of layers (e.g., ResNet [79], DenseNet [94]). These networks have millions to hundreds of millions of parameters, requiring massive data and power-hungry GPUs for training, again limiting their application to real-time / embedded applications. In response, there has been growing research interest in designing compact and lightweight networks [25, 4, 95, 88, 132, 231], network compression and acceleration [34, 97, 195, 121, 124], and network interpretation and understanding [19, 142, 146].

**(3)紧凑高效的深度CNN特征:**在通用对象检测方面取得重大进展的另一个重要因素是功能强大的DCNN，其深度从几个层(如AlexNet[110])显著增加到数百层(如ResNet[79]、DenseNet[94])。这些网络有数百万至数亿个参数，需要大量数据和功耗大的gpu进行训练，这再次限制了它们的应用程序只能用于实时/嵌入式应用程序。因此，设计compact and lightweight networks 紧凑和轻量级网络[25,4,95,88,132,231]，network compression and acceleration网络压缩和加速[34,97,195,121,124]，以及network interpretation and understanding 网络解释和理解[19,142,146]的研究兴趣越来越大。

(4) Robust Object Representations: One important factor which makes the object recognition problem so challenging is the great variability in real-world images, including viewpoint and lighting changes, object scale, object pose, object part deformations, back-ground clutter, occlusions, changes in appearance, image blur, image resolution, noise, and camera limitations and distortions. Despite the advances in deep networks, they are still limited by a lack of robustness to these many variations [134, 24], which significantly constrains the usability for real-world applications.

**(4)鲁棒地对象表示:**使对象识别问题如此具有挑战性的一个重要因素是现实世界图像的巨大变异性，包括视点和光照变化、对象尺度、对象姿态、对象部分变形、背景杂波、遮挡、外观变化、图像模糊、图像分辨率、噪声、相机限制和失真。尽管在深度网络方面取得了进展，但由于缺乏对这些变体的鲁棒性，它们仍然受到限制[134,24]，这极大地限制了实际应用程序的可用性。

(5) Context Reasoning: Real-world objects typically coexist with other objects and environments. It has been recognized that contextual information (object relations, global scene statistics) helps object detection and recognition [155], especially in situations of small or occluded objects or poor image quality. There was extensive work preceding deep learning [143, 152, 171, 47, 59], however since the deep learning era there has been only very limited progress in exploiting contextual information [29, 62, 90]. How to efficiently and effectively incorporate contextual information remains to be explored, ideally guided by how humans are quickly able to guide their attention to objects of interest in natural scenes.

**上下文推理:**现实世界中的对象通常与其他对象和环境共存。人们已经认识到，上下文信息(对象关系、全局场景统计)有助于对象的检测和识别[155]，特别是在小的或遮挡的对象或图像质量较差的情况下。在深度学习之前有大量的工作要做[143,152,171,47,59]，但是自从深度学习时代以来，在挖掘语境信息方面的进展非常有限[29,62,90]。如何有效地结合上下文信息仍有待探索，理想的方向是人类如何快速地将注意力引导到自然场景中感兴趣的对象。

(6) Object Instance Segmentation: Continuing the trend of moving towards a richer and more detailed understanding image content (e.g., from image classification to single object localization to object detection), a next challenge would be to tackle pixel-level object instance segmentation [129, 80, 93], as object instance segmentation can play an important role in many potential applications that require the precise boundaries of individual instances.

* (6)对象实例分类: *持续的趋势朝着更丰富和更详细的图像内容理解(例如,从图像分类，到单一对象定位，到对象检测),下一个挑战将是应对进行像素级对象实例分割(129、80、93),对象实例分割可以发挥重要的作用在许多潜在的应用,这些应用需要个体实例的精确边界。

(7) Weakly Supervised or Unsupervised Learning: Current state of the art detectors employ fully-supervised models learned from labelled data with object bounding boxes or segmentation masks [54, 129, 179, 129], however such fully supervised learning has serious limitations, where the assumption of bounding box annotations may become problematic, especially when the number of object categories is large. Fully supervised learning is not scalable in the absence of fully labelled training data, therefore it is valuable to study how the power of CNNs can be leveraged in weakly supervised or unsupervised detection [15, 45, 187].

* (7)弱监督或无监督学习: *当前领先探测器的状态都采用从对象边界框或分割掩码的标签数据全监督的模型(129、179、129)，然而这种完全监督学习存在严重的局限性，因为边界框注释的采取可能会有问题,尤其是当对象类别的数量很大。完全监督学习在没有完全标记训练数据的情况下是不可扩展的，因此研究CNN在弱监督或非监督检测中如何利用其能力是有价值的[15,45,187]。

(8) 3D Object Detection: The progress of depth cameras has enabled the acquisition of depth information in the form of RGB-D images or 3D point clouds. The depth modality can be employed to help object detection and recognition, however there is only limited work in this direction [30, 165, 220], but which might benefit from taking advantage of large collections of high quality CAD models [219].

**(8)三维目标检测:**深度摄像机的发展使深度信息能够以RGB-D图像或 3D point clouds的形式获取。深度形式可以用来帮助对象检测和识别，但在这个方向上的工作有限[30,165,220]，但这可能得益于大量高质量CAD模型的集合[219]。

The research field of generic object detection is still far from complete; given the massive algorithmic breakthroughs over the past five years, we remain optimistic of the opportunities over the next five years.

一般目标检测的研究领域还很不完善;鉴于过去5年在算法方面取得的巨大突破，我们仍对未来5年的机遇持乐观态度。

References

Sivic J., Zisserman A. (2003) Video google: A text retrieval approach to object matching in videos. In: International Conference on Computer Vision (ICCV), vol 2, pp. 1470–1477 2, 5, 10, 16 ↩︎

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航