您的位置:首页 > 移动开发 > Objective-C

KeyPoint of 《Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks》

2017-02-09 18:37 399 查看
《Faster R-CNN》是2015年由Ross Girshick 提出来的一篇爆炸性文章。其中的目标框产生网络(rpn)突破了前面一系列的产生目标框方法与检测网络相分离的旧模式,且首次在一个网路中整合了rp和detection。虽然后面的YOLO和SSD均取得更好的检测效果,但是其创造性的rpn思想是值得学习的。

文章地址

http://www.rossgirshick.info/

代码地址

https://github.com/rbgirshick/py-faster-rcnn

第一遍reading in 2017/2/13

第二遍reading in 2017/2/16

Abstract

In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features (what is the full-image convolutional features?)with the detection network. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unifed network where to look.

Introduction

现存的3个问题

(1) Region proposals are the test-time computational bottleneck in state-of-the-art detection systems which based on the region proposal method.

(2) Selective Search(SS) is an order(数量级) of magnitude slower(相较于检测网络,生成region proposal 的SS慢一个数量级), at 2 seconds per image in a CPU implementation.

(3) Although re-impliment the SS in GPU mode, but it is ignores the down-stream(后续的) detection network and therefore misses important opportunities for sharing computation.

我们提出的方法

(1)RPN(region proposal networks,待检区域生成网络)computing proposals with a deep convolutional neuralnetwork

(2)RPN share(共享卷积特征层,这带来的好处是,在测试阶段极大的节省计算时间) convolutional layers with state-of-the-art object detection networks

(3)the convolutional feature maps used by region-based detectors, like fast RCNN, can also be used for generating region proposals.(核心观点:卷积层的特征不仅可以用于检测,还可以用于生成Proposals)

(4) On top of(紧接着) these convolutional features, we construct an RPN by adding a few additional convolutional layers that simultaneously regress region bounds and objectness scores at each location on a regular grid. The RPN is thus a kind of fully convolutional network (FCN) and can be trained end-to-end specifically for the task for generating detection proposals.

(5)RPNs are designed to efficiently predict region proposals with a wide range of scales and aspec tratios(预测的rp是不同大小和不同长宽比的).RPNs introduce novel “anchor” boxes(创新点) that serve as references at multiple scales and aspect ratios. Our scheme can be thought of as a pyramid of regression references(回归索引的金字塔?目前我也不是很清楚)

(6)To unify RPNs with Fast R-CNN object detection networks, we propose a training scheme that alternates(轮流来调优) between fine-tuning for the region proposal task and then fine-tuning for object detection, while keeping the proposals fixed. This scheme converges quickly and produces a unified network with convolutional features that are shared between both tasks.

(7)RPN and Faster R-CNN 用途广泛:such as 3D object detection, part-based detection(基于部分的检测), instance segmentation(实例分割), and image captioning(图像标注)。

Related work

Faster R-CNN

(模型的核心描述)Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector that uses the proposed regions.

3.1 Region Proposal Network

A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.

Because our ultimate goal is to share computation with a Fast R-CNN object detection network, we assume that both nets share a common set of convolutional layers

To generate region proposals, we slide(滑动) a small network over the conv feature map which output by the last (最后一层)shared conv layer

small network(小网络)

input: n*n (n=3)spatial window of the input conv feature map. (输入3*3大小的空间滑动窗口,该窗口在最后一层卷积特征上滑动,注意:最后一层的卷积特征并不意味着只有一个通道,实际上还是一个多通道的特征)

Each sliding window is mapped to a lower-dimensional feature(?).

and this feature is fed into (输入到)two sibling(姊妹一般的) full connected layer——a box-regression layer(reg) and a box-classification layer(cls)

如下的这段话需进一步通过阅读代码来理解?

Note that because the mini-network operates in a sliding-window fashion, the fully-connected layers are shared across all spatial locations(全连接层共享所有的空间位置). This architecture is naturally implemented with an n×n convolutional layer followed by two sibling 1×1 convolutional layers(n*n的卷积后面紧接着1*1的卷积层) (for reg and cls, respectively).

Anchors

At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as k.

So the reg layer has 4k (每个框需要4个位置来描述)outputs encoding the coordinates of k boxes, and the cls layer outputs 2k (每个结果需要两个值来描述)scores that estimate probability of object or not object for each proposal. the k proposals are parameterized relative to k reference boxes, which we call anchors. An anchor is centered(Anchor的中心是与sliding window的中心是一致的) at the sliding window in question, and is associated with a scale and aspect ratio (Figure 3, left). By default we use 3 scales and 3 aspect ratios, yielding k = 9 anchors at each sliding position(每个sliding位置均有k个Anchors). For a convolutional feature map of a size W ×H (typically ∼2,400), there are WHk anchors in total.

Translation-Invariant Anchors(平移不变形的Anchor)

An important property of our approach is that it is translation invariant, both in terms of the anchors and the functions that compute proposals relative to the anchors.

平移不变性是指,如果object在图像中有平移,对应的proposal也会平移。以及the same function相似的功能应该能够预测Proposal在任何位置而MultiBox(使用k-means进行聚类产生anchor的方法)方法则不能保证物体被平移后还能产生相同的proposal。

The translation-invariant property also reduces the model size. MultiBox has a (4+1)×800-dimensional fully-connected output layer, whereas our method has a (4 + 2)×9-dimensional convolutional output layer in the case of k = 9 anchors. 4代表4个位置参数,2代表2个结果参数,9代表每个点均有9个框**As a result, our output layer has 2.8 × 104 parameters (512 × (4 + 2) × 9 for VGG-16)**512指具有个遍历点吗? If considering the feature projection layers, our proposal layers still have an order of magnitude fewer parameters than MultiBox6. We expect our method to have less risk of overfitting on small datasets, like PASCAL VOC.参数越少,过拟合的风险越小

multiple scale and aspect ratios

As a comparison, our anchor-based method is built on a pyramid of anchors**基于anchor的金字塔, Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios. 利用不同尺度的anchor box**, It only relies on images and feature maps of a single scale, and uses filters (sliding windows on the feature map) of a single size. 这个anchor基于的特征图的尺度是不变化的,特征图上的滤波器也是单一尺度的

Because of this multi-scale design based on anchors, we can simply use the convolutional features computed on a single-scale image. The design of multiscale anchors is a key component for sharing features without extra cost for addressing scales.

Loss Function

For training RPNs, we assign a binary class label (of being an object or not) to each anchor.

如何标记正样本anchor

We assign a positive label to two kinds of anchors:

(i) the anchor/anchors with the highest Intersection-over-Union(IoU) overlap with a ground-truth box 和某一个真框 .

(ii) an anchor that has an IoU overlap higher than 0.7 with any ground-truth box**和任意的真框.Note that a single ground-truth box may assign positive labels to multiple anchors. Usually the second condition is sufficient to determine the positive samples; but we still **adopt the first condition for the reason that in some rare cases the second condition may find no positive sample.

如何标记负样本anchor

We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes. Anchors that are neither positive nor negative do not contribute to the training objective.

Loss Function Define

Our loss function for an image is defined as:

L(pi,ti)=1Ncls∑iLcls(pi,p∗i)+λ1Nreg∑ip∗iLreg(ti,t∗i)

i是一小批mini_batch中一个anchor的序号

pi是anchor i 是目标的概率

p∗i 是The ground-truth label

p∗i=1if the anchor is positive

p∗i=0if the anchor is negative

ti是一个含有4个位置参数的向量

t∗iThe ground-truth 与anchor的匹配程度

Lcls是classification loss, is log loss over two classes (object vs. not object).

For the regression lossLreg(ti,t∗i)=R(ti−t∗i)

R是smooth L1

术语p∗iLreg意味着regression loss is activated only for positive anchors(p∗i=1)

The outputs of the cls and reg layers consist of {pi} and {ti}

The two terms are normalized(归一化) by Ncls and Nreg and weighted by a balancing parameter λ.

Ncls is normalized by the mini-batch size. Ncls=256

Nreg is normalized by the number of anchor locations. Nreg 2400

λ=10both cls and reg terms are roughly equally weighted, and the results are insensitive to the values of λ in a wide range

We also note that the normalization as above is not required and could be simplified. (λ和归一化参数均不重要)

For bounding box regression, we adopt the parameterizations of the 4 coordinates following :

公式的分析有待下次阅读

x,y,w,hdenote the box’s center coordinates and its width and height

x,xa,x∗ are for the predicted box, anchor box, and groundtruth box respectively (likewise for y,w,h). This can be thought of as bounding-box regression from an anchor box to a nearby ground-truth box. 这可以认为,待预测的bounding-box是从anchor box 来靠近ground-truth box的

但是, our method achieves bounding-box regression by a different manner from previous RoI-based (Region of Interest) methods , bounding-box regression is performed on features pooled from arbitrarily sized RoIs, and the regression weights are shared by all region sizes. bounding-box regression 不依赖于任意大小的ROI

In our formulation, the features used for regression are of the same spatial size (3 × 3) on the feature maps. To account for varying sizes, a set of k bounding-box regressors are learned. 通过权值不共享的k个anchors来实现多尺度. Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights. As such, it is still possible to predict boxes of various sizes even though the features are of a fixed size/scale, thanks to the design of anchors.

Training RPNs

The RPN can be trained end-to-end by backpropagation and stochastic gradient descent (SGD). Each mini-batch arises from a single image that contains many positive and negative example anchors. It is possible to optimize for the loss functions of all anchors, but this will bias towards negative samples as they are dominate(负样本过多). Instead, we randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1. If there are fewer than 128 positive samples in an image, we pad the mini-batch with negative ones(?).

We randomly initialize all new (注意:新层才用Gaussian分布初始化)layers by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01.

All other layers (i.e., the shared convolutional layers) are initialized by pretraining a model for ImageNet classification

参数设置:待下次阅读时再分析

Sharing Features for RPN and Fast R-CNN

参考文章:

http://blog.csdn.net/u011534057/article/details/51247371
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐