您的位置：首页 > 移动开发 > Objective-C

[深度学习论文笔记][Object Detection] You Only Look Once: Unified, Real-Time Object Detection

2016-11-12 17:57 821 查看

Redmon, Joseph, et al. “You only look once: Unified, real-time object detection.” arXiv preprint arXiv:1506.02640 (2015). (Citations: 76).

1 Motivation

We frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.

2 Pipeline

See Fig.

1. Resize the input image to 448 × 448 (use 448 instead of 224 is to capture fine-grained visual information).

2. Divides the input image into an S × S grid (S = 7 in our case).

3. Each grid cell predicts B bounding boxes and confidence values Pr(object) for those boxes (B = 2 in our case).

4. Each grid cell also predicts K class probabilities conditioned on object Pr(k|object). We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.

5. Then we combine the class and individual box predictions Pr(k) = Pr(object) · Pr(k|object).
6. Finally we do NMS and threshold detections.

3 Training Details

During trianing, if the center of an object falls into a grid cell, that grid cell is responsible for detecting that object. The loss function only penalizes classification error (the

conditional class probability) if an object is present in that grid cell. It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth

box (i.e. has the highest IOU among the B predictors in that grid cell). For the confidence values, we increase the confidence score of the “responsible” predictor, and decrease the confidence of other boxes. That means, if some grid cells do not have any ground-truth detections,
we only decrease the confidence of these boxes, and do not adjust the class probabilities or coordinates.

The reason why we need two kinds of probabilities is that if we predict Pr(k) directly from each grid cell, there will be S × S × B × K prediction numbers, many of which are

zero. Therefore, we can solve this problem by introducing Pr(object). We are updating Pr(object) in each grid cell, while updating Pr(k|object) when there is a object in that grid

cell.

4 Results

See Tab. It is faster than Faster R-CNN, but not as good. This is because YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only

predicts B boxes and can only have one class. Besides, since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Finally, our loss function treats errors the same in
small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.

5 Refences

[1]. CVPR 2016. https://www.youtube.com/watch?v=NM6lrxy0bxs.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： Deep Learning Papers Computer Vision CNN Object Detection

相关文章推荐

新的分享

章节导航