您的位置：首页 > 其它

YOLO源码详解（五）-追本溯源7*7个grid

2016-12-26 13:26 288 查看

者：木凌

时间：2016年11月。

文章连接：http://blog.csdn.net/u014540717

最近一直有人在问，把图像分为7*7个网格，每个网格推荐两个框是什么意思，一直没搞明白，今天我们就从源码入手，追本溯源，彻底理解7*7个grid

在YOLO源码详解（三）- 前向传播（forward）里，我们分析了detection_layer.c层的代码，我们来看一下

truth_index

是怎么定义的：

int truth_index = (b*locations + i)*(1+l.coords+l.classes);

1
1

这里参数意义如下：

locations：7*7

b ：batch size的索引

i ：locations的索引

1 ：置信度

l.coords ：值为４，分别表示x,y,w,h

l.classes : 20

然后在下面我们可以看到如下代码段

//l.n就是一个网格要推荐几个框，论文中l.n=2
for(j = 0; j < l.n; ++j){
int box_index = index + locations*(l.classes + l.n) + (i*l.n + j) * l.coords;
box out = float_to_box(l.output + box_index);
out.x /= l.side;
out.y /= l.side;

if (l.sqrt){
out.w = out.w*out.w;
out.h = out.h*out.h;
}

//计算iou的值
float iou  = box_iou(out, truth);
//iou = 0;
//计算均方根误差（root-mean-square error）
float rmse = box_rmse(out, truth);
//选出iou最大或者均方根误差最小的那个框作为最后预测框～
if(best_iou > 0 || iou > 0){
if(iou > best_iou){
best_iou = iou;
best_index = j;
}
}else{
if(rmse < best_rmse){
best_rmse = rmse;
best_index = j;
}
}
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

上述代码中最重要的是

box_iou(out, truth);

这句代码，这句代码是要计算你输出的框和真实框的IOU，truth的定义如下：

box truth = float_to_box(state.truth + truth_index + 1 + l.classes);

1
1

从定义我们可以得到，真实框的坐标来自state.truth，我们来追本溯源

state.truth最初是在network.c中赋值的

//network.c
state.truth = y;

1
2
1
2

y又从哪里来呢？最初的y是从

load_data_in_thread(args);

这个函数中获得的，我们来剖析下该函数

//data.c
pthread_t load_data_in_thread(load_args args)
{
pthread_t thread;
struct load_args *ptr = calloc(1, sizeof(struct load_args));
*ptr = args;
//调用load_thread这个函数
if(pthread_create(&thread, 0, load_thread, ptr)) error("Thread creation failed");
return thread;
}

1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10

//data.c
void *load_thread(void *ptr)
{
//printf("Loading data: %d\n", rand());
load_args a = *(struct load_args*)ptr;
if(a.exposure == 0) a.exposure = 1;
if(a.saturation == 0) a.saturation = 1;
if(a.aspect == 0) a.aspect = 1;

if (a.type == OLD_CLASSIFICATION_DATA){
*a.d = load_data_old(a.paths, a.n, a.m, a.labels, a.classes, a.w, a.h);
} else if (a.type == CLASSIFICATION_DATA){
*a.d = load_data_augment(a.paths, a.n, a.m, a.labels, a.classes, a.hierarchy, a.min, a.max, a.size, a.angle, a.aspect, a.hue, a.saturation, a.exposure);
} else if (a.type == SUPER_DATA){
*a.d = load_data_super(a.paths, a.n, a.m, a.w, a.h, a.scale);
} else if (a.type == WRITING_DATA){
*a.d = load_data_writing(a.paths, a.n, a.m, a.w, a.h, a.out_w, a.out_h);
} else if (a.type == REGION_DATA){
//因为a.type == REGION_DATA，所以调用这个函数，我们继续追～
*a.d = load_data_region(a.n, a.paths, a.m, a.w, a.h, a.num_boxes, a.classes, a.jitter, a.hue, a.saturation, a.exposure);
.
.
.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

//data.c
data load_data_region(int n, char **paths, int m, int w, int h, int size, int classes, float jitter, float hue, float saturation, float exposure)
{
char **random_paths = get_random_paths(paths, n, m);
int i;
data d = {0};
d.shallow = 0;
//n就是batch size啦
d.X.rows = n;
//给X（也就是图像数据）分配内存
d.X.vals = calloc(d.X.rows, sizeof(float*));
d.X.cols = h*w*3;

int k = size*size*(5+classes);
//终于找到你啦~\(≧▽≦)/~。这里先给y分配了内存，注意一共分配了n*k个float类型的内存块，为什么分配这么多呢？慢慢往下看～
d.y = make_matrix(n, k);
for(i = 0; i < n; ++i){
//读取图像
image orig = load_image_color(random_paths[i], 0, 0);

int oh = orig.h;
int ow = orig.w;

//这里jitter=0.2(cfg文件中有写)，这就是所谓的抖动了，其实就是crop（数据增广的一种）
//剪掉的不能太多，这里设置图像的左边和右边最多剪掉dw（整幅图像宽度的1/5），上边和下边最多剪掉dh（整幅图像高度的1/5）
int dw = (ow*jitter);
int dh = (oh*jitter);
//rand_uniform生成(-dw, dw)的一个随机数
int pleft  = rand_uniform(-dw, dw);
int pright = rand_uniform(-dw, dw);
int ptop   = rand_uniform(-dh, dh);
int pbot   = rand_uniform(-dh, dh);

//swidth是图像剪完后的宽度，sheight是图像剪完后的高度
int swidth =  ow - pleft - pright;
int sheight = oh - ptop - pbot;

//sx是图像剪完后宽度和原始图像的宽度比，同理sy
float sx = (float)swidth  / ow;
float sy = (float)sheight / oh;

//设置图像随机翻转
int flip = rand()%2;
//开始剪切图像，咔咔咔，具体代码不看了，很简单～
image cropped = crop_image(orig, pleft, ptop, swidth, sheight);

//dx=pleft/swidth，dy=ptop/sheight
float dx = ((float)pleft/ow)/sx;
float dy = ((float)ptop /oh)/sy;

//都剪完了，当然要把图像重新resize到448*448（论文中说了，输入图像是448*448）
image sized = resize_image(cropped, w, h);
//翻转图像～
if(flip) flip_image(sized);
//图像随机排序
random_distort_image(sized, hue, saturation, exposure);
//最终d.X.vals[]存储的就是要输入的数据啦，准备好X了，我们去准备下y
d.X.vals[i] = sized.data;

//开始追y，追追追～
fill_truth_region(random_paths[i], d.y.vals[i], classes, size, flip, dx, dy, 1./sx, 1./sy);

.
.
.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

//data.c
void fill_truth_region(char *path, float *truth, int classes, int num_boxes, int flip, float dx, float dy, float sx, float sy)
{
char labelpath[4096];
//有人一直不知道labels怎么来的，说源码都没设置labels的路径啊，怎么读的labels啊，那不是成了无监督学习？其实源码只是没直接设置labels的路径而已，把images替换为labels，在把.jpg替换为.txt，labels的路径就有了～
find_replace(path, "images", "labels", labelpath);
find_replace(labelpath, "JPEGImages", "labels", labelpath);

find_replace(labelpath, ".jpg", ".txt", labelpath);
find_replace(labelpath, ".png", ".txt", labelpath);
find_replace(labelpath, ".JPG", ".txt", labelpath);
find_replace(labelpath, ".JPEG", ".txt", labelpath);
int count = 0;
//从.txt中读取labels值，count记录框的个数
box_label *boxes = read_boxes(labelpath, &count);
//把框随机排序～
randomize_boxes(boxes, count);
//因为图像已经被修剪了，所以框的坐标也要改一改，correct_boxes函数就是把框在原始图像下的坐标转到修剪后图像下的坐标
correct_boxes(boxes, count, dx, dy, sx, sy, flip);
float x,y,w,h;
int id;
int i;
for (i = 0; i < count; ++i) {
x =  boxes[i].x;
y =  boxes[i].y;
w =  boxes[i].w;
h =  boxes[i].h;
id = boxes[i].id;

//修剪后，太小的框就不作为正样本了
if (w < .01 || h < .01) continue;

//这里x的值为0~1之间（不一定能取到0和1，因为图像被修剪过了，坐标的范围也变了），num_boxes=7，所以col和row都是0~6之间的整数
int col = (int)(x*num_boxes);
int row = (int)(y*num_boxes);

//x和y又被打回原型，又变成0~1之间的数了
x = x*num_boxes - col;
y = y*num_boxes - row;

//index就懂了吧，一共7*7个网格，每个网格的索引是0~6
int index = (col+row*num_boxes)*(5+classes);
if (truth[index]) continue;
//如果第i个框落在这个网格里，就把相应的置信度赋1
truth[index++] = 1;
//然后看标签id是几，就把对应的类别处赋值为1
if (id < classes) truth[index+id] = 1;
index += classes;
//再赋值框的x,y,w,h到truth
truth[index++] = x;
truth[index++] = y;
truth[index++] = w;
truth[index++] = h;
}
free(boxes);
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

y值追完了，我们再回过头来看

float iou  = box_iou(out, truth);

1
1

out（每个网格一共l.n个out，论文中l.n=2）就是网络回归出来的值，然后把out的值和truth中的值对应比较，计算出iou，然后从

l.n

个iou中挑出iou最高的一个，作为最后的预测框，说白了就是：只有该框会对loss function产生影响，其他框不产生影响，仅此而已。

现在你知道7*7个网格，每个网格推荐两个框是怎么回事儿了吗？

(END)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航