您的位置：首页 > 其它

目标检测6 - Spatial Transformer Networks

2018-02-28 11:00 393 查看

Spatial Transformer Networks

Introduction

卷积神经网络（CNN）已经被证明能够训练一个能力强大的分类模型，但与传统的模式识别方法类似，它也会受到数据在空间上多样性的影响。这篇Paper提出了一种叫做空间变换网络（Spatial Transform Networks， STN） 该网络不需要关键点的标定，能够根据分类或者其它任务自适应地将数据进行空间变换和对齐（包括平移、缩放、旋转以及其它几何变换等）。在输入数据在空间差异较大的情况下，这个网络可以加在现有的卷积网络中，提高分类的准确性。

背景知识

线性插值

已知一条直线 ll上两个点 (x0,y0(x0,y0) 与 (x1,y1)(x1,y1)，要计算 [x0,x1][x0,x1] 区间内某一位置 xx 在直线上的 yy 值（反过来也是一样，略），由于斜率相等：

y−y0x−x0=y1−y0x1−x0(1)(1)y−y0x−x0=y1−y0x1−x0

y=x1−xx1−x0y0+x−x0x1−x0y1(2)(2)y=x1−xx1−x0y0+x−x0x1−x0y1

也可以理解为用 xx 和 x0,x1x0,x1 的距离作为一个权重，用于 y0y0 和 y1y1 的加权。而双线性插值本质上就是在两个方向上做线性插值。

双线性插值

在数学上，双线性插值是有两个变量的插值函数的线性插值扩展，其核心思想是在两个方向分别进行一次线性插值。见下图：

假如我们想得到未知函数 ff 在点 P=(x,y)P=(x,y) 的值，假设我们已知函数 ff 在 Q11=(x1,y1)Q11=(x1,y1) , Q12=(x1,y2)Q12=(x1,y2), Q21=(x2,y1)Q21=(x2,y1) 以及 Q22=(x2,y2)Q22=(x2,y2) 四个点的值。最常见的情况，ff 就是一个像素点的像素值。首先在 xx 方向进行线性插值，得到：

f(R1)≈x2−xx2−x1f(Q11)+x−x1x2−x1f(Q21)whereR1=(x,y1)f(R1)≈x2−xx2−x1f(Q11)+x−x1x2−x1f(Q21)whereR1=(x,y1)

f(R2)≈x2−xx2−x1f(Q12)+x−x1x2−x1f(Q22)whereR2=(x,y2)f(R2)≈x2−xx2−x1f(Q12)+x−x1x2−x1f(Q22)whereR2=(x,y2)

然后在 yy 方向进行线性插值，得到：

f(P)≈y2−yy2−y1f(R1)+y−y1y2−y1f(R2)f(P)≈y2−yy2−y1f(R1)+y−y1y2−y1f(R2)

综合起来就是双线性插值最后的结果：

f(x,y)≈y2−yy2−y1x2−xx2−x1f(Q11)+y2−yy2−y1x−x1x2−x1f(Q21)+y−y1y2−y1x2−xx2−x1f(Q12)+y−y1y2−y1x−x1x2−x1f(Q22)(3)(3)f(x,y)≈y2−yy2−y1x2−xx2−x1f(Q11)+y2−yy2−y1x−x1x2−x1f(Q21)+y−y1y2−y1x2−xx2−x1f(Q12)+y−y1y2−y1x−x1x2−x1f(Q22)

由于图像双线性插值只会用相邻的4个点，因此上述公式的分母都是1。

仿射变换（affine transformation）

scaling: scales the x and y direction by a scalar.

K′=[p00q00]⎡⎣⎢⎢xy1⎤⎦⎥⎥=[pxqy]K′=[p000q0][xy1]=[pxqy]

shearing: offsets the x by a number proportional to y and x by a number proportional to x. （如将正常字体转换为italics）

K′=[1nm100]⎡⎣⎢⎢xy1⎤⎦⎥⎥=[x+myy+nx]K′=[1m0n10][xy1]=[x+myy+nx]

rotating: rotates the points around the origin by an angle θ.

K′=[cosθsinθ−sinθcosθ00]⎡⎣⎢⎢xy1⎤⎦⎥⎥=[xcosθ−ysinθxsinθ+ycosθ]K′=[cos⁡θ−sin⁡θ0sin⁡θcos⁡θ0][xy1]=[xcos⁡θ−ysin⁡θxsin⁡θ+ycos⁡θ]

translate：平移变换

K′=[1001ΔΔ]⎡⎣⎢⎢xy1⎤⎦⎥⎥=[x+Δy+Δ]K′=[10Δ01Δ][xy1]=[x+Δy+Δ]

注意，将前三种操作的 [xy][xy] 改写为 ⎡⎣⎢⎢xy1⎤⎦⎥⎥[xy1] 主要是为了generalize（因为平移操作的变换矩阵必须为 2×32×3 的矩阵）

summary

Hence, we can generalize our results and represent our 4 affine transformations (all linear transformations are affine) by the 6 parameter matrix：

M=[adbecf]M=[abcdef]

Pooling层的缺陷

Q1：池化层如何使得模型具有一定的spatial invariance？

Well think of it this way. The idea behind pooling is to take a complex input, split it up into cells, and “pool” the information from these complex cells to produce a set of simpler cells that describe the output. So for example, say we have 3 images of the number 7(MNIST), each in a different orientation. A pool over a small grid in each image would detect the number 7 regardless of its position in that grid since we’d be capturing approximately the same information by aggregating pixel values.

Q2：pooling的downsides？

pooling is destructive. 当使用pooling时会丢弃75%的特征（2x2 max pooling），这就意味着我们一定会损失一些精确的位置信息。虽然这样做可以换取分类任务的spatial robustness，但是在目标检测任务中这些位置信息是非常重要的。

Another limitation of pooling is that it is local and predefined.

With a small receptive field, the effects of a pooling operator are only felt towards deeper layers of the network meaning intermediate feature maps may suffer from large input distortions. And remember, we can’t just increase the receptive field arbitrarily because then that would downsample our feature map too agressively.

The main takeaway is that ConvNets are not invariant to relatively large input distortions. This limitation is due to having only a restricted, pre-defined pooling mechanism for dealing with spatial variation of the data. This is where Spatial Transformer Networks come into play!

传统CNN存在着对于较大的输入形变不再具有invariance，这主要是因为CNN中只有一个restricted、pre-defined的pooling机制来应对spatial variation。Spatial Transformer Networks就是为了应对这个问题产生的。

The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster. (Geoffrey Hinton, Reddit AMA)

Spatial Transformer Networks

Spatial Transformer具有3个特性，这使得它相当appealing：

modular：STNs可以被插入到任何已有的架构，只需作很小的调整；

differentiable：可以使用反向传播算法来训练STNs，从而允许对被S模型进行端到端的训练。

dynamic：STNs perform active spatial transformation on a feature map for each input sample as compared to the pooling layer which acted identically for all input samples；

Spatial Transformer的结构如下图所示，它分成三部分，分别为Localisation Net, Grid Generator和Sampler。localisation network用来计算空间变换的参数 θθ ，grid generator则是得到input map U∈RH×W×CU∈RH×W×C 到 output map 各位置的 V∈RH′×W′×CV∈RH′×W′×C 对应关系 θTθ， sampler根据input map UU 和对应关系 θTθ ，生成最终的output map。

流程图如下图所示。它完成的是一个将输入特征图进行一定的变换的过程，而具体如何变换，是通过在训练过程中学习来的，更通俗地将，该模块在训练阶段学习如何对输入数据进行变换更有益于模型的分类，然后在测试阶段应用已经训练好的网络对输入数据进行执行相应的变换，从而提高模型的识别率。下面具体介绍STN的三个部分。

【注意】当我们要对一张图片进行仿射变换时，我们不是直接对原图做仿射变换的，而是先生成一堆grid采样格点，然后transform这些格点，最后用这些格点在原图上进行采样来生成变换后的图片。参考

An image processing affine transformation usually follows the 3-step pipeline below:

First, we create a sampling grid composed of (x,y)(x,y) coordinates. For example, given a 400x400400x400 grayscale image, we create a meshgrid of same dimension, that is, evenly spaced x∈[0,W]x∈[0,W] and y∈[0,H]y∈[0,H] .

We then apply the transformation matrix to the sampling grid generated in the step above.

Finally, we sample the resulting grid from the original image using the desired interpolation technique.

As you can see, this is different than directly applying a transform to the original image.

按照作者的说法，STN可以被安装在任意CNN的任意一层中——这里有些同学有误解，以为Figure 2中U到V是原来的卷积，并且在卷积的路径上加了一个分支，其实并不是，而是将原来的一层结果U，变换到了V，中间并没有卷积的操作。看Figure 3右边，通过U到V的变换，相当于又生成了一个新数据，而这个数据变换不是定死的而是学习来的，即然是学习来的，那它就有让loss变小的作用，也就是说，通过对输入数据进行简单的空间变换，使得特征变得更容易分类（往loss更小的方向变化）。另外一方面，有了STN，网络就可以动态地做到旋转不变性，平移不变性等原本认为是Pooling层做的事情，同时可以选择图像中最终要的区域（有利于分类）并把它变换到一个最理想的姿态（比如把字放正）。

Localisation Network

Localisation Network的输入是特征图 U∈RH×W×CU∈RH×W×C （width/height/channel），输出是需要对输入的特征图所做的仿射变换 θTθ 的参数 θ=floc(U)θ=floc(U) 。Localisation network 函数 floc()floc() 可以是任何形式，LN的结构通常是一个全连接网络或者卷积网络后接一个回归层来训练参数 θθ 。θθ 的规模取决于具体的变换，当变换取二维仿射变换时，θθ 是一个6维（2*3）的矩阵（见仿射变换小节）。

As we train our network, we would like our localisation net to output more and more accurate thetas. What do we mean by accurate? Well, think of our digit 9 rotated by 90 degrees counterclockwise. After say 2 epochs, our localisation net may output a transformation matrix which performs a 45 degree clockwise rotation and after 5 epochs for example, it may actually learn to do a complete 90 degree clockwise rotation. The effect is that our output image looks like a standard digit 9, something our neural network has seen in the training data and can easily classify.

Grid Generator

Grid generator的作用是输出一堆parametrised sampling grid，即在源图片中找到用于做插值（双线性插值）的grid。

which is a set of points where the input map should be sampled to produce the desired transformed output.

假设 UU (不局限于输入图片，也可以是其它层输出的 feature map）每个像素的坐标为 (xsi,ysi)(xis,yis) ， VV 的每个像素坐标为 (xti,yti)(xit,yit) , 空间变换函数 θθ 为仿射变换函数，那么 (xsi,ysi)(xis,yis) 和 (xti,yti)(xit,yit) 的对应关系可以写为：

(xsiysi)=θ(Gi)=Aθ⎛⎝⎜⎜⎜xtiyti1⎞⎠⎟⎟⎟=[θ11θ21θ12θ22θ13θ23]⎡⎣⎢⎢⎢xtiyti1⎤⎦⎥⎥⎥(xisyis)=Tθ(Gi)=Aθ(xityit1)=[θ11θ12θ13θ21θ22θ23][xityit1]

表示将输出特征图上某一位置 (xti,yti)(xit,yit) 根据变换参数 θθ 映射到输入特征图上某一位置 (xsi,ysi)(xis,yis) 。如Figure 3所示。

s表示source（U中的坐标），t表示target（V中的坐标）。为什么 xsixis 在等式左边，而 xtixit 却在等式右边？我们不是要得到target吗？看上去我们像在对target做变换？其实这里的 θ(Gi)Tθ(Gi) 代表的是对目标网格进行变换后得到的采样网格点，而不是直接对原图进行的变换。注意下面这张图，原图始终没变，变换的是网格！我们把网格进行仿射变换，然后把变换后的网格放回到原图上，用原图中对应位置的像素值去填充变换后的网格（输出图片）！这样能够保证变换后的输出始终是我们设定的网格的大小！也即意味着我们可以通过控制网格的大小去控制该层输出的图像的最大分辨率（同时仿射变换矩阵也会对图像有作用）。

但此时往往 (xsi,ysi)(xis,yis) 会落在原始输入特征图的几个像素点中间部分，所以需要利用双线性插值来计算出对应该点的灰度值。需要补充的是，文中在变换时用的都是标准化坐标(height/width normalized)，即 xi,yi∈[−1,1]xi,yi∈[−1,1] 。

Differentiable Image Sampling

有了input feature map UU 和 parametrised sampling grid θ(Gi)Tθ(Gi) ，输出特征图 VV 和输入特征图 UU 上所有像素点灰度值就可以建立具体的联系，具体表示成如下公式：

Vci=∑nH∑mWUcnmk(xsi−m;Φx)k(ysi−n;Φy)∀i∈[1...H′W′]∀c∈[1...C]Vic=∑nH∑mWUnmck(xis−m;Φx)k(yis−n;Φy)∀i∈[1...H′W′]∀c∈[1...C]

* UcnmUnmc is the value at location (n,m)(n,m) in channel cc of the inpu，即输入特征图上第 cc 个通道上点 (n,m)(n,m) 的灰度值, and VciVic is the output value for pixel ii at location (xti,yti)(xit,yit) in channel cc ，即输出特征图上的第 cc 个通道某一点的灰度值，

Note that the sampling is done identically for each channel of the input，(this preserves spatial consistency between channels).

ΦxΦx and ΦyΦy are the parameters of a generic sampling kernel k()k() which defines the image interpolation (e.g. bilinear)。k()k() 为具体的采样核，它定义了输入和输出特征图的重要关系。

可以这样理解：（1）输出特征图上某一点 VciVic 的灰度对应于输入特征图上某一点 (xsi,ysi)(xis,yis) 的灰度值，而这点的灰度值由周围的若干点的灰度值 UcnmUnmc 共同确定并且距离 (xsi,ysi)(xis,yis) 越近（距离关系由 xsi−mxis−m 和 ysi−myis−m 确定），影响越大（权重越大）；（2）具体的灰度插值方法由 k()k() 中 ΦxΦx 和 ΦyΦy 确定。

理论上我们可以采用任意一种sampling kernel，论文中采用的是双线性插值（bilinear sampling kernel），并用一种更简介的形式来表示，公式如下：

Vci=∑nH∑mWUcnmmax(0,1−|xsi−m|)max(0,1−|ysi−n|)Vic=∑nH∑mWUnmcmax(0,1−|xis−m|)max(0,1−|yis−n|)

该公式和之前介绍双线性插值部分的示意图含义是一样的，只是因为在图像中，相邻两个点的坐标差是1，就没有分母部分了。该公式中定义的双线性插值使得目标灰度值只与 (xsi,ysi)(xis,yis) 周围4个点的灰度有关。具体来说，当 |xsi−m||xis−m| 或者 |ysi−m||yis−m| 大于1时，对应的 max()max() 项将取 00 ，也就是说，只有 (xsi,ysi)(xis,yis) 周围4个点的灰度值决定目标像素点的灰度并且当 1−|xsi−m|1−|xis−m| 和 1−|ysi−m|1−|yis−m| 越小，影响越大，即离点 (xsi,ysi)(xis,yis)）越近，权重越大。

al22a22l 经过[0110][22]+[−1−1][0110][22]+[−1−1] 变换后，得到 al−111a11l−1 。这是变换后的点刚好为整数的情况，如果不是那么凑巧呢？

al22a22l 变换后得到 [1.62.4][1.62.4] ，如果把 al22a22l 的value（像素值）设为input map中距离它最近的点的value，则应该选取点 al−122a22l−1 ，然而这样处理的问题在于，它没法进行gradient descent。Recall一下gradient的定义：我们对参数做一些小小的变化，它对output会有怎样的影响。对于这个NN，我们对其参数（2x3 matrix）做一下小改变，其output会有小小的改变（[1.622.38][1.622.38]）， al22a22l 还是接到 al−122a22l−1 ，结果并没有变化，所以 gradient=0 ，无法train这个NN。

所以要做Interpolation。

(1.6, 2.4)这个点是介于input map的4个点的中间的，如图所示。则Interpolation的weights就取决于点(1.6, 2.4)和这4个点的距离（(1-dx)(1-dy)）。这样表示之后我们就可以用梯度下降求解了，因为当参数矩阵 [010.500.60.4][00.50.6100.4] 有些微的变化时，ouput (1.6, 2.4)也会有些微的变化，而 al22a22l 的value也会有相应变化。

另外很重要的一点是，上述公式对 UcnmUnmc 和 (xsi,ysi)(xis,yis) 是可导的，也就是说，Spatial Transformer的变换过程是可以在网络中不断训练来修正参数的。具体的求导过程如下：

∂Vci∂Ucnm=∑nH∑mWmax(0,1−|xsi−m|)max(0,1−|ysi−n|)∂Vic∂Unmc=∑nH∑mWmax(0,1−|xis−m|)max(0,1−|yis−n|)

∂Vci∂xsi=∑nH∑mWUcnmmax(0,1−|ysi−n|)⎧⎩⎨⎪⎪0,1,−1if |m−xsi|≥1if m≥xsiif m<xsi∂Vic∂xis=∑nH∑mWUnmcmax(0,1−|yis−n|){0,if |m−xis|≥11,if m≥xis−1if m<xis

∂Vci∂ysi∂Vic∂yis 与 ∂Vci∂xsi∂Vic∂xis 类似，对 θθ 的求导为：

∂Vci∂θ=⎛⎝⎜⎜∂Vci∂xsi⋅∂xsi∂θ∂Vci∂ysi⋅∂ysi∂θ⎞⎠⎟⎟∂Vic∂θ=(∂Vic∂xis⋅∂xis∂θ∂Vic∂yis⋅∂yis∂θ)

而 ∂xsi∂θ，∂ysi∂θ∂xis∂θ，∂yis∂θ 根据具体的变换函数便可得到：

(xsiysi)=[θ11θ21θ12θ22θ13θ23]⎡⎣⎢⎢⎢xtiyti1⎤⎦⎥⎥⎥(xisyis)=[θ11θ12θ13θ21θ22θ23][xityit1]

添加空间变换层之后的梯度流动：参考链接

实验结果

固定参数矩阵两处为0，只输出4个数字 (a,d,e,f)(a,d,e,f) ，则该参数矩阵只有缩放和平移功能，不能旋转，相当于提高了模型的focus的能力，对于鸟的分类问题（200类）效果会有提升。

总结

STN 能够在没有标注关键点的情况下，根据任务自己学习图片或特征的空间变换参数，将输入图片或者学习的特征在空间上进行对齐，从而减少物体由于空间中的旋转、平移、尺度、扭曲等几何变换对分类、定位等任务的影响。加入到已有的CNN或者FCN网络，能够提升网络的学习能力。

Source

https://kevinzakka.github.io/2017/01/10/stn-part1/

https://kevinzakka.github.io/2017/01/18/stn-part2/

http://blog.csdn.net/sinat_34474705/article/details/75268248

http://tang.su/2017/04/paper-notes-spatial-transformer-network/

http://www.cnblogs.com/neopenx/p/4851806.html

http://blog.csdn.net/xbinworld/article/details/69049680

http://blog.csdn.net/shaoxiaohu1/article/details/51809605

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航