您的位置:首页 > 其它

TensorFlow之深入理解Fast Neural Style

2016-10-09 14:15 507 查看


reference: http://hacker.duanshishi.com/?p=1639


前言

前面几篇文章讲述了在Computer Vision领域里面常用的模型,接下来一段时间,我会花精力来学习一些TensorFlow在Computer Vision领域的应用,主要是分析相关pape和源码,今天会来详细了解下fast neural style的相关工作,前面也有文章分析neural
style的内容,那篇算是neural style的起源,但是无法应用到实际工作上,为啥呢?它每次都需要指定好content image和style image,然后最小化content loss 和style loss去生成图像,时间花销很大,而且无法保存某种风格的model,所以每次生成图像都是训练一个model的过程,而fast
neural style中能够将训练好的某种style的image的模型保存下来,然后对content image 进行transform,当然文中还提到了image transform的另一个应用方向:Super-Resolution,利用深度学习的技术将低分辨率的图像转换为高分辨率图像,现在在很多大型的互联网公司,尤其是视频网站上也有应用。


Paper原理

几个月前,就看了Neural Style相关的文章TensorFlow之深入理解Neural Style,A
Neural Algorithm of Aritistic Style中构造了一个多层的卷积网络,通过最小化定义的content loss和style loss最后生成一个结合了content和style的图像,很有意思,而Perceptual
Losses for Real-Time Style Transfer and Super-Resolution,通过使用perceptual loss来替代per-pixels loss使用pre-trained的vgg model来简化原先的loss计算,增加一个transform Network,直接生成Content image的style版本, 如何实现的呢,请看下图,容我道来:

整个网络是由部分组成:image
transformation network、 loss netwrok;Image Transformation network是一个deep residual conv netwrok,用来将输入图像(content image)直接transform为带有style的图像;而loss network参数是fixed的,这里的loss network和A
Neural Algorithm of Aritistic Style中的网络结构一致,只是参数不做更新,只用来做content loss 和style loss的计算,这个就是所谓的perceptual loss,作者是这样解释的为Image Classification的pretrained的卷积模型已经很好的学习了perceptual和semantic information(场景和语义信息),所以后面的整个loss network仅仅是为了计算content loss和style loss,而不像A
Neural Algorithm of Aritistic Style做更新这部分网络的参数,这里更新的是前面的transform network的参数,所以从整个网络结构上来看输入图像通过transform network得到转换的图像,然后计算对应的loss,整个网络通过最小化这个loss去update前面的transform network,是不是很简单?

loss的计算也和之前的都很类似,content loss:

style loss:

style
loss中的gram matrix:

Gram Matrix是一个很重要的东西,他可以保证y^hat和y之间有同样的shape。 Gram的说明具体见paper这部分,我这也解释不清楚,相信读者一看就明白:


相信看到这里就基本上明白了这篇paper在fast neural style是如何做的,总结一下:
transform network 网络结构为deep residual network,将输入image转换为带有特种风格的图像,网络参数可更新。
loss network 网络结构同之前paper类似,这里主要是计算content loss和style loss, 注意参数不做更新。
Gram matrix的提出,让transform之后的图像与最后经过loss network之后的图像不同shape时计算loss也很方便。


fast neural style on tensorflow

代码参考https://github.com/OlavHN/fast-neural-style,但是我跑了下,代码是跑不通的,原因大概是tensorflow在更新之后,local_variables之后的一些问题,具体原因可以看这个issue:https://github.com/tensorflow/tensorflow/issues/1045#issuecomment-239789244.还有这个项目的代码都写在一起,有点杂乱,我将train和最后生成style后的图像的代码分开了,项目放到了我的个人的githubneural_style_tensorflow,项目基本要求:
python 2.7.x
Tensorflow r0.10
VGG-19 model
COCO dataset

Transform Network网络结构
import tensorflow as tf

def conv2d(x, input_filters, output_filters, kernel, strides, padding='SAME'):
with tf.variable_scope('conv') as scope:

shape = [kernel, kernel, input_filters, output_filters]
weight = tf.Variable(tf.truncated_normal(shape, stddev=0.1), name='weight')
convolved = tf.nn.conv2d(x, weight, strides=[1, strides, strides, 1], padding=padding, name='conv')

normalized = batch_norm(convolved, output_filters)

return normalized

def conv2d_transpose(x, input_filters, output_filters, kernel, strides, padding='SAME'):
with tf.variable_scope('conv_transpose') as scope:

shape = [kernel, kernel, output_filters, input_filters]
weight = tf.Variable(tf.truncated_normal(shape, stddev=0.1), name='weight')

batch_size = tf.shape(x)[0]
height = tf.shape(x)[1] * strides
width = tf.shape(x)[2] * strides
output_shape = tf.pack([batch_size, height, width, output_filters])
convolved = tf.nn.conv2d_transpose(x, weight, output_shape, strides=[1, strides, strides, 1], padding=padding, name='conv_transpose')

normalized = batch_norm(convolved, output_filters)
return normalized

def batch_norm(x, size):
batch_mean, batch_var = tf.nn.moments(x, [0, 1, 2], keep_dims=True)
beta = tf.Variable(tf.zeros([size]), name='beta')
scale = tf.Variable(tf.ones([size]), name='scale')
epsilon = 1e-3
return tf.nn.batch_normalization(x, batch_mean, batch_var, beta, scale, epsilon, name='batch')

def residual(x, filters, kernel, strides, padding='SAME'):
with tf.variable_scope('residual') as scope:
conv1 = conv2d(x, filters, filters, kernel, strides, padding=padding)
conv2 = conv2d(tf.nn.relu(conv1), filters, filters, kernel, strides, padding=padding)

residual = x + conv2

return residual

def net(image):
with tf.variable_scope('conv1'):
conv1 = tf.nn.relu(conv2d(image, 3, 32, 9, 1))
with tf.variable_scope('conv2'):
conv2 = tf.nn.relu(conv2d(conv1, 32, 64, 3, 2))
with tf.variable_scope('conv3'):
conv3 = tf.nn.relu(conv2d(conv2, 64, 128, 3, 2))
with tf.variable_scope('res1'):
res1 = residual(conv3, 128, 3, 1)
with tf.variable_scope('res2'):
res2 = residual(res1, 128, 3, 1)
with tf.variable_scope('res3'):
res3 = residual(res2, 128, 3, 1)
with tf.variable_scope('res4'):
res4 = residual(res3, 128, 3, 1)
with tf.variable_scope('res5'):
res5 = residual(res4, 128, 3, 1)
with tf.variable_scope('deconv1'):
deconv1 = tf.nn.relu(conv2d_transpose(res5, 128, 64, 3, 2))
with tf.variable_scope('deconv2'):
deconv2 = tf.nn.relu(conv2d_transpose(deconv1, 64, 32, 3, 2))
with tf.variable_scope('deconv3'):
deconv3 = tf.nn.tanh(conv2d_transpose(deconv2, 32, 3, 9, 1))

y = deconv3 * 127.5

return y


使用deep residual network来训练COCO数据集,能够在保证性能的前提下,训练更深的模型。 而Loss Network是有pretrained的VGG网络来计算,网络结构:
import tensorflow as tf
import numpy as np
import scipy.io
from scipy import misc

def net(data_path, input_image):
layers = (
'conv1_1', 'relu1_1', 'conv1_2', 'relu1_2', 'pool1',

'conv2_1', 'relu2_1', 'conv2_2', 'relu2_2', 'pool2',

'conv3_1', 'relu3_1', 'conv3_2', 'relu3_2', 'conv3_3',
'relu3_3', 'conv3_4', 'relu3_4', 'pool3',

'conv4_1', 'relu4_1', 'conv4_2', 'relu4_2', 'conv4_3',
'relu4_3', 'conv4_4', 'relu4_4', 'pool4',

'conv5_1', 'relu5_1', 'conv5_2', 'relu5_2', 'conv5_3',
'relu5_3', 'conv5_4', 'relu5_4'
)

data = scipy.io.loadmat(data_path)
mean = data['normalization'][0][0][0]
mean_pixel = np.mean(mean, axis=(0, 1))
weights = data['layers'][0]

net = {}
current = input_image
for i, name in enumerate(layers):
kind = name[:4]
if kind == 'conv':
kernels, bias = weights[i][0][0][0][0]
# matconvnet: weights are [width, height, in_channels, out_channels]
# tensorflow: weights are [height, width, in_channels, out_channels]
kernels = np.transpose(kernels, (1, 0, 2, 3))
bias = bias.reshape(-1)
current = _conv_layer(current, kernels, bias, name=name)
elif kind == 'relu':
current = tf.nn.relu(current, name=name)
elif kind == 'pool':
current = _pool_layer(current, name=name)
net[name] = current

assert len(net) == len(layers)
return net, mean_pixel

def _conv_layer(input, weights, bias, name=None):
conv = tf.nn.conv2d(input, tf.constant(weights), strides=(1, 1, 1, 1),
padding='SAME', name=name)
return tf.nn.bias_add(conv, bias)

def _pool_layer(input, name=None):
return tf.nn.max_pool(input, ksize=(1, 2, 2, 1), strides=(1, 2, 2, 1),
padding='SAME', name=name)

def preprocess(image, mean_pixel):
return image - mean_pixel

def unprocess(image, mean_pixel):
return image + mean_pixel


Content Loss:
def compute_content_loss(content_layers,net):
content_loss = 0
# tf.app.flags.DEFINE_string("CONTENT_LAYERS", "relu4_2", "Which VGG layer to extract content loss from")
for layer in content_layers:
generated_images, content_images = tf.split(0, 2, net[layer])
size = tf.size(generated_images)
content_loss += tf.nn.l2_loss(generated_images - content_images) / tf.to_float(size)
content_loss = content_loss / len(content_layers)

return content_loss


Style Loss:
def compute_style_loss(style_features_t, style_layers,net):
style_loss = 0
for style_gram, layer in zip(style_features_t, style_layers):
generated_images, _ = tf.split(0, 2, net[layer])
size = tf.size(generated_images)
for style_image in style_gram:
style_loss += tf.nn.l2_loss(tf.reduce_sum(gram(generated_images) - style_image, 0)) / tf.to_float(size)
style_loss = style_loss / len(style_layers)
return style_loss


gram:
def gram(layer):
shape = tf.shape(layer)
num_images = shape[0]
num_filters = shape[3]
size = tf.size(layer)
filters = tf.reshape(layer, tf.pack([num_images, -1, num_filters]))
grams = tf.batch_matmul(filters, filters, adj_x=True) / tf.to_float(size / FLAGS.BATCH_SIZE)

return grams


在train_fast_neural_style.py main()中,net, _ = vgg.net(FLAGS.VGG_PATH, tf.concat(0, [generated, images]))这部分有点疑问,按我的想法来说应该分别把generated、images作为input给到vgg.net,然后在后面去计算与content image 和style的loss,但是这里是直接首先把generated和原来的content images先concat by axis=0(具体可以查下tf.concat)然后在输出进行tf.split得到对应网络的输出,这个很有意思,想想CNN在做卷积的时候某个位置的值只与周围相关,共享weight之后,该层彼此之间不相关(可能在generated和image之间就边缘地方有些许pixels的影响,基本可以忽略,我的解释可能就是这样,有其他更合理的解释的小伙伴,请在本文下方留言),这个技巧感觉挺有用的,以后写相关代码的时候可以采纳。

代码图像生成效果不是很好,我怀疑是content和 style之间的weight大小的关系,还有就是可能是epoch数大小的问题,之后我会好好改下权重,看看能不能有点比较好的结果。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: