您的位置：首页 > 其它

TensorFlow之深入理解Fast Neural Style

2016-10-09 14:15 507 查看

reference: http://hacker.duanshishi.com/?p=1639

前言

前面几篇文章讲述了在Computer Vision领域里面常用的模型，接下来一段时间，我会花精力来学习一些TensorFlow在Computer Vision领域的应用，主要是分析相关pape和源码，今天会来详细了解下fast neural style的相关工作，前面也有文章分析neural
style的内容，那篇算是neural style的起源，但是无法应用到实际工作上，为啥呢？它每次都需要指定好content image和style image，然后最小化content loss 和style loss去生成图像，时间花销很大，而且无法保存某种风格的model，所以每次生成图像都是训练一个model的过程，而fast
neural style中能够将训练好的某种style的image的模型保存下来，然后对content image 进行transform，当然文中还提到了image transform的另一个应用方向：Super-Resolution，利用深度学习的技术将低分辨率的图像转换为高分辨率图像，现在在很多大型的互联网公司，尤其是视频网站上也有应用。

Paper原理

几个月前，就看了Neural Style相关的文章TensorFlow之深入理解Neural Style,A
Neural Algorithm of Aritistic Style中构造了一个多层的卷积网络，通过最小化定义的content loss和style loss最后生成一个结合了content和style的图像，很有意思，而Perceptual
Losses for Real-Time Style Transfer and Super-Resolution，通过使用perceptual loss来替代per-pixels loss使用pre-trained的vgg model来简化原先的loss计算，增加一个transform Network，直接生成Content image的style版本，如何实现的呢，请看下图，容我道来：

整个网络是由部分组成：image
transformation network、 loss netwrok；Image Transformation network是一个deep residual conv netwrok，用来将输入图像（content image）直接transform为带有style的图像；而loss network参数是fixed的，这里的loss network和A
Neural Algorithm of Aritistic Style中的网络结构一致，只是参数不做更新，只用来做content loss 和style loss的计算，这个就是所谓的perceptual loss，作者是这样解释的为Image Classification的pretrained的卷积模型已经很好的学习了perceptual和semantic information（场景和语义信息），所以后面的整个loss network仅仅是为了计算content loss和style loss，而不像A
Neural Algorithm of Aritistic Style做更新这部分网络的参数，这里更新的是前面的transform network的参数，所以从整个网络结构上来看输入图像通过transform network得到转换的图像，然后计算对应的loss，整个网络通过最小化这个loss去update前面的transform network，是不是很简单？

loss的计算也和之前的都很类似，content loss：

style loss:

style
loss中的gram matrix:

Gram Matrix是一个很重要的东西，他可以保证y^hat和y之间有同样的shape。 Gram的说明具体见paper这部分，我这也解释不清楚，相信读者一看就明白：

相信看到这里就基本上明白了这篇paper在fast neural style是如何做的，总结一下：
transform network 网络结构为deep residual network，将输入image转换为带有特种风格的图像，网络参数可更新。
loss network 网络结构同之前paper类似，这里主要是计算content loss和style loss，注意参数不做更新。
Gram matrix的提出，让transform之后的图像与最后经过loss network之后的图像不同shape时计算loss也很方便。

fast neural style on tensorflow

代码参考https://github.com/OlavHN/fast-neural-style，但是我跑了下，代码是跑不通的，原因大概是tensorflow在更新之后，local_variables之后的一些问题，具体原因可以看这个issue:https://github.com/tensorflow/tensorflow/issues/1045#issuecomment-239789244.还有这个项目的代码都写在一起，有点杂乱，我将train和最后生成style后的图像的代码分开了，项目放到了我的个人的githubneural_style_tensorflow，项目基本要求：
python 2.7.x
Tensorflow r0.10
VGG-19 model
COCO dataset

Transform Network网络结构

import tensorflow as tf

def conv2d(x, input_filters, output_filters, kernel, strides, padding='SAME'):
with tf.variable_scope('conv') as scope:

shape = [kernel, kernel, input_filters, output_filters]
weight = tf.Variable(tf.truncated_normal(shape, stddev=0.1), name='weight')
convolved = tf.nn.conv2d(x, weight, strides=[1, strides, strides, 1], padding=padding, name='conv')

normalized = batch_norm(convolved, output_filters)

return normalized

def conv2d_transpose(x, input_filters, output_filters, kernel, strides, padding='SAME'):
with tf.variable_scope('conv_transpose') as scope:

shape = [kernel, kernel, output_filters, input_filters]
weight = tf.Variable(tf.truncated_normal(shape, stddev=0.1), name='weight')

batch_size = tf.shape(x)[0]
height = tf.shape(x)[1] * strides
width = tf.shape(x)[2] * strides
output_shape = tf.pack([batch_size, height, width, output_filters])
convolved = tf.nn.conv2d_transpose(x, weight, output_shape, strides=[1, strides, strides, 1], padding=padding, name='conv_transpose')

normalized = batch_norm(convolved, output_filters)
return normalized

def batch_norm(x, size):
batch_mean, batch_var = tf.nn.moments(x, [0, 1, 2], keep_dims=True)
beta = tf.Variable(tf.zeros([size]), name='beta')
scale = tf.Variable(tf.ones([size]), name='scale')
epsilon = 1e-3
return tf.nn.batch_normalization(x, batch_mean, batch_var, beta, scale, epsilon, name='batch')

def residual(x, filters, kernel, strides, padding='SAME'):
with tf.variable_scope('residual') as scope:
conv1 = conv2d(x, filters, filters, kernel, strides, padding=padding)
conv2 = conv2d(tf.nn.relu(conv1), filters, filters, kernel, strides, padding=padding)

residual = x + conv2

return residual

def net(image):
with tf.variable_scope('conv1'):
conv1 = tf.nn.relu(conv2d(image, 3, 32, 9, 1))
with tf.variable_scope('conv2'):
conv2 = tf.nn.relu(conv2d(conv1, 32, 64, 3, 2))
with tf.variable_scope('conv3'):
conv3 = tf.nn.relu(conv2d(conv2, 64, 128, 3, 2))
with tf.variable_scope('res1'):
res1 = residual(conv3, 128, 3, 1)
with tf.variable_scope('res2'):
res2 = residual(res1, 128, 3, 1)
with tf.variable_scope('res3'):
res3 = residual(res2, 128, 3, 1)
with tf.variable_scope('res4'):
res4 = residual(res3, 128, 3, 1)
with tf.variable_scope('res5'):
res5 = residual(res4, 128, 3, 1)
with tf.variable_scope('deconv1'):
deconv1 = tf.nn.relu(conv2d_transpose(res5, 128, 64, 3, 2))
with tf.variable_scope('deconv2'):
deconv2 = tf.nn.relu(conv2d_transpose(deconv1, 64, 32, 3, 2))
with tf.variable_scope('deconv3'):
deconv3 = tf.nn.tanh(conv2d_transpose(deconv2, 32, 3, 9, 1))

y = deconv3 * 127.5

return y

使用deep residual network来训练COCO数据集，能够在保证性能的前提下，训练更深的模型。而Loss Network是有pretrained的VGG网络来计算，网络结构：

import tensorflow as tf
import numpy as np
import scipy.io
from scipy import misc

def net(data_path, input_image):
layers = (
'conv1_1', 'relu1_1', 'conv1_2', 'relu1_2', 'pool1',

'conv2_1', 'relu2_1', 'conv2_2', 'relu2_2', 'pool2',

'conv3_1', 'relu3_1', 'conv3_2', 'relu3_2', 'conv3_3',
'relu3_3', 'conv3_4', 'relu3_4', 'pool3',

'conv4_1', 'relu4_1', 'conv4_2', 'relu4_2', 'conv4_3',
'relu4_3', 'conv4_4', 'relu4_4', 'pool4',

'conv5_1', 'relu5_1', 'conv5_2', 'relu5_2', 'conv5_3',
'relu5_3', 'conv5_4', 'relu5_4'
)

data = scipy.io.loadmat(data_path)
mean = data['normalization'][0][0][0]
mean_pixel = np.mean(mean, axis=(0, 1))
weights = data['layers'][0]

net = {}
current = input_image
for i, name in enumerate(layers):
kind = name[:4]
if kind == 'conv':
kernels, bias = weights[i][0][0][0][0]
# matconvnet: weights are [width, height, in_channels, out_channels]
# tensorflow: weights are [height, width, in_channels, out_channels]
kernels = np.transpose(kernels, (1, 0, 2, 3))
bias = bias.reshape(-1)
current = _conv_layer(current, kernels, bias, name=name)
elif kind == 'relu':
current = tf.nn.relu(current, name=name)
elif kind == 'pool':
current = _pool_layer(current, name=name)
net[name] = current

assert len(net) == len(layers)
return net, mean_pixel

def _conv_layer(input, weights, bias, name=None):
conv = tf.nn.conv2d(input, tf.constant(weights), strides=(1, 1, 1, 1),
padding='SAME', name=name)
return tf.nn.bias_add(conv, bias)

def _pool_layer(input, name=None):
return tf.nn.max_pool(input, ksize=(1, 2, 2, 1), strides=(1, 2, 2, 1),
padding='SAME', name=name)

def preprocess(image, mean_pixel):
return image - mean_pixel

def unprocess(image, mean_pixel):
return image + mean_pixel

Content Loss：

def compute_content_loss(content_layers,net):
content_loss = 0
# tf.app.flags.DEFINE_string("CONTENT_LAYERS", "relu4_2", "Which VGG layer to extract content loss from")
for layer in content_layers:
generated_images, content_images = tf.split(0, 2, net[layer])
size = tf.size(generated_images)
content_loss += tf.nn.l2_loss(generated_images - content_images) / tf.to_float(size)
content_loss = content_loss / len(content_layers)

return content_loss

Style Loss：

def compute_style_loss(style_features_t, style_layers,net):
style_loss = 0
for style_gram, layer in zip(style_features_t, style_layers):
generated_images, _ = tf.split(0, 2, net[layer])
size = tf.size(generated_images)
for style_image in style_gram:
style_loss += tf.nn.l2_loss(tf.reduce_sum(gram(generated_images) - style_image, 0)) / tf.to_float(size)
style_loss = style_loss / len(style_layers)
return style_loss

gram：

def gram(layer):
shape = tf.shape(layer)
num_images = shape[0]
num_filters = shape[3]
size = tf.size(layer)
filters = tf.reshape(layer, tf.pack([num_images, -1, num_filters]))
grams = tf.batch_matmul(filters, filters, adj_x=True) / tf.to_float(size / FLAGS.BATCH_SIZE)

return grams

在train_fast_neural_style.py main()中，net, _ = vgg.net(FLAGS.VGG_PATH, tf.concat(0, [generated, images]))这部分有点疑问，按我的想法来说应该分别把generated、images作为input给到vgg.net，然后在后面去计算与content image 和style的loss，但是这里是直接首先把generated和原来的content images先concat by axis=0（具体可以查下tf.concat）然后在输出进行tf.split得到对应网络的输出，这个很有意思，想想CNN在做卷积的时候某个位置的值只与周围相关，共享weight之后，该层彼此之间不相关（可能在generated和image之间就边缘地方有些许pixels的影响，基本可以忽略，我的解释可能就是这样，有其他更合理的解释的小伙伴，请在本文下方留言），这个技巧感觉挺有用的，以后写相关代码的时候可以采纳。

代码图像生成效果不是很好，我怀疑是content和 style之间的weight大小的关系，还有就是可能是epoch数大小的问题，之后我会好好改下权重，看看能不能有点比较好的结果。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航