您的位置：首页 > 大数据 > 人工智能
AI challenger 场景分类 tensorflow inception-resnet-v2 LB: 0.94361

2017-10-06 04:46 651 查看
模型采用tf-slim在imagenet上训练的inception-resnet-v2，可以选择训练哪些层，如只重新训练最后一层，或重新训练后面的多层等等。没有采取特殊的数据增强，用的tf-slim默认的inception输入方式。采用如下参数配置线上得分0.94361。

用的tfrecord图片都是先resize成299*299再转换的，具体方法可参考之前的博文。

learning_rate=0.0001

batch_size=32

num_epochs=80

具体情况：

training accuracy: 0.836019

FInal Testing accuracy: 0.945787（val）

FInal Testing accuracy: 0.94361 （testA）

看起来还是有很大改进（调参）空间的，包括数据增强/分辨率和epoch数等等，但是：

这个代码目前有个问题：没有实现训练的同时监测验证准确率。这是tensorflow使用tfrecord时的一个坑，需要自己写一些很丑的解决方案，待解决（非常重要，因为已经在一些参数配置上观测到过拟合）。新版本的tf会逐步解决这个问题，详见开头注释的两个issues。采用官方代码提供的图片读取方案则可以简单解决这个问题，但是读取效率可能慢一倍，而且无法在一些云计算平台使用。
# -*- coding: utf-8 -*-
"""
Created on Wed Sep 20 16:05:02 2017

@author: wayne

FEELINGS
目前原生tf和tfrecord的坑还是挺多的，需要自己写的“通用代码”较多，尤其是input pipeline和训练/验证的【流程控制和监控准确率】等
已经在最新的1.3版本中引入了datasets,未来的1.4版本特性参见 https://github.com/tensorflow/tensorflow/issues/7902 和 https://github.com/tensorflow/tensorflow/issues/7951 目前来看，其实还是PyTorch好用，代码更直观易懂

使用原生tf的各种模块结合slim模型。可以考虑学习使用slim官方的样板代码，不过抽象程度较高。

CHANGES
- 可以restore我们自己上次的存档模型，而不是每次都从官方模型开始训练: tf.flags.DEFINE_bool('use_official', True)
-

REFERENCES
 https://web.stanford.edu/class/cs20si/syllabus.html 
输入数据 https://stackoverflow.com/questions/44054656/creating-tfrecords-from-a-list-of-strings-and-feeding-a-graph-in-tensorflow-afte https://indico.io/blog/tensorflow-data-inputs-part1-placeholders-protobufs-queues/ https://indico.io/blog/tensorflow-data-input-part2-extensions/ 
整个架构 https://github.com/tensorflow/tensorflow/blob/ 4000
master/tensorflow/examples/how_tos/reading_data/fully_connected_reader.py https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/udacity/2_fullyconnected.ipynb 
模型的存储和调用 http://blog.csdn.net/u014595019/article/details/53912710 http://blog.csdn.net/u012436149/article/details/52883747 (restore变量的子集) https://github.com/SymphonyPy/Valified_Code_Classify/tree/master/Classified http://blog.csdn.net/lwplwf/article/details/76177296 (定义了一个loop，去监听，一旦有新的checkpoint生成，就去执行一次验证。)

迁移学习(使用tf原生模块结合slim cnn模型的教程真少！） https://github.com/AIChallenger/AI_Challenger/tree/master/Baselines/caption_baseline (用的slim cnn) https://github.com/kwotsin/transfer_learning_tutorial (较为完整的程序，但是使用的都是slim提供的模块，还使用了tf.train.Supervisor和tensorboard) http://blog.csdn.net/ArtistA/article/details/52860050 (用tf直接实现的cnn): https://github.com/joelthchao/tensorflow-finetune-flickr-style http://blog.csdn.net/nnnnnnnnnnnny/article/details/70244232 （tensorflow_inception_graph.pb。因为一个训练数据会被使用多次，所以可以将原始图像通过Inception-v3模型计算得到的特征向量保存在文件中，免去重复的计算。） https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py  https://github.com/tensorflow/models/issues/391         [slim] weird result with parameter is_training https://github.com/YanWang2014/models/tree/master/slim  (slim的各种模型) http://pytorch.org/docs/master/torchvision/models.html http://data.mxnet.io/models/

数据增强 https://github.com/wzhang1/iNaturalist   MXNet finetune baseline (res152) for challenger.ai/competition/scene https://github.com/AIChallenger/AI_Challenger/tree/master/Baselines/caption_baseline/im2txt/im2txt/ops 
调参 https://zhuanlan.zhihu.com/p/22252270    深度学习最全优化方法总结比较（SGD，Adagrad，Adadelta，Adam，Adamax，Nadam） http://www.360doc.com/content/16/1010/08/36492363_597225745.shtml  https://www.zhihu.com/question/41631631  你有哪些deep learning（rnn、cnn）调参的经验？ https://www.zhihu.com/question/25097993  深度学习调参有哪些技巧？ https://www.zhihu.com/question/24529483  在神经网络中weight decay起到的做用是什么？momentum呢？normalization呢？ https://zhuanlan.zhihu.com/p/27555858?utm_medium=social&utm_source=wechat_session  [科普]如何使用高大上的方法调参数

tfrecord验证集问题：在是否额外建立graph方面有很多幺蛾子方法 https://github.com/tensorflow/tensorflow/issues/7902    每次验证要恰好读完整个验证集，且要读多次，在用tfrecord时怎么(优雅地)实现？ https://github.com/tensorflow/tensorflow/issues/7951    新版本会在input pipeline上做改进 https://stackoverflow.com/questions/39187764/tensorflow-efficient-feeding-of-eval-train-data-using-queue-runners https://stackoverflow.com/questions/44270198/when-using-tfrecord-how-can-i-run-intermediate-validation-check-a-better-way
 https://stackoverflow.com/questions/40146428/show-training-and-validation-accuracy-in-tensorflow-using-same-graph 
可视化adamoptimizer的lr https://stackoverflow.com/questions/36990476/getting-the-current-learning-rate-from-a-tf-train-adamoptimizer/44688307#44688307 
"""

from __future__ import division, print_function, absolute_import

import tensorflow as tf
import time
slim = tf.contrib.slim
from inception_resnet_v2 import *
import inception_preprocessing

tf.reset_default_graph()

import os
FLAGS = tf.flags.FLAGS

tf.flags.DEFINE_bool('train_flag', False, 'train_flag')
tf.flags.DEFINE_string('trainable_scopes', 'InceptionResnetV2/Logits,InceptionResnetV2/AuxLogits', '训练的层') #None 为全部训练。测试时不用管
tf.flags.DEFINE_bool('use_official', True, '使用官方模型开始训练还是使用自己存的模型，使用自己模型之前先给模型备份，否则可能会被覆盖掉')

tf.flags.DEFINE_float('learning_rate', 0.001, 'learning_rate')
tf.flags.DEFINE_string('val_test',  'None', 'train_flag=False时用哪个数据测试： val.tfrecord, testA testB')
#0.1 for the last layer
#1e-3 5e-4。 0.001 for the last layer, 0.0001 for whole0? 0.1 0.05 0.00001

tf.flags.DEFINE_float('beta1', 0.9, 'beta1')
tf.flags.DEFINE_float('beta2', 0.999, 'beta2')
tf.flags.DEFINE_float('epsilon', 0.1, 'epsilon') #1e-8。 Imagenet: 1.0 or 0.1

tf.flags.DEFINE_integer('batch_size', 2, 'batch大小')
tf.flags.DEFINE_integer('num_epochs', 1, 'epochs')

tf.flags.DEFINE_string('buckets', 'oss://scene2017', '训练图片所在文件夹')
official_model_path = 'oss://scene2017/slim/inception_resnet_v2_2016_08_30.ckpt'

tf.flags.DEFINE_string('checkpointDir', 'oss://scene2017', '模型输出文件夹')
model_path = os.path.join(FLAGS.checkpointDir,'model.ckpt')    # finetune后的
tf.flags.DEFINE_string('writes', 'oss://scene2017/slim/submit.txt', '预测结果的保存')

image_size = inception_resnet_v2.default_image_size #  299
num_labels = 80

'''
鉴于 每次验证要恰好读完整个验证集，而且下次还要重新读，目前在用tfrecord时无法(优雅地)实现，我们control the queue mannually: magic https://github.com/tensorflow/tensorflow/issues/7951 '''
magic_val_len = 7120 #验证集大小
magic_vac_batch_size = 128 #验证时batch_size可以很大，只要内/显存够

def read_and_decode(tfrecord_file, batch_size, num_epochs):

filename_queue = tf.train.string_input_producer([tfrecord_file], num_epochs = num_epochs)
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)

img_features = tf.parse_single_example(
serialized_example,
features={
'label': tf.FixedLenFeature([], tf.int64),
'h': tf.FixedLenFeature([], tf.int64),
'w': tf.FixedLenFeature([], tf.int64),
'c': tf.FixedLenFeature([], tf.int64),
'image': tf.FixedLenFeature([], tf.string),
})

h = tf.cast(img_features['h'], tf.int32)
w = tf.cast(img_features['w'], tf.int32)
c = tf.cast(img_features['c'], tf.int32)

image = tf.decode_raw(img_features['image'], tf.uint8)
image = tf.reshape(image, [h, w, c])

label = tf.cast(img_features['label'],tf.int32)

##########################################################
'''data augmentation here'''
#    distorted_image = tf.random_crop(images, [530, 530, img_channel])
#    distorted_image = tf.image.random_flip_left_right(distorted_image)
#    distorted_image = tf.image.random_brightness(distorted_image, max_delta=63)
#    distorted_image = tf.image.random_contrast(distorted_image, lower=0.2, upper=1.8)

#    image = tf.image.resize_images(image, (image_size,image_size))
#    image = tf.image.per_image_standardization(image)
#    image = tf.reshape(image, [image_size * image_size * 3])
image = inception_preprocessing.preprocess_image(image,
image_size,
image_size,
is_training=True)

##########################################################
'''shuffle here'''
image_batch, label_batch = tf.train.shuffle_batch([image, label],
batch_size= batch_size,
num_threads= 64,    # 注意多线程有可能改变图片顺序
capacity = 10240,
min_after_dequeue= 256
)

return image_batch, label_batch

def read_and_decode_test(tfrecord_file, batch_size, num_epochs):

filename_queue = tf.train.string_input_producer([tfrecord_file], num_epochs = num_epochs)
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)

img_features = tf.parse_single_example(
serialized_example,
features={
'label': tf.FixedLenFeature([], tf.int64),
'h': tf.FixedLenFeature([], tf.int64),
'w': tf.FixedLenFeature([], tf.int64),
'c': tf.FixedLenFeature([], tf.int64),
'image': tf.FixedLenFeature([], tf.string),   #https://stackoverflow.com/questions/41921746/tensorflow-varlenfeature-vs-fixedlenfeature
'image_id': tf.FixedLenFeature([], tf.string)
})

h = tf.cast(img_features['h'], tf.int32)
w = tf.cast(img_features['w'], tf.int32)
c = tf.cast(img_features['c'], tf.int32)
image_id = img_features['image_id']

image = tf.decode_raw(img_features['image'], tf.uint8)
image = tf.reshape(image, [h, w, c])

label = tf.cast(img_features['label'],tf.int32)

##########################################################
'''no data augmentation'''
#image = tf.image.resize_images(image, (image_size,image_size))
#    image = tf.image.per_image_standardization(image)
#    image = tf.reshape(image, [image_size * image_size * 3])
image = inception_preprocessing.preprocess_image(image,
image_size,
image_size,
is_training=False)
'''
inception_preprocessing.preprocess_for_eval的bug?
'''
image.set_shape([None, None, 3])

image_batch, label_batch, image_id_batch= tf.train.batch([image, label, image_id],
batch_size= batch_size,
num_threads= 64,    # 注意多线程有可能改变图片顺序
capacity = 2000,
allow_smaller_final_batch = True
)

return image_batch, label_batch, image_id_batch

def batch_to_list_of_dicts(indices2, image_id_batch2):
result = [] #[{"image_id":"a0563eadd9ef79fcc137e1c60be29f2f3c9a65ea.jpg","label_id": [5,18,32]}]
dict_ = {}
for item in range(indices2.shape[0]):
dict_ ['image_id'] = image_id_batch2[item].decode()
dict_['label_id'] = indices2[item,:].tolist()
result.append(dict_)
dict_ = {}
return result

'''https://github.com/tensorflow/models/blob/master/research/slim/train_image_classifier.py'''
def get_variables_to_train():
"""Returns a list of variables to train.
Returns:
A list of variables to train by the optimizer.
"""
trainable_scopes = FLAGS.trainable_scopes

if trainable_scopes == "None":
print("from scratch")
d6fb
return tf.trainable_variables()
else:
print("train the specified layer")
scopes = [scope.strip() for scope in trainable_scopes.split(',')]

variables_to_train = []
for scope in scopes:
variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope)
variables_to_train.extend(variables)
#    variables_to_train = [i.name for i in variables_to_train]
return variables_to_train

def read_tfrecord2(tfrecord_file, batch_size, train_flag, num_epochs, total_steps):

#因为test有image_id，否则和train共用输入函数就行了。另外read_and_decode训练中会加入data augmentation，因此验证集和测试集均用第二个函数
if train_flag:
train_batch, train_label_batch = read_and_decode(tfrecord_file, batch_size, num_epochs)

with slim.arg_scope(inception_resnet_v2_arg_scope()):
train_logits, end_points = inception_resnet_v2(train_batch, num_classes = num_labels, is_training = True)
#Define the scopes that you want to exclude for restoration
exclude = ['InceptionResnetV2/Logits', 'InceptionResnetV2/AuxLogits']
variables_to_restore = slim.get_variables_to_restore(exclude = exclude)
variables_to_train = get_variables_to_train()

#Performs the equivalent to tf.nn.sparse_softmax_cross_entropy_with_logits but enhanced with checks
loss = tf.losses.sparse_softmax_cross_entropy(labels=train_label_batch, logits=train_logits)
#slim.losses.add_loss(pose_loss)
total_loss = tf.losses.get_total_loss()    #obtain the regularization losses as well

#http://blog.csdn.net/xierhacker/article/details/53174558
optimizer = tf.train.AdamOptimizer(
learning_rate=FLAGS.learning_rate,
beta1=FLAGS.beta1,
beta2=FLAGS.beta2,
epsilon=FLAGS.epsilon,
use_locking=False,
name='Adam'
)

'''要确定训练哪些层需要用这个函数，默认是全部都训练: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/slim/python/slim/learning.py#L374''' train_op = slim.learning.create_train_op(total_loss, optimizer, variables_to_train = variables_to_train)

'''minibatch accuracy, non-streaming'''
train_accuracy_batch = tf.reduce_mean(tf.cast(tf.nn.in_top_k(predictions = train_logits, targets=train_label_batch, k=3),tf.float32))
'''Streaming accuracyy'''
train_accuracy, train_accuracy_update= tf.metrics.mean(tf.cast(tf.nn.in_top_k(predictions = train_logits, targets=train_label_batch, k=3),tf.float32))

else:
val_test_batch, val_test_label_batch, image_id_batch= read_and_decode_test(tfrecord_file, batch_size, num_epochs)

with slim.arg_scope(inception_resnet_v2_arg_scope()):
val_test_logits, end_points = inception_resnet_v2(val_test_batch, num_classes = num_labels, is_training = False)

'''Useless minibatch accuracy, non-streaming'''
val_test_accuracy_batch = tf.reduce_mean(tf.cast(tf.nn.in_top_k(predictions = val_test_logits, targets=val_test_label_batch, k=3),tf.float32))
'''Streaming accuracyy'''
val_test_accuracy, val_test_accuracy_update= tf.metrics.mean(tf.cast(tf.nn.in_top_k(predictions = val_test_logits, targets=val_test_label_batch, k=3),tf.float32))

values, indices = tf.nn.top_k(val_test_logits, 3)

saver = tf.train.Saver() # 生成saver
if train_flag:
if FLAGS.use_official:
saver_step0 = tf.train.Saver(variables_to_restore)
else:
saver_step0 = tf.train.Saver()

with tf.Session() as sess:
# https://github.com/tensorflow/tensorflow/issues/1045 sess.run(tf.group(tf.global_variables_initializer(), tf.local_variables_initializer()))
print("Initialized")

coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)

if train_flag:

'''
类数被修改的最后一层logits是如何初始化的，是sess.run(tf.group(tf.global_variables_initializer(), tf.local_variables_initializer()))？？？
'''
if FLAGS.use_official:
saver_step0.restore(sess, official_model_path)
else:
saver_step0.restore(sess, model_path)

try:
step = 0
start_time = time.time()
while not coord.should_stop():
_, l, logits2, train_acc2_batch, train_acc2, train_acc2_update = sess.run([train_op, total_loss, train_logits, train_accuracy_batch, train_accuracy, train_accuracy_update])

duration = time.time() - start_time

if (step % 10 == 0):
print("Minibatch loss at step %d - %d: %.6f (%.3f sec)" % (step, total_steps, l, duration))
print("Minibatch accuracy: %.6f" % train_acc2)
print("lr: %.6f" % optimizer._lr) #https://stackoverflow.com/questions/38882593/learning-rate-doesnt-change-for-adamoptimizer-in-tensorflow
#if (step % 100 == 0):
#Validating accuracy

step += 1
except tf.errors.OutOfRangeError:
print('Done training for %d epochs, %d steps.' % (num_epochs, step))
print('FInal training accuracy: %.6f' % (train_acc2_update))
#Final Validating accuracy

saver.save(sess, model_path)
finally:
coord.request_stop()

else:
saver.restore(sess, model_path) #会将已经保存的变量值resotre到 变量中。
results = []
try:
step = 0
start_time = time.time()
while not coord.should_stop():
val_test_logits2, val_test_acc2_batch, val_test_acc2, val_test_acc2_update,image_id_batch2, indices2, values2= sess.run([val_test_logits, val_test_accuracy_batch, val_test_accuracy, val_test_accuracy_update, image_id_batch, indices, values])
step += 1

results += batch_to_list_of_dicts(indices2, image_id_batch2)
if (step % 10 == 0):
print('Useless minibatch testing accuracy at step %d: %.6f' % (step, val_test_acc2_batch))
print(indices2.shape[0])

except tf.errors.OutOfRangeError:
print('Done testing in, %d steps.' % (step))
print('FInal Testing accuracy: %.6f' % (val_test_acc2_update))

'''Writing JSON data'''
#results = [{"image_id":"a0563eadd9ef79fcc137e1c60be29f2f3c9a65ea.jpg","label_id": [5,18,32]}]
print(len(results))
tf.gfile.GFile(FLAGS.writes, 'w').write(str(results)) # PAI的坑
#with open('oss://scene2017.oss-cn-shanghai-internal.aliyuncs.com/softmax/submit.json', 'w') as f:
# json.dump(results, f)
finally:
coord.request_stop()

coord.join(threads)

def main(_):

train_flag = FLAGS.train_flag

if train_flag:
tfrecord_file = os.path.join(FLAGS.buckets,'train.tfrecord')
#'../ai_challenger_scene_train_20170904/train.tfrecord'
#    tfrecord_file_val = '../ai_challenger_scene_train_20170904/val.tfrecord' # validate while training
batch_size = FLAGS.batch_size#256
num_epochs = FLAGS.num_epochs
total_steps = 1.0 * num_epochs * 53879 / batch_size
print("total_steps is %d" % total_steps)
print("num_epochs is %d" % num_epochs)
print("batch_size is %d" % batch_size)
print("lr %.6f" % FLAGS.learning_rate)
read_tfrecord2(tfrecord_file, batch_size, train_flag, num_epochs, total_steps)
else:
tfrecord_file = os.path.join(FLAGS.buckets,FLAGS.val_test)#'../ai_challenger_scene_train_20170904/val.tfrecord'  #test
batch_size = FLAGS.batch_size #16
num_epochs = FLAGS.num_epochs #1
total_steps = 1.0 * num_epochs * 7120 / batch_size  #7120是val.tfrecord的，其他的test稍微有点误差，不管了
print("total_steps is %d" % total_steps)
read_tfrecord2(tfrecord_file, batch_size, train_flag, num_epochs, total_steps)

# 53879  7120   7040

if __name__ == "__main__": #使用这种方式保证了，如果此文件被其它文件import的时候，不会执行main中的代码
tf.app.run() #解析命令行参数，调用main函数 main(sys.argv)
内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理
标签：
相关文章推荐
新的分享
章节导航