AI challenger 场景分类 tensorflow inception-resnet-v2 LB: 0.94361
2017-10-06 04:46
651 查看
模型采用tf-slim在imagenet上训练的inception-resnet-v2,可以选择训练哪些层,如只重新训练最后一层,或重新训练后面的多层等等。没有采取特殊的数据增强,用的tf-slim默认的inception输入方式。采用如下参数配置线上得分0.94361。
用的tfrecord图片都是先resize成299*299再转换的,具体方法可参考之前的博文。
learning_rate=0.0001
batch_size=32
num_epochs=80
具体情况:
training accuracy: 0.836019
FInal Testing accuracy: 0.945787(val)
FInal Testing accuracy: 0.94361 (testA)
看起来还是有很大改进(调参)空间的,包括数据增强/分辨率和epoch数等等,但是:
这个代码目前有个问题: 没有实现训练的同时监测验证准确率。这是tensorflow使用tfrecord时的一个坑,需要自己写一些很丑的解决方案,待解决(非常重要,因为已经在一些参数配置上观测到过拟合)。新版本的tf会逐步解决这个问题,详见开头注释的两个issues。采用官方代码提供的图片读取方案则可以简单解决这个问题,但是读取效率可能慢一倍,而且无法在一些云计算平台使用。
用的tfrecord图片都是先resize成299*299再转换的,具体方法可参考之前的博文。
learning_rate=0.0001
batch_size=32
num_epochs=80
具体情况:
training accuracy: 0.836019
FInal Testing accuracy: 0.945787(val)
FInal Testing accuracy: 0.94361 (testA)
看起来还是有很大改进(调参)空间的,包括数据增强/分辨率和epoch数等等,但是:
这个代码目前有个问题: 没有实现训练的同时监测验证准确率。这是tensorflow使用tfrecord时的一个坑,需要自己写一些很丑的解决方案,待解决(非常重要,因为已经在一些参数配置上观测到过拟合)。新版本的tf会逐步解决这个问题,详见开头注释的两个issues。采用官方代码提供的图片读取方案则可以简单解决这个问题,但是读取效率可能慢一倍,而且无法在一些云计算平台使用。
# -*- coding: utf-8 -*- """ Created on Wed Sep 20 16:05:02 2017 @author: wayne FEELINGS 目前原生tf和tfrecord的坑还是挺多的,需要自己写的“通用代码”较多,尤其是input pipeline和训练/验证的【流程控制和监控准确率】等 已经在最新的1.3版本中引入了datasets,未来的1.4版本特性参见 https://github.com/tensorflow/tensorflow/issues/7902 和 https://github.com/tensorflow/tensorflow/issues/7951 目前来看,其实还是PyTorch好用,代码更直观易懂 使用原生tf的各种模块结合slim模型。可以考虑学习使用slim官方的样板代码,不过抽象程度较高。 CHANGES - 可以restore我们自己上次的存档模型,而不是每次都从官方模型开始训练: tf.flags.DEFINE_bool('use_official', True) - REFERENCES https://web.stanford.edu/class/cs20si/syllabus.html 输入数据 https://stackoverflow.com/questions/44054656/creating-tfrecords-from-a-list-of-strings-and-feeding-a-graph-in-tensorflow-afte https://indico.io/blog/tensorflow-data-inputs-part1-placeholders-protobufs-queues/ https://indico.io/blog/tensorflow-data-input-part2-extensions/ 整个架构 https://github.com/tensorflow/tensorflow/blob/ 4000 master/tensorflow/examples/how_tos/reading_data/fully_connected_reader.py https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/udacity/2_fullyconnected.ipynb 模型的存储和调用 http://blog.csdn.net/u014595019/article/details/53912710 http://blog.csdn.net/u012436149/article/details/52883747 (restore变量的子集) https://github.com/SymphonyPy/Valified_Code_Classify/tree/master/Classified http://blog.csdn.net/lwplwf/article/details/76177296 (定义了一个loop,去监听,一旦有新的checkpoint生成,就去执行一次验证。) 迁移学习(使用tf原生模块结合slim cnn模型的教程真少!) https://github.com/AIChallenger/AI_Challenger/tree/master/Baselines/caption_baseline (用的slim cnn) https://github.com/kwotsin/transfer_learning_tutorial (较为完整的程序,但是使用的都是slim提供的模块,还使用了tf.train.Supervisor和tensorboard) http://blog.csdn.net/ArtistA/article/details/52860050 (用tf直接实现的cnn): https://github.com/joelthchao/tensorflow-finetune-flickr-style http://blog.csdn.net/nnnnnnnnnnnny/article/details/70244232 (tensorflow_inception_graph.pb。因为一个训练数据会被使用多次,所以可以将原始图像通过Inception-v3模型计算得到的特征向量保存在文件中,免去重复的计算。) https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py https://github.com/tensorflow/models/issues/391 [slim] weird result with parameter is_training https://github.com/YanWang2014/models/tree/master/slim (slim的各种模型) http://pytorch.org/docs/master/torchvision/models.html http://data.mxnet.io/models/ 数据增强 https://github.com/wzhang1/iNaturalist MXNet finetune baseline (res152) for challenger.ai/competition/scene https://github.com/AIChallenger/AI_Challenger/tree/master/Baselines/caption_baseline/im2txt/im2txt/ops 调参 https://zhuanlan.zhihu.com/p/22252270 深度学习最全优化方法总结比较(SGD,Adagrad,Adadelta,Adam,Adamax,Nadam) http://www.360doc.com/content/16/1010/08/36492363_597225745.shtml https://www.zhihu.com/question/41631631 你有哪些deep learning(rnn、cnn)调参的经验? https://www.zhihu.com/question/25097993 深度学习调参有哪些技巧? https://www.zhihu.com/question/24529483 在神经网络中weight decay起到的做用是什么?momentum呢?normalization呢? https://zhuanlan.zhihu.com/p/27555858?utm_medium=social&utm_source=wechat_session [科普]如何使用高大上的方法调参数 tfrecord验证集问题:在是否额外建立graph方面有很多幺蛾子方法 https://github.com/tensorflow/tensorflow/issues/7902 每次验证要恰好读完整个验证集,且要读多次,在用tfrecord时怎么(优雅地)实现? https://github.com/tensorflow/tensorflow/issues/7951 新版本会在input pipeline上做改进 https://stackoverflow.com/questions/39187764/tensorflow-efficient-feeding-of-eval-train-data-using-queue-runners https://stackoverflow.com/questions/44270198/when-using-tfrecord-how-can-i-run-intermediate-validation-check-a-better-way https://stackoverflow.com/questions/40146428/show-training-and-validation-accuracy-in-tensorflow-using-same-graph 可视化adamoptimizer的lr https://stackoverflow.com/questions/36990476/getting-the-current-learning-rate-from-a-tf-train-adamoptimizer/44688307#44688307 """ from __future__ import division, print_function, absolute_import import tensorflow as tf import time slim = tf.contrib.slim from inception_resnet_v2 import * import inception_preprocessing tf.reset_default_graph() import os FLAGS = tf.flags.FLAGS tf.flags.DEFINE_bool('train_flag', False, 'train_flag') tf.flags.DEFINE_string('trainable_scopes', 'InceptionResnetV2/Logits,InceptionResnetV2/AuxLogits', '训练的层') #None 为全部训练。测试时不用管 tf.flags.DEFINE_bool('use_official', True, '使用官方模型开始训练还是使用自己存的模型,使用自己模型之前先给模型备份,否则可能会被覆盖掉') tf.flags.DEFINE_float('learning_rate', 0.001, 'learning_rate') tf.flags.DEFINE_string('val_test', 'None', 'train_flag=False时用哪个数据测试: val.tfrecord, testA testB') #0.1 for the last layer #1e-3 5e-4。 0.001 for the last layer, 0.0001 for whole0? 0.1 0.05 0.00001 tf.flags.DEFINE_float('beta1', 0.9, 'beta1') tf.flags.DEFINE_float('beta2', 0.999, 'beta2') tf.flags.DEFINE_float('epsilon', 0.1, 'epsilon') #1e-8。 Imagenet: 1.0 or 0.1 tf.flags.DEFINE_integer('batch_size', 2, 'batch大小') tf.flags.DEFINE_integer('num_epochs', 1, 'epochs') tf.flags.DEFINE_string('buckets', 'oss://scene2017', '训练图片所在文件夹') official_model_path = 'oss://scene2017/slim/inception_resnet_v2_2016_08_30.ckpt' tf.flags.DEFINE_string('checkpointDir', 'oss://scene2017', '模型输出文件夹') model_path = os.path.join(FLAGS.checkpointDir,'model.ckpt') # finetune后的 tf.flags.DEFINE_string('writes', 'oss://scene2017/slim/submit.txt', '预测结果的保存') image_size = inception_resnet_v2.default_image_size # 299 num_labels = 80 ''' 鉴于 每次验证要恰好读完整个验证集,而且下次还要重新读,目前在用tfrecord时无法(优雅地)实现,我们control the queue mannually: magic https://github.com/tensorflow/tensorflow/issues/7951 ''' magic_val_len = 7120 #验证集大小 magic_vac_batch_size = 128 #验证时batch_size可以很大,只要内/显存够 def read_and_decode(tfrecord_file, batch_size, num_epochs): filename_queue = tf.train.string_input_producer([tfrecord_file], num_epochs = num_epochs) reader = tf.TFRecordReader() _, serialized_example = reader.read(filename_queue) img_features = tf.parse_single_example( serialized_example, features={ 'label': tf.FixedLenFeature([], tf.int64), 'h': tf.FixedLenFeature([], tf.int64), 'w': tf.FixedLenFeature([], tf.int64), 'c': tf.FixedLenFeature([], tf.int64), 'image': tf.FixedLenFeature([], tf.string), }) h = tf.cast(img_features['h'], tf.int32) w = tf.cast(img_features['w'], tf.int32) c = tf.cast(img_features['c'], tf.int32) image = tf.decode_raw(img_features['image'], tf.uint8) image = tf.reshape(image, [h, w, c]) label = tf.cast(img_features['label'],tf.int32) ########################################################## '''data augmentation here''' # distorted_image = tf.random_crop(images, [530, 530, img_channel]) # distorted_image = tf.image.random_flip_left_right(distorted_image) # distorted_image = tf.image.random_brightness(distorted_image, max_delta=63) # distorted_image = tf.image.random_contrast(distorted_image, lower=0.2, upper=1.8) # image = tf.image.resize_images(image, (image_size,image_size)) # image = tf.image.per_image_standardization(image) # image = tf.reshape(image, [image_size * image_size * 3]) image = inception_preprocessing.preprocess_image(image, image_size, image_size, is_training=True) ########################################################## '''shuffle here''' image_batch, label_batch = tf.train.shuffle_batch([image, label], batch_size= batch_size, num_threads= 64, # 注意多线程有可能改变图片顺序 capacity = 10240, min_after_dequeue= 256 ) return image_batch, label_batch def read_and_decode_test(tfrecord_file, batch_size, num_epochs): filename_queue = tf.train.string_input_producer([tfrecord_file], num_epochs = num_epochs) reader = tf.TFRecordReader() _, serialized_example = reader.read(filename_queue) img_features = tf.parse_single_example( serialized_example, features={ 'label': tf.FixedLenFeature([], tf.int64), 'h': tf.FixedLenFeature([], tf.int64), 'w': tf.FixedLenFeature([], tf.int64), 'c': tf.FixedLenFeature([], tf.int64), 'image': tf.FixedLenFeature([], tf.string), #https://stackoverflow.com/questions/41921746/tensorflow-varlenfeature-vs-fixedlenfeature 'image_id': tf.FixedLenFeature([], tf.string) }) h = tf.cast(img_features['h'], tf.int32) w = tf.cast(img_features['w'], tf.int32) c = tf.cast(img_features['c'], tf.int32) image_id = img_features['image_id'] image = tf.decode_raw(img_features['image'], tf.uint8) image = tf.reshape(image, [h, w, c]) label = tf.cast(img_features['label'],tf.int32) ########################################################## '''no data augmentation''' #image = tf.image.resize_images(image, (image_size,image_size)) # image = tf.image.per_image_standardization(image) # image = tf.reshape(image, [image_size * image_size * 3]) image = inception_preprocessing.preprocess_image(image, image_size, image_size, is_training=False) ''' inception_preprocessing.preprocess_for_eval的bug? ''' image.set_shape([None, None, 3]) image_batch, label_batch, image_id_batch= tf.train.batch([image, label, image_id], batch_size= batch_size, num_threads= 64, # 注意多线程有可能改变图片顺序 capacity = 2000, allow_smaller_final_batch = True ) return image_batch, label_batch, image_id_batch def batch_to_list_of_dicts(indices2, image_id_batch2): result = [] #[{"image_id":"a0563eadd9ef79fcc137e1c60be29f2f3c9a65ea.jpg","label_id": [5,18,32]}] dict_ = {} for item in range(indices2.shape[0]): dict_ ['image_id'] = image_id_batch2[item].decode() dict_['label_id'] = indices2[item,:].tolist() result.append(dict_) dict_ = {} return result '''https://github.com/tensorflow/models/blob/master/research/slim/train_image_classifier.py''' def get_variables_to_train(): """Returns a list of variables to train. Returns: A list of variables to train by the optimizer. """ trainable_scopes = FLAGS.trainable_scopes if trainable_scopes == "None": print("from scratch") d6fb return tf.trainable_variables() else: print("train the specified layer") scopes = [scope.strip() for scope in trainable_scopes.split(',')] variables_to_train = [] for scope in scopes: variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope) variables_to_train.extend(variables) # variables_to_train = [i.name for i in variables_to_train] return variables_to_train def read_tfrecord2(tfrecord_file, batch_size, train_flag, num_epochs, total_steps): #因为test有image_id,否则和train共用输入函数就行了。另外read_and_decode训练中会加入data augmentation,因此验证集和测试集均用第二个函数 if train_flag: train_batch, train_label_batch = read_and_decode(tfrecord_file, batch_size, num_epochs) with slim.arg_scope(inception_resnet_v2_arg_scope()): train_logits, end_points = inception_resnet_v2(train_batch, num_classes = num_labels, is_training = True) #Define the scopes that you want to exclude for restoration exclude = ['InceptionResnetV2/Logits', 'InceptionResnetV2/AuxLogits'] variables_to_restore = slim.get_variables_to_restore(exclude = exclude) variables_to_train = get_variables_to_train() #Performs the equivalent to tf.nn.sparse_softmax_cross_entropy_with_logits but enhanced with checks loss = tf.losses.sparse_softmax_cross_entropy(labels=train_label_batch, logits=train_logits) #slim.losses.add_loss(pose_loss) total_loss = tf.losses.get_total_loss() #obtain the regularization losses as well #http://blog.csdn.net/xierhacker/article/details/53174558 optimizer = tf.train.AdamOptimizer( learning_rate=FLAGS.learning_rate, beta1=FLAGS.beta1, beta2=FLAGS.beta2, epsilon=FLAGS.epsilon, use_locking=False, name='Adam' ) '''要确定训练哪些层需要用这个函数,默认是全部都训练: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/slim/python/slim/learning.py#L374''' train_op = slim.learning.create_train_op(total_loss, optimizer, variables_to_train = variables_to_train) '''minibatch accuracy, non-streaming''' train_accuracy_batch = tf.reduce_mean(tf.cast(tf.nn.in_top_k(predictions = train_logits, targets=train_label_batch, k=3),tf.float32)) '''Streaming accuracyy''' train_accuracy, train_accuracy_update= tf.metrics.mean(tf.cast(tf.nn.in_top_k(predictions = train_logits, targets=train_label_batch, k=3),tf.float32)) else: val_test_batch, val_test_label_batch, image_id_batch= read_and_decode_test(tfrecord_file, batch_size, num_epochs) with slim.arg_scope(inception_resnet_v2_arg_scope()): val_test_logits, end_points = inception_resnet_v2(val_test_batch, num_classes = num_labels, is_training = False) '''Useless minibatch accuracy, non-streaming''' val_test_accuracy_batch = tf.reduce_mean(tf.cast(tf.nn.in_top_k(predictions = val_test_logits, targets=val_test_label_batch, k=3),tf.float32)) '''Streaming accuracyy''' val_test_accuracy, val_test_accuracy_update= tf.metrics.mean(tf.cast(tf.nn.in_top_k(predictions = val_test_logits, targets=val_test_label_batch, k=3),tf.float32)) values, indices = tf.nn.top_k(val_test_logits, 3) saver = tf.train.Saver() # 生成saver if train_flag: if FLAGS.use_official: saver_step0 = tf.train.Saver(variables_to_restore) else: saver_step0 = tf.train.Saver() with tf.Session() as sess: # https://github.com/tensorflow/tensorflow/issues/1045 sess.run(tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())) print("Initialized") coord = tf.train.Coordinator() threads = tf.train.start_queue_runners(coord=coord) if train_flag: ''' 类数被修改的最后一层logits是如何初始化的,是sess.run(tf.group(tf.global_variables_initializer(), tf.local_variables_initializer()))??? ''' if FLAGS.use_official: saver_step0.restore(sess, official_model_path) else: saver_step0.restore(sess, model_path) try: step = 0 start_time = time.time() while not coord.should_stop(): _, l, logits2, train_acc2_batch, train_acc2, train_acc2_update = sess.run([train_op, total_loss, train_logits, train_accuracy_batch, train_accuracy, train_accuracy_update]) duration = time.time() - start_time if (step % 10 == 0): print("Minibatch loss at step %d - %d: %.6f (%.3f sec)" % (step, total_steps, l, duration)) print("Minibatch accuracy: %.6f" % train_acc2) print("lr: %.6f" % optimizer._lr) #https://stackoverflow.com/questions/38882593/learning-rate-doesnt-change-for-adamoptimizer-in-tensorflow #if (step % 100 == 0): #Validating accuracy step += 1 except tf.errors.OutOfRangeError: print('Done training for %d epochs, %d steps.' % (num_epochs, step)) print('FInal training accuracy: %.6f' % (train_acc2_update)) #Final Validating accuracy saver.save(sess, model_path) finally: coord.request_stop() else: saver.restore(sess, model_path) #会将已经保存的变量值resotre到 变量中。 results = [] try: step = 0 start_time = time.time() while not coord.should_stop(): val_test_logits2, val_test_acc2_batch, val_test_acc2, val_test_acc2_update,image_id_batch2, indices2, values2= sess.run([val_test_logits, val_test_accuracy_batch, val_test_accuracy, val_test_accuracy_update, image_id_batch, indices, values]) step += 1 results += batch_to_list_of_dicts(indices2, image_id_batch2) if (step % 10 == 0): print('Useless minibatch testing accuracy at step %d: %.6f' % (step, val_test_acc2_batch)) print(indices2.shape[0]) except tf.errors.OutOfRangeError: print('Done testing in, %d steps.' % (step)) print('FInal Testing accuracy: %.6f' % (val_test_acc2_update)) '''Writing JSON data''' #results = [{"image_id":"a0563eadd9ef79fcc137e1c60be29f2f3c9a65ea.jpg","label_id": [5,18,32]}] print(len(results)) tf.gfile.GFile(FLAGS.writes, 'w').write(str(results)) # PAI的坑 #with open('oss://scene2017.oss-cn-shanghai-internal.aliyuncs.com/softmax/submit.json', 'w') as f: # json.dump(results, f) finally: coord.request_stop() coord.join(threads) def main(_): train_flag = FLAGS.train_flag if train_flag: tfrecord_file = os.path.join(FLAGS.buckets,'train.tfrecord') #'../ai_challenger_scene_train_20170904/train.tfrecord' # tfrecord_file_val = '../ai_challenger_scene_train_20170904/val.tfrecord' # validate while training batch_size = FLAGS.batch_size#256 num_epochs = FLAGS.num_epochs total_steps = 1.0 * num_epochs * 53879 / batch_size print("total_steps is %d" % total_steps) print("num_epochs is %d" % num_epochs) print("batch_size is %d" % batch_size) print("lr %.6f" % FLAGS.learning_rate) read_tfrecord2(tfrecord_file, batch_size, train_flag, num_epochs, total_steps) else: tfrecord_file = os.path.join(FLAGS.buckets,FLAGS.val_test)#'../ai_challenger_scene_train_20170904/val.tfrecord' #test batch_size = FLAGS.batch_size #16 num_epochs = FLAGS.num_epochs #1 total_steps = 1.0 * num_epochs * 7120 / batch_size #7120是val.tfrecord的,其他的test稍微有点误差,不管了 print("total_steps is %d" % total_steps) read_tfrecord2(tfrecord_file, batch_size, train_flag, num_epochs, total_steps) # 53879 7120 7040 if __name__ == "__main__": #使用这种方式保证了,如果此文件被其它文件import的时候,不会执行main中的代码 tf.app.run() #解析命令行参数,调用main函数 main(sys.argv)
相关文章推荐
- 【Tensorflow slim 实战】写Inception-V4 Inception-ResNet-v2结构
- 使用tf-slim的inception_resnet_v2预训练模型进行图像分类
- TensorFlow Inception, ResNet, VGG net and more examples (TensorFlow 例子集合)
- Retrain a tensorflow model based on Inception v3
- Ai challenger 场景分类: 检查类别平衡
- Improving Inception and Image Classification in Tensorflow
- 利用opencv3读取tensorflow model,对图像进行分类
- 运行github上的ResNet in TensorFlow注意事项
- AI challenger 场景分类 生成tfrecord文件
- How to setup Tensorflow inception-v3 model on Windows
- AI challenger 场景分类 PyTorch 测试代码
- 【深度学习】keras + tensorflow 实现猫和狗图像分类
- TensorFlow Serving和Kubernetes 服务Inception模型
- tensorflow inceptionv3参数笔记
- 初窥Tensorflow Object Detection API 源码之(1.1) Resnet
- 神经网络之Inception模型的实现(Python+TensorFlow)
- Tensorflow + ImageNet Inception-v3 视频图像识别
- Tensorflow + ResNet101 + fasterRcnn 训练自己的模型 数据(一)
- 系统学习深度学习(二十一)--GoogLeNetV4与Inception-ResNet V1,V2
- 使用Kubernetes和TensorFlow Serving将神经网络镜像分类进行弹性扩容