Git Product home page Git Product logo

resnet-in-tensorflow's Introduction

ResNet in Tensorflow

This implementation of resnet and its variants is designed to be straightforward and friendly to new ResNet users. You can train a resnet on cifar10 by downloading and running the code. There are screen outputs, tensorboard statistics and tensorboard graph visualization to help you monitor the training process and visualize the model.

Now the code works with tensorflow 1.0.0 and 1.1.0, but it's no longer compatible with earlier versions.

If you like the code, please star it! You are welcome to post questions and suggestions on my github.

Table of Contents

Validation errors

The lowest valdiation errors of ResNet-32, ResNet-56 and ResNet-110 are 6.7%, 6.5% and 6.2% respectively. You can change the number of the total layers by changing the hyper-parameter num_residual_blocks. Total layers = 6 * num_residual_blocks + 2

Network Lowest Validation Error
ResNet-32 6.7%
ResNet-56 6.5%
ResNet-110 6.2%

Training curves

alt tag

User's guide

You can run cifar10_train.py and see how it works from the screen output (the code will download the data for you if you don't have it yet). It’s better to speicify version identifier before running, since the training logs, checkpoints, and error.csv file will be saved in the folder with name logs_$version. You can do this by command line: python cifar10_train.py --version='test'. You may also change the version number inside the hyper_parameters.py file

The training and validation error will be output on the screen. They can also be viewed using tensorboard. Use tensorboard --logdir='logs_$version' command to pull them out. (For e.g. If the version is ‘test’, the logdir should be ‘logs_test’.) The relevant statistics of each layer can be found on tensorboard.

Pre-requisites

pandas, numpy , opencv, tensorflow(1.0.0)

Overall structure

There are four python files in the repository. cifar10_input.py, resnet.py, cifar10_train.py, hyper_parameters.py.

cifar10_input.py includes helper functions to download, extract and pre-process the cifar10 images. resnet.py defines the resnet structure. cifar10_train.py is responsible for the training and validation. hyper_parameters.py defines hyper-parameters related to train, resnet structure, data augmentation, etc.

The following sections expain the codes in details.


hyper-parameters

The hyper_parameters.py file defines all the hyper-parameters that you may change to customize your training. You may use python cifar10_train.py --hyper_parameter1=value1 --hyper_parameter2=value2 to set all the hyper-parameters. You may also change the default values inside the python script.

There are five categories of hyper-parameters.


1. Hyper-parameters about saving training logs, tensorboard outputs and screen outputs, which includes:

version: str. The checkpoints and output events will be saved in logs_$version/

report_freq: int. How many batches to run a full validation and print screen output once. Screen output looks like: alt tag

train_ema_decay: float. The tensorboard will record a moving average of batch train errors, besides the original ones. This decay factor is used to define an ExponentialMovingAverage object in tensorflow with tf.train.ExponentialMovingAverage(FLAGS.train_ema_decay, global_step). Essentially, the recorded error = train_ema_decay * shadowed_error + (1 - train_ema_decay) * current_batch_error. The larger the train_ema_decay is, the smoother the training curve will be.


2. Hyper-parameters regarding the training process

train_steps: int. Total training steps

is_full_validation: boolean. If you want to use all the 10000 validation images to run the validation (True), or you want to randomly draw a batch of validation data (False)

train_batch_size: int. Training batch size

validation_batch_size: int. Validation batch size (which is only effective if is_full_validation=False)

init_lr: float. The initial learning rate. The learning rate may decay based on the settings below

lr_decay_factor: float. The decaying factor of learning rate. The learning rate will become lr_decay_factor * current_learning_rate every time it is decayed.

decay_step0: int. The learning rate will decay at decay_step0 for the first time

decay_step1: int. The second time when the learning rate will decay


3. Hyper-parameters that controls the network

num_residual_blocks: int. The total layers of the ResNet = 6 * num_residual_blocks + 2

weight_decay: float. The weight decay used to regularize the network. Total_loss = train_loss + weight_decay* sume of sqaures of the weights


4. About data augmentation

padding_size: int. padding_size is numbers of zero pads to add on each side of the image. Padding and random cropping during training can prevent overfitting.


5. Loading checkpoints

ckpt_path: str. The path of the checkpoint that you want to load

is_use_ckpt: boolean. If yes, use a checkpoint and continue the training from the checkpoint


ResNet Structure

Here we use the latest version of ResNet. The structure of the residual block looks like ref:

The inference() function is the main function of resnet.py. It will be used twice in both building the training graph and validation graph.

Training

The class Train() defines all the functions regarding training process, with train() being the main function. The basic idea is to run train_op for FLAGS.train_steps times. If step % FLAGS.report_freq == 0, it will valdiate once, train once and wrote all the summaries onto the tensorboard.

Test

The test() function in the class Train() help you predict. It returns the softmax probability with shape [num_test_images, num_labels]. You need to prepare and pre-process your test data and pass it to the function. You may either use your own checkpoints or the pre-trained ResNet-110 checkpoint I uploaded. You may wrote the following lines at the end of cifar10_train.py file

train = Train()
test_image_array = ... # Better to be whitened in advance. Shape = [-1, img_height, img_width, img_depth]
predictions = train.test(test_image_array)
# predictions is the predicted softmax array.

Run the following commands in the command line:

# If you want to use my checkpoint. 
python cifar10_train.py --test_ckpt_path='model_110.ckpt-79999'

resnet-in-tensorflow's People

Contributors

lokhande-vishnu avatar wenxinxu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

resnet-in-tensorflow's Issues

UnrecognizedFlagError: Unknown command line flag 'f'

UnrecognizedFlagError Traceback (most recent call last)
in
----> 1 train_dir = 'logs_' + FLAGS.version + '/'

~\AppData\Local\Continuum\anaconda3\lib\site-packages\tensorflow\python\platform\flags.py in getattr(self, name)
82 # a flag.
83 if not wrapped.is_parsed():
---> 84 wrapped(_sys.argv)
85 return wrapped.getattr(name)
86

~\AppData\Local\Continuum\anaconda3\lib\site-packages\absl\flags_flagvalues.py in call(self, argv, known_only)
631 suggestions = _helpers.get_flag_suggestions(name, list(self))
632 raise _exceptions.UnrecognizedFlagError(
--> 633 name, value, suggestions=suggestions)
634
635 self.mark_as_parsed()

inference Error

When I call inference, I got an error:
Traceback (most recent call last):
File "/home/forever/PycharmProjects/PIG/resnet.py", line 313, in
conv9_out = inference(x, FLAGS.num_residual_blocks, reuse=False)
File "/home/forever/PycharmProjects/PIG/resnet.py", line 194, in inference
assert conv3.get_shape().as_list()[1:] == [8, 8, 64]
AssertionError

can you tell where should I change my code?

If I would like to get the ’Training curve‘

I can change the ‘num_residual_blocks’ if I would like to get a phtot like your Training curve.
If I set ‘num_residual_blocks’=3 this is a 20-resnet ?
If I set ‘num_residual_blocks’=5 this is a 32-resnet ?
If I set ‘num_residual_blocks’=9 this is a 56-resnet ?
If I set ‘num_residual_blocks’=18 this is a 110-resnet ?

It is OK?

about weight initialization

why are you using tf.contrib.layers.xavier initializer instead of tf.contrib.layers.variance_scaling_initializer() ??

all the input arrays must have same number of dimensions

working on python3
and have changed cPickle to pickle && data = dicts['data'] to data = dicts.get('data')

I am encountering a problem about
ValueError: all the input arrays must have same number of dimensions
in cifar10_input.py

Traceback (most recent call last):
File "/data/tmp/pycharm_project_979/cifar10_train.py", line 425, in
train.train()
File "/data/tmp/pycharm_project_979/cifar10_train.py", line 86, in train
all_data, all_labels = prepare_train_data(padding_size=FLAGS.padding_size)
File "/data/tmp/pycharm_project_979/cifar10_input.py", line 176, in prepare_train_data
data, label = read_in_all_images(path_list, is_random_label=TRAIN_RANDOM_LABEL)
File "/data/tmp/pycharm_project_979/cifar10_input.py", line 96, in read_in_all_images
data = np.concatenate((data, batch_data))

and when I print (data.shape) it shows (0, 3072), print(batch_data) it shows None

How can I fix the problem?

train the model with gpu

Hello,I want to know how to train the model with gpu?Now when I excute "python cifar10_train.py" it only uses cpu,tell me how to train the model with gpu.Thank you!

set step of lr decay

why do you choose 40000 as the first step to change lr? it seems that smaller step of changing lr works better.

the predict is not with the same result of evaluate

when I predict the model, the result is not the same with evaluate, and the result has much difference. I fetch the fc_weight , save to model and restore from model of the weight is not the same. Maybe the model has something wrong.

DuplicateFlagError: The flag 'version' is defined twice.

DuplicateFlagError: The flag 'version' is defined twice. First at C:\Users\khazi\AppData\Local\Continuum\anaconda3\Lib\site-packages\ipkernel_launcher.py and second at C:\Users\khazi\AppData\Local\Continuum\anaconda3\Lib\site-packages\ipkernel_launcher.py

Python Version = 3.7.3
Tensorflow Version = 1.14.0

About pre-train model

Hi @wenxinxu ,

Was the model model_110.ckpt-79999 fine tune from others like caffe model or totally retrain from cifar10 dataset?

Thanks

Error "no such file or directory" while training the model using the uploaded checkpoint

Hi. I am using this project as a practice of understanding CNN deeper. Since this model takes 80000 steps to finish training, I was trying to use the uploaded checkpoint of step 79999 to accelerate the training process. However, when I typed the following command
python cifar10_train.py --is_use_ckpt=True --test-ckpt_path='model_110.ckpt-79999'
an error saying "no such file or directory" showed up. What might be the potential problem? Thank you very much.

function create_variables in resnet.py


def create_variables(name, shape, initializer=tf.contrib.layers.xavier_initializer(), is_fc_layer=False):
if is_fc_layer is True:
regularizer = tf.contrib.layers.l2_regularizer(scale=FLAGS.weight_decay)
else:
regularizer = tf.contrib.layers.l2_regularizer(scale=FLAGS.weight_decay)

new_variables = tf.get_variable(name, shape=shape, initializer=initializer,
                                regularizer=regularizer)
return new_variables

These two lines, the same?

please help error with checkpoint

I have an error NotFoundError (see above for traceback): Tensor name "truediv_1/ExponentialMovingAverage_1" not found in checkpoint files model_110.ckpt-79999
I've change number of residual blocks to 18 to get 110 layers. It doesn't help

About batch normalization

The batch_normalization_layer() function doesn't compute the statistics of population i.e. population mean and variance. The part implemented is only taking care of the training procedure (batch statistics), but while testing one will need the population statistics

_read_one_batch

` fo = open(path, 'rb')
dicts = pickle.load(fo)
fo.close()

data = dicts['data']
if is_random_label is False:
label = np.array(dicts[b'labels'])
else:
labels = np.random.randint(low=0, high=10, size=10000)
label = np.array(labels)
return data, label`

  1. 当运行到 dicts = pickle.load(fo)时,报错:UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 6: ordinal not in range(128)。您没遇到过这种情况吗?

2.当修改成dicts = pickle.load(fo,encoding='bytes')程序可以继续运行,但是在data = dicts['data']报错:KeyError: 'data'。当我查看dicts.key()后,我发现结果是:dict_keys([b'data', b'labels', b'batch_label', b'filenames']),为什么每个键的前面会出现字母b?

load checkpoint

Can I load checkpoint model_110.ckpt-79999 into ResNet-32?

感谢您的分享
有个问题我不太明白,想请教
我使用您的的代码,设定
num_residual_blocks =5 (32层)
load checkpoint model_110.ckpt-79999
之后继续在在cifar 10的数据集上训练了2000 次
top-1 错误率 为什么会在15%左右?
难道不应该是 7%左右码?

The result of the resnet-32

Hi,
Firstly I cannot get the best accuacy (6.7%) as reported. Set the 'is_full_validation' as 'True' and keep other settings the same as the souce code, I run the 'cifar10_train.py'. I only have my best results as 7.22% at about 77809 iters and the second best results as 7.28% at about 69207 iters. Maybe there are some tips that I have ignored. Would you give me some suggestions about it?
Secondly I notice that the validation curve is more unstable compared to the results in original paper. I run the code and find that it doesn't seem to converge. The results on validation set are shocked at last. Is there something wrong?

Please give a exact example for test

example below is not enough. please give a exact example for test code

Test

The test() function in the class Train() help you predict. It returns the softmax probability with shape [num_test_images, num_labels]. You need to prepare and pre-process your test data and pass it to the function. You may either use your own checkpoints or the pre-trained ResNet-110 checkpoint I uploaded. You may wrote the following lines at the end of cifar10_train.py file
train = Train()
test_image_array = ... # Better to be whitened in advance. Shape = [-1, img_height, img_width, img_depth]
top1_error, loss = train.test(test_image_array)

Run the following commands in the command line:

If you want to use my checkpoint.

python cifar10_train.py --test_ckpt_path='model_110.ckpt-79999'

Start working on resnet

Hi,
I am new in Resnet. So, I would like to ask how can I put my data in the code. Furthermore, I need to solve a regression task, so, could you give some information about how I can modify the code in order to do this task.
Thank you.

how to test

how to test?please list detailed codes.thank you,
the code below your page is not detailed, I don't know how to test this resnet .please you could list detailed codes .thank you very much. @wenxinxu

Want to validate once before training.

In the cifar10_main.py, the train() function.
The author comment: Want to validate once before training. You may check the theoretical validation.
What does this mean? Thanks a lot

there are no way to test only?

i successed to train the model in my pc.
but i can't get accuracy of test.

there are no way to test only?
when i checked test(self, test_image_array) method
there are no call.

Batch_Noramlization

when I run this ,I encounter a problem,it shows 'InvalidArgumentError:Input to reshape is a tensor
with 8 values ,but the requested shape has 64' ,the problem locates here ' mean, variance = tf.nn.moments(input_layer, axes=[0, 1, 2])' when I change to ' mean, variance = tf.nn.moments(input_layer, axes=[0])' ,it is ok ,but when the 'axes=[0,1,2]' it is wrong ,I dont know why ,can you help me ?

Custom dataset

Hi, I want to apply your resnet on my dataset. I have created the dataset similar Cifar10, Binay format. Can anyone help me to use my dataset instead of cifar10 train data?

value error due to too many values to unpack

I got this error when i run python_train.py
Can anyone please tell me how to resolve these errors?

Traceback (most recent call last):

File "cifar10_train.py", line 426, in Model restored from model_110.ckpt-79999
0 batches finished!
10 batches finished!
20 batches finished!
30 batches finished!
40 batches finished!
50 batches finished!
60 batches finished!
70 batches finished!

top1_error, loss = train.test(test_image_array)

ValueError: too many values to unpack

Working on python 2.7

this code is written on python 2.7, libraries like cPickle is not working on python 3.7

The validation loss is so big?

Hello sir:
I run the demo in my database, but i meet so many questions. The top1 error is 0 during trainning, but the validation top1 error is about 0.7. The number of my train dataset is 1300 and validation dataset is 400.Thanks!

Whole validation accuracy using provided model_110.ckpt-79999 is extremely low

I try to test accuracy on the whole validation set using the provided ckpt, so I modify cifar10_train.py like below

# Initialize the Train object
train = Train()
# Start the training session
# train.train()

validation_array, validation_labels = read_in_all_images([vali_dir],
                                                         is_random_label=VALI_RANDOM_LABEL)
predictions=train.test(validation_array)
vali_accu=np.mean((np.argmax(predictions,1)==validation_labels.astype(int)).astype(float))
print 'total accu on vali is %f'%vali_accu

But the result is extremly low

total accu on vali is 0.334300

Then I trained my owner checkpoint from scratch, and got a much better accuracy

total accu on vali is 0.9161000

Is there anything wrong during my test process?

Resnet_train Error

I have been running inference with small number of images and then training; code only runs for one step and then breaks with following error:

step 0, loss = 1.13 (14.0 examples/sec; 0.642 sec/batch)
Traceback (most recent call last):

File "", line 1, in
runfile('C:/Users/Fariha/Desktop/MS/CervCancer/tensorflow-resnet-master/wth.py', wdir='C:/Users/Fariha/Desktop/MS/CervCancer/tensorflow-resnet-master')

File "C:\Users\Fariha\Anaconda3\envs\py36\lib\site-packages\spyder\utils\site\sitecustomize.py", line 710, in runfile
execfile(filename, namespace)

File "C:\Users\Fariha\Anaconda3\envs\py36\lib\site-packages\spyder\utils\site\sitecustomize.py", line 101, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/Fariha/Desktop/MS/CervCancer/tensorflow-resnet-master/wth.py", line 76, in
image_tensor = sess.run(error)

File "C:\Users\Fariha\Anaconda3\envs\py36\lib\site-packages\tensorflow\python\client\session.py", line 789, in run
run_metadata_ptr)

File "C:\Users\Fariha\Anaconda3\envs\py36\lib\site-packages\tensorflow\python\client\session.py", line 984, in _run
self._graph, fetches, feed_dict_string, feed_handles=feed_handles)

File "C:\Users\Fariha\Anaconda3\envs\py36\lib\site-packages\tensorflow\python\client\session.py", line 410, in init
self._fetch_mapper = _FetchMapper.for_fetch(fetches)

File "C:\Users\Fariha\Anaconda3\envs\py36\lib\site-packages\tensorflow\python\client\session.py", line 227, in for_fetch
(fetch, type(fetch)))

TypeError: Fetch argument None has invalid type <class 'NoneType'>

INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.CancelledError'>, Run call was cancelled

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.