arashno / tensorflow_multigpu_imagenet Goto Github PK

Tensorflow code for training different architectures(DenseNet, ResNet, AlexNet, GoogLeNet, VGG, NiN) on ImageNet dataset + Multi-GPU support + Transfer Learning support

License: MIT License

Python 100.00%

tensorflow_multigpu_imagenet's People

Contributors

Stargazers

Watchers

tensorflow_multigpu_imagenet's Issues

Using vgg or googlenet results in Value Error

I am trying to train with googlenet and vgg, alexnet and resnet training works. i am training on imagenet data, with 2 classes. the scripts fails here,
top1acc = tf.reduce_mean(
tf.cast(tf.nn.in_top_k(logits, labels, 1), tf.float32)).
the error.
ValueError: Tried to convert 'predictions' to a tensor and failed. Error: None values not supported.
upon inspection the logits in case of vgg or googlenet is none.

got a “None values not supported” error for densenet121 training.

...
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Filling queue with 2000 images before starting to train. This may take some times.
WARNING:tensorflow:From /home/yuanxh/DenseNet121/tensorflow_multigpu_imagenet-master/data_loader.py:162: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
Traceback (most recent call last):
  File "run.py", line 393, in <module>
    main()
  File "run.py", line 366, in main
    do_train(sess, args)
  File "run.py", line 78, in do_train
    train_ops = dnn_model.train_ops()
  File "/home/yuanxh/DenseNet121/tensorflow_multigpu_imagenet-master/architectures/model.py", line 180, in train_ops
    grads, last_grads, batchnorm_updates, cross_entropy_mean, top1acc, topnacc = self.get_grads('/gpu:0')
  File "/home/yuanxh/DenseNet121/tensorflow_multigpu_imagenet-master/architectures/model.py", line 100, in get_grads
    probabilities= tf.nn.softmax(logits, name='output')
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_ops.py", line 2958, in softmax
    return _softmax(logits, gen_nn_ops.softmax, axis, name)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_ops.py", line 2884, in _softmax
    logits = ops.convert_to_tensor(logits)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1184, in convert_to_tensor
    return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1242, in convert_to_tensor_v2
    as_ref=False)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1297, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 286, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 227, in constant
    allow_broadcast=True)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 265, in _constant_impl
    allow_broadcast=allow_broadcast))
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/framework/tensor_util.py", line 437, in make_tensor_proto
    raise ValueError("None values not supported.")
ValueError: None values not supported.

(tensorflow 1.15.0, python 3.7.5)
I've trained vgg, resnet50 successfully. But I got a problem when I tried densenet121.

run slowly,my code run very slowly,gpu use rate is to low

Cannot evaluate on DenseNet

when I try to run eval.py, I got this problem:

python eval.py --num_threads 8 --architecture densenet --log_dir densenet_Run-31-07-2018_12-45-51/ --path_prefix /root/ImageNet/val/ --data_info val.txt

Namespace(architecture='densenet', batch_size=512, crop_size=[224, 224], data_info='val.txt', delimiter=' ', depth=50, load_size=[256, 256], log_dir='densenet_Run-31-07-2018_12-45-51/', num_batches=98, num_channels=3, num_classes=1000, num_samples=50000, num_threads=8, path_prefix='/root/ImageNet/val/', save_predictions=None, top_n=5)
Filling queue with 2000 images before starting to train. This may take some times.
Traceback (most recent call last):
File "eval.py", line 134, in
main()
File "eval.py", line 130, in main
evaluate(args)
File "eval.py", line 33, in evaluate
logits = arch.get_model(images, 0.0, False, args)
File "/root/ImageNet/repositories/tensorflow_multigpu_imagenet/arch.py", line 15, in get_model
return architectures.densenet.inference(inputs, args.depth, args.num_classes, wd, is_training, transferMode)
File "/root/ImageNet/repositories/tensorflow_multigpu_imagenet/architectures/densenet.py", line 21, in inference
return getModel(x, num_output, K, stages, wd, is_training, transfer_mode= transfer_mode)
File "/root/ImageNet/repositories/tensorflow_multigpu_imagenet/architectures/densenet.py", line 60, in getModel
x = block(x, stages[0], K, is_training= is_training, wd= wd)
IndexError: list index out of range

How to solve this?

Ask for a pre-trained ImageNet model

Could you send me a pre-trained ImageNet classification model which based on googlenet? my email address is [email protected], thank you.

std::bad_alloc

2018-08-15 12:23:05.209896: epoch 21, step 30, loss = 0.51, Top-1 = 0.78 Top-5 = 1.00 (46.9 examples/sec; 0.682 sec/batch)
2018-08-15 12:23:32.467195: epoch 21, step 40, loss = 0.29, Top-1 = 0.88 Top-5 = 1.00 (46.2 examples/sec; 0.693 sec/batch)
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

when evaluate the trained model,the accuracy is not changing ,just as following,whta's wrong

Batch Number: 182 of 196, Top-1 Hit: 353, Top-5 Hit: 880, Loss 27.7487, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0188
Batch Number: 183 of 196, Top-1 Hit: 353, Top-5 Hit: 881, Loss 27.7495, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0187
Batch Number: 184 of 196, Top-1 Hit: 356, Top-5 Hit: 889, Loss 27.7460, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0188
Batch Number: 185 of 196, Top-1 Hit: 357, Top-5 Hit: 894, Loss 27.7505, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0188
Batch Number: 186 of 196, Top-1 Hit: 358, Top-5 Hit: 896, Loss 27.7570, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0187
Batch Number: 187 of 196, Top-1 Hit: 358, Top-5 Hit: 902, Loss 27.7570, Top-1 Accuracy: 0.0074, Top-5 Accuracy: 0.0187
Batch Number: 188 of 196, Top-1 Hit: 360, Top-5 Hit: 907, Loss 27.7517, Top-1 Accuracy: 0.0074, Top-5 Accuracy: 0.0187
Batch Number: 189 of 196, Top-1 Hit: 364, Top-5 Hit: 914, Loss 27.7487, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0188
Batch Number: 190 of 196, Top-1 Hit: 369, Top-5 Hit: 927, Loss 27.7452, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0190
Batch Number: 191 of 196, Top-1 Hit: 371, Top-5 Hit: 932, Loss 27.7508, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0190
Batch Number: 192 of 196, Top-1 Hit: 374, Top-5 Hit: 938, Loss 27.7472, Top-1 Accuracy: 0.0076, Top-5 Accuracy: 0.0190
Batch Number: 193 of 196, Top-1 Hit: 375, Top-5 Hit: 943, Loss 27.7488, Top-1 Accuracy: 0.0076, Top-5 Accuracy: 0.0190
Batch Number: 194 of 196, Top-1 Hit: 375, Top-5 Hit: 947, Loss 27.7512, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0190
Batch Number: 195 of 196, Top-1 Hit: 376, Top-5 Hit: 948, Loss 27.7516, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0190

Evaluating Densenet Model

when I ran my test with your code, I ran into the problem below:

File "/home/ImageNet/Image-Net/architectures/densenet.py", line 58, in getModel
x = common.spatialConvolution(x, 7, 2, 2 * K, wd=wd)
AttributeError: 'module' object has no attribute 'spatialConvolution'

i cannot find anything to solve this problem.
What should I do?

The model never convergent

Hi, I follow the readme and train the model.I choose the num_class = 2, I train the model for very long time,but it never convergent. I down't know how to do,please tell me ,thank you

Dense Layers

This is a discussion more than reporting bug

In training DenseNet Model:
Q1) there is a parameter --depth, if it is set to "--depth 201", does it also mean it will have 201 layers?
if depth=201, stages= [6, 12, 48, 32], what is the relationship between layers and stages?

Q2) from the densenet.py,
stages= []
K=32
if depth == 121:
stages= [6, 12, 24, 16]
elif depth == 169:
stages= [6, 12, 32, 32]
elif depth == 201:
stages= [6, 12, 48, 32]
elif depth == 161:
stages= [6, 12, 36, 24]
K= 48
any idea why only when depth=161 k is set to 48 and others use k=32?

Q3) Does k mean growth rate?

Q4) from "python train.py --help"
--depth DEPTH The depth of ResNet architecture
Can -depth be applied to DenseNet architecture too?

Look forward to hearing from you, thanks!

It seems computation does not run on GPU

In the line 35 of run.py, it seems GPUs are not used.

When I run the program in training mode, the memory of GPUs are not used.

I guess maybe images and labels are loaded by CPU and physical memory of the server. Only the backpropogation is calculated on GPU.

Is that right？

Thx

tensorflow.python.framework.errors_impl.PermissionDeniedError: resnet50; Permission denied

when i run python run.py train --path_prefix /home/ccf_disk/dataset/imagenet --train_info train.txt --num_epochs 50 ,it happened above, what's wrong?

Loss cannot convergent

I train densenet 201 on imagenet 2012, the total epochs are about 200, but the total loss is still about 1.3.
The top-1 and top-5 accuracy is about 0.72 and 0.91, still behind the paper's result.
Is the total loss meant to be drop to 0 eventually?

Another problem is that every time I retrain from the previous model, at the beginning the loss is very high and accuracy is 0, and they will be normal very quickly. As far as I am concerned, the result should be the same as the model was trained, right? Did I do anything wrong?

A bug?

Namespace(LR_steps=[19, 30, 44, 53], LR_values=[0.01, 0.005, 0.001, 0.0005, 0.0001], WD_steps=[30], WD_values=[0.0005, 0.0], architecture='resnet', batch_size=128, chunked_batch_size=128, crop_size=[224, 224], data_info='train.txt', delimiter=' ', depth=50, load_size=[256, 256], log_debug_info=False, log_device_placement=False, log_dir='resnet_Run-14-03-2018_16-14-32', max_to_keep=5, num_batches=10010, num_channels=3, num_classes=1000, num_epochs=55, num_gpus=1, num_samples=1281166, num_threads=20, path_prefix='./', retrain_from=None, run_name='Run-14-03-2018_16-14-32', shuffle=True, snapshot_prefix='snapshot', top_n=5, transfer_mode=[0])
Saving everything in resnet_Run-14-03-2018_16-14-32
Filling queue with 2000 images before starting to train. This may take some times.
Traceback (most recent call last):
File "train.py", line 350, in
main()
File "train.py", line 346, in main
train(args)
File "train.py", line 130, in train
logits = arch.get_model(images, wd, is_training, args)
File "/home/liuzhisheng/.workspace/test/.workspace/tensorflow_multigpu_imagenet/arch.py", line 12, in get_model
return architectures.resnet.inference(inputs, args.depth, args.num_classes, wd, is_training, transferMode)
File "/home/liuzhisheng/.workspace/test/.workspace/tensorflow_multigpu_imagenet/architectures/resnet.py", line 24, in inference
return getModel(x, num_output, wd, is_training, num_blocks= num_blocks, bottleneck= bottleneck, transfer_mode= transfer_mode)
File "/home/liuzhisheng/.workspace/test/.workspace/tensorflow_multigpu_imagenet/architectures/resnet.py", line 33, in getModel
x = common.batchNormalization(x, is_training= is_training)
File "/home/liuzhisheng/.workspace/test/.workspace/tensorflow_multigpu_imagenet/architectures/common.py", line 39, in batchNormalization
initializer= tf.zeros_initializer)
File "/home/liuzhisheng/.workspace/test/.workspace/tensorflow_multigpu_imagenet/architectures/common.py", line 26, in _get_variable
trainable= trainable)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 988, in get_variable
custom_getter=custom_getter)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 890, in get_variable
custom_getter=custom_getter)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 348, in get_variable
validate_shape=validate_shape)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 333, in _true_getter
caching_device=caching_device, validate_shape=validate_shape)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 684, in _get_single_variable
validate_shape=validate_shape)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 226, in init
expected_shape=expected_shape)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 303, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 673, in
shape.as_list(), dtype=dtype, partition_info=partition_info)
TypeError: init() got multiple values for keyword argument 'dtype'

pre-trained model

Thank you for the excellent code that solved my problem.
Can you provide a pre-trained model on resnet with IMDB-WIKI or imagenet dataset?

A little doubt about implementation of Googlenet

Thank you for your code, l‘m somewhat perplexed by the implementation of Googlenet, are there two auxiliary classifier?

Beginners ask for help

I'm a novice, I'm looking at your code, but I don't know how to run through your code completely (such as which folders need to be built, how to place data sets). If possible, could you tell me something about it?

No module named 'utils'

hi,i can't find the file 'utils' ~Could u please upload the file?

Alexnet Convolution layer filter size different.

the number of filters in the convolution layers in alexnet are different, 64 instead of 96 in first layer and 196 instead of 256 in 2nd convolution layer. is there any particular reason for this?

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

when I run imagenet, and when it comes to epoch 2 , step 16000 , the pycharm stop and say Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) ,Could you please tell me what's the matter?

VGG acc always keeps 0

I train vgg type D(default) from scratch, but the loss doesn't decline, and the acc always keeps 0, others model like resnet50 is fine. Is there anyone meet the same problem?

Can not reach the accuracy of ResNet in the published paper.

I use your code to train ResNet34 on ImageNet, but the accuracy of single-crop is 0.655 which is much lower than 0.75 the accuracy of the published paper.

batch_size = 256,
epoch_number = 120
num_gpus = 2
lr = tf.train.piecewise_constant(epoch_number, [60,90,110], [0.1,0.01,0.001,0.0001],name='LearningRate')
wd = tf.train.piecewise_constant(epoch_number,[60],[0.0001,0.0001],name='WeightDecay')

And in the 'common.py':
def resnetBlock():
....
with tf.variable_scope('B')
x = spatialConvolution(x,3,1,filters_out,weight_initializer=conv_weight_initializer,wd=wd) #"the kernel size should be 3"

For the 'data_loader.py', I add the code to preprocess images for multi-scale training.

These are what I changed.

Note: when training, the moving_mean and moving_variance need to be updated. By default, the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
  with tf.control_dependencies(update_ops):
    train_op = optimizer.minimize(loss)

So, in your code Line 169

batchnorm_updates_op = tf.group(*batchnorm_updates)
train_op = tf.group(apply_gradient_op, batchnorm_updates_op)

I think it must be

batchnorm_updates_op = tf.group(*batchnorm_updates)
train_op = tf.group(apply_gradient_op, batchnorm_updates_op)
with tf.control_dependencies(batchnorm_updates_op):
    train_op = tf.group(apply_gradient_op, batchnorm_updates_op)

But it has error that it is not iteration. Do you agree that we need to add the line with tf.control_dependencies(batchnorm_updates_op):?

arashno / tensorflow_multigpu_imagenet Goto Github PK

tensorflow_multigpu_imagenet's People

Contributors

Stargazers

Watchers

Forkers

tensorflow_multigpu_imagenet's Issues

Recommend Projects

Recommend Topics

Recommend Org