Git Product home page Git Product logo

tensorflow_multigpu_imagenet's People

Contributors

arashno avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tensorflow_multigpu_imagenet's Issues

Using vgg or googlenet results in Value Error

I am trying to train with googlenet and vgg, alexnet and resnet training works. i am training on imagenet data, with 2 classes. the scripts fails here,
top1acc = tf.reduce_mean(
tf.cast(tf.nn.in_top_k(logits, labels, 1), tf.float32)).
the error.
ValueError: Tried to convert 'predictions' to a tensor and failed. Error: None values not supported.
upon inspection the logits in case of vgg or googlenet is none.

got a “None values not supported” error for densenet121 training.

...
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Filling queue with 2000 images before starting to train. This may take some times.
WARNING:tensorflow:From /home/yuanxh/DenseNet121/tensorflow_multigpu_imagenet-master/data_loader.py:162: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
Traceback (most recent call last):
  File "run.py", line 393, in <module>
    main()
  File "run.py", line 366, in main
    do_train(sess, args)
  File "run.py", line 78, in do_train
    train_ops = dnn_model.train_ops()
  File "/home/yuanxh/DenseNet121/tensorflow_multigpu_imagenet-master/architectures/model.py", line 180, in train_ops
    grads, last_grads, batchnorm_updates, cross_entropy_mean, top1acc, topnacc = self.get_grads('/gpu:0')
  File "/home/yuanxh/DenseNet121/tensorflow_multigpu_imagenet-master/architectures/model.py", line 100, in get_grads
    probabilities= tf.nn.softmax(logits, name='output')
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_ops.py", line 2958, in softmax
    return _softmax(logits, gen_nn_ops.softmax, axis, name)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_ops.py", line 2884, in _softmax
    logits = ops.convert_to_tensor(logits)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1184, in convert_to_tensor
    return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1242, in convert_to_tensor_v2
    as_ref=False)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1297, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 286, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 227, in constant
    allow_broadcast=True)
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py", line 265, in _constant_impl
    allow_broadcast=allow_broadcast))
  File "/usr/local/python3/lib/python3.7/site-packages/tensorflow_core/python/framework/tensor_util.py", line 437, in make_tensor_proto
    raise ValueError("None values not supported.")
ValueError: None values not supported.

(tensorflow 1.15.0, python 3.7.5)
I've trained vgg, resnet50 successfully. But I got a problem when I tried densenet121.

Cannot evaluate on DenseNet

when I try to run eval.py, I got this problem:

python eval.py --num_threads 8 --architecture densenet --log_dir densenet_Run-31-07-2018_12-45-51/ --path_prefix /root/ImageNet/val/ --data_info val.txt

Namespace(architecture='densenet', batch_size=512, crop_size=[224, 224], data_info='val.txt', delimiter=' ', depth=50, load_size=[256, 256], log_dir='densenet_Run-31-07-2018_12-45-51/', num_batches=98, num_channels=3, num_classes=1000, num_samples=50000, num_threads=8, path_prefix='/root/ImageNet/val/', save_predictions=None, top_n=5)
Filling queue with 2000 images before starting to train. This may take some times.
Traceback (most recent call last):
File "eval.py", line 134, in
main()
File "eval.py", line 130, in main
evaluate(args)
File "eval.py", line 33, in evaluate
logits = arch.get_model(images, 0.0, False, args)
File "/root/ImageNet/repositories/tensorflow_multigpu_imagenet/arch.py", line 15, in get_model
return architectures.densenet.inference(inputs, args.depth, args.num_classes, wd, is_training, transferMode)
File "/root/ImageNet/repositories/tensorflow_multigpu_imagenet/architectures/densenet.py", line 21, in inference
return getModel(x, num_output, K, stages, wd, is_training, transfer_mode= transfer_mode)
File "/root/ImageNet/repositories/tensorflow_multigpu_imagenet/architectures/densenet.py", line 60, in getModel
x = block(x, stages[0], K, is_training= is_training, wd= wd)
IndexError: list index out of range

How to solve this?

std::bad_alloc

2018-08-15 12:23:05.209896: epoch 21, step 30, loss = 0.51, Top-1 = 0.78 Top-5 = 1.00 (46.9 examples/sec; 0.682 sec/batch)
2018-08-15 12:23:32.467195: epoch 21, step 40, loss = 0.29, Top-1 = 0.88 Top-5 = 1.00 (46.2 examples/sec; 0.693 sec/batch)
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

when evaluate the trained model,the accuracy is not changing ,just as following,whta's wrong

Batch Number: 182 of 196, Top-1 Hit: 353, Top-5 Hit: 880, Loss 27.7487, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0188
Batch Number: 183 of 196, Top-1 Hit: 353, Top-5 Hit: 881, Loss 27.7495, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0187
Batch Number: 184 of 196, Top-1 Hit: 356, Top-5 Hit: 889, Loss 27.7460, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0188
Batch Number: 185 of 196, Top-1 Hit: 357, Top-5 Hit: 894, Loss 27.7505, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0188
Batch Number: 186 of 196, Top-1 Hit: 358, Top-5 Hit: 896, Loss 27.7570, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0187
Batch Number: 187 of 196, Top-1 Hit: 358, Top-5 Hit: 902, Loss 27.7570, Top-1 Accuracy: 0.0074, Top-5 Accuracy: 0.0187
Batch Number: 188 of 196, Top-1 Hit: 360, Top-5 Hit: 907, Loss 27.7517, Top-1 Accuracy: 0.0074, Top-5 Accuracy: 0.0187
Batch Number: 189 of 196, Top-1 Hit: 364, Top-5 Hit: 914, Loss 27.7487, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0188
Batch Number: 190 of 196, Top-1 Hit: 369, Top-5 Hit: 927, Loss 27.7452, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0190
Batch Number: 191 of 196, Top-1 Hit: 371, Top-5 Hit: 932, Loss 27.7508, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0190
Batch Number: 192 of 196, Top-1 Hit: 374, Top-5 Hit: 938, Loss 27.7472, Top-1 Accuracy: 0.0076, Top-5 Accuracy: 0.0190
Batch Number: 193 of 196, Top-1 Hit: 375, Top-5 Hit: 943, Loss 27.7488, Top-1 Accuracy: 0.0076, Top-5 Accuracy: 0.0190
Batch Number: 194 of 196, Top-1 Hit: 375, Top-5 Hit: 947, Loss 27.7512, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0190
Batch Number: 195 of 196, Top-1 Hit: 376, Top-5 Hit: 948, Loss 27.7516, Top-1 Accuracy: 0.0075, Top-5 Accuracy: 0.0190

Evaluating Densenet Model

when I ran my test with your code, I ran into the problem below:

File "/home/ImageNet/Image-Net/architectures/densenet.py", line 58, in getModel
x = common.spatialConvolution(x, 7, 2, 2 * K, wd=wd)
AttributeError: 'module' object has no attribute 'spatialConvolution'

i cannot find anything to solve this problem.
What should I do?

The model never convergent

Hi, I follow the readme and train the model.I choose the num_class = 2, I train the model for very long time,but it never convergent. I down't know how to do,please tell me ,thank you

Dense Layers

This is a discussion more than reporting bug

In training DenseNet Model:
Q1) there is a parameter --depth, if it is set to "--depth 201", does it also mean it will have 201 layers?
if depth=201, stages= [6, 12, 48, 32], what is the relationship between layers and stages?

Q2) from the densenet.py,
stages= []
K=32
if depth == 121:
stages= [6, 12, 24, 16]
elif depth == 169:
stages= [6, 12, 32, 32]
elif depth == 201:
stages= [6, 12, 48, 32]
elif depth == 161:
stages= [6, 12, 36, 24]
K= 48
any idea why only when depth=161 k is set to 48 and others use k=32?

Q3) Does k mean growth rate?

Q4) from "python train.py --help"
--depth DEPTH The depth of ResNet architecture
Can -depth be applied to DenseNet architecture too?

Look forward to hearing from you, thanks!

It seems computation does not run on GPU

In the line 35 of run.py, it seems GPUs are not used.

image

When I run the program in training mode, the memory of GPUs are not used.

I guess maybe images and labels are loaded by CPU and physical memory of the server. Only the backpropogation is calculated on GPU.

Is that right?

Thx

Loss cannot convergent

I train densenet 201 on imagenet 2012, the total epochs are about 200, but the total loss is still about 1.3.
The top-1 and top-5 accuracy is about 0.72 and 0.91, still behind the paper's result.
Is the total loss meant to be drop to 0 eventually?
image
Another problem is that every time I retrain from the previous model, at the beginning the loss is very high and accuracy is 0, and they will be normal very quickly. As far as I am concerned, the result should be the same as the model was trained, right? Did I do anything wrong?

A bug?

Namespace(LR_steps=[19, 30, 44, 53], LR_values=[0.01, 0.005, 0.001, 0.0005, 0.0001], WD_steps=[30], WD_values=[0.0005, 0.0], architecture='resnet', batch_size=128, chunked_batch_size=128, crop_size=[224, 224], data_info='train.txt', delimiter=' ', depth=50, load_size=[256, 256], log_debug_info=False, log_device_placement=False, log_dir='resnet_Run-14-03-2018_16-14-32', max_to_keep=5, num_batches=10010, num_channels=3, num_classes=1000, num_epochs=55, num_gpus=1, num_samples=1281166, num_threads=20, path_prefix='./', retrain_from=None, run_name='Run-14-03-2018_16-14-32', shuffle=True, snapshot_prefix='snapshot', top_n=5, transfer_mode=[0])
Saving everything in resnet_Run-14-03-2018_16-14-32
Filling queue with 2000 images before starting to train. This may take some times.
Traceback (most recent call last):
File "train.py", line 350, in
main()
File "train.py", line 346, in main
train(args)
File "train.py", line 130, in train
logits = arch.get_model(images, wd, is_training, args)
File "/home/liuzhisheng/.workspace/test/.workspace/tensorflow_multigpu_imagenet/arch.py", line 12, in get_model
return architectures.resnet.inference(inputs, args.depth, args.num_classes, wd, is_training, transferMode)
File "/home/liuzhisheng/.workspace/test/.workspace/tensorflow_multigpu_imagenet/architectures/resnet.py", line 24, in inference
return getModel(x, num_output, wd, is_training, num_blocks= num_blocks, bottleneck= bottleneck, transfer_mode= transfer_mode)
File "/home/liuzhisheng/.workspace/test/.workspace/tensorflow_multigpu_imagenet/architectures/resnet.py", line 33, in getModel
x = common.batchNormalization(x, is_training= is_training)
File "/home/liuzhisheng/.workspace/test/.workspace/tensorflow_multigpu_imagenet/architectures/common.py", line 39, in batchNormalization
initializer= tf.zeros_initializer)
File "/home/liuzhisheng/.workspace/test/.workspace/tensorflow_multigpu_imagenet/architectures/common.py", line 26, in _get_variable
trainable= trainable)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 988, in get_variable
custom_getter=custom_getter)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 890, in get_variable
custom_getter=custom_getter)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 348, in get_variable
validate_shape=validate_shape)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 333, in _true_getter
caching_device=caching_device, validate_shape=validate_shape)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 684, in _get_single_variable
validate_shape=validate_shape)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 226, in init
expected_shape=expected_shape)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 303, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 673, in
shape.as_list(), dtype=dtype, partition_info=partition_info)
TypeError: init() got multiple values for keyword argument 'dtype'

pre-trained model

Thank you for the excellent code that solved my problem.
Can you provide a pre-trained model on resnet with IMDB-WIKI or imagenet dataset?

Beginners ask for help

I'm a novice, I'm looking at your code, but I don't know how to run through your code completely (such as which folders need to be built, how to place data sets). If possible, could you tell me something about it?

Alexnet Convolution layer filter size different.

the number of filters in the convolution layers in alexnet are different, 64 instead of 96 in first layer and 196 instead of 256 in 2nd convolution layer. is there any particular reason for this?

VGG acc always keeps 0

I train vgg type D(default) from scratch, but the loss doesn't decline, and the acc always keeps 0, others model like resnet50 is fine. Is there anyone meet the same problem?

Can not reach the accuracy of ResNet in the published paper.

I use your code to train ResNet34 on ImageNet, but the accuracy of single-crop is 0.655 which is much lower than 0.75 the accuracy of the published paper.

batch_size = 256,
epoch_number = 120
num_gpus = 2
lr = tf.train.piecewise_constant(epoch_number, [60,90,110], [0.1,0.01,0.001,0.0001],name='LearningRate')
wd = tf.train.piecewise_constant(epoch_number,[60],[0.0001,0.0001],name='WeightDecay')

And in the 'common.py':
def resnetBlock():
....
with tf.variable_scope('B')
x = spatialConvolution(x,3,1,filters_out,weight_initializer=conv_weight_initializer,wd=wd) #"the kernel size should be 3"

For the 'data_loader.py', I add the code to preprocess images for multi-scale training.

These are what I changed.

Missing the condition of batch norm update

Following the note in offical tensorflow,

Note: when training, the moving_mean and moving_variance need to be updated. By default, the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
  with tf.control_dependencies(update_ops):
    train_op = optimizer.minimize(loss)

So, in your code Line 169

batchnorm_updates_op = tf.group(*batchnorm_updates)
train_op = tf.group(apply_gradient_op, batchnorm_updates_op)

I think it must be

batchnorm_updates_op = tf.group(*batchnorm_updates)
train_op = tf.group(apply_gradient_op, batchnorm_updates_op)
with tf.control_dependencies(batchnorm_updates_op):
    train_op = tf.group(apply_gradient_op, batchnorm_updates_op)

But it has error that it is not iteration. Do you agree that we need to add the line with tf.control_dependencies(batchnorm_updates_op):?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.