drsleep / tensorflow-deeplab-resnet Goto Github PK

View Code? Open in Web Editor NEW

1.2K 55.0 432.0 1.98 MB

DeepLab-ResNet rebuilt in TensorFlow

License: MIT License

Python 100.00%

tensorflow deeplab-resnet pascal-voc semantic-segmentation

tensorflow-deeplab-resnet's People

Contributors

Stargazers

Watchers

Forkers

ml-lab bigsnarfdude codeaudit gengxiaohuan soledad89 benjamesbabala jdc08161063 nishathussain johndpope woniuhu wenxuanliu gninnur wanjinchang ziyubiti allensmile leezqcst xsongx vsitzmann nicolov caomw soprof v-italy iamsile shichaosuper mgarbade jpapon ajdroid dreadlord1984 gaopeng-eugene weigq namhv712 arslan-chaudhry flettling comicchang corenel papamadeleine2022 hyzcn thomasdic2000 j50888 speedinghzl joestrummer82 richard-chau amiltonwong geoyi dongzhuoyao kewenjing1020 zilongzhong zhyj3038 bityangke aicarmark amlarraz yghlc lemonanilabs stephenjia aliscifp tenderteng gxlcliqi jiangqh guoyilin bhaktipriya fireae lxh-123 davidsonggithub yirank dolokov kenneth-x peteflorence dasona ngchc csuwoshikunge anjali411 visualead yt605155624 coocoky pyadolla shiyongde yeelan0319 leica8244 wujiahongpku fanhaidi fangliangbai summertune nku428 hellochick forloverj airai vijayg78 alicranck clockworkhou talent518 d-zhou12 wlwkgus salehe-e 596310026 neuralnetworkingtechnologies qqgeogor susandebug yingning liuqi05 li-js

tensorflow-deeplab-resnet's Issues

How much is the GPU memory requirements?

My GPU is TITAN X(Pascal)(12GiB).
When I train the model using 'python train.py', there is a warning: "Ran out of memory trying to allocate 3.19GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.".
I wonder if the required GPU memory is 15GiB?
But even I ran out of memory, it still runs.
And it seems not damage the performance.

Have you ever met this kind of problem?

cv@cv:~/tf/tf-deeplab-resnet$ python train.py
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: TITAN X (Pascal)
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:02:00.0
Total memory: 11.90GiB
Free memory: 11.61GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x54536c0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: TITAN X (Pascal)
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:03:00.0
Total memory: 11.90GiB
Free memory: 11.76GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x54574e0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 2 with properties:
name: TITAN X (Pascal)
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:82:00.0
Total memory: 11.90GiB
Free memory: 11.76GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x545b300
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 3 with properties:
name: TITAN X (Pascal)
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:83:00.0
Total memory: 11.90GiB
Free memory: 11.76GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2: N N Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3: N N Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: TITAN X (Pascal), pci bus id: 0000:03:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: TITAN X (Pascal), pci bus id: 0000:82:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: TITAN X (Pascal), pci bus id: 0000:83:00.0)
Restored model parameters from /home/cv/tf/tf-deeplab-resnet/deeplab_resnet_init.ckpt
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.19GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
W tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory trying to allocate 3.19GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
The checkpoint has been created.
step 0 loss = 4.333, (12.162 sec/step)
step 1 loss = 3.095, (1.228 sec/step)
step 2 loss = 3.916, (0.890 sec/step)
step 3 loss = 3.019, (0.836 sec/step)
step 4 loss = 3.412, (0.824 sec/step)
step 5 loss = 2.240, (0.909 sec/step)
step 6 loss = 2.528, (0.962 sec/step)

W tensorflow/core/framework/op_kernel.cc:968]

I have download the VOC2012 from here
and I run this script:

python train.py --random-scale

but i encounter the following error:

XXX@imcl-test:/data/XXX/deeplab/tensorflow-deeplab-resnet$ python train.py --random-scale
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: 
name: Tesla K40m
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:06:00.0
Total memory: 11.17GiB
Free memory: 11.10GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:06:00.0)
Restored model parameters from ./deeplab_resnet.ckpt



#### this is the error


W tensorflow/core/framework/op_kernel.cc:968] Not found: ./VOCdevkit/JPEGImages/2009_004475.jpg
W tensorflow/core/framework/op_kernel.cc:968] Not found: ./VOCdevkit/SegmentationClassAug/2009_004475.png
W tensorflow/core/framework/op_kernel.cc:968] Out of range: FIFOQueue '_1_create_inputs/batch/fifo_queue' is closed and has insufficient elements (requested 4, current size 0)
	 [[Node: create_inputs/batch = QueueDequeueMany[_class=["loc:@create_inputs/batch/fifo_queue"], component_types=[DT_FLOAT, DT_UINT8], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](create_inputs/batch/fifo_queue, create_inputs/batch/n)]]
W tensorflow/core/framework/op_kernel.cc:968] Out of range: FIFOQueue '_1_create_inputs/batch/fifo_queue' is closed and has insufficient elements (requested 4, current size 0)
	 [[Node: create_inputs/batch = QueueDequeueMany[_class=["loc:@create_inputs/batch/fifo_queue"], component_types=[DT_FLOAT, DT_UINT8], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](create_inputs/batch/fifo_queue, create_inputs/batch/n)]]
Traceback (most recent call last):
  File "train.py", line 192, in <module>
    main()
  File "train.py", line 181, in main
    loss_value, images, labels, preds, summary, _ = sess.run([reduced_loss, image_batch, label_batch, pred, total_summary, optim])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 717, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 915, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 985, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.OutOfRangeError: FIFOQueue '_1_create_inputs/batch/fifo_queue' is closed and has insufficient elements (requested 4, current size 0)
	 [[Node: create_inputs/batch = QueueDequeueMany[_class=["loc:@create_inputs/batch/fifo_queue"], component_types=[DT_FLOAT, DT_UINT8], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](create_inputs/batch/fifo_queue, create_inputs/batch/n)]]

Caused by op u'create_inputs/batch', defined at:
  File "train.py", line 192, in <module>
    main()
  File "train.py", line 106, in main
    image_batch, label_batch = reader.dequeue(args.batch_size)
  File "/data/weigq/deeplab/tensorflow-deeplab-resnet/deeplab_resnet/image_reader.py", line 103, in dequeue
    num_elements)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/input.py", line 597, in batch
    dequeued = queue.dequeue_many(batch_size, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/data_flow_ops.py", line 458, in dequeue_many
    self._queue_ref, n=n, component_types=self._dtypes, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 905, in _queue_dequeue_many
    timeout_ms=timeout_ms, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 749, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2380, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1298, in __init__
    self._traceback = _extract_stack()

OutOfRangeError (see above for traceback): FIFOQueue '_1_create_inputs/batch/fifo_queue' is closed and has insufficient elements (requested 4, current size 0)
	 [[Node: create_inputs/batch = QueueDequeueMany[_class=["loc:@create_inputs/batch/fifo_queue"], component_types=[DT_FLOAT, DT_UINT8], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](create_inputs/batch/fifo_queue, create_inputs/batch/n)]]

who can help me
thanks Ｔ_Ｔ

Restore called with invalid save path

I`m new here.
I just follow the steps, and all the requirements are met, when I run:
python train.py --random-scale
I get this:

xxx@imcl-test:/data/xxx/deeplab/tensorflow-deeplab-resnet$ python train.py --random-scale
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: 
name: Tesla K40m
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:06:00.0
Total memory: 11.17GiB
Free memory: 11.10GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:06:00.0)
Traceback (most recent call last):
  File "train.py", line 189, in <module>
    main()
  File "train.py", line 168, in main
    load(loader, sess, args.restore_from)
  File "train.py", line 85, in load
    saver.restore(sess, ckpt_path)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1342, in restore
    "File path is: %r" % (save_path, file_path))
ValueError: Restore called with invalid save path: './deeplab_resnet.ckpt'. File path is: './deeplab_resnet.ckpt'

what should I do?
thanks!

anyway to get the class specific prediction?

continue training of the deeplab-residual model, the loss fucntion will diverage.

I use the PASCAL VOC 2012 AUG dataset, here is the loss

step 500 loss = 0.221, (20.800 sec/step)
step 501 loss = 0.115, (0.445 sec/step)
step 502 loss = 0.115, (0.425 sec/step)
step 503 loss = 0.204, (0.420 sec/step)
step 504 loss = 0.104, (0.418 sec/step)
step 505 loss = 0.098, (0.418 sec/step)
step 506 loss = 0.219, (0.417 sec/step)
step 507 loss = 0.128, (0.419 sec/step)
step 508 loss = 0.129, (0.419 sec/step)
step 509 loss = 0.149, (0.421 sec/step)
step 510 loss = 0.117, (0.418 sec/step)
step 511 loss = 0.091, (0.489 sec/step)
step 512 loss = 0.341, (0.546 sec/step)
step 513 loss = 0.109, (0.517 sec/step)
step 514 loss = 0.176, (0.517 sec/step)
step 515 loss = 0.181, (0.559 sec/step)
step 516 loss = 0.140, (0.524 sec/step)
step 517 loss = 0.117, (0.538 sec/step)
step 518 loss = 0.111, (0.519 sec/step)
step 519 loss = 0.213, (0.516 sec/step)
step 520 loss = 0.192, (0.518 sec/step)
step 521 loss = 0.424, (0.509 sec/step)
step 522 loss = 0.096, (0.492 sec/step)
step 523 loss = 0.169, (0.537 sec/step)
step 524 loss = 0.171, (0.493 sec/step)
step 525 loss = 0.158, (0.536 sec/step)
step 526 loss = 0.150, (0.506 sec/step)
step 527 loss = 0.201, (0.549 sec/step)
step 528 loss = 0.146, (0.511 sec/step)
step 529 loss = 0.134, (0.507 sec/step)
step 530 loss = 0.162, (0.496 sec/step)
step 531 loss = 0.234, (0.524 sec/step)
step 532 loss = 0.228, (0.572 sec/step)
step 533 loss = 0.115, (0.529 sec/step)
step 534 loss = 0.174, (0.522 sec/step)
step 535 loss = 0.149, (0.504 sec/step)
step 536 loss = 0.206, (0.525 sec/step)
step 537 loss = 0.221, (0.542 sec/step)
step 538 loss = 0.093, (0.513 sec/step)
step 539 loss = 0.251, (0.565 sec/step)
step 540 loss = 0.084, (0.744 sec/step)
step 541 loss = 0.086, (0.534 sec/step)
step 542 loss = 0.175, (0.560 sec/step)
step 543 loss = 0.058, (0.538 sec/step)
step 544 loss = 0.214, (0.520 sec/step)
step 545 loss = 0.124, (0.498 sec/step)
step 546 loss = 0.097, (0.535 sec/step)
step 547 loss = 0.172, (0.547 sec/step)
step 548 loss = 0.234, (0.549 sec/step)
step 549 loss = 0.186, (0.518 sec/step)
step 550 loss = 0.262, (0.518 sec/step)
step 551 loss = 0.132, (0.522 sec/step)
step 552 loss = 0.156, (0.502 sec/step)
step 553 loss = 0.066, (0.534 sec/step)
step 554 loss = 0.155, (0.573 sec/step)
step 555 loss = 0.145, (0.530 sec/step)
step 556 loss = 0.225, (0.513 sec/step)
step 557 loss = 0.136, (0.519 sec/step)
step 558 loss = 0.223, (0.499 sec/step)
step 559 loss = 0.109, (0.532 sec/step)
step 560 loss = 0.133, (0.523 sec/step)
step 561 loss = 0.107, (0.534 sec/step)
step 562 loss = 0.247, (0.523 sec/step)
step 563 loss = 0.084, (0.528 sec/step)
step 564 loss = 0.172, (0.508 sec/step)
step 565 loss = 0.125, (0.504 sec/step)
step 566 loss = 0.244, (0.531 sec/step)
step 567 loss = 0.144, (0.536 sec/step)
step 568 loss = 0.110, (0.562 sec/step)
step 569 loss = 0.111, (0.513 sec/step)
step 570 loss = 6.620, (0.511 sec/step)
step 571 loss = 2.842, (0.539 sec/step)
step 572 loss = 5.740, (0.515 sec/step)
step 573 loss = 4.940, (0.513 sec/step)
step 574 loss = 4.994, (0.535 sec/step)
step 575 loss = 2.670, (0.522 sec/step)
step 576 loss = 2.900, (0.537 sec/step)
step 577 loss = 2.638, (0.513 sec/step)
step 578 loss = 3.622, (0.529 sec/step)
step 579 loss = 2.524, (0.531 sec/step)
step 580 loss = 2.554, (0.586 sec/step)
step 581 loss = 3.612, (0.525 sec/step)
step 582 loss = 2.171, (0.512 sec/step)
step 583 loss = 3.751, (0.548 sec/step)
step 584 loss = 2.245, (0.506 sec/step)
step 585 loss = 3.306, (0.536 sec/step)
step 586 loss = 1.847, (0.526 sec/step)
step 587 loss = 4.644, (0.556 sec/step)
step 588 loss = 2.619, (0.513 sec/step)
step 589 loss = 3.449, (0.534 sec/step)
step 590 loss = 2.106, (0.528 sec/step)
step 591 loss = 1.349, (0.519 sec/step)
step 592 loss = 1.875, (0.557 sec/step)
step 593 loss = 2.142, (0.527 sec/step)
step 594 loss = 1.511, (0.529 sec/step)
step 595 loss = 2.507, (0.537 sec/step)
step 596 loss = 3.175, (0.531 sec/step)
step 597 loss = 2.369, (0.536 sec/step)
step 598 loss = 1.895, (0.521 sec/step)
step 599 loss = 2.188, (0.533 sec/step)
The checkpoint has been created.
step 600 loss = 0.994, (21.880 sec/step)
step 601 loss = 2.987, (0.433 sec/step)
step 602 loss = 2.304, (0.417 sec/step)
step 603 loss = 2.919, (0.418 sec/step)
step 604 loss = 2.105, (0.416 sec/step)
step 605 loss = 1.149, (0.422 sec/step)
step 606 loss = 1.857, (0.416 sec/step)
step 607 loss = 2.461, (0.415 sec/step)
step 608 loss = 1.290, (0.415 sec/step)
step 609 loss = 1.688, (0.416 sec/step)
step 610 loss = 2.122, (0.418 sec/step)
step 611 loss = 2.046, (0.447 sec/step)
step 612 loss = 2.429, (0.523 sec/step)
step 613 loss = 2.406, (0.520 sec/step)

multi scale training perform badly when training from scratch

I tried the multi scale training methods and found that the performance is much worse than the train.py when training from scratch.
How about your experiment?

Restore partial weights after change input channels

I want to add one additional channel(grey level image).
So the channel size of input changes from 3 to 4.
I want to restore the weights for the origin 3 channels.
How can I do that?

I tried to do not restore the 'conv1' weights , but it can't converge.

IoU should ignore void class

Hello, I think the provided code in evaluate.py does not ignore the void label in ground truth masks (which is mapped to the background in the 1D labels); therefore, the performance is underestimated compared to the results of VOC devkit. Here is the link to the line ignoring void pixels in the VOC devkit.

Mean subtraction after cropping

There is a bug in the image transformation. The mean subtraction must happen before the zero padding performed by tf.image.resize_image_with_crop_or_pad.

I'm trying to squeeze the functions
img = tf.random_crop(img, [h, w, 3])
label = tf.random_crop(label, [h, w, 1])
in there as well to get a random instead of a center crop, but I'm not sure if they take the same crop from image and label if I do it like this... So I keep looking into that.

This project will shuffle the data list or not during training?

It seems like there is shuffle in training while no shuffle in testing.

self.queue = tf.train.slice_input_producer([self.images, self.labels],
shuffle=input_size is not None) # not shuffling if it is val

However, when I train the model several times the loss is totally same which means there is no shuffle in training.

Fail to inference with saved checkpoint

I modify the train.py code and train my own data. After all 200001 steps, I have checkpoint and many saved model in 'snapshots' folder. I try to do inference on single image with saved model. My command line is :

python inference.py /home/username/testing_data/1.jpeg /home/username/tf_deeplab_resnet/snapshots/checkpoint

However ,below error keeps coming up :

W tensorflow/core/framework/op_kernel.cc:975] Data loss: Unable to open table file /home/zhaohe/tf_deeplab_resnet/snapshots/checkpoint: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
Traceback (most recent call last):
File "inference.py", line 100, in
main()
File "inference.py", line 85, in main
load(loader, sess, args.model_weights)
File "inference.py", line 47, in load
saver.restore(sess, ckpt_path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1388, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file /home/zhaohe/tf_deeplab_resnet/snapshots/checkpoint: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
[[Node: save/RestoreV2_378 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_378/tensor_names, save/RestoreV2_378/shape_and_slices)]]
[[Node: save/RestoreV2_378/_99 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_3540_save/RestoreV2_378", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]]

Caused by op u'save/RestoreV2_378', defined at:
File "inference.py", line 100, in
main()
File "inference.py", line 84, in main
loader = tf.train.Saver(var_list=restore_var)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1000, in init
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1030, in build
restore_sequentially=self._restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 624, in build
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 361, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 200, in restore_op
[spec.tensor.dtype])[0])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 441, in restore_v2
dtypes=dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in init
self._traceback = _extract_stack()

DataLossError (see above for traceback): Unable to open table file /home/zhaohe/tf_deeplab_resnet/snapshots/checkpoint: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
[[Node: save/RestoreV2_378 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_378/tensor_names, save/RestoreV2_378/shape_and_slices)]]
[[Node: save/RestoreV2_378/_99 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_3540_save/RestoreV2_378", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"]]

Is there any suggestion on what can I do ?

Best
He Zhao

About the evaluate.py

After I train my model, and get the snapshots, I want to evaluate the final model.
However, it doesn't work. I met this issue: http://stackoverflow.com/questions/41048819/how-to-restore-a-model-by-filename-in-tensorflow-r12

And I fix it by using the answer:

import tensorflow as tf
from tensorflow.core.protobuf import saver_pb2
...
saver = tf.train.Saver(write_version = saver_pb2.SaverDef.V1)
saver.save(sess, './model.ckpt', global_step = step)

It perhaps the evaluate.py doesn't fit the new saver.

Could you fix it?

Thanks

error when train on ADE20k

I want to train a model on ADE20k, so I modified 'DATA_DIRECTORY' and 'DATA_LIST_PATH' in 'train.py' but don't modified others. I know the there are 150 classes in ADE20k, but I just wanna have a try first. I sufffered a wired problem as follows, saying the image cannot be found. But I'm sure the image is there. Can anybody give me a hand?
`
W tensorflow/core/framework/op_kernel.cc:993] Not found: /Storage/zhixy/ADE20k/annotations/training/ADE_train_00009635.png
W tensorflow/core/framework/op_kernel.cc:993] Not found: /Storage/zhixy/ADE20k/annotations/training/ADE_train_00009635.png
[[Node: create_inputs/ReadFile_1 = ReadFile_device="/job:localhost/replica:0/task:0/cpu:0"]]
W tensorflow/core/framework/op_kernel.cc:993] Not found: /Storage/zhixy/ADE20k/annotations/training/ADE_train_00009635.png
[[Node: create_inputs/ReadFile_1 = ReadFile_device="/job:localhost/replica:0/task:0/cpu:0"]]
W tensorflow/core/framework/op_kernel.cc:993] Not found: /Storage/zhixy/ADE20k/annotations/training/ADE_train_00009635.png
[[Node: create_inputs/ReadFile_1 = ReadFile_device="/job:localhost/replica:0/task:0/cpu:0"]]
Traceback (most recent call last):
File "train.py", line 244, in
main()
File "train.py", line 233, in main
loss_value, images, labels, preds, summary, _ = sess.run([reduced_loss, image_batch, label_batch, pred, total_summary, train_op], feed_dict=feed_dict)
File "/home/zhixy/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/zhixy/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/zhixy/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/zhixy/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: FIFOQueue '_1_create_inputs/batch/fifo_queue' is closed and has insufficient elements (requested 16, current size 0)
[[Node: create_inputs/batch = QueueDequeueManyV2[component_types=[DT_FLOAT, DT_UINT8], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](create_inputs/batch/fifo_queue, create_inputs/batch/n)]]
[[Node: Gather/_1075 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_5339_Gather", tensor_type=DT_UINT8, _device="/job:l
ocalhost/replica:0/task:0/gpu:0"]]

Caused by op u'create_inputs/batch', defined at:
File "train.py", line 244, in
main()
File "train.py", line 132, in main
image_batch, label_batch = reader.dequeue(args.batch_size)
File "/Storage/zhixy/speedinghzl/tensorflow-deeplab-resnet/deeplab_resnet/image_reader.py", line 180, in dequeue
num_elements)
File "/home/zhixy/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 872, in batch
name=name)
File "/home/zhixy/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 667, in _batch
dequeued = queue.dequeue_many(batch_size, name=name)
File "/home/zhixy/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/data_flow_ops.py", line 458, in dequeue_many
self._queue_ref, n=n, component_types=self._dtypes, name=name)
File "/home/zhixy/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 1310, in _queue_dequeue_many_v2
timeout_ms=timeout_ms, name=name)
File "/home/zhixy/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/home/zhixy/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/zhixy/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1226, in init
self._traceback = _extract_stack()

OutOfRangeError (see above for traceback): FIFOQueue '_1_create_inputs/batch/fifo_queue' is closed and has insufficient elements (requested 16, current size 0)
[[Node: create_inputs/batch = QueueDequeueManyV2[component_types=[DT_FLOAT, DT_UINT8], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](create_inputs/batch/fifo_queue, create_inputs/batch/n)]]
[[Node: Gather/_1075 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_5339_Gather", tensor_type=DT_UINT8, _device="/job:l
ocalhost/replica:0/task:0/gpu:0"]]

npy2ckpt.py --save-dir argument is wrong

Please change the "--save_dir" to "--save-dir" in npy2ckpt.py or alternatively change the documentation.

coco dataset pretraining

Recently, I am trying to add coco dataset into this framework which is a common trick for many state-of-the-art semantic segmentation method. I will try to add the code as soon as possible.

evaluation score should be got from reduced_val.txt rather than val.txt

There is a overlap between val.txt and augmented_train.txt. That's the reason we should run evaluation on reduced_val.txt.

https://github.com/shelhamer/fcn.berkeleyvision.org/blob/master/data/pascal/seg11valid.txt

Where to download augmented PASCAL VOC 2012 dataset

I'm sorry to ask this stupid question, but I try my best, but don't find the augmented PASCAL VOC 2012 dataset, which consists of 10582 train image. Can anybody tell me the link?

error

I have some thing wrong :

Traceback (most recent call last):
  File "train.py", line 187, in <module>
    main()
  File "train.py", line 106, in main
    image_batch, label_batch = reader.dequeue(args.batch_size)
  File "/data6/weigq/deeplab/tensorflow-deeplab-resnet/deeplab_resnet/image_reader.py", line 121, in dequeue
    image_batch, label_batch = tf.train.batch([self.image, self.label], num_elements)
  File "/home/liusen/local/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 586, in batch
    capacity=capacity, dtypes=types, shapes=shapes, shared_name=shared_name)
  File "/home/liusen/local/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/data_flow_ops.py", line 638, in __init__
    shapes = _as_shape_list(shapes, dtypes)
  File "/home/liusen/local/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/data_flow_ops.py", line 72, in _as_shape_list
    raise ValueError("All shapes must be fully defined: %s" % shapes)
ValueError: All shapes must be fully defined: [TensorShape([Dimension(321), Dimension(321), Dimension(3)]), TensorShape([Dimension(321), Dimension(321), Dimension(None)])]

what is the matter

question for "evaluate.py"

Hi, @DrSleep

I had run the evaluation:
$ python evaluate.py --data-dir /data/VOC_dataset/voc2012_trainval/JPEGImages --data-list ./dataset/val.txt

However, using the provided model: deeplab_resnet.ckpt, the Mean IoU score: 0.047 is very low.
Could you tell me what's wrong with that? THX!

Here is the output:

(python_2.7_tf_0.12) root@milton-OptiPlex-9010:/data/code/tensorflow-deeplab-resnet# python evaluate.py --data-dir /data/VOC_dataset/voc2012_trainval/JPEGImages --data-list ./dataset/val.txt 
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX TITAN
major: 3 minor: 5 memoryClockRate (GHz) 0.8755
pciBusID 0000:01:00.0
Total memory: 5.94GiB
Free memory: 5.56GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN, pci bus id: 0000:01:00.0)
Restored model parameters from ./deeplab_resnet.ckpt
step 0
step 100
step 200
step 300
step 400
step 500
step 600
step 700
step 800
step 900
step 1000
step 1100
step 1200
step 1300
step 1400
Mean IoU: 0.047
(python_2.7_tf_0.12) root@milton-OptiPlex-9010:/data/code/tensorflow-deeplab-resnet#

TypeError: split() got an unexpected keyword argument 'split_dim'

When trying to run inference.py , I get the following error:

ubuntu@ip-Address:~/tensorflow-deeplab-resnet$ python inference.py --save_dir ./output/ 2718714.jpg deeplab_resnet.ckpt
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcurand.so.8.0 locally
Traceback (most recent call last):
  File "inference.py", line 102, in <module>
    main()
  File "inference.py", line 60, in main
    img_r, img_g, img_b = tf.split(split_dim=2, num_split=3, value=img)
TypeError: split() got an unexpected keyword argument 'split_dim'
ubuntu@ip-Address:~/tensorflow-deeplab-resnet$

I am unsure of how to resolve it.

Finetuning on new dataset / Modify input images on the fly

How can I modify images on the fly? Say I would like to set a certain area of the input images region to 0? Where in your code would I need to do the surgery for that?
Rather in the ImageReader function where the image is loaded?
Or rather in the network graph itself, say by adding a layer after the data layer in DeepLabResNetModel that multiplies elementwise with some mask?

Sorry for bothering you with this stupid question. I'm new to tensorflow. Also sorry for asking usage-question, but since your code differs quite a lot from the tensorflow tutorial code I don't really know where else to turn for that question...Thanks a lot for providing the deeplab-resnet model for tensorflow!

Fail to train when loading pretrained model

Hi, I used tensorflow 0.12.0 on Ubuntu 16.04, CUDA 8.0. When I ran train.py which was loading pretrained model from model.ckpt-init or model.ckpt-pretrained, I met the tensorflow NotFoundError: Tensor name, but it was OK for deeplab_resnet.ckpt.
BTW, would you please provide the scripts which you use to convert caffemodel to tensorflow checkpoints. Thank you in advance!

problem with label reading

I follow your code and write a program to read the labels for single image. However the label is weired.

atplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np

label_name = tf.placeholder(tf.string,None)
label = tf.image.decode_png(tf.read_file(label_name), channels=1)
sess = tf.Session()

output = sess.run(label,feed_dict={label_name:'/opt/data/penggao/dataset/VOCdevkit/VOC2012/SegmentationClass/2010_001131.png'})
np.unique(output)
array([ 0, 52, 132, 147, 220], dtype=uint8)

question about deeplab_resnet_init.ckpt and deeplab_resnet.ckpt

deeplab_resnet_init.ckpt is the original caffe resnet trained on IMAGENTE by Kai Ming He while deeplab_resnet.ckpt is fintuned on MS-COCO after adding atrous CNN.
Do you know how to finetune on the MS-COCO dataset? They use classification or segmentation to finetune?

Very low mIOU

Training from scratch using the initialization model - deeplab_resnet_init.ckpt (provided in the instructions) - and default parameters (learning rate, number of iterations etc) results in a network which gives very low mIOU (53.2%) on the validation set. Whereas, pre-trained model (deeplab_resnet.ckpt, provided in the instructions) gives 78.9% mIOU. I am just wondering, if you used the same parameters (20k iterations, 1e-4 learning rate etc...) for training the model which gives 78.9% on validation set.

P.S: I didn't run fine_tune.py. I just ran train.py, followed by evaluate.py.

inference.py

I have trained the net, and when I run theinference.py, I got this error 👍

W tensorflow/core/framework/op_kernel.cc:993] Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for deeplab_resnet_init.ckpt

anyone can help me?

Train doubts

Hi! I've try to train resnet with my own data and I have some doubts:

1.- when the training finish how where are the final weights? (sorry maybe is a easy question but I'm new in tensorflow) I.e, when finish the training I search for a .ckpt to make a inference with this and I dont find this.

2.- During training which is the "path/to/log-directory" to use tensorboard? (I saw in tensorflow web you must to execute: "tensorboard --logdir=path/to/log-directory" to use tensorboard)

Very thanks in advance!

No need to remove the void-class during training

I think the following is trying to remove label 255.
raw_prediction = tf.reshape(raw_output, [-1, n_classes])
label_proc = prepare_label(label_batch, tf.pack(raw_output.get_shape()[1:3]), one_hot=False) # [batch_size, h, w]
raw_gt = tf.reshape(label_proc, [-1,])
indices = tf.squeeze(tf.where(tf.less_equal(raw_gt, n_classes - 1)), 1)
gt = tf.cast(tf.gather(raw_gt, indices), tf.int32)
prediction = tf.gather(raw_prediction, indices)

However, tf.one_hot can do this automatically.
input_batch = tf.one_hot(input_batch, depth=n_classes)

The following code is for testing:
a = tf.placeholder(tf.uint8)
out = tf.one_hot(a,depth=21)
sess = tf.Session()
sess.run(out,feed_dict={a:7})
array([ 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)
sess.run(out,feed_dict={a:67})
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)

As we can see, the loss is 0 when label is larger than 21 which will not contribute to gradient under cross-entropy loss.

These two approaches seem to be the same but use one_hot encoding can achieve better performance in my experiment. About 3 point improvement for the vgg16 network.

tf.split()

I run the train.py on tensoeflow 1.0.0 the following error happenes

img_r, img_g, img_b = tf.split(split_dim=2, num_split=3, value=img)
TypeError: split() got an unexpected keyword argument 'split_dim'

what is wrong with the 'split_dim'

npy2ckpt.py error

Hi! In the moment of weights conversion, I used npy2ckpt.py and give me the following error:

ValueError: Variable bn4b2_branch2a/mean does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?

To create the numpy file of weights, I used caffe-tensorflow, with pretrained .caffemodel you can download in here and the deploy.prototxt you can see in here.

can you provide some suggestions on how to install tensorlfow 0.12

right now, the compiled tensorlfow 0.12 version has been removed from homepage. Do you have any suggestion on how to install the 0.12 version? I am using docker right now which means I need to reinstall 0.12 every time when I set up a container.

version

Can I run it in tf 0.10?
if so, what should I change the code ?

How to restore part of the model?

For example, I want to train 'fc1_voc12_c0' 'fc1_voc12_c1' 'fc1_voc12_c2' 'fc1_voc12_c3' from scratch. To do that, I changed the name to 'fc1_voc12_c0_random'. How can I load other weight while making sure 'fc1_voc12_c0_random' initialised from noise?

Directly change the name will cause errors.

tensorflow 1.0

when will it support tf1.0 ? Becaue many apis have changed and the code doesn't work now

where is the scale layer in your code?

In the original caffe code, there are batch normalization layer and scale layer. Which is the corresponding scale layer in your tensorflow implementation. Another question, how do you deal with the padding problem in your code.

Do you have any plan to implement the CRF as RNN method?

the current train code include gradient accumulation or not?

train my own dataset

I want to train the model on my own image dataset, what anything else I should do except for changing the code which contains the path to image to my own dataset image?

thanks !

tf version

I want to know if the branch train-orig can run on tf-0.10?

Augmented dataset.

Hi.

Thank you for your repository!

I have a small question: where did you get the augmented dataset from?

Visulization of the image batch

Hi, I use the matplotlib to visualise the image batch for debug purpose.
However, for color image, the maplotlib can only show noise like this

The single channel is correct

Mean IoU is 0.643 after 26400 iterations and initialized from provided model.

how to do evaluation on the pascal server

This is not a issue for your project. Recently, I upload the result.tgz to the pascal server. However, there is no response and submission results.
Do you know the reason?

Extending the model

@DrSleep I am trying to extend the model to add distillation loss in addition to cross entropy with ground truth labels. Something like:

loss = cross_entropy(predictions, gt) + cross_entropy(predictions, activations_from_earlier_model)

I have everything covered except for the input pipeline. Right now, I have stored activations from an earlier model in an .npz file, where each activation is of size [H_ x W_ x n_classes] (down-sampled activations). I have added a new data_list of the following form:

/path/to/image_1.jpg /path/to/mask_1.png /path/to/activation_1.npz
/path/to/image_2.jpg /path/to/mask_2.png /path/to/activation_2.npz
.
.

The problem with the above is I don't know of any reader in tensorflow that matches npz file format. The same goes for the decoder as well.
Next, I tried dumping the activations in csv format and tried TextLineReader and corresponding decode_csv to read the data, but that does not seem to accept the outputs of tf.train.slice_input_producer, giving following error:

self.image, self.label, self.activation = read_images_from_disk(self.queue, self.input_size, random_scale, random_mirror)
...
activation_reader = tf.TextLineReader()
key, activation_contents = activation_reader.read(input_queue[2])
...
TypeError: Input 'queue_handle' of 'ReaderRead' Op requires l-value input

I think I could instead use something like this:

self.image_list, self.label_list, self.actv_list = read_labeled_image_list(self.data_dir, self.actv_dir, self.data_list)
self.queue[0] = tf.train.string_input_producer(self.images_list, shuffle=input_size is not None, seed='123')
self.queue[1] = tf.train.string_input_producer(self.label_list, shuffle=input_size is not None, seed='123')
self.queue[2] = tf.train.string_input_producer(self.actv_list, shuffle=input_size is not None, seed='123')
self.image, self.label, self.activation = read_images_from_disk(self.queue, self.input_size, random_scale, random_mirror)

But I am not sure.
Is there any nicer solution to what I am trying to achieve here in tensorflow? I would prefer to use npz format because of memory constraints.

Finetuning on the frontground/background of VOC2012

I want to change the VOC2012 semantic segmentation task(21-class, include background) to a frontground/background task(2-class, all the class is assigned to frontground).
I don't want to change the png image.
I want to change the code to achieve the effect.

I changed the code as the following:

In deeplab_resnet/image_reader.py, I add the code in the line 122:

line 121:       label = tf.image.decode_png(label_contents, channels=1)
line 122:       label = tf.cast(tf.not_equal(label, 0), tf.int8)

ps: I want to change all the non-background label to 1, and all the background label is 0.

In train.py:
n_classes = 2
In deeplab_resnet/model.py:
I replace 21 with 2 in the calls to atrous convolution.(as you said in #12)
In train.py:

restore_var = [v for v in trainable if not v.name.startswith('fc1_voc12_c')]
not_restore_var = [v for v in trainable if v.name.startswith('fc1_voc12_c')]

optim = optimiser.minimize(reduced_loss, var_list=not_restore_var)

saver = tf.train.Saver(var_list=restore_var, max_to_keep=40)
if args.restore_from is not None:
     load(saver, sess, args.restore_from)

The code is same as #12

However, it can't converge.
step 0 loss = 76.737
step 1 loss = 42903.938
step 3 loss = 16467024.000
step 4 loss = nan
step 5 loss = nan
step 6 loss = nan
step 7 loss = nan
...

I run it 6000 steps, but the loss is still nan.

If you don't mind, could you please help me with this problem?
I have spent a dozen of hours to read the code and try to realize this function.
I am almost crazy.......orz orz orz orz orz

update: I find that the utils.py should to be changed, and the image_reader.py should not.
And I am trying it. And it is so hard.....orz

Do you update the google drive model?

the old model can not work on the current code. deeplab_init_bn.ckpt and deeplab_pretrained_bn.ckpt. What is the differece between the current google drive code and original one?

print

I want to print the height and width as int like '321' of the new_shape

    h, w = input_size
    if random_scale:
        scale = tf.random_uniform([1], minval=0.75, maxval=1.25, dtype=tf.float32, seed=None)
        h_new = tf.to_int32(tf.mul(tf.to_float(tf.shape(img)[1]), scale))
        w_new = tf.to_int32(tf.mul(tf.to_float(tf.shape(img)[1]), scale))
        new_shape = tf.squeeze(tf.pack([h_new, w_new]), squeeze_dims=[1])

        img = tf.image.resize_images(img, new_shape)

I`m new with tf, what should i do?
thanks

data augmentation compared with the original code

Crop and mirrored image are the two techniques used in the original paper to prevent overfitting. I have two questions here.
(1) the crop implementation in your repository is the same with original one or not? you are using central crop right.
(2) How to implement the mirror augmentation.
tf.image.random_flip_left_right can not flip the label and image simultaneously. Another problem with this operator is theat label (W,H,1) will changed into (W,H,3)

can you suggest any method on mirror implementation?

Best Wishes

Does the mIoU include the the background?

I wonder when calculate the mIoU if it includes the background?
Does it calculate the 21-class IoU including the background and then get the mean?
Or Does it calculate the 20-class IoU excluding the background?
I have check the VOC 2012 dev toolkit, it seems like it include the background.
But I am no sure.
Could you please help me figure it out?