Hi, training the batch normalization is most important contribution in the paper <

OK.I found the bug. It is in the <a href="https://github.com/NanqingD/DeepLabV3-Tensor

Hi, everyone, I made a mistake. <a class="user-mention notranslate" data-hovercard-typ

Have you seen <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="

Training Batch Normalization about deeplabv3-tensorflow HOT 27 CLOSED

eveningdong commented on August 17, 2024

Training Batch Normalization

from deeplabv3-tensorflow.

Comments (27)

John1231983 commented on August 17, 2024 2

OK.I found the bug. It is in the L138 of train_voc12.py
image_batch_val, label_batch_val = read_data(is_training=False)
The above function must be have one more aug such as split name, otherwise, it will read from training data set :)

def read_data(is_training, batch_size=args.batch_size):
  file_pattern = '{}_{}.tfrecord'.format(args.data_name, args.split_name)

I also can achieve about 80% mIoU with above code, but when I corrected the function (by add split name as val), the performance will be 73.6% as my own implementation.

from deeplabv3-tensorflow.

eveningdong commented on August 17, 2024 2

Hi, everyone, I made a mistake. @John1231983 is right about the validation, thanks for your patience. I will fix the bug and rerun the program.

from deeplabv3-tensorflow.

John1231983 commented on August 17, 2024

@NanqingD Could you look at my problem?

from deeplabv3-tensorflow.

bhack commented on August 17, 2024

Have you seen 4804235?

from deeplabv3-tensorflow.

John1231983 commented on August 17, 2024

Have you try to run it and achieve the performance? I sure that the implementation has some issue as above when i look at the code

from deeplabv3-tensorflow.

bhack commented on August 17, 2024

Still not but 85.41% is better than every ablations in the original paper. Have you followed the training protocol in the readme?

from deeplabv3-tensorflow.

John1231983 commented on August 17, 2024

I trained 4 days but i cannot achieved yet. So i cancel the process. I found other solution likes pretrain to speed up the converage. But i am not clear that we will train all paramaters likes conv_trainable in the implementation or just conv +bn in aspp. Do you know that? In my option, first we will copy the pretrained weight from resnet and use them as initial points then train the weight again.

from deeplabv3-tensorflow.

bhack commented on August 17, 2024

Seems that loading pretrained weights is in the tasklist. But I am still curious to know how 85.41% was achieved.

from deeplabv3-tensorflow.

John1231983 commented on August 17, 2024

So let try to run it. I may not totally true. I guess the author mixing train and val set together, so the value like train from train+val set

from deeplabv3-tensorflow.

bhack commented on August 17, 2024

Ok.. of course If he mixed train and val this numbers don't make sense

from deeplabv3-tensorflow.

John1231983 commented on August 17, 2024

I think so. How is my understanding about pretrain? Is it correct? It means we will load pretrained wieght and use it as initial value instead of random initialization. Then the value will be updated during training

from deeplabv3-tensorflow.

bhack commented on August 17, 2024

Good catch!

from deeplabv3-tensorflow.

John1231983 commented on August 17, 2024

@bhack: could you rerun his code and let me know your performance?

from deeplabv3-tensorflow.

dongzhuoyao commented on August 17, 2024

it will be next cvpr best paper, if the repo author can run a 85% validation result without imagenet weight initilizing. 💯

from deeplabv3-tensorflow.

John1231983 commented on August 17, 2024

Hi, I believed the author achieved the number, but he/she was wrong one line in implementation, so the number that he shows just come from the training set, not validation set. It looks like you are training 3 days and then you measure mIoU in the same dataset that you train. Thus, the performance will increase day by day. I guess after 1 weeks, he can achieve 99%. Just one typo in implementation, other is fine

from deeplabv3-tensorflow.

ksnzh commented on August 17, 2024

According to the code from this repo, it will run validation every 1000 step.
If the code is correct, the val_mean_iou is like blow.

step 1000, train_mean_iou: 0.038962, val_mean_iou: 0.010141
step 2000, train_mean_iou: 0.042974, val_mean_iou: 0.028284
step 3000, train_mean_iou: 0.042751, val_mean_iou: 0.013903
step 4000, train_mean_iou: 0.046491, val_mean_iou: 0.027592
step 5000, train_mean_iou: 0.051442, val_mean_iou: 0.028335
step 6000, train_mean_iou: 0.063231, val_mean_iou: 0.043679
step 7000, train_mean_iou: 0.068127, val_mean_iou: 0.059629
step 8000, train_mean_iou: 0.079136, val_mean_iou: 0.067333
step 9000, train_mean_iou: 0.090848, val_mean_iou: 0.079447
step 10000, train_mean_iou: 0.091812, val_mean_iou: 0.080594
step 11000, train_mean_iou: 0.094520, val_mean_iou: 0.082270
step 12000, train_mean_iou: 0.091182, val_mean_iou: 0.037890
step 13000, train_mean_iou: 0.094305, val_mean_iou: 0.100399
step 14000, train_mean_iou: 0.108395, val_mean_iou: 0.087806
step 15000, train_mean_iou: 0.130730, val_mean_iou: 0.077178
step 16000, train_mean_iou: 0.151436, val_mean_iou: 0.094387
step 17000, train_mean_iou: 0.162695, val_mean_iou: 0.099858
step 18000, train_mean_iou: 0.161996, val_mean_iou: 0.092451
step 19000, train_mean_iou: 0.166216, val_mean_iou: 0.121211
step 20000, train_mean_iou: 0.173018, val_mean_iou: 0.138802
step 21000, train_mean_iou: 0.178273, val_mean_iou: 0.109637
step 22000, train_mean_iou: 0.165977, val_mean_iou: 0.163120
step 23000, train_mean_iou: 0.157765, val_mean_iou: 0.100000
step 24000, train_mean_iou: 0.170514, val_mean_iou: 0.124348
step 25000, train_mean_iou: 0.198407, val_mean_iou: 0.144772
step 26000, train_mean_iou: 0.222081, val_mean_iou: 0.098426
step 27000, train_mean_iou: 0.237516, val_mean_iou: 0.219594
step 28000, train_mean_iou: 0.227344, val_mean_iou: 0.148576
step 29000, train_mean_iou: 0.237092, val_mean_iou: 0.163101
step 30000, train_mean_iou: 0.246426, val_mean_iou: 0.156120
step 31000, train_mean_iou: 0.238966, val_mean_iou: 0.121964
step 32000, train_mean_iou: 0.232442, val_mean_iou: 0.172701
step 33000, train_mean_iou: 0.212796, val_mean_iou: 0.135206
step 34000, train_mean_iou: 0.221639, val_mean_iou: 0.164141
step 35000, train_mean_iou: 0.232702, val_mean_iou: 0.187685
step 36000, train_mean_iou: 0.258620, val_mean_iou: 0.126137
step 37000, train_mean_iou: 0.286420, val_mean_iou: 0.137220
step 38000, train_mean_iou: 0.281958, val_mean_iou: 0.220993
step 39000, train_mean_iou: 0.294367, val_mean_iou: 0.146129
step 40000, train_mean_iou: 0.286681, val_mean_iou: 0.180327
step 41000, train_mean_iou: 0.291149, val_mean_iou: 0.230863
step 42000, train_mean_iou: 0.268544, val_mean_iou: 0.259150
step 43000, train_mean_iou: 0.302505, val_mean_iou: 0.246976
step 44000, train_mean_iou: 0.264544, val_mean_iou: 0.209577
step 45000, train_mean_iou: 0.265050, val_mean_iou: 0.175589
step 46000, train_mean_iou: 0.282189, val_mean_iou: 0.162485
step 47000, train_mean_iou: 0.314683, val_mean_iou: 0.185510
step 48000, train_mean_iou: 0.316408, val_mean_iou: 0.259217
step 49000, train_mean_iou: 0.331183, val_mean_iou: 0.288642
step 50000, train_mean_iou: 0.330321, val_mean_iou: 0.292073
step 51000, train_mean_iou: 0.331521, val_mean_iou: 0.251862
step 52000, train_mean_iou: 0.321944, val_mean_iou: 0.223568
step 53000, train_mean_iou: 0.324300, val_mean_iou: 0.177001
step 54000, train_mean_iou: 0.313486, val_mean_iou: 0.237518
step 55000, train_mean_iou: 0.301724, val_mean_iou: 0.293159
step 56000, train_mean_iou: 0.319151, val_mean_iou: 0.163533
step 57000, train_mean_iou: 0.330160, val_mean_iou: 0.263894
step 58000, train_mean_iou: 0.361916, val_mean_iou: 0.243117
step 59000, train_mean_iou: 0.366205, val_mean_iou: 0.249002
step 60000, train_mean_iou: 0.367591, val_mean_iou: 0.183255
step 61000, train_mean_iou: 0.366055, val_mean_iou: 0.250439
step 62000, train_mean_iou: 0.366535, val_mean_iou: 0.318944
step 63000, train_mean_iou: 0.363185, val_mean_iou: 0.266321
step 64000, train_mean_iou: 0.344529, val_mean_iou: 0.293947
step 65000, train_mean_iou: 0.342547, val_mean_iou: 0.275548
step 66000, train_mean_iou: 0.346227, val_mean_iou: 0.219023
step 67000, train_mean_iou: 0.355806, val_mean_iou: 0.235097
step 68000, train_mean_iou: 0.382246, val_mean_iou: 0.169041
step 69000, train_mean_iou: 0.393502, val_mean_iou: 0.314861

from deeplabv3-tensorflow.

John1231983 commented on August 17, 2024

Great to see the log @ksnzh : Actually, I have run the code for 3 days and only achieved around 60% validation (I did not remember the exact number). One thing I want to notice is that you have to change the learning weight by hand, For example, after 69k (as you show in log), you can reduce the learning rate and see the performance can be better. One more thing, it is trained from scratch so take a lot of time to have a good performance

from deeplabv3-tensorflow.

John1231983 commented on August 17, 2024

@NanqingD Could you please look at the thread?

from deeplabv3-tensorflow.

dongzhuoyao commented on August 17, 2024

hi, everyone.

Inference strategy on val set: The proposed model is
trained with output stride = 16, and then during inference
we apply output stride = 8 to get more detailed feature
map. As shown in Tab. 4, interestingly, when evaluating
our best cascaded model with output stride = 8, the performance
improves over evaluating with output stride = 16
by 1:39%.

do you know what this sentence mean? In my understanding, it means set the block5 rate=2 when training, and set the block5 rate=1 when inference. Am I right? What you dude's opinion?

from deeplabv3-tensorflow.

John1231983 commented on August 17, 2024

No. Training block 3 unchanged and same as resnet, block 4 with rate as 2. Block 5 rate 4....In inference, block 3 rate 2, block 4 is 4, block 5 is 8

from deeplabv3-tensorflow.

John1231983 commented on August 17, 2024

One more thing, if you want to use pre-trained model (or future plan), then you have to change the name of resnet in the line L94

scope ='resnet{}'.format(depth)

scope ='resnet_v1_{}'.format(depth)

I think it is better if you use the pretrained model as the author did and you can converge faster. After changed the line, in the train_voc12.py, just add the line load function

 ckpt_path = './resnet_v1_101.ckpt'

And the lines after the net,endpoint=deeplabv3(...) in train_voc12.py

exclude=['resnet_v1_101/aspp/1x1conv', 'resnet_v1_101/aspp/1x1conv/BatchNorm', 'resnet_v1_101/aspp/rate6',
                 'resnet_v1_101/aspp/rate6/BatchNorm', 'resnet_v1_101/aspp/rate12',
                 'resnet_v1_101/aspp/rate12/BatchNorm', 'resnet_v1_101/aspp/rate18',
                 'resnet_v1_101/aspp/rate18/BatchNorm', 'resnet_v1_101/img_pool/1x1conv/BatchNorm',
                 'resnet_v1_101/img_pool/1x1conv', 'resnet_v1_101/fusion/1x1conv',
                 'resnet_v1_101/fusion/1x1conv/BatchNorm', 'resnet_v1_101/logits/Conv']
 restore_var = slim.get_variables_to_restore(exclude=exclude)

Hope you achieved a good result. In my case, my modified code achieved about 74% mIoU in batch size of 8

from deeplabv3-tensorflow.

eveningdong commented on August 17, 2024

@John1231983 , Hi, John. I will follow up your idea in the next few months. I will try to fix it piece by piece on my free time. Again, thanks for your support.

from deeplabv3-tensorflow.

John1231983 commented on August 17, 2024

Good job. One more thing I forget. In the deeplabv3.py, the line should be changed from add to concat aspp = tf.add_n(aspp_list). I think the author used concatenation

from deeplabv3-tensorflow.

eveningdong commented on August 17, 2024

@John1231983 Done. I reran the code for 12 hours, the training mIOU is 77% and validation mIOU is 64%. I think the sanity check is passed. I am gonna close this issue.

Welcome to (re)open issues if find more problems.

from deeplabv3-tensorflow.

John1231983 commented on August 17, 2024

Good job. Have you run wiyh pretrain model or from scratch? I trained with pretrained model and achieved 73% mIoU on validation using poly learning rate. If you used pretrained model, so I think your mIOU still low. You can use poly schedule instead of step schedule

from deeplabv3-tensorflow.

Balabala-Hong commented on August 17, 2024

@John1231983 , Hi, John,I want to know how do you set the learning rate and batchsize.If possible,can you share the snapshots/checkpoint with me?My email address is [email protected].

from deeplabv3-tensorflow.

bhack commented on August 17, 2024

@FlyingIce1 now there is an official reference implementation if you are interested.

from deeplabv3-tensorflow.

Training Batch Normalization about deeplabv3-tensorflow HOT 27 CLOSED

Comments (27)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent