In my opinion, the transformer costs much memory. And the paper pointed out that alth

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Update : I compared three kinds of learning rate strategies when

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Training time of SwinIR; Impact of learning rate (fix the lr to 1e-5 for x4 fine-tuning is slightly better) about swinir HOT 11 CLOSED

shengkelong commented on May 6, 2024

Training time of SwinIR; Impact of learning rate (fix the lr to 1e-5 for x4 fine-tuning is slightly better)

from swinir.

Comments (11)

JingyunLiang commented on May 6, 2024 5

Experiments are conducted on a machine with 8 Nvidia 2080 Ti GPUs. We use batch_size=32 and less total iterations to save time.

classical_sr_x2 (trained on DIV2K, patch size = 48x48) takes about 1.75 days to train for 500K iterations.

classical_sr_x4 (trained on DIV2K, patch size = 48x48) takes about 0.95 days to train for 250K iterations. Note that we fine-tune x3/x4/8 models from x2 and halve the learning rate and total training iterations, for the benefit of reducing training time.

from swinir.

Senwang98 commented on May 6, 2024 1

@JingyunLiang
Hi, thanks for your work!
I find something strange. You use half learning rate to fine-tune BIX3-4-8, why only use half lr for training?
If you use 1e-4 to train swinIR-BIX2 and then use 1e-5 to fine-tune the BIX4, you can get much much better result(even unbelieveable result, PSNR/SSIM so high) instead of half lr=5e-5.
Is this kind of training trick cheating?(still improve 0.2 PSNR on Manga109, 0.0x improve on other image SR benchmark)

from swinir.

shengkelong commented on May 6, 2024

Thanks, so does it mean that it is difficult to train on a single card because of the memory even use 16 batchsize?

from swinir.

JingyunLiang commented on May 6, 2024

For a middle size SwinIR (used for classical image SR, around 12M parameters), we need about 24GB (2x12GB GPUs) for batch_size=16.
For a small size SwinIR (used for lightweight image SR, around 900k parameters), we need about 12GB(1x12GB GPUs) for batch_size=16.

from swinir.

JingyunLiang commented on May 6, 2024

I only use half lr (2e-4 may be too large, so we use 1e-4 for fine-tuning) for fine-tuning to save half of the training time. Fine-tuning from x2 is a common practice, such as RCAN, ECCV2018.

As for using 1e-4 to train SwinIR-BIX2, I never tried it before. Your observation is really surprising! According to my experience, the learning rate doesn't have much impact if you decrease it gradually. Maybe Transformers have different characteristics compared with CNNs in learning rate selection.

@Senwang98 Thank you for your reporting. Can you post more details here? Is it classical SR or lightweight SR? What are your PSNR values on five benchmarks? Is your network architecture identical to ours (seemodels/network_swinir.py)?

I will try it and validate your finding~~ My results will be updated here.

from swinir.

Senwang98 commented on May 6, 2024

@JingyunLiang
Thanks for your quick reply!
The result is found on CNNs network. I am not sure the code is wrong because I used to use EDSR-pytorch repo to train my model.
Several days ago, I conducted an experiment on RCAN, I use RCAN-BIX2.pt (which is 1e-4 for training and decay half per 200000 iters), then I used 1e-5 to fine-tune the RCAN-BIX4(In this time, I didn't decay learning rate per 200000 iters. That is to say I use 1e-5 for the whole training without changing lr!)
I think you are expert in this field, so do you think this is training trick cheating?
If I change the lr when train BIX4, the result is ok. If I use a much small lr to train and don't change lr, the final result is better(For RCAN, the BIX4-Manga109 performance can improve from 31.22 to round 31.45).
Can you give me some suggestions?(I don't mean your swinIR is wrong, I just want to explain this strange thing!)
Thanks for your interesting work again, and maybe you can use siwnIR-BIX2 to fine-tune BIX4 without change lr during training!

from swinir.

JingyunLiang commented on May 6, 2024

It is possible for RCAN as it is a very deep network and should have strong representation ability. Better training strategy or longer training time may help.

In my view, if all other settings are the same as original RCAN (same datasets, same patch size, same training iterations, same optimizer, etc.), changing the lr could be a good trick. I think it is fair. If it is useful for all other CNNs, all future works should use this strategy from then on! However, maybe we should point the lr strategy out in the paper and do some ablation studies if we compare it with these old methods.

As for your advice on fine-tuning SwinIR-BIX4 by using fixed lr (1e-5), I will try it and keep you updated. Thank you.

from swinir.

Senwang98 commented on May 6, 2024

@JingyunLiang
Yes, you are wright, this training setting should be reported in paper and some study should also be done to support this trick.
I will test training other cnn-based later. If it acturally works, I will tell you.
Thanks for your reply!

from swinir.

JingyunLiang commented on May 6, 2024

Update: I compared three kinds of learning rate strategies when finetuning x4 from x2 classical SR models.

Case	Init LR	LR_step	Set5	Set14	BSD100	Urban100	Manga109
1 (used in the paper)	1e-4	[125000, 200000, 225000, 237500] (total_iter=250000)	32.72/0.9021	28.94/0.7914	27.83/0.7459	27.07/0.8164	31.67/0.9226
2	1e-5	[None] (total_iter=250000)	32.69/0.9018	28.96/0.7920	27.84/0.7463	27.07/0.8165	31.73/0.9227
3	1e-5	[None] (total_iter=500000)	32.69/0.9020	28.96/0.7918	27.84/0.7462	27.08/0.8168	31.69/0.9228

Conclusion: We get a PSNR improvement from -0.03 to 0.06. Using the second lr strategy (fix the lr to 1e-5 for x4 fine-tuning) is only slightly better.

from swinir.

Senwang98 commented on May 6, 2024

@JingyunLiang
Ok, I will check. Maybe it is more useful for CNN-based model. (Alough I think this strategy should not work, the results are relly better in my repo 23333333. )

from swinir.

JingyunLiang commented on May 6, 2024

Feel free to open it if you have more questions.

from swinir.

Training time of SwinIR; Impact of learning rate (fix the lr to 1e-5 for x4 fine-tuning is slightly better) about swinir HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent