Sorry for the naive question. However, this is never made clear in t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

similar answer here :<a class="issue-link js-issue-link" data-error-text="Failed to lo

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Training with 16 GPU need to double the lr and lr_backbone about detr HOT 9 CLOSED

facebookresearch commented on August 23, 2024

Training with 16 GPU need to double the lr and lr_backbone

from detr.

Comments (9)

fmassa commented on August 23, 2024 3

@gaopengcuhk we didn't scale the learning rate for our experiments, we found out that by using Adam it was ok to use the same default values for all configurations (even if using 64 GPUs).

The linear scaling rule is definitely too aggressive, and the model will probably not train at all with it. If you want to try some scaling rule for the learning rate, using the square-root scaling could potentially work (so increase batch size by 2, multiply learning rate by sqrt(2))

I believe I've answered your question, and as such I'm closing the issue, but let us know if you have futher questions.

from detr.

DeppMeng commented on August 23, 2024 3

@szagoruyko I tested train DETR for 150 epochs with 8V100 GPUs and 8V100 GPUs * 4 nodes setting, with the learning rate unchanged. However, there is still a performance gap.

GPU config	AP
8	39.9
8 * 4	38.4
8 * 8	running

Did you have similar observation? Or the gap will diminish in 300 epoch setting?

from detr.

gaopengcuhk commented on August 23, 2024 1

If you keep the learning rate unchanged, the performance of 16GPU is worse than 8GPU at the same epoch, right?

from detr.

szagoruyko commented on August 23, 2024 1

@gaopengcuhk depends on the total batch size, for example if we keep total batch size of 32 images with 2 im/gpu on 16 cards we get the same results as with 4 im/gpu on 8 cards. If we increase the total batch size, e.g. by training with 4 im/gpu on 16 cards, we observe that the model converges slower but with longer training it catches up.

from detr.

gaopengcuhk commented on August 23, 2024

I tried scaling up the learning rate and backbone learning rate from 1e-4/1e-5 => 3e-4/3e-5 when training with 24GPU. The mAP is always zero. Can you give any suggestion about learning rate scaling law?

from detr.

gaopengcuhk commented on August 23, 2024

similar answer here :#46

Keep the learning rate unchanged for all GPU configuration.

from detr.

gaopengcuhk commented on August 23, 2024

Hi, I observe the same thing.
2 im/GPU 8 cards will get better results than 2im/GPU 16 cards for the same epoch. I guess 16 cards will finally catch up. I will update the results when I finished the full training.

from detr.

lld533 commented on August 23, 2024

@gaopengcuhk depends on the total batch size, for example if we keep total batch size of 32 images with 2 im/gpu on 16 cards we get the same results as with 4 im/gpu on 8 cards. If we increase the total batch size, e.g. by training with 4 im/gpu on 16 cards, we observe that the model converges slower but with longer training it catches up.

Hi, could you please share your gpu model and how many GPU memory (in MB) are actually used on each GPU card to train with 2 im/GPU? Many thanks!

from detr.

advdfacd commented on August 23, 2024

@gaopengcuhk could you pls share if your larger batch size model finally catch up ?

from detr.

Recommend Projects

Training with 16 GPU need to double the lr and lr_backbone about detr HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent