Git Product home page Git Product logo

Comments (8)

drboog avatar drboog commented on August 16, 2024

Hi, my suggestion is to start with gamma=10 and try different itc and itd. The reason is: 1) I only roughly tuned hyper-parameters, the provided hyper-parameters are not optimal, so if you are looking for better performance, you may want to tune them; 2) although each GPU has 16 samples, the total mini-batch size is different when you use different number of GPUs, which I think will influence the hyper-parameter settings, but I think gamma=10 will lead to promising results.
I used 'ada=noaug' because 1) for fair comparison, I want to compare with previous methods fairly, which didn't use augmentation, to show our effectiveness under the same setting; 2) I did some experiments on different datasets (although the hyper-parameters may not be well-tuned), the ADA augmentation is not guaranteed to improve performance on all the datasets, so I didn't use it in final experiments for simplicity.

from lafite.

StolasIn avatar StolasIn commented on August 16, 2024

Thanks for your advice, It's a great help for me.

I still have some questions about hyper-parameters tuning and batch size.

  1. In the paper, you select itd and itc from 0 ~ 50. my question is those hyper-parameters might still be in this range right ? perhaps close to the original setting (ex. itd = 5, itc = 10) ?

  2. In the latest model which use contrastive learing method, I noticed their batch size were usually seted quite large (>256). does large batch size result in better performance in Lafite ? Or using lower batch size have some benefits ?

from lafite.

drboog avatar drboog commented on August 16, 2024

Yes, I think you can tune the hyper-parameters by searching in this range. I think larger batch size will lead to performance improvement, because it could provide better discriminative information for training. But in that case you may also need to tune some hyper-parameters for contrastive loss (--temp=0.5, --lam=0. are tuned based on batch size=16 per GPU).

from lafite.

StolasIn avatar StolasIn commented on August 16, 2024

I appreciate for you answering my questions. I will close this issue and do some experiment mentioned above.

from lafite.

Cwj1212 avatar Cwj1212 commented on August 16, 2024

@StolasIn Have you reproduced the results with one gpu now? I have reproduced the results of the paper under the setting of four gpus (batch=32, btach_gpu=8 gets better results than the paper). But when I try to experiment with one gpu, I get only poor results.
@drboog ,Thank you so much for your work, for which I have to keep harassing you again. As for why the hyperparameters need to be re-adjusted for one gpu, my observation is that the {gather:flase} setting in the contrastive loss in the code is to distribute and calculate the contrastive loss on each gpu. I don't know what else causes the difference between one gpu and four gpu?
What confuses me is that I modified the calculation method of contrastive learning under the one gpu setting, simulating it as the {gather:false} setting (divide a batch of samples into four parts, and calculate their contrastive losses separately) , but still only get poor results.

from lafite.

drboog avatar drboog commented on August 16, 2024

The performance is related to many things, batch size, learning rate, regularizer ... For example, for StyleGAN2 without contrastive loss (for image generation not text-to-image generation), GPUs still matters a lot.
https://github.com/NVlabs/stylegan2-ada/blob/main/docs/stylegan2-ada-training-curves.png

from lafite.

drboog avatar drboog commented on August 16, 2024

Assume under 4 GPU setting, each GPU has N samples, resulting in a batch size of 4N. Are you using batch size of 4N when using one GPU?

from lafite.

Cwj1212 avatar Cwj1212 commented on August 16, 2024

@drboog Yes, I did, but the performance of one card is still significantly worse than four cards. Thank you very much for providing this picture. I originally thought it was the difference caused by the different number of gups corresponding to different hyperparameters when cfg=auto.
But as a beginner, I still can't understand why the forward and back propagation of the network are not equivalent at this time. Are they equivalent in theory?

from lafite.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.