Git Product home page Git Product logo

Comments (6)

simran-arora avatar simran-arora commented on August 25, 2024

Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs.
Thanks so much!

from torchscale.

sunyt32 avatar sunyt32 commented on August 25, 2024
  1. --share-decoder-input-output-embed saves model parameters especially when the model size is small. The performance is almost the same. We activate it in our experiment.
  2. Don't activate --subln or --deepnorm. The current initialization is good enough.
  3. The training instability comes from Linear bias and eps in LayerNorm. In our experiment, we set bias=False and eps=1e-5. Besides, RMSNorm is helpful for stability so we make a modification.

from torchscale.

donglixp avatar donglixp commented on August 25, 2024

Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs. Thanks so much!

@simran-arora @hanlinxuy

  • The LN eps was modified from 1e-6 to 1e-5 as in the commit d1fefe9

  • The RMSNorm is also used in the commit 5c89ffb , so that the effects of LN_eps can be eliminated

  • For the RetNet implementation, the initialization principle proposed in DeepNet has been integrated. So the arguments --subln or --deepnorm should not be added.

  • Removing bias also improves training stability.

The latest released code has considered the above points.

from torchscale.

simran-arora avatar simran-arora commented on August 25, 2024

Thanks so much! I had used layer norm and did not set the bias=False. Will try switching these.

Adding the explicit deepnorm initialization also improved stability for my downstream runs, but I will try using the recommended techniques instead.

from torchscale.

sunyt32 avatar sunyt32 commented on August 25, 2024

@simran-arora It's better to set bias=False both in layer norm and nn.Linear.

Besides, would you mind sharing the training details with us? e.g. corpus, model size, and hyper-parameters. We'd like to see the instability setting.

from torchscale.

hanlinxuy avatar hanlinxuy commented on August 25, 2024

Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs. Thanks so much!

@simran-arora @hanlinxuy

  • The LN eps was modified from 1e-6 to 1e-5 as in the commit d1fefe9
  • The RMSNorm is also used in the commit 5c89ffb , so that the effects of LN_eps can be eliminated
  • For the RetNet implementation, the initialization principle proposed in DeepNet has been integrated. So the arguments --subln or --deepnorm should not be added.
  • Removing bias also improves training stability.

The latest released code has considered the above points.

Thank you very much! Will try later with those new information!

from torchscale.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.