Hello, I have followed the training configuration introduced here (<

--share-decoder-input-output-embed saves model parameters especially when the mo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

retnet traning config about torchscale HOT 6 OPEN

microsoft commented on August 25, 2024 2

retnet traning config

from torchscale.

Comments (6)

simran-arora commented on August 25, 2024

Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs.
Thanks so much!

from torchscale.

sunyt32 commented on August 25, 2024

--share-decoder-input-output-embed saves model parameters especially when the model size is small. The performance is almost the same. We activate it in our experiment.
Don't activate --subln or --deepnorm. The current initialization is good enough.
The training instability comes from Linear bias and eps in LayerNorm. In our experiment, we set bias=False and eps=1e-5. Besides, RMSNorm is helpful for stability so we make a modification.

from torchscale.

donglixp commented on August 25, 2024

Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs. Thanks so much!

@simran-arora @hanlinxuy

The LN eps was modified from 1e-6 to 1e-5 as in the commit d1fefe9
The RMSNorm is also used in the commit 5c89ffb , so that the effects of LN_eps can be eliminated
For the RetNet implementation, the initialization principle proposed in DeepNet has been integrated. So the arguments --subln or --deepnorm should not be added.
Removing bias also improves training stability.

The latest released code has considered the above points.

from torchscale.

simran-arora commented on August 25, 2024

Thanks so much! I had used layer norm and did not set the bias=False. Will try switching these.

Adding the explicit deepnorm initialization also improved stability for my downstream runs, but I will try using the recommended techniques instead.

from torchscale.

sunyt32 commented on August 25, 2024

@simran-arora It's better to set bias=False both in layer norm and nn.Linear.

Besides, would you mind sharing the training details with us? e.g. corpus, model size, and hyper-parameters. We'd like to see the instability setting.

from torchscale.

hanlinxuy commented on August 25, 2024

Hi, Is there any resolution to this question for the initialization and recommended training configs to reproduce the paper results? I am also seeing some instability with the default configs. Thanks so much!

@simran-arora @hanlinxuy

The LN eps was modified from 1e-6 to 1e-5 as in the commit d1fefe9

The RMSNorm is also used in the commit 5c89ffb , so that the effects of LN_eps can be eliminated

For the RetNet implementation, the initialization principle proposed in DeepNet has been integrated. So the arguments --subln or --deepnorm should not be added.

Removing bias also improves training stability.

The latest released code has considered the above points.

Thank you very much! Will try later with those new information!

from torchscale.

retnet traning config about torchscale HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent