I wonder why we need twice dimensions for <math-renderer class="js-inline-math" style=

Retnet parameter dimension about torchscale HOT 2 CLOSED

microsoft commented on May 18, 2024

Retnet parameter dimension

from torchscale.

Comments (2)

Yuxin-CV commented on May 18, 2024

Please note that the MSR block includes an additional swish gate compared to the MHSA block in the vanilla Transformer. If we do not double the dimension of v, the MSR block will have 5d^2 parameters, while the MHSA block in the vanilla Transformer only has 4d^2 parameters. Given this scenario, it becomes challenging to determine the width and depth of a retnet for fair comparison with a baseline vanilla Transformer of the same size. Therefore, the authors decide to double the value of W_v and halve the value of d_ffn to maintain the overall parameters of each retnet block equal to 12d^2.

Alternatively, another option is to keep W_v the same as W_k and set d_ffn to 3.5d. However, it is preferable to have a wider swish gate rather than a wider mlp as ffn. For more details, please refer to https://arxiv.org/abs/2202.10447. I believe it is even better to use MSR block only and set d_v = 3.33d.

from torchscale.

allanj commented on May 18, 2024

Cool, pretty much makes sense to me. Thanks for the thorough explanationa.

from torchscale.

Recommend Projects

Retnet parameter dimension about torchscale HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent