In the Mixture of Expert (MoE) system, the outputs of the experts are the weighted sum

The gradient flow of Switch transformer seems wrong? about annotated_deep_learning_paper_implementations HOT 7 CLOSED

hobbitlzy commented on April 27, 2024 1

The gradient flow of Switch transformer seems wrong?

from annotated_deep_learning_paper_implementations.

Comments (7)

vpj commented on April 27, 2024

I think you are right! I could only have a brief look still and I'm just double checking because it's surprising that this wasn't caught for so long. We even used the same implementation in a few other models and it worked fine.

May be multiplication of the input by the gate gives the gradient signal of how much each expert "likes" to handle that input. And this may actually be working.

I will spend a little more time checking before fixing this.

Thank you

from annotated_deep_learning_paper_implementations.

hobbitlzy commented on April 27, 2024

😁Thanks for your attention. Actually, we also run several experiments under this setting and everything looks fine.

Another attempt I made is to deduce the gradient flow using the chain rule. The difference between the two settings is, when early multiplication, the update of the gate includes the gradient of the expert. I do still not figure out what it exactly means and is working on it.

Hope this will help.

from annotated_deep_learning_paper_implementations.

vpj commented on April 27, 2024

Fixed it! Thanks again for pointing it out.

from annotated_deep_learning_paper_implementations.

vpj commented on April 27, 2024

Interesting, will try it out and let you know

from annotated_deep_learning_paper_implementations.

hobbitlzy commented on April 27, 2024

Sorry, I think I misunderstood the results. It is because I set a large load balance loss coeficience so that it dominates the gradient update. Early multiplication and later multiplication indeed have different behaviors.

Btw, I am curious what the route probabilities supposed to be. Is it close to 1 for a certain expert and the others are small, or the probabilities are almost evenly distributed among experts.

from annotated_deep_learning_paper_implementations.

vpj commented on April 27, 2024

It seems like they are getting around .5 for the selected expert. This may be different for larger dataset where model capacity is a bottleneck

https://app.labml.ai/run/2de889d0185c11ecb8bbbdc36d3aa63a/metrics

I also tested without multiplication before I saw your comment and here's what I got

https://app.labml.ai/run/43358364185d11ecbebaabd247ae98e1/metrics

from annotated_deep_learning_paper_implementations.

hobbitlzy commented on April 27, 2024

Thanks! I notice there are four experts in your model. So the picked expert owns half probability and the others share the rest half. I also get similar results, but I also get extremely biased distribution in other settings. Do you know which one is the expected one?

As for the second comment, thanks for bothering to run it. But I am curious about the loss of the without-multiplication version is smaller than the normal version? Does this mean it is better to have no gradient signal from experts?

from annotated_deep_learning_paper_implementations.

The gradient flow of Switch transformer seems wrong? about annotated_deep_learning_paper_implementations HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent