Git Product home page Git Product logo

Comments (7)

vpj avatar vpj commented on April 27, 2024

I think you are right! I could only have a brief look still and I'm just double checking because it's surprising that this wasn't caught for so long. We even used the same implementation in a few other models and it worked fine.

May be multiplication of the input by the gate gives the gradient signal of how much each expert "likes" to handle that input. And this may actually be working.

I will spend a little more time checking before fixing this.

Thank you

from annotated_deep_learning_paper_implementations.

hobbitlzy avatar hobbitlzy commented on April 27, 2024

😁Thanks for your attention. Actually, we also run several experiments under this setting and everything looks fine.

Another attempt I made is to deduce the gradient flow using the chain rule. The difference between the two settings is, when early multiplication, the update of the gate includes the gradient of the expert. I do still not figure out what it exactly means and is working on it.

Hope this will help.

from annotated_deep_learning_paper_implementations.

vpj avatar vpj commented on April 27, 2024

Fixed it! Thanks again for pointing it out.

from annotated_deep_learning_paper_implementations.

vpj avatar vpj commented on April 27, 2024

Interesting, will try it out and let you know

from annotated_deep_learning_paper_implementations.

hobbitlzy avatar hobbitlzy commented on April 27, 2024

Sorry, I think I misunderstood the results. It is because I set a large load balance loss coeficience so that it dominates the gradient update. Early multiplication and later multiplication indeed have different behaviors.

Btw, I am curious what the route probabilities supposed to be. Is it close to 1 for a certain expert and the others are small, or the probabilities are almost evenly distributed among experts.

from annotated_deep_learning_paper_implementations.

vpj avatar vpj commented on April 27, 2024

It seems like they are getting around .5 for the selected expert. This may be different for larger dataset where model capacity is a bottleneck

https://app.labml.ai/run/2de889d0185c11ecb8bbbdc36d3aa63a/metrics

I also tested without multiplication before I saw your comment and here's what I got

https://app.labml.ai/run/43358364185d11ecbebaabd247ae98e1/metrics

from annotated_deep_learning_paper_implementations.

hobbitlzy avatar hobbitlzy commented on April 27, 2024

Thanks! I notice there are four experts in your model. So the picked expert owns half probability and the others share the rest half. I also get similar results, but I also get extremely biased distribution in other settings. Do you know which one is the expected one?

As for the second comment, thanks for bothering to run it. But I am curious about the loss of the without-multiplication version is smaller than the normal version? Does this mean it is better to have no gradient signal from experts?

from annotated_deep_learning_paper_implementations.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.