Git Product home page Git Product logo

Comments (5)

DanFu09 avatar DanFu09 commented on August 23, 2024 1

For the M2-BERT synthetics, we ran a non-causal form of associative recall to fine-tune the architecture.

For induction head - the same analogy applies for what we implemented here, but you're right that the setup is a bit different.

from safari.

DanFu09 avatar DanFu09 commented on August 23, 2024

from safari.

Karami-m avatar Karami-m commented on August 23, 2024

Thanks for your quick response. Regarding this dataset, if the goal is to optimize the test accuracy, reported in the paper which is evaluated on the last token, it seems that we don't need to model the training as a causal language model and we can simply design the training loss to match the test loss by evaluation only the last token. I understand that the goal was to design a benchmark to evaluate the autoregressive (causal) behaviour of the LM but it seems that in this dataset, this different training procedure is indeed working as a regularizer by forcing the model to predict next token while the final goal is the accuracy of the last token only.

from safari.

DanFu09 avatar DanFu09 commented on August 23, 2024

My intuition is that the next-token prediction actually makes it easier for the model to learn the circuit it needs to do associative recall, since there's more tokens it needs to predict correctly during training (roughly the second half of the tokens, once it's seen each example once). You can see this when the training accuracy jumps around the same time that the test accuracy jumps - it's learned a generalizable behavior, not just a shortcut.

In any case - the synthetic is mostly useful insofar as it predicts downstream performance on language modeling. There's a lot of techniques that can solve associative recall (e.g., Python function) - the interesting bit is when we can use the synthetic to guess how well a layer will perform downstream.

from safari.

Karami-m avatar Karami-m commented on August 23, 2024

Thanks for your clarification. For Monarch Mixer that you mentioned, did you use the causal form of M2 for associative recall dataset?

I also have another question regarding the induction head dataset: does the same analogy (causality and next token prediction) apply to it as well? I am asking this since, in contrast to associative recall, in this dataset it only need to recall content after a special token that has happened at the middle of the sequence not all the tokens.

from safari.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.