Git Product home page Git Product logo

Comments (7)

BlackSamorez avatar BlackSamorez commented on July 22, 2024 2

I've merged #39 and released aqlm==1.1.0 where I got rid of the need to use aqlm.optimize_for_training(). Everything is determined automatically from here on.

from aqlm.

hiyouga avatar hiyouga commented on July 22, 2024 2

Sounds great! We will instruct users to use the latest AQLM in our training framework

from aqlm.

BlackSamorez avatar BlackSamorez commented on July 22, 2024 1

Hi @hiyouga !
It's, sadly, not properly documented yet, but you should do:

import aqlm

with aqlm.optimize_for_training():
    model = AutoModelForCausalLM.from_pretrained(...)

The thing is, there a few ways to compute a forward pass and some of them work better for very small number of tokens (e.g. generation), and some are optimized for large batch sizes (e.g. training). We're hoping to be able to determine which kernels to use dynamically in later versions of aqlm, but, for now, please add that wrapped explicitly. Also, keep in mind, that a model loaded under that wrapper will be very slow on generation. We're working on making it a more pleasant experience!

from aqlm.

hiyouga avatar hiyouga commented on July 22, 2024

@BlackSamorez Indeed! It's very important to me, I will try to fine-tune the model again following your advice. Thanks for pointing it out!

from aqlm.

BlackSamorez avatar BlackSamorez commented on July 22, 2024

A bit more context: those are the speeds for a typical layer on an RTX 3090 GPU. We have a kernel for a single token pass (generation), which is slightly faster than fp16, and we have a kernel which introduces a huge but constant overhead over fp16, meaning it's asymptotically as fast as fp16.

num_tokens (batch_size x seq_len) with optimize_for_training, ms/pass without optimize_for_training, ms/pass fp16 baseline
1 4.71 0.18 0.14
4 4.69 0.53 0.14
16 4.70 1.91 0.14
64 4.72 7.43 0.16
256 5.02 too slow 0.46
1024 6.14 too slow 1.57
4096 10.04 too slow 5.54
16384 25.68 too slow 21.15

As of now, we don't have a good enough kernel for anything between 4 and 4000 tokens processed in a pass. We're hoping to implement them someday.

from aqlm.

hiyouga avatar hiyouga commented on July 22, 2024

I see. The generation is much faster than training, and it might also be related to the gradient checkpointing technique in training.

from aqlm.

github-actions avatar github-actions commented on July 22, 2024

This issue is stale because it has been open for 30 days with no activity.

from aqlm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.