Git Product home page Git Product logo

Comments (9)

justheuristic avatar justheuristic commented on August 21, 2024 2

Hi!

May be overly presumptuous, but does it have to do with MoE? I see there is an AQLM Mixtral, but I'm guessing MoEs don't perform as well as a standard AQLM quant model.

We have quantized Moe in the past, e.g. that mixtral model you mentioned.
To the best of my knowledge, MoE models quantize roughly as well as non-MoE ones.

Any reason why one of 'em isn't on the to-do list?

There is no such reason, but unfortunately, this is not how it works.

We have very limited compute and "manpower", and what we have is split between research and model quantization.
The research comes first because if we don't do that quickly enough, the lab will be shut down.

So, our "gpu budget" for releasing quantized models is only enough to quantize few most popular models like Llama and Phi 3. For the rest, we release the code in hope that volunteers will run that and publish their own pre-quantized models.

As such, we hope we can reach ChatQA or Wizard eventually, but we can't guarantee anything. If tomorrow we find out we need those GPUs to get our research done in time, the extra models will have to wait.

the Llama 3-70B models are both linked to the same model.

Please specify which ones? (a link would be nice)

from aqlm.

justheuristic avatar justheuristic commented on August 21, 2024 1

Hi! Sorry for not responding to you for so long. Unfortunately, we are also quite limited on compute, but I've added this to the queue of models to quantize eventually. Right now we're processing phi-3 medium, then qwen2, then this

from aqlm.

justheuristic avatar justheuristic commented on August 21, 2024 1

Yes, you would indeed need group size 8.

1x 16-bit code per 8 weights gives you 2 bits per weight, plus some extra bits for the codebook itself.

I've specified example training script for this exact case (llama-3-70B derivative, 2-ish bits per weight) here: #98 (comment)

from aqlm.

justheuristic avatar justheuristic commented on August 21, 2024 1

Good catch! I fixed the link in this commit, with an acknowledgement.

@setup

Your setup would be enough to quantize smaller LLMs (e.g. 7B, maybe 13B with basic tweaking), but you are right that it would probably OOM for 70B+ models

from aqlm.

BuildBackBuehler avatar BuildBackBuehler commented on August 21, 2024

Hi! Sorry for not responding to you for so long. Unfortunately, we are also quite limited on compute, but I've added this to the queue of models to quantize eventually. Right now we're processing phi-3 medium, then qwen2, then this

Hi! Sorry for not responding to you for so long. Unfortunately, we are also quite limited on compute, but I've added this to the queue of models to quantize eventually. Right now we're processing phi-3 medium, then qwen2, then this

Not too bad of a wait! Worth being patient =)

Totally OT, but I'd like to (try to, at least) run a model with MLC-LLM and I'm wondering what the equivalent would be to these Quant. Settings.

https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/quantization/quantization.py

I was figuring something like (for Llama-3-70B 2-bit 1x16, for ex.):

        "AQLM_2bit": GroupQuantize(
        name="AQLM_2bit",
        kind="group-quant",
        group_size=16,
        quantize_dtype="int2",
        storage_dtype="uint32",
        model_dtype="float16",
        linear_weight_layout="NK",

But I donno, with the group in vs. out, it needs to be divisible by the quant. dtype IIRC. So perhaps group_size=8 and without an equivalent...adding:
nbits_per_codebook=16
num_codebooks=1

from aqlm.

BuildBackBuehler avatar BuildBackBuehler commented on August 21, 2024

Hi! Sorry for not responding to you for so long. Unfortunately, we are also quite limited on compute, but I've added this to the queue of models to quantize eventually. Right now we're processing phi-3 medium, then qwen2, then this

I was also interested in Wizard 8x22B @ 2-bit and there was a thread created with regards to its base, Mixtral 8x22B. Any reason why one of 'em isn't on the to-do list? May be overly presumptuous, but does it have to do with MoE? I see there is an AQLM Mixtral, but I'm guessing MoEs don't perform as well as a standard AQLM quant model.

And thank you for the rapid responses and that comprehensive example script, super clutch! Shall I go ahead and close this or keep it open until the model is up and running?

Edit: Just noticed the PV-tuning models and a small FYI -- looks like the Llama 3-70B models are both linked to the same model.

AKA Meta-Llama-3-70B | 1x16g16's link is the one needing correction

from aqlm.

BuildBackBuehler avatar BuildBackBuehler commented on August 21, 2024

Hi!

May be overly presumptuous, but does it have to do with MoE? I see there is an AQLM Mixtral, but I'm guessing MoEs don't perform as well as a standard AQLM quant model.

We have quantized Moe in the past, e.g. that mixtral model you mentioned. To the best of my knowledge, MoE models quantize roughly as well as non-MoE ones.

Any reason why one of 'em isn't on the to-do list?

There is no such reason, but unfortunately, this is not how it works.

We have very limited compute and "manpower", and what we have is split between research and model quantization. The research comes first because if we don't do that quickly enough, the lab will be shut down.

So, our "gpu budget" for releasing quantized models is only enough to quantize few most popular models like Llama and Phi 3. For the rest, we release the code in hope that volunteers will run that and publish their own pre-quantized models.

As such, we hope we can reach ChatQA or Wizard eventually, but we can't guarantee anything. If tomorrow we find out we need those GPUs to get our research done in time, the extra models will have to wait.

the Llama 3-70B models are both linked to the same model.

Please specify which ones? (a link would be nice)

Hah aw, on one hand I know how competitive academic research funding can be so it makes sense, on the other, I'm surprised with such a cutting-edge field and producing the SotA Quant. that ISTA doesn't have a hefty grant for y'all (=

I wish I had more computing power. I haven't read it in awhile, but it sounded like anything less than an A100 is SOL. I'm using an M1 Max w/ 64GB VRAM, I'd be fine if it were compiling in the background for days, but I imagine if I didn't OOM for a 70B or 8x22B model, it'd take a month 😂. Unless its setup for multi-computer quantization but that'd be a mess and half of what I'd need in power I imagine (RTX3070/Tesla M40 24GB)

image https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-2Bit-1x16 It's the model size of 13GB 1 that also has this as its link (3rd from the bottom). It appears that the model in question is not up on HuggingFace so maybe that was an intentional placehold?

from aqlm.

github-actions avatar github-actions commented on August 21, 2024

This issue is stale because it has been open for 30 days with no activity.

from aqlm.

github-actions avatar github-actions commented on August 21, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

from aqlm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.