I'd love to see <a href="https://huggingface.co/nvidia/Llama3-ChatQA-1.5-70B" rel="nof

Good catch! I fixed the link <a href="https://github.com/Vahe1994/AQLM/commit/7a563280

Request for Nvidia's RAG Implementation of Llama-3-70B "ChatQA 1.5" about aqlm HOT 9 CLOSED

BuildBackBuehler commented on August 21, 2024

Request for Nvidia's RAG Implementation of Llama-3-70B "ChatQA 1.5"

from aqlm.

Comments (9)

justheuristic commented on August 21, 2024 2

Hi!

May be overly presumptuous, but does it have to do with MoE? I see there is an AQLM Mixtral, but I'm guessing MoEs don't perform as well as a standard AQLM quant model.

We have quantized Moe in the past, e.g. that mixtral model you mentioned.
To the best of my knowledge, MoE models quantize roughly as well as non-MoE ones.

Any reason why one of 'em isn't on the to-do list?

There is no such reason, but unfortunately, this is not how it works.

We have very limited compute and "manpower", and what we have is split between research and model quantization.
The research comes first because if we don't do that quickly enough, the lab will be shut down.

So, our "gpu budget" for releasing quantized models is only enough to quantize few most popular models like Llama and Phi 3. For the rest, we release the code in hope that volunteers will run that and publish their own pre-quantized models.

As such, we hope we can reach ChatQA or Wizard eventually, but we can't guarantee anything. If tomorrow we find out we need those GPUs to get our research done in time, the extra models will have to wait.

the Llama 3-70B models are both linked to the same model.

Please specify which ones? (a link would be nice)

from aqlm.

justheuristic commented on August 21, 2024 1

Hi! Sorry for not responding to you for so long. Unfortunately, we are also quite limited on compute, but I've added this to the queue of models to quantize eventually. Right now we're processing phi-3 medium, then qwen2, then this

from aqlm.

justheuristic commented on August 21, 2024 1

Yes, you would indeed need group size 8.

1x 16-bit code per 8 weights gives you 2 bits per weight, plus some extra bits for the codebook itself.

I've specified example training script for this exact case (llama-3-70B derivative, 2-ish bits per weight) here: #98 (comment)

from aqlm.

justheuristic commented on August 21, 2024 1

Good catch! I fixed the link in this commit, with an acknowledgement.

@setup

Your setup would be enough to quantize smaller LLMs (e.g. 7B, maybe 13B with basic tweaking), but you are right that it would probably OOM for 70B+ models

from aqlm.

BuildBackBuehler commented on August 21, 2024

Hi! Sorry for not responding to you for so long. Unfortunately, we are also quite limited on compute, but I've added this to the queue of models to quantize eventually. Right now we're processing phi-3 medium, then qwen2, then this

Not too bad of a wait! Worth being patient =)

Totally OT, but I'd like to (try to, at least) run a model with MLC-LLM and I'm wondering what the equivalent would be to these Quant. Settings.

https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/quantization/quantization.py

I was figuring something like (for Llama-3-70B 2-bit 1x16, for ex.):

        "AQLM_2bit": GroupQuantize(
        name="AQLM_2bit",
        kind="group-quant",
        group_size=16,
        quantize_dtype="int2",
        storage_dtype="uint32",
        model_dtype="float16",
        linear_weight_layout="NK",

But I donno, with the group in vs. out, it needs to be divisible by the quant. dtype IIRC. So perhaps group_size=8 and without an equivalent...adding:
nbits_per_codebook=16
num_codebooks=1

from aqlm.

BuildBackBuehler commented on August 21, 2024

Hi! Sorry for not responding to you for so long. Unfortunately, we are also quite limited on compute, but I've added this to the queue of models to quantize eventually. Right now we're processing phi-3 medium, then qwen2, then this

I was also interested in Wizard 8x22B @ 2-bit and there was a thread created with regards to its base, Mixtral 8x22B. Any reason why one of 'em isn't on the to-do list? May be overly presumptuous, but does it have to do with MoE? I see there is an AQLM Mixtral, but I'm guessing MoEs don't perform as well as a standard AQLM quant model.

And thank you for the rapid responses and that comprehensive example script, super clutch! Shall I go ahead and close this or keep it open until the model is up and running?

Edit: Just noticed the PV-tuning models and a small FYI -- looks like the Llama 3-70B models are both linked to the same model.

AKA Meta-Llama-3-70B | 1x16g16's link is the one needing correction

from aqlm.

BuildBackBuehler commented on August 21, 2024

Hi!

May be overly presumptuous, but does it have to do with MoE? I see there is an AQLM Mixtral, but I'm guessing MoEs don't perform as well as a standard AQLM quant model.

We have quantized Moe in the past, e.g. that mixtral model you mentioned. To the best of my knowledge, MoE models quantize roughly as well as non-MoE ones.

Any reason why one of 'em isn't on the to-do list?

There is no such reason, but unfortunately, this is not how it works.

We have very limited compute and "manpower", and what we have is split between research and model quantization. The research comes first because if we don't do that quickly enough, the lab will be shut down.

So, our "gpu budget" for releasing quantized models is only enough to quantize few most popular models like Llama and Phi 3. For the rest, we release the code in hope that volunteers will run that and publish their own pre-quantized models.

As such, we hope we can reach ChatQA or Wizard eventually, but we can't guarantee anything. If tomorrow we find out we need those GPUs to get our research done in time, the extra models will have to wait.

the Llama 3-70B models are both linked to the same model.

Please specify which ones? (a link would be nice)

Hah aw, on one hand I know how competitive academic research funding can be so it makes sense, on the other, I'm surprised with such a cutting-edge field and producing the SotA Quant. that ISTA doesn't have a hefty grant for y'all (=

I wish I had more computing power. I haven't read it in awhile, but it sounded like anything less than an A100 is SOL. I'm using an M1 Max w/ 64GB VRAM, I'd be fine if it were compiling in the background for days, but I imagine if I didn't OOM for a 70B or 8x22B model, it'd take a month 😂. Unless its setup for multi-computer quantization but that'd be a mess and half of what I'd need in power I imagine (RTX3070/Tesla M40 24GB)

https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-2Bit-1x16 It's the model size of 13GB 1 that also has this as its link (3rd from the bottom). It appears that the model in question is not up on HuggingFace so maybe that was an intentional placehold?

from aqlm.

github-actions commented on August 21, 2024

This issue is stale because it has been open for 30 days with no activity.

from aqlm.

github-actions commented on August 21, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

from aqlm.

Request for Nvidia's RAG Implementation of Llama-3-70B "ChatQA 1.5" about aqlm HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent