Comments (9)
Hi!
May be overly presumptuous, but does it have to do with MoE? I see there is an AQLM Mixtral, but I'm guessing MoEs don't perform as well as a standard AQLM quant model.
We have quantized Moe in the past, e.g. that mixtral model you mentioned.
To the best of my knowledge, MoE models quantize roughly as well as non-MoE ones.
Any reason why one of 'em isn't on the to-do list?
There is no such reason, but unfortunately, this is not how it works.
We have very limited compute and "manpower", and what we have is split between research and model quantization.
The research comes first because if we don't do that quickly enough, the lab will be shut down.
So, our "gpu budget" for releasing quantized models is only enough to quantize few most popular models like Llama and Phi 3. For the rest, we release the code in hope that volunteers will run that and publish their own pre-quantized models.
As such, we hope we can reach ChatQA or Wizard eventually, but we can't guarantee anything. If tomorrow we find out we need those GPUs to get our research done in time, the extra models will have to wait.
the Llama 3-70B models are both linked to the same model.
Please specify which ones? (a link would be nice)
from aqlm.
Hi! Sorry for not responding to you for so long. Unfortunately, we are also quite limited on compute, but I've added this to the queue of models to quantize eventually. Right now we're processing phi-3 medium, then qwen2, then this
from aqlm.
Yes, you would indeed need group size 8.
1x 16-bit code per 8 weights gives you 2 bits per weight, plus some extra bits for the codebook itself.
I've specified example training script for this exact case (llama-3-70B derivative, 2-ish bits per weight) here: #98 (comment)
from aqlm.
Good catch! I fixed the link in this commit, with an acknowledgement.
Your setup would be enough to quantize smaller LLMs (e.g. 7B, maybe 13B with basic tweaking), but you are right that it would probably OOM for 70B+ models
from aqlm.
Hi! Sorry for not responding to you for so long. Unfortunately, we are also quite limited on compute, but I've added this to the queue of models to quantize eventually. Right now we're processing phi-3 medium, then qwen2, then this
Hi! Sorry for not responding to you for so long. Unfortunately, we are also quite limited on compute, but I've added this to the queue of models to quantize eventually. Right now we're processing phi-3 medium, then qwen2, then this
Not too bad of a wait! Worth being patient =)
Totally OT, but I'd like to (try to, at least) run a model with MLC-LLM and I'm wondering what the equivalent would be to these Quant. Settings.
https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/quantization/quantization.py
I was figuring something like (for Llama-3-70B 2-bit 1x16, for ex.):
"AQLM_2bit": GroupQuantize(
name="AQLM_2bit",
kind="group-quant",
group_size=16,
quantize_dtype="int2",
storage_dtype="uint32",
model_dtype="float16",
linear_weight_layout="NK",
But I donno, with the group in vs. out, it needs to be divisible by the quant. dtype IIRC. So perhaps group_size=8 and without an equivalent...adding:
nbits_per_codebook=16
num_codebooks=1
from aqlm.
Hi! Sorry for not responding to you for so long. Unfortunately, we are also quite limited on compute, but I've added this to the queue of models to quantize eventually. Right now we're processing phi-3 medium, then qwen2, then this
I was also interested in Wizard 8x22B @ 2-bit and there was a thread created with regards to its base, Mixtral 8x22B. Any reason why one of 'em isn't on the to-do list? May be overly presumptuous, but does it have to do with MoE? I see there is an AQLM Mixtral, but I'm guessing MoEs don't perform as well as a standard AQLM quant model.
And thank you for the rapid responses and that comprehensive example script, super clutch! Shall I go ahead and close this or keep it open until the model is up and running?
Edit: Just noticed the PV-tuning models and a small FYI -- looks like the Llama 3-70B models are both linked to the same model.
AKA Meta-Llama-3-70B | 1x16g16's link is the one needing correction
from aqlm.
Hi!
May be overly presumptuous, but does it have to do with MoE? I see there is an AQLM Mixtral, but I'm guessing MoEs don't perform as well as a standard AQLM quant model.
We have quantized Moe in the past, e.g. that mixtral model you mentioned. To the best of my knowledge, MoE models quantize roughly as well as non-MoE ones.
Any reason why one of 'em isn't on the to-do list?
There is no such reason, but unfortunately, this is not how it works.
We have very limited compute and "manpower", and what we have is split between research and model quantization. The research comes first because if we don't do that quickly enough, the lab will be shut down.
So, our "gpu budget" for releasing quantized models is only enough to quantize few most popular models like Llama and Phi 3. For the rest, we release the code in hope that volunteers will run that and publish their own pre-quantized models.
As such, we hope we can reach ChatQA or Wizard eventually, but we can't guarantee anything. If tomorrow we find out we need those GPUs to get our research done in time, the extra models will have to wait.
the Llama 3-70B models are both linked to the same model.
Please specify which ones? (a link would be nice)
Hah aw, on one hand I know how competitive academic research funding can be so it makes sense, on the other, I'm surprised with such a cutting-edge field and producing the SotA Quant. that ISTA doesn't have a hefty grant for y'all (=
I wish I had more computing power. I haven't read it in awhile, but it sounded like anything less than an A100 is SOL. I'm using an M1 Max w/ 64GB VRAM, I'd be fine if it were compiling in the background for days, but I imagine if I didn't OOM for a 70B or 8x22B model, it'd take a month 😂. Unless its setup for multi-computer quantization but that'd be a mess and half of what I'd need in power I imagine (RTX3070/Tesla M40 24GB)
https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-2Bit-1x16 It's the model size of 13GB 1 that also has this as its link (3rd from the bottom). It appears that the model in question is not up on HuggingFace so maybe that was an intentional placehold?from aqlm.
This issue is stale because it has been open for 30 days with no activity.
from aqlm.
This issue was closed because it has been inactive for 14 days since being marked as stale.
from aqlm.
Related Issues (20)
- Minor race condition in CPU 2x8 inference code HOT 4
- Finetuning ISTA-DASLab/Mistral-7B-Instruct-v0.2-AQLM-2Bit-2x8: RuntimeError: CUDA error: invalid argument HOT 3
- Actual bitrate of models on github? HOT 5
- Request for the Llama-2-13B with AQLM (2x8 scheme) HOT 3
- How to run perplexity eval on HF hub models? HOT 3
- when load Llama, AutoConfig will occur error. HOT 2
- Can you please share the *end-to-end* quantization script+config (including data used) for each model you've already quantized? (particularly llama-3 and miqu - i.e. 70B models) HOT 6
- FV tuning based on GPTQ HOT 6
- aqlm/inference_kernels/cuda_kernel.py HOT 2
- NaNs in sequence classifier output HOT 3
- Using pv-tuning on other quantization methods HOT 3
- How to import and use it in my existent code that loads LLMs? HOT 2
- [Feature Request] Gemma2 support & models HOT 2
- Quantization on multi-node GPUs HOT 2
- Performance issues with ~2bit quantization HOT 6
- Llama mixtral quantaniz
- how to get detailed results about the codebooks and codes parameters HOT 2
- Conda package
- Codebook size for LLama 2 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aqlm.