Comments (22)
Not sure if the AutoTokenizer is ready for Llama 3 models, cause they changed the template and the tokenizer itself so you take a look at https://github.com/meta-llama/llama-recipes to avoid surprises in the final quality of the quantized model.
from aqlm.
@catid In addition, you may reduce memory usage via --finetune_dtype=bfloat16
.
from aqlm.
@catid Unfortunately, 2 bit quantization at the moment doesn't offer lossless
quantization. In addition, it seems like very accurate models are harder to compress without noticeable degradation relative to the floating point model.
We have just published quantized Meta-Llama-3-8B-Instruct
with 1x16 quantization to the hub. Drops are more pronounced on more challenging MMLU and GSM8k tasks.
We are currently running 70B models anyway; I would suggest you to not spend your money.
Concerning the slow inference - did you run the conversion script convert_to_hf.py
? Seems like the checkpoint is in improper format as it is larger compared to the one posted in our quantized model repo.
from aqlm.
This was fixed by editing your code to use AutoTokenizer instead of LlamaTokenizer.
from aqlm.
Now it runs out of memory:
Loaded data from pajama; len(data)=1024 sequences
Starting AQ quantization ...
catching layer inputs from data
Traceback (most recent call last):
File "/home/catid/sources/AQLM/main.py", line 892, in <module>
quantize_model(model, args)
File "/home/catid/sources/AQLM/main.py", line 59, in quantize_model
results = quantize_aq(model, train_data, val_data, args)
File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/catid/sources/AQLM/main.py", line 169, in quantize_aq
outs = [torch.zeros_like(inp_tensor, pin_memory=inp_tensor.is_pinned()) for inp_tensor in inps]
File "/home/catid/sources/AQLM/main.py", line 169, in <listcomp>
outs = [torch.zeros_like(inp_tensor, pin_memory=inp_tensor.is_pinned()) for inp_tensor in inps]
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
from aqlm.
Running on a big server fixed that issue
from aqlm.
Documenting my run here: https://github.com/catid/AQLM/blob/main/catid_readme.md
Please suggest any improvements based on your experience
from aqlm.
Model is up here if you want to check https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm-noft
from aqlm.
Trying to figure out how to get the finetune script to work. Currently it runs out of memory on the big server.
from aqlm.
Using a smaller microbatch_size fixed that
from aqlm.
Hi, @catid. We are also running Llama-3 quantization at the moment.
Concerning the issue with tokenizer, we have a fix for this and will soon merge into the main branch, The cause of the issue is that Llama-3 uses FastTokenizer instead of the default one.
About the OOM - I guess it is hard to fit microbatch>1 into 80Gb of VRAM.
from aqlm.
Final model here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm
Currently running lm-eval stuff on it
from aqlm.
I'd like to do the 70B model at 3 bits, which seems like it would cost about $5k on runpod. Do you have access to cheaper compute or otherwise already doing it @Godofnothing ?
from aqlm.
Evaluation results without global fine tune are here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm-noft
Evaluation results with global fine tune are here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm
The results are exactly the same so I think the fine tuned model did not get saved out properly? I think the weights are the same for both =(
from aqlm.
Honestly a bit disappointed in the performance of AQLM. It loses -2.5% accuracy on arc_challenge, and the inference is very slow. Worried the 70B version will be disappointing as well.
from aqlm.
Cool I hope you do a 3 bit version so that we can do longer context or speculative decoding with the 8B model
from aqlm.
Ah mine is bigger because I did a 4-bit quant not a 2 bit quant. IMHO 2 bits is too small for an 8B model
from aqlm.
Oof your GSM8k (8-shot) is really bad 74% vs 34% maybe something is broken in the quantization or your scripts?
from aqlm.
Maybe should fix that before spending the $$ doing a 70B model
from aqlm.
@catid We observed earlier on Llama-2 that the quality on GSM8k drops much more dramatically compared to other tasks. Perplexity evaluation, hellaswag, winogrande, arc-easy/challenge and similar stuff provide too optimistic estimate of the model performance.
We have measurements on 1x15 AQLM quant and 2-bit QuIP# Llama-2-7b and drops are in either case are quite severe. Specifically, fp16 llama-2-7b has ~14.7% accuracy on GSM8k, whereas AQLM and QuIP# (the finetuned version) yield ~6.2% and 5.4%, respectively.
Our conjecture, is that the calibration set used to find the optimal model configuration doesnโt involve math, therefore this task is in some sense OOD for the resulting model.
from aqlm.
@catid 2x16 kernel at the moment is not as efficient as the 1x16 and 2x8, unfortunately. So I would not recommend using it.
from aqlm.
@catid, added more evaluations to our model 1x16 on hub. Drops on five 0-shots are significant, but not catastrophic.
from aqlm.
Related Issues (20)
- Minor race condition in CPU 2x8 inference code HOT 4
- Finetuning ISTA-DASLab/Mistral-7B-Instruct-v0.2-AQLM-2Bit-2x8: RuntimeError: CUDA error: invalid argument HOT 3
- Actual bitrate of models on github? HOT 5
- Request for the Llama-2-13B with AQLM (2x8 scheme) HOT 3
- How to run perplexity eval on HF hub models? HOT 3
- when load Llama, AutoConfig will occur error. HOT 2
- Request for Nvidia's RAG Implementation of Llama-3-70B "ChatQA 1.5" HOT 9
- Can you please share the *end-to-end* quantization script+config (including data used) for each model you've already quantized? (particularly llama-3 and miqu - i.e. 70B models) HOT 6
- FV tuning based on GPTQ HOT 6
- aqlm/inference_kernels/cuda_kernel.py HOT 2
- NaNs in sequence classifier output HOT 3
- Using pv-tuning on other quantization methods HOT 3
- How to import and use it in my existent code that loads LLMs? HOT 2
- [Feature Request] Gemma2 support & models HOT 2
- Quantization on multi-node GPUs HOT 2
- Performance issues with ~2bit quantization HOT 6
- Llama mixtral quantaniz
- how to get detailed results about the codebooks and codes parameters HOT 2
- Conda package
- Codebook size for LLama 2 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aqlm.