Checking to see if this repo works for the new L3 models. Running this : <d

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Now it runs out of memory: <div class="snippet-clipboard-content notranslate posit

Documenting my run here: <a href="https://github.com/catid/AQLM/blob/main/catid_readme

Model is up here if you want to check <a href="https://huggingface.co/catid/cat-llama-

Trying to figure out how to get the finetune to work. Currently it runs out of

Issues while attempting LLaMA-3 Quantization about aqlm HOT 22 CLOSED

catid commented on August 21, 2024

Issues while attempting LLaMA-3 Quantization

from aqlm.

Comments (22)

Mayorc1978 commented on August 21, 2024 1

Not sure if the AutoTokenizer is ready for Llama 3 models, cause they changed the template and the tokenizer itself so you take a look at https://github.com/meta-llama/llama-recipes to avoid surprises in the final quality of the quantized model.

from aqlm.

Godofnothing commented on August 21, 2024 1

@catid In addition, you may reduce memory usage via --finetune_dtype=bfloat16.

from aqlm.

Godofnothing commented on August 21, 2024 1

@catid Unfortunately, 2 bit quantization at the moment doesn't offer lossless quantization. In addition, it seems like very accurate models are harder to compress without noticeable degradation relative to the floating point model.

We have just published quantized Meta-Llama-3-8B-Instruct with 1x16 quantization to the hub. Drops are more pronounced on more challenging MMLU and GSM8k tasks.

We are currently running 70B models anyway; I would suggest you to not spend your money.

Concerning the slow inference - did you run the conversion script convert_to_hf.py? Seems like the checkpoint is in improper format as it is larger compared to the one posted in our quantized model repo.

from aqlm.

catid commented on August 21, 2024

This was fixed by editing your code to use AutoTokenizer instead of LlamaTokenizer.

from aqlm.

catid commented on August 21, 2024

Now it runs out of memory:

Loaded data from pajama; len(data)=1024 sequences

Starting AQ quantization ...
catching layer inputs from data
Traceback (most recent call last):
  File "/home/catid/sources/AQLM/main.py", line 892, in <module>
    quantize_model(model, args)
  File "/home/catid/sources/AQLM/main.py", line 59, in quantize_model
    results = quantize_aq(model, train_data, val_data, args)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/catid/sources/AQLM/main.py", line 169, in quantize_aq
    outs = [torch.zeros_like(inp_tensor, pin_memory=inp_tensor.is_pinned()) for inp_tensor in inps]
  File "/home/catid/sources/AQLM/main.py", line 169, in <listcomp>
    outs = [torch.zeros_like(inp_tensor, pin_memory=inp_tensor.is_pinned()) for inp_tensor in inps]
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

from aqlm.

catid commented on August 21, 2024

Running on a big server fixed that issue

from aqlm.

catid commented on August 21, 2024

Documenting my run here: https://github.com/catid/AQLM/blob/main/catid_readme.md

Please suggest any improvements based on your experience

from aqlm.

catid commented on August 21, 2024

Model is up here if you want to check https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm-noft

from aqlm.

catid commented on August 21, 2024

Trying to figure out how to get the finetune script to work. Currently it runs out of memory on the big server.

from aqlm.

catid commented on August 21, 2024

Using a smaller microbatch_size fixed that

from aqlm.

Godofnothing commented on August 21, 2024

Hi, @catid. We are also running Llama-3 quantization at the moment.

Concerning the issue with tokenizer, we have a fix for this and will soon merge into the main branch, The cause of the issue is that Llama-3 uses FastTokenizer instead of the default one.

About the OOM - I guess it is hard to fit microbatch>1 into 80Gb of VRAM.

from aqlm.

catid commented on August 21, 2024

Final model here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm

Currently running lm-eval stuff on it

from aqlm.

catid commented on August 21, 2024

I'd like to do the 70B model at 3 bits, which seems like it would cost about $5k on runpod. Do you have access to cheaper compute or otherwise already doing it @Godofnothing ?

from aqlm.

catid commented on August 21, 2024

Evaluation results without global fine tune are here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm-noft

Evaluation results with global fine tune are here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm

The results are exactly the same so I think the fine tuned model did not get saved out properly? I think the weights are the same for both =(

from aqlm.

catid commented on August 21, 2024

Honestly a bit disappointed in the performance of AQLM. It loses -2.5% accuracy on arc_challenge, and the inference is very slow. Worried the 70B version will be disappointing as well.

from aqlm.

catid commented on August 21, 2024

Cool I hope you do a 3 bit version so that we can do longer context or speculative decoding with the 8B model

from aqlm.

catid commented on August 21, 2024

Ah mine is bigger because I did a 4-bit quant not a 2 bit quant. IMHO 2 bits is too small for an 8B model

from aqlm.

catid commented on August 21, 2024

Oof your GSM8k (8-shot) is really bad 74% vs 34% maybe something is broken in the quantization or your scripts?

from aqlm.

catid commented on August 21, 2024

Maybe should fix that before spending the $$ doing a 70B model

from aqlm.

Godofnothing commented on August 21, 2024

@catid We observed earlier on Llama-2 that the quality on GSM8k drops much more dramatically compared to other tasks. Perplexity evaluation, hellaswag, winogrande, arc-easy/challenge and similar stuff provide too optimistic estimate of the model performance.

We have measurements on 1x15 AQLM quant and 2-bit QuIP# Llama-2-7b and drops are in either case are quite severe. Specifically, fp16 llama-2-7b has ~14.7% accuracy on GSM8k, whereas AQLM and QuIP# (the finetuned version) yield ~6.2% and 5.4%, respectively.

Our conjecture, is that the calibration set used to find the optimal model configuration doesn’t involve math, therefore this task is in some sense OOD for the resulting model.

from aqlm.

Godofnothing commented on August 21, 2024

@catid 2x16 kernel at the moment is not as efficient as the 1x16 and 2x8, unfortunately. So I would not recommend using it.

from aqlm.

Godofnothing commented on August 21, 2024

@catid, added more evaluations to our model 1x16 on hub. Drops on five 0-shots are significant, but not catastrophic.

from aqlm.

Issues while attempting LLaMA-3 Quantization about aqlm HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent