Git Product home page Git Product logo

Comments (22)

Mayorc1978 avatar Mayorc1978 commented on August 21, 2024 1

Not sure if the AutoTokenizer is ready for Llama 3 models, cause they changed the template and the tokenizer itself so you take a look at https://github.com/meta-llama/llama-recipes to avoid surprises in the final quality of the quantized model.

from aqlm.

Godofnothing avatar Godofnothing commented on August 21, 2024 1

@catid In addition, you may reduce memory usage via --finetune_dtype=bfloat16.

from aqlm.

Godofnothing avatar Godofnothing commented on August 21, 2024 1

@catid Unfortunately, 2 bit quantization at the moment doesn't offer lossless quantization. In addition, it seems like very accurate models are harder to compress without noticeable degradation relative to the floating point model.

We have just published quantized Meta-Llama-3-8B-Instruct with 1x16 quantization to the hub. Drops are more pronounced on more challenging MMLU and GSM8k tasks.

We are currently running 70B models anyway; I would suggest you to not spend your money.

Concerning the slow inference - did you run the conversion script convert_to_hf.py? Seems like the checkpoint is in improper format as it is larger compared to the one posted in our quantized model repo.

from aqlm.

catid avatar catid commented on August 21, 2024

This was fixed by editing your code to use AutoTokenizer instead of LlamaTokenizer.

from aqlm.

catid avatar catid commented on August 21, 2024

Now it runs out of memory:

Loaded data from pajama; len(data)=1024 sequences

Starting AQ quantization ...
catching layer inputs from data
Traceback (most recent call last):
  File "/home/catid/sources/AQLM/main.py", line 892, in <module>
    quantize_model(model, args)
  File "/home/catid/sources/AQLM/main.py", line 59, in quantize_model
    results = quantize_aq(model, train_data, val_data, args)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/catid/sources/AQLM/main.py", line 169, in quantize_aq
    outs = [torch.zeros_like(inp_tensor, pin_memory=inp_tensor.is_pinned()) for inp_tensor in inps]
  File "/home/catid/sources/AQLM/main.py", line 169, in <listcomp>
    outs = [torch.zeros_like(inp_tensor, pin_memory=inp_tensor.is_pinned()) for inp_tensor in inps]
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

from aqlm.

catid avatar catid commented on August 21, 2024

Running on a big server fixed that issue

from aqlm.

catid avatar catid commented on August 21, 2024

Documenting my run here: https://github.com/catid/AQLM/blob/main/catid_readme.md

Please suggest any improvements based on your experience

from aqlm.

catid avatar catid commented on August 21, 2024

Model is up here if you want to check https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm-noft

from aqlm.

catid avatar catid commented on August 21, 2024

Trying to figure out how to get the finetune script to work. Currently it runs out of memory on the big server.
image

from aqlm.

catid avatar catid commented on August 21, 2024

Using a smaller microbatch_size fixed that

from aqlm.

Godofnothing avatar Godofnothing commented on August 21, 2024

Hi, @catid. We are also running Llama-3 quantization at the moment.

Concerning the issue with tokenizer, we have a fix for this and will soon merge into the main branch, The cause of the issue is that Llama-3 uses FastTokenizer instead of the default one.

About the OOM - I guess it is hard to fit microbatch>1 into 80Gb of VRAM.

from aqlm.

catid avatar catid commented on August 21, 2024

Final model here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm

Currently running lm-eval stuff on it

from aqlm.

catid avatar catid commented on August 21, 2024

I'd like to do the 70B model at 3 bits, which seems like it would cost about $5k on runpod. Do you have access to cheaper compute or otherwise already doing it @Godofnothing ?

from aqlm.

catid avatar catid commented on August 21, 2024

Evaluation results without global fine tune are here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm-noft

Evaluation results with global fine tune are here: https://huggingface.co/catid/cat-llama-3-8b-instruct-aqlm

The results are exactly the same so I think the fine tuned model did not get saved out properly? I think the weights are the same for both =(

from aqlm.

catid avatar catid commented on August 21, 2024

Honestly a bit disappointed in the performance of AQLM. It loses -2.5% accuracy on arc_challenge, and the inference is very slow. Worried the 70B version will be disappointing as well.

from aqlm.

catid avatar catid commented on August 21, 2024

Cool I hope you do a 3 bit version so that we can do longer context or speculative decoding with the 8B model

from aqlm.

catid avatar catid commented on August 21, 2024

Ah mine is bigger because I did a 4-bit quant not a 2 bit quant. IMHO 2 bits is too small for an 8B model

from aqlm.

catid avatar catid commented on August 21, 2024

Oof your GSM8k (8-shot) is really bad 74% vs 34% maybe something is broken in the quantization or your scripts?

from aqlm.

catid avatar catid commented on August 21, 2024

Maybe should fix that before spending the $$ doing a 70B model

from aqlm.

Godofnothing avatar Godofnothing commented on August 21, 2024

@catid We observed earlier on Llama-2 that the quality on GSM8k drops much more dramatically compared to other tasks. Perplexity evaluation, hellaswag, winogrande, arc-easy/challenge and similar stuff provide too optimistic estimate of the model performance.

We have measurements on 1x15 AQLM quant and 2-bit QuIP# Llama-2-7b and drops are in either case are quite severe. Specifically, fp16 llama-2-7b has ~14.7% accuracy on GSM8k, whereas AQLM and QuIP# (the finetuned version) yield ~6.2% and 5.4%, respectively.

Our conjecture, is that the calibration set used to find the optimal model configuration doesnโ€™t involve math, therefore this task is in some sense OOD for the resulting model.

from aqlm.

Godofnothing avatar Godofnothing commented on August 21, 2024

@catid 2x16 kernel at the moment is not as efficient as the 1x16 and 2x8, unfortunately. So I would not recommend using it.

from aqlm.

Godofnothing avatar Godofnothing commented on August 21, 2024

@catid, added more evaluations to our model 1x16 on hub. Drops on five 0-shots are significant, but not catastrophic.

from aqlm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.