vahe1994 / aqlm Goto Github PK

Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf

License: Apache License 2.0

Python 51.17% C++ 3.53% Jupyter Notebook 41.41% Cuda 3.89%

aqlm's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes niconico6 eltociear antmikinka zhengshuang0822 codeaudit sorokinvld meomeomeome evelynmitchell cyrilmagsuci alexwortega suryatmodulus mz0in xiechengmude lmmx alex4321 iamwavecut qwopqwop200 yxli2123 dive-into-papers suqi37201 jaemzfleming zhcharles yanndd1 mahavatara m-mckningou spidergood52 sbrightdark ox-softonal v-ceasecu hulkferdy58 generalweare-b scopency-46 gigglyshun-83 56-gambitleak o-cookiecere saviorrazu-mudng cozyme-reejoy rightpop-centhart buffywideangeliche scorpionte-l raerocketv chamerlireackste essents-g skiermonnonfutou experters cheerupringlamarket chermark-y icytwilight-muslanet svorwerk-flextg hamedmoo mbrukman rb1717 wplayergy crawlerty59 broadwaytab94 t-anarchyfully hhenryal hhy5277 schoncuchiquitamajere energyouk commentsben3 76-sandsly j-messagering yesglossy-nepheway rapbroadway-notesys oxledotion bingmo33 missionetv breakingbig43 i-bundersl insidelifel 23deltatana halliele93cottonhope q-sushydr peachninja-shadesdogg louud19 questan-t infradireporks fisherno-timeat agriffondirty 78dekkolo nepheway99cyog rapsimple-spearoomt lilllepiclife kroolspice3 captail-dirtypanet surrealsleek-extorksta gisteroxcaptail ratchapter-higherit n-tacticusal ifallen-riseraci marjellopposolis certready3grimbel ratinabox nordlander1 flixxtiltz shortga-eyeecht mnesaei mirzaye1

aqlm's Issues

AttributeError: module 'torch.library' has no attribute 'impl_abstract'

Note: I tested this for aqlm==1.0.3 and 1.1.0

I'm having issue trying to run inference with a HF model in Transformers

pip install transformers aqlm[gpu,cpu]

Pytorch 2.1.2 is installed

Then

from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained("ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf", torch_dtype="auto", device_map="cuda").cuda()
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
output = quantized_model.generate(tokenizer("", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)
print(output)

Output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/generation/utils.py", line 1544, in generate
    return self.greedy_search(
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/generation/utils.py", line 2404, in greedy_search
    outputs = self(
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1168, in forward
    outputs = self.model(
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1008, in forward
    layer_outputs = decoder_layer(
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 734, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 633, in forward
    query_states = self.q_proj(hidden_states)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/aqlm/inference.py", line 70, in forward
    self.prepare_matmul_op(input)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/aqlm/inference.py", line 86, in prepare_matmul_op
    get_forward_pass_kernel(self.codebooks, False),
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/aqlm/inference_kernels/kernel_selector.py", line 35, in get_forward_pass_kernel
    from .cuda_kernel import CUDA_FOLDER
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.py", line 20, in <module>
    @torch.library.impl_abstract("aqlm::code1x16_matmat")
AttributeError: module 'torch.library' has no attribute 'impl_abstract'

Request: Phi-3-mini-128k-instruct Support

Hello,

Phi-3 is a small (3.8b) but well known and quite a useful model based on evals and my prompting.

Would it be possible to quantize Phi-3? There is also 4k version of the model.

HF: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
HF 4k: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct

Thank you very much,
Vaclav

How to run perplexity eval on HF hub models?

Hi, what command should I run in your codebase to get perplexity numbers on wikitext and c4 for your HF hub models (eg https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-2Bit-1x16/tree/main)

Supported Models

Congratulations on coming up with such an excellent quantization algorithm! I'm trying to use AQLM to quantize Deepseek-Coder and Starcoder2, but the repository doesn't seem to have direct support. Are there any plans to support more models? Or any suggestions on how to modify the source code to achieve fast support, thanks.

Query on Evaluation Support for C4 Validation

Thanks for the excellent work and I really appreciate the public code.

I am currently working with the lm-evaluation-harness for validating some of the language models and have encountered a slight issue. It appears that the lm-evaluation-harness does not provide support for validation on the C4 dataset.

I was wondering if you might have any recommendations or alternative approaches for conducting validation on the C4 dataset using the lm-evaluation-harness or similar tools. Your guidance on this matter would be greatly appreciated.

Thank you once again for your contribution to the community. I am looking forward to your response.

DBRX Support

Does the current AQLM version work for DBRX?

ERROR: Could not find a version that satisfies the requirement aqlm[gpu]==1.0.0

hi, I am trying to run colab_example.ipynb on NVIDIA A100 non colab environment. When i run the !pip install aqlm[gpu]==1.0.0 and getting the below message. I am using python31.0, CUDA 12.2 and NVIDIA A4 GPU

ERROR: Could not find a version that satisfies the requirement aqlm[gpu]==1.0.0 (from versions: 1.0.2, 1.0.3)
ERROR: No matching distribution found for aqlm[gpu]==1.0.0

The requirement.txt file in repo gives the below issues

ERROR: tensorflow 2.11.0 has requirement protobuf<3.20,>=3.9.2, but you'll have protobuf 3.20.3 which is incompatible.
ERROR: dask-sql 2023.6.0 has requirement pandas>=1.4.0, but you'll have pandas 1.1.5 which is incompatible.
ERROR: azureml-inference-server-http 0.8.4 has requirement flask<2.3.0, but you'll have flask 2.3.2 which is incompatible.
ERROR: azureml-automl-dnn-nlp 1.51.0 has requirement datasets<=2.3.2,>=1.7.0, but you'll have datasets 2.15.0 which is incompatible.
ERROR: azureml-automl-dnn-nlp 1.51.0 has requirement torch<=1.12.0,>=1.5.0, but you'll have torch 2.2.1 which is incompatible.
ERROR: azureml-automl-dnn-nlp 1.51.0 has requirement transformers[sentencepiece]<=4.16.0, but you'll have transformers 4.37.0 which is incompatible.
ERROR: azure-cli 2.49.0 has requirement azure-keyvault-keys==4.8.0b2, but you'll have azure-keyvault-keys 4.8.0 which is incompatible.
ERROR: azure-cli 2.49.0 has requirement azure-mgmt-keyvault==10.2.0, but you'll have azure-mgmt-keyvault 10.2.1 which is incompatible.
ERROR: azure-cli 2.49.0 has requirement azure-mgmt-resource==22.0.0, but you'll have azure-mgmt-resource 21.1.0b1 which is incompatible.
ERROR: autokeras 1.0.16 has requirement tensorflow<=2.5.0,>=2.3.0, but you'll have tensorflow 2.11.0 which is incompatible.
ERROR: arviz 0.11.2 has requirement typing-extensions<4,>=3.7.4.3, but you'll have typing-extensions 4.10.0 which is incompatible.

URL typo in package metadata

The package metadata on PyPI points to a repo from Vage1994 not Vahe1994

https://pypi.org/project/aqlm/

This is set in

AQLM/inference_lib/setup.cfg

Line 9 in 48c44b1

url = https://github.com/Vage1994/AQLM

Global finetuning?

How does your updated fine tuning method work vs the one in your arxiv?

'QuantizedLinear' object has no attribute 'weight' error

I'm trying to do quantization with this parameters:

python main.py ../models/my-mistral-7B wikitext2 --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 --finetune_batch_size=32 --local_batch_size=1 --save ../models/my-mistral-7B-AQLM --model_seqlen 8192 --offload_activations

and I got this error

Traceback (most recent call last):
  File "/home/fahadh/AQLM/main.py", line 779, in <module>
    quantize_model(model, args)
  File "/home/fahadh/AQLM/main.py", line 48, in quantize_model
    results = quantize_aq(model, dataloader, args)
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/fahadh/AQLM/main.py", line 252, in quantize_aq
    layer = finetune_groupwise(layer=layer, inps=inps, outs=outs, args=args, **forward_args)
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/fahadh/AQLM/src/finetune.py", line 105, in finetune_groupwise
    loss = _compute_mse_parallel(
  File "/home/fahadh/AQLM/src/finetune.py", line 224, in _compute_mse_parallel
    mse_components = torch.nn.parallel.parallel_apply(
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 108, in parallel_apply
    output.reraise()
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/_utils.py", line 722, in reraise
    raise exception
AttributeError: Caught AttributeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
    output = module(*input, **kwargs)
  File "/home/fahadh/AQLM/src/finetune.py", line 201, in _compute_mse_on_batch
    outs_prediction, *_unused = layer(inps_batch, **kwargs)
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 754, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 437, in forward
    target_dtype = self.q_proj.weight.dtype
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'QuantizedLinear' object has no attribute 'weight'

This happens after fine tuning layers 0 of mistral model. Here is what printed before the error

PREPARING TO FINETUNE
MistralDecoderLayer(
  (self_attn): MistralFlashAttention2(
    (q_proj): QuantizedLinear(
      (quantized_weight): QuantizedWeight(self.out_features=4096, self.in_features=4096, bits_per_parameter=2.50390625)
    )
    (k_proj): QuantizedLinear(
      (quantized_weight): QuantizedWeight(self.out_features=1024, self.in_features=4096, bits_per_parameter=4.00390625)
    )
    (v_proj): QuantizedLinear(
      (quantized_weight): QuantizedWeight(self.out_features=1024, self.in_features=4096, bits_per_parameter=4.00390625)
    )
    (o_proj): QuantizedLinear(
      (quantized_weight): QuantizedWeight(self.out_features=4096, self.in_features=4096, bits_per_parameter=2.50390625)
    )
    (rotary_emb): MistralRotaryEmbedding()
  )
  (mlp): MistralMLP(
    (gate_proj): QuantizedLinear(
      (quantized_weight): QuantizedWeight(self.out_features=14336, self.in_features=4096, bits_per_parameter=2.146763392857143)
    )
    (up_proj): QuantizedLinear(
      (quantized_weight): QuantizedWeight(self.out_features=14336, self.in_features=4096, bits_per_parameter=2.146763392857143)
    )
    (down_proj): QuantizedLinear(
      (quantized_weight): QuantizedWeight(self.out_features=4096, self.in_features=14336, bits_per_parameter=2.1439732142857144)
    )
    (act_fn): SiLU()
  )
  (input_layernorm): MistralRMSNorm()
  (post_attention_layernorm): MistralRMSNorm()
)
Fine-tuning 3721216 parameters

Reproduce perplexity

In the readme the ppl is

Llama-2-7b | 1x16 | 5.92 | 2.4

In the paper it is:

Llama-2-7b AQLM 2.29 6.29 8.11

When I run locally using the same command as in the readme

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py $MODEL_PATH $DATASET_PATH --nsamples=1024 \
 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 \
 --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 \
 --finetune_batch_size=32 --local_batch_size=1 --offload_activations \
 --wandb --save $SAVE_PATH

it gives me

Llama-2-7b AQLM 2.29 6.45 8.39

Can I know why there is such a mismatch? Thanks for any clarifications.

Aqlm models cannot be trained in parallel.

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file deepspeed_zero2-sft.yaml --num_processes=1 ft-4bit-freedom-2bit.py \
    --base_model 'Mixtral-8x7B-Instruct-v0_1-AQLM-2Bit-1x16-hf' \
    --data_path 'eval_set_IA_multi_dialog3_sft.json' \
    --output_dir 'eval_set_IA_multi_dialog3_sft' \
    --batch_size 1 \
    --micro_batch_size 1 \
    --num_epochs 1 \
    --learning_rate 5e-4 \
    --cutoff_len 1024 \
    --val_set_size 0 \
    --lora_r 32 \
    --lora_alpha 32 \
    --lora_dropout 0.1 \
    --train_on_inputs \
    --group_by_length

Traceback (most recent call last):
  File "/home/luhao/alpaca-2bit-sft/ft-4bit-freedom-2bit.py", line 345, in <module>
    fire.Fire(train)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/luhao/alpaca-2bit-sft/ft-4bit-freedom-2bit.py", line 336, in train
    trainer.train()
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/trainer.py", line 1933, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/accelerate/accelerator.py", line 1255, in prepare
    result = self._prepare_deepspeed(*args)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/accelerate/accelerator.py", line 1640, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/__init__.py", line 176, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 262, in __init__
    self._configure_distributed_model(model)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model
    self._broadcast_model()
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model
    dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
    return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 205, in broadcast
    return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work = group.broadcast([tensor], opts)
TypeError: Input tensor data type is not supported for NCCL process group: Short

Feature request: Add HF-AQLM weights on a HF org

Hi there!

Thanks again for your great work !

I was wondering if you would like to create a new HF org similar as: https://huggingface.co/unsloth and push the HF compatible weights there and expose a collection ? That way users could easily refer to these checkpoints as the "official" AQLM checkpoints, (e.g. aqlm/mixtral-aqlm-2bit, etc.)

what do you think?

cc @BlackSamorez @justheuristic

Fine-tune colab example doesn't work

Fine-tune colab example fails when running trainer.train()

Last cell gives output

max_steps is given, it will override any value given in num_train_epochs
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(

RuntimeError Traceback (most recent call last)

in <cell line: 24>()
22 )
23 model.config.use_cache = False # silence the warnings. Please re-enable for inference!
---> 24 trainer.train()

37 frames

/usr/local/lib/python3.10/dist-packages/torch/_ops.py in call(self, *args, **kwargs)
753 # We save the function ptr as the op attribute on
754 # OpOverloadPacket to access it here.
--> 755 return self._op(*args, **(kwargs or {}))
756
757 # TODO: use this to make a dir

RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous

Quantization time & VRAM requirements

Hello,

I have two basic questions:

Do you have any data on how long it takes to quantize a 70b model using 24GB VRAM (assuming that's possible)?
Do you plan to release prequantized models on Hugging Face? Having llama-2-70b for comparison with other methods would be useful.

Finetuning ISTA-DASLab/Mistral-7B-Instruct-v0.2-AQLM-2Bit-2x8: RuntimeError: CUDA error: invalid argument

I've tried to finetune the model "ISTA-DASLab/Mistral-7B-Instruct-v0.2-AQLM-2Bit-2x8" using the notebook "aqlm_2bit_training.ipynb" but I get the below runtime error after initiating the training. Before trying to finetune this model, I had successfully run this notebook with the original model "ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf" that was defined in the notebook.

ERROR:
"
max_steps is given, it will override any value given in num_train_epochs
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(

RuntimeError Traceback (most recent call last)
in <cell line: 24>()
22 )
23 model.config.use_cache = False # silence the warnings. Please re-enable for inference!
---> 24 trainer.train()

34 frames
/usr/local/lib/python3.10/dist-packages/torch/_ops.py in call(self, *args, **kwargs)
753 # We save the function ptr as the op attribute on
754 # OpOverloadPacket to access it here.
--> 755 return self._op(*args, **(kwargs or {}))
756
757 # TODO: use this to make a dir

RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions."

Question: How to choose a dataset for quantizing with AQLM a model like Mistral 7b-Instruct v0.2

I'm curious about quantizing a 7b model like Mistral Instruct v2, from what I understand an important point would be the choice of the dataset. What would be a good dataset for quantizing with AQLM?

Is there any other important point to succeed in creating a good quality quantization of a model with AQLM?

print Total number of parameters is incomplete.

model:ISTA-DASLab/Mixtral-8x7B-Instruct-v0_1-AQLM-2Bit-1x16-hf

model = AutoModelForCausalLM.from_pretrained(base_model,
                                             trust_remote_code=True,
                                             attn_implementation="flash_attention_2",
                                             torch_dtype=torch.bfloat16,
                                             device_map="auto",
                                             low_cpu_mem_usage=True)
total_parameters = sum(p.numel() for p in model.parameters())

print(f"Total number of parameters: {total_parameters}")

output:
Total number of parameters: 6546853888

Minor race condition in CPU 2x8 inference code

The current implementation sometimes returns close_but_not_very_accurate outputs because of this line:

https://github.com/Vahe1994/AQLM/blob/main/inference_lib/src/aqlm/inference_kernels/numba_kernel.py#L43

This is because the cpu cores may concurrently write to the same output unit without atomic addition.

Changing the outer parallel loop to go over output_groups instead fixes the issue, but it may be slower.
Note that the layer error is within <1% in my experiments and the model outputs look reasonable.

@BlackSamorez , would the code be any slower if you change the summation to go in parallel over outer groups or use atomic addition?

“RuntimeError: Only Tensors of floating point and complex dtype can require gradients” when trying to load a model with device_map="auto" or low_cpu_mem_usage

code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x15-hf",
    trust_remote_code=True, torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True
)

I assume that the reason lies in big_modeling.py from accelerate, namely in the code

    def register_empty_parameter(module, name, param):
        old_register_parameter(module, name, param)
        if param is not None:
            param_cls = type(module._parameters[name])
            kwargs = module._parameters[name].__dict__
            module._parameters[name] = param_cls(module._parameters[name].to(device), **kwargs)

Apparently, due to the fact that kwargs does not contain requires_grad=False.
This ridiculous patch that I applied fixed the error:

    def register_empty_parameter(module, name, param):
        old_register_parameter(module, name, param)
        if param is not None:
            param_cls = type(module._parameters[name])
            kwargs = module._parameters[name].__dict__
            kwargs["requires_grad"] = False # silly, but works
            module._parameters[name] = param_cls(module._parameters[name].to(device), **kwargs)

But I understand that this is the wrong solution and there is a correct way to solve this problem.

How long for the quantizing a 70b model? I had ran for 2days

is it toooo long to quantized a model ?

Issues while attempting LLaMA-3 Quantization

Checking to see if this repo works for the new L3 models. Running this script:

export CUDA_VISIBLE_DEVICES=0,1   # or e.g. 0,1,2,3
export MODEL_PATH=/home/catid/models/Meta-Llama-3-8B-Instruct
export DATASET_PATH=pajama
export SAVE_PATH=/home/catid/models/cat-llama-3-8b-instruct-aqlm
export WANDB_PROJECT=aqlm
export WANDB_NAME=aqlm8

python main.py $MODEL_PATH $DATASET_PATH \
 --nsamples=1024 \
 --val_size=128 \
 --num_codebooks=1 \
 --nbits_per_codebook=16 \
 --in_group_size=8 \
 --relative_mse_tolerance=0.01 \
 --finetune_batch_size=32 \
 --finetune_max_epochs=10 \
 --finetune_early_stop=3 \
 --finetune_keep_best \
 --local_batch_size=1 \
 --offload_activations \
 --wandb \
 --resume \
 --save $SAVE_PATH

I see:

============ Load model... ============
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 23.65it/s]
Loading pretrained model ...
Model loaded sucсessfully ...

============ Quantizing model... ============
Loading data ...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'.
The class this function is called from is 'LlamaTokenizer'.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Traceback (most recent call last):
  File "/home/catid/sources/AQLM/main.py", line 892, in <module>
    quantize_model(model, args)
  File "/home/catid/sources/AQLM/main.py", line 41, in quantize_model
    data = get_loaders(
  File "/home/catid/sources/AQLM/src/datautils.py", line 226, in get_loaders
    tokenizer = LlamaTokenizer.from_pretrained(
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
    return cls._from_pretrained(
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 169, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 196, in get_spm_processor
    tokenizer.Load(self.vocab_file)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

unable to install

py -m pip install aqlm[gpu,cpu]
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting aqlm[cpu,gpu]
Downloading aqlm-1.0.0-py3-none-any.whl (10 kB)
Collecting torch>=2.1.1
Downloading torch-2.2.0-cp311-cp311-win_amd64.whl (198.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 198.6/198.6 MB 12.6 MB/s eta 0:00:00
Collecting transformers==4.37.0
Downloading transformers-4.37.0-py3-none-any.whl (8.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.4/8.4 MB 8.5 MB/s eta 0:00:00
Collecting numba>=0.56.4
Downloading numba-0.59.0-cp311-cp311-win_amd64.whl (2.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 18.7 MB/s eta 0:00:00
Collecting scipy>=1.11.3
Downloading scipy-1.12.0-cp311-cp311-win_amd64.whl (46.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.2/46.2 MB 13.1 MB/s eta 0:00:00
ERROR: Ignored the following versions that require a different python version: 0.52.0 Requires-Python >=3.6,<3.9; 0.52.0rc3 Requires-Python >=3.6,<3.9; 0.53.0 Requires-Python >=3.6,<3.10; 0.53.0rc1.post1 Requires-Python >=3.6,<3.10; 0.53.0rc2 Requires-Python >=3.6,<3.10; 0.53.0rc3 Requires-Python >=3.6,<3.10; 0.53.1 Requires-Python >=3.6,<3.10; 0.54.0 Requires-Python >=3.7,<3.10; 0.54.0rc2 Requires-Python >=3.7,<3.10; 0.54.0rc3 Requires-Python >=3.7,<3.10; 0.54.1 Requires-Python >=3.7,<3.10; 0.55.0 Requires-Python >=3.7,<3.11; 0.55.0rc1 Requires-Python >=3.7,<3.11; 0.55.1 Requires-Python >=3.7,<3.11; 0.55.2 Requires-Python >=3.7,<3.11; 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10; 1.7.2 Requires-Python >=3.7,<3.11; 1.7.3 Requires-Python >=3.7,<3.11; 1.8.0 Requires-Python >=3.8,<3.11; 1.8.0rc1 Requires-Python >=3.8,<3.11; 1.8.0rc2 Requires-Python >=3.8,<3.11; 1.8.0rc3 Requires-Python >=3.8,<3.11; 1.8.0rc4 Requires-Python >=3.8,<3.11; 1.8.1 Requires-Python >=3.8,<3.11
ERROR: Could not find a version that satisfies the requirement triton>=2.1; extra == "gpu" (from aqlm[cpu,gpu]) (from versions: none)
ERROR: No matching distribution found for triton>=2.1; extra == "gpu"

How long does it take to quantize?

I'm been using quantization tools like GPTQ, Exllama, or QUIP#. Those tools is quite fast to do quantization in a single A6000 gpu. But, this tool takes a really long time even though I'm using two A6000 gpu. How long does it take for quantizing Mistral 7B using two A6000 gpu and this parameters:

python main.py ../models/my-mistral-7B wikitext2 --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 --finetune_batch_size=32 --local_batch_size=1 --save ../models/my-mistral-7B-AQLM --model_seqlen 8192 --offload_activations

Do not hard pin the `transformers` dependency version

Hard pinning the transformers version is causing a dependency conflict in pip

AQLM/inference_lib/setup.cfg

Line 35 in 48c44b1

transformers==4.37.0

Successfully built transformers
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.37.0
    Uninstalling transformers-4.37.0:
      Successfully uninstalled transformers-4.37.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aqlm 1.0.1 requires transformers==4.37.0, but you have transformers 4.38.0.dev0 which is incompatible.
Successfully installed transformers-4.38.0.dev0

If you change == to >= it would allow for later versions with updates (in this case, the update was for AQLM compatibility!)

What are the hyperparameters to get an average of 2.02 bits per layer for LLaMA2-7b?

Hi @Vahe1994, I really like the idea of the paper and thank you for releasing the codes.

I am currently researching the codes, and I would like to know how you obtain an average of 2.02 bits per layer for LLaMA2-7b (which is shown in the paper's Table 1). I tried the hyperparameters (num_codebooks=1, codebook_value_nbits=16, in_group_size=8, out_group_size=1, codebook_value_nbits=16, scale_nbits=0, and the rest of them are the default values) in the README, but I get around 2.50 bit for attention weight and 2.2 bit for MLP weight and the average of them cannot be 2.02. How do I tune the parameters so that I can get the paper's results? Thank you very much!

Please help quantize sayhan/Qwen1.5-72B-Chat-LLaMAfied

...

Any chance to have K x 16 / 1 x 16 scheme implemented for Numba?

Hi. I am interested in some ideas requiring Mixtral-type model & for resource purpose - quantization & partial CPU offloading.

So since it seems this quantization technique is relatively good - I am going to try it.

However, as far as I understand due to https://github.com/Vahe1994/AQLM/blob/main/README.md - the best result is achieved with 1x16 scheme and it is not implemented for CPU case (well, not optimized kernel - due to https://github.com/Vahe1994/AQLM/blob/main/inference_lib/src/aqlm/inference_kernels/kernel_selector.py it should do dequantization and torch built-in matmul on top of that).

So I guess I may later dive into implementing such a scheme, but maybe it's on plan anyway?

Case Study: Instruction Tuning on AQLM Models

Hi, we have performed a small experiment on fine-tuning the Llama-2-70B-AQLM-2Bit model using the PEFT QLoRA method. We utilized the Alpaca and Glaive datasets for instruction tuning, and the fine-tuned version demonstrates preliminary conversation and tool-using abilities. We found that the training only requires 24GB of GRAM, while inference only needs 20GB. Thus fine-tuning a 70B model on consumer devices can be feasible. However, we found that AQLM significantly increases the overall training time. It would be better if the training speed could be improved. Thanks again for your excellent work!

Adapter weights: https://huggingface.co/hiyouga/Llama-2-70b-AQLM-2Bit-QLoRA-function-calling

optimized 1x16 and 2x8 decompression/dequantization kernels now exist

Hello, this isn't really a problem, I just wanted to inform you all that I wrote 1x16 and 2x8 decompression/dequantization kernels that run a factor of 6 and 9 (respectively) faster than the generic pytorch one you guys coded up, good for prefill performance. I didn't roll the scale into them because scaling the output is very fast.

Anyway, if you want to see them or bring them over, feel free, they are in:

vllm-project/vllm#3287

Cheers, and AQLM is really great, reaching Pareto optimal is a really nice achievement!

-James

aqlm/inference_kernels/cuda_kernel.cu compilation errors

Hi! I'm having the following issue on the forward pass (only when using an AQLM model) while prompt tuning an AQLM model. I'm using https://huggingface.co/BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf-test-dispatch/tree/main.

CalledProcessError Traceback (most recent call last)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2096, in _run_ninja_build(build_directory, verbose, error_prefix)
2095 stdout_fileno = 1
-> 2096 subprocess.run(
2097 command,
2098 stdout=stdout_fileno if verbose else subprocess.PIPE,
2099 stderr=subprocess.STDOUT,
2100 cwd=build_directory,
2101 check=True,
2102 env=env)
2103 except subprocess.CalledProcessError as e:
2104 # Python 2 and 3 compatible way of getting the error object.

File /usr/lib/python3.10/subprocess.py:526, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
525 if check and retcode:
--> 526 raise CalledProcessError(retcode, process.args,
527 output=stdout, stderr=stderr)
528 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last)
File , line 41
39 batch = {k: v.to(device) for k, v in batch.items()}
40 with torch.cuda.amp.autocast():
---> 41 outputs = model(**batch)
42 loss = outputs.loss
44 loss = loss / gradient_accumulation_steps

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/peft/peft_model.py:1295, in PeftModelForCausalLM.forward(self, input_ids, attention_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, task_ids, **kwargs)
1293 prompts = prompts.to(inputs_embeds.dtype)
1294 inputs_embeds = torch.cat((prompts, inputs_embeds), dim=1)
-> 1295 return self.base_model(inputs_embeds=inputs_embeds, **kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:1360, in MixtralForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict)
1357 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1359 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1360 outputs = self.model(
1361 input_ids=input_ids,
1362 attention_mask=attention_mask,
1363 position_ids=position_ids,
1364 past_key_values=past_key_values,
1365 inputs_embeds=inputs_embeds,
1366 use_cache=use_cache,
1367 output_attentions=output_attentions,
1368 output_hidden_states=output_hidden_states,
1369 output_router_logits=output_router_logits,
1370 return_dict=return_dict,
1371 )
1373 hidden_states = outputs[0]
1374 logits = self.lm_head(hidden_states)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:1217, in MixtralModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict)
1214 all_hidden_states += (hidden_states,)
1216 if self.gradient_checkpointing and self.training:
-> 1217 layer_outputs = self._gradient_checkpointing_func(
1218 decoder_layer.call,
1219 hidden_states,
1220 attention_mask,
1221 position_ids,
1222 past_key_values,
1223 output_attentions,
1224 output_router_logits,
1225 use_cache,
1226 )
1227 else:
1228 layer_outputs = decoder_layer(
1229 hidden_states,
1230 attention_mask=attention_mask,
(...)
1235 use_cache=use_cache,
1236 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/_compile.py:24, in _disable_dynamo..inner(*args, **kwargs)
20 @functools.wraps(fn)
21 def inner(*args, **kwargs):
22 import torch._dynamo
---> 24 return torch._dynamo.disable(fn, recursive)(*args, **kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:489, in _TorchDynamoContext.call.._fn(*args, **kwargs)
487 dynamo_config_ctx.enter()
488 try:
--> 489 return fn(*args, **kwargs)
490 finally:
491 set_eval_frame(prior)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/_dynamo/external_utils.py:17, in wrap_inline..inner(*args, **kwargs)
15 @functools.wraps(fn)
16 def inner(*args, **kwargs):
---> 17 return fn(*args, **kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/checkpoint.py:482, in checkpoint(function, use_reentrant, context_fn, determinism_check, debug, *args, **kwargs)
477 if context_fn is not noop_context_fn or debug is not False:
478 raise ValueError(
479 "Passing context_fn or debug is only supported when "
480 "use_reentrant=False."
481 )
--> 482 return CheckpointFunction.apply(function, preserve, *args)
483 else:
484 gen = _checkpoint_without_reentrant_generator(
485 function, preserve, context_fn, determinism_check, debug, *args, **kwargs
486 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/autograd/function.py:553, in Function.apply(cls, *args, **kwargs)
550 if not torch._C._are_functorch_transforms_active():
551 # See NOTE: [functorch vjp and autograd interaction]
552 args = _functorch.utils.unwrap_dead_wrappers(args)
--> 553 return super().apply(*args, **kwargs) # type: ignore[misc]
555 if not is_setup_ctx_defined:
556 raise RuntimeError(
557 "In order to use an autograd.Function with functorch transforms "
558 "(vmap, grad, jvp, jacrev, ...), it must override the setup_context "
559 "staticmethod. For more details, please see "
560 "https://pytorch.org/docs/master/notes/extending.func.html"
561 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/checkpoint.py:261, in CheckpointFunction.forward(ctx, run_function, preserve_rng_state, *args)
258 ctx.save_for_backward(*tensor_inputs)
260 with torch.no_grad():
--> 261 outputs = run_function(*args)
262 return outputs

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:934, in MixtralDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, output_router_logits, use_cache, **kwargs)
931 hidden_states = self.input_layernorm(hidden_states)
933 # Self Attention
--> 934 hidden_states, self_attn_weights, present_key_value = self.self_attn(
935 hidden_states=hidden_states,
936 attention_mask=attention_mask,
937 position_ids=position_ids,
938 past_key_value=past_key_value,
939 output_attentions=output_attentions,
940 use_cache=use_cache,
941 )
942 hidden_states = residual + hidden_states
944 # Fully Connected

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference.py:70, in QuantizedLinear.forward(self, input)
68 def forward(self, input: torch.Tensor) -> torch.Tensor:
69 if self.gemv_op is None:
---> 70 self.prepare_matmul_op(input)
72 if self.use_gemv_rule(input):
73 return self.gemv_op.apply(input, self.codes, self.codebooks, self.scales, self.bias)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference.py:86, in QuantizedLinear.prepare_matmul_op(self, input)
78 if (
79 not input.is_cuda
80 and self.codebook_size == 256
81 and self.codes.shape[0] == self.out_features // self.out_group_size
82 ):
83 self.codes.data = torch.permute(self.codes.data, (1, 0, 2)).contiguous() # TODO: fix this thing
85 self.gemv_op = _get_autograd_matmul_op(
---> 86 get_forward_pass_kernel(self.codebooks, False),
87 get_backward_pass_kernel(self.codebooks, False),
88 )
90 self.gemm_op = _get_autograd_matmul_op(
91 get_forward_pass_kernel(self.codebooks, True),
92 get_backward_pass_kernel(self.codebooks, True),
93 )
95 self.use_gemv_rule = lambda input: math.prod(input.shape[:-1]) <= 6

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/kernel_selector.py:35, in get_forward_pass_kernel(codebooks, optimize_for_training)
25 num_codebooks, codebook_size, out_group_size, in_group_size = codebooks.shape
27 if (optimize_for_training, codebooks.device.type, num_codebooks, codebook_size, out_group_size, in_group_size) == (
28 False,
29 "cuda",
(...)
33 8,
34 ):
---> 35 from .cuda_kernel import CUDA_FOLDER
37 return torch.ops.aqlm.code1x16_matmat
38 elif (
39 optimize_for_training,
40 codebooks.device.type,
(...)
51 8,
52 ):

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.py:8
5 from torch.utils.cpp_extension import load
7 CUDA_FOLDER = os.path.dirname(os.path.abspath(file))
----> 8 CUDA_KERNEL = load(
9 name="codebook_cuda",
10 sources=[os.path.join(CUDA_FOLDER, "cuda_kernel.cpp"), os.path.join(CUDA_FOLDER, "cuda_kernel.cu")],
11 )
13 torch.library.define(
14 "aqlm::code1x16_matmat", "(Tensor input, Tensor codes, Tensor codebooks, Tensor scales, Tensor bias) -> Tensor"
15 )
17 torch.library.impl("aqlm::code1x16_matmat", "default", CUDA_KERNEL.code1x16_matmat)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1306, in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
1214 def load(name,
1215 sources: Union[str, List[str]],
1216 extra_cflags=None,
(...)
1224 is_standalone=False,
1225 keep_intermediates=True):
1226 """
1227 Load a PyTorch C++ extension just-in-time (JIT).
1228
(...)
1304 ... verbose=True)
1305 """
-> 1306 return _jit_compile(
1307 name,
1308 [sources] if isinstance(sources, str) else sources,
1309 extra_cflags,
1310 extra_cuda_cflags,
1311 extra_ldflags,
1312 extra_include_paths,
1313 build_directory or _get_build_directory(name, verbose),
1314 verbose,
1315 with_cuda,
1316 is_python_module,
1317 is_standalone,
1318 keep_intermediates=keep_intermediates)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1710, in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
1706 hipified_sources.add(hipify_result[s_abs].hipified_path if s_abs in hipify_result else s_abs)
1708 sources = list(hipified_sources)
-> 1710 _write_ninja_file_and_build_library(
1711 name=name,
1712 sources=sources,
1713 extra_cflags=extra_cflags or [],
1714 extra_cuda_cflags=extra_cuda_cflags or [],
1715 extra_ldflags=extra_ldflags or [],
1716 extra_include_paths=extra_include_paths or [],
1717 build_directory=build_directory,
1718 verbose=verbose,
1719 with_cuda=with_cuda,
1720 is_standalone=is_standalone)
1721 finally:
1722 baton.release()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1823, in _write_ninja_file_and_build_library(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_standalone)
1821 if verbose:
1822 print(f'Building extension module {name}...', file=sys.stderr)
-> 1823 _run_ninja_build(
1824 build_directory,
1825 verbose,
1826 error_prefix=f"Error building extension '{name}'")

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2112, in _run_ninja_build(build_directory, verbose, error_prefix)
2110 if hasattr(error, 'output') and error.output: # type: ignore[union-attr]
2111 message += f": {error.output.decode(*SUBPROCESS_DECODE_ARGS)}" # type: ignore[union-attr]
-> 2112 raise RuntimeError(message) from e

RuntimeError: Error building extension 'codebook_cuda': [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -std=c++17 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o
FAILED: cuda_kernel.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -std=c++17 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o
/local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu(270): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists

/local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu(270): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists

/local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu(256): warning #177-D: variable "res" was declared but never referenced

2 errors detected in the compilation of "/local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu".
[2/3] c++ -MMD -MF cuda_kernel.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cpp -o cuda_kernel.o
ninja: build stopped: subcommand failed.

`RuntimeError: CUDA error: invalid argument` while running

I have Ubuntu 23.10 system.

I installed cudatoolkit 12.1 using https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local

(since it need headers and so so I can't just install cuda through conda).

The rest of my environment

accelerate @ git+https://github.com/huggingface/accelerate.git@97d2168e5953fe7373a06c69c02c5a00a84d5344
anyio==4.2.0
aqlm @ file:///home/alex4321/Documents/AQLM/inference_lib
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
attrs==23.2.0
Babel==2.14.0
beautifulsoup4==4.12.3
bleach==6.1.0
Brotli @ file:///work/perseverance-python-buildout/croot/brotli-split_1698805593785/work
certifi @ file:///croot/certifi_1696279375225/work/certifi
cffi @ file:///croot/cffi_1700254295673/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
comm==0.2.1
cryptography @ file:///work/perseverance-python-buildout/croot/cryptography_1698845900024/work
debugpy==1.8.1
decorator==5.1.1
defusedxml==0.7.1
executing==2.0.1
fastjsonschema==2.19.1
filelock @ file:///work/perseverance-python-buildout/croot/filelock_1698846025262/work
fqdn==1.5.1
fsspec==2024.2.0
h11==0.14.0
httpcore==1.0.3
httpx==0.26.0
huggingface-hub==0.20.3
idna @ file:///work/perseverance-python-buildout/croot/idna_1698845632828/work
ipykernel==6.29.2
ipython==8.21.0
isoduration==20.11.0
jedi==0.19.1
Jinja2 @ file:///work/perseverance-python-buildout/croot/jinja2_1698847462642/work
json5==0.9.14
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter-events==0.9.0
jupyter-lsp==2.2.2
jupyter_client==8.6.0
jupyter_core==5.7.1
jupyter_server==2.12.5
jupyter_server_terminals==0.5.2
jupyterlab==4.1.1
jupyterlab_pygments==0.3.0
jupyterlab_server==2.25.3
llvmlite==0.42.0
MarkupSafe @ file:///work/perseverance-python-buildout/croot/markupsafe_1698846636000/work
matplotlib-inline==0.1.6
mistune==3.0.2
mkl-service==2.4.0
mpmath @ file:///work/perseverance-python-buildout/croot/mpmath_1698864994882/work
nbclient==0.9.0
nbconvert==7.16.0
nbformat==5.9.2
nest-asyncio==1.6.0
networkx @ file:///work/perseverance-python-buildout/croot/networkx_1698865062738/work
ninja==1.11.1.1
notebook_shim==0.2.4
numba==0.59.0
numpy @ file:///work/perseverance-python-buildout/croot/numpy_and_numpy_base_1698845160062/work/dist/numpy-1.26.0-cp312-cp312-linux_x86_64.whl#sha256=fdc35057024038070345ff9f7f47ed48ecdb21dd72461617bdadf4f5d1634fcb
overrides==7.7.0
packaging==23.2
pandocfilters==1.5.1
parso==0.8.3
pexpect==4.9.0
Pillow @ file:///work/perseverance-python-buildout/croot/pillow_1698847657722/work
platformdirs==4.2.0
prometheus_client==0.20.0
prompt-toolkit==3.0.43
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
Pygments==2.17.2
pyOpenSSL @ file:///work/perseverance-python-buildout/croot/pyopenssl_1698863523157/work
PySocks @ file:///work/perseverance-python-buildout/croot/pysocks_1698845478203/work
python-dateutil==2.8.2
python-json-logger==2.0.7
PyYAML @ file:///work/perseverance-python-buildout/croot/pyyaml_1698849903511/work
pyzmq==25.1.2
referencing==0.33.0
regex==2023.12.25
requests @ file:///work/perseverance-python-buildout/croot/requests_1698846321763/work
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.18.0
safetensors==0.4.2
scipy==1.12.0
Send2Trash==1.8.2
setuptools==68.0.0
six==1.16.0
sniffio==1.3.0
soupsieve==2.5
stack-data==0.6.3
sympy @ file:///croot/sympy_1701397643339/work
terminado==0.18.0
tinycss2==1.2.1
tokenizers==0.15.2
torch==2.2.0
torchaudio==2.2.0
torchvision==0.17.0
tornado==6.4
tqdm==4.66.2
traitlets==5.14.1
transformers==4.37.0
triton==2.2.0
types-python-dateutil==2.8.19.20240106
typing_extensions==4.9.0
uri-template==1.3.0
urllib3 @ file:///work/perseverance-python-buildout/croot/urllib3_1698845837793/work
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
wheel==0.41.2

AQLM installed from latest github state.

Now if I try to run some code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

llama_cpu_quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf",
    trust_remote_code=True, torch_dtype=torch.float32, device_map="cpu"
)
llama_gpu_quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf",
    trust_remote_code=True, torch_dtype=torch.float16, device_map="cuda:0"
).cuda()
llama_tokenizer = AutoTokenizer.from_pretrained("daryl149/llama-2-7b-hf")

output = llama_cpu_quantized_model.generate(llama_tokenizer("Test is", return_tensors="pt")["input_ids"], max_new_tokens=10)
print(llama_tokenizer.decode(output[0]))

it tells me

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Compiling AQLM numba kernel with parameters: kernel_key=(8, 4096, 4096, 2)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Compiling AQLM numba kernel with parameters: kernel_key=(8, 11008, 4096, 2)
Compiling AQLM numba kernel with parameters: kernel_key=(8, 4096, 11008, 2)
<s> Test is a 19999 film directed by

which I guess it more or less fine.

But:

output = llama_gpu_quantized_model.generate(llama_tokenizer("Test is", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)
print(llama_tokenizer.decode(output[0]))

gives me

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[7], [line 1](vscode-notebook-cell:?execution_count=7&line=1)
----> [1](vscode-notebook-cell:?execution_count=7&line=1) output = llama_gpu_quantized_model.generate(llama_tokenizer("Test is", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)
      [2](vscode-notebook-cell:?execution_count=7&line=2) print(llama_tokenizer.decode(output[0]))

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:115](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:115), in context_decorator.<locals>.decorate_context(*args, **kwargs)
    [112](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:112) @functools.wraps(func)
    [113](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:113) def decorate_context(*args, **kwargs):
    [114](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:114)     with ctx_factory():
--> [115](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:115)         return func(*args, **kwargs)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1474](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1474), in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   [1457](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1457)     return self.assisted_decoding(
   [1458](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1458)         input_ids,
   [1459](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1459)         candidate_generator=candidate_generator,
   (...)
   [1470](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1470)         **model_kwargs,
   [1471](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1471)     )
   [1472](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1472) if generation_mode == GenerationMode.GREEDY_SEARCH:
   [1473](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1473)     # 11. run greedy search
-> [1474](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1474)     return self.greedy_search(
   [1475](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1475)         input_ids,
   [1476](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1476)         logits_processor=prepared_logits_processor,
   [1477](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1477)         stopping_criteria=prepared_stopping_criteria,
   [1478](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1478)         pad_token_id=generation_config.pad_token_id,
   [1479](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1479)         eos_token_id=generation_config.eos_token_id,
   [1480](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1480)         output_scores=generation_config.output_scores,
   [1481](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1481)         return_dict_in_generate=generation_config.return_dict_in_generate,
   [1482](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1482)         synced_gpus=synced_gpus,
   [1483](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1483)         streamer=streamer,
   [1484](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1484)         **model_kwargs,
   [1485](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1485)     )
   [1487](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1487) elif generation_mode == GenerationMode.CONTRASTIVE_SEARCH:
   [1488](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1488)     if not model_kwargs["use_cache"]:

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2335](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2335), in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   [2332](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2332) model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   [2334](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2334) # forward pass to get next token
-> [2335](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2335) outputs = self(
   [2336](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2336)     **model_inputs,
   [2337](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2337)     return_dict=True,
   [2338](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2338)     output_attentions=output_attentions,
   [2339](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2339)     output_hidden_states=output_hidden_states,
   [2340](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2340) )
   [2342](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2342) if synced_gpus and this_peer_finished:
   [2343](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2343)     continue  # don't waste resources running the code we don't need

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
   [1509](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1509)     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   [1510](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511)     return self._call_impl(*args, **kwargs)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
   [1515](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
   [1516](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
   [1517](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   [1518](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1518)         or _global_backward_pre_hooks or _global_backward_hooks
   [1519](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1519)         or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520)     return forward_call(*args, **kwargs)
   [1522](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1522) try:
   [1523](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1523)     result = None

File [~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1195](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1195), in LlamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   [1192](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1192) return_dict = return_dict if return_dict is not None else self.config.use_return_dict
   [1194](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1194) # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> [1195](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1195) outputs = self.model(
   [1196](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1196)     input_ids=input_ids,
   [1197](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1197)     attention_mask=attention_mask,
   [1198](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1198)     position_ids=position_ids,
   [1199](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1199)     past_key_values=past_key_values,
   [1200](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1200)     inputs_embeds=inputs_embeds,
   [1201](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1201)     use_cache=use_cache,
   [1202](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1202)     output_attentions=output_attentions,
   [1203](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1203)     output_hidden_states=output_hidden_states,
   [1204](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1204)     return_dict=return_dict,
   [1205](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1205) )
   [1207](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1207) hidden_states = outputs[0]
   [1208](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1208) if self.config.pretraining_tp > 1:

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
   [1509](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1509)     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   [1510](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511)     return self._call_impl(*args, **kwargs)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
   [1515](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
   [1516](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
   [1517](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   [1518](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1518)         or _global_backward_pre_hooks or _global_backward_hooks
   [1519](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1519)         or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520)     return forward_call(*args, **kwargs)
   [1522](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1522) try:
   [1523](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1523)     result = None

File [~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1082](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1082), in LlamaModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
   [1072](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1072)     layer_outputs = self._gradient_checkpointing_func(
   [1073](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1073)         decoder_layer.__call__,
   [1074](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1074)         hidden_states,
   (...)
   [1079](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1079)         use_cache,
   [1080](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1080)     )
   [1081](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1081) else:
-> [1082](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1082)     layer_outputs = decoder_layer(
   [1083](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1083)         hidden_states,
   [1084](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1084)         attention_mask=attention_mask,
   [1085](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1085)         position_ids=position_ids,
   [1086](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1086)         past_key_value=past_key_values,
   [1087](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1087)         output_attentions=output_attentions,
   [1088](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1088)         use_cache=use_cache,
   [1089](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1089)     )
   [1091](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1091) hidden_states = layer_outputs[0]
   [1093](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1093) if use_cache:

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
   [1509](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1509)     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   [1510](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511)     return self._call_impl(*args, **kwargs)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
   [1515](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
   [1516](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
   [1517](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   [1518](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1518)         or _global_backward_pre_hooks or _global_backward_hooks
   [1519](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1519)         or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520)     return forward_call(*args, **kwargs)
   [1522](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1522) try:
   [1523](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1523)     result = None

File [~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:810](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:810), in LlamaDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache, **kwargs)
    [807](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:807) hidden_states = self.input_layernorm(hidden_states)
    [809](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:809) # Self Attention
--> [810](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:810) hidden_states, self_attn_weights, present_key_value = self.self_attn(
    [811](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:811)     hidden_states=hidden_states,
    [812](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:812)     attention_mask=attention_mask,
    [813](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:813)     position_ids=position_ids,
    [814](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:814)     past_key_value=past_key_value,
    [815](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:815)     output_attentions=output_attentions,
    [816](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:816)     use_cache=use_cache,
    [817](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:817)     **kwargs,
    [818](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:818) )
    [819](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:819) hidden_states = residual + hidden_states
    [821](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:821) # Fully Connected

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
   [1509](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1509)     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   [1510](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511)     return self._call_impl(*args, **kwargs)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
   [1515](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
   [1516](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
   [1517](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   [1518](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1518)         or _global_backward_pre_hooks or _global_backward_hooks
   [1519](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1519)         or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520)     return forward_call(*args, **kwargs)
   [1522](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1522) try:
   [1523](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1523)     result = None

File [~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:705](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:705), in LlamaSdpaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
    [694](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:694)     return super().forward(
    [695](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:695)         hidden_states=hidden_states,
    [696](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:696)         attention_mask=attention_mask,
   (...)
    [700](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:700)         use_cache=use_cache,
    [701](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:701)     )
    [703](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:703) bsz, q_len, _ = hidden_states.size()
--> [705](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:705) query_states = self.q_proj(hidden_states)
    [706](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:706) key_states = self.k_proj(hidden_states)
    [707](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:707) value_states = self.v_proj(hidden_states)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
   [1509](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1509)     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   [1510](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511)     return self._call_impl(*args, **kwargs)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
   [1515](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
   [1516](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
   [1517](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   [1518](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1518)         or _global_backward_pre_hooks or _global_backward_hooks
   [1519](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1519)         or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520)     return forward_call(*args, **kwargs)
   [1522](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1522) try:
   [1523](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1523)     result = None

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:65](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:65), in QuantizedLinear.forward(self, input)
     [59](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:59) if (
     [60](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:60)     not input.is_cuda
     [61](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:61)     and self.codebook_size == 256
     [62](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:62)     and self.codes.shape[0] == self.out_features [/](https://file+.vscode-resource.vscode-cdn.net/)[/](https://file+.vscode-resource.vscode-cdn.net/) self.out_group_size
     [63](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:63) ):
     [64](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:64)     self.codes.data = torch.permute(self.codes.data, (1, 0, 2)).contiguous()  #  TODO: fix this thing
---> [65](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:65) return forward_pass_quantized_linear(input, self.codes, self.codebooks, self.scales, self.bias)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:31](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:31), in forward_pass_quantized_linear(input, codes, codebooks, scales, bias)
     [26](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:26)     from .cuda_kernel import CUDA_KERNEL
     [28](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:28)     assert (
     [29](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:29)         input.dtype == torch.float16
     [30](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:30)     ), f"please load the model with `torch_dtype=torch.float16`, as {input.dtype} is not supported on GPU yet"
---> [31](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:31)     return CUDA_KERNEL.code2x8_matmat(input, codes, codebooks, scales) + (bias if bias is not None else 0)
     [32](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:32) case (True, _, _, _, _):
     [33](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:33)     from .triton_kernel import triton_matmul

RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

p.s. by the way - it is kinda offtopic, but I don't get it:

In https://github.com/Vahe1994/AQLM/blob/main/inference_lib/src/aqlm/inference_kernels/kernel_selector.py#L9 we see both CUDA_KERNEL.code2x8_matmat and numba_gemm_lut consume codes argument as is as well as codebooks
But in https://github.com/Vahe1994/AQLM/blob/main/inference_lib/src/aqlm/inference_kernels/cuda_kernel.cpp#L98 we see

auto out_features = codes.size(0) * codebooks.size(2);

And the same pattern for all the other kernels, except for...
While in https://github.com/Vahe1994/AQLM/blob/main/inference_lib/src/aqlm/inference_kernels/numba_kernel.py#L24

out_features = codes.shape[1] * out_group_size

How model_seqlen affects quantization quality

Hi!
Thanks for such a useful tool!
I have a question about model_seqlen:

As I can see default value in main.py is 4096. What if I'll use a smaller values e.g. 1024 when quantizing MoE mixtral model? Will it affect the quality of quantized model? Or quality on greater than 1024 contexts? Will it significantly speedup process of quantization?

Thanks in advance!

    parser.add_argument(
        "--model_seqlen",
        type=int,
        default=4096,
        help="Model seqlen and calibration data context length.",
    )

Feature request: Quantize Google Gemma models

Hi authors!

With the recent AQLM integration in transformers, would it makes sense to quantize the Google gemma models in 2-bit

The list of the models can be found here: https://huggingface.co/collections/google/gemma-release-65d5efbccdbb8c4202ec078b

cc @BlackSamorez

Actual bitrate of models on github?

Are the models you report in your readme supposed to be actual 2 bit models or just 2.x bit models? For example, the two 7B models below are both larger than a 2 bit decoder model, which would take 2.1G on disk. Also, why is there such a large difference in size between the 1x16 and 2x8 models? The size gap is much larger than the codebook size delta should take. Are you using different groupsizes in each model? Thanks

Integration with HF transformers

Hi there,

Thanks for the great work !
We recently added HfQuantizer support in HF transformers thanks to @poedator 's work - I was wondering if it would make sense to create a new quantizer AqlmQuantizer for this method to natively support inference in HF transformers: https://huggingface.co/docs/transformers/main/en/hf_quantizer

Let me know if this makes sense!

Why only last token used in datasets for training?

Dear maintainers,

Thank you for your awesome paper and open-source project. I recently ran into the detail of dataset pre-processing, which i can not understand properly.

In the processing of datasets you ignore all the labels exept the last one (as it is in

AQLM/src/datautils.py

Line 51 in bf84f39

tar[:, :-1] = -100

). This seems a bit odd as long as transformers perform causal masking inside forward and during training LM model we want to propagate loss through all of the tokens and not the last one.

Could you please explain this detail?

Request: AQLM quantization of the new Mixtral 8x22B

https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1/tree/main

This is way more powerful than the previous Mixtral, on essentially all benchmarks

Tried quantizing this myself, it seemed like it was working but I started running out of cloud credits :p

RuntimeError: Error building extension 'codebook_cuda': [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d

Environment: Google Colab
CUDA Info:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |

Code Runned from: Basic Basic AQLM generation DEMO

error line:

%%capture
output = quantized_model.generate(tokenizer("", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py](https://localhost:8080/#) in _run_ninja_build(build_directory, verbose, error_prefix)
   2095         stdout_fileno = 1
-> 2096         subprocess.run(
   2097             command,

26 frames
CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py](https://localhost:8080/#) in _run_ninja_build(build_directory, verbose, error_prefix)
   2110         if hasattr(error, 'output') and error.output:  # type: ignore[union-attr]
   2111             message += f": {error.output.decode(*SUBPROCESS_DECODE_ARGS)}"  # type: ignore[union-attr]
-> 2112         raise RuntimeError(message) from e
   2113 
   2114 

RuntimeError: Error building extension 'codebook_cuda': [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -std=c++17 -c /usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o 
FAILED: cuda_kernel.cuda.o 
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -std=c++17 -c /usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o 
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(59): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
              res2 = __hfma2(a[j], b[j], res2);
                             ^

/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(59): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
              res2 = __hfma2(a[j], b[j], res2);
                                   ^

/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(59): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
              res2 = __hfma2(a[j], b[j], res2);
                                         ^

/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(147): error: no suitable user-defined conversion from "__nv_bfloat162" to "__half2" exists
              res2 = __hfma2(__hadd2(a0[j], a1[j]), b[j], res2);
                             ^

/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(147): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
              res2 = __hfma2(__hadd2(a0[j], a1[j]), b[j], res2);
                                                    ^

/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(147): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
              res2 = __hfma2(__hadd2(a0[j], a1[j]), b[j], res2);
                                                          ^

6 errors detected in the compilation of "/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu".
[2/3] c++ -MMD -MF cuda_kernel.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cpp -o cuda_kernel.o 
ninja: build stopped: subcommand failed.

Any lightweight OpenAI API compatible HTTP server that can be used to serve AQLM models yet?

The AQLM quantization method is recent, is there any tool that supports serving the models as an OpenAI API-compatible server?

KV Cache Quantization

Hey, thanks for your work. I saw https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-Instruct-AQLM-2Bit-1x16/discussions/2 about how 8-bit KV cache quantization can be enabled on vLLM. I am not too sure of how exactly KV cache is implemented for AQLM using Transformers, but would KV cache quantization be theoretically possible? It might address some of the concerns regarding high vram usage for context from https://www.reddit.com/r/LocalLLaMA/comments/1clinlb/bringing_2bit_llms_to_production_new_aqlm_models/.

To be specific, would 4-bit cache quantization be possible? Turboderp managed to achieve negligable ppl loss somehow: https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md. For reference, turboderp/exllamav2@324404e.

Thanks!

Are there any tools that can serve AQLM quantized models?

Are Ollama or other frameworks or tools capable of serving AQLM quantized models?

Request for the Llama-2-13B with AQLM (2x8 scheme)

Hello,

Thanks for your outstanding works. I want to do a compressive comparison of recent quantization methods.

Due to that latest lm-eval can obtain higher accuracy than the ones reported in paper, I have to re-evaluate each quantized models.

I found that there is not model about Llama-2-13B with AQLM (2x8 scheme), can you share it on huggingface?

Thank you!

How to fine-tune the compressed model

How to fine-tune the compressed model. Thank you very much for the author's work, I tried to deploy lla2 on 3090 before, but failed due to lack of video memory. Recently I have found work on compression of large models. If you use compressed model parameters for fine tuning tasks, there are more detailed steps, I am very interested in your current work.

I wanted to know what is the beauty of technology

I welcome the developers of the new method. I launched your goolge colab and got very dubious inference. Can you explain a little about the essence of the breakthrough of this compression? I may not have understood something in terms of use, but judging by the inference, it is completely lost.

quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf", trust_remote_code=True, torch_dtype=torch.float16,
).cuda()
tokenizer = AutoTokenizer.from_pretrained("BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf")

inputs = tokenizer(["Write a poem about python"], return_tensors="pt")["input_ids"].cuda()

streamer = TextStreamer(tokenizer)
_ = quantized_model.generate(inputs, streamer=streamer, max_new_tokens=120)

<s> Write a poem about python.
Write a poem about python.
Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem

Can't run the model after building with Docker, fails in ninja build.

This is my Dockerfile. I've tried a lot of permutations looking at your notebooks, e.g., using pip install git+https://github.com/huggingface/accelerate.git@main instead of the plain pip install below. I paste the error after the docker file.

FROM nvidia/cuda:11.8.0-base-ubuntu20.04

# Run system updates and install any desired packages
RUN apt-get update && apt-get upgrade -y && apt-get install -y curl

# need to make sure sudo is installed
RUN apt-get update && apt-get install -y sudo
RUN sudo apt-get install python3.8 -y

RUN curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
RUN mkdir /root/.conda
RUN bash Miniconda3-latest-Linux-x86_64.sh -b
ENV PATH="/root/miniconda3/bin:${PATH}"
ARG PATH="/root/miniconda3/bin:${PATH}"

RUN apt-get update && apt-get install -y gcc g++
RUN apt-get update && apt-get install -y ninja-build

RUN apt-get update && apt-get install -y git
RUN pip install accelerate
RUN pip install aqlm[gpu]==1.0.0


#docker run --gpus '"all"' --rm -it --name aqlm --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,source=/home/josh/PycharmProjects/vectors/DSPy/scratch,target=/scratch -v ${HOME}/.cache/huggingface:/root/.cache/huggingface --network=host aqlm:0

Here's the error:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2096, in _run_ninja_build
    subprocess.run(
  File "/root/miniconda3/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 7, in chat_loop
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/transformers/generation/utils.py", line 1520, in generate
    return self.sample(
           ^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/transformers/generation/utils.py", line 2617, in sample
    outputs = self(
              ^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf/057668a874cfe55d73bed6f244367eb072da75a7/modeling_llama_aqlm.py", line 1194, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf/057668a874cfe55d73bed6f244367eb072da75a7/modeling_llama_aqlm.py", line 1081, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf/057668a874cfe55d73bed6f244367eb072da75a7/modeling_llama_aqlm.py", line 809, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf/057668a874cfe55d73bed6f244367eb072da75a7/modeling_llama_aqlm.py", line 704, in forward
    query_states = self.q_proj(hidden_states)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/aqlm/inference.py", line 65, in forward
    return forward_pass_quantized_linear(input, self.codes, self.codebooks, self.scales, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py", line 19, in forward_pass_quantized_linear
    from .cuda_kernel import CUDA_KERNEL
  File "/root/miniconda3/lib/python3.12/site-packages/aqlm/inference_kernels/cuda_kernel.py", line 8, in <module>
    CUDA_KERNEL = load(
                  ^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1306, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2112, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'codebook_cuda': [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/miniconda3/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /root/miniconda3/lib/python3.12/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o 
FAILED: cuda_kernel.cuda.o 
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/miniconda3/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /root/miniconda3/lib/python3.12/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o 
/bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
[2/3] c++ -MMD -MF cuda_kernel.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/miniconda3/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /root/miniconda3/lib/python3.12/site-packages/aqlm/inference_kernels/cuda_kernel.cpp -o cuda_kernel.o 
ninja: build stopped: subcommand failed.

And here's the little script I threw together to test the model:

from transformers import AutoModelForCausalLM

model="BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf"
quantized_model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True, torch_dtype="auto", device_map="cuda").cuda()
# unload the model from cuda
# quantized_model = quantized_model.cpu()

# Let's set up a chat loop
# need this token for the tokenizer
token=<MY_TOKEN>
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-13b-hf', trust_remote_code=True, token=token)

def chat_loop():
    while True:
        prompt = input(">>> ")
        if prompt.strip() == "exit":
            break
        input_ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
        output = quantized_model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=tokenizer.eos_token_id)
        print(tokenizer.decode(output[0], skip_special_tokens=True))

chat_loop()
# Quiz me about the principles of scrum.

Just not familiar enough yet with your code to troubleshoot dependency / build issues. Happy to learn more!

Quantization Time

How long is the expected time to quantize a 7b mistral model ?

AQLM support for Cohere Command-R +

Considering is probably the best OS LLM model at the moment, an optimized AQLM model would be great.

RuntimeError: Unknown layout

Hi, I've been encountering this problem when running this notebook on local machine with an L4. I've followed the instructions from the notebook on Python3.10.

Name: torch
Version: 2.2.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: /home/ubuntu/anaconda3/envs/aqlm/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, aqlm, torchaudio, torchvision

  +---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                       On | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P0               33W /  75W|  13858MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     16482      C   ...untu/anaconda3/envs/aqlm/bin/python    13856MiB |
+---------------------------------------------------------------------------------------+

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:2 for open-end generation.

RuntimeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 output = quantized_model.generate(tokenizer("", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/generation/utils.py:1513, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
1496 return self.assisted_decoding(
1497 input_ids,
1498 candidate_generator=candidate_generator,
(...)
1509 **model_kwargs,
1510 )
1511 if generation_mode == GenerationMode.GREEDY_SEARCH:
1512 # 11. run greedy search
-> 1513 return self.greedy_search(
1514 input_ids,
1515 logits_processor=prepared_logits_processor,
1516 stopping_criteria=prepared_stopping_criteria,
1517 pad_token_id=generation_config.pad_token_id,
1518 eos_token_id=generation_config.eos_token_id,
1519 output_scores=generation_config.output_scores,
1520 return_dict_in_generate=generation_config.return_dict_in_generate,
1521 synced_gpus=synced_gpus,
1522 streamer=streamer,
1523 **model_kwargs,
1524 )
1526 elif generation_mode == GenerationMode.CONTRASTIVE_SEARCH:
1527 if not model_kwargs["use_cache"]:

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/generation/utils.py:2350, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
2347 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
2349 # forward pass to get next token
-> 2350 outputs = self(
2351 **model_inputs,
2352 return_dict=True,
2353 output_attentions=output_attentions,
2354 output_hidden_states=output_hidden_states,
2355 )
2357 if synced_gpus and this_peer_finished:
2358 continue # don't waste resources running the code we don't need

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:1360, in MixtralForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict)
1357 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1359 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1360 outputs = self.model(
1361 input_ids=input_ids,
1362 attention_mask=attention_mask,
1363 position_ids=position_ids,
1364 past_key_values=past_key_values,
1365 inputs_embeds=inputs_embeds,
1366 use_cache=use_cache,
1367 output_attentions=output_attentions,
1368 output_hidden_states=output_hidden_states,
1369 output_router_logits=output_router_logits,
1370 return_dict=return_dict,
1371 )
1373 hidden_states = outputs[0]
1374 logits = self.lm_head(hidden_states)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:1228, in MixtralModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict)
1217 layer_outputs = self._gradient_checkpointing_func(
1218 decoder_layer.call,
1219 hidden_states,
(...)
1225 use_cache,
1226 )
1227 else:
-> 1228 layer_outputs = decoder_layer(
1229 hidden_states,
1230 attention_mask=attention_mask,
1231 position_ids=position_ids,
1232 past_key_value=past_key_values,
1233 output_attentions=output_attentions,
1234 output_router_logits=output_router_logits,
1235 use_cache=use_cache,
1236 )
1238 hidden_states = layer_outputs[0]
1240 if use_cache:

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:934, in MixtralDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, output_router_logits, use_cache, **kwargs)
931 hidden_states = self.input_layernorm(hidden_states)
933 # Self Attention
--> 934 hidden_states, self_attn_weights, present_key_value = self.self_attn(
935 hidden_states=hidden_states,
936 attention_mask=attention_mask,
937 position_ids=position_ids,
938 past_key_value=past_key_value,
939 output_attentions=output_attentions,
940 use_cache=use_cache,
941 )
942 hidden_states = residual + hidden_states
944 # Fully Connected

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:730, in MixtralSdpaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
719 return super().forward(
720 hidden_states=hidden_states,
721 attention_mask=attention_mask,
(...)
725 use_cache=use_cache,
726 )
728 bsz, q_len, _ = hidden_states.size()
--> 730 query_states = self.q_proj(hidden_states)
731 key_states = self.k_proj(hidden_states)
732 value_states = self.v_proj(hidden_states)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/aqlm/inference.py:65, in QuantizedLinear.forward(self, input)
59 if (
60 not input.is_cuda
61 and self.codebook_size == 256
62 and self.codes.shape[0] == self.out_features // self.out_group_size
63 ):
64 self.codes.data = torch.permute(self.codes.data, (1, 0, 2)).contiguous() # TODO: fix this thing
---> 65 return forward_pass_quantized_linear(input, self.codes, self.codebooks, self.scales, self.bias)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/aqlm/inference_kernels/kernel_selector.py:24, in forward_pass_quantized_linear(input, codes, codebooks, scales, bias)
19 from .cuda_kernel import CUDA_KERNEL
21 assert (
22 input.dtype == torch.float16
23 ), f"please load the model with torch_dtype=torch.float16, as {input.dtype} is not supported on GPU yet"
---> 24 return CUDA_KERNEL.code1x16_matmat(input, codes, codebooks, scales) + (bias if bias is not None else 0)
25 case (True, 2, 256, 1, 8):
26 from .cuda_kernel import CUDA_KERNEL

RuntimeError: Unknown layout

33B llama quantization post-inference time

Why is it that after I quantize, my inference is 2 times slower than the original `model?

Quantize command：python main.py $MODEL_PATH $DATASET_PATH --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 --finetune_batch_size=32 --local_batch_size=1 --offload_activations --wandb --save $SAVE_PATH --dtype float32

Quantitative results：

In addition, looking at the GPU usage, it seems that it has become a parallel pipeline. I use four cores to load the model, but only one GPU utilization is 100% at the same time.

Have you encountered the same problem as me again? Is the inference speed of your quantized model normal?

vahe1994 / aqlm Goto Github PK

aqlm's People

Contributors

Stargazers

Watchers

Forkers

aqlm's Issues

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:2 for open-end generation.

Recommend Projects

Recommend Topics

Recommend Org

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.