Git Product home page Git Product logo

aqlm's Introduction

AQLM

Official PyTorch implementation for Extreme Compression of Large Language Models via Additive Quantization

Inference

Demo

Learn how to run the prequantized models using this Google Colab examples:

Basic AQLM
generation
Streaming with
GPU/CPU
Inference with CUDA
graphs (3x speedup)
Fine-tuning
with PEFT
Serving with
vLLM
AQLM In Colab AQLM In Colab Open In Colab Open In Colab Open In Colab

Models

This repository is currently designed to work with models of LLaMA, Mistral and Mixtral families. The models reported below use full model fine-tuning as described in appendix A, with cross-entropy objective with teacher logits.

We provide a number of prequantized models:

Model AQLM scheme WikiText 2 PPL MMLU (5-shot) FP16→AQLM Model size, Gb Hub link
Llama-3-8b 1x16 - 0.65→0.56 4.1 Link
Llama-3-8b-Instruct 1x16 - 0.66→0.59 4.1 Link
Llama-3-70b 1x16 - 0.79→0.75 21.9 Link
Llama-3-70b-Instruct 1x16 - 0.80→0.76 21.9 Link
Command-R 1x16 - 0.68→0.57 12.7 Link
Command-R+ 1x16 - 0.74→0.68 31.9 Link
Mistral-7b 1x16 5.40 - 2.5 Link
Mistral-7B-Instruct-v0.2 2x8 - 0.59→0.44 2.5 Link
Mixtral-8x7b 1x16 3.35 - 12.6 Link
Mixtral-8x7b-Instruct 1x16 - - 12.6 Link
Llama-2-7b 1x16 5.92 0.46→0.39 2.4 Link
Llama-2-7b 2x8 6.69 - 2.2 Link
Llama-2-7b 8x8 6.61 - 2.2 Link
Llama-2-13b 1x16 5.22 0.55→0.49 4.1 Link
Llama-2-70b 1x16 3.83 0.69→0.65 18.8 Link
Llama-2-70b 2x8 4.21 - 18.2 Link
gemma-2b 1x16 - - 1.7 Link
gemma-2b 2x8 - - 1.6 Link

Above perplexity is evaluated on 4k context length for Llama-2 models and 8k for Mistral/Mixtral. Please see more evaluation results on the model pages.

Inference kernels

AQLM quantization setpus vary mainly on the number of codebooks used as well as the codebook sizes in bits. The most popular setups, as well as inference kernels they support are:

Kernel Number of codebooks Codebook size, bits Scheme Notation Accuracy Speedup Fast GPU inference Fast CPU inference
Triton K N KxN - Up to ~0.7x
CUDA 1 16 1x16 Best Up to ~1.3x
CUDA 2 8 2x8 OK Up to ~3.0x
Numba K 8 Kx8 Good Up to ~4.0x

Installation

To run the models, one would have to install an inference library:

pip install aqlm[gpu,cpu]

, specifying either gpu, cpu or both based on one's inference setting.

Then, one can use the familiar .from_pretrained method provided by the transformers library:

from transformers import AutoModelForCausalLM

quantized_model = AutoModelForCausalLM.from_pretrained(
    "ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf",
    trust_remote_code=True, torch_dtype="auto"
).cuda()

Notice that torch_dtype should be set to either torch.float16 or "auto" on GPU and torch.float32 on CPU. After that, the model can be used exactly the same as one would use and unquantized model.

Quantization

Dependencies

Install packages from requirements.txt:

pip install -r requirements.txt

Loading / caching datasets and tokenizer

The script will require downloading and caching locally the relevant tokenizer and the datasets. They will be saved in default Huggingface Datasets directory unless alternative location is provided by env variables. See relevant Datasets documentation section

Data

When quantizing models with AQLM, we recommend that you use a subset of the original data the model was trained on.

For Llama-2 models, the closest available dataset is RedPajama . To load subset of RedPajama provide "pajama" in --dataset argument. This will process nsamples data and tokenize it using provided model tokenizer.

Additionally we provide tokenized Redpajama for LLama and Solar/Mistral models for 4096 context lengths stored in Hunggingface . To load it, use:

from huggingface_hub import hf_hub_download

hf_hub_download(repo_id="Vahe1994/AQLM", filename="data/name.pth",repo_type="dataset")

To use downloaded data from HF, place it in data folder(optional) and set correct path to it in "--dataset" argument in main.py.

Warning: These subsets are already processed with the corresponding model tokenizer. If you want to quantize another model (e.g. mistral/mixtral), please re-tokenize the data with provided script in src/datautils.

WandB logging

One can optionally log the data to Weights and Biases service (wandb). Run pip install wandb for W&B logging. Specify $WANDB_ENTITY, $WANDB_PROJECT, $WANDB_NAME environment variables prior to running experiments. use --wandb argument to enable logging

GPU and RAM requirements

This code was developed and tested using a several A100 GPU with 80GB GPU RAM. You can use the --offload activations option to reduce VRAM usage. For Language Model Evaluation Harness evaluation one needs to have enough memory to load whole model + activation tensors on one or several devices.

Quantization time

AQLM quantization takes considerably longer to calibrate than simpler quantization methods such as GPTQ. This only impacts quantization time, not inference time.

For instance, quantizing a 7B model with default configuration takes about 1 day on a single A100 gpu. Similarly, quantizing a 70B model on a single GPU would take 10-14 days. If you have multiple GPUs with fast interconnect, you can run AQLM multi-gpu to speed up comparison - simply set CUDA_VISIBLE_DEVICES for multiple GPUs. Quantizing 7B model on two gpus reduces quantization time to ~14.5 hours. Similarly, quantizing a 70B model on 8 x A100 GPUs takes 3 days 18 hours.

If you need to speed up quantization without adding more GPUs, you may also increase --relative_mse_tolerance or set --init_max_points_per_centroid or limit --finetune_max_epochs. However, that usually comes at a cost of reduced model accuracy.

Model downloading

The code requires the LLaMA model to be downloaded in Huggingface format and saved locally. The scripts below assume that $TRANSFORMERS_CACHE variable points to the Huggingface Transformers cache folder. To download and cache the models, run this in the same environment:

from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "meta-llama/Llama-2-7b-hf"  # or whatever else you wish to download
tokenizer = AutoTokenizer.from_pretrained(model_name, torch_dtype="auto")
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")

How to quantize a model with AQLM

This script compresses the model and then tests its performance in terms of perplexity using WikiText2, C4, and Penn Treebank datasets.

The command to launch the script should look like this:

export CUDA_VISIBLE_DEVICES=0   # or e.g. 0,1,2,3
export MODEL_PATH=<PATH_TO_MODEL_ON_HUB>
export DATASET_PATH=<INSERT DATASET NAME OR PATH TO CUSTOM DATA>
export SAVE_PATH=/path/to/save/quantized/model/
export WANDB_PROJECT=MY_AQ_EXPS
export WANDB_NAME=COOL_EXP_NAME

python main.py $MODEL_PATH $DATASET_PATH \
 --nsamples=1024 \
 --val_size=128 \
 --num_codebooks=1 \
 --nbits_per_codebook=16 \
 --in_group_size=8 \
 --relative_mse_tolerance=0.01 \
 --finetune_batch_size=32 \
 --finetune_max_epochs=10 \
 --finetune_early_stop=3 \
 --finetune_keep_best \
 --local_batch_size=1 \
 --offload_activations \
 --wandb \
 --resume \
 --save $SAVE_PATH

Main CLI arguments:

  • CUDA_VISIBLE_DEVICES - by default, the code will use all available GPUs. If you want to use specific GPUs (or one GPU), use this variable.
  • MODEL_PATH - a path to either Hugging Face hub (e.g. meta-llama/Llama-2-7b-hf) or a local folder with transformers model and a tokenizer.
  • DATASET_PATH - either a path to calibration data (see above) or a standard dataset [c4, ptb, wikitext2]
    • for llama-2 models, you can use DATASET_PATH=./data/red_pajama_n=1024_4096_context_length.pth for a slice of RedPajama (up to 1024 samples)
  • --nsamples - the number of calibration data sequences (train + validation). If this parameter is not set, take all calibration data avaialble.
  • --val_size - the number of validation sequences for early stopping on block finetuning. By default equal to 0. Must be smaller than --nsamples.
  • --num_codebooks - number of codebooks per layer
  • --nbits_per_codebook - each codebook will contain 2 ** nbits_per_codebook vectors
  • --in_group_size - how many weights are quantized together (aka "g" in the arXiv paper)
  • --finetune_batch_size - (for fine-tuning only) the total number of sequences used for each optimization step
  • --local_batch_size - when accumulating finetune_batch_size, process this many samples per GPU per forward pass (affects GPU RAM usage)
  • --relative_mse_tolerance- (for initial calibration) - stop training when (current_epoch_mse / previous_epoch_mse) > (1 - relative_mse_tolerance)
  • --finetune_max_epochs - maximal number of passes through calibration data on block tuning.
  • --finetune_early_stop - maximal number of passes through calibration data without improvement on validation.
  • --offload_activations -- during calibration, move activations from GPU memory to RAM. This reduces VRAM usage while slowing calibration by ~10% (depending on your hardware).
  • --save -- path to save/load quantized model. (see also: --load)
  • --wandb - if this parameter is set, the code will log results to wandb
  • --attn_implementation - specify attention (for transformers >= 4.38). Sdpa attention sometimes causes issues and it is recommended to use eager implementation.

There are additional hyperparameters aviailable. Run python main.py --help for more details on command line arguments, including compression parameters.

Finetuning

The accuracy of the quantized model can be further improved via block finetuning. First, the logits of the float16/bfloat16 are cached in RAM. Then the differentiable parameters of the quantized model are optimized to minimize KL-divergence with teacher logits. Typically, we use the same calibration data that was used for model quantization.

The command to launch the script should look like this:

python finetune.py \
  --base_model $MODEL_PATH \
  --quant_model $INPUT_PATH \
  --dataset $DATASET_PATH \
  --nsamples=<TOTAL_SIZE> \
  --val_size=<VAL_SIZE> \
  --lr=1e-5 \
  --adam_beta1=0.90 \
  --adam_beta2=0.999 \
  --epochs=5 \
  --early_stop=3 \
  --batch_size=8 \
  --microbatch_size=4 \
  --save $DATA_PATH \
  --gradient_checkpointing

Main CLI arguments:

  • --base_model - path or name of the original floating-point model
  • --quant_model - path to quantized model weights.
  • --dataset - path or name of the calibration dataset
  • --nsamples - the number of calibration data sequences (train + validation). If this parameter is not set, take all calibration data avaialble.
  • --val_size - the number of validation sequences for early stopping on end-to-end finetuning. By default equal to 0. Must be smaller than --nsamples.
  • --gradient_checkpointing - whether to use gradient checkpointing. Reduces peak memory usage at the cost of longer runtime.
  • --finetune_dtype - which dtype should be used on finetuning. By default float32.
  • --amp - whether to use amp on finetuning. Requires --finetune_dtype=float32.

Note for larger models one would need multi-GPU training. At the moment, FSDP training is not implemented and the model is finetuned on a single process with parameters sharded across available devices.

Zero-shot benchmarks via LM Evaluation Harness

To perform zero-shot evaluation, we use Language Model Evaluation Harness framework with slight modifications. This repository contains a copy of LM Evaluation Harness repo from early 2023 in lm-eval-harness folder.

Before running the code make sure that you have all the requirements and dependencies of lm-eval-harness installed. To install them run:

pip install -r lm-evaluation-harness/requirements.txt

The main script launching the evaluation procedure is lmeval.py .

export CUDA_VISIBLE_DEVICES=0,1,2,3  # optional: select GPUs
export QUANTZED_MODEL=<PATH_TO_SAVED_QUANTIZED_MODEL_FROM_MAIN.py>
export MODEL_PATH=<INSERT_PATH_TO_ORIINAL_MODEL_ON_HUB>
export DATASET=<INSERT DATASET NAME OR PATH TO CUSTOM DATA>
export WANDB_PROJECT=MY_AQ_LM_EVAL
export WANDB_NAME=COOL_EVAL_NAME

python lmeval.py \
    --model hf-causal \
    --model_args pretrained=$MODEL_PATH,dtype=float16,use_accelerate=True \
    --load $QUANTZED_MODEL \
    --tasks winogrande,piqa,hellaswag,arc_easy,arc_challenge \
    --batch_size 1

Preparing models for inference

To convert a model into a Hugging Face compatible format, use convert_to_hf.py with corresponding arguments:

  • --model - the original pretrained model (corresponds to MODEL_PATH of main.py, e.g. meta-llama/Llama-2-7b-hf).
  • --in_path - the folder containing an initially quantized model (corresponds to --save of main.py).
  • --out_path - the folder to save transformers model to.

The conversion automatically

Contributing

If you want to contribute something substantial (more than a typo), please open an issue first. We use black and isort for all pull requests. Before committing your code run black . && isort .

Cite

If you found this work useful, please consider citing:

@misc{egiazarian2024extreme,
      title={Extreme Compression of Large Language Models via Additive Quantization}, 
      author={Vage Egiazarian and Andrei Panferov and Denis Kuznedelev and Elias Frantar and Artem Babenko and Dan Alistarh},
      year={2024},
      eprint={2401.06118},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

aqlm's People

Contributors

blacksamorez avatar dalistarh avatar efrantar avatar eltociear avatar galqiwi avatar godofnothing avatar justheuristic avatar lmmx avatar vahe1994 avatar vectozavr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aqlm's Issues

What are the hyperparameters to get an average of 2.02 bits per layer for LLaMA2-7b?

Hi @Vahe1994, I really like the idea of the paper and thank you for releasing the codes.

I am currently researching the codes, and I would like to know how you obtain an average of 2.02 bits per layer for LLaMA2-7b (which is shown in the paper's Table 1). I tried the hyperparameters (num_codebooks=1, codebook_value_nbits=16, in_group_size=8, out_group_size=1, codebook_value_nbits=16, scale_nbits=0, and the rest of them are the default values) in the README, but I get around 2.50 bit for attention weight and 2.2 bit for MLP weight and the average of them cannot be 2.02. How do I tune the parameters so that I can get the paper's results? Thank you very much!

'QuantizedLinear' object has no attribute 'weight' error

I'm trying to do quantization with this parameters:

python main.py ../models/my-mistral-7B wikitext2 --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 --finetune_batch_size=32 --local_batch_size=1 --save ../models/my-mistral-7B-AQLM --model_seqlen 8192 --offload_activations

and I got this error

Traceback (most recent call last):
  File "/home/fahadh/AQLM/main.py", line 779, in <module>
    quantize_model(model, args)
  File "/home/fahadh/AQLM/main.py", line 48, in quantize_model
    results = quantize_aq(model, dataloader, args)
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/fahadh/AQLM/main.py", line 252, in quantize_aq
    layer = finetune_groupwise(layer=layer, inps=inps, outs=outs, args=args, **forward_args)
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/fahadh/AQLM/src/finetune.py", line 105, in finetune_groupwise
    loss = _compute_mse_parallel(
  File "/home/fahadh/AQLM/src/finetune.py", line 224, in _compute_mse_parallel
    mse_components = torch.nn.parallel.parallel_apply(
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 108, in parallel_apply
    output.reraise()
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/_utils.py", line 722, in reraise
    raise exception
AttributeError: Caught AttributeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
    output = module(*input, **kwargs)
  File "/home/fahadh/AQLM/src/finetune.py", line 201, in _compute_mse_on_batch
    outs_prediction, *_unused = layer(inps_batch, **kwargs)
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 754, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 437, in forward
    target_dtype = self.q_proj.weight.dtype
  File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'QuantizedLinear' object has no attribute 'weight'

This happens after fine tuning layers 0 of mistral model. Here is what printed before the error

PREPARING TO FINETUNE
MistralDecoderLayer(
  (self_attn): MistralFlashAttention2(
    (q_proj): QuantizedLinear(
      (quantized_weight): QuantizedWeight(self.out_features=4096, self.in_features=4096, bits_per_parameter=2.50390625)
    )
    (k_proj): QuantizedLinear(
      (quantized_weight): QuantizedWeight(self.out_features=1024, self.in_features=4096, bits_per_parameter=4.00390625)
    )
    (v_proj): QuantizedLinear(
      (quantized_weight): QuantizedWeight(self.out_features=1024, self.in_features=4096, bits_per_parameter=4.00390625)
    )
    (o_proj): QuantizedLinear(
      (quantized_weight): QuantizedWeight(self.out_features=4096, self.in_features=4096, bits_per_parameter=2.50390625)
    )
    (rotary_emb): MistralRotaryEmbedding()
  )
  (mlp): MistralMLP(
    (gate_proj): QuantizedLinear(
      (quantized_weight): QuantizedWeight(self.out_features=14336, self.in_features=4096, bits_per_parameter=2.146763392857143)
    )
    (up_proj): QuantizedLinear(
      (quantized_weight): QuantizedWeight(self.out_features=14336, self.in_features=4096, bits_per_parameter=2.146763392857143)
    )
    (down_proj): QuantizedLinear(
      (quantized_weight): QuantizedWeight(self.out_features=4096, self.in_features=14336, bits_per_parameter=2.1439732142857144)
    )
    (act_fn): SiLU()
  )
  (input_layernorm): MistralRMSNorm()
  (post_attention_layernorm): MistralRMSNorm()
)
Fine-tuning 3721216 parameters

How model_seqlen affects quantization quality

Hi!
Thanks for such a useful tool!
I have a question about model_seqlen:

As I can see default value in main.py is 4096. What if I'll use a smaller values e.g. 1024 when quantizing MoE mixtral model? Will it affect the quality of quantized model? Or quality on greater than 1024 contexts? Will it significantly speedup process of quantization?

Thanks in advance!

    parser.add_argument(
        "--model_seqlen",
        type=int,
        default=4096,
        help="Model seqlen and calibration data context length.",
    )

Supported Models

Congratulations on coming up with such an excellent quantization algorithm! I'm trying to use AQLM to quantize Deepseek-Coder and Starcoder2, but the repository doesn't seem to have direct support. Are there any plans to support more models? Or any suggestions on how to modify the source code to achieve fast support, thanks.

RuntimeError: Unknown layout

Hi, I've been encountering this problem when running this notebook on local machine with an L4. I've followed the instructions from the notebook on Python3.10.

Name: torch
Version: 2.2.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: /home/ubuntu/anaconda3/envs/aqlm/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, aqlm, torchaudio, torchvision

  +---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                       On | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P0               33W /  75W|  13858MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     16482      C   ...untu/anaconda3/envs/aqlm/bin/python    13856MiB |
+---------------------------------------------------------------------------------------+

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:2 for open-end generation.

RuntimeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 output = quantized_model.generate(tokenizer("", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/generation/utils.py:1513, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
1496 return self.assisted_decoding(
1497 input_ids,
1498 candidate_generator=candidate_generator,
(...)
1509 **model_kwargs,
1510 )
1511 if generation_mode == GenerationMode.GREEDY_SEARCH:
1512 # 11. run greedy search
-> 1513 return self.greedy_search(
1514 input_ids,
1515 logits_processor=prepared_logits_processor,
1516 stopping_criteria=prepared_stopping_criteria,
1517 pad_token_id=generation_config.pad_token_id,
1518 eos_token_id=generation_config.eos_token_id,
1519 output_scores=generation_config.output_scores,
1520 return_dict_in_generate=generation_config.return_dict_in_generate,
1521 synced_gpus=synced_gpus,
1522 streamer=streamer,
1523 **model_kwargs,
1524 )
1526 elif generation_mode == GenerationMode.CONTRASTIVE_SEARCH:
1527 if not model_kwargs["use_cache"]:

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/generation/utils.py:2350, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
2347 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
2349 # forward pass to get next token
-> 2350 outputs = self(
2351 **model_inputs,
2352 return_dict=True,
2353 output_attentions=output_attentions,
2354 output_hidden_states=output_hidden_states,
2355 )
2357 if synced_gpus and this_peer_finished:
2358 continue # don't waste resources running the code we don't need

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:1360, in MixtralForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict)
1357 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1359 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1360 outputs = self.model(
1361 input_ids=input_ids,
1362 attention_mask=attention_mask,
1363 position_ids=position_ids,
1364 past_key_values=past_key_values,
1365 inputs_embeds=inputs_embeds,
1366 use_cache=use_cache,
1367 output_attentions=output_attentions,
1368 output_hidden_states=output_hidden_states,
1369 output_router_logits=output_router_logits,
1370 return_dict=return_dict,
1371 )
1373 hidden_states = outputs[0]
1374 logits = self.lm_head(hidden_states)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:1228, in MixtralModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict)
1217 layer_outputs = self._gradient_checkpointing_func(
1218 decoder_layer.call,
1219 hidden_states,
(...)
1225 use_cache,
1226 )
1227 else:
-> 1228 layer_outputs = decoder_layer(
1229 hidden_states,
1230 attention_mask=attention_mask,
1231 position_ids=position_ids,
1232 past_key_value=past_key_values,
1233 output_attentions=output_attentions,
1234 output_router_logits=output_router_logits,
1235 use_cache=use_cache,
1236 )
1238 hidden_states = layer_outputs[0]
1240 if use_cache:

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:934, in MixtralDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, output_router_logits, use_cache, **kwargs)
931 hidden_states = self.input_layernorm(hidden_states)
933 # Self Attention
--> 934 hidden_states, self_attn_weights, present_key_value = self.self_attn(
935 hidden_states=hidden_states,
936 attention_mask=attention_mask,
937 position_ids=position_ids,
938 past_key_value=past_key_value,
939 output_attentions=output_attentions,
940 use_cache=use_cache,
941 )
942 hidden_states = residual + hidden_states
944 # Fully Connected

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:730, in MixtralSdpaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
719 return super().forward(
720 hidden_states=hidden_states,
721 attention_mask=attention_mask,
(...)
725 use_cache=use_cache,
726 )
728 bsz, q_len, _ = hidden_states.size()
--> 730 query_states = self.q_proj(hidden_states)
731 key_states = self.k_proj(hidden_states)
732 value_states = self.v_proj(hidden_states)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/aqlm/inference.py:65, in QuantizedLinear.forward(self, input)
59 if (
60 not input.is_cuda
61 and self.codebook_size == 256
62 and self.codes.shape[0] == self.out_features // self.out_group_size
63 ):
64 self.codes.data = torch.permute(self.codes.data, (1, 0, 2)).contiguous() # TODO: fix this thing
---> 65 return forward_pass_quantized_linear(input, self.codes, self.codebooks, self.scales, self.bias)

File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/aqlm/inference_kernels/kernel_selector.py:24, in forward_pass_quantized_linear(input, codes, codebooks, scales, bias)
19 from .cuda_kernel import CUDA_KERNEL
21 assert (
22 input.dtype == torch.float16
23 ), f"please load the model with torch_dtype=torch.float16, as {input.dtype} is not supported on GPU yet"
---> 24 return CUDA_KERNEL.code1x16_matmat(input, codes, codebooks, scales) + (bias if bias is not None else 0)
25 case (True, 2, 256, 1, 8):
26 from .cuda_kernel import CUDA_KERNEL

RuntimeError: Unknown layout

Can't run the model after building with Docker, fails in ninja build.

This is my Dockerfile. I've tried a lot of permutations looking at your notebooks, e.g., using pip install git+https://github.com/huggingface/accelerate.git@main instead of the plain pip install below. I paste the error after the docker file.

FROM nvidia/cuda:11.8.0-base-ubuntu20.04

# Run system updates and install any desired packages
RUN apt-get update && apt-get upgrade -y && apt-get install -y curl

# need to make sure sudo is installed
RUN apt-get update && apt-get install -y sudo
RUN sudo apt-get install python3.8 -y

RUN curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
RUN mkdir /root/.conda
RUN bash Miniconda3-latest-Linux-x86_64.sh -b
ENV PATH="/root/miniconda3/bin:${PATH}"
ARG PATH="/root/miniconda3/bin:${PATH}"

RUN apt-get update && apt-get install -y gcc g++
RUN apt-get update && apt-get install -y ninja-build

RUN apt-get update && apt-get install -y git
RUN pip install accelerate
RUN pip install aqlm[gpu]==1.0.0


#docker run --gpus '"all"' --rm -it --name aqlm --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,source=/home/josh/PycharmProjects/vectors/DSPy/scratch,target=/scratch -v ${HOME}/.cache/huggingface:/root/.cache/huggingface --network=host aqlm:0

Here's the error:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2096, in _run_ninja_build
    subprocess.run(
  File "/root/miniconda3/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 7, in chat_loop
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/transformers/generation/utils.py", line 1520, in generate
    return self.sample(
           ^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/transformers/generation/utils.py", line 2617, in sample
    outputs = self(
              ^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf/057668a874cfe55d73bed6f244367eb072da75a7/modeling_llama_aqlm.py", line 1194, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf/057668a874cfe55d73bed6f244367eb072da75a7/modeling_llama_aqlm.py", line 1081, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf/057668a874cfe55d73bed6f244367eb072da75a7/modeling_llama_aqlm.py", line 809, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
                                                          ^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf/057668a874cfe55d73bed6f244367eb072da75a7/modeling_llama_aqlm.py", line 704, in forward
    query_states = self.q_proj(hidden_states)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/aqlm/inference.py", line 65, in forward
    return forward_pass_quantized_linear(input, self.codes, self.codebooks, self.scales, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py", line 19, in forward_pass_quantized_linear
    from .cuda_kernel import CUDA_KERNEL
  File "/root/miniconda3/lib/python3.12/site-packages/aqlm/inference_kernels/cuda_kernel.py", line 8, in <module>
    CUDA_KERNEL = load(
                  ^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1306, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2112, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'codebook_cuda': [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/miniconda3/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /root/miniconda3/lib/python3.12/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o 
FAILED: cuda_kernel.cuda.o 
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/miniconda3/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /root/miniconda3/lib/python3.12/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o 
/bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
[2/3] c++ -MMD -MF cuda_kernel.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/miniconda3/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /root/miniconda3/lib/python3.12/site-packages/aqlm/inference_kernels/cuda_kernel.cpp -o cuda_kernel.o 
ninja: build stopped: subcommand failed.

And here's the little script I threw together to test the model:

from transformers import AutoModelForCausalLM

model="BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf"
quantized_model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True, torch_dtype="auto", device_map="cuda").cuda()
# unload the model from cuda
# quantized_model = quantized_model.cpu()

# Let's set up a chat loop
# need this token for the tokenizer
token=<MY_TOKEN>
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-13b-hf', trust_remote_code=True, token=token)

def chat_loop():
    while True:
        prompt = input(">>> ")
        if prompt.strip() == "exit":
            break
        input_ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
        output = quantized_model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=tokenizer.eos_token_id)
        print(tokenizer.decode(output[0], skip_special_tokens=True))

chat_loop()
# Quiz me about the principles of scrum.

Just not familiar enough yet with your code to troubleshoot dependency / build issues. Happy to learn more!

Fine-tune colab example doesn't work

Fine-tune colab example fails when running trainer.train()

Last cell gives output

max_steps is given, it will override any value given in num_train_epochs
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(


RuntimeError Traceback (most recent call last)

in <cell line: 24>()
22 )
23 model.config.use_cache = False # silence the warnings. Please re-enable for inference!
---> 24 trainer.train()

37 frames

/usr/local/lib/python3.10/dist-packages/torch/_ops.py in call(self, *args, **kwargs)
753 # We save the function ptr as the op attribute on
754 # OpOverloadPacket to access it here.
--> 755 return self._op(*args, **(kwargs or {}))
756
757 # TODO: use this to make a dir

RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous

AttributeError: module 'torch.library' has no attribute 'impl_abstract'

Note: I tested this for aqlm==1.0.3 and 1.1.0

I'm having issue trying to run inference with a HF model in Transformers

pip install transformers aqlm[gpu,cpu]

Pytorch 2.1.2 is installed

Then

from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained("ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf", torch_dtype="auto", device_map="cuda").cuda()
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
output = quantized_model.generate(tokenizer("", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)
print(output)

Output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/generation/utils.py", line 1544, in generate
    return self.greedy_search(
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/generation/utils.py", line 2404, in greedy_search
    outputs = self(
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1168, in forward
    outputs = self.model(
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1008, in forward
    layer_outputs = decoder_layer(
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 734, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 633, in forward
    query_states = self.q_proj(hidden_states)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/aqlm/inference.py", line 70, in forward
    self.prepare_matmul_op(input)
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/aqlm/inference.py", line 86, in prepare_matmul_op
    get_forward_pass_kernel(self.codebooks, False),
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/aqlm/inference_kernels/kernel_selector.py", line 35, in get_forward_pass_kernel
    from .cuda_kernel import CUDA_FOLDER
  File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.py", line 20, in <module>
    @torch.library.impl_abstract("aqlm::code1x16_matmat")
AttributeError: module 'torch.library' has no attribute 'impl_abstract'

RuntimeError: Error building extension 'codebook_cuda': [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d

Environment: Google Colab
CUDA Info:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |

Code Runned from: Basic Basic AQLM generation DEMO

error line:

%%capture
output = quantized_model.generate(tokenizer("", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py](https://localhost:8080/#) in _run_ninja_build(build_directory, verbose, error_prefix)
   2095         stdout_fileno = 1
-> 2096         subprocess.run(
   2097             command,

26 frames
CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py](https://localhost:8080/#) in _run_ninja_build(build_directory, verbose, error_prefix)
   2110         if hasattr(error, 'output') and error.output:  # type: ignore[union-attr]
   2111             message += f": {error.output.decode(*SUBPROCESS_DECODE_ARGS)}"  # type: ignore[union-attr]
-> 2112         raise RuntimeError(message) from e
   2113 
   2114 

RuntimeError: Error building extension 'codebook_cuda': [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -std=c++17 -c /usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o 
FAILED: cuda_kernel.cuda.o 
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -std=c++17 -c /usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o 
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(59): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
              res2 = __hfma2(a[j], b[j], res2);
                             ^

/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(59): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
              res2 = __hfma2(a[j], b[j], res2);
                                   ^

/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(59): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
              res2 = __hfma2(a[j], b[j], res2);
                                         ^

/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(147): error: no suitable user-defined conversion from "__nv_bfloat162" to "__half2" exists
              res2 = __hfma2(__hadd2(a0[j], a1[j]), b[j], res2);
                             ^

/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(147): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
              res2 = __hfma2(__hadd2(a0[j], a1[j]), b[j], res2);
                                                    ^

/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(147): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
              res2 = __hfma2(__hadd2(a0[j], a1[j]), b[j], res2);
                                                          ^

6 errors detected in the compilation of "/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu".
[2/3] c++ -MMD -MF cuda_kernel.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cpp -o cuda_kernel.o 
ninja: build stopped: subcommand failed.

Actual bitrate of models on github?

Are the models you report in your readme supposed to be actual 2 bit models or just 2.x bit models? For example, the two 7B models below are both larger than a 2 bit decoder model, which would take 2.1G on disk. Also, why is there such a large difference in size between the 1x16 and 2x8 models? The size gap is much larger than the codebook size delta should take. Are you using different groupsizes in each model? Thanks

image

How long does it take to quantize?

I'm been using quantization tools like GPTQ, Exllama, or QUIP#. Those tools is quite fast to do quantization in a single A6000 gpu. But, this tool takes a really long time even though I'm using two A6000 gpu. How long does it take for quantizing Mistral 7B using two A6000 gpu and this parameters:

python main.py ../models/my-mistral-7B wikitext2 --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 --finetune_batch_size=32 --local_batch_size=1 --save ../models/my-mistral-7B-AQLM --model_seqlen 8192 --offload_activations

print Total number of parameters is incomplete.

model:ISTA-DASLab/Mixtral-8x7B-Instruct-v0_1-AQLM-2Bit-1x16-hf

model = AutoModelForCausalLM.from_pretrained(base_model,
                                             trust_remote_code=True,
                                             attn_implementation="flash_attention_2",
                                             torch_dtype=torch.bfloat16,
                                             device_map="auto",
                                             low_cpu_mem_usage=True)
total_parameters = sum(p.numel() for p in model.parameters())

print(f"Total number of parameters: {total_parameters}")

output:
Total number of parameters: 6546853888

aqlm/inference_kernels/cuda_kernel.cu compilation errors

Hi! I'm having the following issue on the forward pass (only when using an AQLM model) while prompt tuning an AQLM model. I'm using https://huggingface.co/BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf-test-dispatch/tree/main.


CalledProcessError Traceback (most recent call last)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2096, in _run_ninja_build(build_directory, verbose, error_prefix)
2095 stdout_fileno = 1
-> 2096 subprocess.run(
2097 command,
2098 stdout=stdout_fileno if verbose else subprocess.PIPE,
2099 stderr=subprocess.STDOUT,
2100 cwd=build_directory,
2101 check=True,
2102 env=env)
2103 except subprocess.CalledProcessError as e:
2104 # Python 2 and 3 compatible way of getting the error object.

File /usr/lib/python3.10/subprocess.py:526, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
525 if check and retcode:
--> 526 raise CalledProcessError(retcode, process.args,
527 output=stdout, stderr=stderr)
528 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last)
File , line 41
39 batch = {k: v.to(device) for k, v in batch.items()}
40 with torch.cuda.amp.autocast():
---> 41 outputs = model(**batch)
42 loss = outputs.loss
44 loss = loss / gradient_accumulation_steps

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/peft/peft_model.py:1295, in PeftModelForCausalLM.forward(self, input_ids, attention_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, task_ids, **kwargs)
1293 prompts = prompts.to(inputs_embeds.dtype)
1294 inputs_embeds = torch.cat((prompts, inputs_embeds), dim=1)
-> 1295 return self.base_model(inputs_embeds=inputs_embeds, **kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:1360, in MixtralForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict)
1357 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1359 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1360 outputs = self.model(
1361 input_ids=input_ids,
1362 attention_mask=attention_mask,
1363 position_ids=position_ids,
1364 past_key_values=past_key_values,
1365 inputs_embeds=inputs_embeds,
1366 use_cache=use_cache,
1367 output_attentions=output_attentions,
1368 output_hidden_states=output_hidden_states,
1369 output_router_logits=output_router_logits,
1370 return_dict=return_dict,
1371 )
1373 hidden_states = outputs[0]
1374 logits = self.lm_head(hidden_states)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:1217, in MixtralModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict)
1214 all_hidden_states += (hidden_states,)
1216 if self.gradient_checkpointing and self.training:
-> 1217 layer_outputs = self._gradient_checkpointing_func(
1218 decoder_layer.call,
1219 hidden_states,
1220 attention_mask,
1221 position_ids,
1222 past_key_values,
1223 output_attentions,
1224 output_router_logits,
1225 use_cache,
1226 )
1227 else:
1228 layer_outputs = decoder_layer(
1229 hidden_states,
1230 attention_mask=attention_mask,
(...)
1235 use_cache=use_cache,
1236 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/_compile.py:24, in _disable_dynamo..inner(*args, **kwargs)
20 @functools.wraps(fn)
21 def inner(*args, **kwargs):
22 import torch._dynamo
---> 24 return torch._dynamo.disable(fn, recursive)(*args, **kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:489, in _TorchDynamoContext.call.._fn(*args, **kwargs)
487 dynamo_config_ctx.enter()
488 try:
--> 489 return fn(*args, **kwargs)
490 finally:
491 set_eval_frame(prior)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/_dynamo/external_utils.py:17, in wrap_inline..inner(*args, **kwargs)
15 @functools.wraps(fn)
16 def inner(*args, **kwargs):
---> 17 return fn(*args, **kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/checkpoint.py:482, in checkpoint(function, use_reentrant, context_fn, determinism_check, debug, *args, **kwargs)
477 if context_fn is not noop_context_fn or debug is not False:
478 raise ValueError(
479 "Passing context_fn or debug is only supported when "
480 "use_reentrant=False."
481 )
--> 482 return CheckpointFunction.apply(function, preserve, *args)
483 else:
484 gen = _checkpoint_without_reentrant_generator(
485 function, preserve, context_fn, determinism_check, debug, *args, **kwargs
486 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/autograd/function.py:553, in Function.apply(cls, *args, **kwargs)
550 if not torch._C._are_functorch_transforms_active():
551 # See NOTE: [functorch vjp and autograd interaction]
552 args = _functorch.utils.unwrap_dead_wrappers(args)
--> 553 return super().apply(*args, **kwargs) # type: ignore[misc]
555 if not is_setup_ctx_defined:
556 raise RuntimeError(
557 "In order to use an autograd.Function with functorch transforms "
558 "(vmap, grad, jvp, jacrev, ...), it must override the setup_context "
559 "staticmethod. For more details, please see "
560 "https://pytorch.org/docs/master/notes/extending.func.html"
561 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/checkpoint.py:261, in CheckpointFunction.forward(ctx, run_function, preserve_rng_state, *args)
258 ctx.save_for_backward(*tensor_inputs)
260 with torch.no_grad():
--> 261 outputs = run_function(*args)
262 return outputs

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:934, in MixtralDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, output_router_logits, use_cache, **kwargs)
931 hidden_states = self.input_layernorm(hidden_states)
933 # Self Attention
--> 934 hidden_states, self_attn_weights, present_key_value = self.self_attn(
935 hidden_states=hidden_states,
936 attention_mask=attention_mask,
937 position_ids=position_ids,
938 past_key_value=past_key_value,
939 output_attentions=output_attentions,
940 use_cache=use_cache,
941 )
942 hidden_states = residual + hidden_states
944 # Fully Connected

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:730, in MixtralSdpaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
719 return super().forward(
720 hidden_states=hidden_states,
721 attention_mask=attention_mask,
(...)
725 use_cache=use_cache,
726 )
728 bsz, q_len, _ = hidden_states.size()
--> 730 query_states = self.q_proj(hidden_states)
731 key_states = self.k_proj(hidden_states)
732 value_states = self.v_proj(hidden_states)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference.py:70, in QuantizedLinear.forward(self, input)
68 def forward(self, input: torch.Tensor) -> torch.Tensor:
69 if self.gemv_op is None:
---> 70 self.prepare_matmul_op(input)
72 if self.use_gemv_rule(input):
73 return self.gemv_op.apply(input, self.codes, self.codebooks, self.scales, self.bias)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference.py:86, in QuantizedLinear.prepare_matmul_op(self, input)
78 if (
79 not input.is_cuda
80 and self.codebook_size == 256
81 and self.codes.shape[0] == self.out_features // self.out_group_size
82 ):
83 self.codes.data = torch.permute(self.codes.data, (1, 0, 2)).contiguous() # TODO: fix this thing
85 self.gemv_op = _get_autograd_matmul_op(
---> 86 get_forward_pass_kernel(self.codebooks, False),
87 get_backward_pass_kernel(self.codebooks, False),
88 )
90 self.gemm_op = _get_autograd_matmul_op(
91 get_forward_pass_kernel(self.codebooks, True),
92 get_backward_pass_kernel(self.codebooks, True),
93 )
95 self.use_gemv_rule = lambda input: math.prod(input.shape[:-1]) <= 6

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/kernel_selector.py:35, in get_forward_pass_kernel(codebooks, optimize_for_training)
25 num_codebooks, codebook_size, out_group_size, in_group_size = codebooks.shape
27 if (optimize_for_training, codebooks.device.type, num_codebooks, codebook_size, out_group_size, in_group_size) == (
28 False,
29 "cuda",
(...)
33 8,
34 ):
---> 35 from .cuda_kernel import CUDA_FOLDER
37 return torch.ops.aqlm.code1x16_matmat
38 elif (
39 optimize_for_training,
40 codebooks.device.type,
(...)
51 8,
52 ):

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.py:8
5 from torch.utils.cpp_extension import load
7 CUDA_FOLDER = os.path.dirname(os.path.abspath(file))
----> 8 CUDA_KERNEL = load(
9 name="codebook_cuda",
10 sources=[os.path.join(CUDA_FOLDER, "cuda_kernel.cpp"), os.path.join(CUDA_FOLDER, "cuda_kernel.cu")],
11 )
13 torch.library.define(
14 "aqlm::code1x16_matmat", "(Tensor input, Tensor codes, Tensor codebooks, Tensor scales, Tensor bias) -> Tensor"
15 )
17 torch.library.impl("aqlm::code1x16_matmat", "default", CUDA_KERNEL.code1x16_matmat)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1306, in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
1214 def load(name,
1215 sources: Union[str, List[str]],
1216 extra_cflags=None,
(...)
1224 is_standalone=False,
1225 keep_intermediates=True):
1226 """
1227 Load a PyTorch C++ extension just-in-time (JIT).
1228
(...)
1304 ... verbose=True)
1305 """
-> 1306 return _jit_compile(
1307 name,
1308 [sources] if isinstance(sources, str) else sources,
1309 extra_cflags,
1310 extra_cuda_cflags,
1311 extra_ldflags,
1312 extra_include_paths,
1313 build_directory or _get_build_directory(name, verbose),
1314 verbose,
1315 with_cuda,
1316 is_python_module,
1317 is_standalone,
1318 keep_intermediates=keep_intermediates)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1710, in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
1706 hipified_sources.add(hipify_result[s_abs].hipified_path if s_abs in hipify_result else s_abs)
1708 sources = list(hipified_sources)
-> 1710 _write_ninja_file_and_build_library(
1711 name=name,
1712 sources=sources,
1713 extra_cflags=extra_cflags or [],
1714 extra_cuda_cflags=extra_cuda_cflags or [],
1715 extra_ldflags=extra_ldflags or [],
1716 extra_include_paths=extra_include_paths or [],
1717 build_directory=build_directory,
1718 verbose=verbose,
1719 with_cuda=with_cuda,
1720 is_standalone=is_standalone)
1721 finally:
1722 baton.release()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1823, in _write_ninja_file_and_build_library(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_standalone)
1821 if verbose:
1822 print(f'Building extension module {name}...', file=sys.stderr)
-> 1823 _run_ninja_build(
1824 build_directory,
1825 verbose,
1826 error_prefix=f"Error building extension '{name}'")

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2112, in _run_ninja_build(build_directory, verbose, error_prefix)
2110 if hasattr(error, 'output') and error.output: # type: ignore[union-attr]
2111 message += f": {error.output.decode(*SUBPROCESS_DECODE_ARGS)}" # type: ignore[union-attr]
-> 2112 raise RuntimeError(message) from e

RuntimeError: Error building extension 'codebook_cuda': [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -std=c++17 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o
FAILED: cuda_kernel.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -std=c++17 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o
/local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu(270): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists

/local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu(270): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists

/local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu(256): warning #177-D: variable "res" was declared but never referenced

2 errors detected in the compilation of "/local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu".
[2/3] c++ -MMD -MF cuda_kernel.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cpp -o cuda_kernel.o
ninja: build stopped: subcommand failed.

optimized 1x16 and 2x8 decompression/dequantization kernels now exist

Hello, this isn't really a problem, I just wanted to inform you all that I wrote 1x16 and 2x8 decompression/dequantization kernels that run a factor of 6 and 9 (respectively) faster than the generic pytorch one you guys coded up, good for prefill performance. I didn't roll the scale into them because scaling the output is very fast.

Anyway, if you want to see them or bring them over, feel free, they are in:

vllm-project/vllm#3287

Cheers, and AQLM is really great, reaching Pareto optimal is a really nice achievement!

-James

Global finetuning?

How does your updated fine tuning method work vs the one in your arxiv?

Case Study: Instruction Tuning on AQLM Models

Hi, we have performed a small experiment on fine-tuning the Llama-2-70B-AQLM-2Bit model using the PEFT QLoRA method. We utilized the Alpaca and Glaive datasets for instruction tuning, and the fine-tuned version demonstrates preliminary conversation and tool-using abilities. We found that the training only requires 24GB of GRAM, while inference only needs 20GB. Thus fine-tuning a 70B model on consumer devices can be feasible. However, we found that AQLM significantly increases the overall training time. It would be better if the training speed could be improved. Thanks again for your excellent work!

Adapter weights: https://huggingface.co/hiyouga/Llama-2-70b-AQLM-2Bit-QLoRA-function-calling

examples
train_loss
gpu

Minor race condition in CPU 2x8 inference code

The current implementation sometimes returns close_but_not_very_accurate outputs because of this line:

https://github.com/Vahe1994/AQLM/blob/main/inference_lib/src/aqlm/inference_kernels/numba_kernel.py#L43

This is because the cpu cores may concurrently write to the same output unit without atomic addition.

Changing the outer parallel loop to go over output_groups instead fixes the issue, but it may be slower.
Note that the layer error is within <1% in my experiments and the model outputs look reasonable.

@BlackSamorez , would the code be any slower if you change the summation to go in parallel over outer groups or use atomic addition?

How to fine-tune the compressed model

How to fine-tune the compressed model. Thank you very much for the author's work, I tried to deploy lla2 on 3090 before, but failed due to lack of video memory. Recently I have found work on compression of large models. If you use compressed model parameters for fine tuning tasks, there are more detailed steps, I am very interested in your current work.

Why only last token used in datasets for training?

Dear maintainers,

Thank you for your awesome paper and open-source project. I recently ran into the detail of dataset pre-processing, which i can not understand properly.

In the processing of datasets you ignore all the labels exept the last one (as it is in

tar[:, :-1] = -100
). This seems a bit odd as long as transformers perform causal masking inside forward and during training LM model we want to propagate loss through all of the tokens and not the last one.

Could you please explain this detail?

I wanted to know what is the beauty of technology

I welcome the developers of the new method. I launched your goolge colab and got very dubious inference. Can you explain a little about the essence of the breakthrough of this compression? I may not have understood something in terms of use, but judging by the inference, it is completely lost.

quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf", trust_remote_code=True, torch_dtype=torch.float16,
).cuda()
tokenizer = AutoTokenizer.from_pretrained("BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf")

inputs = tokenizer(["Write a poem about python"], return_tensors="pt")["input_ids"].cuda()

streamer = TextStreamer(tokenizer)
_ = quantized_model.generate(inputs, streamer=streamer, max_new_tokens=120)

<s> Write a poem about python.
Write a poem about python.
Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem

ERROR: Could not find a version that satisfies the requirement aqlm[gpu]==1.0.0

hi, I am trying to run colab_example.ipynb on NVIDIA A100 non colab environment. When i run the !pip install aqlm[gpu]==1.0.0 and getting the below message. I am using python31.0, CUDA 12.2 and NVIDIA A4 GPU


ERROR: Could not find a version that satisfies the requirement aqlm[gpu]==1.0.0 (from versions: 1.0.2, 1.0.3)
ERROR: No matching distribution found for aqlm[gpu]==1.0.0


The requirement.txt file in repo gives the below issues

ERROR: tensorflow 2.11.0 has requirement protobuf<3.20,>=3.9.2, but you'll have protobuf 3.20.3 which is incompatible.
ERROR: dask-sql 2023.6.0 has requirement pandas>=1.4.0, but you'll have pandas 1.1.5 which is incompatible.
ERROR: azureml-inference-server-http 0.8.4 has requirement flask<2.3.0, but you'll have flask 2.3.2 which is incompatible.
ERROR: azureml-automl-dnn-nlp 1.51.0 has requirement datasets<=2.3.2,>=1.7.0, but you'll have datasets 2.15.0 which is incompatible.
ERROR: azureml-automl-dnn-nlp 1.51.0 has requirement torch<=1.12.0,>=1.5.0, but you'll have torch 2.2.1 which is incompatible.
ERROR: azureml-automl-dnn-nlp 1.51.0 has requirement transformers[sentencepiece]<=4.16.0, but you'll have transformers 4.37.0 which is incompatible.
ERROR: azure-cli 2.49.0 has requirement azure-keyvault-keys==4.8.0b2, but you'll have azure-keyvault-keys 4.8.0 which is incompatible.
ERROR: azure-cli 2.49.0 has requirement azure-mgmt-keyvault==10.2.0, but you'll have azure-mgmt-keyvault 10.2.1 which is incompatible.
ERROR: azure-cli 2.49.0 has requirement azure-mgmt-resource==22.0.0, but you'll have azure-mgmt-resource 21.1.0b1 which is incompatible.
ERROR: autokeras 1.0.16 has requirement tensorflow<=2.5.0,>=2.3.0, but you'll have tensorflow 2.11.0 which is incompatible.
ERROR: arviz 0.11.2 has requirement typing-extensions<4,>=3.7.4.3, but you'll have typing-extensions 4.10.0 which is incompatible.

33B llama quantization post-inference time

Why is it that after I quantize, my inference is 2 times slower than the original `model?

Quantize command:python main.py $MODEL_PATH $DATASET_PATH --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 --finetune_batch_size=32 --local_batch_size=1 --offload_activations --wandb --save $SAVE_PATH --dtype float32

Quantitative results:
image

In addition, looking at the GPU usage, it seems that it has become a parallel pipeline. I use four cores to load the model, but only one GPU utilization is 100% at the same time.
image

Have you encountered the same problem as me again? Is the inference speed of your quantized model normal?

Reproduce perplexity

In the readme the ppl is

Llama-2-7b | 1x16 | 5.92 | 2.4

In the paper it is:

Llama-2-7b AQLM 2.29 6.29 8.11

When I run locally using the same command as in the readme

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py $MODEL_PATH $DATASET_PATH --nsamples=1024 \
 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 \
 --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 \
 --finetune_batch_size=32 --local_batch_size=1 --offload_activations \
 --wandb --save $SAVE_PATH

it gives me

Llama-2-7b AQLM 2.29 6.45 8.39

Can I know why there is such a mismatch? Thanks for any clarifications.

Aqlm models cannot be trained in parallel.

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file deepspeed_zero2-sft.yaml --num_processes=1 ft-4bit-freedom-2bit.py \
    --base_model 'Mixtral-8x7B-Instruct-v0_1-AQLM-2Bit-1x16-hf' \
    --data_path 'eval_set_IA_multi_dialog3_sft.json' \
    --output_dir 'eval_set_IA_multi_dialog3_sft' \
    --batch_size 1 \
    --micro_batch_size 1 \
    --num_epochs 1 \
    --learning_rate 5e-4 \
    --cutoff_len 1024 \
    --val_set_size 0 \
    --lora_r 32 \
    --lora_alpha 32 \
    --lora_dropout 0.1 \
    --train_on_inputs \
    --group_by_length

Traceback (most recent call last):
  File "/home/luhao/alpaca-2bit-sft/ft-4bit-freedom-2bit.py", line 345, in <module>
    fire.Fire(train)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/luhao/alpaca-2bit-sft/ft-4bit-freedom-2bit.py", line 336, in train
    trainer.train()
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/trainer.py", line 1933, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/accelerate/accelerator.py", line 1255, in prepare
    result = self._prepare_deepspeed(*args)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/accelerate/accelerator.py", line 1640, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/__init__.py", line 176, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 262, in __init__
    self._configure_distributed_model(model)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model
    self._broadcast_model()
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model
    dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
    return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 205, in broadcast
    return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
    work = group.broadcast([tensor], opts)
TypeError: Input tensor data type is not supported for NCCL process group: Short

Finetuning ISTA-DASLab/Mistral-7B-Instruct-v0.2-AQLM-2Bit-2x8: RuntimeError: CUDA error: invalid argument

I've tried to finetune the model "ISTA-DASLab/Mistral-7B-Instruct-v0.2-AQLM-2Bit-2x8" using the notebook "aqlm_2bit_training.ipynb" but I get the below runtime error after initiating the training. Before trying to finetune this model, I had successfully run this notebook with the original model "ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf" that was defined in the notebook.

ERROR:
"
max_steps is given, it will override any value given in num_train_epochs
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(

RuntimeError Traceback (most recent call last)
in <cell line: 24>()
22 )
23 model.config.use_cache = False # silence the warnings. Please re-enable for inference!
---> 24 trainer.train()

34 frames
/usr/local/lib/python3.10/dist-packages/torch/_ops.py in call(self, *args, **kwargs)
753 # We save the function ptr as the op attribute on
754 # OpOverloadPacket to access it here.
--> 755 return self._op(*args, **(kwargs or {}))
756
757 # TODO: use this to make a dir

RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions."

Any chance to have K x 16 / 1 x 16 scheme implemented for Numba?

Hi. I am interested in some ideas requiring Mixtral-type model & for resource purpose - quantization & partial CPU offloading.

So since it seems this quantization technique is relatively good - I am going to try it.

However, as far as I understand due to https://github.com/Vahe1994/AQLM/blob/main/README.md - the best result is achieved with 1x16 scheme and it is not implemented for CPU case (well, not optimized kernel - due to https://github.com/Vahe1994/AQLM/blob/main/inference_lib/src/aqlm/inference_kernels/kernel_selector.py it should do dequantization and torch built-in matmul on top of that).

So I guess I may later dive into implementing such a scheme, but maybe it's on plan anyway?

Query on Evaluation Support for C4 Validation

Thanks for the excellent work and I really appreciate the public code.

I am currently working with the lm-evaluation-harness for validating some of the language models and have encountered a slight issue. It appears that the lm-evaluation-harness does not provide support for validation on the C4 dataset.

I was wondering if you might have any recommendations or alternative approaches for conducting validation on the C4 dataset using the lm-evaluation-harness or similar tools. Your guidance on this matter would be greatly appreciated.

Thank you once again for your contribution to the community. I am looking forward to your response.

Quantization time & VRAM requirements

Hello,

I have two basic questions:

  1. Do you have any data on how long it takes to quantize a 70b model using 24GB VRAM (assuming that's possible)?
  2. Do you plan to release prequantized models on Hugging Face? Having llama-2-70b for comparison with other methods would be useful.

Issues while attempting LLaMA-3 Quantization

Checking to see if this repo works for the new L3 models. Running this script:

export CUDA_VISIBLE_DEVICES=0,1   # or e.g. 0,1,2,3
export MODEL_PATH=/home/catid/models/Meta-Llama-3-8B-Instruct
export DATASET_PATH=pajama
export SAVE_PATH=/home/catid/models/cat-llama-3-8b-instruct-aqlm
export WANDB_PROJECT=aqlm
export WANDB_NAME=aqlm8

python main.py $MODEL_PATH $DATASET_PATH \
 --nsamples=1024 \
 --val_size=128 \
 --num_codebooks=1 \
 --nbits_per_codebook=16 \
 --in_group_size=8 \
 --relative_mse_tolerance=0.01 \
 --finetune_batch_size=32 \
 --finetune_max_epochs=10 \
 --finetune_early_stop=3 \
 --finetune_keep_best \
 --local_batch_size=1 \
 --offload_activations \
 --wandb \
 --resume \
 --save $SAVE_PATH

I see:

============ Load model... ============
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 23.65it/s]
Loading pretrained model ...
Model loaded sucсessfully ...

============ Quantizing model... ============
Loading data ...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'.
The class this function is called from is 'LlamaTokenizer'.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Traceback (most recent call last):
  File "/home/catid/sources/AQLM/main.py", line 892, in <module>
    quantize_model(model, args)
  File "/home/catid/sources/AQLM/main.py", line 41, in quantize_model
    data = get_loaders(
  File "/home/catid/sources/AQLM/src/datautils.py", line 226, in get_loaders
    tokenizer = LlamaTokenizer.from_pretrained(
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
    return cls._from_pretrained(
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 169, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 196, in get_spm_processor
    tokenizer.Load(self.vocab_file)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

`RuntimeError: CUDA error: invalid argument` while running

I have Ubuntu 23.10 system.

I installed cudatoolkit 12.1 using https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local

(since it need headers and so so I can't just install cuda through conda).

The rest of my environment

accelerate @ git+https://github.com/huggingface/accelerate.git@97d2168e5953fe7373a06c69c02c5a00a84d5344
anyio==4.2.0
aqlm @ file:///home/alex4321/Documents/AQLM/inference_lib
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
attrs==23.2.0
Babel==2.14.0
beautifulsoup4==4.12.3
bleach==6.1.0
Brotli @ file:///work/perseverance-python-buildout/croot/brotli-split_1698805593785/work
certifi @ file:///croot/certifi_1696279375225/work/certifi
cffi @ file:///croot/cffi_1700254295673/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
comm==0.2.1
cryptography @ file:///work/perseverance-python-buildout/croot/cryptography_1698845900024/work
debugpy==1.8.1
decorator==5.1.1
defusedxml==0.7.1
executing==2.0.1
fastjsonschema==2.19.1
filelock @ file:///work/perseverance-python-buildout/croot/filelock_1698846025262/work
fqdn==1.5.1
fsspec==2024.2.0
h11==0.14.0
httpcore==1.0.3
httpx==0.26.0
huggingface-hub==0.20.3
idna @ file:///work/perseverance-python-buildout/croot/idna_1698845632828/work
ipykernel==6.29.2
ipython==8.21.0
isoduration==20.11.0
jedi==0.19.1
Jinja2 @ file:///work/perseverance-python-buildout/croot/jinja2_1698847462642/work
json5==0.9.14
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter-events==0.9.0
jupyter-lsp==2.2.2
jupyter_client==8.6.0
jupyter_core==5.7.1
jupyter_server==2.12.5
jupyter_server_terminals==0.5.2
jupyterlab==4.1.1
jupyterlab_pygments==0.3.0
jupyterlab_server==2.25.3
llvmlite==0.42.0
MarkupSafe @ file:///work/perseverance-python-buildout/croot/markupsafe_1698846636000/work
matplotlib-inline==0.1.6
mistune==3.0.2
mkl-service==2.4.0
mpmath @ file:///work/perseverance-python-buildout/croot/mpmath_1698864994882/work
nbclient==0.9.0
nbconvert==7.16.0
nbformat==5.9.2
nest-asyncio==1.6.0
networkx @ file:///work/perseverance-python-buildout/croot/networkx_1698865062738/work
ninja==1.11.1.1
notebook_shim==0.2.4
numba==0.59.0
numpy @ file:///work/perseverance-python-buildout/croot/numpy_and_numpy_base_1698845160062/work/dist/numpy-1.26.0-cp312-cp312-linux_x86_64.whl#sha256=fdc35057024038070345ff9f7f47ed48ecdb21dd72461617bdadf4f5d1634fcb
overrides==7.7.0
packaging==23.2
pandocfilters==1.5.1
parso==0.8.3
pexpect==4.9.0
Pillow @ file:///work/perseverance-python-buildout/croot/pillow_1698847657722/work
platformdirs==4.2.0
prometheus_client==0.20.0
prompt-toolkit==3.0.43
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
Pygments==2.17.2
pyOpenSSL @ file:///work/perseverance-python-buildout/croot/pyopenssl_1698863523157/work
PySocks @ file:///work/perseverance-python-buildout/croot/pysocks_1698845478203/work
python-dateutil==2.8.2
python-json-logger==2.0.7
PyYAML @ file:///work/perseverance-python-buildout/croot/pyyaml_1698849903511/work
pyzmq==25.1.2
referencing==0.33.0
regex==2023.12.25
requests @ file:///work/perseverance-python-buildout/croot/requests_1698846321763/work
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.18.0
safetensors==0.4.2
scipy==1.12.0
Send2Trash==1.8.2
setuptools==68.0.0
six==1.16.0
sniffio==1.3.0
soupsieve==2.5
stack-data==0.6.3
sympy @ file:///croot/sympy_1701397643339/work
terminado==0.18.0
tinycss2==1.2.1
tokenizers==0.15.2
torch==2.2.0
torchaudio==2.2.0
torchvision==0.17.0
tornado==6.4
tqdm==4.66.2
traitlets==5.14.1
transformers==4.37.0
triton==2.2.0
types-python-dateutil==2.8.19.20240106
typing_extensions==4.9.0
uri-template==1.3.0
urllib3 @ file:///work/perseverance-python-buildout/croot/urllib3_1698845837793/work
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
wheel==0.41.2

AQLM installed from latest github state.

Now if I try to run some code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

llama_cpu_quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf",
    trust_remote_code=True, torch_dtype=torch.float32, device_map="cpu"
)
llama_gpu_quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf",
    trust_remote_code=True, torch_dtype=torch.float16, device_map="cuda:0"
).cuda()
llama_tokenizer = AutoTokenizer.from_pretrained("daryl149/llama-2-7b-hf")

output = llama_cpu_quantized_model.generate(llama_tokenizer("Test is", return_tensors="pt")["input_ids"], max_new_tokens=10)
print(llama_tokenizer.decode(output[0]))

it tells me

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Compiling AQLM numba kernel with parameters: kernel_key=(8, 4096, 4096, 2)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Compiling AQLM numba kernel with parameters: kernel_key=(8, 11008, 4096, 2)
Compiling AQLM numba kernel with parameters: kernel_key=(8, 4096, 11008, 2)
<s> Test is a 19999 film directed by

which I guess it more or less fine.

But:

output = llama_gpu_quantized_model.generate(llama_tokenizer("Test is", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)
print(llama_tokenizer.decode(output[0]))

gives me

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[7], [line 1](vscode-notebook-cell:?execution_count=7&line=1)
----> [1](vscode-notebook-cell:?execution_count=7&line=1) output = llama_gpu_quantized_model.generate(llama_tokenizer("Test is", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)
      [2](vscode-notebook-cell:?execution_count=7&line=2) print(llama_tokenizer.decode(output[0]))

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:115](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:115), in context_decorator.<locals>.decorate_context(*args, **kwargs)
    [112](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:112) @functools.wraps(func)
    [113](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:113) def decorate_context(*args, **kwargs):
    [114](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:114)     with ctx_factory():
--> [115](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:115)         return func(*args, **kwargs)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1474](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1474), in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   [1457](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1457)     return self.assisted_decoding(
   [1458](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1458)         input_ids,
   [1459](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1459)         candidate_generator=candidate_generator,
   (...)
   [1470](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1470)         **model_kwargs,
   [1471](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1471)     )
   [1472](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1472) if generation_mode == GenerationMode.GREEDY_SEARCH:
   [1473](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1473)     # 11. run greedy search
-> [1474](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1474)     return self.greedy_search(
   [1475](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1475)         input_ids,
   [1476](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1476)         logits_processor=prepared_logits_processor,
   [1477](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1477)         stopping_criteria=prepared_stopping_criteria,
   [1478](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1478)         pad_token_id=generation_config.pad_token_id,
   [1479](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1479)         eos_token_id=generation_config.eos_token_id,
   [1480](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1480)         output_scores=generation_config.output_scores,
   [1481](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1481)         return_dict_in_generate=generation_config.return_dict_in_generate,
   [1482](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1482)         synced_gpus=synced_gpus,
   [1483](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1483)         streamer=streamer,
   [1484](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1484)         **model_kwargs,
   [1485](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1485)     )
   [1487](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1487) elif generation_mode == GenerationMode.CONTRASTIVE_SEARCH:
   [1488](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1488)     if not model_kwargs["use_cache"]:

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2335](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2335), in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   [2332](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2332) model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   [2334](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2334) # forward pass to get next token
-> [2335](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2335) outputs = self(
   [2336](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2336)     **model_inputs,
   [2337](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2337)     return_dict=True,
   [2338](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2338)     output_attentions=output_attentions,
   [2339](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2339)     output_hidden_states=output_hidden_states,
   [2340](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2340) )
   [2342](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2342) if synced_gpus and this_peer_finished:
   [2343](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2343)     continue  # don't waste resources running the code we don't need

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
   [1509](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1509)     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   [1510](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511)     return self._call_impl(*args, **kwargs)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
   [1515](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
   [1516](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
   [1517](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   [1518](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1518)         or _global_backward_pre_hooks or _global_backward_hooks
   [1519](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1519)         or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520)     return forward_call(*args, **kwargs)
   [1522](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1522) try:
   [1523](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1523)     result = None

File [~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1195](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1195), in LlamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
   [1192](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1192) return_dict = return_dict if return_dict is not None else self.config.use_return_dict
   [1194](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1194) # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> [1195](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1195) outputs = self.model(
   [1196](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1196)     input_ids=input_ids,
   [1197](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1197)     attention_mask=attention_mask,
   [1198](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1198)     position_ids=position_ids,
   [1199](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1199)     past_key_values=past_key_values,
   [1200](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1200)     inputs_embeds=inputs_embeds,
   [1201](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1201)     use_cache=use_cache,
   [1202](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1202)     output_attentions=output_attentions,
   [1203](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1203)     output_hidden_states=output_hidden_states,
   [1204](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1204)     return_dict=return_dict,
   [1205](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1205) )
   [1207](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1207) hidden_states = outputs[0]
   [1208](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1208) if self.config.pretraining_tp > 1:

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
   [1509](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1509)     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   [1510](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511)     return self._call_impl(*args, **kwargs)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
   [1515](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
   [1516](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
   [1517](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   [1518](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1518)         or _global_backward_pre_hooks or _global_backward_hooks
   [1519](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1519)         or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520)     return forward_call(*args, **kwargs)
   [1522](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1522) try:
   [1523](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1523)     result = None

File [~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1082](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1082), in LlamaModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
   [1072](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1072)     layer_outputs = self._gradient_checkpointing_func(
   [1073](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1073)         decoder_layer.__call__,
   [1074](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1074)         hidden_states,
   (...)
   [1079](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1079)         use_cache,
   [1080](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1080)     )
   [1081](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1081) else:
-> [1082](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1082)     layer_outputs = decoder_layer(
   [1083](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1083)         hidden_states,
   [1084](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1084)         attention_mask=attention_mask,
   [1085](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1085)         position_ids=position_ids,
   [1086](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1086)         past_key_value=past_key_values,
   [1087](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1087)         output_attentions=output_attentions,
   [1088](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1088)         use_cache=use_cache,
   [1089](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1089)     )
   [1091](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1091) hidden_states = layer_outputs[0]
   [1093](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1093) if use_cache:

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
   [1509](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1509)     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   [1510](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511)     return self._call_impl(*args, **kwargs)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
   [1515](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
   [1516](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
   [1517](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   [1518](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1518)         or _global_backward_pre_hooks or _global_backward_hooks
   [1519](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1519)         or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520)     return forward_call(*args, **kwargs)
   [1522](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1522) try:
   [1523](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1523)     result = None

File [~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:810](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:810), in LlamaDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache, **kwargs)
    [807](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:807) hidden_states = self.input_layernorm(hidden_states)
    [809](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:809) # Self Attention
--> [810](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:810) hidden_states, self_attn_weights, present_key_value = self.self_attn(
    [811](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:811)     hidden_states=hidden_states,
    [812](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:812)     attention_mask=attention_mask,
    [813](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:813)     position_ids=position_ids,
    [814](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:814)     past_key_value=past_key_value,
    [815](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:815)     output_attentions=output_attentions,
    [816](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:816)     use_cache=use_cache,
    [817](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:817)     **kwargs,
    [818](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:818) )
    [819](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:819) hidden_states = residual + hidden_states
    [821](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:821) # Fully Connected

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
   [1509](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1509)     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   [1510](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511)     return self._call_impl(*args, **kwargs)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
   [1515](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
   [1516](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
   [1517](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   [1518](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1518)         or _global_backward_pre_hooks or _global_backward_hooks
   [1519](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1519)         or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520)     return forward_call(*args, **kwargs)
   [1522](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1522) try:
   [1523](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1523)     result = None

File [~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:705](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:705), in LlamaSdpaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
    [694](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:694)     return super().forward(
    [695](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:695)         hidden_states=hidden_states,
    [696](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:696)         attention_mask=attention_mask,
   (...)
    [700](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:700)         use_cache=use_cache,
    [701](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:701)     )
    [703](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:703) bsz, q_len, _ = hidden_states.size()
--> [705](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:705) query_states = self.q_proj(hidden_states)
    [706](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:706) key_states = self.k_proj(hidden_states)
    [707](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:707) value_states = self.v_proj(hidden_states)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
   [1509](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1509)     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   [1510](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511)     return self._call_impl(*args, **kwargs)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
   [1515](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
   [1516](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
   [1517](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   [1518](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1518)         or _global_backward_pre_hooks or _global_backward_hooks
   [1519](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1519)         or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520)     return forward_call(*args, **kwargs)
   [1522](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1522) try:
   [1523](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1523)     result = None

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:65](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:65), in QuantizedLinear.forward(self, input)
     [59](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:59) if (
     [60](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:60)     not input.is_cuda
     [61](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:61)     and self.codebook_size == 256
     [62](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:62)     and self.codes.shape[0] == self.out_features [/](https://file+.vscode-resource.vscode-cdn.net/)[/](https://file+.vscode-resource.vscode-cdn.net/) self.out_group_size
     [63](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:63) ):
     [64](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:64)     self.codes.data = torch.permute(self.codes.data, (1, 0, 2)).contiguous()  #  TODO: fix this thing
---> [65](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:65) return forward_pass_quantized_linear(input, self.codes, self.codebooks, self.scales, self.bias)

File [~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:31](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:31), in forward_pass_quantized_linear(input, codes, codebooks, scales, bias)
     [26](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:26)     from .cuda_kernel import CUDA_KERNEL
     [28](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:28)     assert (
     [29](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:29)         input.dtype == torch.float16
     [30](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:30)     ), f"please load the model with `torch_dtype=torch.float16`, as {input.dtype} is not supported on GPU yet"
---> [31](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:31)     return CUDA_KERNEL.code2x8_matmat(input, codes, codebooks, scales) + (bias if bias is not None else 0)
     [32](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:32) case (True, _, _, _, _):
     [33](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:33)     from .triton_kernel import triton_matmul

RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

p.s. by the way - it is kinda offtopic, but I don't get it:

auto out_features = codes.size(0) * codebooks.size(2);
out_features = codes.shape[1] * out_group_size

Do not hard pin the `transformers` dependency version

Hard pinning the transformers version is causing a dependency conflict in pip

transformers==4.37.0

Successfully built transformers
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.37.0
    Uninstalling transformers-4.37.0:
      Successfully uninstalled transformers-4.37.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aqlm 1.0.1 requires transformers==4.37.0, but you have transformers 4.38.0.dev0 which is incompatible.
Successfully installed transformers-4.38.0.dev0

If you change == to >= it would allow for later versions with updates (in this case, the update was for AQLM compatibility!)

Request for the Llama-2-13B with AQLM (2x8 scheme)

Hello,

Thanks for your outstanding works. I want to do a compressive comparison of recent quantization methods.

Due to that latest lm-eval can obtain higher accuracy than the ones reported in paper, I have to re-evaluate each quantized models.

I found that there is not model about Llama-2-13B with AQLM (2x8 scheme), can you share it on huggingface?

Thank you!

“RuntimeError: Only Tensors of floating point and complex dtype can require gradients” when trying to load a model with device_map="auto" or low_cpu_mem_usage

code:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x15-hf",
    trust_remote_code=True, torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True
)

I assume that the reason lies in big_modeling.py from accelerate, namely in the code

    def register_empty_parameter(module, name, param):
        old_register_parameter(module, name, param)
        if param is not None:
            param_cls = type(module._parameters[name])
            kwargs = module._parameters[name].__dict__
            module._parameters[name] = param_cls(module._parameters[name].to(device), **kwargs)

Apparently, due to the fact that kwargs does not contain requires_grad=False.
This ridiculous patch that I applied fixed the error:

    def register_empty_parameter(module, name, param):
        old_register_parameter(module, name, param)
        if param is not None:
            param_cls = type(module._parameters[name])
            kwargs = module._parameters[name].__dict__
            kwargs["requires_grad"] = False # silly, but works
            module._parameters[name] = param_cls(module._parameters[name].to(device), **kwargs)

But I understand that this is the wrong solution and there is a correct way to solve this problem.

KV Cache Quantization

Hey, thanks for your work. I saw https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-Instruct-AQLM-2Bit-1x16/discussions/2 about how 8-bit KV cache quantization can be enabled on vLLM. I am not too sure of how exactly KV cache is implemented for AQLM using Transformers, but would KV cache quantization be theoretically possible? It might address some of the concerns regarding high vram usage for context from https://www.reddit.com/r/LocalLLaMA/comments/1clinlb/bringing_2bit_llms_to_production_new_aqlm_models/.

To be specific, would 4-bit cache quantization be possible? Turboderp managed to achieve negligable ppl loss somehow: https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md. For reference, turboderp/exllamav2@324404e.

Thanks!

unable to install

py -m pip install aqlm[gpu,cpu]
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting aqlm[cpu,gpu]
Downloading aqlm-1.0.0-py3-none-any.whl (10 kB)
Collecting torch>=2.1.1
Downloading torch-2.2.0-cp311-cp311-win_amd64.whl (198.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 198.6/198.6 MB 12.6 MB/s eta 0:00:00
Collecting transformers==4.37.0
Downloading transformers-4.37.0-py3-none-any.whl (8.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.4/8.4 MB 8.5 MB/s eta 0:00:00
Collecting numba>=0.56.4
Downloading numba-0.59.0-cp311-cp311-win_amd64.whl (2.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 18.7 MB/s eta 0:00:00
Collecting scipy>=1.11.3
Downloading scipy-1.12.0-cp311-cp311-win_amd64.whl (46.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.2/46.2 MB 13.1 MB/s eta 0:00:00
ERROR: Ignored the following versions that require a different python version: 0.52.0 Requires-Python >=3.6,<3.9; 0.52.0rc3 Requires-Python >=3.6,<3.9; 0.53.0 Requires-Python >=3.6,<3.10; 0.53.0rc1.post1 Requires-Python >=3.6,<3.10; 0.53.0rc2 Requires-Python >=3.6,<3.10; 0.53.0rc3 Requires-Python >=3.6,<3.10; 0.53.1 Requires-Python >=3.6,<3.10; 0.54.0 Requires-Python >=3.7,<3.10; 0.54.0rc2 Requires-Python >=3.7,<3.10; 0.54.0rc3 Requires-Python >=3.7,<3.10; 0.54.1 Requires-Python >=3.7,<3.10; 0.55.0 Requires-Python >=3.7,<3.11; 0.55.0rc1 Requires-Python >=3.7,<3.11; 0.55.1 Requires-Python >=3.7,<3.11; 0.55.2 Requires-Python >=3.7,<3.11; 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10; 1.7.2 Requires-Python >=3.7,<3.11; 1.7.3 Requires-Python >=3.7,<3.11; 1.8.0 Requires-Python >=3.8,<3.11; 1.8.0rc1 Requires-Python >=3.8,<3.11; 1.8.0rc2 Requires-Python >=3.8,<3.11; 1.8.0rc3 Requires-Python >=3.8,<3.11; 1.8.0rc4 Requires-Python >=3.8,<3.11; 1.8.1 Requires-Python >=3.8,<3.11
ERROR: Could not find a version that satisfies the requirement triton>=2.1; extra == "gpu" (from aqlm[cpu,gpu]) (from versions: none)
ERROR: No matching distribution found for triton>=2.1; extra == "gpu"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.