vahe1994 / aqlm Goto Github PK
View Code? Open in Web Editor NEWOfficial Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf
License: Apache License 2.0
Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf
License: Apache License 2.0
Note: I tested this for aqlm==1.0.3 and 1.1.0
I'm having issue trying to run inference with a HF model in Transformers
pip install transformers aqlm[gpu,cpu]
Pytorch 2.1.2 is installed
Then
from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained("ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf", torch_dtype="auto", device_map="cuda").cuda()
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
output = quantized_model.generate(tokenizer("", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)
print(output)
Output:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/generation/utils.py", line 1544, in generate
return self.greedy_search(
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/generation/utils.py", line 2404, in greedy_search
outputs = self(
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1168, in forward
outputs = self.model(
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1008, in forward
layer_outputs = decoder_layer(
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 734, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 633, in forward
query_states = self.q_proj(hidden_states)
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/aqlm/inference.py", line 70, in forward
self.prepare_matmul_op(input)
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/aqlm/inference.py", line 86, in prepare_matmul_op
get_forward_pass_kernel(self.codebooks, False),
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/aqlm/inference_kernels/kernel_selector.py", line 35, in get_forward_pass_kernel
from .cuda_kernel import CUDA_FOLDER
File "/home/michael/venvs/nm-vllm/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.py", line 20, in <module>
@torch.library.impl_abstract("aqlm::code1x16_matmat")
AttributeError: module 'torch.library' has no attribute 'impl_abstract'
Hello,
Phi-3 is a small (3.8b) but well known and quite a useful model based on evals and my prompting.
Would it be possible to quantize Phi-3? There is also 4k version of the model.
HF: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
HF 4k: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct
Thank you very much,
Vaclav
Hi, what command should I run in your codebase to get perplexity numbers on wikitext and c4 for your HF hub models (eg https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-2Bit-1x16/tree/main)
Congratulations on coming up with such an excellent quantization algorithm! I'm trying to use AQLM to quantize Deepseek-Coder and Starcoder2, but the repository doesn't seem to have direct support. Are there any plans to support more models? Or any suggestions on how to modify the source code to achieve fast support, thanks.
Thanks for the excellent work and I really appreciate the public code.
I am currently working with the lm-evaluation-harness
for validating some of the language models and have encountered a slight issue. It appears that the lm-evaluation-harness
does not provide support for validation on the C4 dataset.
I was wondering if you might have any recommendations or alternative approaches for conducting validation on the C4 dataset using the lm-evaluation-harness
or similar tools. Your guidance on this matter would be greatly appreciated.
Thank you once again for your contribution to the community. I am looking forward to your response.
Does the current AQLM version work for DBRX?
hi, I am trying to run colab_example.ipynb on NVIDIA A100 non colab environment. When i run the !pip install aqlm[gpu]==1.0.0 and getting the below message. I am using python31.0, CUDA 12.2 and NVIDIA A4 GPU
ERROR: Could not find a version that satisfies the requirement aqlm[gpu]==1.0.0 (from versions: 1.0.2, 1.0.3)
ERROR: No matching distribution found for aqlm[gpu]==1.0.0
The requirement.txt file in repo gives the below issues
ERROR: tensorflow 2.11.0 has requirement protobuf<3.20,>=3.9.2, but you'll have protobuf 3.20.3 which is incompatible.
ERROR: dask-sql 2023.6.0 has requirement pandas>=1.4.0, but you'll have pandas 1.1.5 which is incompatible.
ERROR: azureml-inference-server-http 0.8.4 has requirement flask<2.3.0, but you'll have flask 2.3.2 which is incompatible.
ERROR: azureml-automl-dnn-nlp 1.51.0 has requirement datasets<=2.3.2,>=1.7.0, but you'll have datasets 2.15.0 which is incompatible.
ERROR: azureml-automl-dnn-nlp 1.51.0 has requirement torch<=1.12.0,>=1.5.0, but you'll have torch 2.2.1 which is incompatible.
ERROR: azureml-automl-dnn-nlp 1.51.0 has requirement transformers[sentencepiece]<=4.16.0, but you'll have transformers 4.37.0 which is incompatible.
ERROR: azure-cli 2.49.0 has requirement azure-keyvault-keys==4.8.0b2, but you'll have azure-keyvault-keys 4.8.0 which is incompatible.
ERROR: azure-cli 2.49.0 has requirement azure-mgmt-keyvault==10.2.0, but you'll have azure-mgmt-keyvault 10.2.1 which is incompatible.
ERROR: azure-cli 2.49.0 has requirement azure-mgmt-resource==22.0.0, but you'll have azure-mgmt-resource 21.1.0b1 which is incompatible.
ERROR: autokeras 1.0.16 has requirement tensorflow<=2.5.0,>=2.3.0, but you'll have tensorflow 2.11.0 which is incompatible.
ERROR: arviz 0.11.2 has requirement typing-extensions<4,>=3.7.4.3, but you'll have typing-extensions 4.10.0 which is incompatible.
The package metadata on PyPI points to a repo from Vage1994 not Vahe1994
https://pypi.org/project/aqlm/
This is set in
Line 9 in 48c44b1
How does your updated fine tuning method work vs the one in your arxiv?
I'm trying to do quantization with this parameters:
python main.py ../models/my-mistral-7B wikitext2 --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 --finetune_batch_size=32 --local_batch_size=1 --save ../models/my-mistral-7B-AQLM --model_seqlen 8192 --offload_activations
and I got this error
Traceback (most recent call last):
File "/home/fahadh/AQLM/main.py", line 779, in <module>
quantize_model(model, args)
File "/home/fahadh/AQLM/main.py", line 48, in quantize_model
results = quantize_aq(model, dataloader, args)
File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/fahadh/AQLM/main.py", line 252, in quantize_aq
layer = finetune_groupwise(layer=layer, inps=inps, outs=outs, args=args, **forward_args)
File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/fahadh/AQLM/src/finetune.py", line 105, in finetune_groupwise
loss = _compute_mse_parallel(
File "/home/fahadh/AQLM/src/finetune.py", line 224, in _compute_mse_parallel
mse_components = torch.nn.parallel.parallel_apply(
File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 108, in parallel_apply
output.reraise()
File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/_utils.py", line 722, in reraise
raise exception
AttributeError: Caught AttributeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
output = module(*input, **kwargs)
File "/home/fahadh/AQLM/src/finetune.py", line 201, in _compute_mse_on_batch
outs_prediction, *_unused = layer(inps_batch, **kwargs)
File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 754, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 437, in forward
target_dtype = self.q_proj.weight.dtype
File "/home/fahadh/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1688, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'QuantizedLinear' object has no attribute 'weight'
This happens after fine tuning layers 0 of mistral model. Here is what printed before the error
PREPARING TO FINETUNE
MistralDecoderLayer(
(self_attn): MistralFlashAttention2(
(q_proj): QuantizedLinear(
(quantized_weight): QuantizedWeight(self.out_features=4096, self.in_features=4096, bits_per_parameter=2.50390625)
)
(k_proj): QuantizedLinear(
(quantized_weight): QuantizedWeight(self.out_features=1024, self.in_features=4096, bits_per_parameter=4.00390625)
)
(v_proj): QuantizedLinear(
(quantized_weight): QuantizedWeight(self.out_features=1024, self.in_features=4096, bits_per_parameter=4.00390625)
)
(o_proj): QuantizedLinear(
(quantized_weight): QuantizedWeight(self.out_features=4096, self.in_features=4096, bits_per_parameter=2.50390625)
)
(rotary_emb): MistralRotaryEmbedding()
)
(mlp): MistralMLP(
(gate_proj): QuantizedLinear(
(quantized_weight): QuantizedWeight(self.out_features=14336, self.in_features=4096, bits_per_parameter=2.146763392857143)
)
(up_proj): QuantizedLinear(
(quantized_weight): QuantizedWeight(self.out_features=14336, self.in_features=4096, bits_per_parameter=2.146763392857143)
)
(down_proj): QuantizedLinear(
(quantized_weight): QuantizedWeight(self.out_features=4096, self.in_features=14336, bits_per_parameter=2.1439732142857144)
)
(act_fn): SiLU()
)
(input_layernorm): MistralRMSNorm()
(post_attention_layernorm): MistralRMSNorm()
)
Fine-tuning 3721216 parameters
In the readme the ppl is
Llama-2-7b | 1x16 | 5.92 | 2.4
In the paper it is:
Llama-2-7b AQLM 2.29 6.29 8.11
When I run locally using the same command as in the readme
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py $MODEL_PATH $DATASET_PATH --nsamples=1024 \
--num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 \
--relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 \
--finetune_batch_size=32 --local_batch_size=1 --offload_activations \
--wandb --save $SAVE_PATH
it gives me
Llama-2-7b AQLM 2.29 6.45 8.39
Can I know why there is such a mismatch? Thanks for any clarifications.
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file deepspeed_zero2-sft.yaml --num_processes=1 ft-4bit-freedom-2bit.py \
--base_model 'Mixtral-8x7B-Instruct-v0_1-AQLM-2Bit-1x16-hf' \
--data_path 'eval_set_IA_multi_dialog3_sft.json' \
--output_dir 'eval_set_IA_multi_dialog3_sft' \
--batch_size 1 \
--micro_batch_size 1 \
--num_epochs 1 \
--learning_rate 5e-4 \
--cutoff_len 1024 \
--val_set_size 0 \
--lora_r 32 \
--lora_alpha 32 \
--lora_dropout 0.1 \
--train_on_inputs \
--group_by_length
Traceback (most recent call last):
File "/home/luhao/alpaca-2bit-sft/ft-4bit-freedom-2bit.py", line 345, in <module>
fire.Fire(train)
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/luhao/alpaca-2bit-sft/ft-4bit-freedom-2bit.py", line 336, in train
trainer.train()
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/trainer.py", line 1933, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/accelerate/accelerator.py", line 1255, in prepare
result = self._prepare_deepspeed(*args)
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/accelerate/accelerator.py", line 1640, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/__init__.py", line 176, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 262, in __init__
self._configure_distributed_model(model)
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1157, in _configure_distributed_model
self._broadcast_model()
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1077, in _broadcast_model
dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 224, in broadcast
return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 205, in broadcast
return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
return func(*args, **kwargs)
File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1914, in broadcast
work = group.broadcast([tensor], opts)
TypeError: Input tensor data type is not supported for NCCL process group: Short
Hi there!
Thanks again for your great work !
I was wondering if you would like to create a new HF org similar as: https://huggingface.co/unsloth and push the HF compatible weights there and expose a collection ? That way users could easily refer to these checkpoints as the "official" AQLM checkpoints, (e.g. aqlm/mixtral-aqlm-2bit
, etc.)
what do you think?
Fine-tune colab example fails when running trainer.train()
Last cell gives output
max_steps is given, it will override any value given in num_train_epochs
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
RuntimeError Traceback (most recent call last)
in <cell line: 24>()
22 )
23 model.config.use_cache = False # silence the warnings. Please re-enable for inference!
---> 24 trainer.train()
37 frames
/usr/local/lib/python3.10/dist-packages/torch/_ops.py in call(self, *args, **kwargs)
753 # We save the function ptr as the op
attribute on
754 # OpOverloadPacket to access it here.
--> 755 return self._op(*args, **(kwargs or {}))
756
757 # TODO: use this to make a dir
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous
Hello,
I have two basic questions:
I've tried to finetune the model "ISTA-DASLab/Mistral-7B-Instruct-v0.2-AQLM-2Bit-2x8" using the notebook "aqlm_2bit_training.ipynb" but I get the below runtime error after initiating the training. Before trying to finetune this model, I had successfully run this notebook with the original model "ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf" that was defined in the notebook.
RuntimeError Traceback (most recent call last)
in <cell line: 24>()
22 )
23 model.config.use_cache = False # silence the warnings. Please re-enable for inference!
---> 24 trainer.train()
34 frames
/usr/local/lib/python3.10/dist-packages/torch/_ops.py in call(self, *args, **kwargs)
753 # We save the function ptr as the op
attribute on
754 # OpOverloadPacket to access it here.
--> 755 return self._op(*args, **(kwargs or {}))
756
757 # TODO: use this to make a dir
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions."
I'm curious about quantizing a 7b model like Mistral Instruct v2, from what I understand an important point would be the choice of the dataset. What would be a good dataset for quantizing with AQLM?
Is there any other important point to succeed in creating a good quality quantization of a model with AQLM?
model:ISTA-DASLab/Mixtral-8x7B-Instruct-v0_1-AQLM-2Bit-1x16-hf
model = AutoModelForCausalLM.from_pretrained(base_model,
trust_remote_code=True,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
device_map="auto",
low_cpu_mem_usage=True)
total_parameters = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_parameters}")
output:
Total number of parameters: 6546853888
The current implementation sometimes returns close_but_not_very_accurate outputs because of this line:
This is because the cpu cores may concurrently write to the same output unit without atomic addition.
Changing the outer parallel loop to go over output_groups instead fixes the issue, but it may be slower.
Note that the layer error is within <1% in my experiments and the model outputs look reasonable.
@BlackSamorez , would the code be any slower if you change the summation to go in parallel over outer groups or use atomic addition?
code:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
quantized_model = AutoModelForCausalLM.from_pretrained(
"BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x15-hf",
trust_remote_code=True, torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True
)
I assume that the reason lies in big_modeling.py from accelerate, namely in the code
def register_empty_parameter(module, name, param):
old_register_parameter(module, name, param)
if param is not None:
param_cls = type(module._parameters[name])
kwargs = module._parameters[name].__dict__
module._parameters[name] = param_cls(module._parameters[name].to(device), **kwargs)
Apparently, due to the fact that kwargs does not contain requires_grad=False.
This ridiculous patch that I applied fixed the error:
def register_empty_parameter(module, name, param):
old_register_parameter(module, name, param)
if param is not None:
param_cls = type(module._parameters[name])
kwargs = module._parameters[name].__dict__
kwargs["requires_grad"] = False # silly, but works
module._parameters[name] = param_cls(module._parameters[name].to(device), **kwargs)
But I understand that this is the wrong solution and there is a correct way to solve this problem.
is it toooo long to quantized a model ?
Checking to see if this repo works for the new L3 models. Running this script:
export CUDA_VISIBLE_DEVICES=0,1 # or e.g. 0,1,2,3
export MODEL_PATH=/home/catid/models/Meta-Llama-3-8B-Instruct
export DATASET_PATH=pajama
export SAVE_PATH=/home/catid/models/cat-llama-3-8b-instruct-aqlm
export WANDB_PROJECT=aqlm
export WANDB_NAME=aqlm8
python main.py $MODEL_PATH $DATASET_PATH \
--nsamples=1024 \
--val_size=128 \
--num_codebooks=1 \
--nbits_per_codebook=16 \
--in_group_size=8 \
--relative_mse_tolerance=0.01 \
--finetune_batch_size=32 \
--finetune_max_epochs=10 \
--finetune_early_stop=3 \
--finetune_keep_best \
--local_batch_size=1 \
--offload_activations \
--wandb \
--resume \
--save $SAVE_PATH
I see:
============ Load model... ============
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 23.65it/s]
Loading pretrained model ...
Model loaded sucсessfully ...
============ Quantizing model... ============
Loading data ...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'.
The class this function is called from is 'LlamaTokenizer'.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Traceback (most recent call last):
File "/home/catid/sources/AQLM/main.py", line 892, in <module>
quantize_model(model, args)
File "/home/catid/sources/AQLM/main.py", line 41, in quantize_model
data = get_loaders(
File "/home/catid/sources/AQLM/src/datautils.py", line 226, in get_loaders
tokenizer = LlamaTokenizer.from_pretrained(
File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
return cls._from_pretrained(
File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 169, in __init__
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 196, in get_spm_processor
tokenizer.Load(self.vocab_file)
File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
return self.LoadFromFile(model_file)
File "/home/catid/mambaforge/envs/aqlm/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
py -m pip install aqlm[gpu,cpu]
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting aqlm[cpu,gpu]
Downloading aqlm-1.0.0-py3-none-any.whl (10 kB)
Collecting torch>=2.1.1
Downloading torch-2.2.0-cp311-cp311-win_amd64.whl (198.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 198.6/198.6 MB 12.6 MB/s eta 0:00:00
Collecting transformers==4.37.0
Downloading transformers-4.37.0-py3-none-any.whl (8.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.4/8.4 MB 8.5 MB/s eta 0:00:00
Collecting numba>=0.56.4
Downloading numba-0.59.0-cp311-cp311-win_amd64.whl (2.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 18.7 MB/s eta 0:00:00
Collecting scipy>=1.11.3
Downloading scipy-1.12.0-cp311-cp311-win_amd64.whl (46.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.2/46.2 MB 13.1 MB/s eta 0:00:00
ERROR: Ignored the following versions that require a different python version: 0.52.0 Requires-Python >=3.6,<3.9; 0.52.0rc3 Requires-Python >=3.6,<3.9; 0.53.0 Requires-Python >=3.6,<3.10; 0.53.0rc1.post1 Requires-Python >=3.6,<3.10; 0.53.0rc2 Requires-Python >=3.6,<3.10; 0.53.0rc3 Requires-Python >=3.6,<3.10; 0.53.1 Requires-Python >=3.6,<3.10; 0.54.0 Requires-Python >=3.7,<3.10; 0.54.0rc2 Requires-Python >=3.7,<3.10; 0.54.0rc3 Requires-Python >=3.7,<3.10; 0.54.1 Requires-Python >=3.7,<3.10; 0.55.0 Requires-Python >=3.7,<3.11; 0.55.0rc1 Requires-Python >=3.7,<3.11; 0.55.1 Requires-Python >=3.7,<3.11; 0.55.2 Requires-Python >=3.7,<3.11; 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10; 1.7.2 Requires-Python >=3.7,<3.11; 1.7.3 Requires-Python >=3.7,<3.11; 1.8.0 Requires-Python >=3.8,<3.11; 1.8.0rc1 Requires-Python >=3.8,<3.11; 1.8.0rc2 Requires-Python >=3.8,<3.11; 1.8.0rc3 Requires-Python >=3.8,<3.11; 1.8.0rc4 Requires-Python >=3.8,<3.11; 1.8.1 Requires-Python >=3.8,<3.11
ERROR: Could not find a version that satisfies the requirement triton>=2.1; extra == "gpu" (from aqlm[cpu,gpu]) (from versions: none)
ERROR: No matching distribution found for triton>=2.1; extra == "gpu"
I'm been using quantization tools like GPTQ, Exllama, or QUIP#. Those tools is quite fast to do quantization in a single A6000 gpu. But, this tool takes a really long time even though I'm using two A6000 gpu. How long does it take for quantizing Mistral 7B using two A6000 gpu and this parameters:
python main.py ../models/my-mistral-7B wikitext2 --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 --finetune_batch_size=32 --local_batch_size=1 --save ../models/my-mistral-7B-AQLM --model_seqlen 8192 --offload_activations
Hard pinning the transformers version is causing a dependency conflict in pip
Line 35 in 48c44b1
Successfully built transformers
Installing collected packages: transformers
Attempting uninstall: transformers
Found existing installation: transformers 4.37.0
Uninstalling transformers-4.37.0:
Successfully uninstalled transformers-4.37.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aqlm 1.0.1 requires transformers==4.37.0, but you have transformers 4.38.0.dev0 which is incompatible.
Successfully installed transformers-4.38.0.dev0
If you change ==
to >=
it would allow for later versions with updates (in this case, the update was for AQLM compatibility!)
Hi @Vahe1994, I really like the idea of the paper and thank you for releasing the codes.
I am currently researching the codes, and I would like to know how you obtain an average of 2.02
bits per layer for LLaMA2-7b (which is shown in the paper's Table 1). I tried the hyperparameters (num_codebooks=1
, codebook_value_nbits=16
, in_group_size=8
, out_group_size=1
, codebook_value_nbits=16
, scale_nbits=0
, and the rest of them are the default values) in the README, but I get around 2.50
bit for attention weight and 2.2
bit for MLP weight and the average of them cannot be 2.02
. How do I tune the parameters so that I can get the paper's results? Thank you very much!
...
Hi. I am interested in some ideas requiring Mixtral-type model & for resource purpose - quantization & partial CPU offloading.
So since it seems this quantization technique is relatively good - I am going to try it.
However, as far as I understand due to https://github.com/Vahe1994/AQLM/blob/main/README.md - the best result is achieved with 1x16 scheme and it is not implemented for CPU case (well, not optimized kernel - due to https://github.com/Vahe1994/AQLM/blob/main/inference_lib/src/aqlm/inference_kernels/kernel_selector.py it should do dequantization and torch built-in matmul on top of that).
So I guess I may later dive into implementing such a scheme, but maybe it's on plan anyway?
Hi, we have performed a small experiment on fine-tuning the Llama-2-70B-AQLM-2Bit model using the PEFT QLoRA method. We utilized the Alpaca and Glaive datasets for instruction tuning, and the fine-tuned version demonstrates preliminary conversation and tool-using abilities. We found that the training only requires 24GB of GRAM, while inference only needs 20GB. Thus fine-tuning a 70B model on consumer devices can be feasible. However, we found that AQLM significantly increases the overall training time. It would be better if the training speed could be improved. Thanks again for your excellent work!
Adapter weights: https://huggingface.co/hiyouga/Llama-2-70b-AQLM-2Bit-QLoRA-function-calling
Hello, this isn't really a problem, I just wanted to inform you all that I wrote 1x16 and 2x8 decompression/dequantization kernels that run a factor of 6 and 9 (respectively) faster than the generic pytorch one you guys coded up, good for prefill performance. I didn't roll the scale into them because scaling the output is very fast.
Anyway, if you want to see them or bring them over, feel free, they are in:
Cheers, and AQLM is really great, reaching Pareto optimal is a really nice achievement!
-James
Hi! I'm having the following issue on the forward pass (only when using an AQLM model) while prompt tuning an AQLM model. I'm using https://huggingface.co/BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf-test-dispatch/tree/main.
CalledProcessError Traceback (most recent call last)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2096, in _run_ninja_build(build_directory, verbose, error_prefix)
2095 stdout_fileno = 1
-> 2096 subprocess.run(
2097 command,
2098 stdout=stdout_fileno if verbose else subprocess.PIPE,
2099 stderr=subprocess.STDOUT,
2100 cwd=build_directory,
2101 check=True,
2102 env=env)
2103 except subprocess.CalledProcessError as e:
2104 # Python 2 and 3 compatible way of getting the error object.
File /usr/lib/python3.10/subprocess.py:526, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
525 if check and retcode:
--> 526 raise CalledProcessError(retcode, process.args,
527 output=stdout, stderr=stderr)
528 return CompletedProcess(process.args, retcode, stdout, stderr)
CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
File , line 41
39 batch = {k: v.to(device) for k, v in batch.items()}
40 with torch.cuda.amp.autocast():
---> 41 outputs = model(**batch)
42 loss = outputs.loss
44 loss = loss / gradient_accumulation_steps
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/peft/peft_model.py:1295, in PeftModelForCausalLM.forward(self, input_ids, attention_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, task_ids, **kwargs)
1293 prompts = prompts.to(inputs_embeds.dtype)
1294 inputs_embeds = torch.cat((prompts, inputs_embeds), dim=1)
-> 1295 return self.base_model(inputs_embeds=inputs_embeds, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:1360, in MixtralForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict)
1357 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1359 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1360 outputs = self.model(
1361 input_ids=input_ids,
1362 attention_mask=attention_mask,
1363 position_ids=position_ids,
1364 past_key_values=past_key_values,
1365 inputs_embeds=inputs_embeds,
1366 use_cache=use_cache,
1367 output_attentions=output_attentions,
1368 output_hidden_states=output_hidden_states,
1369 output_router_logits=output_router_logits,
1370 return_dict=return_dict,
1371 )
1373 hidden_states = outputs[0]
1374 logits = self.lm_head(hidden_states)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:1217, in MixtralModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict)
1214 all_hidden_states += (hidden_states,)
1216 if self.gradient_checkpointing and self.training:
-> 1217 layer_outputs = self._gradient_checkpointing_func(
1218 decoder_layer.call,
1219 hidden_states,
1220 attention_mask,
1221 position_ids,
1222 past_key_values,
1223 output_attentions,
1224 output_router_logits,
1225 use_cache,
1226 )
1227 else:
1228 layer_outputs = decoder_layer(
1229 hidden_states,
1230 attention_mask=attention_mask,
(...)
1235 use_cache=use_cache,
1236 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/_compile.py:24, in _disable_dynamo..inner(*args, **kwargs)
20 @functools.wraps(fn)
21 def inner(*args, **kwargs):
22 import torch._dynamo
---> 24 return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:489, in _TorchDynamoContext.call.._fn(*args, **kwargs)
487 dynamo_config_ctx.enter()
488 try:
--> 489 return fn(*args, **kwargs)
490 finally:
491 set_eval_frame(prior)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/_dynamo/external_utils.py:17, in wrap_inline..inner(*args, **kwargs)
15 @functools.wraps(fn)
16 def inner(*args, **kwargs):
---> 17 return fn(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/checkpoint.py:482, in checkpoint(function, use_reentrant, context_fn, determinism_check, debug, *args, **kwargs)
477 if context_fn is not noop_context_fn or debug is not False:
478 raise ValueError(
479 "Passing context_fn
or debug
is only supported when "
480 "use_reentrant=False."
481 )
--> 482 return CheckpointFunction.apply(function, preserve, *args)
483 else:
484 gen = _checkpoint_without_reentrant_generator(
485 function, preserve, context_fn, determinism_check, debug, *args, **kwargs
486 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/autograd/function.py:553, in Function.apply(cls, *args, **kwargs)
550 if not torch._C._are_functorch_transforms_active():
551 # See NOTE: [functorch vjp and autograd interaction]
552 args = _functorch.utils.unwrap_dead_wrappers(args)
--> 553 return super().apply(*args, **kwargs) # type: ignore[misc]
555 if not is_setup_ctx_defined:
556 raise RuntimeError(
557 "In order to use an autograd.Function with functorch transforms "
558 "(vmap, grad, jvp, jacrev, ...), it must override the setup_context "
559 "staticmethod. For more details, please see "
560 "https://pytorch.org/docs/master/notes/extending.func.html"
561 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/checkpoint.py:261, in CheckpointFunction.forward(ctx, run_function, preserve_rng_state, *args)
258 ctx.save_for_backward(*tensor_inputs)
260 with torch.no_grad():
--> 261 outputs = run_function(*args)
262 return outputs
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:934, in MixtralDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, output_router_logits, use_cache, **kwargs)
931 hidden_states = self.input_layernorm(hidden_states)
933 # Self Attention
--> 934 hidden_states, self_attn_weights, present_key_value = self.self_attn(
935 hidden_states=hidden_states,
936 attention_mask=attention_mask,
937 position_ids=position_ids,
938 past_key_value=past_key_value,
939 output_attentions=output_attentions,
940 use_cache=use_cache,
941 )
942 hidden_states = residual + hidden_states
944 # Fully Connected
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:730, in MixtralSdpaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
719 return super().forward(
720 hidden_states=hidden_states,
721 attention_mask=attention_mask,
(...)
725 use_cache=use_cache,
726 )
728 bsz, q_len, _ = hidden_states.size()
--> 730 query_states = self.q_proj(hidden_states)
731 key_states = self.k_proj(hidden_states)
732 value_states = self.v_proj(hidden_states)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference.py:70, in QuantizedLinear.forward(self, input)
68 def forward(self, input: torch.Tensor) -> torch.Tensor:
69 if self.gemv_op is None:
---> 70 self.prepare_matmul_op(input)
72 if self.use_gemv_rule(input):
73 return self.gemv_op.apply(input, self.codes, self.codebooks, self.scales, self.bias)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference.py:86, in QuantizedLinear.prepare_matmul_op(self, input)
78 if (
79 not input.is_cuda
80 and self.codebook_size == 256
81 and self.codes.shape[0] == self.out_features // self.out_group_size
82 ):
83 self.codes.data = torch.permute(self.codes.data, (1, 0, 2)).contiguous() # TODO: fix this thing
85 self.gemv_op = _get_autograd_matmul_op(
---> 86 get_forward_pass_kernel(self.codebooks, False),
87 get_backward_pass_kernel(self.codebooks, False),
88 )
90 self.gemm_op = _get_autograd_matmul_op(
91 get_forward_pass_kernel(self.codebooks, True),
92 get_backward_pass_kernel(self.codebooks, True),
93 )
95 self.use_gemv_rule = lambda input: math.prod(input.shape[:-1]) <= 6
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/kernel_selector.py:35, in get_forward_pass_kernel(codebooks, optimize_for_training)
25 num_codebooks, codebook_size, out_group_size, in_group_size = codebooks.shape
27 if (optimize_for_training, codebooks.device.type, num_codebooks, codebook_size, out_group_size, in_group_size) == (
28 False,
29 "cuda",
(...)
33 8,
34 ):
---> 35 from .cuda_kernel import CUDA_FOLDER
37 return torch.ops.aqlm.code1x16_matmat
38 elif (
39 optimize_for_training,
40 codebooks.device.type,
(...)
51 8,
52 ):
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.py:8
5 from torch.utils.cpp_extension import load
7 CUDA_FOLDER = os.path.dirname(os.path.abspath(file))
----> 8 CUDA_KERNEL = load(
9 name="codebook_cuda",
10 sources=[os.path.join(CUDA_FOLDER, "cuda_kernel.cpp"), os.path.join(CUDA_FOLDER, "cuda_kernel.cu")],
11 )
13 torch.library.define(
14 "aqlm::code1x16_matmat", "(Tensor input, Tensor codes, Tensor codebooks, Tensor scales, Tensor bias) -> Tensor"
15 )
17 torch.library.impl("aqlm::code1x16_matmat", "default", CUDA_KERNEL.code1x16_matmat)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1306, in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
1214 def load(name,
1215 sources: Union[str, List[str]],
1216 extra_cflags=None,
(...)
1224 is_standalone=False,
1225 keep_intermediates=True):
1226 """
1227 Load a PyTorch C++ extension just-in-time (JIT).
1228
(...)
1304 ... verbose=True)
1305 """
-> 1306 return _jit_compile(
1307 name,
1308 [sources] if isinstance(sources, str) else sources,
1309 extra_cflags,
1310 extra_cuda_cflags,
1311 extra_ldflags,
1312 extra_include_paths,
1313 build_directory or _get_build_directory(name, verbose),
1314 verbose,
1315 with_cuda,
1316 is_python_module,
1317 is_standalone,
1318 keep_intermediates=keep_intermediates)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1710, in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
1706 hipified_sources.add(hipify_result[s_abs].hipified_path if s_abs in hipify_result else s_abs)
1708 sources = list(hipified_sources)
-> 1710 _write_ninja_file_and_build_library(
1711 name=name,
1712 sources=sources,
1713 extra_cflags=extra_cflags or [],
1714 extra_cuda_cflags=extra_cuda_cflags or [],
1715 extra_ldflags=extra_ldflags or [],
1716 extra_include_paths=extra_include_paths or [],
1717 build_directory=build_directory,
1718 verbose=verbose,
1719 with_cuda=with_cuda,
1720 is_standalone=is_standalone)
1721 finally:
1722 baton.release()
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1823, in _write_ninja_file_and_build_library(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_standalone)
1821 if verbose:
1822 print(f'Building extension module {name}...', file=sys.stderr)
-> 1823 _run_ninja_build(
1824 build_directory,
1825 verbose,
1826 error_prefix=f"Error building extension '{name}'")
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2112, in _run_ninja_build(build_directory, verbose, error_prefix)
2110 if hasattr(error, 'output') and error.output: # type: ignore[union-attr]
2111 message += f": {error.output.decode(*SUBPROCESS_DECODE_ARGS)}" # type: ignore[union-attr]
-> 2112 raise RuntimeError(message) from e
RuntimeError: Error building extension 'codebook_cuda': [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -std=c++17 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o
FAILED: cuda_kernel.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -std=c++17 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o
/local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu(270): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
/local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu(270): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
/local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu(256): warning #177-D: variable "res" was declared but never referenced
2 errors detected in the compilation of "/local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cu".
[2/3] c++ -MMD -MF cuda_kernel.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-3ecdb189-0e4f-43fa-829c-0de80b503f55/lib/python3.10/site-packages/aqlm/inference_kernels/cuda_kernel.cpp -o cuda_kernel.o
ninja: build stopped: subcommand failed.
I have Ubuntu 23.10 system.
I installed cudatoolkit 12.1 using https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local
(since it need headers and so so I can't just install cuda through conda).
The rest of my environment
accelerate @ git+https://github.com/huggingface/accelerate.git@97d2168e5953fe7373a06c69c02c5a00a84d5344
anyio==4.2.0
aqlm @ file:///home/alex4321/Documents/AQLM/inference_lib
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
attrs==23.2.0
Babel==2.14.0
beautifulsoup4==4.12.3
bleach==6.1.0
Brotli @ file:///work/perseverance-python-buildout/croot/brotli-split_1698805593785/work
certifi @ file:///croot/certifi_1696279375225/work/certifi
cffi @ file:///croot/cffi_1700254295673/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
comm==0.2.1
cryptography @ file:///work/perseverance-python-buildout/croot/cryptography_1698845900024/work
debugpy==1.8.1
decorator==5.1.1
defusedxml==0.7.1
executing==2.0.1
fastjsonschema==2.19.1
filelock @ file:///work/perseverance-python-buildout/croot/filelock_1698846025262/work
fqdn==1.5.1
fsspec==2024.2.0
h11==0.14.0
httpcore==1.0.3
httpx==0.26.0
huggingface-hub==0.20.3
idna @ file:///work/perseverance-python-buildout/croot/idna_1698845632828/work
ipykernel==6.29.2
ipython==8.21.0
isoduration==20.11.0
jedi==0.19.1
Jinja2 @ file:///work/perseverance-python-buildout/croot/jinja2_1698847462642/work
json5==0.9.14
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter-events==0.9.0
jupyter-lsp==2.2.2
jupyter_client==8.6.0
jupyter_core==5.7.1
jupyter_server==2.12.5
jupyter_server_terminals==0.5.2
jupyterlab==4.1.1
jupyterlab_pygments==0.3.0
jupyterlab_server==2.25.3
llvmlite==0.42.0
MarkupSafe @ file:///work/perseverance-python-buildout/croot/markupsafe_1698846636000/work
matplotlib-inline==0.1.6
mistune==3.0.2
mkl-service==2.4.0
mpmath @ file:///work/perseverance-python-buildout/croot/mpmath_1698864994882/work
nbclient==0.9.0
nbconvert==7.16.0
nbformat==5.9.2
nest-asyncio==1.6.0
networkx @ file:///work/perseverance-python-buildout/croot/networkx_1698865062738/work
ninja==1.11.1.1
notebook_shim==0.2.4
numba==0.59.0
numpy @ file:///work/perseverance-python-buildout/croot/numpy_and_numpy_base_1698845160062/work/dist/numpy-1.26.0-cp312-cp312-linux_x86_64.whl#sha256=fdc35057024038070345ff9f7f47ed48ecdb21dd72461617bdadf4f5d1634fcb
overrides==7.7.0
packaging==23.2
pandocfilters==1.5.1
parso==0.8.3
pexpect==4.9.0
Pillow @ file:///work/perseverance-python-buildout/croot/pillow_1698847657722/work
platformdirs==4.2.0
prometheus_client==0.20.0
prompt-toolkit==3.0.43
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
Pygments==2.17.2
pyOpenSSL @ file:///work/perseverance-python-buildout/croot/pyopenssl_1698863523157/work
PySocks @ file:///work/perseverance-python-buildout/croot/pysocks_1698845478203/work
python-dateutil==2.8.2
python-json-logger==2.0.7
PyYAML @ file:///work/perseverance-python-buildout/croot/pyyaml_1698849903511/work
pyzmq==25.1.2
referencing==0.33.0
regex==2023.12.25
requests @ file:///work/perseverance-python-buildout/croot/requests_1698846321763/work
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.18.0
safetensors==0.4.2
scipy==1.12.0
Send2Trash==1.8.2
setuptools==68.0.0
six==1.16.0
sniffio==1.3.0
soupsieve==2.5
stack-data==0.6.3
sympy @ file:///croot/sympy_1701397643339/work
terminado==0.18.0
tinycss2==1.2.1
tokenizers==0.15.2
torch==2.2.0
torchaudio==2.2.0
torchvision==0.17.0
tornado==6.4
tqdm==4.66.2
traitlets==5.14.1
transformers==4.37.0
triton==2.2.0
types-python-dateutil==2.8.19.20240106
typing_extensions==4.9.0
uri-template==1.3.0
urllib3 @ file:///work/perseverance-python-buildout/croot/urllib3_1698845837793/work
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
wheel==0.41.2
AQLM installed from latest github state.
Now if I try to run some code:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
llama_cpu_quantized_model = AutoModelForCausalLM.from_pretrained(
"BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf",
trust_remote_code=True, torch_dtype=torch.float32, device_map="cpu"
)
llama_gpu_quantized_model = AutoModelForCausalLM.from_pretrained(
"BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf",
trust_remote_code=True, torch_dtype=torch.float16, device_map="cuda:0"
).cuda()
llama_tokenizer = AutoTokenizer.from_pretrained("daryl149/llama-2-7b-hf")
output = llama_cpu_quantized_model.generate(llama_tokenizer("Test is", return_tensors="pt")["input_ids"], max_new_tokens=10)
print(llama_tokenizer.decode(output[0]))
it tells me
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Compiling AQLM numba kernel with parameters: kernel_key=(8, 4096, 4096, 2)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Compiling AQLM numba kernel with parameters: kernel_key=(8, 11008, 4096, 2)
Compiling AQLM numba kernel with parameters: kernel_key=(8, 4096, 11008, 2)
<s> Test is a 19999 film directed by
which I guess it more or less fine.
But:
output = llama_gpu_quantized_model.generate(llama_tokenizer("Test is", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)
print(llama_tokenizer.decode(output[0]))
gives me
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[7], [line 1](vscode-notebook-cell:?execution_count=7&line=1)
----> [1](vscode-notebook-cell:?execution_count=7&line=1) output = llama_gpu_quantized_model.generate(llama_tokenizer("Test is", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)
[2](vscode-notebook-cell:?execution_count=7&line=2) print(llama_tokenizer.decode(output[0]))
File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:115](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:115), in context_decorator.<locals>.decorate_context(*args, **kwargs)
[112](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:112) @functools.wraps(func)
[113](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:113) def decorate_context(*args, **kwargs):
[114](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:114) with ctx_factory():
--> [115](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/utils/_contextlib.py:115) return func(*args, **kwargs)
File [~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1474](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1474), in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
[1457](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1457) return self.assisted_decoding(
[1458](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1458) input_ids,
[1459](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1459) candidate_generator=candidate_generator,
(...)
[1470](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1470) **model_kwargs,
[1471](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1471) )
[1472](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1472) if generation_mode == GenerationMode.GREEDY_SEARCH:
[1473](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1473) # 11. run greedy search
-> [1474](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1474) return self.greedy_search(
[1475](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1475) input_ids,
[1476](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1476) logits_processor=prepared_logits_processor,
[1477](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1477) stopping_criteria=prepared_stopping_criteria,
[1478](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1478) pad_token_id=generation_config.pad_token_id,
[1479](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1479) eos_token_id=generation_config.eos_token_id,
[1480](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1480) output_scores=generation_config.output_scores,
[1481](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1481) return_dict_in_generate=generation_config.return_dict_in_generate,
[1482](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1482) synced_gpus=synced_gpus,
[1483](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1483) streamer=streamer,
[1484](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1484) **model_kwargs,
[1485](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1485) )
[1487](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1487) elif generation_mode == GenerationMode.CONTRASTIVE_SEARCH:
[1488](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:1488) if not model_kwargs["use_cache"]:
File [~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2335](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2335), in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
[2332](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2332) model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
[2334](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2334) # forward pass to get next token
-> [2335](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2335) outputs = self(
[2336](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2336) **model_inputs,
[2337](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2337) return_dict=True,
[2338](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2338) output_attentions=output_attentions,
[2339](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2339) output_hidden_states=output_hidden_states,
[2340](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2340) )
[2342](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2342) if synced_gpus and this_peer_finished:
[2343](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/transformers/generation/utils.py:2343) continue # don't waste resources running the code we don't need
File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
[1509](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1509) return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
[1510](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511) return self._call_impl(*args, **kwargs)
File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
[1515](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
[1516](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
[1517](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
[1518](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1518) or _global_backward_pre_hooks or _global_backward_hooks
[1519](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1519) or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520) return forward_call(*args, **kwargs)
[1522](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1522) try:
[1523](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1523) result = None
File [~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1195](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1195), in LlamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
[1192](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1192) return_dict = return_dict if return_dict is not None else self.config.use_return_dict
[1194](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1194) # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> [1195](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1195) outputs = self.model(
[1196](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1196) input_ids=input_ids,
[1197](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1197) attention_mask=attention_mask,
[1198](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1198) position_ids=position_ids,
[1199](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1199) past_key_values=past_key_values,
[1200](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1200) inputs_embeds=inputs_embeds,
[1201](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1201) use_cache=use_cache,
[1202](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1202) output_attentions=output_attentions,
[1203](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1203) output_hidden_states=output_hidden_states,
[1204](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1204) return_dict=return_dict,
[1205](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1205) )
[1207](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1207) hidden_states = outputs[0]
[1208](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1208) if self.config.pretraining_tp > 1:
File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
[1509](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1509) return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
[1510](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511) return self._call_impl(*args, **kwargs)
File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
[1515](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
[1516](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
[1517](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
[1518](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1518) or _global_backward_pre_hooks or _global_backward_hooks
[1519](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1519) or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520) return forward_call(*args, **kwargs)
[1522](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1522) try:
[1523](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1523) result = None
File [~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1082](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1082), in LlamaModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
[1072](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1072) layer_outputs = self._gradient_checkpointing_func(
[1073](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1073) decoder_layer.__call__,
[1074](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1074) hidden_states,
(...)
[1079](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1079) use_cache,
[1080](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1080) )
[1081](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1081) else:
-> [1082](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1082) layer_outputs = decoder_layer(
[1083](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1083) hidden_states,
[1084](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1084) attention_mask=attention_mask,
[1085](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1085) position_ids=position_ids,
[1086](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1086) past_key_value=past_key_values,
[1087](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1087) output_attentions=output_attentions,
[1088](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1088) use_cache=use_cache,
[1089](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1089) )
[1091](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1091) hidden_states = layer_outputs[0]
[1093](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:1093) if use_cache:
File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
[1509](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1509) return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
[1510](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511) return self._call_impl(*args, **kwargs)
File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
[1515](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
[1516](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
[1517](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
[1518](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1518) or _global_backward_pre_hooks or _global_backward_hooks
[1519](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1519) or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520) return forward_call(*args, **kwargs)
[1522](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1522) try:
[1523](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1523) result = None
File [~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:810](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:810), in LlamaDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache, **kwargs)
[807](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:807) hidden_states = self.input_layernorm(hidden_states)
[809](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:809) # Self Attention
--> [810](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:810) hidden_states, self_attn_weights, present_key_value = self.self_attn(
[811](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:811) hidden_states=hidden_states,
[812](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:812) attention_mask=attention_mask,
[813](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:813) position_ids=position_ids,
[814](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:814) past_key_value=past_key_value,
[815](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:815) output_attentions=output_attentions,
[816](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:816) use_cache=use_cache,
[817](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:817) **kwargs,
[818](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:818) )
[819](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:819) hidden_states = residual + hidden_states
[821](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:821) # Fully Connected
File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
[1509](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1509) return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
[1510](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511) return self._call_impl(*args, **kwargs)
File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
[1515](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
[1516](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
[1517](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
[1518](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1518) or _global_backward_pre_hooks or _global_backward_hooks
[1519](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1519) or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520) return forward_call(*args, **kwargs)
[1522](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1522) try:
[1523](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1523) result = None
File [~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:705](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:705), in LlamaSdpaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
[694](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:694) return super().forward(
[695](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:695) hidden_states=hidden_states,
[696](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:696) attention_mask=attention_mask,
(...)
[700](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:700) use_cache=use_cache,
[701](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:701) )
[703](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:703) bsz, q_len, _ = hidden_states.size()
--> [705](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:705) query_states = self.q_proj(hidden_states)
[706](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:706) key_states = self.k_proj(hidden_states)
[707](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf/2df1b7a5cbb2a8b584eade2de5c2b4975072a644/modeling_llama_aqlm.py:707) value_states = self.v_proj(hidden_states)
File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
[1509](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1509) return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
[1510](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1511) return self._call_impl(*args, **kwargs)
File [~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
[1515](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
[1516](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
[1517](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
[1518](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1518) or _global_backward_pre_hooks or _global_backward_hooks
[1519](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1519) or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1520) return forward_call(*args, **kwargs)
[1522](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1522) try:
[1523](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/torch/nn/modules/module.py:1523) result = None
File [~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:65](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:65), in QuantizedLinear.forward(self, input)
[59](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:59) if (
[60](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:60) not input.is_cuda
[61](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:61) and self.codebook_size == 256
[62](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:62) and self.codes.shape[0] == self.out_features [/](https://file+.vscode-resource.vscode-cdn.net/)[/](https://file+.vscode-resource.vscode-cdn.net/) self.out_group_size
[63](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:63) ):
[64](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:64) self.codes.data = torch.permute(self.codes.data, (1, 0, 2)).contiguous() # TODO: fix this thing
---> [65](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference.py:65) return forward_pass_quantized_linear(input, self.codes, self.codebooks, self.scales, self.bias)
File [~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:31](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:31), in forward_pass_quantized_linear(input, codes, codebooks, scales, bias)
[26](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:26) from .cuda_kernel import CUDA_KERNEL
[28](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:28) assert (
[29](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:29) input.dtype == torch.float16
[30](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:30) ), f"please load the model with `torch_dtype=torch.float16`, as {input.dtype} is not supported on GPU yet"
---> [31](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:31) return CUDA_KERNEL.code2x8_matmat(input, codes, codebooks, scales) + (bias if bias is not None else 0)
[32](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:32) case (True, _, _, _, _):
[33](https://file+.vscode-resource.vscode-cdn.net/home/alex4321/Documents/AQLM/~/anaconda3/envs/llms/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py:33) from .triton_kernel import triton_matmul
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
p.s. by the way - it is kinda offtopic, but I don't get it:
CUDA_KERNEL.code2x8_matmat
and numba_gemm_lut
consume codes
argument as is as well as codebooksauto out_features = codes.size(0) * codebooks.size(2);
out_features = codes.shape[1] * out_group_size
Hi!
Thanks for such a useful tool!
I have a question about model_seqlen
:
As I can see default value in main.py is 4096. What if I'll use a smaller values e.g. 1024 when quantizing MoE mixtral model? Will it affect the quality of quantized model? Or quality on greater than 1024 contexts? Will it significantly speedup process of quantization?
Thanks in advance!
parser.add_argument(
"--model_seqlen",
type=int,
default=4096,
help="Model seqlen and calibration data context length.",
)
Hi authors!
With the recent AQLM integration in transformers, would it makes sense to quantize the Google gemma models in 2-bit
The list of the models can be found here: https://huggingface.co/collections/google/gemma-release-65d5efbccdbb8c4202ec078b
Are the models you report in your readme supposed to be actual 2 bit models or just 2.x bit models? For example, the two 7B models below are both larger than a 2 bit decoder model, which would take 2.1G on disk. Also, why is there such a large difference in size between the 1x16 and 2x8 models? The size gap is much larger than the codebook size delta should take. Are you using different groupsizes in each model? Thanks
Hi there,
Thanks for the great work !
We recently added HfQuantizer
support in HF transformers thanks to @poedator 's work - I was wondering if it would make sense to create a new quantizer AqlmQuantizer
for this method to natively support inference in HF transformers: https://huggingface.co/docs/transformers/main/en/hf_quantizer
Let me know if this makes sense!
Dear maintainers,
Thank you for your awesome paper and open-source project. I recently ran into the detail of dataset pre-processing, which i can not understand properly.
In the processing of datasets you ignore all the labels exept the last one (as it is in
Line 51 in bf84f39
Could you please explain this detail?
https://huggingface.co/mistral-community/Mixtral-8x22B-v0.1/tree/main
This is way more powerful than the previous Mixtral, on essentially all benchmarks
Tried quantizing this myself, it seemed like it was working but I started running out of cloud credits :p
Environment: Google Colab
CUDA Info:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 57C P8 12W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
Code Runned from: Basic Basic AQLM generation DEMO
error line:
%%capture
output = quantized_model.generate(tokenizer("", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py](https://localhost:8080/#) in _run_ninja_build(build_directory, verbose, error_prefix)
2095 stdout_fileno = 1
-> 2096 subprocess.run(
2097 command,
26 frames
CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py](https://localhost:8080/#) in _run_ninja_build(build_directory, verbose, error_prefix)
2110 if hasattr(error, 'output') and error.output: # type: ignore[union-attr]
2111 message += f": {error.output.decode(*SUBPROCESS_DECODE_ARGS)}" # type: ignore[union-attr]
-> 2112 raise RuntimeError(message) from e
2113
2114
RuntimeError: Error building extension 'codebook_cuda': [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -std=c++17 -c /usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o
FAILED: cuda_kernel.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --compiler-options '-fPIC' -std=c++17 -c /usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(59): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
res2 = __hfma2(a[j], b[j], res2);
^
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(59): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
res2 = __hfma2(a[j], b[j], res2);
^
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(59): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
res2 = __hfma2(a[j], b[j], res2);
^
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(147): error: no suitable user-defined conversion from "__nv_bfloat162" to "__half2" exists
res2 = __hfma2(__hadd2(a0[j], a1[j]), b[j], res2);
^
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(147): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
res2 = __hfma2(__hadd2(a0[j], a1[j]), b[j], res2);
^
/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu(147): error: no suitable user-defined conversion from "nv_bfloat162" to "__half2" exists
res2 = __hfma2(__hadd2(a0[j], a1[j]), b[j], res2);
^
6 errors detected in the compilation of "/usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cu".
[2/3] c++ -MMD -MF cuda_kernel.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /usr/local/lib/python3.10/dist-packages/aqlm/inference_kernels/cuda_kernel.cpp -o cuda_kernel.o
ninja: build stopped: subcommand failed.
The AQLM quantization method is recent, is there any tool that supports serving the models as an OpenAI API-compatible server?
Hey, thanks for your work. I saw https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-Instruct-AQLM-2Bit-1x16/discussions/2 about how 8-bit KV cache quantization can be enabled on vLLM. I am not too sure of how exactly KV cache is implemented for AQLM using Transformers, but would KV cache quantization be theoretically possible? It might address some of the concerns regarding high vram usage for context from https://www.reddit.com/r/LocalLLaMA/comments/1clinlb/bringing_2bit_llms_to_production_new_aqlm_models/.
To be specific, would 4-bit cache quantization be possible? Turboderp managed to achieve negligable ppl loss somehow: https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md. For reference, turboderp/exllamav2@324404e.
Thanks!
Are Ollama or other frameworks or tools capable of serving AQLM quantized models?
Hello,
Thanks for your outstanding works. I want to do a compressive comparison of recent quantization methods.
Due to that latest lm-eval can obtain higher accuracy than the ones reported in paper, I have to re-evaluate each quantized models.
I found that there is not model about Llama-2-13B with AQLM (2x8 scheme), can you share it on huggingface?
Thank you!
How to fine-tune the compressed model. Thank you very much for the author's work, I tried to deploy lla2 on 3090 before, but failed due to lack of video memory. Recently I have found work on compression of large models. If you use compressed model parameters for fine tuning tasks, there are more detailed steps, I am very interested in your current work.
I welcome the developers of the new method. I launched your goolge colab and got very dubious inference. Can you explain a little about the essence of the breakthrough of this compression? I may not have understood something in terms of use, but judging by the inference, it is completely lost.
quantized_model = AutoModelForCausalLM.from_pretrained(
"BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf", trust_remote_code=True, torch_dtype=torch.float16,
).cuda()
tokenizer = AutoTokenizer.from_pretrained("BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf")
inputs = tokenizer(["Write a poem about python"], return_tensors="pt")["input_ids"].cuda()
streamer = TextStreamer(tokenizer)
_ = quantized_model.generate(inputs, streamer=streamer, max_new_tokens=120)
<s> Write a poem about python.
Write a poem about python.
Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem about python. Write a poem
This is my Dockerfile. I've tried a lot of permutations looking at your notebooks, e.g., using pip install git+https://github.com/huggingface/accelerate.git@main
instead of the plain pip install below. I paste the error after the docker file.
FROM nvidia/cuda:11.8.0-base-ubuntu20.04
# Run system updates and install any desired packages
RUN apt-get update && apt-get upgrade -y && apt-get install -y curl
# need to make sure sudo is installed
RUN apt-get update && apt-get install -y sudo
RUN sudo apt-get install python3.8 -y
RUN curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
RUN mkdir /root/.conda
RUN bash Miniconda3-latest-Linux-x86_64.sh -b
ENV PATH="/root/miniconda3/bin:${PATH}"
ARG PATH="/root/miniconda3/bin:${PATH}"
RUN apt-get update && apt-get install -y gcc g++
RUN apt-get update && apt-get install -y ninja-build
RUN apt-get update && apt-get install -y git
RUN pip install accelerate
RUN pip install aqlm[gpu]==1.0.0
#docker run --gpus '"all"' --rm -it --name aqlm --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --mount type=bind,source=/home/josh/PycharmProjects/vectors/DSPy/scratch,target=/scratch -v ${HOME}/.cache/huggingface:/root/.cache/huggingface --network=host aqlm:0
Here's the error:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2096, in _run_ninja_build
subprocess.run(
File "/root/miniconda3/lib/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 7, in chat_loop
File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/transformers/generation/utils.py", line 1520, in generate
return self.sample(
^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/transformers/generation/utils.py", line 2617, in sample
outputs = self(
^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf/057668a874cfe55d73bed6f244367eb072da75a7/modeling_llama_aqlm.py", line 1194, in forward
outputs = self.model(
^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf/057668a874cfe55d73bed6f244367eb072da75a7/modeling_llama_aqlm.py", line 1081, in forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf/057668a874cfe55d73bed6f244367eb072da75a7/modeling_llama_aqlm.py", line 809, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.cache/huggingface/modules/transformers_modules/BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf/057668a874cfe55d73bed6f244367eb072da75a7/modeling_llama_aqlm.py", line 704, in forward
query_states = self.q_proj(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/aqlm/inference.py", line 65, in forward
return forward_pass_quantized_linear(input, self.codes, self.codebooks, self.scales, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/aqlm/inference_kernels/kernel_selector.py", line 19, in forward_pass_quantized_linear
from .cuda_kernel import CUDA_KERNEL
File "/root/miniconda3/lib/python3.12/site-packages/aqlm/inference_kernels/cuda_kernel.py", line 8, in <module>
CUDA_KERNEL = load(
^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1306, in load
return _jit_compile(
^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
_write_ninja_file_and_build_library(
File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/root/miniconda3/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2112, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'codebook_cuda': [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/miniconda3/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /root/miniconda3/lib/python3.12/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o
FAILED: cuda_kernel.cuda.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/miniconda3/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -std=c++17 -c /root/miniconda3/lib/python3.12/site-packages/aqlm/inference_kernels/cuda_kernel.cu -o cuda_kernel.cuda.o
/bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
[2/3] c++ -MMD -MF cuda_kernel.o.d -DTORCH_EXTENSION_NAME=codebook_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/TH -isystem /root/miniconda3/lib/python3.12/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /root/miniconda3/include/python3.12 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /root/miniconda3/lib/python3.12/site-packages/aqlm/inference_kernels/cuda_kernel.cpp -o cuda_kernel.o
ninja: build stopped: subcommand failed.
And here's the little script I threw together to test the model:
from transformers import AutoModelForCausalLM
model="BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf"
quantized_model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True, torch_dtype="auto", device_map="cuda").cuda()
# unload the model from cuda
# quantized_model = quantized_model.cpu()
# Let's set up a chat loop
# need this token for the tokenizer
token=<MY_TOKEN>
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-13b-hf', trust_remote_code=True, token=token)
def chat_loop():
while True:
prompt = input(">>> ")
if prompt.strip() == "exit":
break
input_ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
output = quantized_model.generate(input_ids, do_sample=True, max_length=50, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output[0], skip_special_tokens=True))
chat_loop()
# Quiz me about the principles of scrum.
Just not familiar enough yet with your code to troubleshoot dependency / build issues. Happy to learn more!
How long is the expected time to quantize a 7b mistral model ?
Considering is probably the best OS LLM model at the moment, an optimized AQLM model would be great.
Hi, I've been encountering this problem when running this notebook on local machine with an L4. I've followed the instructions from the notebook on Python3.10
.
Name: torch
Version: 2.2.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: /home/ubuntu/anaconda3/envs/aqlm/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, aqlm, torchaudio, torchvision
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L4 On | 00000000:00:04.0 Off | 0 |
| N/A 69C P0 33W / 75W| 13858MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 16482 C ...untu/anaconda3/envs/aqlm/bin/python 13856MiB |
+---------------------------------------------------------------------------------------+
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's
attention_mask
to obtain reliable results.
Settingpad_token_id
toeos_token_id
:2 for open-end generation.RuntimeError Traceback (most recent call last)
Cell In[3], line 1
----> 1 output = quantized_model.generate(tokenizer("", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/generation/utils.py:1513, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
1496 return self.assisted_decoding(
1497 input_ids,
1498 candidate_generator=candidate_generator,
(...)
1509 **model_kwargs,
1510 )
1511 if generation_mode == GenerationMode.GREEDY_SEARCH:
1512 # 11. run greedy search
-> 1513 return self.greedy_search(
1514 input_ids,
1515 logits_processor=prepared_logits_processor,
1516 stopping_criteria=prepared_stopping_criteria,
1517 pad_token_id=generation_config.pad_token_id,
1518 eos_token_id=generation_config.eos_token_id,
1519 output_scores=generation_config.output_scores,
1520 return_dict_in_generate=generation_config.return_dict_in_generate,
1521 synced_gpus=synced_gpus,
1522 streamer=streamer,
1523 **model_kwargs,
1524 )
1526 elif generation_mode == GenerationMode.CONTRASTIVE_SEARCH:
1527 if not model_kwargs["use_cache"]:File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/generation/utils.py:2350, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
2347 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
2349 # forward pass to get next token
-> 2350 outputs = self(
2351 **model_inputs,
2352 return_dict=True,
2353 output_attentions=output_attentions,
2354 output_hidden_states=output_hidden_states,
2355 )
2357 if synced_gpus and this_peer_finished:
2358 continue # don't waste resources running the code we don't needFile ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = NoneFile ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:1360, in MixtralForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict)
1357 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1359 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1360 outputs = self.model(
1361 input_ids=input_ids,
1362 attention_mask=attention_mask,
1363 position_ids=position_ids,
1364 past_key_values=past_key_values,
1365 inputs_embeds=inputs_embeds,
1366 use_cache=use_cache,
1367 output_attentions=output_attentions,
1368 output_hidden_states=output_hidden_states,
1369 output_router_logits=output_router_logits,
1370 return_dict=return_dict,
1371 )
1373 hidden_states = outputs[0]
1374 logits = self.lm_head(hidden_states)File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = NoneFile ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:1228, in MixtralModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict)
1217 layer_outputs = self._gradient_checkpointing_func(
1218 decoder_layer.call,
1219 hidden_states,
(...)
1225 use_cache,
1226 )
1227 else:
-> 1228 layer_outputs = decoder_layer(
1229 hidden_states,
1230 attention_mask=attention_mask,
1231 position_ids=position_ids,
1232 past_key_value=past_key_values,
1233 output_attentions=output_attentions,
1234 output_router_logits=output_router_logits,
1235 use_cache=use_cache,
1236 )
1238 hidden_states = layer_outputs[0]
1240 if use_cache:File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = NoneFile ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:934, in MixtralDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, output_router_logits, use_cache, **kwargs)
931 hidden_states = self.input_layernorm(hidden_states)
933 # Self Attention
--> 934 hidden_states, self_attn_weights, present_key_value = self.self_attn(
935 hidden_states=hidden_states,
936 attention_mask=attention_mask,
937 position_ids=position_ids,
938 past_key_value=past_key_value,
939 output_attentions=output_attentions,
940 use_cache=use_cache,
941 )
942 hidden_states = residual + hidden_states
944 # Fully ConnectedFile ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = NoneFile ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/transformers/models/mixtral/modeling_mixtral.py:730, in MixtralSdpaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)
719 return super().forward(
720 hidden_states=hidden_states,
721 attention_mask=attention_mask,
(...)
725 use_cache=use_cache,
726 )
728 bsz, q_len, _ = hidden_states.size()
--> 730 query_states = self.q_proj(hidden_states)
731 key_states = self.k_proj(hidden_states)
732 value_states = self.v_proj(hidden_states)File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
1509 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1510 else:
-> 1511 return self._call_impl(*args, **kwargs)File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
1515 # If we don't have any hooks, we want to skip the rest of the logic in
1516 # this function, and just call forward.
1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1518 or _global_backward_pre_hooks or _global_backward_hooks
1519 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520 return forward_call(*args, **kwargs)
1522 try:
1523 result = NoneFile ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/aqlm/inference.py:65, in QuantizedLinear.forward(self, input)
59 if (
60 not input.is_cuda
61 and self.codebook_size == 256
62 and self.codes.shape[0] == self.out_features // self.out_group_size
63 ):
64 self.codes.data = torch.permute(self.codes.data, (1, 0, 2)).contiguous() # TODO: fix this thing
---> 65 return forward_pass_quantized_linear(input, self.codes, self.codebooks, self.scales, self.bias)File ~/anaconda3/envs/aqlm/lib/python3.10/site-packages/aqlm/inference_kernels/kernel_selector.py:24, in forward_pass_quantized_linear(input, codes, codebooks, scales, bias)
19 from .cuda_kernel import CUDA_KERNEL
21 assert (
22 input.dtype == torch.float16
23 ), f"please load the model withtorch_dtype=torch.float16
, as {input.dtype} is not supported on GPU yet"
---> 24 return CUDA_KERNEL.code1x16_matmat(input, codes, codebooks, scales) + (bias if bias is not None else 0)
25 case (True, 2, 256, 1, 8):
26 from .cuda_kernel import CUDA_KERNELRuntimeError: Unknown layout
Why is it that after I quantize, my inference is 2 times slower than the original `model?
Quantize command:python main.py $MODEL_PATH $DATASET_PATH --nsamples=1024 --num_codebooks=1 --nbits_per_codebook=16 --in_group_size=8 --relative_mse_tolerance=0.01 --finetune_relative_mse_tolerance=0.001 --finetune_batch_size=32 --local_batch_size=1 --offload_activations --wandb --save $SAVE_PATH --dtype float32
In addition, looking at the GPU usage, it seems that it has become a parallel pipeline. I use four cores to load the model, but only one GPU utilization is 100% at the same time.
Have you encountered the same problem as me again? Is the inference speed of your quantized model normal?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.