cornell-relaxml / quip-sharp Goto Github PK

View Code? Open in Web Editor NEW

466.0 466.0 40.0 2.38 MB

License: GNU General Public License v3.0

Python 97.73% C++ 0.15% Cuda 2.03% Shell 0.09%

quip-sharp's People

Contributors

Stargazers

Watchers

quip-sharp's Issues

LLaMA-3 support and questions

This seems like one of the best options for quantization for the important new LLaMA-3 70B model so that it can be run on 1-2 consumer grade GPUs. However it looks like support for MQA is not present in llama.py so it will not work I think.

Are you planning to add support for LLaMA-3?

What could went wrong

Hi,

I was going to upload a quip 2 bit version of llama2 model which I took as a chance as an experiment to this method.
https://huggingface.co/Yhyu13/Xwin-Math-7B-V1.0-QUIP-2bit

But as I mentioned in its readme, the hessian pass took quite long, about 6 hours, and the final ppl for quip 2bit is not so ideal. The model perfomance also download grade noticeably.

The conversion process does not throw any error though, and the evlaution process is smooth, too. What could went wrong. I might spent some time to re-run the whole process as double check

Here is my script

#!/bin/bash
eval "$(conda shell.bash hook)"
conda activate quip
MODEL_NAME=Xwin-Math-7B-V1.0
#MODEL_NAME=ShareGPT4V-7B
BASE_MODEL_DIR=/media/hangyu5/Home/Documents/Hugging-Face/$MODEL_NAME/
SAVE_MODEL_DIR=/media/hangyu5/Home/Documents/Hugging-Face/$MODEL_NAME-QUIP/
TMP_MODEL_DIR=./$MODEL_NAME
BATCH_SIZE=2
CTX_LEN=4096

cd repo/quip-sharp/

if [! -d "$TMP_MODEL_DIR" ]; then
    mkdir -p $TMP_MODEL_DIR
fi

TRANSFORMERS_VERBOSITY=debug CUDA_VISIBLE_DEVICES=0 python ./hessian_offline_llama.py \
    --seed 34 \
    --base_model $BASE_MODEL_DIR \
    --save_path $TMP_MODEL_DIR/Hessian/ \
    --ctx_size $CTX_LEN \
    --batch_size $BATCH_SIZE \
    | tee $TMP_MODEL_DIR/Hessian-quip.log

TRANSFORMERS_VERBOSITY=debug CUDA_VISIBLE_DEVICES=0 python ./quantize_llama.py \
    --seed 34 \
    --base_model $BASE_MODEL_DIR \
    --hessian_path $TMP_MODEL_DIR/Hessian \
    --save_path $TMP_MODEL_DIR/Ckpt \
    --ctx_size $CTX_LEN \
    --batch_size $BATCH_SIZE \
    --codebook E8P12 \
    --scale_override 0.9 \
     | tee $TMP_MODEL_DIR/Ckpt-quip.log

TRANSFORMERS_VERBOSITY=debug CUDA_VISIBLE_DEVICES=0 python hfize_llama.py \
    --quantized_path $TMP_MODEL_DIR/Ckpt \
    --hf_output_path $SAVE_MODEL_DIR \
    | tee $TMP_MODEL_DIR/HFize-quip.log

TRANSFORMERS_VERBOSITY=debug CUDA_VISIBLE_DEVICES=0 python eval_ppl.py \
    --seed 34 \
    --hf_path $BASE_MODEL_DIR \
    --seqlen $CTX_LEN \
    | tee $TMP_MODEL_DIR/PPl-quip.log

cd ../../

[Question] How to reproduce QuIP# (No FT & No E_8)

Hi,

I wonder that how can I reproduce the results of QuIP#(no FT and no lattice codebook E8).

no lattice codebook E8 dose it mean uniform (INT) quantization, or is it still vector quantization.

can you provide example or code to compute hessians

does each model need different hessians for best performance?
for example if I want to using QuiP for DeepSeek models do I have to first to compute new hessians before quant?
or I can just re-use the llama2 hessians.

if I need to re-compute the hessians.
can you provide script or code example to do so?

[Request] Pre-Converted files for Yi-34B-200k

Hi!

Yi-34B appears to be one of the best performing base models available on HF for now.
It's base architecture is basically unchanged from Llama2, with only some layer naming differences.
People have easily converted it to Llama2-compatible format for example here.

I think it would be really nice if you could offer a 4bit converted version of this model for use with your repo!
Thank you.

Exception: Saved weights version (0) does not match the codebook version (1).

I started quantizing an LLM with a slightly older version of your library. It took a while to calculate 8k context hessians, so when it finally finished the library I used was already outdated. I get the exception when I try to do inference with the model on you newer library version. I can get inference to work with the older version. Are my weights, which took a week to calculate, now quasi deprecated?

https://huggingface.co/KnutJaegersberg/Tess-M-34B-2bit

Traceback (most recent call last):
File "/home/knut/New Folder/quip-sharp/hfize_llama.py", line 126, in
main(args)
File "/home/knut/New Folder/quip-sharp/hfize_llama.py", line 112, in main
outputs = model.generate(input_ids=inputs['input_ids'].cuda(),
File "/home/knut/transformers/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/transformers/generation/utils.py", line 1606, in generate
return self.greedy_search(
File "/home/knut/transformers/lib/python3.9/site-packages/transformers/generation/utils.py", line 2454, in greedy_search
outputs = self(
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/knut/New Folder/quip-sharp/model/llama.py", line 1056, in forward
outputs = self.model(
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/knut/New Folder/quip-sharp/model/llama.py", line 943, in forward
layer_outputs = decoder_layer(
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/knut/New Folder/quip-sharp/model/llama.py", line 652, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/knut/New Folder/quip-sharp/model/llama.py", line 453, in forward
query_states, key_states, value_states = self.qkv_proj(hidden_states.to(torch.float32))
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/knut/New Folder/quip-sharp/lib/linear/fused_quantized_linear.py", line 20, in forward
fused_output = super(FusedQuantizedLinear, self).forward(input)
File "/home/knut/New Folder/quip-sharp/lib/linear/quantized_linear.py", line 86, in forward
raise Exception(
Exception: Saved weights version (0) does not match the codebook version (1). Please download the latest weights from https://huggingface.co/relaxml

HF Mistral-7B and Llama 2 7b chat Not working.

Hello Team, I am experimenting with E8P 2 Bit, E8P RVQ 3 Bit and E8P RVQ 4 Bit versions of Mistral-7B and Llama 2 7b chat models for a text-generation task. The issues that I encountered are:

Gibberish output (Random characters) generated by the model.
Very long time to generate the output - around 5 minutes.
Very High GPU consumption: around 33 GB

The issue is consistent with all three Lattice Codebooks.

I am directly using the HF model from your repo.
System detail:
OS: Ubuntu 20.04.6 LTS x86_64 GPU: NVIDIA A100 40 GB

Code:

from langchain import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
from langchain.prompts import PromptTemplate, FewShotPromptTemplate
from langchain import LLMChain


model_name="relaxml/Mistral-7b-E8PRVQ-4Bit"
hf_access_token = "hf_XXXX" # Replace with your HF access token

tokenizer = AutoTokenizer.from_pretrained(
    model_name, use_auth_token=hf_access_token #, use_fast=True
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    use_auth_token=hf_access_token,
)

pipeline = transformers.pipeline(
    "text-generation",  # task
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    use_fast=True,
    trust_remote_code=True,
    device_map="auto",
    max_length=4000,
    do_sample=True,
    top_k=3,
    num_return_sequences=1,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

llm = HuggingFacePipeline(pipeline=pipeline, model_kwargs={"temperature": 0.0})

text_gen_intent_examples = [
{
"input_text": "please write a text to john saying I am running late",
"generated_text": "Hi John, I wanted to let you know that I'm running a bit late. I apologize for any inconvenience this may cause. I'll be there as soon as I can. Thanks for understanding,"
},
{
"input_text": "write text to Jessica saying the we have a meeting tomorrow",
"generated_text": "Hi Jessica, Just a reminder that we have a meeting scheduled for tomorrow. Please let me know if there's anything specific you'd like to discuss or prepare ahead of time. Looking forward to it!"
},
{
"input_text": "write a message to Patrick to send the presentation slides",
"generated_text": "Hi Patrick, Hope you're doing well. Could you please send over the presentation slides when you get a chance?"
}
]

text_gen_intent_prefix = """
[INST]
Generate a brief and concise text message or email that a person could quickly send while driving.
- It should be polite but direct, without any unnecessary details.
- your response must not be in chat format as shown below:
[/INST]
"""

prompt_template = """
<s>[INST]
input_text: {input_text}
generated_text: {generated_text}
</s>[/INST]
"""

prompt = PromptTemplate(
    template=prompt_template, input_variables=["input_text", "generated_text"]
)


few_shot_template = FewShotPromptTemplate(
    examples=text_gen_intent_examples,
    example_prompt=prompt,
    prefix=text_gen_intent_prefix,
    suffix="input_text: {input_text} \n generated_text:",
    input_variables=["input_text"],
    example_separator="\n\n",
)

llm_chain = LLMChain(prompt=few_shot_template, llm=llm)

text = """
       write an email to Mark stating I am running late for the meeting and won't be able to join. Also, to send the minutes of the
       meeting.
       """
output_lst = []

print(few_shot_template.format(input_text=text.strip()))
print("-" * 30)
time_list = []
iter_num = 5
import time
for i in range(iter_num):
    st_time = time.time()
    print(llm_chain.run(text.strip()))
    end_time = time.time()
    time_taken = end_time - st_time
    print(f' {time_taken=}')
    time_list.append(time_taken)
    print("-" * 30)

total_time_taken = sum(time_list)
avg_time = total_time_taken / len(time_list)
print(f"Average time taken = {round(avg_time, 2)}")

Output Generated:

utz cleohl cleufohlarrowarrowarrow orient appar cle indul appar cleailleöt士ötötightssndbugbugบohlufijioenohlohlohlufesser («ohlmulticol Barceluf Plus negarrow Negarrow Inf Infarrowarrow Infarrow Bedarrowaille Dead febrohlohlaille郎郎郎郎arrowiji sealbuglijbuglijlij PlusMSMbugbugbug Nutbugsndbugbugbugbug bleedingportal Negouthbugštštštštiekiekatoniek Contin Contin bleeding Continiekiekiekiekiek Brigiek бы Negwick бы Barcel original seal Negesseresseroke Barcelmulticolmulticolwick Barcel retro retro retro retroFF retro retro retroervicesieriiasmiasmiasmervicesunch plainiasmiasmervicesosoosoiasmiasmiasmwickINEwickbugbugwickwickót Barcel Barcelwickötwick Barcelwickšt Barcelwickwick Barcelwick notenwick Barcel org Barcel Barcel Barcel CL Barcel Barcel Barcel wides Barcelieriieritee absolutteeervicesieriieriieriulumieri Barcel absolutungsungswickwickwickwickwickwickvíervicesieriwickwickieriwickieriieriwickšt济œuverviceservices济erviceservicesoki CHAPTERitsch CHAPTERitschwickwick济

requirements.txt

transformers
torch
accelerate
bitsandbytes
xformers
langchain
scipy

Could you please help me with this?
Question: Do I need to train your models for a downstream task?

NOTE: The above code generates decent output with non-quantized, nf4-quantized, and AWQ version of Mistral-7B and Llama 2 7b chat

custom 1.3B llama quant

I train a custom 1.3B model of llama arch. Then I try to compress my model to 2bit using quip-sharp. Except the max length of our model is 2,048, the other hyper-params keep same. I adopt a devset of 4096 for computing hessian. However, it seems that the ppl of quantized model on c4 is quite high. The original fp16 model has 21.56 ppl, while the quantized model has >21716 ppl... This is the config of quantized model:

{
"_name_or_path": "llama-xl-seq2048-bsz128-2e-4-0-w32a32-vquip/checkpoint_1_15000-quip/hf",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"bos_token_id": 0,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 5504,
"max_position_embeddings": 2048,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 24,
"num_key_value_heads": 32,
"pad_token_id": 1,
"pretraining_tp": 1,
"quip_params": {
"codebook": "E8P12",
"codebook_version": 0,
"codesz": 8,
"fused": true,
"idx_dtype": "torch.int16",
"lora_rank": 0,
"model_version": 0,
"outlier_channel_split": false,
"packsz": 1,
"rescale_WH": false
},
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": true,
"torch_dtype": "float16",
"transformers_version": "4.34.0",
"use_cache": true,
"vocab_size": 32002
}

the difference of ppl on the dev set is not quite large:

I1210 08:04:11.968106 2240291 quantize_llama.py:398] calculating perplexity on devset
original model perplexity: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 14.53it/s]
I1210 08:04:12.570923 2240291 quantize_llama.py:417] original model perplexity: 8.982492446899414
quantized model perplexity: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 13.92it/s]
I1210 08:04:13.150637 2240291 quantize_llama.py:430] quantized model perplexity: 9.49695873260498

could you provide some advice to resolve this problem?

FileNotFoundError: [Errno 2] No such file or directory: '/worker/quip_llama2/hessians/2_qkv.pt'

Hi I am trying to download Hessians of LLaMA 13B using download_hf.py, but it seems that 2_qkv.pt file is missing. Can you upload these pt files? Thanks!

There are some issues when I try to run the Yi34b model with 2bits quant

I pull the the model file in huggingface https://huggingface.co/Minami-su/Yi_34B_Chat_2bit/discussions
Then I replace the model path in interactive_gen.py with yi34b model
And it reminds me Couldn't find a module script at quip-sharp/exact_match/exact_math.py. Module exact_math doesn't exist on the hugging face hub either.
So how should I do to solve this problem.

[Question] Word Embedding Quantization?

Hello, I was wondering if we can apply the same approach for post-training quantization of word embeddings in transformer architecture. Usually, this word embedding module is demanding in size due to embedding map that is effectively large matrix of floats. What's your thoughts on this?

Thanks

Procedures for quantizing generic architectures

I have a custom 13B model I'm trying to quantize with QuIP#. In the quantize_llama scripts, llama2's architecture is hardcoded in a few places. It seems brittle and difficult for me to try to use my architecture with these, since I won't be able to track upstream changes, and I won't know when I have made all the necessary changes.

Many other quantization libraries have the same issue, but there have been two solutions:

Some libraries let us specify the layer names to quantize, such as: https://github.com/neuralmagic/sparseml/tree/main/src/sparseml/transformers/sparsification/obcq#-how-to-adapt-a-recipe-for-a-new-model So when we want to test new architectures, we only need to change the layer names.
Gpt-fast automatically deduces the layer configuration using torch._dynamo.export: https://github.com/pytorch-labs/gpt-fast/blob/40ff31458beffa3ddda8c45215240d6ce43a768a/GPTQ.py#L134

If either of these solutions is possible with QuIP#, it would be convenient for experimenting with new transformer architectures.

[Question] Why discard the last element of logit.

Hi,

I curious about the reason of discarding last element of prediction logit during training (corresponding to [:, :-1]). As shown in the follows.

quip-sharp/quantize_llama/finetune_e2e_llama.py

Line 102 in b22288c

orig_logits = orig_logits[:, :-1].contiguous().softmax(dim=-1).float()

Group-wise Quantization

Hi,

I understand that currently you are quantizing the model weights in a per-row fashion. Can you extend QuIP# to per-group granularity? Can you elaborate on why and why not?

Thanks

Low Ppl benchmark results

Hi.
I am in the process of adding QuiP inference support into ExllamaV2
and this is the PR

The problem I am having right now is the my Ppl testing results is kind worse compare to your blog results.

so I am wondering is there something wrong with my implementation or any other reasons.

Ppl Benchmarks

using dataset: [wikitext-2-v1_validation_0000.parquet]
(https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-v1/validation)

Model	Performance
2Bit
Llama-2-7b-E8P-2Bit	8.7339
Llama2-7b-exl2-2.5bpw	8.0745
Llama-2-13b-E8P-2Bit	7.1207
Llama2-13b-exl2-2.5bpw	7.2741
Llama-2-70b-E8P-2Bit	6.2192
Llama2-70b-exl2-2.5bpw	5.8270
4Bit
Llama-2-7b-HI-4Bit-Packed	6.0748
Llama2-7b-exl2-4.0bpw	6.0300
Llama-2-13b-HI-4Bit-Packed	7.4169
Llama2-13b-exl2-4.0bpw	5.4905

llamafied model have some issues happening in hfize_llama.py

python quantize_llama.py --save_path Qwen-1_8B-Chat_LLaMAfied_2bitcc --base_model Qwen-1_8B-Chat_LLaMAfied/ --hessian_path hessians/llama2_70b/ --codebook E8P12 --scale_override 0.9 --batch_size 1 --devset_size 8 --ctx_size 500

python hfize_llama.py --quantized_path Qwen-1_8B-Chat_LLaMAfied_2bitcc/ --hf_output_path Qwen-1_8B-Chat_LLaMAfied_2bit

error:

python interactive_gen.py --hf_path Qwen-1_8B-Chat_2bit --max_length 100
bad output:

TypeError: decompress_e8p_origorder(): incompatible function arguments.

I've used (not the version before the recent version as it came out during quantization) of quip-sharp to make 2 bit hessians of MoMo-72b (https://huggingface.co/moreh/MoMo-70B-LoRA-V1.4), a llamafied Qwen fine tuned on slimorca.
I get an exception during the final step, which makes hf weigths from the quantized weights, as the script tries to load the weights. I also can't use the weights otherwise.
For you to reproduce, I'm uploading the quantized (non-hf) weights here:
https://huggingface.co/KnutJaegersberg/MoMo-72b-quants

The perplexity that is given at the end of the quantization was ok, like 7 up from 6.5, which let's me think that the weights are not completely broken.

I add the complete exception below:

I0113 09:07:36.679814 209425 hfize_llama.py:93] loading layer 79 down
I0113 09:07:36.699936 209425 hfize_llama.py:149] saving model...
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use attn_implementation="flash_attention_2" instead.
I0113 09:08:12.982370 209425 modeling.py:799] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set max_memory in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████| 5/5 [00:03<00:00, 1.40it/s]
I0113 09:08:19.277218 209425 hfize_llama.py:156] successfully loaded hfized model
I0113 09:08:19.277296 209425 hfize_llama.py:158] generating some text...
Setting pad_token_id to eos_token_id:2 for open-end generation.
Traceback (most recent call last):
File "/run/media/knut/HD/quip-sharp/hfize_llama.py", line 177, in
main(args)
File "/run/media/knut/HD/quip-sharp/hfize_llama.py", line 163, in main
outputs = model.generate(input_ids=inputs['input_ids'].cuda(),
File "/home/knut/transformers/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/transformers/generation/utils.py", line 1718, in generate
return self.greedy_search(
File "/home/knut/transformers/lib/python3.9/site-packages/transformers/generation/utils.py", line 2579, in greedy_search
outputs = self(
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/run/media/knut/HD/quip-sharp/model/llama.py", line 1086, in forward
outputs = self.model(
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/run/media/knut/HD/quip-sharp/model/llama.py", line 973, in forward
layer_outputs = decoder_layer(
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/run/media/knut/HD/quip-sharp/model/llama.py", line 682, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/run/media/knut/HD/quip-sharp/model/llama.py", line 373, in forward
qkv_states = self.qkv_proj(hidden_states.to(torch.float32))
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/run/media/knut/HD/quip-sharp/lib/linear/quantized_linear.py", line 90, in forward
return self.codebook_class(input,
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/knut/transformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/run/media/knut/HD/quip-sharp/lib/codebook/latticee8_padded12.py", line 252, in forward
quiptools_cuda.decompress_e8p_origorder(Qidxs, self.codebook.grid_abs,
TypeError: decompress_e8p_origorder(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor) -> None

Invoked with: tensor([[-15210, 24677, -15616, ..., 24930, 26312, 26133],
[ 26256, 21329, 13018, ..., 28524, 24142, 31816],
[-25804, 7775, 937, ..., -18345, 9130, -8142],
...,
[-22927, -1611, -31657, ..., -8611, 25484, -30339],
[ 9292, -27004, 28969, ..., 17868, 7946, -11222],
[ 25934, -3162, 28537, ..., -4720, 8854, 21136]],
device='cuda:0', dtype=torch.int16), tensor([[0.5000, 0.5000, 0.5000, ..., 0.5000, 0.5000, 0.5000],
[0.5000, 0.5000, 0.5000, ..., 0.5000, 0.5000, 1.5000],
[0.5000, 0.5000, 0.5000, ..., 0.5000, 0.5000, 2.5000],
...,
[0.5000, 1.5000, 1.5000, ..., 1.5000, 0.5000, 1.5000],
[0.5000, 1.5000, 1.5000, ..., 0.5000, 1.5000, 1.5000],
[1.5000, 1.5000, 0.5000, ..., 1.5000, 1.5000, 0.5000]],
device='cuda:0', dtype=torch.float16), tensor([ True, False, True, False, True, False, True, False, False, True,
False, True, False, False, True, False, False, False, True, False,
True, False, False, True, False, False, True, False, True, False,
False, False, False, True, False, True, False, False, True, False,
False, True, False, True, False, False, True, False, True, True,
False, True, False, False, False, False, False, True, False, True,
False, False, True, False, False, True, False, True, False, False,
True, False, True, True, False, True, False, False, True, False,
True, True, False, True, True, True, False, True, False, False,
False, False, False, False, True, False, True, False, False, True,
False, False, True, False, True, False, False, True, False, True,
True, False, True, False, False, True, False, True, True, False,
True, True, True, False, True, False, False, True, False, True,
True, False, True, True, True, False, True, True, True, True,
False, True, False, False, False, False, False, False, False, True,
False, True, False, False, True, False, False, True, False, True,
False, False, True, False, True, True, False, True, False, False,
True, False, True, True, False, True, True, True, False, True,
False, False, True, False, True, True, False, True, True, True,
False, True, True, True, True, False, True, False, False, True,
False, True, True, False, True, True, True, False, True, True,
True, True, False, True, True, True, True, True, False, True,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False], device='cuda:0'), tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], device='cuda:0', dtype=torch.float16)

mixtral 8x7b

Please show example to quantize mixtral 8x7b

mistral-7b

can i please have example how to quantize mistral-7b?

support qwen model

I spent several days trying to modify the code to support the qwen model, but the ppl was too high (origin model 2xx, quantize model 7xx) and failed

Question about error proxy in show_metrics

Hi all,
Thanks for sharing such an interesting method, and I like your analysis and proof in the paper.

I have a question about the error proxy in the source code.

I found that the optimziation object of PTQ to minimzie the sum of diag of $tr(\Delta W^T H \Delta W) = \sum \text{diag} (\Delta W^T H \Delta W)$.

However, I found that the QuIP paper uses the following codes in evaluation codes. Is there any reason for the metric?

    err_proxy = (((hatW - W_orig) @ H) * (hatW - W_orig)).sum() / ((W_orig @ H) * W_orig).sum()

https://github.com/Cornell-RelaxML/quip-sharp/blob/main/lib/utils/misc.py#L15

BTW, I guess should you use $\approx$ in the equation (1), $L(\hat{W}) \approx tr(\Delta W H \Delta W^T)$?

Thanks!

Yang

ROCm Build Error

Are you planning on adding ROCm support or did you already test it on AMD?

I just tried building the package and it crashes with the following error:

FAILED: /data/linux_data/AI/LLM/WebUI/repositories/quip-sharp/quiptools/build/temp.linux-x86_64-cpython-310/quiptools_e8p_gemv.o 
/opt/rocm/bin/hipcc  -I/data/linux_data/AI/LLM/WebUI/.venvs/llm1100/lib/python3.10/site-packages/torch/include -I/data/linux_data/AI/LLM/WebUI/.venvs/llm1100/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/data/linux_data/AI/LLM/WebUI/.venvs/llm1100/lib/python3.10/site-packages/torch/include/TH -I/data/linux_data/AI/LLM/WebUI/.venvs/llm1100/lib/python3.10/site-packages/torch/include/THC -I/data/linux_data/AI/LLM/WebUI/.venvs/llm1100/lib/python3.10/site-packages/torch/include/THH -I/opt/rocm/include -I/data/linux_data/AI/LLM/WebUI/.venvs/llm1100/include -I/home/lukas/.pyenv/versions/3.10.12/include/python3.10 -c -c /data/linux_data/AI/LLM/WebUI/repositories/quip-sharp/quiptools/quiptools_e8p_gemv.hip -o /data/linux_data/AI/LLM/WebUI/repositories/quip-sharp/quiptools/build/temp.linux-x86_64-cpython-310/quiptools_e8p_gemv.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O2 -g -Xcompiler -rdynamic -lineinfo -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=quiptools_cuda -D_GLIBCXX_USE_CXX11_ABI=0 --offload-arch=gfx900 --offload-arch=gfx906 --offload-arch=gfx908 --offload-arch=gfx90a --offload-arch=gfx1030 --offload-arch=gfx1100 -fno-gpu-rdc -std=c++17
clang++: warning: -lineinfo: 'linker' input unused [-Wunused-command-line-argument]
clang++: warning: argument unused during compilation: '-Xcompiler' [-Wunused-command-line-argument]
clang++: warning: argument unused during compilation: '-rdynamic' [-Wunused-command-line-argument]
/data/linux_data/AI/LLM/WebUI/repositories/quip-sharp/quiptools/quiptools_e8p_gemv.hip:11:10: fatal error: 'mma.h' file not found
#include <mma.h>
         ^~~~~~~
1 error generated when compiling for gfx1030.
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/data/linux_data/AI/LLM/WebUI/.venvs/llm1100/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2099, in _run_ninja_build
    subprocess.run(
  File "/home/lukas/.pyenv/versions/3.10.12/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

model size confirmation

Hi team,

i found https://huggingface.co/TheBloke/Llama-2-70B-Chat-AWQ, this model declare using 4-bit quantization:
AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.

Meanwhile, https://huggingface.co/relaxml/Llama-2-70b-chat-E8P-2Bit/tree/main this model used 2Bit quantization.

But the both model size are almost the same ~36.5GB. Assumed Llama-2-70b-chat-E8P-2Bit will have more compact footprint size right? could you help to confirm?

Cannot re-initialize CUDA in forked subprocess

Thanks for your wonderful work!
I try to compute the hessian of our custom model based on llama arch. But it reports the "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method" while executing hessian_offline_llama.py", line 56, in forward_layer layer = layer.to(device). How to resolve this bug? Could you provide a version for disable mp since the code is quite complex?

How many samples do you use in checkpoints?

Hi all,
Thanks for sharing the interesting idea.

I have a question about Hessian matrices for fair comparison with other methods.

How many samples do you use in checkpoint? And I found that the default devset_size is (256)[https://github.com/Cornell-RelaxML/quip-sharp/blob/main/hessian_offline_llama.py#L23]. I just want to confirm the settings in checkpoints.

Thanks!
Yang

Load LORA?

It seems that there is parameters named A and B in QuantizedLinear. I guess it's for LORA, right? How to load lora model inside? Was the LORA weight also quantized?

Does QUIP# support deepseek-llm-chat-67b?

https://huggingface.co/deepseek-ai/deepseek-llm-67b-chat

I assume 2bits should be fit into 24G GPU. I tried https://huggingface.co/LoneStriker/deepseek-llm-67b-chat-2.4bpw-h6-exl2 with 24G 3090. It works but very unstable.

[Question] Why only quantize an individual linear layer during block-wise optimization of fine-tunings.

Hello,
During the first step of fine-tuning, QuIP# attempts to minimize the activation error caused by an individual linear layer during quantization.

I wonder to know weather such trick can improve the performance of quantized models, or only for saving fine-tuning memory?

[Question] Different target in the e2e finetuning

Hi,

Thanks for your outstanding work! I find that QuiP# takes orig_logits as the training target in the e2e finetuning process.

I want to know how much performance gain of using logit compared to using one-hot label. Do you have some detailed ablation experiments?

Thank you!

Llama-2-7b-E8P-2Bit not loading correctly.

Hi,

I was loading this model https://huggingface.co/relaxml/Llama-2-7b-E8P-2Bit from huggingface.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("relaxml/Llama-2-7b-E8P-2Bit")
model = AutoModelForCausalLM.from_pretrained("relaxml/Llama-2-7b-E8P-2Bit")

It gives an error message about model loading:

Some weights of the model checkpoint at relaxml/Llama-2-7b-E8P-2Bit were not used when initializing LlamaForCausalLM: ['model.layers.27.self_attn.qkv_proj.codebook_id', 'model.layers.19.mlp.down_proj.Qidxs', 'model.layers.26.self_attn.qkv_proj.codebook_id', 'model.layers.15.self_attn.qkv_proj.SV', 'model.layers.16.self_attn.o_proj.Wscale', 'model.layers.23.mlp.down_proj.SU', 'model.layers.8.mlp.upgate_proj.SU', 'model.layers.6.self_attn.qkv_proj.Qidxs', 'model.layers.24.self_attn.o_proj.SU', 'model.layers.17.mlp.upgate_proj.SU', 'model.layers.23.self_attn.o_proj.SU', 'model.layers.9.mlp.down_proj.SU', 'model.layers.24.mlp.down_proj.codebook_id', 'model.layers.28.self_attn.o_proj.Qidxs', 'model.layers.2.self_attn.o_proj.codebook_id', 'model.layers.26.self_attn.o_proj.Qidxs', 'model.layers.10.mlp.down_proj.SV', 'model.layers.25.self_attn.o_proj.SU', 'model.layers.2.mlp.upgate_proj.fuse_scales', 'model.layers.21.mlp.upgate_proj.codebook_id', 'model.layers.31.self_attn.o_proj.codebook_id', 'model.layers.29.self_attn.qkv_proj.fuse_scales', 'model.layers.0.self_attn.qkv_proj.Qidxs', 'model.layers.31.self_attn.o_proj.Wscale', 'model.layers.0.self_attn.qkv_proj.Wscale', 'model.layers.18.mlp.down_proj.Qidxs', 'model.layers.23.mlp.upgate_proj.SU', 'model.layers.21.mlp.down_proj.Wscale', 'model.layers.24.self_attn.o_proj.Qidxs', 'model.layers.2.self_attn.qkv_proj.SV', 'model.layers.27.mlp.down_proj.codebook_id', 'model.layers.18.mlp.down_proj.codebook_id', 'model.layers.3.self_attn.o_proj.Wscale', 'model.layers.19.mlp.upgate_proj.SU', 'model.layers.3.self_attn.o_proj.SV', 'model.layers.20.mlp.upgate_proj.codebook_id', 'model.layers.8.self_attn.qkv_proj.fuse_scales', 'model.layers.20.self_attn.qkv_proj.Wscale', 'model.layers.31.mlp.upgate_proj.codebook_id', 'model.layers.2.self_attn.o_proj.Qidxs', 'model.layers.4.mlp.down_proj.codebook_id', 'model.layers.28.mlp.upgate_proj.Qidxs', 'model.layers.9.mlp.down_proj.Qidxs', 'model.layers.15.self_attn.o_proj.Wscale', 'model.layers.11.self_attn.qkv_proj.SU', 'model.layers.0.mlp.down_proj.SU', 'model.layers.9.mlp.upgate_proj.SU', 'model.layers.19.mlp.down_proj.codebook_id', 'model.layers.6.mlp.upgate_proj.Qidxs', 'model.layers.29.mlp.down_proj.SU', 'model.layers.18.mlp.upgate_proj.Qidxs', 'model.layers.12.self_attn.qkv_proj.SV', 'model.layers.30.mlp.down_proj.SU', 'model.layers.12.mlp.down_proj.SV', 'model.layers.16.mlp.upgate_proj.SV', 'model.layers.3.mlp.upgate_proj.SU', 'model.layers.16.self_attn.o_proj.SU', 'model.layers.11.mlp.upgate_proj.codebook_id', 'model.layers.4.self_attn.qkv_proj.fuse_scales', 'model.layers.29.mlp.down_proj.Wscale', 'model.layers.31.self_attn.qkv_proj.SV', 'model.layers.4.self_attn.qkv_proj.Wscale', 'model.layers.16.mlp.upgate_proj.fuse_scales', 'model.layers.18.mlp.upgate_proj.codebook_id', 'model.layers.24.mlp.upgate_proj.SV', 'model.layers.23.self_attn.o_proj.SV', 'model.layers.7.self_attn.qkv_proj.fuse_scales', 'model.layers.23.mlp.upgate_proj.Wscale', 'model.layers.19.mlp.down_proj.Wscale', 'model.layers.15.self_attn.qkv_proj.codebook_id', 'model.layers.2.mlp.down_proj.SU', 'model.layers.6.mlp.upgate_proj.Wscale', 'model.layers.0.self_attn.qkv_proj.SU', 'model.layers.2.self_attn.o_proj.Wscale', 'model.layers.24.mlp.upgate_proj.Wscale', 'model.layers.25.self_attn.qkv_proj.Qidxs', 'model.layers.1.mlp.upgate_proj.Qidxs', 'model.layers.10.mlp.upgate_proj.SV', 'model.layers.29.mlp.down_proj.Qidxs', 'model.layers.12.mlp.upgate_proj.Wscale', 'model.layers.15.mlp.upgate_proj.SV', 'model.layers.12.mlp.upgate_proj.SV', 'model.layers.30.mlp.upgate_proj.codebook_id', 'model.layers.0.mlp.upgate_proj.Qidxs', 'model.layers.20.self_attn.o_proj.Qidxs', 'model.layers.19.self_attn.qkv_proj.Qidxs', 'model.layers.29.self_attn.o_proj.SU', 'model.layers.10.self_attn.o_proj.SV', 'model.layers.17.mlp.upgate_proj.Wscale', 'model.layers.13.mlp.down_proj.codebook_id', 'model.layers.11.self_attn.o_proj.Qidxs', 'model.layers.12.self_attn.qkv_proj.codebook_id', 'model.layers.11.mlp.down_proj.Qidxs', 'model.layers.28.self_attn.qkv_proj.codebook_id', 'model.layers.29.mlp.upgate_proj.SU', 'model.layers.24.mlp.down_proj.Qidxs', 'model.layers.23.mlp.down_proj.SV', 'model.layers.20.mlp.down_proj.SV', 'model.layers.20.self_attn.o_proj.codebook_id', 'model.layers.27.self_attn.o_proj.SV', 'model.layers.8.self_attn.o_proj.Qidxs', 'model.layers.0.self_attn.o_proj.SU', 'model.layers.4.self_attn.o_proj.Qidxs', 'model.layers.15.self_attn.qkv_proj.Wscale', 'model.layers.21.self_attn.qkv_proj.Qidxs', 'model.layers.17.mlp.upgate_proj.fuse_scales', 'model.layers.19.self_attn.o_proj.SU', 'model.layers.31.mlp.down_proj.codebook_id', 'model.layers.24.self_attn.qkv_proj.Qidxs', 'model.layers.20.mlp.down_proj.Qidxs', 'model.layers.5.self_attn.o_proj.SU', 'model.layers.11.self_attn.o_proj.SV', 'model.layers.13.mlp.down_proj.Qidxs', 'model.layers.0.mlp.upgate_proj.SV', 'model.layers.19.mlp.upgate_proj.SV', 'model.layers.22.self_attn.qkv_proj.Wscale', 'model.layers.30.self_attn.o_proj.Qidxs', 'model.layers.30.mlp.upgate_proj.fuse_scales', 'model.layers.25.mlp.upgate_proj.SV', 'model.layers.8.self_attn.qkv_proj.SU', 'model.layers.17.self_attn.qkv_proj.SU', 'model.layers.19.mlp.upgate_proj.Qidxs', 'model.layers.20.mlp.down_proj.Wscale', 'model.layers.16.self_attn.o_proj.Qidxs', 'model.layers.26.self_attn.o_proj.SU', 'model.layers.11.mlp.down_proj.SU', 'model.layers.1.mlp.down_proj.SV', 'model.layers.12.self_attn.qkv_proj.Qidxs', 'model.layers.23.self_attn.qkv_proj.codebook_id', 'model.layers.26.mlp.down_proj.Qidxs', 'model.layers.3.self_attn.o_proj.Qidxs', 'model.layers.11.mlp.upgate_proj.SV', 'model.layers.14.mlp.upgate_proj.SU', 'model.layers.14.mlp.upgate_proj.codebook_id', 'model.layers.14.self_attn.o_proj.SU', 'model.layers.30.self_attn.o_proj.SV', 'model.layers.24.mlp.upgate_proj.Qidxs', 'model.layers.0.mlp.upgate_proj.fuse_scales', 'model.layers.7.mlp.upgate_proj.SV', 'model.layers.6.self_attn.o_proj.SV', 'model.layers.27.self_attn.o_proj.Wscale', 'model.layers.0.mlp.down_proj.Wscale', 'model.layers.28.mlp.down_proj.codebook_id', 'model.layers.29.self_attn.qkv_proj.Wscale', 'model.layers.22.mlp.down_proj.Qidxs', 'model.layers.15.mlp.down_proj.Qidxs', 'model.layers.10.self_attn.qkv_proj.SU', 'model.layers.25.self_attn.o_proj.Qidxs', 'model.layers.29.mlp.down_proj.codebook_id', 'model.layers.26.mlp.down_proj.Wscale', 'model.layers.22.self_attn.qkv_proj.SU', 'model.layers.17.self_attn.qkv_proj.Wscale', 'model.layers.9.self_attn.qkv_proj.Qidxs', 'model.layers.21.self_attn.o_proj.SV', 'model.layers.26.mlp.upgate_proj.codebook_id', 'model.layers.31.mlp.upgate_proj.SU', 'model.layers.10.self_attn.o_proj.SU', 'model.layers.17.self_attn.o_proj.SV', 'model.layers.10.mlp.down_proj.SU', 'model.layers.13.self_attn.qkv_proj.fuse_scales', 'model.layers.21.self_attn.o_proj.Qidxs', 'model.layers.5.mlp.down_proj.codebook_id', 'model.layers.25.mlp.down_proj.Wscale', 'model.layers.19.mlp.down_proj.SV', 'model.layers.14.self_attn.qkv_proj.Wscale', 'model.layers.25.self_attn.o_proj.Wscale', 'model.layers.3.self_attn.qkv_proj.fuse_scales', 'model.layers.3.self_attn.o_proj.SU', 'model.layers.9.mlp.upgate_proj.fuse_scales', 'model.layers.13.self_attn.qkv_proj.Qidxs', 'model.layers.2.mlp.upgate_proj.SU', 'model.layers.14.mlp.upgate_proj.fuse_scales', 'model.layers.28.mlp.upgate_proj.codebook_id', 'model.layers.6.mlp.down_proj.SU', 'model.layers.9.self_attn.o_proj.SV', 'model.layers.1.mlp.upgate_proj.Wscale', 'model.layers.26.self_attn.qkv_proj.fuse_scales', 'model.layers.16.self_attn.qkv_proj.Wscale', 'model.layers.24.mlp.upgate_proj.fuse_scales', 'model.layers.10.self_attn.o_proj.Wscale', 'model.layers.30.self_attn.qkv_proj.SU', 'model.layers.17.mlp.down_proj.Wscale', 'model.layers.28.mlp.down_proj.SU', 'model.layers.22.self_attn.o_proj.Wscale', 'model.layers.29.self_attn.o_proj.SV', 'model.layers.2.mlp.upgate_proj.Wscale', 'model.layers.5.self_attn.qkv_proj.Qidxs', 'model.layers.9.mlp.upgate_proj.Wscale', 'model.layers.31.mlp.down_proj.SU', 'model.layers.27.self_attn.qkv_proj.SU', 'model.layers.9.self_attn.o_proj.SU', 'model.layers.29.self_attn.o_proj.Wscale', 'model.layers.14.self_attn.o_proj.Qidxs', 'model.layers.30.mlp.down_proj.Qidxs', 'model.layers.2.mlp.down_proj.Qidxs', 'model.layers.21.mlp.down_proj.SU', 'model.layers.19.mlp.upgate_proj.Wscale', 'model.layers.11.self_attn.o_proj.codebook_id', 'model.layers.28.self_attn.qkv_proj.Wscale', 'model.layers.24.self_attn.qkv_proj.SV', 'model.layers.4.mlp.upgate_proj.Qidxs', 'model.layers.1.mlp.down_proj.SU', 'model.layers.7.mlp.down_proj.Wscale', 'model.layers.12.mlp.down_proj.Wscale', 'model.layers.26.self_attn.o_proj.SV', 'model.layers.5.mlp.upgate_proj.fuse_scales', 'model.layers.0.mlp.down_proj.Qidxs', 'model.layers.8.self_attn.o_proj.codebook_id', 'model.layers.20.self_attn.o_proj.Wscale', 'model.layers.18.mlp.down_proj.SV', 'model.layers.6.self_attn.o_proj.codebook_id', 'model.layers.5.mlp.down_proj.SV', 'model.layers.17.mlp.down_proj.SU', 'model.layers.29.self_attn.qkv_proj.SV', 'model.layers.5.self_attn.o_proj.SV', 'model.layers.3.mlp.upgate_proj.codebook_id', 'model.layers.6.mlp.upgate_proj.fuse_scales', 'model.layers.30.self_attn.qkv_proj.Qidxs', 'model.layers.9.self_attn.qkv_proj.SV', 'model.layers.30.mlp.down_proj.codebook_id', 'model.layers.13.self_attn.o_proj.codebook_id', 'model.layers.21.mlp.upgate_proj.Wscale', 'model.layers.4.self_attn.o_proj.SV', 'model.layers.5.self_attn.o_proj.Wscale', 'model.layers.15.self_attn.o_proj.SV', 'model.layers.10.mlp.upgate_proj.fuse_scales', 'model.layers.29.self_attn.qkv_proj.SU', 'model.layers.4.self_attn.o_proj.SU', 'model.layers.31.self_attn.o_proj.Qidxs', 'model.layers.4.self_attn.qkv_proj.SU', 'model.layers.11.mlp.upgate_proj.SU', 'model.layers.20.mlp.down_proj.SU', 'model.layers.30.self_attn.qkv_proj.SV', 'model.layers.30.mlp.upgate_proj.Wscale', 'model.layers.10.mlp.upgate_proj.Qidxs', 'model.layers.21.self_attn.qkv_proj.SU', 'model.layers.8.self_attn.qkv_proj.SV', 'model.layers.12.mlp.upgate_proj.codebook_id', 'model.layers.19.self_attn.o_proj.Qidxs', 'model.layers.31.self_attn.qkv_proj.Qidxs', 'model.layers.23.mlp.down_proj.codebook_id', 'model.layers.5.self_attn.qkv_proj.SV', 'model.layers.0.mlp.down_proj.codebook_id', 'model.layers.10.self_attn.o_proj.Qidxs', 'model.layers.26.mlp.upgate_proj.SV', 'model.layers.19.self_attn.qkv_proj.SV', 'model.layers.9.self_attn.qkv_proj.fuse_scales', 'model.layers.26.self_attn.qkv_proj.SU', 'model.layers.20.mlp.upgate_proj.SV', 'model.layers.0.mlp.upgate_proj.Wscale', 'model.layers.5.self_attn.o_proj.Qidxs', 'model.layers.4.self_attn.qkv_proj.Qidxs', 'model.layers.15.mlp.down_proj.SV', 'model.layers.16.mlp.upgate_proj.SU', 'model.layers.7.self_attn.qkv_proj.Qidxs', 'model.layers.14.mlp.upgate_proj.Wscale', 'model.layers.9.self_attn.o_proj.Qidxs', 'model.layers.18.self_attn.qkv_proj.SV', 'model.layers.6.mlp.down_proj.codebook_id', 'model.layers.3.mlp.upgate_proj.Wscale', 'model.layers.23.self_attn.qkv_proj.Qidxs', 'model.layers.0.self_attn.qkv_proj.fuse_scales', 'model.layers.14.self_attn.qkv_proj.Qidxs', 'model.layers.16.mlp.down_proj.Wscale', 'model.layers.21.mlp.down_proj.Qidxs', 'model.layers.14.mlp.down_proj.Qidxs', 'model.layers.6.self_attn.qkv_proj.SV', 'model.layers.16.self_attn.qkv_proj.fuse_scales', 'model.layers.12.mlp.upgate_proj.fuse_scales', 'model.layers.15.self_attn.qkv_proj.fuse_scales', 'model.layers.2.mlp.down_proj.codebook_id', 'model.layers.28.mlp.upgate_proj.SU', 'model.layers.27.mlp.upgate_proj.codebook_id', 'model.layers.20.mlp.down_proj.codebook_id', 'model.layers.27.self_attn.o_proj.Qidxs', 'model.layers.19.mlp.upgate_proj.fuse_scales', 'model.layers.2.self_attn.o_proj.SU', 'model.layers.21.self_attn.o_proj.Wscale', 'model.layers.12.mlp.down_proj.SU', 'model.layers.8.mlp.upgate_proj.codebook_id', 'model.layers.3.mlp.upgate_proj.Qidxs', 'model.layers.1.self_attn.qkv_proj.SV', 'model.layers.26.self_attn.qkv_proj.Wscale', 'model.layers.23.self_attn.qkv_proj.SU', 'model.layers.15.self_attn.o_proj.codebook_id', 'model.layers.24.mlp.down_proj.SV', 'model.layers.23.self_attn.qkv_proj.Wscale', 'model.layers.5.self_attn.qkv_proj.codebook_id', 'model.layers.21.self_attn.o_proj.codebook_id', 'model.layers.29.mlp.upgate_proj.SV', 'model.layers.24.self_attn.qkv_proj.fuse_scales', 'model.layers.16.self_attn.qkv_proj.SU', 'model.layers.1.self_attn.qkv_proj.codebook_id', 'model.layers.13.self_attn.o_proj.SU', 'model.layers.16.mlp.down_proj.Qidxs', 'model.layers.21.mlp.down_proj.codebook_id', 'model.layers.25.mlp.upgate_proj.Qidxs', 'model.layers.3.self_attn.qkv_proj.SV', 'model.layers.1.mlp.upgate_proj.codebook_id', 'model.layers.18.self_attn.qkv_proj.codebook_id', 'model.layers.26.mlp.upgate_proj.Wscale', 'model.layers.9.self_attn.qkv_proj.Wscale', 'model.layers.6.self_attn.o_proj.Qidxs', 'model.layers.22.self_attn.qkv_proj.Qidxs', 'model.layers.23.mlp.down_proj.Qidxs', 'model.layers.19.mlp.upgate_proj.codebook_id', 'model.layers.26.self_attn.o_proj.Wscale', 'model.layers.17.mlp.upgate_proj.codebook_id', 'model.layers.11.mlp.down_proj.SV', 'model.layers.6.mlp.down_proj.Qidxs', 'model.layers.30.mlp.upgate_proj.SU', 'model.layers.19.self_attn.qkv_proj.Wscale', 'model.layers.17.self_attn.o_proj.Qidxs', 'model.layers.7.self_attn.o_proj.Qidxs', 'model.layers.22.self_attn.qkv_proj.fuse_scales', 'model.layers.6.self_attn.qkv_proj.fuse_scales', 'model.layers.18.mlp.down_proj.Wscale', 'model.layers.4.self_attn.o_proj.codebook_id', 'model.layers.17.mlp.down_proj.Qidxs', 'model.layers.4.mlp.upgate_proj.codebook_id', 'model.layers.27.self_attn.o_proj.codebook_id', 'model.layers.30.mlp.upgate_proj.SV', 'model.layers.0.self_attn.qkv_proj.codebook_id', 'model.layers.15.mlp.upgate_proj.SU', 'model.layers.16.mlp.upgate_proj.Wscale', 'model.layers.13.self_attn.o_proj.Qidxs', 'model.layers.16.mlp.down_proj.SV', 'model.layers.30.self_attn.qkv_proj.Wscale', 'model.layers.7.mlp.upgate_proj.SU', 'model.layers.3.self_attn.qkv_proj.Wscale', 'model.layers.17.self_attn.qkv_proj.codebook_id', 'model.layers.2.mlp.down_proj.Wscale', 'model.layers.31.self_attn.o_proj.SV', 'model.layers.8.self_attn.qkv_proj.codebook_id', 'model.layers.17.self_attn.qkv_proj.fuse_scales', 'model.layers.3.self_attn.qkv_proj.Qidxs', 'model.layers.0.mlp.upgate_proj.codebook_id', 'model.layers.8.mlp.upgate_proj.SV', 'model.layers.28.self_attn.qkv_proj.SU', 'model.layers.4.mlp.down_proj.SU', 'model.layers.27.mlp.upgate_proj.SU', 'model.layers.9.self_attn.o_proj.codebook_id', 'model.layers.10.self_attn.qkv_proj.SV', 'model.layers.15.mlp.down_proj.codebook_id', 'model.layers.12.self_attn.qkv_proj.SU', 'model.layers.18.self_attn.qkv_proj.Qidxs', 'model.layers.10.self_attn.qkv_proj.codebook_id', 'model.layers.14.self_attn.o_proj.SV', 'model.layers.31.self_attn.qkv_proj.Wscale', 'model.layers.16.self_attn.qkv_proj.codebook_id', 'model.layers.20.mlp.upgate_proj.Qidxs', 'model.layers.3.mlp.down_proj.codebook_id', 'model.layers.13.self_attn.qkv_proj.Wscale', 'model.layers.30.mlp.down_proj.Wscale', 'model.layers.1.mlp.upgate_proj.SV', 'model.layers.28.self_attn.qkv_proj.fuse_scales', 'model.layers.31.mlp.upgate_proj.fuse_scales', 'model.layers.28.mlp.down_proj.SV', 'model.layers.18.self_attn.o_proj.SV', 'model.layers.0.self_attn.o_proj.codebook_id', 'model.layers.22.mlp.down_proj.SU', 'model.layers.6.self_attn.qkv_proj.codebook_id', 'model.layers.28.self_attn.o_proj.codebook_id', 'model.layers.22.mlp.upgate_proj.Qidxs', 'model.layers.18.self_attn.qkv_proj.fuse_scales', 'model.layers.24.mlp.down_proj.Wscale', 'model.layers.23.self_attn.qkv_proj.fuse_scales', 'model.layers.26.mlp.down_proj.codebook_id', 'model.layers.21.self_attn.o_proj.SU', 'model.layers.28.self_attn.o_proj.SU', 'model.layers.10.self_attn.qkv_proj.Qidxs', 'model.layers.16.mlp.down_proj.codebook_id', 'model.layers.28.mlp.upgate_proj.Wscale', 'model.layers.30.self_attn.o_proj.codebook_id', 'model.layers.31.self_attn.qkv_proj.fuse_scales', 'model.layers.14.self_attn.qkv_proj.codebook_id', 'model.layers.2.self_attn.qkv_proj.fuse_scales', 'model.layers.29.self_attn.o_proj.Qidxs', 'model.layers.21.self_attn.qkv_proj.SV', 'model.layers.2.mlp.upgate_proj.SV', 'model.layers.24.self_attn.o_proj.Wscale', 'model.layers.21.mlp.down_proj.SV', 'model.layers.4.mlp.upgate_proj.fuse_scales', 'model.layers.1.mlp.down_proj.codebook_id', 'model.layers.28.mlp.down_proj.Qidxs', 'model.layers.12.mlp.upgate_proj.Qidxs', 'model.layers.29.mlp.upgate_proj.Qidxs', 'model.layers.2.self_attn.o_proj.SV', 'model.layers.29.mlp.upgate_proj.codebook_id', 'model.layers.10.mlp.upgate_proj.SU', 'model.layers.15.mlp.down_proj.Wscale', 'model.layers.30.self_attn.o_proj.Wscale', 'model.layers.5.self_attn.qkv_proj.fuse_scales', 'model.layers.15.mlp.upgate_proj.codebook_id', 'model.layers.28.mlp.down_proj.Wscale', 'model.layers.18.mlp.upgate_proj.SU', 'model.layers.18.mlp.upgate_proj.SV', 'model.layers.23.self_attn.o_proj.Qidxs', 'model.layers.27.mlp.upgate_proj.Wscale', 'model.layers.25.self_attn.o_proj.codebook_id', 'model.layers.19.self_attn.qkv_proj.SU', 'model.layers.1.mlp.down_proj.Qidxs', 'model.layers.1.mlp.upgate_proj.SU', 'model.layers.28.mlp.upgate_proj.SV', 'model.layers.9.mlp.down_proj.codebook_id', 'model.layers.24.mlp.upgate_proj.codebook_id', 'model.layers.0.self_attn.o_proj.SV', 'model.layers.9.mlp.down_proj.SV', 'model.layers.1.self_attn.o_proj.Qidxs', 'model.layers.22.self_attn.qkv_proj.codebook_id', 'model.layers.25.mlp.upgate_proj.codebook_id', 'model.layers.1.self_attn.o_proj.codebook_id', 'model.layers.27.mlp.down_proj.SV', 'model.layers.27.self_attn.qkv_proj.fuse_scales', 'model.layers.5.self_attn.qkv_proj.SU', 'model.layers.17.self_attn.qkv_proj.SV', 'model.layers.27.self_attn.qkv_proj.SV', 'model.layers.24.self_attn.qkv_proj.Wscale', 'model.layers.15.self_attn.qkv_proj.SU', 'model.layers.22.mlp.down_proj.Wscale', 'model.layers.7.self_attn.o_proj.SV', 'model.layers.12.mlp.down_proj.Qidxs', 'model.layers.13.self_attn.o_proj.SV', 'model.layers.5.mlp.down_proj.Qidxs', 'model.layers.21.mlp.upgate_proj.fuse_scales', 'model.layers.12.mlp.down_proj.codebook_id', 'model.layers.20.mlp.upgate_proj.Wscale', 'model.layers.27.mlp.upgate_proj.fuse_scales', 'model.layers.1.self_attn.qkv_proj.Wscale', 'model.layers.0.mlp.upgate_proj.SU', 'model.layers.30.self_attn.o_proj.SU', 'model.layers.18.self_attn.o_proj.SU', 'model.layers.7.mlp.upgate_proj.Qidxs', 'model.layers.23.self_attn.qkv_proj.SV', 'model.layers.17.self_attn.o_proj.SU', 'model.layers.7.mlp.down_proj.SV', 'model.layers.30.self_attn.qkv_proj.codebook_id', 'model.layers.10.self_attn.qkv_proj.fuse_scales', 'model.layers.13.mlp.upgate_proj.SV', 'model.layers.16.self_attn.qkv_proj.Qidxs', 'model.layers.22.mlp.down_proj.codebook_id', 'model.layers.23.mlp.upgate_proj.Qidxs', 'model.layers.23.mlp.upgate_proj.fuse_scales', 'model.layers.22.self_attn.qkv_proj.SV', 'model.layers.13.self_attn.qkv_proj.SV', 'model.layers.5.mlp.upgate_proj.Wscale', 'model.layers.0.mlp.down_proj.SV', 'model.layers.20.self_attn.qkv_proj.Qidxs', 'model.layers.27.mlp.down_proj.SU', 'model.layers.19.self_attn.o_proj.codebook_id', 'model.layers.14.self_attn.o_proj.Wscale', 'model.layers.4.mlp.upgate_proj.SU', 'model.layers.20.self_attn.qkv_proj.fuse_scales', 'model.layers.10.mlp.upgate_proj.codebook_id', 'model.layers.13.mlp.upgate_proj.fuse_scales', 'model.layers.22.self_attn.o_proj.Qidxs', 'model.layers.5.self_attn.qkv_proj.Wscale', 'model.layers.22.mlp.upgate_proj.SV', 'model.layers.24.self_attn.o_proj.codebook_id', 'model.layers.25.self_attn.qkv_proj.SU', 'model.layers.6.self_attn.qkv_proj.Wscale', 'model.layers.7.mlp.upgate_proj.fuse_scales', 'model.layers.25.self_attn.qkv_proj.SV', 'model.layers.25.self_attn.qkv_proj.Wscale', 'model.layers.19.self_attn.o_proj.SV', 'model.layers.12.mlp.upgate_proj.SU', 'model.layers.10.mlp.upgate_proj.Wscale', 'model.layers.4.mlp.upgate_proj.SV', 'model.layers.4.mlp.upgate_proj.Wscale', 'model.layers.11.self_attn.o_proj.SU', 'model.layers.27.mlp.upgate_proj.Qidxs', 'model.layers.4.self_attn.qkv_proj.codebook_id', 'model.layers.2.self_attn.qkv_proj.codebook_id', 'model.layers.7.self_attn.o_proj.SU', 'model.layers.25.mlp.upgate_proj.fuse_scales', 'model.layers.12.self_attn.o_proj.SV', 'model.layers.20.self_attn.qkv_proj.SU', 'model.layers.22.self_attn.o_proj.SV', 'model.layers.16.mlp.upgate_proj.Qidxs', 'model.layers.5.mlp.upgate_proj.SU', 'model.layers.9.mlp.upgate_proj.SV', 'model.layers.4.self_attn.qkv_proj.SV', 'model.layers.4.mlp.down_proj.Qidxs', 'model.layers.26.mlp.upgate_proj.SU', 'model.layers.19.self_attn.qkv_proj.codebook_id', 'model.layers.20.self_attn.o_proj.SU', 'model.layers.29.mlp.upgate_proj.Wscale', 'model.layers.3.mlp.down_proj.SV', 'model.layers.14.self_attn.qkv_proj.fuse_scales', 'model.layers.21.self_attn.qkv_proj.Wscale', 'model.layers.26.mlp.upgate_proj.fuse_scales', 'model.layers.15.mlp.upgate_proj.fuse_scales', 'model.layers.25.self_attn.qkv_proj.codebook_id', 'model.layers.3.self_attn.qkv_proj.SU', 'model.layers.7.mlp.upgate_proj.codebook_id', 'model.layers.8.mlp.down_proj.Wscale', 'model.layers.25.mlp.down_proj.SV', 'model.layers.25.mlp.down_proj.SU', 'model.layers.22.mlp.upgate_proj.SU', 'model.layers.8.mlp.down_proj.codebook_id', 'model.layers.11.self_attn.qkv_proj.SV', 'model.layers.30.mlp.upgate_proj.Qidxs', 'model.layers.23.self_attn.o_proj.codebook_id', 'model.layers.21.mlp.upgate_proj.Qidxs', 'model.layers.5.mlp.down_proj.SU', 'model.layers.8.self_attn.qkv_proj.Qidxs', 'model.layers.31.mlp.upgate_proj.SV', 'model.layers.8.mlp.down_proj.SV', 'model.layers.18.mlp.down_proj.SU', 'model.layers.24.self_attn.qkv_proj.SU', 'model.layers.14.mlp.upgate_proj.Qidxs', 'model.layers.8.mlp.upgate_proj.Qidxs', 'model.layers.3.mlp.upgate_proj.fuse_scales', 'model.layers.27.self_attn.o_proj.SU', 'model.layers.13.mlp.down_proj.SV', 'model.layers.4.self_attn.o_proj.Wscale', 'model.layers.14.mlp.down_proj.SU', 'model.layers.9.mlp.upgate_proj.Qidxs', 'model.layers.1.mlp.upgate_proj.fuse_scales', 'model.layers.24.self_attn.o_proj.SV', 'model.layers.30.mlp.down_proj.SV', 'model.layers.20.mlp.upgate_proj.fuse_scales', 'model.layers.2.mlp.down_proj.SV', 'model.layers.23.mlp.upgate_proj.SV', 'model.layers.13.mlp.upgate_proj.Wscale', 'model.layers.11.self_attn.qkv_proj.codebook_id', 'model.layers.3.self_attn.o_proj.codebook_id', 'model.layers.8.self_attn.qkv_proj.Wscale', 'model.layers.28.self_attn.o_proj.Wscale', 'model.layers.13.mlp.upgate_proj.Qidxs', 'model.layers.19.mlp.down_proj.SU', 'model.layers.21.mlp.upgate_proj.SV', 'model.layers.17.mlp.upgate_proj.Qidxs', 'model.layers.26.mlp.down_proj.SV', 'model.layers.24.mlp.upgate_proj.SU', 'model.layers.11.self_attn.o_proj.Wscale', 'model.layers.15.mlp.upgate_proj.Qidxs', 'model.layers.11.mlp.down_proj.Wscale', 'model.layers.0.self_attn.o_proj.Wscale', 'model.layers.13.self_attn.qkv_proj.codebook_id', 'model.layers.8.mlp.down_proj.SU', 'model.layers.9.self_attn.qkv_proj.codebook_id', 'model.layers.31.mlp.down_proj.SV', 'model.layers.15.self_attn.qkv_proj.Qidxs', 'model.layers.30.self_attn.qkv_proj.fuse_scales', 'model.layers.15.self_attn.o_proj.Qidxs', 'model.layers.29.mlp.down_proj.SV', 'model.layers.1.self_attn.qkv_proj.Qidxs', 'model.layers.26.self_attn.qkv_proj.Qidxs', 'model.layers.11.self_attn.qkv_proj.Wscale', 'model.layers.5.mlp.down_proj.Wscale', 'model.layers.11.mlp.upgate_proj.Qidxs', 'model.layers.31.mlp.down_proj.Wscale', 'model.layers.10.self_attn.o_proj.codebook_id', 'model.layers.2.mlp.upgate_proj.Qidxs', 'model.layers.1.self_attn.qkv_proj.fuse_scales', 'model.layers.21.mlp.upgate_proj.SU', 'model.layers.13.mlp.upgate_proj.SU', 'model.layers.2.self_attn.qkv_proj.Qidxs', 'model.layers.0.self_attn.o_proj.Qidxs', 'model.layers.18.self_attn.o_proj.Wscale', 'model.layers.6.mlp.down_proj.Wscale', 'model.layers.7.mlp.down_proj.codebook_id', 'model.layers.12.self_attn.o_proj.Wscale', 'model.layers.27.mlp.down_proj.Wscale', 'model.layers.11.mlp.upgate_proj.fuse_scales', 'model.layers.17.self_attn.o_proj.codebook_id', 'model.layers.17.mlp.down_proj.SV', 'model.layers.7.self_attn.qkv_proj.Wscale', 'model.layers.14.self_attn.qkv_proj.SU', 'model.layers.9.mlp.down_proj.Wscale', 'model.layers.27.mlp.upgate_proj.SV', 'model.layers.29.self_attn.qkv_proj.Qidxs', 'model.layers.25.mlp.down_proj.codebook_id', 'model.layers.18.mlp.upgate_proj.fuse_scales', 'model.layers.31.self_attn.o_proj.SU', 'model.layers.5.mlp.upgate_proj.Qidxs', 'model.layers.3.mlp.down_proj.Wscale', 'model.layers.25.self_attn.o_proj.SV', 'model.layers.8.self_attn.o_proj.Wscale', 'model.layers.23.mlp.down_proj.Wscale', 'model.layers.14.self_attn.o_proj.codebook_id', 'model.layers.2.self_attn.qkv_proj.Wscale', 'model.layers.6.self_attn.o_proj.SU', 'model.layers.11.mlp.upgate_proj.Wscale', 'model.layers.19.self_attn.o_proj.Wscale', 'model.layers.6.mlp.down_proj.SV', 'model.layers.19.self_attn.qkv_proj.fuse_scales', 'model.layers.22.mlp.upgate_proj.Wscale', 'model.layers.27.mlp.down_proj.Qidxs', 'model.layers.31.self_attn.qkv_proj.codebook_id', 'model.layers.16.self_attn.o_proj.SV', 'model.layers.9.mlp.upgate_proj.codebook_id', 'model.layers.5.self_attn.o_proj.codebook_id', 'model.layers.7.mlp.upgate_proj.Wscale', 'model.layers.29.self_attn.o_proj.codebook_id', 'model.layers.24.mlp.down_proj.SU', 'model.layers.12.self_attn.o_proj.Qidxs', 'model.layers.21.self_attn.qkv_proj.fuse_scales', 'model.layers.13.mlp.upgate_proj.codebook_id', 'model.layers.14.mlp.down_proj.codebook_id', 'model.layers.16.self_attn.o_proj.codebook_id', 'model.layers.17.self_attn.o_proj.Wscale', 'model.layers.3.self_attn.qkv_proj.codebook_id', 'model.layers.8.self_attn.o_proj.SV', 'model.layers.25.self_attn.qkv_proj.fuse_scales', 'model.layers.8.mlp.upgate_proj.fuse_scales', 'model.layers.27.self_attn.qkv_proj.Qidxs', 'model.layers.23.self_attn.o_proj.Wscale', 'model.layers.12.self_attn.o_proj.SU', 'model.layers.20.mlp.upgate_proj.SU', 'model.layers.2.mlp.upgate_proj.codebook_id', 'model.layers.18.self_attn.o_proj.codebook_id', 'model.layers.29.self_attn.qkv_proj.codebook_id', 'model.layers.8.mlp.down_proj.Qidxs', 'model.layers.22.mlp.down_proj.SV', 'model.layers.8.self_attn.o_proj.SU', 'model.layers.13.self_attn.qkv_proj.SU', 'model.layers.10.self_attn.qkv_proj.Wscale', 'model.layers.4.mlp.down_proj.Wscale', 'model.layers.14.mlp.down_proj.SV', 'model.layers.1.self_attn.o_proj.SU', 'model.layers.17.self_attn.qkv_proj.Qidxs', 'model.layers.28.self_attn.qkv_proj.Qidxs', 'model.layers.16.self_attn.qkv_proj.SV', 'model.layers.3.mlp.upgate_proj.SV', 'model.layers.13.mlp.down_proj.SU', 'model.layers.10.mlp.down_proj.Qidxs', 'model.layers.27.self_attn.qkv_proj.Wscale', 'model.layers.12.self_attn.qkv_proj.fuse_scales', 'model.layers.24.self_attn.qkv_proj.codebook_id', 'model.layers.31.mlp.down_proj.Qidxs', 'model.layers.17.mlp.down_proj.codebook_id', 'model.layers.11.self_attn.qkv_proj.Qidxs', 'model.layers.22.self_attn.o_proj.SU', 'model.layers.6.mlp.upgate_proj.codebook_id', 'model.layers.6.self_attn.o_proj.Wscale', 'model.layers.5.mlp.upgate_proj.codebook_id', 'model.layers.23.mlp.upgate_proj.codebook_id', 'model.layers.11.mlp.down_proj.codebook_id', 'model.layers.26.mlp.down_proj.SU', 'model.layers.20.self_attn.o_proj.SV', 'model.layers.6.mlp.upgate_proj.SU', 'model.layers.10.mlp.down_proj.codebook_id', 'model.layers.3.mlp.down_proj.SU', 'model.layers.7.self_attn.qkv_proj.codebook_id', 'model.layers.12.self_attn.o_proj.codebook_id', 'model.layers.10.mlp.down_proj.Wscale', 'model.layers.7.self_attn.qkv_proj.SV', 'model.layers.6.self_attn.qkv_proj.SU', 'model.layers.22.self_attn.o_proj.codebook_id', 'model.layers.22.mlp.upgate_proj.fuse_scales', 'model.layers.3.mlp.down_proj.Qidxs', 'model.layers.15.self_attn.o_proj.SU', 'model.layers.21.self_attn.qkv_proj.codebook_id', 'model.layers.1.self_attn.qkv_proj.SU', 'model.layers.31.mlp.upgate_proj.Wscale', 'model.layers.8.mlp.upgate_proj.Wscale', 'model.layers.11.self_attn.qkv_proj.fuse_scales', 'model.layers.16.mlp.down_proj.SU', 'model.layers.31.mlp.upgate_proj.Qidxs', 'model.layers.12.self_attn.qkv_proj.Wscale', 'model.layers.15.mlp.down_proj.SU', 'model.layers.20.self_attn.qkv_proj.codebook_id', 'model.layers.18.self_attn.o_proj.Qidxs', 'model.layers.29.mlp.upgate_proj.fuse_scales', 'model.layers.26.self_attn.qkv_proj.SV', 'model.layers.9.self_attn.o_proj.Wscale', 'model.layers.28.mlp.upgate_proj.fuse_scales', 'model.layers.5.mlp.upgate_proj.SV', 'model.layers.14.mlp.down_proj.Wscale', 'model.layers.16.mlp.upgate_proj.codebook_id', 'model.layers.17.mlp.upgate_proj.SV', 'model.layers.28.self_attn.qkv_proj.SV', 'model.layers.2.self_attn.qkv_proj.SU', 'model.layers.1.mlp.down_proj.Wscale', 'model.layers.7.self_attn.qkv_proj.SU', 'model.layers.26.self_attn.o_proj.codebook_id', 'model.layers.18.self_attn.qkv_proj.Wscale', 'model.layers.18.mlp.upgate_proj.Wscale', 'model.layers.22.mlp.upgate_proj.codebook_id', 'model.layers.28.self_attn.o_proj.SV', 'model.layers.7.mlp.down_proj.SU', 'model.layers.18.self_attn.qkv_proj.SU', 'model.layers.9.self_attn.qkv_proj.SU', 'model.layers.14.self_attn.qkv_proj.SV', 'model.layers.7.mlp.down_proj.Qidxs', 'model.layers.20.self_attn.qkv_proj.SV', 'model.layers.7.self_attn.o_proj.codebook_id', 'model.layers.13.self_attn.o_proj.Wscale', 'model.layers.7.self_attn.o_proj.Wscale', 'model.layers.4.mlp.down_proj.SV', 'model.layers.1.self_attn.o_proj.SV', 'model.layers.13.mlp.down_proj.Wscale', 'model.layers.25.mlp.upgate_proj.SU', 'model.layers.14.mlp.upgate_proj.SV', 'model.layers.0.self_attn.qkv_proj.SV', 'model.layers.25.mlp.upgate_proj.Wscale', 'model.layers.25.mlp.down_proj.Qidxs', 'model.layers.15.mlp.upgate_proj.Wscale', 'model.layers.1.self_attn.o_proj.Wscale', 'model.layers.26.mlp.upgate_proj.Qidxs', 'model.layers.31.self_attn.qkv_proj.SU', 'model.layers.6.mlp.upgate_proj.SV']
- This IS expected if you are initializing LlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at relaxml/Llama-2-7b-E8P-2Bit and are newly initialized: ['model.layers.15.self_attn.q_proj.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'model.layers.16.mlp.up_proj.weight', 'model.layers.21.mlp.gate_proj.weight', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.3.mlp.up_proj.weight', 'model.layers.22.self_attn.o_proj.weight', 'model.layers.17.mlp.up_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.21.mlp.down_proj.weight', 'model.layers.3.self_attn.o_proj.weight', 'model.layers.4.mlp.up_proj.weight', 'model.layers.5.self_attn.k_proj.weight', 'model.layers.16.self_attn.q_proj.weight', 'model.layers.20.self_attn.o_proj.weight', 'model.layers.15.mlp.up_proj.weight', 'model.layers.2.mlp.gate_proj.weight', 'model.layers.15.self_attn.o_proj.weight', 'model.layers.10.self_attn.o_proj.weight', 'model.layers.18.self_attn.o_proj.weight', 'model.layers.29.self_attn.q_proj.weight', 'model.layers.12.self_attn.v_proj.weight', 'model.layers.16.self_attn.v_proj.weight', 'model.layers.17.self_attn.k_proj.weight', 'model.layers.20.mlp.gate_proj.weight', 'model.layers.1.self_attn.o_proj.weight', 'model.layers.17.mlp.gate_proj.weight', 'model.layers.5.mlp.down_proj.weight', 'model.layers.29.mlp.down_proj.weight', 'model.layers.10.self_attn.k_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.2.self_attn.o_proj.weight', 'model.layers.26.self_attn.k_proj.weight', 'model.layers.10.mlp.gate_proj.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.5.self_attn.o_proj.weight', 'model.layers.8.self_attn.v_proj.weight', 'model.layers.8.self_attn.o_proj.weight', 'model.layers.11.self_attn.q_proj.weight', 'model.layers.4.mlp.down_proj.weight', 'model.layers.12.self_attn.o_proj.weight', 'model.layers.12.mlp.down_proj.weight', 'model.layers.9.mlp.gate_proj.weight', 'model.layers.1.self_attn.q_proj.weight', 'model.layers.31.mlp.up_proj.weight', 'model.layers.19.self_attn.o_proj.weight', 'model.layers.23.mlp.up_proj.weight', 'model.layers.27.self_attn.k_proj.weight', 'model.layers.6.mlp.up_proj.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.1.mlp.down_proj.weight', 'model.layers.14.self_attn.o_proj.weight', 'model.layers.30.mlp.gate_proj.weight', 'model.layers.0.mlp.down_proj.weight', 'model.layers.3.mlp.gate_proj.weight', 'model.layers.1.mlp.gate_proj.weight', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.14.self_attn.k_proj.weight', 'model.layers.16.self_attn.o_proj.weight', 'model.layers.25.mlp.down_proj.weight', 'model.layers.30.self_attn.q_proj.weight', 'model.layers.31.self_attn.k_proj.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.28.self_attn.q_proj.weight', 'model.layers.3.self_attn.q_proj.weight', 'model.layers.9.self_attn.v_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.18.mlp.up_proj.weight', 'model.layers.6.self_attn.k_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.22.mlp.up_proj.weight', 'model.layers.16.mlp.gate_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.7.mlp.down_proj.weight', 'model.layers.7.mlp.up_proj.weight', 'model.layers.8.mlp.up_proj.weight', 'model.layers.21.self_attn.q_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.13.self_attn.v_proj.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.25.self_attn.o_proj.weight', 'model.layers.7.mlp.gate_proj.weight', 'model.layers.7.self_attn.k_proj.weight', 'model.layers.11.self_attn.v_proj.weight', 'model.layers.21.self_attn.o_proj.weight', 'model.layers.21.mlp.up_proj.weight', 'model.layers.14.self_attn.q_proj.weight', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.4.mlp.gate_proj.weight', 'model.layers.1.mlp.up_proj.weight', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.30.self_attn.o_proj.weight', 'model.layers.8.mlp.gate_proj.weight', 'model.layers.25.self_attn.k_proj.weight', 'model.layers.3.self_attn.v_proj.weight', 'model.layers.4.self_attn.o_proj.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.7.self_attn.v_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layers.24.self_attn.o_proj.weight', 'model.layers.2.mlp.down_proj.weight', 'model.layers.3.mlp.down_proj.weight', 'model.layers.19.mlp.gate_proj.weight', 'model.layers.15.mlp.gate_proj.weight', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.12.self_attn.k_proj.weight', 'model.layers.24.mlp.gate_proj.weight', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.30.mlp.up_proj.weight', 'model.layers.3.self_attn.k_proj.weight', 'model.layers.27.mlp.down_proj.weight', 'model.layers.21.self_attn.v_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.6.self_attn.q_proj.weight', 'model.layers.12.mlp.gate_proj.weight', 'model.layers.31.self_attn.q_proj.weight', 'model.layers.31.self_attn.o_proj.weight', 'model.layers.30.mlp.down_proj.weight', 'model.layers.31.mlp.gate_proj.weight', 'model.layers.8.self_attn.q_proj.weight', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.0.self_attn.o_proj.weight', 'model.layers.8.mlp.down_proj.weight', 'model.layers.15.mlp.down_proj.weight', 'model.layers.5.mlp.gate_proj.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.5.self_attn.q_proj.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.9.self_attn.o_proj.weight', 'model.layers.27.self_attn.v_proj.weight', 'model.layers.30.self_attn.k_proj.weight', 'model.layers.28.mlp.down_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.9.mlp.up_proj.weight', 'model.layers.11.mlp.gate_proj.weight', 'model.layers.11.mlp.down_proj.weight', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.10.mlp.up_proj.weight', 'model.layers.2.mlp.up_proj.weight', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.29.mlp.up_proj.weight', 'model.layers.31.mlp.down_proj.weight', 'model.layers.10.self_attn.q_proj.weight', 'model.layers.29.self_attn.v_proj.weight', 'model.layers.9.self_attn.k_proj.weight', 'model.layers.6.mlp.gate_proj.weight', 'model.layers.12.mlp.up_proj.weight', 'model.layers.23.self_attn.k_proj.weight', 'model.layers.18.mlp.gate_proj.weight', 'model.layers.19.mlp.down_proj.weight', 'model.layers.10.mlp.down_proj.weight', 'model.layers.20.mlp.down_proj.weight', 'model.layers.4.self_attn.v_proj.weight', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.20.mlp.up_proj.weight', 'model.layers.28.self_attn.o_proj.weight', 'model.layers.19.mlp.up_proj.weight', 'model.layers.31.self_attn.v_proj.weight', 'model.layers.14.mlp.up_proj.weight', 'model.layers.14.mlp.gate_proj.weight', 'model.layers.5.self_attn.v_proj.weight', 'model.layers.13.mlp.gate_proj.weight', 'model.layers.7.self_attn.o_proj.weight', 'model.layers.13.self_attn.k_proj.weight', 'model.layers.17.self_attn.v_proj.weight', 'model.layers.29.mlp.gate_proj.weight', 'model.layers.11.self_attn.k_proj.weight', 'model.layers.14.self_attn.v_proj.weight', 'model.layers.14.mlp.down_proj.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.6.mlp.down_proj.weight', 'model.layers.22.mlp.down_proj.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.12.self_attn.q_proj.weight', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.13.self_attn.o_proj.weight', 'model.layers.7.self_attn.q_proj.weight', 'model.layers.10.self_attn.v_proj.weight', 'model.layers.9.self_attn.q_proj.weight', 'model.layers.5.mlp.up_proj.weight', 'model.layers.13.self_attn.q_proj.weight', 'model.layers.13.mlp.down_proj.weight', 'model.layers.27.self_attn.o_proj.weight', 'model.layers.11.mlp.up_proj.weight', 'model.layers.24.self_attn.v_proj.weight', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.13.mlp.up_proj.weight', 'model.layers.8.self_attn.k_proj.weight', 'model.layers.25.mlp.gate_proj.weight', 'model.layers.6.self_attn.o_proj.weight', 'model.layers.30.self_attn.v_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.29.self_attn.o_proj.weight', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.6.self_attn.v_proj.weight', 'model.layers.11.self_attn.o_proj.weight', 'model.layers.9.mlp.down_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.4.self_attn.k_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

I tested this model by doing some text generation:

(Pdb) model.generate(max_length=30)
tensor([[    1, 24279, 31730, 24945,  7771, 17308, 17698, 23431, 29339, 31895,
         10582, 21101, 31730, 21554, 17698, 23431, 17698, 11814, 21727, 21727,
         31730, 23431,  8950, 17698, 31730, 17698, 21727, 22823, 10582, 23431]],
       device='cuda:0')
(Pdb) generated_id = self.model.generate(max_length=30)
(Pdb) tokenizer.batch_decode(generated_id, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
'Bedeutovystru Bayer Browagi scales식reraagicolon scales식ksam富富agirebmathchar富ozzárefix scalesmathchar Bayer scales scales Bedeutksam'

It seems like this checkpoint is not loaded properly, otherwise there should be better generated text.

trouble building quip_tools

do I need a specific compiler version?

I'm using gcc 13.2

gcc --version
gcc (Ubuntu 13.2.0-4ubuntu3) 13.2.0
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I think it's picking up the cuda 12.0 in my /usr/bin though the python requirements show CUDA 12.1.

I also didn't see anything about the python version, so I tried with 3.10 and 3.11.

cd quiptools && python setup.py install && cd ../
running install
/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing quiptools_cuda.egg-info/PKG-INFO
writing dependency_links to quiptools_cuda.egg-info/dependency_links.txt
writing top-level names to quiptools_cuda.egg-info/top_level.txt
reading manifest file 'quiptools_cuda.egg-info/SOURCES.txt'
writing manifest file 'quiptools_cuda.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/utils/cpp_extension.py:414: UserWarning: The detected CUDA version (12.0) has a minor version mismatch with the version that was used to compile PyTorch (12.1). Most likely this shouldn't be a problem.
  warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/utils/cpp_extension.py:424: UserWarning: There are no g++ version bounds defined for CUDA version 12.0
  warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
building 'quiptools_cuda' extension
Emitting ninja build file /code/quip-sharp/quiptools/build/temp.linux-x86_64-cpython-311/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/bin/nvcc  -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/TH -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/THC -I/home/mihai/.pyenv/versions/quip/include -I/home/mihai/.pyenv/versions/3.11.7/include/python3.11 -c -c /code/quip-sharp/quiptools/quiptools.cu -o /code/quip-sharp/quiptools/build/temp.linux-x86_64-cpython-311/quiptools.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -g -Xcompiler -rdynamic -lineinfo -allow-unsupported-compiler -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=quiptools_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17
FAILED: /code/quip-sharp/quiptools/build/temp.linux-x86_64-cpython-311/quiptools.o 
/usr/bin/nvcc  -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/TH -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/THC -I/home/mihai/.pyenv/versions/quip/include -I/home/mihai/.pyenv/versions/3.11.7/include/python3.11 -c -c /code/quip-sharp/quiptools/quiptools.cu -o /code/quip-sharp/quiptools/build/temp.linux-x86_64-cpython-311/quiptools.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -g -Xcompiler -rdynamic -lineinfo -allow-unsupported-compiler -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=quiptools_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17
/code/quip-sharp/quiptools/quiptools.cu(34): warning #177-D: function "gpuAssert" was declared but never referenced

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/pybind11/detail/../cast.h: In function ‘typename pybind11::detail::type_caster<typename pybind11::detail::intrinsic_type<T>::type>::cast_op_type<T> pybind11::detail::cast_op(make_caster<T>&)’:
/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:45:120: error: expected template-name before ‘<’ token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                        ^
/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:45:120: error: expected identifier before ‘<’ token
/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:45:123: error: expected primary-expression before ‘>’ token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                           ^
/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:45:126: error: expected primary-expression before ‘)’ token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                              ^
[2/2] /usr/bin/nvcc  -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/TH -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/THC -I/home/mihai/.pyenv/versions/quip/include -I/home/mihai/.pyenv/versions/3.11.7/include/python3.11 -c -c /code/quip-sharp/quiptools/quiptools_e8p_gemv.cu -o /code/quip-sharp/quiptools/build/temp.linux-x86_64-cpython-311/quiptools_e8p_gemv.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -g -Xcompiler -rdynamic -lineinfo -allow-unsupported-compiler -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=quiptools_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17
FAILED: /code/quip-sharp/quiptools/build/temp.linux-x86_64-cpython-311/quiptools_e8p_gemv.o 
/usr/bin/nvcc  -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/TH -I/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/THC -I/home/mihai/.pyenv/versions/quip/include -I/home/mihai/.pyenv/versions/3.11.7/include/python3.11 -c -c /code/quip-sharp/quiptools/quiptools_e8p_gemv.cu -o /code/quip-sharp/quiptools/build/temp.linux-x86_64-cpython-311/quiptools_e8p_gemv.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -g -Xcompiler -rdynamic -lineinfo -allow-unsupported-compiler -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=quiptools_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++17
/code/quip-sharp/quiptools/quiptools_e8p_gemv.cu(85): warning #177-D: variable "shared_weights" was declared but never referenced

Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"

/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/pybind11/detail/../cast.h: In function ‘typename pybind11::detail::type_caster<typename pybind11::detail::intrinsic_type<T>::type>::cast_op_type<T> pybind11::detail::cast_op(make_caster<T>&)’:
/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:45:120: error: expected template-name before ‘<’ token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                        ^
/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:45:120: error: expected identifier before ‘<’ token
/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:45:123: error: expected primary-expression before ‘>’ token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                           ^
/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:45:126: error: expected primary-expression before ‘)’ token
   45 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                              ^
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2100, in _run_ninja_build
    subprocess.run(
  File "/home/mihai/.pyenv/versions/3.11.7/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/code/quip-sharp/quiptools/setup.py", line 4, in <module>
    setup(name='quiptools_cuda',
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/__init__.py", line 87, in setup
    return distutils.core.setup(**attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 185, in setup
    return run_commands(dist)
           ^^^^^^^^^^^^^^^^^^
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
    dist.run_commands()
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 968, in run_commands
    self.run_command(cmd)
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/dist.py", line 1217, in run_command
    super().run_command(command)
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
    cmd_obj.run()
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/command/install.py", line 74, in run
    self.do_egg_install()
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/command/install.py", line 123, in do_egg_install
    self.run_command('bdist_egg')
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
    self.distribution.run_command(command)
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/dist.py", line 1217, in run_command
    super().run_command(command)
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
    cmd_obj.run()
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/command/bdist_egg.py", line 165, in run
    cmd = self.call_command('install_lib', warn_dir=0)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/command/bdist_egg.py", line 151, in call_command
    self.run_command(cmdname)
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
    self.distribution.run_command(command)
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/dist.py", line 1217, in run_command
    super().run_command(command)
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
    cmd_obj.run()
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/command/install_lib.py", line 11, in run
    self.build()
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/_distutils/command/install_lib.py", line 112, in build
    self.run_command('build_ext')
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 319, in run_command
    self.distribution.run_command(command)
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/dist.py", line 1217, in run_command
    super().run_command(command)
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 987, in run_command
    cmd_obj.run()
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 84, in run
    _build_ext.run(self)
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
    self.build_extensions()
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 873, in build_extensions
    build_ext.build_extensions(self)
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 466, in build_extensions
    self._build_extensions_serial()
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 492, in _build_extensions_serial
    self.build_extension(ext)
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/command/build_ext.py", line 246, in build_extension
    _build_ext.build_extension(self, ext)
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/setuptools/_distutils/command/build_ext.py", line 547, in build_extension
    objects = self.compiler.compile(
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 686, in unix_wrap_ninja_compile
    _write_ninja_file_and_compile_objects(
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1774, in _write_ninja_file_and_compile_objects
    _run_ninja_build(
  File "/home/mihai/.pyenv/versions/quip/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2116, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

why don't you do the register_buffer inside the QuantizedLinear() init ?

instead of here: https://github.com/Cornell-RelaxML/quip-sharp/blob/main/model/llama_nofuse.py#L258-L260
and same for self attention
it would make things much easier to integrate the Class in other projects

Pytorch dequantization

Do you have a python/Pytorch implementation of the dequantization (/scaling) that is done at inference time?

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

(train) root@WIN-IUEXHKJBZLU:/home/luhao/quip-sharp# CUDA_VISIBLE_DEVICES=0 python quantize_llama.py --base_model Yi-34B-Chat --codebook E8P12
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 15/15 [00:01<00:00, 10.49it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
I1204 12:02:37.810840 3560 quantize_llama.py:289] loaded model
W1204 12:02:40.726335 3560 warnings.py:109] /root/anaconda3/envs/train/lib/python3.9/site-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)

8
18
26
33
38
51
63
64
I1204 12:02:50.217392 3560 quantize_llama.py:293] loaded dataset and devset
Traceback (most recent call last):
  File "/home/luhao/quip-sharp/quantize_llama.py", line 417, in <module>
    main(args)
  File "/home/luhao/quip-sharp/quantize_llama.py", line 342, in main
    model.model.layers[i](
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 635, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 389, in forward
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
  File "/root/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/functional.py", line 1858, in softmax
    ret = input.softmax(dim, dtype=dtype)
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

In the same vein as #17

I think we should also include the register_buffer in the modules here:

https://github.com/Cornell-RelaxML/quip-sharp/blob/main/model/llama_nofuse.py#L258-L260

distribute the memory usage evenly across both cards?

Parallel quantization of GPUs allows the distribution of memory usage across two graphics cards when running quantized code.
Quantizing a 34b model requires approximately 30GB, and I have two RTX 3090 cards with individual capacities of 24GB each, it is currently impossible to perform quantization on a single card. How can I distribute the memory usage evenly across both cards?

How to quant 1.3B model to 2bit

Thanks for your wonderful work!
I try to use quip-sharp to quantize my 1.3B model base on llama arch to 2bit. The config of my model is:

config = LlamaConfig(
vocab_size=len(dictionary),
hidden_size=2048,
intermediate_size=5460,
num_hidden_layers=24,
num_attention_heads=32,
num_key_value_heads=None,
hidden_act="silu",
max_position_embeddings=2048,
initializer_range=0.02,
rms_norm_eps=1e-6,
use_cache=True,
pad_token_id=dictionary.pad(),
bos_token_id=dictionary.bos(),
eos_token_id=dictionary.eos(),
pretraining_tp=1,
tie_word_embeddings=True,
rope_theta=10000.0,
rope_scaling=None,
attention_bias=False,
)

However, I encounter the following error:

Traceback (most recent call last):
File "/home/xxx/anaconda3/envs/bit/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/xxx/anaconda3/envs/bit/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/quantize_llama.py", line 261, in quantize_layer_queue
quantize_layer(*next_item, cb, args, device, False)
File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/quantize_llama.py", line 246, in quantize_layer
quantize_up(layer, idx, cb, args, device, check_only=not return_layer)
File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/quantize_llama.py", line 162, in quantize_up
hatW, attr = quip.quantize(H, W_upgate, args.lora_rank, cb, args, device)
File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/algo/quip.py", line 331, in quantize
incoh_out = incoherence_preprocess(H, W, args)
File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/algo/quip.py", line 48, in incoherence_preprocess
Wr = RHT_W(Wr, SU, SV)
File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/algo/quip.py", line 13, in RHT_W
return utils.matmul_hadUt(utils.matmul_hadUt(W.T * SV).T * SU)
File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/utils/matmul_had.py", line 96, in matmul_hadUt
return matmul_hadU(X, transpose=True)
File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/utils/matmul_had.py", line 78, in matmul_hadU
hadK, K = get_hadK(n, transpose)
File "/home/xxx/torchscale-examples/torchscale-scaling2/quip_sharp/lib/utils/matmul_had.py", line 16, in get_hadK
assert (is_pow2(n // 156))
AssertionError

The n is 10920. Could you provide some suggestions to resolve the problem?

Thx~

Problem with namespace nvcuda

Hi! I am getting this error:

FAILED: /media/indrema/6372D35F27CA2C9A1/llama/quip-sharp/quiptools/build/temp.linux-x86_64-cpython-311/quiptools.o
/usr/local/cuda/bin/nvcc -I/home/indrema/miniconda3/envs/quip/lib/python3.11/site-packages/torch/include -I/home/indrema/miniconda3/envs/quip/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/home/indrema/miniconda3/envs/quip/lib/python3.11/site-packages/torch/include/TH -I/home/indrema/miniconda3/envs/quip/lib/python3.11/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/indrema/miniconda3/envs/quip/include/python3.11 -c -c /media/indrema/6372D35F27CA2C9A1/llama/quip-sharp/quiptools/quiptools.cu -o /media/indrema/6372D35F27CA2C9A1/llama/quip-sharp/quiptools/build/temp.linux-x86_64-cpython-311/quiptools.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -g -Xcompiler -rdynamic -lineinfo -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=quiptools_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17
/media/indrema/6372D35F27CA2C9A1/llama/quip-sharp/quiptools/quiptools.cu(23): error: name must be a namespace name
using namespace nvcuda;
^

Any idea? (Cuda installed, 12.1)

3 bit quantization

Does the method also work well with smaller codebooks that result in about 3 bits per weight?
Might be worth trying.

Mistral models output gibberish

Testing on oobabooga webui as being implemented here.
Llama-2 models (13B 2Bit/4Bit) work as expected.

Tested models:

Typical output:

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Same issue as #4

Exact same issue as #4 , except setting batch size 1 does not work. Using same transformers==4.34.0 as well.

NameError: name 'quant_emb' is not defined

File "/home/dong/code/quip-sharp/hessian_offline_llama.py", line 171, in main
quant_emb[0:args.batch_size], 0).to(device)
NameError: name 'quant_emb' is not defined

Variable quant_emb is not defined in hessian_offline_llama.py.

cornell-relaxml / quip-sharp Goto Github PK

quip-sharp's People

Contributors

Stargazers

Watchers

Forkers

quip-sharp's Issues

Ppl Benchmarks

Recommend Projects

Recommend Topics

Recommend Org