Git Product home page Git Product logo

omniquant's Introduction

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

arXiv jiqizhixin zhihu License GitHub Stars🔥🔥🔥

omniquant

OmniQuant is a simple and powerful quantization technique for LLMs. The current release supports:

  • OmniQuant algorithm for accurate weight-only quantization (W4A16/W3A16/W2A16) and weight-activation quantization (W6A6, W4A4)
  • Pre-trained Omniquant model zoo for LLMs (LLaMA-1&2, LLaMA-2-Chat, OPT, Falcon, Mixtral-7Bx8; load to generate quantized weights).
  • A out-of-the-box case that leverages MLC-LLM to run LLaMa-2-Chat (7B/13B) with W3A16g128 quantization on GPUs and mobile phones.

News

  • [2024/7] 🔥 We release a new quantization algorithm, EfficientQAT, which realizes quantization-aware training in a time-efficient and memory-efficient manner. Additionally, EfficientQAT is the current SoTA of uniform quantization.
  • [2024/1] 🌟 Our OmniQuant paper has been accepted for a Spotlight presentation at ICLR 2024 (only top 5% out of over 7200 submissions)! 🎉 Cheers!
  • [2023/12] 🔥 We provide support for Mixtral-8x7B. OmniQuant is capable of achieving near-lossless 4-bit quantization with Mixtral-8x7B-v0.1, which reduces the memory requirement from 87GB to 23GB. Notably, the work-in-progress OmniQuant v2 is anticipated to outperform OmniQuant v1. We invite you to stay tuned for the upcoming, more powerful OmniQuant v2. You can access the model URL and the inference code for the quantized models at runing_quantized_mixtral_7bx8. mistral-8bx7-v0.1
  • [2023/09] 🔥 We have expanded support for Falcon. OmniQuant efficiently compresses Falcon-180b from 335G to 65G, with minimal performance loss. Furthermore, this compression allows for Falcon-180b inference on a single A100 80GB GPU. For details, refer to runing_falcon180b_on_single_a100_80g. falcon-180b

Contents

Install

conda create -n omniquant python=3.10 -y
conda activate omniquant
git clone https://github.com/OpenGVLab/OmniQuant.git
cd OmniQuant
pip install --upgrade pip 
pip install -e .

We also leverage the kernel from AutoGPTQ to achieve real quantization. So you should also install the bug-fixed AutoGPTQ as follows::

git clone https://github.com/ChenMnZ/AutoGPTQ-bugfix
pip install -v .

OmniQuant Model Zoo

We provide pre-trained Omniquant model zoo for multiple model families, including LLaMa-1&2, LLaMa-2-Chat, OPT.

You can download the pre-trained OmniQuant parameters you need at Huggingface.

The detailed support list:

Models Sizes W2A16 W2A16g128 W2A16g64 W3A16
LLaMA 7B/13B/30B/65B
LLaMA-2 7B/13B/70B
OPT 125m/1.3B/2.7B/6.7B/13B/30B/66B
Models Sizes W3A16g128 W4A16 W4A16g128 W6A6 W4A4
LLaMA 7B/13B/30B/65B
LLaMA-2 7B/13B/70B
OPT 125m/1.3B/2.7B/6.7B/13B/30B/66B
LLaMA-2-Chat 7B/13B

Usage

We provide full script to run OmniQuant in ./scripts/. We use LLaMa-7B as an example here:

  1. Obtain the channel-wise scales and shifts required for initialization:
conda install git git-lfs
git lfs install
git clone https://huggingface.co/ChenMnZ/act_shifts
git clone https://huggingface.co/ChenMnZ/act_scales

Optional, we also offer the script that you can generate channel-wise scales and shifts by yourself:

python generate_act_scale_shift.py --model /PATH/TO/LLaMA/llama-7b
  1. Weight-only quantization
# W3A16
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w3a16 \
--eval_ppl --wbits 3 --abits 16 --lwc

# W3A16g128
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w3a16g128 \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc
  1. weight-activation quantization
# W4A4
CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b  \
--epochs 20 --output_dir ./log/llama-7b-w4a4 \
--eval_ppl --wbits 4 --abits 4 --lwc --let \
--tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande
  1. reproduce evaluation results of our paper

    1) download the pretrained OmniQuant parameters you want through Huggingface.

    2) set epoch as 0 and inference with resume, take LLaMa-7B with W3A16g128 quantization as an example:

CUDA_VISIBLE_DEVICES=0 python main.py \
--model /PATH/TO/LLaMA/llama-7b  \
--epochs 0 --output_dir ./log/test \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc \
--resume /PATH/TO/Pretrained/Parameters 

More detailed and optional arguments:

  • --model: the local model path or huggingface format.
  • --wbits: weight quantization bits.
  • --abits: activation quantization bits.
  • --group_size: group size of weight quantization. If no set, use per-channel quantization for weight as default.
  • --lwc: activate the Learnable Weight Clipping (LWC).
  • --let: activate the Learnable Equivalent Transformation (LET).
  • --lwc_lr: learning rate of LWC parameters, 1e-2 as default.
  • --let_lr: learning rate of LET parameters, 5e-3 as default.
  • --epochs: training epochs. You can set it as 0 to evaluate pre-trained OmniQuant checkpoints.
  • --nsamples: number of calibration samples, 128 as default.
  • --eval_ppl: evaluating the perplexity of quantized models.
  • --tasks: evaluating zero-shot tasks.
  • --resume: loading pre-trained OmniQuant parameters.
  • --multigpu: to inference larger network on multiple GPUs
  • --real_quant: real quantization, which can see memory reduce. Note that due to the limitations of AutoGPTQ kernels, the real quantization of weight-only quantization can only lead memory reduction, but with slower inference speed.
  • --save_dir: saving the quantization model for further exploration.

Runing Quantized Models with MLC-LLM

MLC-LLM offers a universal deployment solution suitable for various language models across a wide range of hardware backends, encompassing iPhones, Android phones, and GPUs from NVIDIA, AMD, and Intel.

We compile the OmniQuant's quantization models through MLC-LLM and offer an out-of-the-box case here. You can see smaller gpu memory usage and inference speedup. Detailed instructions can be found in in runing_quantized_models_with_mlc_llm.ipynb.

Specially, we also deploy the aforementioned two quantized models into mobile phones through MLC-LLM. You can download the Android app by simply clicking the button below:

This app includes three models, LLaMa-2-7B-Chat-Omniquant-W3A16g128asym, LLaMa-2-13B-Chat-Omniquant-W3A16g128asym, and LLaMa-2-13B-Chat-Omniquant-W2A16g128asym. They require at least 4.5G, 7.5G, and 6.0G free RAM, respectively. Note that 2bit quantization has worse performance compared to 3bit quantization as shown in our paper. The inclusion of 2-bit quantization is just an extreme exploration about deploy LLM in mobile phones. Currently, this app is in its demo phase and may experience slower response times, so wait patiently for the generation of response. We have tested this app on Redmi Note 12 Turbo (Snapdragon 7+ Gen 2 and 16G RAM), some examples are provided below:

  • LLaMa-2-7B-Chat-Omniquant-W3A16g128asym
  • LLaMa-2-13B-Chat-Omniquant-W3A16g128asym
  • LLaMa-2-13B-Chat-Omniquant-W2A16g128asym

We also have tested this app on iPhone 14 Pro (A16 Bionic and 6G RAM), some examples are provided below:

  • LLaMa-2-7B-Chat-Omniquant-W3A16g128asym

Results

  • OmniQuant achieve SoTA performance in weight-only quantization weight_only
  • OmniQuant achieve SoTA performance in weight-activation quantization weight_activation
  • OmniQuant is generalize, also obatins excellent performance in instruction-tuned models with GPT-4 evaluation gpt_4_evaluation
  • MLC-LLM can obtain really speedup and memory saving for W4A16/W3A16/W2A16 quantization mlc_llm

Related Project

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers

RPTQ: Reorder-Based Post-Training Quantization for Large Language Models

MLC LLM

AutoGPTQ

Citation

If you use our OmniQuant approach in your research, please cite our paper:

@article{OmniQuant,
  title={OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models},
  author={Shao, Wenqi and Chen,Mengzhao and  Zhang, Zhaoyang and Xu, Peng and Zhao, Lirui and Li, Zhiqian and Zhang, Kaipeng Zhang, and Gao, Peng, and Qiao, Yu, and Luo, Ping},
  journal={arXiv preprint arXiv:2308.13137},
  year={2023}
}

omniquant's People

Contributors

alvant avatar brisker avatar chenmnz avatar eltociear avatar jeethu avatar kaushikthedeveloper avatar mutichung avatar radi-cho avatar realaidanjoe avatar wqshao126 avatar xingchensong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

omniquant's Issues

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed).

I modified the code to support the Codellama-34b model, but when using lwc and let simultaneously, the following error occurred:

Traceback (most recent call last): File "main.py", line 380, in <module> main() File "main.py", line 345, in main omniquant( File "~/OmniQuant/OmniQuant-main/quantize/omniquant.py", line 358, in omniquant norm = loss_scaler(loss, optimizer, File "~/OmniQuant/OmniQuant-main/utils.py", line 34, in __call__ self._scaler.scale(loss).backward(create_graph=create_graph, retain_graph=retain_graph) File "~/aconda/envs/omiquant/lib/python3.8/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "~/aconda/envs/omiquant/lib/python3.8/site-packages/torch/autograd/__init__.py", line 266, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermed iate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

What could be the problem? Is there any good solution?

Loss is NAN, stopping training

每当我量化到第17层的时候就会出现这个报错,我发现quant_out是nan,但我单独量化任意一层都不会出现这个bug,请问这是为什么?

Regarding the Initialization of `smooth_scale` for the Q*K Operation

Hello, and thank you for your outstanding work. I have a question about the initialization of the smooth-scale for the QK operation in the codebase. I've noticed that scales for other operations (e.g., out-proj, fc1) are initialized using the SmoothQuant method with an alpha value of 0.5, which utilizes statistics of weight and dumped activation. However, the scale for QK is initialized with 'torch.ones()'.

While I understand that SmoothQuant doesn't apply scaling to QK, I have a couple of questions:

  1. Could the performance potentially benefit from initializing the QK scale similarly to the SmoothQuant method?
  2. Is it feasible to apply the SmoothQuant approach to both qkv-scales and qkt-scales (both of which affect q-proj.weight and k-proj.weight)?

Quant script for large models like 180B and 70B models?

Hi, would you please provide script and hardware specs that can run quant models for these super large models?

Did OmniQuant load model mostly to RAM and only evaluate certain layers in VRAM so to lower the GPU requirement to average joe?

Is evaluation on MMLU dataset supported?

Is evaluation on MMLU dataset supported? I can find corresponding codes here:
https://github.com/OpenGVLab/OmniQuant/blob/main/categories.py
But can not find any API which can be called.

Model File Formats: .pth, .bin vs. GGUF

Hello,

I've been exploring the OmniQuant repository and am impressed with the quantization techniques provided for Large Language Models (LLMs). I noticed that the pre-trained models are available in .pth and .bin file formats from huggingface

I was wondering why these models are not available in the GGUF format, which is considered more efficient for handling large models. Is there a specific reason for this choice of file formats? Am I missing something here?

I am sure there is a reason for that I am probably just missing something.

Slow decoding compared to AWQ

Hey @ChenMnZ,

Thanks for the great work. I was trying out AWQ (A16W4) and OmniQuant (A4W4) versions for model meta-llama/llama-2-70b-chat-hf and noticed that OmniQuant is much slower than AWQ. I used the following snippet of code to benchmark:

model.eval()
prompt = "Give me a list of the top 10 dive sites you would recommend around the world. \nThe list is:"
input_ids = enc(prompt, return_tensors='pt').input_ids.cuda()
model = model.cuda()
start_time = time.time()
output = model.generate(inputs=input_ids, do_sample=True, top_k=10, max_new_tokens=128)
end_time = time.time()
speed = len(output[0])/(end_time-start_time)
print(enc.decode(output[0]))
print(f"speed:{speed}token/s")

AWQ was able to generate around 36.13 token/s whereas OmniQuant could only generate 6.15 token/s. I ran these on RTX 3060 (12GiB of VRAM).

For AWQ, I ran the model abhinavkulkarni/meta-llama-Llama-2-7b-chat-hf-w4-g128-awq, instructions for how to run model are in the model card.

For OmniQuant, I downloaded Llama-2-7b-w4a4.pth and followed the notebook for Falcon (slightly modified it for Llama 2).

Thanks!

‼️Llama2-70b not working

There seems to be an issue with the OmniQuant Llama2-70b model. The problem arises from the mismatch of scales shape and weight in the GQA algorithm.

Problems with memory usage and model loading

I used the ”fake quant“ code you gave to perform llama quantification. During the model loading stage, I found that all the parameters loaded were of the fp16, and the memory usage was almost the same as the original llama model. May I ask if I use real quant parameters to load? how to get the extremely low memory use in the paper? How to correctly load a quantized model (taking 4bit as an example)?

general question about LLM kv-cache quantization

  1. Is kv-cache actually not used in all the LLM-evaluation tasks, since those tasks usually takes only one-step attention calculation, not like language generating process which needs a lot of kv-cache since the words need to be generated one by one?

  2. If this is true, how to evaluate the quantization performace if kv-cache needs to be quantized, if we want to accelerate LLM using something like GPTQ?(since kv-cache is actually not used in normal evaluation tasks)

In OmniQuant codes, it seems that, there exists no flags to control the kv-cache quantzation(only actually k and v matrix quantization, but not the cache quantization)

Hoping to discuss with OmniQuant authors about this.

reproduce evaluation results

i use the llama weight from "huggyllama/llama-7b"
i want to reproduce the "LLaMA-7B W4A4" by script/llama/llama-7b/w4a4.sh

CUDA_VISIBLE_DEVICES=0 python main.py
--model ${MODEL_PATH} --eval_ppl
--epochs 20 --output_dir ./log/llama-7b-w4a4
--wbits 4 --abits 4 --lwc --let --aug_loss

but i got wikitext2 : 11.583588600158691; c4 : 14.935160636901855
but in your paper: wikitext2 :11.23(epoch-40) , 11.26(epoch-20); c4 : 14.61 from table A.4

i want to check is there any details i have missed?

i think if we use the same seed, we should get same result?

CUDA extension not installed

When I try to run scripts, "CUDA EXTENSION NOT INSTALLED" happens, and the running time is too long.
What should I do?

[Model Request] upstage/SOLAR-10.7B-v1.0

SOLAR-10.7B, is a compact, yet remarkably powerful large language model; it has demonstrated unparalleled state-of-the-art performance in models under 30B parameters - rivaling model's with up to 30B parameters in performance.

SOLAR-10.7B, was developed using Upstage's Depth Up-Scaling. And, it was Built on the Llama2 architecture with integrated Mistral 7B weights integrated into its upscaled layers as part of its pre-training.

Upstage's Depth-Upscaled SOLAR-10.7B has remarkable performance. It outperforms models with up to 30B parameters, even surpassing the recent Mixtral 8X7B model. For detailed information, please refer to the experimental table. Solar 10.7B is an ideal choice for fine-tuning. SOLAR-10.7B offers robustness and adaptability for your fine-tuning needs. Our simple instruction fine-tuning using the SOLAR-10.7B pre-trained model yields significant performance improvements (SOLAR-10.7B-Instruct-v1.0).

https://huggingface.co/upstage/SOLAR-10.7B-v1.0
Screenshot 2023-12-20 at 10 04 06 AM

Reduce shape for per group weight calibration

Hello!

Am I right that when we quantize weights with some group size, we expect calibration stats (min and max) to be the same within each group? If so, why the reduce_shape is set to -1 here:

https://github.com/OpenGVLab/OmniQuant/blob/main/quantize/quantizer.py#L130C28-L130C28

x = x.reshape(-1,self.group_size)

# some code omitted

reduce_shape = [-1]
xmin = x.amin(reduce_shape, keepdim=True)
xmax =  x.amax(reduce_shape, keepdim=True)

Should not the reduce_shape param be equal to 0 if x is a weight matrix? if, on the other hand, x is an input tensor, than reduce_shape should indeed be -1?

P.S. If some kind of fix is really needed, I would be happy to try make a pull request 🙂

Quick Clarification Question on C4 PPL

Hi, first of all, thanks for the amazing paper and repo --- I've learned a lot from it. The comprehensive results table has also been useful as a service to the community.

I just want to ask a very quick clarification question. The repo has two versions of C4 evaluation ("c4" and "c4-new"). For the C4 perplexities in the paper (e.g., Table 10), I'd like to confirm which version of C4 evaluation the numbers correspond to (i.e., "c4" or "c4-new")?

Thanks in advance for your time!

aug_loss option in OmniQuant Scripts

Hi! Thanks for the awesome quantization work! I learned a lot.
I have a question regarding the aug_loss option. I noticed in some scripts with llama, this option is present.

if args.aug_loss:
loss += loss_func(fp_inps_2[index:index+args.batch_size,], quant_out)

I'm curious to know what this additional loss specifically does, and the reasons behind its inclusion. I'd also like to understand why it's only included in certain models and with specific bit precisions.

MLC Android app is missing storage permission

The app does not work for me (Operation not permitted)

There is no option to give it permission to access the whole storage, currently it only allows it to access media.

It is missing the MANAGE_EXTERNAL_STORAGE permission in the manifest and explicit permission request on app start.

AutoGPTQ or AutoGPTQ-bugfix?

Some time ago, in README there was a link to the "fixed version" of AutoGPTQ: AutoGPTQ-bugfix. However, current README gives link to the original repo: AutoGPTQ.

So, does this mean that everything is OK with AutoGPTQ real quantization now and we do not need the fixed repo?

I am asking such question, because, for example, the fix for qlinear triton was the following (link1, link2):

# qlinear_triton.py
# ...

qweight = qweight.astype(np.int32)
self.qweight = torch.from_numpy(qweight)

# zeros -= 1  # This line removed in the fix
zeros = zeros.numpy().astype(np.uint32)
qzeros = np.zeros((zeros.shape[0], zeros.shape[1] // 32 * self.bits), dtype=np.uint32)
i = 0

# ...

However, in AutoGPTQ there is still such zeros modification (link). So, it seems that original AutoGPTQ still might have some problems?..

about decode speed and gpu memory usage

I used your real_quant parameters to obtain the quantized llama7b model and tested the inference speed on A10. However, for both w4a16 and w4a4, the inference speed was only 7 tokens/s, and it also exceeded 7GB in terms of memory usage. These results differ significantly from the reported results in the paper(100+tokens\s, 5.7GB). Is it possible for you to open-source your testing code?

How to quantize a llama structure model and run it with sampling process?

I trained some llama structure like models, and use following command quatize my model:

CUDA_VISIBLE_DEVICES=0 python main.py --model /workdir/hf_models/aquila-chat-7b/ --eval_ppl --epochs 20 --output_dir ./log/aquila-chat-7b-w4a16 --wbits 4 --abits 16 --lwc --net llama-7b

it failed with following outpus:

[2023-09-20 05:57:45 root](omniquant.py 233): INFO layer 25 iter 13 loss:0.8996272683143616 norm:0.0016859474126249552 max memory_allocated 14942.64794921875
[2023-09-20 05:58:02 root](omniquant.py 233): INFO layer 25 iter 14 loss:0.8993780016899109 norm:0.0017346511594951153 max memory_allocated 14942.64794921875
[2023-09-20 05:58:18 root](omniquant.py 233): INFO layer 25 iter 15 loss:0.899031937122345 norm:0.001769484020769596 max memory_allocated 14942.64794921875
[2023-09-20 05:58:35 root](omniquant.py 233): INFO layer 25 iter 16 loss:0.8987679481506348 norm:0.0017901259707286954 max memory_allocated 14942.64794921875
[2023-09-20 05:58:52 root](omniquant.py 233): INFO layer 25 iter 17 loss:0.8985449075698853 norm:0.0018109744414687157 max memory_allocated 14942.64794921875
[2023-09-20 05:59:08 root](omniquant.py 233): INFO layer 25 iter 18 loss:0.8984124660491943 norm:0.0018458825070410967 max memory_allocated 14942.64794921875
[2023-09-20 05:59:25 root](omniquant.py 233): INFO layer 25 iter 19 loss:0.8984500169754028 norm:0.0018331403844058514 max memory_allocated 14942.64794921875
[2023-09-20 05:59:28 root](omniquant.py 158): INFO === Start quantize layer 26 ===
[2023-09-20 05:59:37 root](omniquant.py 223): INFO Loss is NAN, stopping training
> /workdir/OmniQuant/quantize/omniquant.py(226)omniquant()
-> loss_list.append(loss.data)

In which case will produce a NULL Loss object? BTW, this command works on llama-7b well.

Besides PPL evaluation, I also need to do subject/object evaluaiton on some dataset. I know there is a runing_falcon180b_on_single_a100_80g.ipynb, which shown how to run a quantized Falcon-180b, but it seems I have to learn AutoGPTQ first before I can know how to load quantized weight of a model?

I convert the quantized llama-7b-w4a16.pth to hf format with https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py.
I tried replaced the FalconLinear to nn.Linear , not sure if it's correct. And I wander what the group size is, if I didn't use group size in quantize phase, what value should I set in QuantLinear?

Are there any examples or docs can help me run a llama model quickly?

Failed to compile AutoGPTQ-bugfix

Error occurs when running pip install -v . in AutoGPTQ-bugfix, could someone help?

……
running build_ext
building 'autogptq_cuda_64' extension
creating /OmniQuants/AutoGPTQ-bugfix/build/temp.linux-x86_64-3.10
creating /OmniQuants/AutoGPTQ-bugfix/build/temp.linux-x86_64-3.10/autogptq_extension
creating /OmniQuants/AutoGPTQ-bugfix/build/temp.linux-x86_64-3.10/autogptq_extension/cuda_64
Emitting ninja build file /OmniQuants/AutoGPTQ-bugfix/build/temp.linux-x86_64-3.10/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] g++ -MMD -MF /OmniQuants/AutoGPTQ-bugfix/build/temp.linux-x86_64-3.10/autogptq_extension/cuda_64/autogptq_cuda_64.o.d -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/OmniQuants/AutoGPTQ-bugfix/autogptq_cuda -I/usr/include/python3.10 -c -c /OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_64.cpp -o /OmniQuants/AutoGPTQ-bugfix/build/temp.linux-x86_64-3.10/autogptq_extension/cuda_64/autogptq_cuda_64.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1016"' -DTORCH_EXTENSION_NAME=autogptq_cuda_64 -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
[2/2] /usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/OmniQuants/AutoGPTQ-bugfix/autogptq_cuda -I/usr/include/python3.10 -c -c /OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu -o /OmniQuants/AutoGPTQ-bugfix/build/temp.linux-x86_64-3.10/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1016"' -DTORCH_EXTENSION_NAME=autogptq_cuda_64 -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_72,code=sm_72 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_87,code=sm_87 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 -ccbin g++ -std=c++17
FAILED: /OmniQuants/AutoGPTQ-bugfix/build/temp.linux-x86_64-3.10/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.o
/usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.10/dist-packages/torch/include -I/usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.10/dist-packages/torch/include/TH -I/usr/local/lib/python3.10/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/OmniQuants/AutoGPTQ-bugfix/autogptq_cuda -I/usr/include/python3.10 -c -c /OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu -o /OmniQuants/AutoGPTQ-bugfix/build/temp.linux-x86_64-3.10/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=autogptq_cuda_64 -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_72,code=sm_72 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_87,code=sm_87 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 -ccbin g++ -std=c++17
/OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu(1167): error: identifier "__hfma2" is undefined
res2 = __hfma2(__hfma2(deq2[(tmp >> 0) & 0xf][off], scale, zero), blockvec[k + 0], res2);
^

/OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu(1167): error: identifier "__hfma2" is undefined
res2 = __hfma2(__hfma2(deq2[(tmp >> 0) & 0xf][off], scale, zero), blockvec[k + 0], res2);
^

/OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu(1301): error: identifier "__hfma2" is undefined
res2 = __hfma2(__hfma2(deq2[(tmp1 >> 0) & 0x3f][off], scale, zero), blockvec[k + 0], res2);
^

/OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu(1301): error: identifier "__hfma2" is undefined
res2 = __hfma2(__hfma2(deq2[(tmp1 >> 0) & 0x3f][off], scale, zero), blockvec[k + 0], res2);
^

/OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu(1419): error: identifier "__hfma2" is undefined
res2 = __hfma2(__hfma2(deq2[(tmp >> 0) & 0xff][off], scale, zero), blockvec[k + 0], res2);
^

/OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu(1419): error: identifier "__hfma2" is undefined
res2 = __hfma2(__hfma2(deq2[(tmp >> 0) & 0xff][off], scale, zero), blockvec[k + 0], res2);
^

/OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu(332): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
atomicAdd(&mul[b * width + w], res);
^
detected during instantiation of "void VecQuant2MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const int *, const int *, int, int, int, int, int) [with scalar_t=double]" at line 270

/OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu(477): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
atomicAdd(&mul[b * width + w], res);
^
detected during instantiation of "void VecQuant3MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const int *, const int *, int, int, int, int, int) [with scalar_t=double]" at line 357

/OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu(565): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
atomicAdd(&mul[b * width + w], res);
^
detected during instantiation of "void VecQuant4MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const int *, const int *, int, int, int, int, int) [with scalar_t=double]" at line 502

/OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu(652): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
atomicAdd(&mul[b * width + w], res);
^
detected during instantiation of "void VecQuant8MatMulKernel(const scalar_t *, const int *, scalar_t *, const scalar_t *, const int *, const int *, int, int, int, int, int) [with scalar_t=double]" at line 590

/OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu(750): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
atomicAdd(&mul[b * width + w], res);
^
detected during instantiation of "void VecQuant2MatMulKernel_old(const scalar_t *, const int *, scalar_t *, const scalar_t *, const int *, int, int, int, int, int, int) [with scalar_t=double]" at line 679

/OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu(909): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
atomicAdd(&mul[b * width + w], res);
^
detected during instantiation of "void VecQuant3MatMulKernel_old(const scalar_t *, const int *, scalar_t *, const scalar_t *, const int *, int, int, int, int, int, int) [with scalar_t=double]" at line 774

/OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu(996): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
atomicAdd(&mul[b * width + w], res);
^
detected during instantiation of "void VecQuant4MatMulKernel_old(const scalar_t *, const int *, scalar_t *, const scalar_t *, const int *, int, int, int, int, int, int) [with scalar_t=double]" at line 933

/OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu(1079): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
atomicAdd(&mul[b * width + w], res);
^
detected during instantiation of "void VecQuant8MatMulKernel_old(const scalar_t *, const int *, scalar_t *, const scalar_t *, const int *, int, int, int, int, int, int) [with scalar_t=double]" at line 1020

14 errors detected in the compilation of "/OmniQuants/AutoGPTQ-bugfix/autogptq_extension/cuda_64/autogptq_cuda_kernel_64.cu".
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1917, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/OmniQuants/AutoGPTQ-bugfix/setup.py", line 167, in
setup(
File "/usr/local/lib/python3.10/dist-packages/setuptools/init.py", line 103, in setup
return distutils.core.setup(**attrs)
File "/usr/lib/python3.10/distutils/core.py", line 148, in setup
dist.run_commands()
File "/usr/lib/python3.10/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
super().run_command(command)
File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.10/dist-packages/wheel/bdist_wheel.py", line 364, in run
self.run_command("build")
File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
super().run_command(command)
File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/usr/lib/python3.10/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/usr/lib/python3.10/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command
super().run_command(command)
File "/usr/lib/python3.10/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 88, in run
_build_ext.run(self)
File "/usr/lib/python3.10/distutils/command/build_ext.py", line 340, in run
self.build_extensions()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 865, in build_extensions
build_ext.build_extensions(self)
File "/usr/lib/python3.10/distutils/command/build_ext.py", line 449, in build_extensions
self._build_extensions_serial()
File "/usr/lib/python3.10/distutils/command/build_ext.py", line 474, in _build_extensions_serial
self.build_extension(ext)
File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 249, in build_extension
_build_ext.build_extension(self, ext)
File "/usr/local/lib/python3.10/dist-packages/Cython/Distutils/build_ext.py", line 127, in build_extension
super(build_ext, self).build_extension(ext)
File "/usr/lib/python3.10/distutils/command/build_ext.py", line 529, in build_extension
objects = self.compiler.compile(sources,
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 678, in unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1590, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py", line 1933, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
full command: /usr/bin/python -u -c '
exec(compile('"'"''"'"''"'"'

This is -- a caller that pip uses to run setup.py

- It imports setuptools before invoking setup.py, to enable projects that directly

import from distutils.core to work with newer packaging standards.

- It provides a clear error message when setuptools is not installed.

- It sets sys.argv[0] to the underlying setup.py, when invoking setup.py so

setuptools doesn'"'"'t think the script is -c. This avoids the following warning:

manifest_maker: standard file '"'"'-c'"'"' not found".

- It generates a shim setup.py, for handling setup.cfg-only projects.

import os, sys, tokenize

try:
import setuptools
except ImportError as error:
print(
"ERROR: Can not execute setup.py since setuptools is not available in "
"the build environment.",
file=sys.stderr,
)
sys.exit(1)

file = %r
sys.argv[0] = file

if os.path.exists(file):
filename = file
with tokenize.open(file) as f:
setup_py_code = f.read()
else:
filename = ""
setup_py_code = "from setuptools import setup; setup()"

exec(compile(setup_py_code, filename, "exec"))
'"'"''"'"''"'"' % ('"'"'/OmniQuants/AutoGPTQ-bugfix/setup.py'"'"',), "", "exec"))' bdist_wheel -d /tmp/pip-wheel-4jtubafy
cwd: /OmniQuants/AutoGPTQ-bugfix/
Building wheel for auto-gptq (setup.py) ... error
ERROR: Failed building wheel for auto-gptq


Build Env is in docker nvcr.io/nvidia/pytorch:23.09-py3
NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 12.2

How to properly evaluate W6A6 models using checkpoint from the mode zoo

Dear author, may I ask how to evaluate the W6A6 opt model using the provided ChenMnZ/OmniQuant, act_shifts, and act_scales ?

Here is my command

python main.py --model $PATH_TO_MY_OPT_CHECKPOINT \
    --epochs 0 \
    --output_dir $LOG_DIR \
    --wbits 6 --abits 6 --lwc --let \
    --tasks lambada_openai \
    --resume ./OmniQuant/opt-6.7b-w6a6.pth # this is the ckpt downloaded from the huggingface repo

I got the following error raise in

smooth_ln_fcs_inplace(model.input_layernorm,[model.self_attn.q_proj, model.self_attn.k_proj, model.self_attn.v_proj],

AttributeError: 'QuantOPTDecoderLayer' object has no attribute 'input_layernorm'. Did you mean: 'final_layer_norm'?

If I evaluate W6A6 llama-7b using the following command, the lambada accuracy is 0, though the perplexity matches the value reported in the paper.

❓ May I ask if I missed out something in the following command line?

python main.py --model $PATH_TO_MY_LLAMA_CHECKPOINT \
    --epochs 0 \
    --output_dir $LOG_DIR \
    --wbits 6 --abits 6 --lwc --let \
    --tasks lambada_openai \
    --resume ./OmniQuant/llama-7b-w6a6.pth # this is the ckpt downloaded from the huggingface repo

Here is the metrics I copied from the output

{'config': {'bootstrap_iters': 100000,
            'description_dict': None,
            'limit': None,
            'model': <models.LMClass.LMClass object at 0x7fc3624658a0>,
            'model_args': None,
            'num_fewshot': 0},
 'results': {'arc_challenge': {'acc': 0.38822525597269625,
                               'acc_norm': 0.4112627986348123,
                               'acc_norm_stderr': 0.01437944106852208,
                               'acc_stderr': 0.014241614207414037},
             'arc_easy': {'acc': 0.6637205387205387,
                          'acc_norm': 0.5197811447811448,
                          'acc_norm_stderr': 0.010251751199542736,
                          'acc_stderr': 0.009694178072725202},
             'boolq': {'acc': 0.728440366972477,
                       'acc_stderr': 0.00777897092960314},
             'lambada_openai': {'acc': 0.0,
                                'acc_stderr': 0.0,
                                'ppl': 2654605.7843538206,
                                'ppl_stderr': 129661.85146222409},
             'openbookqa': {'acc': 0.272,
                            'acc_norm': 0.418,
                            'acc_norm_stderr': 0.022080014812228134,
                            'acc_stderr': 0.01992048320956608},
             'piqa': {'acc': 0.7671381936887922,
                      'acc_norm': 0.764417845484222,
                      'acc_norm_stderr': 0.009901067586473888,
                      'acc_stderr': 0.009861236071080746}},
 'versions': {'arc_challenge': 0,
              'arc_easy': 0,
              'boolq': 1,
              'lambada_openai': 0,
              'openbookqa': 0,
              'piqa': 0}}

RuntimeError when quantize bloom using our code

We face a problem when we try to support Bloom models.
The error occurs when we are trying to train the let.
We write the int_bloom_layer.py according to the other three models but face this problem.
The error lies in utils.py line 32
image

License

Hi,
Great project! Thanks for releasing it! Would you mind adding a license (ie MIT) so it can be used in production?
Thank you!

attention_mask may appear None for newer versions of LLaMA?

Updated some libs recently, and today received an error in line https://github.com/OpenGVLab/OmniQuant/blob/main/quantize/omniquant.py#L164

attention_mask_batch = attention_mask.repeat(args.batch_size,1,1,1) if args.deactive_amp else attention_mask.repeat(args.batch_size,1,1,1).float()

which told that attention_mask was None, and so had not repeat method.

Seems that LLaMA may operate without attention_mask. At least in some cases. Here https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1032 is the code which does not use attention_mask. And exactly this code is executed if there is no info about attention in the model config.json file (seems that SDPA attention may be automatically selected instead of an "Eager" one: https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L1336).

If there really could be some kind of a bug, I think there might be several possible things one can do about it. First, just set attention_mask_batch also None if attention_mask is None:

if attention_mask is not None:
    attention_mask_batch = attention_mask.repeat(args.batch_size,1,1,1) if args.deactive_amp else attention_mask.repeat(args.batch_size,1,1,1).float()
else:
    attention_mask_batch = None

However, this could change the experiment results for LLaMA models because previously attention_mask was in use. So, we can also make sure that Eager attention is used if nothing specified in the config.json (https://github.com/OpenGVLab/OmniQuant/blob/main/models/LMClass.py#L23):

config = AutoConfig.from_pretrained(args.model)

if getattr(config, '_attn_implementation_internal', None) is None:
    config._attn_implementation_internal = 'eager'

P.S. I would be ready to make a PR with a fix, if there is really a need for some 🙂

TypeError: QuantLlamaDecoderLayer.forward() got an unexpected keyword argument 'padding_mask'

Hi, I have a problem evaluating the quantified llama-2-7b model,, can anyone help?

I quantize llama-2-7b model with below command:

CUDA_VISIABLE_DEVICES=6 python main.py --model ../llama_2-7b
--epochs 0
--output_dir ./log/llama-2b-w4a16
--wbits 4 --abits 16 --lwc
--eval_ppl

and get error as follow:

Traceback (most recent call last):
File "/home/user/workspace/quantization/s24_quant/OmniQuant/main.py", line 376, in
main()
File "/home/user/workspace/quantization/s24_quant/OmniQuant/main.py", line 352, in main
evaluate(lm, args,logger)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/workspace/quantization/s24_quant/OmniQuant/main.py", line 124, in evaluate
outputs = lm.model.model(batch)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 925, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1505, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _call_impl
return forward_call(*args, **kwargs)
TypeError: QuantLlamaDecoderLayer.forward() got an unexpected keyword argument 'padding_mask'

my environment is configured as follows:

torch: 2.1.0
transformers: 4.34.0
python: 3.10.2

Difference between fake quant and real quant

Dear author.
Thanks for your amazing job. We are tying to apply this job to our own model.
I want to ask what's the difference between fake quant and real quant.
The reason I want to ask this is the w3a16 llama2-7b-chat model fake quantized by OmniQuant has a slower inference time than fp16 model by using transformers.

potential bug about matmul quantization process?

the matmul process between query_states and key_states here has transpose function on key_states:

image

so in this case, the per_token quantization of activations on the [-1] dimension here fits well.

But in the matmul process between attention and value_states here, there is no transpose function on value_states after per_token quantization function :

image

this causes the matmul between (a,b) and (c,d), former is quantized at a dimension but the latter is quantized at c dimension, which seems wrong( I mean the latter should be quantized at d dimension so that the quantized matmul can be really accelerated on hardware)

Inquiry about Activation Quantization Strategy in Inference

Hi Team,

Great work, really interesting. I was wondering about an aspect in the paper. You are saying that you are using a "per token activation quantization ". Is it dynamic quantization or static at test time ?

In your test with MLC-LLM, you are only benchmarking Weight only quantization and the speed in INT2/ INT3 are worse than INT4. Is it because of constraint of MLC-LLM and that you are only looking for memory reduction ?

Thanks

Quantize LLAMA-2-7b-chat to W4A4

Hello, and thank you for your efforts! I encountered an issue while attempting to quantize the LLAMA-2-7b-chat model to W4A4. I utilized the command below.

CUDA_VISIBLE_DEVICES=0 python main.py \
--model meta-llama/Llama-2-7b-chat-hf --eval_ppl \
--epochs 20 --output_dir ./log/Llama-2-7b-chat-w4a4 \
--wbits 4 --abits 4 --lwc --let \
--let_lr 1e-3 --alpha 0.75

However, the outcome was not as expected. The perplexity (PPL) on the Wikitext-2 dataset was only 37, which is unsatisfactory. Additional results are provided below.

INFO load calibration from ./cache/testloader_Llama_wikitext2_all.cache
INFO wikitext2 : 37.00777053833008
INFO load calibration from ./cache/testloader_Llama_ptb_all.cache
INFO ptb : 150.4561767578125
INFO load calibration from ./cache/testloader_Llama_c4_all.cache
INFO c4 : 46.19054412841797
INFO load calibration from ./cache/testloader_Llama_ptb-new_all.cache
INFO ptb-new : 572.8397216796875
INFO load calibration from ./cache/testloader_Llama_c4-new_all.cache
INFO c4-new : 50.049354553222656

Could you please offer some guidance on adjusting the hyper-parameters for "Llama-2-7b-chat' to achieve results comparable to your 'Llama-2-7b-w4a4' model? Your assistance would be greatly appreciated. Thank you.

Runing quantized models with MLC-LLM error

this line:

cm = ChatModule(model="dist/Llama-2-7b-chat-omniquant-w3a16g128asym/params", lib_path="dist/Llama-2-7b-chat-omniquant-w3a16g128asym/Llama-2-7b-chat-omniquant-w3a16g128asym-cuda.so")

produces this error:

JSONDecodeError: Expecting property name enclosed in double quotes: line 18 column 1 (char 497)

Cannot compile with mlc-llm

I quantized a custom fine-tuned llama2 70b model like this.

$ python main.py \
  --model /data/finetuned_llama2_70b  \
  --epochs 20 \
  --output_dir /data/finetuned_llama2_70b_output \
  --wbits 4 \
  --abits 16 \
  --group_size 128 \
  --lwc \
  --net Llama-2-70b

$ python main.py \
  --model /data/finetuned_llama2_70b \
  --epochs 0 \
  --output_dir /data/finetuned_llama2_70b_output2 \
  --save_dir /data/finetuned_llama2_70b_omniquant \
  --resume /data/finetuned_llama2_70b_output/omni_parameters.pth \
  --wbits 4 \
  --abits 16 \
  --group_size 128 \
  --lwc \
  --net Llama-2-70b

Then I updated mlc_llm/quantization/__init__.py like this

"w4a16g128asym": QuantizationScheme(
    name="w4a16g128asym",
    linear_weight=GroupQuantizationSpec(
        dtype="float16",
        mode="int4",
        sym=False,
        storage_nbit=16,
        group_size=128,
        transpose=False,
    ),
    embedding_table=None,
    final_fc_weight=None,
)

When I try to compile the model with mlc-llm,

$ python -m mlc_llm.build \
  --model /data/finetuned_llama2_70b_omniquant \
  --target cuda \
  --quantization w4a16g128asym \
  --artifact-path /data/finetuned_llama2_70b_omniquant_mlc \
  --use-cache 0

I got this error.

Start computing and quantizing weights... This may take a while.
Traceback (most recent call last):
  File "~/mlc-llm/mlc_llm/build.py", line 42, in main
    core.build_model_from_args(parsed_args)
  File "~/mlc-llm/mlc_llm/core.py", line 619, in build_model_from_args
    new_params = utils.convert_weights(param_manager, params, args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/mlc-llm/mlc_llm/utils.py", line 258, in convert_weights
    vm["transform_params"]()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "~/mambaforge/envs/mlc/lib/python3.11/site-packages/tvm/_ffi/base.py", line 476, in raise_last_ffi_error
    raise py_err
  File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
  File "~/mlc-llm/mlc_llm/relax_model/param_manager.py", line 558, in get_item
    for torch_binname in [
                         ^
  File "~/mlc-llm/mlc_llm/relax_model/param_manager.py", line 559, in <listcomp>
    self.torch_pname2binname[torch_pname] for torch_pname in torch_pnames
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: 'model.layers.0.self_attn.q_proj.weight'

[Llama-2-7B-chat] ppl of w4a8 is nan

When I perform w4a4 quantization and w4a8 quantization separately on the Llama-2-7B-chat model, w4a8 yields significantly lower loss compared to w4a4. However, the PPL of w4a8 is "nan," while the PPL of w4a4 is 23.7.

please see the script and log I used to quantize the model:

  • w4a4
[2023-12-27 11:39:51 root] (main.py 251): INFO Namespace(
model='/jfs-hdfs/user/xingchen.song/share/LLM/Llama-2-7b-chat',
cache_dir='./cache', output_dir='./log/Llama-2-7b-chat-w4a4',
save_dir='exp/OmniQuant_Checkpoints/Llama-2-7b-chat-w4a4',
resume=None, real_quant=False, calib_dataset='wikitext2',
nsamples=128, batch_size=1, seed=2, tasks='', eval_ppl=True,
num_fewshot=0, wbits=4, abits=4, group_size=None, alpha=0.5,
let_lr=0.005, lwc_lr=0.01, wd=0, epochs=20, let=True, lwc=True,
aug_loss=True, symmetric=False, a_dynamic_method='per_token',
w_dynamic_method='per_channel', limit=-1, multigpu=False,
deactive_amp=False, net=None, act_scales=None, act_shifts=None)
[2023-12-27 11:39:53 root] (main.py 316): INFO === start quantization ===
[2023-12-27 11:39:53 root] (main.py 322): INFO load calibration from ./cache/dataloader_Llama_wikitext2_128.cache
[2023-12-27 11:39:53 root] (omniquant.py 47): INFO Starting ...
[2023-12-27 11:39:54 root] (omniquant.py 181): INFO === Start quantize layer 0 ===
[2023-12-27 11:40:23 root] (omniquant.py 262): INFO layer 0 iter 0 loss:0.00011710012040566653 norm:3.260538505855948e-05 max memory_allocated 14383.4755859375
[2023-12-27 11:40:44 root] (omniquant.py 262): INFO layer 0 iter 1 loss:8.657259604660794e-05 norm:9.548537491355091e-06 max memory_allocated 14383.4755859375
[2023-12-27 11:41:05 root] (omniquant.py 262): INFO layer 0 iter 2 loss:7.863873179303482e-05 norm:8.535305823897943e-06 max memory_allocated 14383.4755859375
[2023-12-27 11:41:26 root] (omniquant.py 262): INFO layer 0 iter 3 loss:7.489907875424251e-05 norm:8.39468884805683e-06 max memory_allocated 14383.4755859375
[2023-12-27 11:41:48 root] (omniquant.py 262): INFO layer 0 iter 4 loss:7.386491051875055e-05 norm:1.1096978596469853e-05 max memory_allocated 14383.4755859375
[2023-12-27 11:42:09 root] (omniquant.py 262): INFO layer 0 iter 5 loss:7.13972985977307e-05 norm:1.4607306184188928e-05 max memory_allocated 14383.4755859375
...
...
...
[2023-12-27 15:33:45 root] (omniquant.py 262): INFO layer 31 iter 15 loss:3.263472557067871 norm:0.043484702706336975 max memory_allocated 14410.3583984375
[2023-12-27 15:34:07 root] (omniquant.py 262): INFO layer 31 iter 16 loss:3.261472463607788 norm:0.04224986955523491 max memory_allocated 14410.3583984375
[2023-12-27 15:34:28 root] (omniquant.py 262): INFO layer 31 iter 17 loss:3.2586634159088135 norm:0.04047902300953865 max memory_allocated 14410.3583984375
[2023-12-27 15:34:50 root] (omniquant.py 262): INFO layer 31 iter 18 loss:3.256319522857666 norm:0.03761046379804611 max memory_allocated 14410.3583984375
[2023-12-27 15:35:11 root] (omniquant.py 262): INFO layer 31 iter 19 loss:3.2562286853790283 norm:0.03877091407775879 max memory_allocated 14410.3583984375
[2023-12-27 15:35:15 root] (main.py 345): INFO 14122.338710308075
[2023-12-27 15:35:58 root] (main.py 100): INFO load calibration from ./cache/testloader_Llama_wikitext2_all.cache
[2023-12-27 15:38:12 root] (main.py 144): INFO wikitext2 : 23.720489501953125
[2023-12-27 15:38:12 root] (main.py 100): INFO load calibration from ./cache/testloader_Llama_ptb_all.cache
[2023-12-27 15:38:52 root] (main.py 144): INFO ptb : 663.4659423828125
  • w4a8
[2024-01-02 09:57:31 root] (main.py 257): INFO Namespace(
model='/jfs-hdfs/user/xingchen.song/share/LLM/Llama-2-7b-chat',
cache_dir='./cache', output_dir='./log/Llama-2-7b-chat-w4a8',
save_dir='exp/OmniQuant_Checkpoints/Llama-2-7b-chat-w4a8',
resume=None, real_quant=False, calib_dataset='wikitext2',
nsamples=128, batch_size=1, seed=2, tasks='', eval_ppl=True,
num_fewshot=0, wbits=4, abits=8, group_size=None, alpha=0.5,
let_lr=0.005, lwc_lr=0.01, wd=0, epochs=50, let=True, lwc=True,
aug_loss=True, symmetric=False, a_dynamic_method='per_token',
w_dynamic_method='per_channel', limit=-1, multigpu=False,
deactive_amp=True, attn_implementation='eager', net=None, act_scales=None, act_shifts=None)
[2024-01-02 09:57:49 root] (main.py 322): INFO === start quantization ===
[2024-01-02 09:57:50 root] (main.py 328): INFO load calibration from ./cache/dataloader_Llama_wikitext2_128.cache
[2024-01-02 09:57:51 root] (omniquant.py 47): INFO Starting ...
[2024-01-02 09:58:09 root] (omniquant.py 190): INFO === Start quantize layer 0 ===
[2024-01-02 09:58:54 root] (omniquant.py 271): INFO layer 0 iter 0 loss:1.4345696399686858e-05 norm:4.120320227229968e-06 max memory_allocated 20445.7119140625
[2024-01-02 09:59:27 root] (omniquant.py 271): INFO layer 0 iter 1 loss:1.053816686180653e-05 norm:3.7067807170387823e-06 max memory_allocated 20445.7119140625
[2024-01-02 10:00:00 root] (omniquant.py 271): INFO layer 0 iter 2 loss:9.328913620265666e-06 norm:3.473989181657089e-06 max memory_allocated 20445.7119140625
[2024-01-02 10:00:33 root] (omniquant.py 271): INFO layer 0 iter 3 loss:8.979684935184196e-06 norm:3.6778537833015434e-06 max memory_allocated 20445.7119140625
[2024-01-02 10:01:06 root] (omniquant.py 271): INFO layer 0 iter 4 loss:8.662630534672644e-06 norm:2.7172736736247316e-06 max memory_allocated 20445.7119140625
[2024-01-02 10:01:39 root] (omniquant.py 271): INFO layer 0 iter 5 loss:8.493237146467436e-06 norm:2.8158544864709256e-06 max memory_allocated 20445.7119140625
...
...
...
[2024-01-03 00:54:08 root] (omniquant.py 271): INFO layer 31 iter 45 loss:1.0911047458648682 norm:0.016510091722011566 max memory_allocated 20473.8056640625
[2024-01-03 00:54:42 root] (omniquant.py 271): INFO layer 31 iter 46 loss:1.0910612344741821 norm:0.016708627343177795 max memory_allocated 20473.8056640625
[2024-01-03 00:55:16 root] (omniquant.py 271): INFO layer 31 iter 47 loss:1.0911448001861572 norm:0.01610805094242096 max memory_allocated 20473.8056640625
[2024-01-03 00:55:50 root] (omniquant.py 271): INFO layer 31 iter 48 loss:1.0910682678222656 norm:0.016231702640652657 max memory_allocated 20473.8056640625
[2024-01-03 00:56:24 root] (omniquant.py 271): INFO layer 31 iter 49 loss:1.0912127494812012 norm:0.016619287431240082 max memory_allocated 20473.8056640625
[2024-01-03 00:56:29 root] (main.py 351): INFO 53919.22867846489
[2024-01-03 00:57:30 root] (main.py 100): INFO load calibration from ./cache/testloader_Llama_wikitext2_all.cache
[2024-01-03 00:59:48 root] (main.py 144): INFO wikitext2 : nan
[2024-01-03 00:59:48 root] (main.py 100): INFO load calibration from ./cache/testloader_Llama_ptb_all.cache
[2024-01-03 01:00:29 root] (main.py 144): INFO ptb : nan

Lazy loading

can you guys implement it on the app mlcchat as llama.cpp? cause in low ram devices it crashes instantly when trying to generate text

OPT Model Reproduction Discrepancies

Dear authors,

Thank you for sharing your remarkable work.

I am currently focusing on replicating the evaluation results mentioned in your paper as part of our research efforts. While I've successfully matched results for llama-7b and llama-2-7b, I've encountered discrepancies in the OPT series results. Please see my attached pictures.

I'm using the OPT-1.3b model from https://huggingface.co/facebook/opt-1.3b with the following command line:

CUDA_VISIBLE_DEVICES=0 python main.py
--model /PATH/TO/OPT-1.3b
--epochs 0 --output_dir ./log/test
--eval_ppl --wbits 4 --abits 16 --group_size 128 --lwc
--resume /PATH/TO/opt-1.3b-w4a16g128.pth

issue
mismatch_c4
mismatch_wiki2

Could you confirm if the OPT models' base models match the ones used in your pre-trained models? If different, please specify.

Also, for replicating only the paper's results, I can skip the first three steps in the Readme's Usage section, Right?

Thank you for your time and assistance.

Best regards,
Chao

Trying to run models following docs; incomplete?

Hi, I'm just trying to follow along and try to test this out and have run into some issues with the instructions:

I follow the instructions

mkdir dist && cd dist

# test Llama-2-7b-chat with w3a16g128 quantization
git clone https://huggingface.co/ChenMnZ/Llama-2-7b-chat-omniquant-w3a16g128asym

Then, not in the instructions I cd .. to the folder w/ your mlc_chat_cli and run:

./mlc_chat_cli --local-id Llama-2-7b-chat-omniquant-w3a16g128asym --device-name cuda

I get this error:

./mlc_chat_cli: error while loading shared libraries: libmlc_llm.so: cannot open shared object file: No such file or directory

I can install my own mlc_chat_cli (mamba install -c mlc-ai -c conda-forge mlc-chat-cli-nightly), however it has very different flags - --model vs --local-id and --device not --device-name and it's not happy with the cuda.so for some reason:

Loading model...
mlc_chat_cli: symbol lookup error: /home/local/llm/omniquant/OmniQuant/dist/Llama-2-7b-chat-omniquant-w3a16g128asym/Llama-2-7b-chat-omniquant-w3a16g128asym-cuda.so: undefined symbol: __cudaRegisterFatBinary

So I decided that I would see if I could build my own model vi the Usage docs https://github.com/OpenGVLab/OmniQuant#usage

I was able to generate the scales and shifts, and do weight-only quantization (it took about 1.8h for a W3A16g128 of a llama2-7b on a 4090) - does that seem right? If you have approximate times for how long quants take (on an A100 40G I suppose) that would be useful as well.

I'm at step 4 now, it appears to be going through the quantization process again (it can't use the existing logs to save?) so I'm letting it run, but after that, it's still unclear how I should get it working. Am I able to compile and MLC-LLM model in the default way from this "fake quantized" model? https://mlc.ai/mlc-llm/docs/compilation/compile_models.html - do I just skip the --quantization entirely for mlc_llm.build?

Results Errors

Could you please tell me how you got your ppl results on the wikitext2 in LLaMa models, I reused your ckpt but found there are some disparities.

How to add a new model for OmniQuant?

Thanks for your brilliant work, after explord the project for several days, I found that OmniQuant is portable for edge devices, like Jetson or phones. And wondering how can I add more models into OmniQuant, do you have any tutorials about this? Or maybe we can start from CodeLlama, since it has the similiar architecture with Llama-2, and Llama-2 is already supported.
Also apologies in advance if this seems to be something obvious because I'm new in LLM field.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.