mit-han-lab / llm-awq Goto Github PK

View Code? Open in Web Editor NEW

2.0K 2.0K 146.0 73.26 MB

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

License: MIT License

Python 57.36% Cuda 19.16% C 0.47% C++ 21.44% Shell 1.57%

llm-awq's People

Contributors

Stargazers

Watchers

Forkers

penpaperkeycode syaikhipin hitech777 liketheflower jjhw segmond dumpmemory zguo0525 haorand btobolaski rookieokky qpg-mit techthiyanes entn-at mrcodechef jfontestad binshi-bing shaunhenju tiantian-han marsjacobs flytigerw paixai hoiyeungng zhaojp-frank luchangli03 abhinavkulkarni dusty-nv rondorf peiqinsun b1ore idoamit198 kp-forks samirmoustafa ztaoplus jerinphilip xianfuwongintel vuiseng9 cli99 m-x-05 xindong-sony julian-q songkq lyf-00 hanrui1sensetime yiyunzheng samee99 jamesdborin eltociear ymwangg thenerdyyeti yujiepan-work ri938 casper-hansen togethercomputer compressa-ai stophobia fwtan yonghuazhang-buaa jan-karsten-kuhnke gaohuan2015 monstertail minyang-chen tsingcoo hu19911991 drxd1000 bleugreen adriangrepo chooch-ai cindysridykhan trotsky1997 bettercallcaleb patrickzan changqi1 younesbelkada brunoscaglione sakits mascobot seankhatiri zhuhanqing rouge-coder-png meicale icyxp gonnavis sajalsr-bot summer-summer brandonliau guhaifudeng jwkweon shaojiewang gubowen2 f901107 huguanglong swoldanski kidsss sundogs8603 kyrie2to11 somitm qzl164 xwdreamer un-story

llm-awq's Issues

http.client.RemoteDisconnected: Remote end closed connection without response

Hi,
thx for your inspiring work, I try to reproduce the work following the Example in the readme, however it seems to raise an error about network connecting, I wonder does the optimization process need http request? can i run in the local

I ran the code by :
python -m awq.entry --model_path /path/to/llama-7b-hf --w_bit 4 --q_group_size 128 --run_awq --dump_awq awq_cache/llama-7b-w4-g128.pt

And the error:

raise ConnectionError(err, request=request)

requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

[Bug] Memory leak in real_quantize_model_weight

Hi,

I have been trying to quantize bigger models (mpt-30b, falcon-40b) on a relatively smaller GPU (RTX 3060 with 12GB of VRAM) and have struggled with CUDA OOM errors.

First of all, I wrote a small utility function to get all tensors allocated on CUDA:

import gc


def get_cuda_tensors():
    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
                if obj.is_cuda:
                    yield obj
        except:
            pass

It seems, there are a lot of places where memory is being leaked.

For e.g., in real_quantize_model_weight, calling this function at the end of for loop here and printing the tensors, it is clear that the number of tensors allocated on CUDA keeps on growing every iteration of the loop despite calling gc.collect(); torch.cuda.empty_cache() repeatedly.

Thanks!

W4A16 kernel error when group_size is not 128

Hi,

Thanks for your interesting work and clear open-source code.

I have been trying to test the W4A16 kernel with different quantization group size， and I have found that this kernel only produces correct outputs when the group_size is set as 128.

For example, I tested the W4A16 kernel with the following code:

import torch
from awq.quantize.quantizer import pseudo_quantize_tensor,pseudo_quantize_model_weight
from awq.quantize.qmodule import WQLinear
w_bit = 4
q_group_size= 128
inputs = torch.randn((1,4096,4096)).cuda().half()
module = torch.nn.Linear(4096,4096, True).cuda().half()

module.weight.data, scales, zeros = pseudo_quantize_tensor(module.weight.data, n_bit=w_bit, get_scale_zp=True,q_group_size=q_group_size)
fake_outputs = module(inputs)
scales = scales.t().contiguous()
zeros = zeros.t().contiguous()
q_linear = WQLinear.from_linear(module, w_bit,q_group_size , False, scales, zeros)
real_outputs = q_linear(inputs)

print(f"average dist:{(real_outputs-fake_outputs).abs().mean()}")

when q_group_size=128, the gap is negligible:

average dist:0.00014293193817138672

However when q_group_size was set as other value, the gap becomes significant. Takeing group_size=256 as an example, the output is:

average dist:0.32958984375

Is there anything I can do to resolve this ?

Question about apply scales for fc2

Thanks for your guys' great work from smoothquant to awq.
I have a question about this line.

llm-awq/awq/quantize/auto_scale.py

Lines 173 to 177 in 06e299b

 scales_list.append(_auto_get_scale( 

 prev_op=module.fc1, 

 layers=[module.fc2], 

 inp=input_feat['fc2'], 

 ))

why could we transfer the scale from fc2 to fc1? there is a nonlinear activation function between two fc.

Can not load pre-computed AWQ results for Bloom7b

Thanks for your support for Bloom updated today! I tried with Bloom-7b but failed.
After performing AWQ search and save search results , I tried to reload and evaluate with cmd
'python -m awq.entry --model_path /tmp/bloom_orig_0608/ --w_bit 4 --q_group_size 128 --run_awq --load_awq awq_cache/bloom-7b-w4-g128.pt --tasks wikitext --q_backend fake'
but raise NotImplementedError with awq.quantize.qmodule.ScaledActivation like:

Which "activations" are assumed in this work?

Hello, thanks for your work.

We hypothesize that the input features with larger magnitudes are generally more important.

Do you mean "activations" = "input features" = encircled matrices in this picture (from blogpost http://jalammar.github.io/illustrated-transformer/)?

And, if so, why did you decide to use input features, but not calculated Q, K, V (and maybe Z as it's referred in the picture) values as measure of saliency of weights?

Can this quantization model be inferenced on CPU?

For example x86 and ARM. Or it has to be tied with CUDA

Question about Activation-aware Scaling and its implementation

Hi,

Thank you for your outstanding work and make it accessible to the public.

I would like to inquire about the correlation between the activation distribution and the chosen alpha through grid search.

According to the paper, AWQ aims to reduce quantization error by preserving significant weights identified by the activation distribution. However, in the implementation, alpha is selected based on the mean square error (MSE) between the original output and the output of the quantized layer.

Paper insights path: activation distribution -> salient weight -> keep the salient weight with high precision -> reduce quantization error
Implementation path: layer-wise MSE -> alpha -> migrate activation outlier -> reduce quantization error

Is there a connection between the activation distribution and the lowest MSE? In other words, does the alpha value determined by MSE reflect the underlying activation distribution?

For example, if we found the best alpha, does it reflect the activation distribution and found the real salient weights?

Please let me know if I have missed anything or if there are any misunderstandings.

Your clarification would be greatly appreciated :).

need help! about auto_scale.scale_fc_fc function

Hello， I would like to apply awq to a GPTBigCodeForCausalLM object and it has a unusual atten like this picture shown:

I added some necessary implements and finally i got this

it was caused by there:

Seems like fc2's scale was apply to both fc1 and fc2 ， and because the different shape between my_fc1 and my_fc2 ，the entire progrem was broken here.

It seems to be dividing the weight of the previous layer by the scaler of fc2 instead of dividing the input x of fc2 by the scaler，Right？

How can I fix this error , or could you please tell me why we must apply the scale to fc1 and fc2 ?

Thanks & Regrads

Can I do some change like this?

A question about the metrics in the paper

Hello~, I'm reading AWQ and have a small question about the metrics. I found the results about OPT on wikitext-2 in AWQ are different from what it is in GPTQ's paper,

(results from AWQ)

(results from GPTQ)

(results from SqPR, basically same with GPTQ)

would that be a problem? is it due to the different experiment setting or I missed something?

TypeError: _request() got an unexpected keyword argument 'https'

bash scripts/opt_example.sh

TypeError: expected string or bytes-like object

(awq) C:\Users\Bhanu prakash\llm-awq\awq\kernels>python setup.py install
No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0'
running install
E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

    ********************************************************************************
    Please avoid running ``setup.py`` directly.
    Instead, use pypa/build, pypa/installer or other
    standards-based tools.

    See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
    ********************************************************************************

!!
self.initialize_options()
E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

    ********************************************************************************
    Please avoid running ``setup.py`` and ``easy_install``.
    Instead, use pypa/build, pypa/installer or other
    standards-based tools.

    See https://github.com/pypa/setuptools/issues/917 for details.
    ********************************************************************************

!!
self.initialize_options()
running bdist_egg
running egg_info
writing awq_inference_engine.egg-info\PKG-INFO
writing dependency_links to awq_inference_engine.egg-info\dependency_links.txt
writing requirements to awq_inference_engine.egg-info\requires.txt
writing top-level names to awq_inference_engine.egg-info\top_level.txt
reading manifest file 'awq_inference_engine.egg-info\SOURCES.txt'
writing manifest file 'awq_inference_engine.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_ext
E:\Anaconda\envs\awq\lib\site-packages\torch\utils\cpp_extension.py:359: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
warnings.warn(f'Error checking compiler version for {compiler}: {error}')
Traceback (most recent call last):
File "C:\Users\Bhanu prakash\llm-awq\awq\kernels\setup.py", line 9, in
setup(
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_init_.py", line 107, in setup
return distutils.core.setup(**attrs)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\core.py", line 185, in setup
return run_commands(dist)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\core.py", line 201, in run_commands
dist.run_commands()
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 969, in run_commands
self.run_command(cmd)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\dist.py", line 1234, in run_command
super().run_command(command)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\command\install.py", line 80, in run
self.do_egg_install()
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\command\install.py", line 129, in do_egg_install
self.run_command('bdist_egg')
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\dist.py", line 1234, in run_command
super().run_command(command)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\command\bdist_egg.py", line 164, in run
cmd = self.call_command('install_lib', warn_dir=0)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\command\bdist_egg.py", line 150, in call_command
self.run_command(cmdname)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\dist.py", line 1234, in run_command
super().run_command(command)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\command\install_lib.py", line 11, in run
self.build()
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\command\install_lib.py", line 111, in build
self.run_command('build_ext')
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\dist.py", line 1234, in run_command
super().run_command(command)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools\command\build_ext.py", line 84, in run
_build_ext.run(self)
File "E:\Anaconda\envs\awq\lib\site-packages\setuptools_distutils\command\build_ext.py", line 345, in run
self.build_extensions()
File "E:\Anaconda\envs\awq\lib\site-packages\torch\utils\cpp_extension.py", line 499, in build_extensions
_check_cuda_version(compiler_name, compiler_version)
File "E:\Anaconda\envs\awq\lib\site-packages\torch\utils\cpp_extension.py", line 383, in _check_cuda_version
torch_cuda_version = packaging.version.parse(torch.version.cuda)
File "E:\Anaconda\envs\awq\lib\site-packages\pkg_resources_vendor\packaging\version.py", line 52, in parse
return Version(version)
File "E:\Anaconda\envs\awq\lib\site-packages\pkg_resources_vendor\packaging\version.py", line 196, in init
match = self._regex.search(version)
TypeError: expected string or bytes-like object

Does this work support compression and acceleration for mt0-xl or gpt model?

Hi, thanks for sharing the great work.
I'm wondering if this work supports compression and acceleration for mt0-xl or gpt model?
If not currently, do you have plan to support these models? How can I transfer this work to support these models? Could you please give some advice?
For example:
https://huggingface.co/bigscience/mt0-xl
https://huggingface.co/nvidia/nemo-megatron-mt5-3B
https://huggingface.co/nvidia/nemo-megatron-gpt-5B

How to measure the speedup of W4A16 kernel like Figure 6？

Hi,

Thanks for your outstanding work. I have tested the quantized model using the W4A16 kernel on the WikiText2 datasets. Specially, the WikiText2 validation datasets is split into non-overlapping segments of width 2048. I have observed that the W4A16 kernel significantly reduces memory usage. However, the actual speed is even slow than the W16A16 in my setup.

For example, for LLaMa-30B, the test time of W16A16 on the WikiText2 validation datasets is 177 seconds, whereas the test time increase to 420 seconds when using the W4A16 kernel.

I would like to know how to accurately measure the speedup. Am I overlooking something?

Thank you.

how to use awq in our own model?

great work! but how to use awq in my own model?

[Question/Feature] Skip initialization after quantization

Hi maintainers.

I have been developing models using your AWQ library, which has significantly increased the speed. I have noticed there is a challenge when loading the weights again after quantization because we need to run in init_only mode to load weights correctly and replace layers. This took me roughly 10-12 seconds on a 3090.

Have you thought of a way to 1) quantize weights, 2) save weights with blocks/layers replaced, and 3) load weights without needing to initialize again?

The rationale is that we can get loading times down since we would not need to re-initialize every time we need to load the model.

ModuleNotFoundError: No module named 'awq_inference_engine'

I installed it according to the documentation step by step:
https://github.com/mit-han-lab/llm-awq#install

git clone https://github.com/mit-han-lab/llm-awq
cd llm-awq
pip3 install -e .
cd awq/kernels
python3 setup.py install

But I got an error,

Bad result when running AWQ without GPU

Hi, folks, I met some weird issue when reproducing the results shown in paper. I can get results below with GPU visible, but cannot reproduce it with only CPU. I set the dtype as torch.float to avoid lossing precision from float16.

It's not the inference device issue, the difference comes from the awq_results got w/ and w/o GPU. Is there any workaround the handle it? Any suggestions would be helpful, thanks!

To disable GPU: export CUDA_VISIBLE_DEVICES=''

opt-125m	FP32	group_size	INT4 RTN asym on CPU	AWQ on CPU	AWQ on GPU
wikitext	31.95	G32	33.83	48.52	33.01
wikitext	31.95	G128	35.96	39.53	33.96

Question: Mismatched CUDA version (12.0)

When I was installing the efficient W4A16 (4-bit weight, 16-bit activation) CUDA kernel, I encountered the following error:
The detected CUDA version (12.0) mismatches the version that was used to compile
PyTorch (11.7). Please make sure to use the same CUDA versions.

There is no downgrade possibility to other version. Does it mean I cannot use awq if I have to use CUDA 12? Thank you.

Has XGen models by Salesforce been tested?

They trained a 7B model with 8k context with largely the same architecture as LLaMa, and I am wondering if it is compatible with AWQ since it may be very similar but just permissively licensed.

Model: https://huggingface.co/Salesforce/xgen-7b-8k-base

Open-Flamingo reference

In the paper you said the following. How to do quantization for Open-Flamingo?

Thanks to better generalization, it also achieves good quantization
performance for instruction-tuned LMs (e.g., Vicuna) and, for the first time, multi-modal LMs (Open-Flamingo [2]). Thanks to our efficient kernels, AWQ achieves 1.45× and 2× speedup over GPTQ
and GPTQ with reordering on A100.

Help me understand, this is weight only quant?

So the activations are kept at FP16, so basically W8A16, W4A16, W3A16?

SmoothQuant vs AWQ which one is faster?

Question

We are very interested in two post-training quantization papers from han lab!

SmoothQuant use W8A8 for efficient GPU computation.
AWQ uses W4/3A16 for lower memory requirements and higher memory throughput.

But which one is faster in actual production?
If you have any data about this, could you share it with us?

Quantization of larger models on smaller GPUs using CPU offloading

Hi,

I was trying to quantize facebook/opt-6.7b on RTX 3060 (12GB of VRAM) and was running into OOM errors.

I tried supplying my own device_map (instead of device_map="balanced"), the quantization progressed to 3rd layer and then I was getting this error in accelerate:

NotImplementedError: Cannot copy out of meta tensor; no data!

I think carefully using accelerate and CPU offloading, it should be possible to quantize larger models on smaller GPUs.

Can you please look into this? Or provide some guidance as to where to make changes in the code?

Thanks!

Would AWQ be able to support LLaMa2 quantization?

3bit backward implementation

Thank you for your amazing work. Do you have any plans to implement 3 bit backward (transpose matmul)?
I think this can apply LoRA to the model in 3 bits. like QLoRA

error on setup.py in kernels folder

(awq) C:\Users\caleb\Desktop\AI stuff\llm-awq\awq\kernels>python -m setup.py install
No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1'
running install
C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

    ********************************************************************************
    Please avoid running ``setup.py`` directly.
    Instead, use pypa/build, pypa/installer, pypa/build or
    other standards-based tools.

    See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
    ********************************************************************************

!!
self.initialize_options()
C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

    ********************************************************************************
    Please avoid running ``setup.py`` and ``easy_install``.
    Instead, use pypa/build, pypa/installer, pypa/build or
    other standards-based tools.

    See https://github.com/pypa/setuptools/issues/917 for details.
    ********************************************************************************

!!
self.initialize_options()
running bdist_egg
running egg_info
writing f16s4_gemm.egg-info\PKG-INFO
writing dependency_links to f16s4_gemm.egg-info\dependency_links.txt
writing requirements to f16s4_gemm.egg-info\requires.txt
writing top-level names to f16s4_gemm.egg-info\top_level.txt
C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\torch\utils\cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'f16s4_gemm.egg-info\SOURCES.txt'
writing manifest file 'f16s4_gemm.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_ext
C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\torch\utils\cpp_extension.py:359: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
warnings.warn(f'Error checking compiler version for {compiler}: {error}')
Traceback (most recent call last):
File "C:\Users\caleb\miniconda3\envs\awq\lib\runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "C:\Users\caleb\miniconda3\envs\awq\lib\runpy.py", line 110, in get_module_details
import(pkg_name)
File "C:\Users\caleb\Desktop\AI stuff\llm-awq\awq\kernels\setup.py", line 9, in
setup(
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_init.py", line 107, in setup
return distutils.core.setup(**attrs)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\core.py", line 185, in setup
return run_commands(dist)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\core.py", line 201, in run_commands
dist.run_commands()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 969, in run_commands
self.run_command(cmd)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\dist.py", line 1244, in run_command
super().run_command(command)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\command\install.py", line 80, in run
self.do_egg_install()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\command\install.py", line 129, in do_egg_install
self.run_command('bdist_egg')
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\dist.py", line 1244, in run_command
super().run_command(command)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\command\bdist_egg.py", line 164, in run
cmd = self.call_command('install_lib', warn_dir=0)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\command\bdist_egg.py", line 150, in call_command
self.run_command(cmdname)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\dist.py", line 1244, in run_command
super().run_command(command)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\command\install_lib.py", line 11, in run
self.build()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\command\install_lib.py", line 111, in build
self.run_command('build_ext')
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\cmd.py", line 318, in run_command
self.distribution.run_command(command)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\dist.py", line 1244, in run_command
super().run_command(command)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\dist.py", line 988, in run_command
cmd_obj.run()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools\command\build_ext.py", line 84, in run
_build_ext.run(self)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\setuptools_distutils\command\build_ext.py", line 345, in run
self.build_extensions()
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\torch\utils\cpp_extension.py", line 499, in build_extensions
_check_cuda_version(compiler_name, compiler_version)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\torch\utils\cpp_extension.py", line 383, in _check_cuda_version
torch_cuda_version = packaging.version.parse(torch.version.cuda)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\pkg_resources_vendor\packaging\version.py", line 52, in parse
return Version(version)
File "C:\Users\caleb\miniconda3\envs\awq\lib\site-packages\pkg_resources_vendor\packaging\version.py", line 195, in init
match = self._regex.search(version)
TypeError: expected string or bytes-like object

I can't figure out why the setup.py thing won't work, I can't finish installing this repo because of this error.

Can AWQ be run on TPUs?

Hi,

Is it possible to run AWQ on cloud TPUs? Will CUDA kernels run correctly on those?

Thanks!

[Feature Request] Add support for Instructor models

Hi,

It would be great to see support added for Instructor models. As can be seen on MTEB leaderboard, they perform quite well and are widely used for generating embedding in semantic search, etc.

Thanks!

Hi, Could you also support xgen-7b-8k-inst ?

Hi,

Could you also support xgen-7b-8k-inst? Thanks.
https://huggingface.co/Salesforce/xgen-7b-8k-inst

Looking at the code, I see that there is an dequantisation process when actually doing the inference, i.e. the actual matrix multiplication is done with floating point arithmetic right?

Error Occurs When Quantizing LLaMA2-70B

I really appreciate authors' fantastic work done here.

When I tried to apply awq on LLaMA2-70B, however, the error below poped out:

Running AWQ...:   0%|                                                                                                                               | 0/80 [00:16<?, ?it/s]
Traceback (most recent call last):
  File "/home/vma/.conda/envs/awq/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/vma/.conda/envs/awq/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/proj/Projects/FastChat/repo/llm-awq/awq/entry.py", line 214, in <module>
    main()
  File "/proj/Projects/FastChat/repo/llm-awq/awq/entry.py", line 189, in main
    model, enc = build_model_and_enc(args.model_path)
  File "/proj/Projects/FastChat/repo/llm-awq/awq/entry.py", line 122, in build_model_and_enc
    awq_results = run_awq(
  File "/home/vma/.conda/envs/awq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/proj/Projects/FastChat/repo/llm-awq/awq/quantize/pre_quant.py", line 149, in run_awq
    apply_scale(layers[i], scales_list, input_feat_dict=input_feat)
  File "/proj/Projects/FastChat/repo/llm-awq/awq/quantize/auto_scale.py", line 355, in apply_scale
    scale_fc_fc(prev_op, layers[0], scales)
  File "/home/vma/.conda/envs/awq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/proj/Projects/FastChat/repo/llm-awq/awq/quantize/auto_scale.py", line 68, in scale_fc_fc
    fc1.weight[-scales.size(0):].div_(scales.view(-1, 1))
RuntimeError: The size of tensor a (1024) must match the size of tensor b (8192) at non-singleton dimension 0

This error is really unexpected, since awq works fine on LLaMA2-7B and LLaMA2-13B.

I wonder if anyone could give me some hint about solving this problem.

Thanks in advance.

Version of Nvidia Jetson Orin used for TinyChat benchmarks

You report some benchmarking numbers for TinyChat running on an Nvidia Jetson Orin device, but it is not clear which version of the device you are using. Is it a Nano, NX or AGX and with how much memory? Please update the TinyChat benchmark with this information.

[Bug] ValueError: OC is not multiple of cta_N = 128

Hi,

I encounter this error while using the quantized tiiuae/falcon-7b-instruct model:

python -m awq.entry --model_path tiiuae/falcon-7b-instruct \
  --max_memory 0:9GiB cpu:99GiB \
  --tasks wikitext \
  --w_bit 4 --q_group_size 64 \
  --load_quant quant_cache/falcon-7b-instruct-w4-g64-awq.pt

│ /home/user/llm-awq/awq/quantize/qmodule.py:92 in forward             │
│                                                                                                  │
│   89 │   @torch.no_grad()                                                                        │
│   90 │   def forward(self, x):                                                                   │
│   91 │   │   out_shape = x.shape[:-1] + (self.out_features, )                                    │
│ ❱ 92 │   │   out = f16s4_gemm.gemm_forward_cuda(x.reshape(-1, x.shape[-1]), self.qweight, sel    │
│   93 │   │   out = out + self.bias if self.bias is not None else out                             │
│   94 │   │   return out.reshape(out_shape)                                                       │
│   95                                                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: OC is not multiple of cta_N = 128

Please note that group size 128 cannot be used for tiiuae/falcon-7b-instruct model as input dimension size of Linear layer in the transformer block is not divisible by 128.

Thanks!

Merge the more models branch with main branch

Hello maintainers! Just wondering when you are planning to merge the changes that allows for more models into main?

[Question/Feature] Fused attention/mlp/norm for MPT

I have had the great pleasure of testing out TinyChat today - it's blazing fast.

In particular, I was able to get 102 tokens/s (9.8ms/token) on a 4090 with the fused operations on LLaMa-2 7B, which is a 100% speed boost over the non-fused operations which ran at about 45-50 tokens/s.

How can we extend these fusing operations to the MPT model series? i.e. fusing the torch implementation of Multi-Head Attention plus their ALiBi implementation.

The main reason I want to use MPT models over LLaMa is licensing issues, but also that MPT has 7B models trained for 8k context.

Why scales need to be transformed by sqrt(scales.max() * scales.min())?

Great work!
I have a question about the scales in the repo. Can you explain why we need this: scales = scales / (scales.max() * scales.min()).sqrt() in the implementation, which I didn't find any explanations in the original paper.

Question about inference speed

I'm trying compare the inference performance of gptq (reorder) and awq on A100-40G. The table below is the result of preliminary tests.

LLaMA-13B	best(t/s)	worse(t/s)
Exllama	47.13	41.86
Tinychat	23.04	21.35

It seems that the results here are inconsistent with those in the paper, awq is much slower than gptq.

4bit kernel error when large input size

import torch
import f16s4_gemm
in_features = 14336
out_features = 5376
w = torch.Tensor(in_features, out_features // (32 // 4)).to(torch.int32).cuda()
scales = torch.Tensor(in_features // 128, out_features).half().cuda()
qzeros = torch.Tensor(in_features // 128, out_features // (32 // 4)).to(torch.int32).cuda()
x = torch.Tensor(2048 * 8, 14336).cuda().half()                       
f16s4_gemm.gemm_forward_cuda(x, w, scales, qzeros, 8)

when m of gemm is 2048 x 8, it runs perfectly, if we double it

x = torch.Tensor(2048 * 16, 14336).cuda().half()

I got an error like RuntimeError: CUDA error: invalid configuration argument
I think my GPU memory is sufficient since I ran this on A100.
Is there anything I can do to solve this ?

Support for MPT models

Hi @kentang-mit, great work and research!

I would like to suggest implementing the MPT foundational models for AWQ.

MPT obstacles:

uses ALiBi
uses Triton with FlashAttention

What are your thoughts on supporting other architectures and foundational open-source models?

bloom-176b CUDA out of memory on 8* A100 80g

Thanks for your work on support the bloom model. I have already put the --parallel or --auto_parallel argument on my script, but still can't comput AWQ on my 8* A100 80g server.
python -m awq.entry_new_lambada --model_path $model_path/$MODEL
--w_bit 4 --q_group_size 128
--run_awq --dump_awq awq_cache/$MODEL-w4-g128.pt --parallel

How can I fix this problem?

How to use with custom fine-tuned LLM ?

[Feature Request] Support grouped-query attention

Hi,

The recent release of LLaMA 2 from Meta AI uses grouped-query attention (GQA) as opposed to multi-head attention (MHA) for the 70B model and the current AWQ search fails. Considering it is the best open-source model, please support GQA.

Thanks!

Guidance on CUDA driver and runtime versions

Hi,

I have two different setups where I wanted to run AWQ - I managed to run it successfully on one, but not on the other.

The setup where I was able to run AWQ successfully has the following:

$ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
530.30.02

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

Whereas the unsuccessful setup has the following:

$ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
470.161.03

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

The second environment is a Kaggle notebook and the traceback is as follows:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/kaggle/working/llm-awq/awq/entry.py", line 9, in <module>
    from awq.quantize.pre_quant import run_awq, apply_awq
  File "/kaggle/working/llm-awq/awq/quantize/pre_quant.py", line 12, in <module>
    from .auto_scale import auto_scale_block, apply_scale
  File "/kaggle/working/llm-awq/awq/quantize/auto_scale.py", line 8, in <module>
    from .qmodule import ScaledActivation
  File "/kaggle/working/llm-awq/awq/quantize/qmodule.py", line 4, in <module>
    import f16s4_gemm  # with CUDA kernels
ModuleNotFoundError: No module named 'f16s4_gemm'

Thanks!

awqlora

Excited for your great research. Can we combine qlora and awq, like gptqlora? if it's possible, would you consider releasing a version of awqlora?

Best regards.

Bug of Load and evaluate the real quantized model

I encountered an error while loading and evaluating the model for OPT-6.7b, here is my implementation code:

The error is displayed as follows:

It looks like the weight given in awq-model-zoo mismatch defined model in code.

How can I make it support Bloom?

can we replace https://the-eye.eu/public/AI/pile/val.jsonl.zst

I can't download the file "https://the-eye.eu/public/AI/pile/val.jsonl.zst" in get_calib_dataset, can we use other data to replace it?
Thanks a lot.

awq use more GPU memory than gptq

We tested the llama model using AWQ and GPTQ. It does have higher accuracy than GPTQ.

But we found that when using AWQ code to infer the llama model, it uses more GPU memory than GPTQ.

The following are the relevant test results：

For llama-7b w4 group_size=128, the quantized model size is 3.7G.

use A100 40GB and test on human-eval

GPTQ

use_cache=True Maximum memory used：9.505859375GB
use_cache=False Maximum memory used：9.115234375GB

AWQ

use_cache=True Maximum memory used：26.47265625GB
use_cache=False Maximum memory used：36.96484375GB

There are two points to pay attention to the above results.

In the inference stage, GPTQ can use less memory than AWQ
For AWQ, use_cache=False uses more memory( usually use_cache=True requires more memory)

use_cache=False
We use GPTQ script to infer 4bit llama-65b, which can be run on a single GPU. When using AWQ, the OOM will occur.

I would like to ask if you have any of the above problems during the test. Could you please provide your thoughts on the above issues? Thank you so much.

I noticed that in the forward phase, the main difference between GPTQ and AWQ is that AWQ uses Tensor cores (I am not familiar with the contents of tensor cores). Will Tensor cores cause more memory usage?

Can not install with 2080ti

It‘s an amazing package, and may be is the only cuda 4 bit method for MPT for now. However, I can not install it because of the following, my cuda version is Cuda compilation tools, release 11.2, V11.2.152, and graph card is 2080ti, maybe the problem is because the card is too old? because I use another machine with 3090 cuda is 11.3, V11.3.109 and everything is ok. is there any method to install the pachage with 2080ti? the log is at below:

(whisper) vitualwht@DESKTOP-DSTBT14:$ cd llm-awq
(whisper) vitualwht@DESKTOP-DSTBT14:/llm-awq$ cd awq/kernels
ython setup.py i(whisper) vitualwht@DESKTOP-DSTBT14:~/llm-awq/awq/kernels$ python setup.py install
running install
/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/setuptools/_distutils/cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

    ********************************************************************************
    Please avoid running ``setup.py`` directly.
    Instead, use pypa/build, pypa/installer, pypa/build or
    other standards-based tools.

    See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
    ********************************************************************************

!!
self.initialize_options()
/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/setuptools/_distutils/cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

    ********************************************************************************
    Please avoid running ``setup.py`` and ``easy_install``.
    Instead, use pypa/build, pypa/installer, pypa/build or
    other standards-based tools.

    See https://github.com/pypa/setuptools/issues/917 for details.
    ********************************************************************************

!!
self.initialize_options()
running bdist_egg
running egg_info
writing f16s4_gemm.egg-info/PKG-INFO
writing dependency_links to f16s4_gemm.egg-info/dependency_links.txt
writing requirements to f16s4_gemm.egg-info/requires.txt
writing top-level names to f16s4_gemm.egg-info/top_level.txt
/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'f16s4_gemm.egg-info/SOURCES.txt'
writing manifest file 'f16s4_gemm.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/utils/cpp_extension.py:388: UserWarning: The detected CUDA version (11.2) has a minor version mismatch with the version that was used to compile PyTorch (11.7). Most likely this shouldn't be a problem.
warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
building 'f16s4_gemm' extension
creating build/temp.linux-x86_64-cpython-39
/usr/local/cuda/bin/nvcc -I/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include -I/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include/TH -I/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/vitualwht/anaconda3/envs/whisper/include/python3.9 -c gemm_cuda_gen.cu -o build/temp.linux-x86_64-cpython-39/gemm_cuda_gen.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -std=c++17 -keep -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=f16s4_gemm -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75
gemm_cuda_gen.cu(23): warning: variable "scaling_factors_shared" was declared but never referenced

gemm_cuda_gen.cu(24): warning: variable "zeros_shared" was declared but never referenced

gemm_cuda_gen.cu(28): warning: variable "blockIdx_x" was declared but never referenced

gemm_cuda_gen.cu(42): warning: variable "ld_zero_flag" was declared but never referenced

/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include/c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h(77): here

/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include/c10/util/irange.h(54): warning: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
/home/vitualwht/anaconda3/envs/whisper/lib/python3.9/site-packages/torch/include/ATen/core/qualified_name.h(73): here

gemm_cuda_gen.cu(10): warning: function "__pack_half2" was declared but never referenced

ptxas gemm_cuda_gen.ptx, line 911; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 915; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 919; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 923; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 927; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 931; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 935; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 939; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 983; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 987; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 991; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 995; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 999; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 1003; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 1007; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas gemm_cuda_gen.ptx, line 1011; error : Feature '.m16n8k16' requires .target sm_80 or higher
ptxas fatal : Ptx assembly aborted due to errors
error: command '/usr/local/cuda/bin/nvcc' failed with exit code 255

	scales_list.append(_auto_get_scale(
	prev_op=module.fc1,
	layers=[module.fc2],
	inp=input_feat['fc2'],
	))

mit-han-lab / llm-awq Goto Github PK

llm-awq's People

Contributors

Stargazers

Watchers

Forkers

llm-awq's Issues

Question

Recommend Projects

Recommend Topics

Recommend Org