Git Product home page Git Product logo

atom's Introduction

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

[paper] [slides] [poster]

overview

Atom is an accurate low-bit weight-activation quantization algorithm that combines (1) mixed-precision, (2) fine-grained group quantization, (3) dynamic activation quantization, (4) KV-cache quantization, and (5) efficient CUDA kernels co-design.

This codebase utilizes lm_eval to evaluate perplexity and zero-shot accuracy. Code segments from SmoothQuant, GPTQ, and SparseGPT are integrated to reproduce results. Our kernels are modified based on previous version of FlashInfer and tested by NVBench. Serving framework Punica is integrated to evaluate end-to-end throughput and latency. We also use BitsandBytes for new data-type evaluations (e.g., FP4). We thank the authors for their great works.

The current release features:

  • Simulated quantization for accuracy evaluation.
  • Perplexity and zero-shot accuracy evaluation
  • Kernel benchmark & End-to-end evaluation

To do:

  • Release code for reproducing results.
  • Release code for end-to-end throughput evaluation.
  • Add FP4 accuracy evaluation for both weight and activation quantization.
  • Add support for Mixtral models.
  • Optimize kernel for different GPUs.
  • Full inference workflow in real production scenario.

Abstract

The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance.

To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization setups in the serving context. Atom improves end-to-end throughput by up to 7.73× compared to the FP16 and by 2.53× compared to INT8 quantization, while maintaining the same latency target.

Installation

  1. Run in container. Mount models.
docker pull nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
docker run -it --gpus all -v /PATH2MODEL:/model nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 /bin/bash
  1. Clone this repo (Make sure you install Git, and Conda)
git clone --recurse-submodules https://github.com/efeslab/Atom
cd Atom
  1. Prepare environment
cd model
conda create -n atom python=3.10
conda activate atom
pip install -r requirements.txt
  1. Compile kernels benchmarks (Optional): Install gcc-11 and CMake (>= 3.24)
apt install software-properties-common lsb-release
apt-get update

curl -s https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null
apt-add-repository "deb https://apt.kitware.com/ubuntu/ $(lsb_release -cs) main"
apt update
apt install cmake

cd /PATH_TO_ATOM/kernels
add-apt-repository -y ppa:ubuntu-toolchain-r/test
apt-get update
apt install -y gcc-11 g++-11
mkdir build && cd build
cmake ..
make -j

Usage

Accuracy Evaluation

Before running this command, please download Llama model from Hugging Face website first. We recommend downloading from Deca-Llama.

We provide several scripts to reproduce our results in the paper:

To run our W4A4 perplexity evaluation, please execute

bash scripts/run_atom_ppl.sh /Path/To/Llama/Model

To get our W4A4 zero shot accuracy on common sense tasks, please execute

bash scripts/run_atom_zeroshot_acc.sh /Path/To/Llama/Model

To run our ablation study on different quantization optimizations, please run

bash scripts/run_atom_ablation.sh /Path/To/Llama/Model

You can also customize your own quantization setup by modifying the parameters. Check model/main.py to see the description of each parameter.

python model/main.py /Path/To/Llama/Model wikitext2 \
    --wbits 4 --abits 4 --a_sym --w_sym \
    --act_group_size 128 --weight_group_size 128 --weight_channel_group 2 \
    --reorder --act_sort_metric hessian \
    --a_clip_ratio 0.9 --w_clip_ratio 0.85 \
    --keeper 128 --keeper_precision 3 --kv_cache --use_gptq \
    --eval_ppl --eval_common_sense

Efficiency Evaluation

We evaluate Atom on a RTX4090 GPU. Results below are executed in cu113 docker container. Note that current kernels are only optimized for RTX4090.

To get INT4 GEMM kernel result, please execute:

cd kernels/build
./bench_gemm_i4_o16

Check column Elem/s to see the computation throughput of the kernel (Flop/s). gemm

Other kernel of Atom can be evaluated similarly, for e.g., ./bench_reorder. We conduct kernel evaluation on baselines as well. Please check baselines/README.md to reproduce results.

To reproduce end-to-end throughput and latency evaluation, please check e2e/README.md.

Key Results

Perplexity

We evaluate Atom's accuracy on serveral model families including Llama, Llama-2, and Mixtral, with data types of INT4 and FP4.

  • WikiText2, PTB and C4 datasets on Llama family: perplexity

  • WikiText2 perplexity on Llama-2 and Mixtral:

End-to-end throughput and latency

  • Atom achieves up to 7.7x higher throughput with similar latency than FP16 with a fixed GPU memory under serving scenario. e2e

Reference

If you find this project is helpful to your research, please consider to cite our paper:

@inproceedings{MLSYS2024_5edb57c0,
 author = {Zhao, Yilong and Lin, Chien-Yu and Zhu, Kan and Ye, Zihao and Chen, Lequn and Zheng, Size and Ceze, Luis and Krishnamurthy, Arvind and Chen, Tianqi and Kasikci, Baris},
 booktitle = {Proceedings of Machine Learning and Systems},
 editor = {P. Gibbons and G. Pekhimenko and C. De Sa},
 pages = {196--209},
 title = {Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving},
 url = {https://proceedings.mlsys.org/paper_files/paper/2024/file/5edb57c05c81d04beb716ef1d542fe9e-Paper-Conference.pdf},
 volume = {6},
 year = {2024}
}

atom's People

Contributors

cylinbao avatar eltociear avatar happierpig avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

atom's Issues

The question about calib data

Hi, I am wandering which dataset should be the calib dataset? I want to evaluate the quantized model on my own dataset. Which dataset to generate the calib data? my own dataset or other public dataset like wikitext?

RuntimeError when quant llama model

Hi, when I tried to quant llama model, I met the following error:

Traceback (most recent call last):
  File "/workspace/code/atom-main/model/main.py", line 205, in <module>
    act_scales = get_act_stats_func(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/yangshangtong/code/atom-main/model/outlier.py", line 95, in get_act_stats_llama
    outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask, position_ids=position_ids)[0]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 750, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 681, in forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: The expanded size of the tensor (2048) must match the existing size (32768) at non-singleton dimension 3.  Target sizes: [1, 52, 2048, 2048].  Tensor sizes: [1, 1, 32768, 32768]

the command is:

python model/main.py /workspace/model/llama2-7B wikitext2 \
	--act_group_size 128 --weight_group_size 128 --weight_channel_group 2 \
	--reorder --act_sort_metric hessian \
	--a_clip_ratio 0.9 --w_clip_ratio 0.85 \
	--keeper 128 --keeper_precision 3 --kv_cache \
	--eval_ppl

What should I do to solve this problem? looking forward to your reply.

ppl on ptb

When I ran your code, I found that the ppl score of the llama-7B model on the PTB dataset is greater than 20, whether quantized or not. And in another paper, the score is also different from yours. Why?

Screenshot 2024-01-12 at 19 10 49

Question about the synchronazation in low-precision kernel

First of all, thank you for your wonderful work! I have a few small questions about the low precision matrix multiplication code.

In kernels/include/GEMM/Dense_layer_gemm_i4_o16.cuh, in the kernel compute_gemm_imma. I noticed that in the entire kernel, when performing cp.async synchronization, only the statement

asm volatile("cp.async.wait_group %0;\n" ::"n"(STAGE - 2));

is involved.

My question is, has the correctness of this kernel been checked? Because in my understanding, a complete kernel should involve similar

asm volatile("cp.async.wait_group %0;\n" ::"n"(STAGE - 3));

and asm volatile("cp.async.wait_group %0;\n" ::"n"(0));

In other words, we should see statements like "wait for the last two memory transactions to complete" somewhere in the code, right? Of course, it is possible that I have missed some of the code. If you can provide some help in explaining the code, I will be very grateful!

Question regarding the efficiency evaluation

Hello, regarding the efficiency evaluation experiment, it seems that there are only codes for evaluating the throughput and latency of Atom and SmoothQuant. I would like to ask how the throughput and latency results for FP16 and AWQ were obtained?

issue with `c4` dataset for eval

Thanks for the great work guys! Trying to run the W4A4 perplexity evaluation, the HF datasets complains about "ValueError: BuilderConfig 'allenai--c4' not found", so removing allenai--c4 from [datautils.py(https://github.com/efeslab/Atom/blob/main/model/datautils.py#L49) and keeping only 'allenai/c4 let the script to complete the run. Wonder if that would be ok to remove it/ if I am missing something.

Traceback (most recent call last):
  File "/data/home/hamidnazeri/Atom/model/llama.py", line 232, in <module>
    dataloader, testloader = get_loaders(
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/Atom/model/datautils.py", line 175, in get_loaders
    return get_c4(nsamples, seed, seqlen, model, tokenizer)
  File "/opt/hpcaas/.mounts/fs-5c62ddab/home/hamidnazeri/Atom/model/datautils.py", line 51, in get_c4
    traindata = load_dataset(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/load.py", line 1852, in load_dataset_builder
    builder_instance: DatasetBuilder = builder_cls(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/builder.py", line 373, in __init__
    self.config, self.config_id = self._create_builder_config(
  File "/data/home/hamidnazeri/miniconda/envs/atom/lib/python3.10/site-packages/datasets/builder.py", line 539, in _create_builder_config
    raise ValueError(
ValueError: BuilderConfig 'allenai--c4' not found. Available: ['en', 'en.noblocklist', 'en.noclean', 'realnewslike', 'multilingual', 'af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en-multi', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'und', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Latn', 'zu']
model,bit,wiki2,ptb,c4,ptb-new,c4-new
/data/home/hamidnazeri/PiPPy/

AssertionError

I attempted the W4A4 operation on the OPT-350M model and was able to obtain the corresponding results. However, after switching the model to 2.7B, I encountered a mismatch error at line 238 in quant.py. Upon printing, I discovered the size to be ([32, 2048, 160]), whereas, for the 350M model, it was displayed as 16, 2048, 128. How should I resolve this error?

TypeError: QLlamaDecoderLayer.forward() got an unexpected keyword argument 'cache_position'

Hi there, I followed all the steps of this proj until I encountered an issue while running the following command.
python model/main.py decapoda-research-llama-7b-hf wikitext2 \ --wbits 4 --abits 4 --a_sym --w_sym \ --act_group_size 128 --weight_group_size 128 --weight_channel_group 2 \ --reorder --act_sort_metric hessian \ --a_clip_ratio 0.9 --w_clip_ratio 0.85 \ --keeper 128 --keeper_precision 3 --kv_cache --use_gptq \ --eval_ppl --eval_common_sense

Env

  1. GPU: Nvidia RTX 4090
  2. As same as you, I used the container of nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
  3. The versions of Python and dependency libraries are consistent with yours requirements.txt
  4. For convenient, I temporarily skipped the step of Compiling kernels benchmarks
  5. By the way, for quickly validation, I changed this line to only verify wikitext2 dataset_only_wikitext2

Describe the issue

When running loglikelihood requests, The typeError in the screenshot has occurred:
image

I tried to make changes based on this issue for cache_position=None in transformers/models/llama/modeling_llama.py, but it also doesn't work.

Any suggestions will be greatly appreciated!

not including dynamic quantizaiton when reproducing results, why?

I have reproduced the results of llama-7b, the ppl of WikiText2 is the same as it in Table 3(6.16).
But in the code you provide, the dynamic quantization for activations is not included. As far as I know, dynamic quantization also cause quantization error. Why omit dynamic quantization in your code?

the ppl for llama-7b is very large

I have reproduced the results of llama-7b.
The ppl of WikiText2 is the same as it in Table 3(6.16).
The ppl of c4 is also the same as it in Table 3 (7.694).
But the ppl of ptb is very large (32.879), it is 9.62 in Table 3. why ?

Is it possible to add support for other models?

Hi, this is great work. I found under /Atom/model/main.py file that it seems to only support llama, opt, mixtral models. If I need to add support for Qwen model, which files should I change?

error:same device

"I would like to ask if there is a solution to this problem, as the error occurred without any changes to the code." :RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

Quention about end-to-end efficiency evaluation of Atom

Thanks for your great work! I have a small question here.

Why the matrix dimension is (bs, (hidden_dim - group_size) // 2)) not (bs, hidden_dim - group_size)) here?
What does this "//2" mean? Is it some kind of hardware acceleration method? Can you elaborate it? Thank you.

" a = torch.randint(\n",
" 16,\n",
" 128, (bs, (hidden_dim - group_size) // 2),\n",
" dtype=torch.uint8).cuda()\n",
" b = torch.randint(\n",
" 16,\n",
" 128, (hidden_dim, (hidden_dim - group_size) // 2),\n",
" dtype=torch.uint8).cuda()\n",

LLM model load hanging problem

Hello,
When i follow the guidance and try reproduce the result, i encounter the problem shown in the screening below.

1719457102248

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.