GPTQ-for-LLaMa

4 bits quantization of LLaMa using GPTQ

GPTQ is SOTA one-shot weight quantization method

This code is based on GPTQ

Result

Model(LLaMa-7B)	Bits	group-size	Wikitext2	PTB	C4
FP16	16	-	5.67	8.79	7.05
RTN	4	-	6.28	9.68	7.70
GPTQ	4	-	6.79	10.67	8.28
GPTQ	4	64	6.16	9.66	7.52
RTN	3	-	25.66	61.25	28.19
GPTQ	3	-	20.86	37.54	22.19
GPTQ	3	64	12.24	16.77	9.55

Model(LLaMa-13B)	Bits	group-size	Wikitext2	PTB	C4
FP16	16	-	5.08	8.06	6.58
RTN	4	-	5.52	8.62	6.96
GPTQ	4	-	5.35	8.40	6.82
GPTQ	4	64	5.18	8.18	6.66
RTN	3	-	11.41	21.21	13.20
GPTQ	3	-	6.80	10.45	8.31
GPTQ	3	64	5.50	8.60	7.00

Quantizing the model requires a large amount of CPU memory. For example, quantizing a LLaMa-13b model requires 42gb, and LLaMa-33b requires more memory than 64gb.

Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)

According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.

Dependencies

torch: tested on v1.12.1+cu113
transformers: tested on v4.27.0.dev0(required)
datasets: tested on v2.10.1
(to run 4-bit kernels: setup for compiling PyTorch CUDA extensions, see also https://pytorch.org/tutorials/advanced/cpp_extension.html, tested on CUDA 11.3)

All experiments were run on a single NVIDIA RTX3090.

Language Generation

LLaMa

# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --groupsize 64

To run other LLaMa models replace llama-7b-hf with one of: llama-13b-hf, llama-30b-hf, llama-65b-hf.

ZeroShot

See zeroShot/ folder.

CUDA Kernels

# Install kernels
python setup_cuda.py install

# Benchmark performance for FC2 layer of LLaMa-7B
CUDA_VISIBLE_DEVICES=0 python test_kernel.py

# Benchmark language generation with 4-bit LLaMa-7B:

# Save compressed model
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --save llama7b-4bit.pt
# Benchmark generating a 2048 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --load llama7b-4bit.pt --benchmark 2048 --check
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python llama.py decapoda-research/llama-7b-hf c4 --benchmark 2048 --check

# model inference with the saved model
CUDA_VISIBLE_DEVICES=0 python llama_inference.py decapoda-research/llama-7b-hf --wbits 4 --load llama7b-4bit.pt --text "this is llama"

CUDA Kernels support 2,3,4,8 bits.

Basically, 4-bit quantization is recommended.

cuda kernel does not support group size.

Memory Usage

Model	Bits	memory(MiB)	benchmark(ppl)	Wikitext2	PTB	C4	checkpoint size(GB)
LLaMa-7B with FP16	16	13940	5.23	5.67	8.79	7.05	12.5
LLaMa-13B with FP16	16	OOM	-	5.08	8.06	6.58	24.2
LLaMa-7B with GPTQ	8	7748	5.39	5.67	8.81	7.08	6.5
LLaMa-13B with GPTQ	8	14570	5.00	5.09	8.06	6.61	12.4
LLaMa-7B with GPTQ	4	4740	6.23	6.79	10.67	8.28	3.5
LLaMa-13B with GPTQ	4	8410	5.14	5.35	8.40	6.82	6.5
LLaMa-7B with GPTQ	3	3852	11.43	17.94	31.44	19.65	2.75
LLaMa-13B with GPTQ	3	6870	5.58	6.77	10.29	8.34	5.06
LLaMa-7B with GPTQ	2	3076	4152	30749	45936	5045	2.0
LLaMa-13B with GPTQ	2	5275	6903	13203	1384	8.34	5.06

Acknowledgements

This code is based on GPTQ

Thanks to Meta AI for releasing LLaMa, a powerful LLM.

fabiorafaelcoutada / gptq-for-llama Goto Github PK

gptq-for-llama's Introduction

GPTQ-for-LLaMa

Result

Dependencies

Language Generation

LLaMa

ZeroShot

CUDA Kernels

Memory Usage

Acknowledgements

gptq-for-llama's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent