Git Product home page Git Product logo

gptq-for-llama's Introduction

GPTQ-for-LLaMa

4 bits quantization of LLaMa using GPTQ

GPTQ is SOTA one-shot weight quantization method

This code is based on GPTQ

Result

Model(LLaMa-7B) Bits group-size Wikitext2 PTB C4
FP16 16 - 5.67 8.79 7.05
RTN 4 - 6.28 9.68 7.70
GPTQ 4 - 6.79 10.67 8.28
GPTQ 4 64 6.16 9.66 7.52
RTN 3 - 25.66 61.25 28.19
GPTQ 3 - 20.86 37.54 22.19
GPTQ 3 64 12.24 16.77 9.55
Model(LLaMa-13B) Bits group-size Wikitext2 PTB C4
FP16 16 - 5.08 8.06 6.58
RTN 4 - 5.52 8.62 6.96
GPTQ 4 - 5.35 8.40 6.82
GPTQ 4 64 5.18 8.18 6.66
RTN 3 - 11.41 21.21 13.20
GPTQ 3 - 6.80 10.45 8.31
GPTQ 3 64 5.50 8.60 7.00

Quantizing the model requires a large amount of CPU memory. For example, quantizing a LLaMa-13b model requires 42gb, and LLaMa-33b requires more memory than 64gb.

Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)

According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.

Dependencies

All experiments were run on a single NVIDIA RTX3090.

Language Generation

LLaMa

# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4
# Run RTN baseline and compute results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --nearest
# Run GPTQ and compute results
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --groupsize 64

To run other LLaMa models replace llama-7b-hf with one of: llama-13b-hf, llama-30b-hf, llama-65b-hf.

ZeroShot

See zeroShot/ folder.

CUDA Kernels

# Install kernels
python setup_cuda.py install

# Benchmark performance for FC2 layer of LLaMa-7B
CUDA_VISIBLE_DEVICES=0 python test_kernel.py

# Benchmark language generation with 4-bit LLaMa-7B:

# Save compressed model
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --save llama7b-4bit.pt
# Benchmark generating a 2048 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --load llama7b-4bit.pt --benchmark 2048 --check
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python llama.py decapoda-research/llama-7b-hf c4 --benchmark 2048 --check

# model inference with the saved model
CUDA_VISIBLE_DEVICES=0 python llama_inference.py decapoda-research/llama-7b-hf --wbits 4 --load llama7b-4bit.pt --text "this is llama"

CUDA Kernels support 2,3,4,8 bits.

Basically, 4-bit quantization is recommended.

cuda kernel does not support group size.

Memory Usage

Model Bits memory(MiB) benchmark(ppl) Wikitext2 PTB C4 checkpoint size(GB)
LLaMa-7B with FP16 16 13940 5.23 5.67 8.79 7.05 12.5
LLaMa-13B with FP16 16 OOM - 5.08 8.06 6.58 24.2
LLaMa-7B with GPTQ 8 7748 5.39 5.67 8.81 7.08 6.5
LLaMa-13B with GPTQ 8 14570 5.00 5.09 8.06 6.61 12.4
LLaMa-7B with GPTQ 4 4740 6.23 6.79 10.67 8.28 3.5
LLaMa-13B with GPTQ 4 8410 5.14 5.35 8.40 6.82 6.5
LLaMa-7B with GPTQ 3 3852 11.43 17.94 31.44 19.65 2.75
LLaMa-13B with GPTQ 3 6870 5.58 6.77 10.29 8.34 5.06
LLaMa-7B with GPTQ 2 3076 4152 30749 45936 5045 2.0
LLaMa-13B with GPTQ 2 5275 6903 13203 1384 8.34 5.06

Acknowledgements

This code is based on GPTQ

Thanks to Meta AI for releasing LLaMa, a powerful LLM.

gptq-for-llama's People

Contributors

qwopqwop200 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.