Git Product home page Git Product logo

quip4llama's Introduction

QuIP: Quantization with Incoherence Processing

This repository contains code for the paper QuIP: 2-Bit Quantization of Large Language Models with Guarantees.

TLDR: Our proposed incoherence processing enables quantization of large language models down to 2 bits. Please see our paper for full details.

The code is built on top of OPTQ's repository. The current code includes the following:

Update: QuIP# is our new and improved method! Includes a lattice codebook and an efficient cuda implementation! Results on quantizing Llama 1 and 2 models, achieving near fp16 quantization performance at 2 bits.

Language Generation

# Compute full precision (FP16) results
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4
# Run a quantization method with Incoherence Processing
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 4 --quant <quantmethod> --incoh_processing --save <savename>
# Run a quantization method with baseline processing
CUDA_VISIBLE_DEVICES=0 python opt.py facebook/opt-125m c4 --wbits 4 --quant gptq --pre_gptqH --save <savename>

Quantization methods include:

  • ldlq: runs the LDLQ rounding algorithm (we show its equivalence to OPTQ, providing a novel theoretical analysis)
  • ldlqRG: runs the LDLQ_RG algorithm with additional hessian-based hessian reordering, and further greedy updates, with --npasses controlling the number of passes over the weights
  • gptq: runs OPTQ algorithm as implemented by its authors
  • allbal: algorithm to run greedy updates by themselves, with --npasses the argument controlling the number of passes over the weights
  • ldlbal_admm: alternative algorithm which constraints the rounded weights to be sufficiently close to their original, giving a better theoretical bound.

The --incoh_processing argument is a meta argument which sets the following flags --pre_gptqH --pre_rescale --pre_proj --qfn b. For more control into the pre and post processing, these arguments can be set individually.

To run other OPT models replace opt-125m with one of: opt-350m, opt-1.3b, opt-2.7b, opt-6.7b, opt-13b, opt-30b, etc. On larger models, a low compute-to-memory-access ratio can slow down the quantization algorithms. We implement a lazy batch update to te weight matrix specified by --lazy_batch. This argument works with the quantization methods {ldlq, ldlqRG, allbal}. Note OPTQ already implements this, and is where we got the idea from.

ZeroShot

# Compute full precision (FP16) results 
CUDA_VISIBLE_DEVICES=0 python main.py facebook/opt-125m c4 --wbits 16 --nsamples 0 --task <task>
# Evaluate saved model
CUDA_VISIBLE_DEVICES=0 python main.py facebook/opt-125m c4 --load <load_name> --nsamples 0 --task <task>

To evaluate the quantized models on zeroshot tasks, simply provide the saved quantized model weights to the script. Evaluated tasks are {arc_easy, lambada, piqa, storycloze}.

Benchmarking

Soon to come!

OPTQ and LDLQ Equivalence

Run the following script to empirically verify that the output of OPTQ's implementation and our implementation of LDLQ are identical: python optq_ldlq_equiv.py. Note OPTQ's implementation requires running on a GPU.

OTPQ/LDLQ Finite Grid Counterexample

Run python optq_counter.py to compute the proxy loss of our W,H counterexample.

Computing Proxy Loss

In a similar manner to opt.py, run opt_saveH.py to save the H matrices resulting from the specified model and quantization method. Then, run opt_proxy.py to compute the proxy loss for a specified quantization method.

CUDA_VISIBLE_DEVICES=0 python opt_proxy.py c4 --wbits 4 --quant <quant_method>

H Summary

Run the following script to compute summary statistics of a folder <dirname> of H matrices, output from running opt_saveH.py.

python compute_Hsummary.py --dirname <> --savename <> 

quip4llama's People

Contributors

godofnothing avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.