Git Product home page Git Product logo

exl2-for-all's Introduction

EXL2 for all

EXL2 is a mixed-bits quantization method proposed in exllama v2. This repo is created from exllamav2 with support for more model architectures. Unlike repos like AutoAWQ and AutoGPTQ which include various kernel fusions, this repo only contains minimal code for quantization and inference.

Installation

exllama v2 kernels have to installed first. See requirements.txt for dependencies.

Examples

  • Quantization

exllamav2 changed the optimization algorithm in v0.0.11. This repo by default will use the new algorithm, if you want to use the old one, please pass version="v1" to Exl2Quantizer.

exllamav2 by default use standard_cal_data which is a mix of c4, code, wiki and so on. To be consistent with other quantization method, we use redpajama dataset instead.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from exl2 import Exl2Quantizer

model_name = "meta-llama/Llama-2-7b-hf"
quant_dir = "llama-exl2-4bits"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

quant = Exl2Quantizer(bits=4.0, dataset="redpajama")
quant_model = quant.quantize_model(model, tokenizer)
quant.save(quant_model, quant_dir)
tokenizer.save_pretrained(quant_dir)
  • Inference
import torch
from transformers import AutoTokenizer
from model import load_quantized_model

quant_model = load_quantized_model("turboderp/Llama2-7B-exl2", revision="2.5bpw")
tokenizer = AutoTokenizer.from_pretrained("turboderp/Llama2-7B-exl2", revision="2.5bpw")
input_ids = tokenizer.encode("The capital of France is", return_tensors="pt").cuda()
output_ids = quant_model.generate(input_ids, do_sample=True)
print(tokenizer.decode(output_ids[0]))

An additional parameter is modules_to_not_convert because Mixtral gate layer is often unquantized.

quant_model = Exl2ForCausalLM.from_quantized("turboderp/Mixtral-8x7B-instruct-exl2",
                                             revision="3.0bpw",
                                             modules_to_not_convert=["gate"])

Perplexity

LLaMA-2 7b on wikitext.

bpw perplexity
FP16 6.23
2.5 10.13
3.0 7.25
3.5 6.88
4.0 6.40
4.5 6.37

exl2-for-all's People

Contributors

chu-tianxiang avatar

Stargazers

Kenneth avatar  avatar 南栖 avatar Jason (Siyu) Zhu avatar  avatar  avatar

Watchers

 avatar

exl2-for-all's Issues

make_q_matrix(): incompatible function arguments

I copied and pasted the inference example, but I receive the following error message:

Traceback (most recent call last):
  File "/home/emoman/Work/rag/exl2-for-all/test_exl2forall.py", line 19, in <module>
    quant_model = Exl2ForCausalLM.from_quantized("turboderp/Llama2-7B-exl2", revision="2.5bpw")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/emoman/Work/rag/exl2-for-all/model.py", line 151, in from_quantized
    model = post_init(model)
            ^^^^^^^^^^^^^^^^
  File "/home/emoman/Work/rag/exl2-for-all/utils.py", line 170, in post_init
    submodule.post_init(temp_dq=model.device_tensors[device])
  File "/home/emoman/Work/rag/exl2-for-all/qlinear.py", line 80, in post_init
    self.q_handle = ext_make_q_matrix(self.q_tensors, temp_dq)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/emoman/Work/rag/exl2-for-all/qlinear.py", line 33, in ext_make_q_matrix
    return ext_c.make_q_matrix(w["q_weight"], w["q_perm"], w["q_invperm"],
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: make_q_matrix(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor, arg5: torch.Tensor, arg6: torch.Tensor, arg7: torch.Tensor, arg8: torch.Tensor, arg9: torch.Tensor, arg10: torch.Tensor) -> int


Integration with Llama Index

Hello,

I am interested in integrating exl2-quantised models with llamaindex for RAG.

Do you think your library will work out of the box for this purposes.

Are you aware of any examples?

Best,

Ed

How to use for 'Mixtral-8x7B-instruct-exl2' ?

With this

from exl2forall.model import Exl2ForCausalLM

# https://huggingface.co/turboderp/Mixtral-8x7B-instruct-exl2
model_name = 'turboderp/Mixtral-8x7B-instruct-exl2'
revision = '3.0bpw'

model = Exl2ForCausalLM.from_quantized(model_name, revision=revision)

I got this error

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

[<ipython-input-5-b754e793608b>](https://localhost:8080/#) in <cell line: 9>()
      7 revision = '3.0bpw'
      8 
----> 9 model = Exl2ForCausalLM.from_quantized(model_name, revision=revision)

2 frames

[/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py](https://localhost:8080/#) in __getitem__(self, key)
    759             return self._extra_content[key]
    760         if key not in self._mapping:
--> 761             raise KeyError(key)
    762         value = self._mapping[key]
    763         module_name = model_type_to_module_name(key)

KeyError: 'mixtral'

whereas if I use bartowski/dolphin-2.6-mistral-7b-dpo-laser-exl2, I have no error.

Any idea why this error occurs?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.