chu-tianxiang / exl2-for-all Goto Github PK

EXL2 quantization generalized to other models.

Python 100.00%

exl2-for-all's Introduction

EXL2 for all

EXL2 is a mixed-bits quantization method proposed in exllama v2. This repo is created from exllamav2 with support for more model architectures. Unlike repos like AutoAWQ and AutoGPTQ which include various kernel fusions, this repo only contains minimal code for quantization and inference.

Installation

exllama v2 kernels have to installed first. See requirements.txt for dependencies.

Examples

Quantization

exllamav2 changed the optimization algorithm in v0.0.11. This repo by default will use the new algorithm, if you want to use the old one, please pass version="v1" to Exl2Quantizer.

exllamav2 by default use standard_cal_data which is a mix of c4, code, wiki and so on. To be consistent with other quantization method, we use redpajama dataset instead.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from exl2 import Exl2Quantizer

model_name = "meta-llama/Llama-2-7b-hf"
quant_dir = "llama-exl2-4bits"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

quant = Exl2Quantizer(bits=4.0, dataset="redpajama")
quant_model = quant.quantize_model(model, tokenizer)
quant.save(quant_model, quant_dir)
tokenizer.save_pretrained(quant_dir)

Inference

import torch
from transformers import AutoTokenizer
from model import load_quantized_model

quant_model = load_quantized_model("turboderp/Llama2-7B-exl2", revision="2.5bpw")
tokenizer = AutoTokenizer.from_pretrained("turboderp/Llama2-7B-exl2", revision="2.5bpw")
input_ids = tokenizer.encode("The capital of France is", return_tensors="pt").cuda()
output_ids = quant_model.generate(input_ids, do_sample=True)
print(tokenizer.decode(output_ids[0]))

An additional parameter is modules_to_not_convert because Mixtral gate layer is often unquantized.

quant_model = Exl2ForCausalLM.from_quantized("turboderp/Mixtral-8x7B-instruct-exl2",
                                             revision="3.0bpw",
                                             modules_to_not_convert=["gate"])

Perplexity

LLaMA-2 7b on wikitext.

bpw	perplexity
FP16	6.23
2.5	10.13
3.0	7.25
3.5	6.88
4.0	6.40
4.5	6.37

exl2-for-all's People

Contributors

Stargazers

Watchers

Forkers

chuanhongli varin15

exl2-for-all's Issues

能给一下相关的评估数据吗

如题，想知道不同模型在量化之后精度损失情况
最好还能给一下量化前后的速度对比

make_q_matrix(): incompatible function arguments

I copied and pasted the inference example, but I receive the following error message:

Traceback (most recent call last):
  File "/home/emoman/Work/rag/exl2-for-all/test_exl2forall.py", line 19, in <module>
    quant_model = Exl2ForCausalLM.from_quantized("turboderp/Llama2-7B-exl2", revision="2.5bpw")
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/emoman/Work/rag/exl2-for-all/model.py", line 151, in from_quantized
    model = post_init(model)
            ^^^^^^^^^^^^^^^^
  File "/home/emoman/Work/rag/exl2-for-all/utils.py", line 170, in post_init
    submodule.post_init(temp_dq=model.device_tensors[device])
  File "/home/emoman/Work/rag/exl2-for-all/qlinear.py", line 80, in post_init
    self.q_handle = ext_make_q_matrix(self.q_tensors, temp_dq)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/emoman/Work/rag/exl2-for-all/qlinear.py", line 33, in ext_make_q_matrix
    return ext_c.make_q_matrix(w["q_weight"], w["q_perm"], w["q_invperm"],
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: make_q_matrix(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor, arg5: torch.Tensor, arg6: torch.Tensor, arg7: torch.Tensor, arg8: torch.Tensor, arg9: torch.Tensor, arg10: torch.Tensor) -> int

Integration with Llama Index

Hello,

I am interested in integrating exl2-quantised models with llamaindex for RAG.

Do you think your library will work out of the box for this purposes.

Are you aware of any examples?

Best,

How to use for 'Mixtral-8x7B-instruct-exl2' ?

With this

from exl2forall.model import Exl2ForCausalLM

# https://huggingface.co/turboderp/Mixtral-8x7B-instruct-exl2
model_name = 'turboderp/Mixtral-8x7B-instruct-exl2'
revision = '3.0bpw'

model = Exl2ForCausalLM.from_quantized(model_name, revision=revision)

I got this error

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

[<ipython-input-5-b754e793608b>](https://localhost:8080/#) in <cell line: 9>()
      7 revision = '3.0bpw'
      8 
----> 9 model = Exl2ForCausalLM.from_quantized(model_name, revision=revision)

2 frames

[/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py](https://localhost:8080/#) in __getitem__(self, key)
    759             return self._extra_content[key]
    760         if key not in self._mapping:
--> 761             raise KeyError(key)
    762         value = self._mapping[key]
    763         module_name = model_type_to_module_name(key)

KeyError: 'mixtral'

whereas if I use bartowski/dolphin-2.6-mistral-7b-dpo-laser-exl2, I have no error.

Any idea why this error occurs?

chu-tianxiang / exl2-for-all Goto Github PK

exl2-for-all's Introduction

EXL2 for all

Installation

Examples

Perplexity

exl2-for-all's People

Contributors

Stargazers

Watchers

Forkers

exl2-for-all's Issues

能给一下相关的评估数据吗

make_q_matrix(): incompatible function arguments

Integration with Llama Index

How to use for 'Mixtral-8x7B-instruct-exl2' ?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent