turboderp / exllama Goto Github PK

View Code? Open in Web Editor NEW

2.7K 2.7K 214.0 848 KB

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

License: MIT License

Python 56.28% Shell 2.39% C++ 6.24% Cuda 21.19% JavaScript 8.24% CSS 4.12% HTML 1.15% C 0.08% Dockerfile 0.32%

exllama's People

Contributors

Stargazers

Watchers

Forkers

lhl jjhw fairfax-mooresby 0cc4m ardfork ai-jie01 worthmining dithercat dkzdev osmarks disarmyouwitha niktamer ryu1845 eyedeck matthewgard1 jfontestad ph0rk0z segmond nopperl techthiyanes jisungk2 victor-psiori coffeevampir3 riyanparvez mrcodechef jllllll mcx allenbenz userbox020 rapidai marcoripa96 aljungberg amachino paolorechia corruptor2037 tonywhite11 leecig lev-stambler dragonfyre13 jeanettegraves daniham844 intensesunlight techsuni2023 kokizzu neggles sergeyrivares pent dmn-tsk eltociear thdorew1 ralfanmary pilkpancy kredeblofran plumingflorrie jenkinsvalet shandiiis wladastic f5lldown xiaojiexiaojie16 panchovix pterameta oymakus sukerbetty marcovargson junuleileen hhy5277 zerkos1 oktopedbabs thegreatunknown74 bjoernpl hapliniste x0rsh1ft vinberuj liam-sc vldmrb elikoga taprosoft ai-ar4s-dev zhuoyanli deltaguo malcolmsharpe zumalabs engininja2 iwasnotyet nivibilla deltavml mlodels automindx parisneo ipfsdapps zhodisov bodaay drasaadmoosa botterbrott vadi2 tecworks-dev tweakoz gumplus jinlmsft qeternity

exllama's Issues

Is is able to turning with exllama？

I have test this in 4*2080ti . Works well and almost 10t/s for llama 65b while only 0.65t/s for bnb 4bit, really amazing..First of all please accept my thanks and adoration.

If you need do tests with 20 series gpu, may be I can help.

And can this used for turning along with lora?

Question - possible to run starcoder with exllama?

Recently there was a 15.5b param called starcoder released https://huggingface.co/bigcode/starcoder

You should be able to run it with text-generation-webui using a fork of GPTQ-for-llama called
GPTQ-for-SantaCoder https://github.com/mayank31398/GPTQ-for-SantaCoder

Since as far as I can tell these two libraries are using the same library - transformers
as well as the same quantization method (GPTQ) shouldn't this be possible to run with exllama?

any ideas on how I would go about doing this? @turboderp @disarmyouwitha

Multi-GPU

I see from your own testing testing that you have multi-GPU working.

Following the instructions and running test_benchmark_inference.py or test_chatbot.py they both worked on one of my RTX 3060's for a 13b model, and my other 2 3060's were detected (but not used)

Attempting to load a 33b llama model across all 3 cards I have led to cuda OOM error before the model loaded, as only a single card was used, with none of the other cards showing any VRAM usage.

Any tricks to multi-card setups or parameters i should be passing?

I have noticed that while it massively increases the inference speed, it massively decreases the quality of the outputs, instruct models become very obstinate and give completely irrelevant responses, words become misspelled, it repeats lines over and over, and also sometimes spams Chinese letters

"ValueError: Found group index but no groupsize. What do?"

Getting subj exception with new model from here: https://huggingface.co/TheBloke/tulu-30B-GPTQ

Possible to add a pip package?

Heya, I'm writing a langchain binding for exllama, I'd love to be able to pip install exllama and be able to access the libraries in python natively, right now I'm not really sure how I'd ship the langchain module without creating my own binding library in pip, which seems very awkward.

Streaming API

Foremost, this is a terrific project.
I've been trying to integrate it with other apps, but the API is a little bit different compared to other implementations like KobolAI and its API or textgen-webui and its API examples.
I could get it to work (while the webapp is running) with the following script with my limited knowledge, albeit it's not the best:

import requests
import json
import sys

url = 'http://0.0.0.0:5005/api/userinput'
data = {'user_input': 'What time is it? Write a very looong essay about time.'}
headers = {'Content-type': 'application/json'}

# send the POST request and stream the response
response = requests.post(url, data=json.dumps(data), headers=headers, stream=True)

# extract the text values from the JSON response
text_values = (json.loads(line).get('text') for line in response.iter_lines())
for text_value in text_values:
    print(text_value, end="")
    sys.stdout.flush() # flush the output buffer

What do you think about the possibility of making a streaming api endpoint on /api/stream that is not connected with the backend user handling and message saving, and is "stateless" so it follows the REST principles? Since it's one of the most performant backends this would surely boost its popularity.

Pure C++ core instead of Python

I'm curious, is it possible to extract just C++ / CUDA core from the project to integrate into external systems? Basically to have something like llama.cpp without Python at all, or just the initial step for compiling / building exllama.

API for batched input?

Thanks for this great project. The inference speed is exceptional. However it seems the generator api only supports single string input. When serving concurrent requests, batching of inputs will be needed for better thoughput.

Support for llama models with >2048 context?

Hi there! As always, thanks for the amazing project.

I was trying to get to load Minotaur-15B (8192 max context), which it's the result of quantising to 4bit using GPTQ-for-LLaMa.
https://huggingface.co/TheBloke/minotaur-15B-GPTQ

At first, I was trying with ooba text webui, and got:

2023-06-19 15:07:54 INFO:Loading TheBloke_minotaur-15B-GPTQ...
2023-06-19 15:07:56 ERROR:Failed to load the model.
Traceback (most recent call last):
  File "F:\ChatIAs\oobabooga\text-generation-webui\server.py", line 62, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(shared.model_name, loader)
  File "F:\ChatIAs\oobabooga\text-generation-webui\modules\models.py", line 65, in load_model
    output = load_func_map[loader](model_name)
  File "F:\ChatIAs\oobabooga\text-generation-webui\modules\models.py", line 277, in ExLlama_loader
    model, tokenizer = ExllamaModel.from_pretrained(model_name)
  File "F:\ChatIAs\oobabooga\text-generation-webui\modules\exllama.py", line 35, in from_pretrained
    config = ExLlamaConfig(str(model_config_path))
  File "F:\ChatIAs\oobabooga\text-generation-webui\repositories\exllama\model.py", line 39, in __init__
    self.pad_token_id = read_config["pad_token_id"]
KeyError: 'pad_token_id'

Then, tried with exllama directly, and got;

(venv) PS F:\ChatIAs\exllama> python webui/app.py -d "F:\ChatIAs\oobabooga\text-generation-webui\models\TheBloke_minotaur-15B-GPTQ"
 -- Tokenizer: F:\ChatIAs\oobabooga\text-generation-webui\models\TheBloke_minotaur-15B-GPTQ\tokenizer.model
 -- Model config: F:\ChatIAs\oobabooga\text-generation-webui\models\TheBloke_minotaur-15B-GPTQ\config.json
 -- Model: F:\ChatIAs\oobabooga\text-generation-webui\models\TheBloke_minotaur-15B-GPTQ\4bit-128g.safetensors
 -- Sequence length: 2048
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: []
Traceback (most recent call last):
  File "F:\ChatIAs\exllama\webui\app.py", line 133, in <module>
    config = model_init.make_config(args)
  File "F:\ChatIAs\exllama\model_init.py", line 97, in make_config
    config = ExLlamaConfig(args.config)
  File "F:\ChatIAs\exllama\model.py", line 39, in __init__
    self.pad_token_id = read_config["pad_token_id"]
KeyError: 'pad_token_id'

Is there any setting that I'm missing? Or are these models not compatible yet? Thanks!

Error when trying to run Wizard-Vicuna-13B-Uncensored-GPTQ

[email protected]:/exllama$ python test_benchmark_inference.py -d Wizard-Vicuna-13B-Uncensored-GPTQ -p -ppl
 -- Loading model
 -- Tokenizer: Wizard-Vicuna-13B-Uncensored-GPTQ/tokenizer.model
 -- Model config: Wizard-Vicuna-13B-Uncensored-GPTQ/config.json
 -- Model: Wizard-Vicuna-13B-Uncensored-GPTQ/Wizard-Vicuna-13B-Uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors
 -- Sequence length: 2048
 -- Options: ['attention: switched', 'matmul: switched', 'mlp: switched', 'perf', 'perplexity']
Traceback (most recent call last):
  File "/exllama/test_benchmark_inference.py", line 171, in <module>
    wrapper = timer("Load model", lambda: ModelWrapper(args))
  File "/exllama/test_benchmark_inference.py", line 73, in timer
    ret = func()
  File "/exllama/test_benchmark_inference.py", line 171, in <lambda>
    wrapper = timer("Load model", lambda: ModelWrapper(args))
  File "/exllama/test_benchmark_inference.py", line 51, in __init__
    self.model = ExLlama(config)
  File "/exllama/model.py", line 883, in __init__
    with safe_open(self.config.model_path, framework="pt", device="cpu") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
Exception ignored in: <function ExLlama.__del__ at 0x7fd43bfe1fc0>
Traceback (most recent call last):
  File "/exllama/model.py", line 1066, in __del__
    if torch_device is not None: cuda_ext.free_cuda_buffers(torch_device)
  File "/exllama/cuda_ext.py", line 57, in free_cuda_buffers
    free_buffers(device)
TypeError: free_buffers(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.device, arg1: int, arg2: int, arg3: int, arg4: int) -> None

Invoked with: device(type='cuda', index=0)
[email protected]:/exllama$ ^C

Has anyone seen an error like this before?

RuntimeError: CUDA error: an illegal memory access was encountered

RuntimeError                              Traceback (most recent call last)
Cell In[3], line 4
      2 config.model_path = model_path
      3 config.max_seq_len = 2048
----> 4 model = ExLlama(config)
      5 cache = ExLlamaCache(model)
      6 tokenizer = ExLlamaTokenizer(tokenizer_model_path)

File /workspace/exllama/model.py:759, in ExLlama.__init__(self, config)
    756     device = self.config.device_map.layers[i]
    757     sin, cos = self.sincos[device]
--> 759     layer = ExLlamaDecoderLayer(self.config, tensors, f"model.layers.{i}", i, sin, cos)
    761     modules.append(layer)
    763 self.layers = modules

File /workspace/exllama/model.py:345, in ExLlamaDecoderLayer.__init__(self, config, tensors, key, index, sin, cos)
    342 self.config = config
    343 self.index = index
--> 345 self.self_attn = ExLlamaAttention(self.config, tensors, key + ".self_attn", sin, cos, self.index)
    346 self.mlp = ExLlamaMLP(self.config, tensors, key + ".mlp")
    348 self.input_layernorm = ExLlamaRMSNorm(self.config, tensors, key + ".input_layernorm.weight")

File /workspace/exllama/model.py:260, in ExLlamaAttention.__init__(self, config, tensors, key, sin, cos, index)
    258 self.k_proj = Ex4bitLinear(config, self.config.hidden_size, self.config.num_attention_heads * self.config.head_dim, False, tensors, key + ".k_proj")
    259 self.v_proj = Ex4bitLinear(config, self.config.hidden_size, self.config.num_attention_heads * self.config.head_dim, False, tensors, key + ".v_proj")
--> 260 self.o_proj = Ex4bitLinear(config, self.config.num_attention_heads * self.config.head_dim, self.config.hidden_size, False, tensors, key + ".o_proj")

File /workspace/exllama/model.py:137, in Ex4bitLinear.__init__(self, config, in_features, out_features, has_bias, tensors, key)
    135 self.qzeros = tensors[key + ".qzeros"]
    136 self.scales = tensors[key + ".scales"]
--> 137 self.g_idx = tensors[key + ".g_idx"].cpu() if key + ".g_idx" in tensors else None
    138 self.bias = tensors[key + ".bias"] if has_bias else None
    140 self.device = self.qweight.device

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

running on runpod, 2x3090 with 2.1.0.dev20230607+cu118

Splitting model on multiple GPUs produces RuntimeError

When attempting to split the model on multiple GPUs, I get the following error:

> python test_chatbot.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/ -gs 16,22 -p prompt_assistant.txt -un "John" -bn "Assistant" -temp 1.00 -topp 0.95 -beams 5 -beamlen 20 -mm quant_only
 -- Loading model
 -- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/tokenizer.model
 -- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/config.json
 -- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/Guanaco-65B-GPTQ-4bit.act-order.safetensors
 -- Sequence length: 2048
 -- Temperature: 1.00
 -- Top-K: 20
 -- Top-P: 0.95
 -- Min-P: 0.00
 -- Repetition penalty: 1.15
 -- Beams: 5 x 20
 -- Options: ['attention: pytorch_scaled_dp', 'matmul: quant_only', 'gpu_split: 16,22']
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
This is a conversation between a person called John and an AI chatbot called Assistant. Assistant is a helpful and articulate virtual assistant possessing deep knowledge on a multitude of subjects. Assistant will never refuse to answer a question or comply with a request.
Assistant: Hello, John. I am Assistant, your virtual assistant. How may I help you?
John: Testing
Assistant:Traceback (most recent call last):
  File "/home/john/Projects/exllama/test_chatbot.py", line 213, in <module>
    gen_token = generator.beam_search()
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/exllama/generator.py", line 385, in beam_search
    tokens, probs = self.sample(logits,
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/exllama/generator.py", line 94, in sample
    sampled_ind = torch.multinomial(norm_probs, norm_probs.shape[-1] if num == -1 else min(num, norm_probs.shape[-1]))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

This only happens if the model is split between GPUs using the -gs option.

Typo in model.py

AttributeError: 'ExLlamaConfig' object has no attribute 'sdp_thd'. Did you mean: 'stp_thd'?

I think that in line 78:

self.stp_thd = 8

should be

self.sdp_thd = 8

Thanks for all your hard work, great project!

65B working on multi-gpu

This is not a issue, just reporting that it works great with Guanaco-65B-GPTQ-4bit.act-order.safetensors from TheBloke using 2x3090. Speed is great, about 15t/s.

Batch generation support

Great repo.

Is there any plans to add support for batched generation?

Any idea how much work this might be to achieve? I can potentially work on this if you can point me in the right direction.

ExLlama API spec / discussion

Opening a new thread to continue conversation re: API as I think having a thread for discussion about this will be valuable as the project continues to scale

Continuation from: #12

Support for StarCoder

Hello there,
upon loading StarCoder and its derivatives like WizardCoder the following error is thrown:

Traceback (most recent call last):
  File "/media/bkutasi/60824A4F824A29BC/Other_projects/exllama/test_benchmark_inference.py", line 114, in <module>
    config = model_init.make_config(args)
  File "/media/bkutasi/60824A4F824A29BC/Other_projects/exllama/model_init.py", line 97, in make_config
    config = ExLlamaConfig(args.config)
  File "/media/bkutasi/60824A4F824A29BC/Other_projects/exllama/model.py", line 40, in __init__
    self.pad_token_id = read_config["pad_token_id"]
KeyError: 'pad_token_id'

Since it's a different model architecture, GPTBigCodeForCausalLM instead of LlamaForCausalLM the config is pretty different, so pad_token_id and hidden_size and other parameters are missing. Am I loading it wrong, or these model types are not supported?

the inference speed of GPTQ 4bit quantized model

does someone compared the inference speed of 4bit quantized model with the origin FP16 model?
is it faster than the origin FP16 model?

Are you able to help?

I've been trying to setup exllama with my webserver. it will create an instance of the LlamaModelRepo and call loadModel, if I use the following code, i can get text output fine

class LlamaModelRepo:
    tokenizer: ExLlamaTokenizer = None
    generator: ExLlamaGenerator = None
    config: ExLlamaConfig = None
    model: ExLlama = None
    cache: ExLlamaCache = None
    def __init__(self):
        self.models: list = []
        self.modelsDir: str = './models'

    def loadModel(self, llamaModel: LlamaModel):
        errors = []
        configPath = llamaModel.path + "/config.json"
        if (not exists(configPath)):
            errors.append(f"{configPath} does not exist")
        
        modelPath = llamaModel.path + "/" + llamaModel.modelFile
        if (not exists(modelPath)):
            errors.append(f"{modelPath} does not exist")
            
        tokenizerModelPath = llamaModel.path + "/tokenizer.model"
        if (not exists(tokenizerModelPath)):
            errors.append(f"{tokenizerModelPath} does not exist")

        if errors:
            raise Exception("\n".join(errors))
        
        torch.set_grad_enabled(False)
        torch.cuda._lazy_init()
        self.config = ExLlamaConfig(configPath)
        self.config.model_path = modelPath
        self.config.max_seq_len = 2048
        self.model = ExLlama(self.config)
        self.cache = ExLlamaCache(self.model)

        self.tokenizer = ExLlamaTokenizer(tokenizerModelPath)
        self.generator = ExLlamaGenerator(self.model, self.tokenizer, self.cache)
        self.generator.settings.token_repetition_penalty_max = 1.2
        self.generator.settings.token_repetition_penalty_sustain = 20
        self.generator.settings.token_repetition_penalty_decay = 50
        gen_tokens = 200
        text = self.generator.generate_simple("test", max_new_tokens = 200)
        print(text)

Printed output

test Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Create a list of 5 adjectives to describe a car.

### Response:
1. Sleek
2. Powerful
3. Luxurious
4. Reliable
5. Sporty

If i move the bottom lines so they happen in a separate function, I call loadModel and then call chat through a separate request to the server. I get some exception, im very confused about why this is happening here and wondering if im doing something wrong?

class LlamaModelRepo:
    tokenizer: ExLlamaTokenizer = None
    generator: ExLlamaGenerator = None
    config: ExLlamaConfig = None
    model: ExLlama = None
    cache: ExLlamaCache = None
    def __init__(self):
        self.models: list = []
        self.modelsDir: str = './models'

    def loadModel(self, llamaModel: LlamaModel):
        errors = []
        configPath = llamaModel.path + "/config.json"
        if (not exists(configPath)):
            errors.append(f"{configPath} does not exist")
        
        modelPath = llamaModel.path + "/" + llamaModel.modelFile
        if (not exists(modelPath)):
            errors.append(f"{modelPath} does not exist")
            
        tokenizerModelPath = llamaModel.path + "/tokenizer.model"
        if (not exists(tokenizerModelPath)):
            errors.append(f"{tokenizerModelPath} does not exist")

        if errors:
            raise Exception("\n".join(errors))
        
        torch.set_grad_enabled(False)
        torch.cuda._lazy_init()
        self.config = ExLlamaConfig(configPath)
        self.config.model_path = modelPath
        self.config.max_seq_len = 2048
        self.model = ExLlama(self.config)
        self.cache = ExLlamaCache(self.model)

        self.tokenizer = ExLlamaTokenizer(tokenizerModelPath)
        self.generator = ExLlamaGenerator(self.model, self.tokenizer, self.cache)
        self.generator.settings.token_repetition_penalty_max = 1.2
        self.generator.settings.token_repetition_penalty_sustain = 20
        self.generator.settings.token_repetition_penalty_decay = 50
        gen_tokens = 200

    def chat(self, text: str, params:dict = {}):
        text = self.generator.generate_simple("test", max_new_tokens = 200)
        print(text)

exception

Traceback (most recent call last):
  File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 2213, in __call__
    return self.wsgi_app(environ, start_response)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 2193, in wsgi_app
    response = self.handle_exception(e)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/kanna/Documents/llm/exllama/server.py", line 110, in chat
    return modelRepo.chat(text="test")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/kanna/Documents/llm/exllama/repos/model_repo.py", line 97, in chat
    text = self.generator.generate_simple(text, max_new_tokens = 200)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/kanna/Documents/llm/exllama/generator.py", line 176, in generate_simple
    self.gen_begin(ids)
  File "/mnt/kanna/Documents/llm/exllama/generator.py", line 103, in gen_begin
    self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True)
  File "/mnt/kanna/Documents/llm/exllama/model.py", line 1153, in forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device])
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/kanna/Documents/llm/exllama/model.py", line 540, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/kanna/Documents/llm/exllama/model.py", line 447, in forward
    query_states = self.q_proj.forward(hidden_states)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/kanna/Documents/llm/exllama/model.py", line 314, in forward
    out = cuda_ext.ExAutogradMatmul4bitCuda.apply(x, self.qweight, self.scales, self.qzeros, self.groupsize, self.bits, self.maxq)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kannalo/.local/lib/python3.11/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kannalo/.local/lib/python3.11/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
    return fwd(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/mnt/kanna/Documents/llm/exllama/cuda_ext.py", line 271, in forward
    raise ValueError("Not implemented yet")
ValueError: Not implemented yet

how to get correct model type?

my original model is a .bin file, then i use the below code to convert the model format

model = AutoModelForCausalLM.from_pretrained("path/to/llama_7b", torch_dtype=torch.float16,device_map='auto' )
model.save_pretrained('path/to/exllama',safe_serialization=True, max_shard_size="200GB")

but when i run test_benchmark_inference.py , i meet mistake

the model didn't have qzeros/scales

how can i deal with it, could you help me?

WebUI Multi-bot

Anything i need to do to make this work? simply adding names doesn't change anything. I've also tried creating prompt text files to match the name added, but no change.

Other than that it's working pretty well for talking to one bot.

Landmark Attention support

Any thoughts on how difficult it would be to support inference on a model trained with landmark attention? Like Minotaur, Wizard or the base Llama landmark finetunes released recently, and I suppose more will come out, now that multiple repos support lora/qlora/gptq-lora training with landmark attention.

I haven’t compared results yet, but it sounds like landmark attention should be more effective with long contexts compared to the turboderp/alpaca_lora_4bit repo. Like the author, I found that that repo did “something”, and stopped generating gibberish beyond 2048 at least, but I’m not sure what the model learned. The landmark attention paper claims it can solve needle-haystack problems beyond the context length, which I couldn’t get the previous method to do.

Landmark apparently works with Oogabooga with remote code enabled.

OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

Pop!_OS 20.04
Python 3.8.10
AMD 6800 XT GPU

Installed with:

pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu118

pip install safetensors sentencepiece ninja

git clone https://github.com/turboderp/exllama
cd exllama

When running python3 test_benchmark_inference.py -d /home/user1/models/ -p -ppl or python example_chatbot.py -d /home/user1/models/ -un "Jeff" -p prompt_chatbot.txt I get the follow errors:

Traceback (most recent call last):
  File "test_benchmark_inference.py", line 1, in <module>
    from model import ExLlama, ExLlamaCache, ExLlamaConfig
  File "/home/user1/bin/exllama/model.py", line 5, in <module>
    import cuda_ext
  File "/home/user1/bin/exllama/cuda_ext.py", line 42, in <module>
    exllama_ext = load(
  File "/home/user1/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1286, in load
    return _jit_compile(
  File "/home/user1/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1511, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/user1/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1603, in _write_ninja_file_and_build_library
    extra_ldflags = _prepare_ldflags(
  File "/home/user1/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1702, in _prepare_ldflags
    if (not os.path.exists(_join_cuda_home(extra_lib_dir)) and
  File "/home/user1/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2238, in _join_cuda_home
    raise EnvironmentError('CUDA_HOME environment variable is not set. '
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

What am I doing wrong?

KeyError when loading GPTQ Model

When trying to run the GPTQ Model (https://huggingface.co/TheBloke/starchat-beta-GPTQ) which works fine with other GPTQ loaders, I get the following using ExLlama via oobabooga webui:

line 40, in __init__
self.pad_token_id = read_config["pad_token_id"]
KeyError: 'pad_token_id'

exllama/model.py

Line 40 in dd63e07

self.pad_token_id = read_config["pad_token_id"]

Cuda 12.1 - Fails to Build Here

exllama/exllama_ext/q4v2_matmul.cu(116): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (half *, half)
      atomicAdd(out_.item_ptr(x_row, w_column), result);

Kernel wouldn't compile in my conda env

Until I added a link from lib to lib64 it was unable to find the cuda libs. Compile would fail. Test kernel stuff is also out of date as the paths are wrong.

Working with TheBloke/WizardLM-30B-Uncensored-GPTQ

Hi! I got this to work with TheBloke/WizardLM-30B-Uncensored-GPTQ.

Here's what worked:

This doesn't work on windows, but it does work on WSL
Download the model (and all files) from HF and place it somewhere. Put this somewhere inside the wsl linux filesystem, not under /mnt/c/somewhere otherwise the model loading will be mega slow regardless of your disk speed
on model.py I added the following:

        # self.groupsize = (self.qweight.shape[0] * 8) // self.qzeros.shape[0]
        self.groupsize = None
        self.config.groupsize = None
        self.config.act_order = True

        # self.config.groupsize = self.groupsize
        # if self.config.groupsize is None:
        #     self.config.groupsize = self.groupsize
        # else:
        #     if self.config.groupsize != self.groupsize:
        #         raise ValueError("Irregular groupsize for matrix: " + key + ", " + str(self.config.groupsize) + ", "+ str(self.groupsize))

Note the commented out code and the additions
4. I had to use -mm pytorch_only and -a pytorch_matmul

Performance degradation

I did a test on the latest commit (77545c) and bec6c9 on h100 with 30b model and I can see stable performance degradation.

Latest   bec6c9
25 t/s   34t/s

thoughts?

TransformerEngine FP8 support

Hello! Could this work utilize the new H100 TransformerEngine for speedup? If yes, I would be very interested in that and would also pay your H100 cloud GPU access if you could estimate how long you will need it

Thank you very much!

Tesla P40 only using 70W underload

So my P40 is only using about 70W while generating responses, its not limited in any way(IE. Power delivery or temp)

Feature Request: length_penalty support

We are trying to port the transformer based gen code to exllama but did not find a configurable length_penalty control. Will this be on the road map? Thanks.

"fatal error LNK1104: cannot open file 'python310.lib'" + Solution (Windows)

I installed exllama on a secondary drive, then tried to install dependencies both inside a venv and on the root folder. In ever case I got a long error when I tried to run the test (python test_benchmark_inference.py -d <path_to_model_files> -p -ppl), including:

fatal error LNK1104: cannot open file 'python310.lib'

The solution was to copy the python310.lib file from Program Files\Python310\libs and paste it into \venv\Scripts\libs. Note I had to make that directory myself.

Using cache cause random behavior during generation

I'm currently testing the different generation behavior between exllama and autogptq, and I found that using cache with exllama will generate different results for same prompt even when I'm using greedy decoding.

def exllama_greedy_gen_wo_cache(prompt, max_length):
    seq = tokenizer.encode(prompt) # Huggingface tokenizer
    for _ in range(max_length):
        temp_cache = ExLlamaCache(model_exllama)
        logits = model_exllama.forward(torch.tensor([seq], dtype=torch.long), temp_cache)[0][0]
        seq.append(torch.argmax(logits).item())
    return seq

for i in range(10):
    print(tokenizer.decode(exllama_greedy_gen_wo_cache("Hello,", 20)))

For generate without cache, it's really slow but I get consistant outputs

But when I enable cache to get a much faster generation, I start seeing inconsistency between generations

def exllama_greedy_gen_wi_cache(prompt, max_length):
    seq = tokenizer.encode(prompt) # Huggingface tokenizer
    gen_cache = ExLlamaCache(model_exllama)
    model_exllama.forward(torch.tensor([seq[:-1]], dtype=torch.long), gen_cache, preprocess_only = True)
    for _ in range(max_length):
        logits = model_exllama.forward(torch.tensor([seq[-1:]], dtype=torch.long), gen_cache)[0][0]
        seq.append(torch.argmax(logits).item())
    return seq

for i in range(10):
    print(tokenizer.decode(exllama_greedy_gen_wi_cache("Hello,", 20)))

I wonder is this a bug within the cache implementation or it is I'm using cache in a wrong way.

Docker and ownership permissions

Hi, I store my models on a local NAS, synology - which does not allow me to change ownership permissions of the files.

I get the following error when starting up the docker container with docker compose;

 ⠿ Network exllama_default  Created                                                                                                                                                                0.0s 
 ⠿ Container exllama-web-1  Created                                                                                                                                                                0.1s
Attaching to exllama-web-1
exllama-web-1  | chown: changing ownership of '/app/model/minotaur-13b-GPTQ-4bit-128g.no-act.order.safetensors': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/tokenizer.model': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/tokenizer_config.json': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/tokenizer.json': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/special_tokens_map.json': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/quantize_config.json': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/generation_config.json': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/config.json': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/README.md': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model': Operation not permitted
exllama-web-1 exited with code 1

Unsure how to proceed

Lora support

Congrats and thank you again for a project that changes everything. Can't use anything else and now I even prefer your Web UI to the std. text-web-ui...

In some instances it would be super-useful to be able load separate lora's on top of a GPTQ model loaded with exllama.

SqueezeLLM Support?

https://github.com/SqueezeAILab/SqueezeLLM is this something exllama will support out of the box? how would integrating support look like?

ExLlamaDeviceMap's layers offload to CPU?

I would like to test run 7b model on my 4g vram 3050, look like exllama does not support offload model to CPU yet?

Gradio error: "Not implemented yet"

I'm getting an error when attempting to use generate_simple inside of a Gradio UI. I can run test_inference.py just fine, however when I put that code into a Gradio UI and attempt to redirect the output to a Chatbot component, I get the below error:

Traceback (most recent call last):
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/gradio/routes.py", line 422, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/gradio/blocks.py", line 1323, in process_api
    result = await self.call_function(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/gradio/blocks.py", line 1051, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/mmealman/src/exllama/webui/Chatbot.py", line 72, in bot
    bot_message = self.predict(history, user_message)
  File "/home/mmealman/src/exllama/webui/Chatbot.py", line 58, in predict
    return self.textgen.test_generate()
  File "/home/mmealman/src/exllama/TextGenerator.py", line 96, in test_generate
    text = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
  File "/home/mmealman/src/exllama/generator.py", line 176, in generate_simple
    self.gen_begin(ids)
  File "/home/mmealman/src/exllama/generator.py", line 103, in gen_begin
    self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True)
  File "/home/mmealman/src/exllama/model.py", line 1153, in forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device])
  File "/home/mmealman/src/exllama/model.py", line 540, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer)
  File "/home/mmealman/src/exllama/model.py", line 447, in forward
    query_states = self.q_proj.forward(hidden_states)
  File "/home/mmealman/src/exllama/model.py", line 314, in forward
    out = cuda_ext.ExAutogradMatmul4bitCuda.apply(x, self.qweight, self.scales, self.qzeros, self.groupsize, self.bits, self.maxq)
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/home/mmealman/src/exllama/cuda_ext.py", line 271, in forward
    raise ValueError("Not implemented yet")
ValueError: Not implemented yet

Below is the generation code I'm calling in the Chatbot:

    def test_generate(self):
        tokenizer_model_path = "/home/mmealman/src/models/vicuna-13B-1.1-GPTQ-4bit-128g/tokenizer.model"
        model_config_path = "/home/mmealman/src/models/vicuna-13B-1.1-GPTQ-4bit-128g/config.json"
        model_path = "/home/mmealman/src/models/vicuna-13B-1.1-GPTQ-4bit-128g/vicuna-13B-1.1-GPTQ-4bit-128g.safetensors"
        config = ExLlamaConfig(model_config_path)
        config.model_path = model_path
        config.max_seq_len = 2048
        model = ExLlama(config)
        cache = ExLlamaCache(model)

        tokenizer = ExLlamaTokenizer(tokenizer_model_path)
        generator = ExLlamaGenerator(model, tokenizer, cache)
        generator.settings.token_repetition_penalty_max = 1.2
        generator.settings.token_repetition_penalty_sustain = 20
        generator.settings.token_repetition_penalty_decay = 50

        prompt = \
        "On 19 February 1952, Headlam became senior air staff officer (SASO) at Eastern Area Command in Penrith, New South " \
        "Wales. During his term as SASO, the RAAF began re-equipping with English Electric Canberra jet bombers and CAC " \
        "Sabre jet fighters. The Air Force also underwent a major organisational change, as it transitioned from a " \
        "geographically based command-and-control system to one based on function, resulting in the establishment of Home " \
        "(operational), Training, and Maintenance Commands. Eastern Area Command, considered a de facto operational " \
        "headquarters owing to the preponderance of combat units under its control, was reorganised as Home Command in " \
        "October 1953. Headlam was appointed an Officer of the Order of the British Empire (OBE) in the 1954 New Year " \
        "Honours for his \"exceptional ability and devotion to duty\". He was promoted to acting air commodore in May. His " \
        "appointment as aide-de-camp to Queen Elizabeth II was announced on 7 October 1954."

        gen_tokens = 200
        text = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
        return text

ExLLaMA generation in all other stand alone Python scripts works fine. The Gradio UI code also has worked fine in several other projects.

Crashing with act order and no act order since latest changes.

`python test_benchmark_inference.py -t /home/nap/llm_models/koala-13B-HF-4bit/tokenizer.model -c /home/nap/llm_models/koala-13B-HF-4bit/config.json -m /home/nap/llm_models/koala-13B-HF-4bit/koala13B-4bit-128g-no-act-order.safetensors -g 128 -p -ppl

Using /home/nap/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/nap/.cache/torch_extensions/py310_cu118/exllama_ext/build.ninja...
Building extension module exllama_ext...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module exllama_ext...
-- Loading model
-- Tokenizer: /home/nap/llm_models/koala-13B-HF-4bit/tokenizer.model
-- Model config: /home/nap/llm_models/koala-13B-HF-4bit/config.json
-- Model: /home/nap/llm_models/koala-13B-HF-4bit/koala13B-4bit-128g-no-act-order.safetensors
-- Groupsize: 128
-- Sequence length: 2048
-- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'perf', 'ppl']
** Time, Load model: 1.50 seconds
** VRAM, Model: [cuda:0] 6,689.96 MB
-- Inference, first pass.
Traceback (most recent call last):
File "/home/nap/Documents/exllama-api/test_benchmark_inference.py", line 153, in
logits = timer("Inference", lambda: wrapper.next_logits(ids))
File "/home/nap/Documents/exllama-api/test_benchmark_inference.py", line 64, in timer
ret = func()
File "/home/nap/Documents/exllama-api/test_benchmark_inference.py", line 153, in
logits = timer("Inference", lambda: wrapper.next_logits(ids))
File "/home/nap/Documents/exllama-api/test_benchmark_inference.py", line 54, in next_logits
return self.model.forward(input_ids, self.cache, last_id_only)
File "/home/nap/Documents/exllama-api/model.py", line 523, in forward
hidden_states = decoder_layer(hidden_states, cache, attn_masks[device])
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nap/Documents/exllama-api/model.py", line 351, in forward
hidden_states = self.self_attn(hidden_states, cache, attention_mask)
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nap/Documents/exllama-api/model.py", line 264, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.config.num_attention_heads, self.config.head_dim).transpose(1, 2)
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nap/Documents/exllama-api/model.py", line 148, in forward
out = quant_util.matmul4bit(x,
File "/home/nap/Documents/exllama-api/quant_util.py", line 70, in matmul4bit
if switch: output = _q4v2_recons(x, qweight, scales, zeros, groupsize, g_idx)
File "/home/nap/Documents/exllama-api/quant_util.py", line 51, in _q4v2_recons
q4v2_recons(qweight, buffer, scales, zeros, groupsize, g_idx if g_idx is not None else none_tensor)
TypeError: q4v2_recons(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: int) -> None

Invoked with: tensor([[-1398026309, 1248440250, 1968657271, ..., 1648788836,
1771582072, 1432982596],
[-1129530164, -1402287736, 1970562646, ..., 2016756323,
900172105, -2007726747],
[ -876888900, -1735723655, 1717986149, ..., -1236974524,
1117231658, -1988663128],
...,
[ 2125380013, 729121940, -1516013256, ..., -1448441238,
1395411286, -910718291],
[ -609454181, -1721358701, 2071349639, ..., -1380296262,
842437924, -646359431],
[ 1518767014, -1668986954, -1201825385, ..., 1920967637,
1770408276, -932611670]], device='cuda:0', dtype=torch.int32), tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], device='cuda:0', dtype=torch.float16), tensor([[0.0146179199, 0.0079727173, 0.0102233887, ..., 0.0184173584,
0.0107421875, 0.0127944946],
[0.0070610046, 0.0054702759, 0.0059432983, ..., 0.0166778564,
0.0122299194, 0.0089492798],
[0.0113754272, 0.0084228516, 0.0120010376, ..., 0.0273590088,
0.0140762329, 0.0107803345],
...,
[0.0120925903, 0.0055465698, 0.0123825073, ..., 0.0216217041,
0.0121994019, 0.0125427246],
[0.0135955811, 0.0066719055, 0.0160827637, ..., 0.0184020996,
0.0138168335, 0.0129928589],
[0.0071449280, 0.0064125061, 0.0047798157, ..., 0.0156250000,
0.0118713379, 0.0109329224]], device='cuda:0', dtype=torch.float16), tensor([[ 2023392650, 1010177125, -1289406317, ..., 1279628968,
-2103822205, 1447265365],
[ 2002999416, 1783731782, 1698252904, ..., 1971681173,
-2055768200, 1720019317],
[-2023401052, 678589785, -1808521094, ..., 1430677867,
-2089273206, 1750898759],
...,
[ 1703449704, 1770349877, -1807272091, ..., -2041219240,
1732671894, 1721131894],
[ 1732459378, 1197652024, 1950771288, ..., -1837668746,
1719236473, -2024245130],
[-2005375383, 1970881927, 1753777765, ..., 1971808872,
2003334805, 1970759287]], device='cuda:0', dtype=torch.int32), 128, tensor([ 0, 0, 0, ..., 39, 39, 39], device='cuda:0', dtype=torch.int32)`

Reverting to previous commit fixed the issue for me.

Perplexity Data Format/Testing Data Question

I was trying to do an apples-to-apple shootout on GPTQ vs the new llama.cpp k-quants (memory usage, speed, etc) but ran into a bump with perplexity. It looks like exllama loads a jsonl formatted version of wikitext-2's wiki.valid.raw (not the wiki.test.raw that is typically used for testing)?

Just wondering if there's a preformatted jsonl of the rest of wikitext-2 already. Is the format just literally chunking every line into a "text" object?

Can't compile on Windows

Hi there, really amazing work that you're doing here.

I'm trying to run either the benchmark or the webui to test (I have 2x4090), but it seems it can't find the compiler or something similar?

The complete error is:

python .\webui\app.py
F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py:358: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
  warnings.warn(f'Error checking compiler version for {compiler}: {error}')
INFO: Could not find files for the given pattern(s)
Traceback (most recent call last):
  File "F:\ChatIAs\exllama\webui\app.py", line 9, in <module>
    import model_init
  File "F:\ChatIAs\exllama\model_init.py", line 1, in <module>
    from model import ExLlama, ExLlamaCache, ExLlamaConfig
  File "F:\ChatIAs\exllama\model.py", line 5, in <module>
    import cuda_ext
  File "F:\ChatIAs\exllama\cuda_ext.py", line 14, in <module>
    exllama_ext = load(
  File "F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py", line 1283, in load
    return _jit_compile(
  File "F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py", line 1508, in _jit_compile
    _write_ninja_file_and_build_library(
  File "F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py", line 1610, in _write_ninja_file_and_build_library
    _write_ninja_file_to_build_library(
  File "F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py", line 2057, in _write_ninja_file_to_build_library
    _write_ninja_file(
  File "F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py", line 2200, in _write_ninja_file
    cl_paths = subprocess.check_output(['where',
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['where', 'cl']' returned non-zero exit status 1.

I have CUDA 11.8 and CUDA 12.1 on my system. I do specify when building for gtpq for example (with $env:CUDA_PATH="CUDA_DIR", but here I'm not sure if it uses those or self built. Also, when specifying the CUDA version, it doesn't work either.

Maybe I'm missing something here?

Python 3.10.10
Windows 11 Pro
RTX 4090 x2
AMD Ryzen 7 7800X3D
VS2019

C:\Program Files (x86)\Microsoft Visual Studio\2019\Community>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Error running. ArgTypes. Ninja: Build stopped: subcommand failed

When i try to run: python3 example_chatbot.py -d /home/xxxxx/models/based-7B-GPTQ -un "Jeff" -p prompt_chatbort.txt

The following error appears:

Traceback (most recent call last):
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils /cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/xxxxxx/exllama/example_chatbot.py", line 1, in
from model import ExLlama, ExLlamaCache, ExLlamaConfig
File "/home/xxxxxx/exllama/model.py", line 5, in
import cuda_ext
File "/home/xxxxxx/exllama/cuda_ext.py", line 42, in
exllama_ext = load(
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils /cpp_extension.py", line 1284, in load
return jit_compile(
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils /cpp_extension.py", line 1509, in jit_compile
write_ninja_file_and_build_library(
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils /cpp_extension.py", line 1624, in write_ninja_file_and_build_library
run_ninja_build(
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils /cpp_extension.py", line 1909, in run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'exllama_ext': [1/5] /usr/bin/nvcc -DTOR CH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILE R_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1 011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home/xxxxxx/miniconda3/envs/gpt q/lib/python3.9/site-packages/torch/include -isystem /home/xxxxxx/miniconda3/envs /gptq/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/TH -i system /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/includ e/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/include/python3.9 -D_GLIBCXX_USE CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUD A_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constex pr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /home/xxxxxx/exllama/exllama_e xt/cuda_func/half_matmul.cu -o half_matmul.cuda.o
FAILED: half_matmul.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11 BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home/xxxxxx /miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem /home/l ucas/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/csrc/a pi/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages /torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-p ackages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/include/pyth on3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_C ONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -- expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=co mpute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /home/luc as/exllama/exllama_ext/cuda_func/half_matmul.cu -o half_matmul.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expa nded with ‘...’:
435 | function(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expa nded with ‘...’:
530 | operator=(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’
[2/5] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTE NSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYB IND11_BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home /xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem / home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/ csrc/api/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-pa ckages/torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/ site-packages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/includ e/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATOR S --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=a rch=compute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /ho me/xxxxxx/exllama/exllama_ext/cuda_func/q4_mlp.cu -o q4_mlp.cuda.o
FAILED: q4_mlp.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11 BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home/xxxxxx /miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem /home/l ucas/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/csrc/a pi/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages /torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-p ackages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/include/pyth on3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_C ONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -- expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=co mpute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /home/luc as/exllama/exllama_ext/cuda_func/q4_mlp.cu -o q4_mlp.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expa nded with ‘...’:
435 | function(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expa nded with ‘...’:
530 | operator=(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’
[3/5] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTE NSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYB IND11_BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home /xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem / home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/ csrc/api/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-pa ckages/torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/ site-packages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/includ e/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATOR S --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=a rch=compute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /ho me/xxxxxx/exllama/exllama_ext/cuda_func/q4_attn.cu -o q4_attn.cuda.o
FAILED: q4_attn.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11 BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home/xxxxxx /miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem /home/l ucas/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/csrc/a pi/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages /torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-p ackages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/include/pyth on3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_C ONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -- expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=co mpute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /home/luc as/exllama/exllama_ext/cuda_func/q4_attn.cu -o q4_attn.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expa nded with ‘...’:
435 | function(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expa nded with ‘...’:
530 | operator=(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’
[4/5] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTE NSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYB IND11_BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home /xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem / home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/ csrc/api/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-pa ckages/torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/ site-packages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/includ e/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATOR S --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=a rch=compute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /ho me/xxxxxx/exllama/exllama_ext/cuda_func/q4_matmul.cu -o q4_matmul.cuda.o
FAILED: q4_matmul.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11 BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home/xxxxxx /miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem /home/l ucas/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/csrc/a pi/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages /torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-p ackages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/include/pyth on3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_C ONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -- expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=co mpute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /home/luc as/exllama/exllama_ext/cuda_func/q4_matmul.cu -o q4_matmul.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expa nded with ‘...’:
435 | function(_Functor&& __f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expa nded with ‘...’:
530 | operator=(_Functor&& __f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘_ArgTypes’
ninja: build stopped: subcommand failed.

Using QLoRA?

I have been using QLoRA to finetune my model on my 3090, which previously could only perform inferences and not finetuning.

With the incredible improvements achieved with exllama, is it possible to combine both QLoRa and exllama so that the finetuning requirements are similar to the requirements for inferences?

Get error when compiling.

Hello! I am trying to run exllama on wizard-vicuna-13b-uncensored-gptq, and when i try to run any of the commands I get the following error. I am running it using the nvidia pytorch image nvcr.io/nvidia/pytorch:23.05-py3. I am using the newest version of cuda 12.1.1 and running it on a google vm with an L4 on ubuntu 18.04 LTS. I know the documentation says its not compatible with all gpu's, is it compatible with the L4? Any help would be very much appreciated. Thank you!!

error.txt

2 x RTX A5000 performance

10 t/s vs. 6 t/s on text-generation-webui.

Great project.

will it work with Nvidia P40 24GB on Linux?

I'm developing AI assistant for fiction writer. As openai API gets pretty expensive with all the inference tricks needed, I'm looking for a good local alternative for most of inference, saving gpt4 just for polishing final results.
exllama looks pretty interesting, but I'm getting compilation error.
Even though in addition to fiction writer I'm a software developer, I'm far from being an AI expert.
Would it be correct to assume from the lines below that P40 is not supported currently?
-D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__

Maybe it was a silly try, but self.weight = tensors[key].half() did not work.

If P40 will not work with exllama, could somebody advise if oobabooga/GPTQ-for-LLaMa would work?
If not CUDA, maybe there are good options for i9-13900K with 128G DDR5?

The full Traceback:
python test_benchmark_inference.py -d /home/igorm/ai-assistant/agent-city/llm/models/Wizard-Vicuna-13B-Uncensored-GPTQ -p -ppl
Traceback (most recent call last):
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/test_benchmark_inference.py", line 1, in
from model import ExLlama, ExLlamaCache, ExLlamaConfig
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/model.py", line 5, in
import cuda_ext
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/cuda_ext.py", line 14, in
exllama_ext = load(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1283, in load
return jit_compile(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in jit_compile
write_ninja_file_and_build_library(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1623, in write_ninja_file_and_build_library
run_ninja_build(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'exllama_ext': [1/3] /opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu -o q4v2_matmul.cuda.o
FAILED: q4v2_matmul.cuda.o
/opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu -o q4v2_matmul.cuda.o
/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/../cuda_compat.cuh(48): error: cannot overload functions distinguished by return type alone
void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }
^

1 error detected in the compilation of "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu".
[2/3] /opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu -o half_matmul.cuda.o
FAILED: half_matmul.cuda.o
/opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu -o half_matmul.cuda.o
/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/../cuda_compat.cuh(48): error: cannot overload functions distinguished by return type alone
void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }
^

1 error detected in the compilation of "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu".
ninja: build stopped: subcommand failed.

Multimodal support

Great work with this loader, I'm seeing 5x i/s improvements in Ooba and was hopeful that it would help serve up some gains when using ooba's multimodal extension (confirmed working in my current setup (Windows 10, RTX 2080Ti, 11 GB VRAM, 96 GB RAM, 11.8 Cuda, 2.0.1 Torch, with Llava or miniGPT pipelines at either 7b and 13b).

When attempting to use exllama as the loader with any of the 4 MM setups, regular text chat or instruct work well and much faster but as soon as attempting to use the multimodal extension to include a photo I get this error.
Maybe you can point me in the right direction to try and resolve this?

  File "D:\00\text-generation-webui\modules\text_generation.py", line 300, in generate_reply_custom
    for reply in shared.model.generate_with_streaming(question, state):
  File "D:\00\text-generation-webui\modules\exllama.py", line 68, in generate_with_streaming
    self.generator.gen_begin_reuse(ids)
  File "D:\00\text-generation-webui\repositories\exllama\generator.py", line 191, in gen_begin_reuse
    if reuse < in_tokens.shape[-1]: self.gen_feed_tokens(in_tokens[:, reuse:])
  File "D:\00\text-generation-webui\repositories\exllama\generator.py", line 209, in gen_feed_tokens
    self.model.forward(self.sequence[:, start:-1], self.cache, preprocess_only = True, lora = self.lora)
  File "D:\00\text-generation-webui\repositories\exllama\model.py", line 841, in forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "D:\00\text-generation-webui\repositories\exllama\model.py", line 459, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
  File "D:\00\text-generation-webui\repositories\exllama\model.py", line 381, in forward
    new_keys = cache.key_states[self.index].narrow(2, past_len, q_len)
RuntimeError: start (49) + length (13970) exceeds dimension size (2048).
Output generated in 1.51 seconds (0.00 tokens/s, 0 tokens, context 14020, seed 979644525)```

Problem with generation leading space.

I'm currently implementing the HF decoding for exllama, but I find that model sometimes do not generate the expected leading space. It happens kind of rarely, but still from time to time. Since I can only trigger it when I do sampling thereby I currently cannot give a prompt that can reproduce it on greedy decoding.
I check the oobabooga/text-generation-webui implementation and find that it's fixed in a strange way:

Since my implementing follows the HF interface so I have no access to generation index "i" and thereby cannot check whether a forward call is for the first token or not.
So I'm wondering what's the potential cause of this and is there any other way to fix it?
Update: Seems like not exllama's problem but has something to do with the strange "add leading space" behavior of HF tokenizer observed earlier.

RTX 3060 12GB Benchmarking

model: llama-13B-4bit-128g

exllama:

(exllama) user@debian:~/AI/exllama$ python test_benchmark_inference.py -d ~/AI/2oobabooga/text-generation-webui/models/llama-13b-4bit-128g/ -p
 -- Loading model
 -- Tokenizer: /home/user/AI/2oobabooga/text-generation-webui/models/llama-13b-4bit-128g/tokenizer.model
 -- Model config: /home/user/AI/2oobabooga/text-generation-webui/models/llama-13b-4bit-128g/config.json
 -- Model: /home/user/AI/2oobabooga/text-generation-webui/models/llama-13b-4bit-128g/llama-13b-4bit-128g.safetensors
 -- Sequence length: 2048
 -- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'perf']
 ** Time, Load model: 1.57 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 6,683.17 MB
 -- Inference, first pass.
 ** Time, Inference: 2.08 seconds
 ** Speed: 923.57 tokens/second
 -- Generating 128 tokens...
 ** Speed: 22.04 tokens/second
 ** VRAM, Inference: [cuda:0] 2,291.67 MB
 ** VRAM, Total: [cuda:0] 8,974.84 MB

ooba's webui:
streaming on:

(textgen) user@debian:~/AI/2oobabooga/text-generation-webui$ python3.10 server.py --wbits 4 --model llama-13b-4bit-128g --groups 128 --model_type llama
INFO:Gradio HTTP request redirected to localhost :)
bin /home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
INFO:Loading llama-13b-4bit-128g...
INFO:Found the following quantized model: models/llama-13b-4bit-128g/llama-13b-4bit-128g.safetensors
INFO:Loaded the model in 2.65 seconds.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Output generated in 9.12 seconds (21.81 tokens/s, 199 tokens, context 4, seed 989197438)
Output generated in 8.57 seconds (23.22 tokens/s, 199 tokens, context 4, seed 26472177)

no stream:

(textgen) user@debian:~/AI/2oobabooga/text-generation-webui$ python3.10 server.py --wbits 4 --model llama-13b-4bit-128g --groups 128 --model_type llama --no-stream
INFO:Gradio HTTP request redirected to localhost :)
bin /home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
INFO:Loading llama-13b-4bit-128g...
INFO:Found the following quantized model: models/llama-13b-4bit-128g/llama-13b-4bit-128g.safetensors
INFO:Loaded the model in 2.48 seconds.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Output generated in 5.17 seconds (24.74 tokens/s, 128 tokens, context 4, seed 250438644)
Output generated in 4.57 seconds (28.02 tokens/s, 128 tokens, context 4, seed 1203371762)
Output generated in 4.80 seconds (26.65 tokens/s, 128 tokens, context 4, seed 484445001)