turboderp / exllama Goto Github PK
View Code? Open in Web Editor NEWA more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
License: MIT License
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
License: MIT License
I have test this in 4*2080ti . Works well and almost 10t/s for llama 65b while only 0.65t/s for bnb 4bit, really amazing..First of all please accept my thanks and adoration.
If you need do tests with 20 series gpu, may be I can help.
And can this used for turning along with lora?
Recently there was a 15.5b param called starcoder released https://huggingface.co/bigcode/starcoder
You should be able to run it with text-generation-webui using a fork of GPTQ-for-llama called
GPTQ-for-SantaCoder https://github.com/mayank31398/GPTQ-for-SantaCoder
Since as far as I can tell these two libraries are using the same library - transformers
as well as the same quantization method (GPTQ) shouldn't this be possible to run with exllama?
any ideas on how I would go about doing this? @turboderp @disarmyouwitha
I see from your own testing testing that you have multi-GPU working.
Following the instructions and running test_benchmark_inference.py or test_chatbot.py they both worked on one of my RTX 3060's for a 13b model, and my other 2 3060's were detected (but not used)
Attempting to load a 33b llama model across all 3 cards I have led to cuda OOM error before the model loaded, as only a single card was used, with none of the other cards showing any VRAM usage.
Any tricks to multi-card setups or parameters i should be passing?
I have noticed that while it massively increases the inference speed, it massively decreases the quality of the outputs, instruct models become very obstinate and give completely irrelevant responses, words become misspelled, it repeats lines over and over, and also sometimes spams Chinese letters
Getting subj exception with new model from here: https://huggingface.co/TheBloke/tulu-30B-GPTQ
Heya, I'm writing a langchain binding for exllama, I'd love to be able to pip install exllama
and be able to access the libraries in python natively, right now I'm not really sure how I'd ship the langchain module without creating my own binding library in pip, which seems very awkward.
Foremost, this is a terrific project.
I've been trying to integrate it with other apps, but the API is a little bit different compared to other implementations like KobolAI and its API or textgen-webui and its API examples.
I could get it to work (while the webapp is running) with the following script with my limited knowledge, albeit it's not the best:
import requests
import json
import sys
url = 'http://0.0.0.0:5005/api/userinput'
data = {'user_input': 'What time is it? Write a very looong essay about time.'}
headers = {'Content-type': 'application/json'}
# send the POST request and stream the response
response = requests.post(url, data=json.dumps(data), headers=headers, stream=True)
# extract the text values from the JSON response
text_values = (json.loads(line).get('text') for line in response.iter_lines())
for text_value in text_values:
print(text_value, end="")
sys.stdout.flush() # flush the output buffer
What do you think about the possibility of making a streaming api endpoint on /api/stream that is not connected with the backend user handling and message saving, and is "stateless" so it follows the REST principles? Since it's one of the most performant backends this would surely boost its popularity.
I'm curious, is it possible to extract just C++ / CUDA core from the project to integrate into external systems? Basically to have something like llama.cpp without Python at all, or just the initial step for compiling / building exllama.
Thanks for this great project. The inference speed is exceptional. However it seems the generator api only supports single string input. When serving concurrent requests, batching of inputs will be needed for better thoughput.
Hi there! As always, thanks for the amazing project.
I was trying to get to load Minotaur-15B (8192 max context), which it's the result of quantising to 4bit using GPTQ-for-LLaMa.
https://huggingface.co/TheBloke/minotaur-15B-GPTQ
At first, I was trying with ooba text webui, and got:
2023-06-19 15:07:54 INFO:Loading TheBloke_minotaur-15B-GPTQ...
2023-06-19 15:07:56 ERROR:Failed to load the model.
Traceback (most recent call last):
File "F:\ChatIAs\oobabooga\text-generation-webui\server.py", line 62, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File "F:\ChatIAs\oobabooga\text-generation-webui\modules\models.py", line 65, in load_model
output = load_func_map[loader](model_name)
File "F:\ChatIAs\oobabooga\text-generation-webui\modules\models.py", line 277, in ExLlama_loader
model, tokenizer = ExllamaModel.from_pretrained(model_name)
File "F:\ChatIAs\oobabooga\text-generation-webui\modules\exllama.py", line 35, in from_pretrained
config = ExLlamaConfig(str(model_config_path))
File "F:\ChatIAs\oobabooga\text-generation-webui\repositories\exllama\model.py", line 39, in __init__
self.pad_token_id = read_config["pad_token_id"]
KeyError: 'pad_token_id'
Then, tried with exllama directly, and got;
(venv) PS F:\ChatIAs\exllama> python webui/app.py -d "F:\ChatIAs\oobabooga\text-generation-webui\models\TheBloke_minotaur-15B-GPTQ"
-- Tokenizer: F:\ChatIAs\oobabooga\text-generation-webui\models\TheBloke_minotaur-15B-GPTQ\tokenizer.model
-- Model config: F:\ChatIAs\oobabooga\text-generation-webui\models\TheBloke_minotaur-15B-GPTQ\config.json
-- Model: F:\ChatIAs\oobabooga\text-generation-webui\models\TheBloke_minotaur-15B-GPTQ\4bit-128g.safetensors
-- Sequence length: 2048
-- Tuning:
-- --matmul_recons_thd: 8
-- --fused_mlp_thd: 2
-- --sdp_thd: 8
-- Options: []
Traceback (most recent call last):
File "F:\ChatIAs\exllama\webui\app.py", line 133, in <module>
config = model_init.make_config(args)
File "F:\ChatIAs\exllama\model_init.py", line 97, in make_config
config = ExLlamaConfig(args.config)
File "F:\ChatIAs\exllama\model.py", line 39, in __init__
self.pad_token_id = read_config["pad_token_id"]
KeyError: 'pad_token_id'
Is there any setting that I'm missing? Or are these models not compatible yet? Thanks!
[email protected]:/exllama$ python test_benchmark_inference.py -d Wizard-Vicuna-13B-Uncensored-GPTQ -p -ppl
-- Loading model
-- Tokenizer: Wizard-Vicuna-13B-Uncensored-GPTQ/tokenizer.model
-- Model config: Wizard-Vicuna-13B-Uncensored-GPTQ/config.json
-- Model: Wizard-Vicuna-13B-Uncensored-GPTQ/Wizard-Vicuna-13B-Uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors
-- Sequence length: 2048
-- Options: ['attention: switched', 'matmul: switched', 'mlp: switched', 'perf', 'perplexity']
Traceback (most recent call last):
File "/exllama/test_benchmark_inference.py", line 171, in <module>
wrapper = timer("Load model", lambda: ModelWrapper(args))
File "/exllama/test_benchmark_inference.py", line 73, in timer
ret = func()
File "/exllama/test_benchmark_inference.py", line 171, in <lambda>
wrapper = timer("Load model", lambda: ModelWrapper(args))
File "/exllama/test_benchmark_inference.py", line 51, in __init__
self.model = ExLlama(config)
File "/exllama/model.py", line 883, in __init__
with safe_open(self.config.model_path, framework="pt", device="cpu") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
Exception ignored in: <function ExLlama.__del__ at 0x7fd43bfe1fc0>
Traceback (most recent call last):
File "/exllama/model.py", line 1066, in __del__
if torch_device is not None: cuda_ext.free_cuda_buffers(torch_device)
File "/exllama/cuda_ext.py", line 57, in free_cuda_buffers
free_buffers(device)
TypeError: free_buffers(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch.device, arg1: int, arg2: int, arg3: int, arg4: int) -> None
Invoked with: device(type='cuda', index=0)
[email protected]:/exllama$ ^C
Has anyone seen an error like this before?
RuntimeError Traceback (most recent call last)
Cell In[3], line 4
2 config.model_path = model_path
3 config.max_seq_len = 2048
----> 4 model = ExLlama(config)
5 cache = ExLlamaCache(model)
6 tokenizer = ExLlamaTokenizer(tokenizer_model_path)
File /workspace/exllama/model.py:759, in ExLlama.__init__(self, config)
756 device = self.config.device_map.layers[i]
757 sin, cos = self.sincos[device]
--> 759 layer = ExLlamaDecoderLayer(self.config, tensors, f"model.layers.{i}", i, sin, cos)
761 modules.append(layer)
763 self.layers = modules
File /workspace/exllama/model.py:345, in ExLlamaDecoderLayer.__init__(self, config, tensors, key, index, sin, cos)
342 self.config = config
343 self.index = index
--> 345 self.self_attn = ExLlamaAttention(self.config, tensors, key + ".self_attn", sin, cos, self.index)
346 self.mlp = ExLlamaMLP(self.config, tensors, key + ".mlp")
348 self.input_layernorm = ExLlamaRMSNorm(self.config, tensors, key + ".input_layernorm.weight")
File /workspace/exllama/model.py:260, in ExLlamaAttention.__init__(self, config, tensors, key, sin, cos, index)
258 self.k_proj = Ex4bitLinear(config, self.config.hidden_size, self.config.num_attention_heads * self.config.head_dim, False, tensors, key + ".k_proj")
259 self.v_proj = Ex4bitLinear(config, self.config.hidden_size, self.config.num_attention_heads * self.config.head_dim, False, tensors, key + ".v_proj")
--> 260 self.o_proj = Ex4bitLinear(config, self.config.num_attention_heads * self.config.head_dim, self.config.hidden_size, False, tensors, key + ".o_proj")
File /workspace/exllama/model.py:137, in Ex4bitLinear.__init__(self, config, in_features, out_features, has_bias, tensors, key)
135 self.qzeros = tensors[key + ".qzeros"]
136 self.scales = tensors[key + ".scales"]
--> 137 self.g_idx = tensors[key + ".g_idx"].cpu() if key + ".g_idx" in tensors else None
138 self.bias = tensors[key + ".bias"] if has_bias else None
140 self.device = self.qweight.device
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
running on runpod, 2x3090 with 2.1.0.dev20230607+cu118
When attempting to split the model on multiple GPUs, I get the following error:
> python test_chatbot.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/ -gs 16,22 -p prompt_assistant.txt -un "John" -bn "Assistant" -temp 1.00 -topp 0.95 -beams 5 -beamlen 20 -mm quant_only
-- Loading model
-- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/tokenizer.model
-- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/config.json
-- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/Guanaco-65B-GPTQ-4bit.act-order.safetensors
-- Sequence length: 2048
-- Temperature: 1.00
-- Top-K: 20
-- Top-P: 0.95
-- Min-P: 0.00
-- Repetition penalty: 1.15
-- Beams: 5 x 20
-- Options: ['attention: pytorch_scaled_dp', 'matmul: quant_only', 'gpu_split: 16,22']
-- Groupsize (inferred): None
-- Act-order (inferred): no
This is a conversation between a person called John and an AI chatbot called Assistant. Assistant is a helpful and articulate virtual assistant possessing deep knowledge on a multitude of subjects. Assistant will never refuse to answer a question or comply with a request.
Assistant: Hello, John. I am Assistant, your virtual assistant. How may I help you?
John: Testing
Assistant:Traceback (most recent call last):
File "/home/john/Projects/exllama/test_chatbot.py", line 213, in <module>
gen_token = generator.beam_search()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/john/Projects/exllama/generator.py", line 385, in beam_search
tokens, probs = self.sample(logits,
^^^^^^^^^^^^^^^^^^^
File "/home/john/Projects/exllama/generator.py", line 94, in sample
sampled_ind = torch.multinomial(norm_probs, norm_probs.shape[-1] if num == -1 else min(num, norm_probs.shape[-1]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
This only happens if the model is split between GPUs using the -gs
option.
AttributeError: 'ExLlamaConfig' object has no attribute 'sdp_thd'. Did you mean: 'stp_thd'?
I think that in line 78:
self.stp_thd = 8
should be
self.sdp_thd = 8
Thanks for all your hard work, great project!
This is not a issue, just reporting that it works great with Guanaco-65B-GPTQ-4bit.act-order.safetensors from TheBloke using 2x3090. Speed is great, about 15t/s.
Great repo.
Is there any plans to add support for batched generation?
Any idea how much work this might be to achieve? I can potentially work on this if you can point me in the right direction.
Opening a new thread to continue conversation re: API as I think having a thread for discussion about this will be valuable as the project continues to scale
Continuation from: #12
Hello there,
upon loading StarCoder and its derivatives like WizardCoder the following error is thrown:
Traceback (most recent call last):
File "/media/bkutasi/60824A4F824A29BC/Other_projects/exllama/test_benchmark_inference.py", line 114, in <module>
config = model_init.make_config(args)
File "/media/bkutasi/60824A4F824A29BC/Other_projects/exllama/model_init.py", line 97, in make_config
config = ExLlamaConfig(args.config)
File "/media/bkutasi/60824A4F824A29BC/Other_projects/exllama/model.py", line 40, in __init__
self.pad_token_id = read_config["pad_token_id"]
KeyError: 'pad_token_id'
Since it's a different model architecture, GPTBigCodeForCausalLM
instead of LlamaForCausalLM
the config is pretty different, so pad_token_id and hidden_size and other parameters are missing. Am I loading it wrong, or these model types are not supported?
does someone compared the inference speed of 4bit quantized model with the origin FP16 model?
is it faster than the origin FP16 model?
I've been trying to setup exllama with my webserver. it will create an instance of the LlamaModelRepo and call loadModel, if I use the following code, i can get text output fine
class LlamaModelRepo:
tokenizer: ExLlamaTokenizer = None
generator: ExLlamaGenerator = None
config: ExLlamaConfig = None
model: ExLlama = None
cache: ExLlamaCache = None
def __init__(self):
self.models: list = []
self.modelsDir: str = './models'
def loadModel(self, llamaModel: LlamaModel):
errors = []
configPath = llamaModel.path + "/config.json"
if (not exists(configPath)):
errors.append(f"{configPath} does not exist")
modelPath = llamaModel.path + "/" + llamaModel.modelFile
if (not exists(modelPath)):
errors.append(f"{modelPath} does not exist")
tokenizerModelPath = llamaModel.path + "/tokenizer.model"
if (not exists(tokenizerModelPath)):
errors.append(f"{tokenizerModelPath} does not exist")
if errors:
raise Exception("\n".join(errors))
torch.set_grad_enabled(False)
torch.cuda._lazy_init()
self.config = ExLlamaConfig(configPath)
self.config.model_path = modelPath
self.config.max_seq_len = 2048
self.model = ExLlama(self.config)
self.cache = ExLlamaCache(self.model)
self.tokenizer = ExLlamaTokenizer(tokenizerModelPath)
self.generator = ExLlamaGenerator(self.model, self.tokenizer, self.cache)
self.generator.settings.token_repetition_penalty_max = 1.2
self.generator.settings.token_repetition_penalty_sustain = 20
self.generator.settings.token_repetition_penalty_decay = 50
gen_tokens = 200
text = self.generator.generate_simple("test", max_new_tokens = 200)
print(text)
Printed output
test Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Create a list of 5 adjectives to describe a car.
### Response:
1. Sleek
2. Powerful
3. Luxurious
4. Reliable
5. Sporty
If i move the bottom lines so they happen in a separate function, I call loadModel and then call chat through a separate request to the server. I get some exception, im very confused about why this is happening here and wondering if im doing something wrong?
class LlamaModelRepo:
tokenizer: ExLlamaTokenizer = None
generator: ExLlamaGenerator = None
config: ExLlamaConfig = None
model: ExLlama = None
cache: ExLlamaCache = None
def __init__(self):
self.models: list = []
self.modelsDir: str = './models'
def loadModel(self, llamaModel: LlamaModel):
errors = []
configPath = llamaModel.path + "/config.json"
if (not exists(configPath)):
errors.append(f"{configPath} does not exist")
modelPath = llamaModel.path + "/" + llamaModel.modelFile
if (not exists(modelPath)):
errors.append(f"{modelPath} does not exist")
tokenizerModelPath = llamaModel.path + "/tokenizer.model"
if (not exists(tokenizerModelPath)):
errors.append(f"{tokenizerModelPath} does not exist")
if errors:
raise Exception("\n".join(errors))
torch.set_grad_enabled(False)
torch.cuda._lazy_init()
self.config = ExLlamaConfig(configPath)
self.config.model_path = modelPath
self.config.max_seq_len = 2048
self.model = ExLlama(self.config)
self.cache = ExLlamaCache(self.model)
self.tokenizer = ExLlamaTokenizer(tokenizerModelPath)
self.generator = ExLlamaGenerator(self.model, self.tokenizer, self.cache)
self.generator.settings.token_repetition_penalty_max = 1.2
self.generator.settings.token_repetition_penalty_sustain = 20
self.generator.settings.token_repetition_penalty_decay = 50
gen_tokens = 200
def chat(self, text: str, params:dict = {}):
text = self.generator.generate_simple("test", max_new_tokens = 200)
print(text)
exception
Traceback (most recent call last):
File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 2213, in __call__
return self.wsgi_app(environ, start_response)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 2193, in wsgi_app
response = self.handle_exception(e)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 2190, in wsgi_app
response = self.full_dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 1486, in full_dispatch_request
rv = self.handle_user_exception(e)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
rv = self.dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/kanna/Documents/llm/exllama/server.py", line 110, in chat
return modelRepo.chat(text="test")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/kanna/Documents/llm/exllama/repos/model_repo.py", line 97, in chat
text = self.generator.generate_simple(text, max_new_tokens = 200)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/kanna/Documents/llm/exllama/generator.py", line 176, in generate_simple
self.gen_begin(ids)
File "/mnt/kanna/Documents/llm/exllama/generator.py", line 103, in gen_begin
self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True)
File "/mnt/kanna/Documents/llm/exllama/model.py", line 1153, in forward
hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/kanna/Documents/llm/exllama/model.py", line 540, in forward
hidden_states = self.self_attn.forward(hidden_states, cache, buffer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/kanna/Documents/llm/exllama/model.py", line 447, in forward
query_states = self.q_proj.forward(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/kanna/Documents/llm/exllama/model.py", line 314, in forward
out = cuda_ext.ExAutogradMatmul4bitCuda.apply(x, self.qweight, self.scales, self.qzeros, self.groupsize, self.bits, self.maxq)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kannalo/.local/lib/python3.11/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/kannalo/.local/lib/python3.11/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
return fwd(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "/mnt/kanna/Documents/llm/exllama/cuda_ext.py", line 271, in forward
raise ValueError("Not implemented yet")
ValueError: Not implemented yet
my original model is a .bin file, then i use the below code to convert the model format
model = AutoModelForCausalLM.from_pretrained("path/to/llama_7b", torch_dtype=torch.float16,device_map='auto' )
model.save_pretrained('path/to/exllama',safe_serialization=True, max_shard_size="200GB")
but when i run test_benchmark_inference.py , i meet mistake
the model didn't have qzeros/scales
how can i deal with it, could you help me?
Anything i need to do to make this work? simply adding names doesn't change anything. I've also tried creating prompt text files to match the name added, but no change.
Other than that it's working pretty well for talking to one bot.
Any thoughts on how difficult it would be to support inference on a model trained with landmark attention? Like Minotaur, Wizard or the base Llama landmark finetunes released recently, and I suppose more will come out, now that multiple repos support lora/qlora/gptq-lora training with landmark attention.
I haven’t compared results yet, but it sounds like landmark attention should be more effective with long contexts compared to the turboderp/alpaca_lora_4bit repo. Like the author, I found that that repo did “something”, and stopped generating gibberish beyond 2048 at least, but I’m not sure what the model learned. The landmark attention paper claims it can solve needle-haystack problems beyond the context length, which I couldn’t get the previous method to do.
Landmark apparently works with Oogabooga with remote code enabled.
Pop!_OS 20.04
Python 3.8.10
AMD 6800 XT GPU
Installed with:
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu118
pip install safetensors sentencepiece ninja
git clone https://github.com/turboderp/exllama
cd exllama
When running python3 test_benchmark_inference.py -d /home/user1/models/ -p -ppl
or python example_chatbot.py -d /home/user1/models/ -un "Jeff" -p prompt_chatbot.txt
I get the follow errors:
Traceback (most recent call last):
File "test_benchmark_inference.py", line 1, in <module>
from model import ExLlama, ExLlamaCache, ExLlamaConfig
File "/home/user1/bin/exllama/model.py", line 5, in <module>
import cuda_ext
File "/home/user1/bin/exllama/cuda_ext.py", line 42, in <module>
exllama_ext = load(
File "/home/user1/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1286, in load
return _jit_compile(
File "/home/user1/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1511, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/user1/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1603, in _write_ninja_file_and_build_library
extra_ldflags = _prepare_ldflags(
File "/home/user1/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1702, in _prepare_ldflags
if (not os.path.exists(_join_cuda_home(extra_lib_dir)) and
File "/home/user1/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2238, in _join_cuda_home
raise EnvironmentError('CUDA_HOME environment variable is not set. '
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
What am I doing wrong?
When trying to run the GPTQ Model (https://huggingface.co/TheBloke/starchat-beta-GPTQ) which works fine with other GPTQ loaders, I get the following using ExLlama via oobabooga webui:
line 40, in __init__
self.pad_token_id = read_config["pad_token_id"]
KeyError: 'pad_token_id'
Line 40 in dd63e07
exllama/exllama_ext/q4v2_matmul.cu(116): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (half *, half)
atomicAdd(out_.item_ptr(x_row, w_column), result);
Until I added a link from lib to lib64 it was unable to find the cuda libs. Compile would fail. Test kernel stuff is also out of date as the paths are wrong.
Hi! I got this to work with TheBloke/WizardLM-30B-Uncensored-GPTQ.
Here's what worked:
/mnt/c/somewhere
otherwise the model loading will be mega slow regardless of your disk speedmodel.py
I added the following: # self.groupsize = (self.qweight.shape[0] * 8) // self.qzeros.shape[0]
self.groupsize = None
self.config.groupsize = None
self.config.act_order = True
# self.config.groupsize = self.groupsize
# if self.config.groupsize is None:
# self.config.groupsize = self.groupsize
# else:
# if self.config.groupsize != self.groupsize:
# raise ValueError("Irregular groupsize for matrix: " + key + ", " + str(self.config.groupsize) + ", "+ str(self.groupsize))
Note the commented out code and the additions
4. I had to use -mm pytorch_only
and -a pytorch_matmul
I did a test on the latest commit (77545c) and bec6c9 on h100 with 30b model and I can see stable performance degradation.
Latest bec6c9
25 t/s 34t/s
thoughts?
Hello! Could this work utilize the new H100 TransformerEngine for speedup? If yes, I would be very interested in that and would also pay your H100 cloud GPU access if you could estimate how long you will need it
Thank you very much!
So my P40 is only using about 70W while generating responses, its not limited in any way(IE. Power delivery or temp)
We are trying to port the transformer based gen code to exllama but did not find a configurable length_penalty
control. Will this be on the road map? Thanks.
I installed exllama on a secondary drive, then tried to install dependencies both inside a venv and on the root folder. In ever case I got a long error when I tried to run the test (python test_benchmark_inference.py -d <path_to_model_files> -p -ppl
), including:
fatal error LNK1104: cannot open file 'python310.lib'
The solution was to copy the python310.lib
file from Program Files\Python310\libs
and paste it into \venv\Scripts\libs
. Note I had to make that directory myself.
I'm currently testing the different generation behavior between exllama and autogptq, and I found that using cache with exllama will generate different results for same prompt even when I'm using greedy decoding.
def exllama_greedy_gen_wo_cache(prompt, max_length):
seq = tokenizer.encode(prompt) # Huggingface tokenizer
for _ in range(max_length):
temp_cache = ExLlamaCache(model_exllama)
logits = model_exllama.forward(torch.tensor([seq], dtype=torch.long), temp_cache)[0][0]
seq.append(torch.argmax(logits).item())
return seq
for i in range(10):
print(tokenizer.decode(exllama_greedy_gen_wo_cache("Hello,", 20)))
For generate without cache, it's really slow but I get consistant outputs
But when I enable cache to get a much faster generation, I start seeing inconsistency between generations
def exllama_greedy_gen_wi_cache(prompt, max_length):
seq = tokenizer.encode(prompt) # Huggingface tokenizer
gen_cache = ExLlamaCache(model_exllama)
model_exllama.forward(torch.tensor([seq[:-1]], dtype=torch.long), gen_cache, preprocess_only = True)
for _ in range(max_length):
logits = model_exllama.forward(torch.tensor([seq[-1:]], dtype=torch.long), gen_cache)[0][0]
seq.append(torch.argmax(logits).item())
return seq
for i in range(10):
print(tokenizer.decode(exllama_greedy_gen_wi_cache("Hello,", 20)))
I wonder is this a bug within the cache implementation or it is I'm using cache in a wrong way.
Hi, I store my models on a local NAS, synology - which does not allow me to change ownership permissions of the files.
I get the following error when starting up the docker container with docker compose;
⠿ Network exllama_default Created 0.0s
⠿ Container exllama-web-1 Created 0.1s
Attaching to exllama-web-1
exllama-web-1 | chown: changing ownership of '/app/model/minotaur-13b-GPTQ-4bit-128g.no-act.order.safetensors': Operation not permitted
exllama-web-1 | chown: changing ownership of '/app/model/tokenizer.model': Operation not permitted
exllama-web-1 | chown: changing ownership of '/app/model/tokenizer_config.json': Operation not permitted
exllama-web-1 | chown: changing ownership of '/app/model/tokenizer.json': Operation not permitted
exllama-web-1 | chown: changing ownership of '/app/model/special_tokens_map.json': Operation not permitted
exllama-web-1 | chown: changing ownership of '/app/model/quantize_config.json': Operation not permitted
exllama-web-1 | chown: changing ownership of '/app/model/generation_config.json': Operation not permitted
exllama-web-1 | chown: changing ownership of '/app/model/config.json': Operation not permitted
exllama-web-1 | chown: changing ownership of '/app/model/README.md': Operation not permitted
exllama-web-1 | chown: changing ownership of '/app/model': Operation not permitted
exllama-web-1 exited with code 1
Unsure how to proceed
Congrats and thank you again for a project that changes everything. Can't use anything else and now I even prefer your Web UI to the std. text-web-ui...
In some instances it would be super-useful to be able load separate lora's on top of a GPTQ model loaded with exllama.
https://github.com/SqueezeAILab/SqueezeLLM is this something exllama will support out of the box? how would integrating support look like?
I would like to test run 7b model on my 4g vram 3050, look like exllama does not support offload model to CPU yet?
I'm getting an error when attempting to use generate_simple inside of a Gradio UI. I can run test_inference.py just fine, however when I put that code into a Gradio UI and attempt to redirect the output to a Chatbot component, I get the below error:
Traceback (most recent call last):
File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/gradio/routes.py", line 422, in run_predict
output = await app.get_blocks().process_api(
File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/gradio/blocks.py", line 1323, in process_api
result = await self.call_function(
File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/gradio/blocks.py", line 1051, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/mmealman/src/exllama/webui/Chatbot.py", line 72, in bot
bot_message = self.predict(history, user_message)
File "/home/mmealman/src/exllama/webui/Chatbot.py", line 58, in predict
return self.textgen.test_generate()
File "/home/mmealman/src/exllama/TextGenerator.py", line 96, in test_generate
text = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
File "/home/mmealman/src/exllama/generator.py", line 176, in generate_simple
self.gen_begin(ids)
File "/home/mmealman/src/exllama/generator.py", line 103, in gen_begin
self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True)
File "/home/mmealman/src/exllama/model.py", line 1153, in forward
hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device])
File "/home/mmealman/src/exllama/model.py", line 540, in forward
hidden_states = self.self_attn.forward(hidden_states, cache, buffer)
File "/home/mmealman/src/exllama/model.py", line 447, in forward
query_states = self.q_proj.forward(hidden_states)
File "/home/mmealman/src/exllama/model.py", line 314, in forward
out = cuda_ext.ExAutogradMatmul4bitCuda.apply(x, self.qweight, self.scales, self.qzeros, self.groupsize, self.bits, self.maxq)
File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
return fwd(*args, **kwargs)
File "/home/mmealman/src/exllama/cuda_ext.py", line 271, in forward
raise ValueError("Not implemented yet")
ValueError: Not implemented yet
Below is the generation code I'm calling in the Chatbot:
def test_generate(self):
tokenizer_model_path = "/home/mmealman/src/models/vicuna-13B-1.1-GPTQ-4bit-128g/tokenizer.model"
model_config_path = "/home/mmealman/src/models/vicuna-13B-1.1-GPTQ-4bit-128g/config.json"
model_path = "/home/mmealman/src/models/vicuna-13B-1.1-GPTQ-4bit-128g/vicuna-13B-1.1-GPTQ-4bit-128g.safetensors"
config = ExLlamaConfig(model_config_path)
config.model_path = model_path
config.max_seq_len = 2048
model = ExLlama(config)
cache = ExLlamaCache(model)
tokenizer = ExLlamaTokenizer(tokenizer_model_path)
generator = ExLlamaGenerator(model, tokenizer, cache)
generator.settings.token_repetition_penalty_max = 1.2
generator.settings.token_repetition_penalty_sustain = 20
generator.settings.token_repetition_penalty_decay = 50
prompt = \
"On 19 February 1952, Headlam became senior air staff officer (SASO) at Eastern Area Command in Penrith, New South " \
"Wales. During his term as SASO, the RAAF began re-equipping with English Electric Canberra jet bombers and CAC " \
"Sabre jet fighters. The Air Force also underwent a major organisational change, as it transitioned from a " \
"geographically based command-and-control system to one based on function, resulting in the establishment of Home " \
"(operational), Training, and Maintenance Commands. Eastern Area Command, considered a de facto operational " \
"headquarters owing to the preponderance of combat units under its control, was reorganised as Home Command in " \
"October 1953. Headlam was appointed an Officer of the Order of the British Empire (OBE) in the 1954 New Year " \
"Honours for his \"exceptional ability and devotion to duty\". He was promoted to acting air commodore in May. His " \
"appointment as aide-de-camp to Queen Elizabeth II was announced on 7 October 1954."
gen_tokens = 200
text = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
return text
ExLLaMA generation in all other stand alone Python scripts works fine. The Gradio UI code also has worked fine in several other projects.
Crashing with act order and no act order since latest changes.
`python test_benchmark_inference.py -t /home/nap/llm_models/koala-13B-HF-4bit/tokenizer.model -c /home/nap/llm_models/koala-13B-HF-4bit/config.json -m /home/nap/llm_models/koala-13B-HF-4bit/koala13B-4bit-128g-no-act-order.safetensors -g 128 -p -ppl
Using /home/nap/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/nap/.cache/torch_extensions/py310_cu118/exllama_ext/build.ninja...
Building extension module exllama_ext...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module exllama_ext...
-- Loading model
-- Tokenizer: /home/nap/llm_models/koala-13B-HF-4bit/tokenizer.model
-- Model config: /home/nap/llm_models/koala-13B-HF-4bit/config.json
-- Model: /home/nap/llm_models/koala-13B-HF-4bit/koala13B-4bit-128g-no-act-order.safetensors
-- Groupsize: 128
-- Sequence length: 2048
-- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'perf', 'ppl']
** Time, Load model: 1.50 seconds
** VRAM, Model: [cuda:0] 6,689.96 MB
-- Inference, first pass.
Traceback (most recent call last):
File "/home/nap/Documents/exllama-api/test_benchmark_inference.py", line 153, in
logits = timer("Inference", lambda: wrapper.next_logits(ids))
File "/home/nap/Documents/exllama-api/test_benchmark_inference.py", line 64, in timer
ret = func()
File "/home/nap/Documents/exllama-api/test_benchmark_inference.py", line 153, in
logits = timer("Inference", lambda: wrapper.next_logits(ids))
File "/home/nap/Documents/exllama-api/test_benchmark_inference.py", line 54, in next_logits
return self.model.forward(input_ids, self.cache, last_id_only)
File "/home/nap/Documents/exllama-api/model.py", line 523, in forward
hidden_states = decoder_layer(hidden_states, cache, attn_masks[device])
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nap/Documents/exllama-api/model.py", line 351, in forward
hidden_states = self.self_attn(hidden_states, cache, attention_mask)
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nap/Documents/exllama-api/model.py", line 264, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.config.num_attention_heads, self.config.head_dim).transpose(1, 2)
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nap/Documents/exllama-api/model.py", line 148, in forward
out = quant_util.matmul4bit(x,
File "/home/nap/Documents/exllama-api/quant_util.py", line 70, in matmul4bit
if switch: output = _q4v2_recons(x, qweight, scales, zeros, groupsize, g_idx)
File "/home/nap/Documents/exllama-api/quant_util.py", line 51, in _q4v2_recons
q4v2_recons(qweight, buffer, scales, zeros, groupsize, g_idx if g_idx is not None else none_tensor)
TypeError: q4v2_recons(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: int) -> None
Invoked with: tensor([[-1398026309, 1248440250, 1968657271, ..., 1648788836,
1771582072, 1432982596],
[-1129530164, -1402287736, 1970562646, ..., 2016756323,
900172105, -2007726747],
[ -876888900, -1735723655, 1717986149, ..., -1236974524,
1117231658, -1988663128],
...,
[ 2125380013, 729121940, -1516013256, ..., -1448441238,
1395411286, -910718291],
[ -609454181, -1721358701, 2071349639, ..., -1380296262,
842437924, -646359431],
[ 1518767014, -1668986954, -1201825385, ..., 1920967637,
1770408276, -932611670]], device='cuda:0', dtype=torch.int32), tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], device='cuda:0', dtype=torch.float16), tensor([[0.0146179199, 0.0079727173, 0.0102233887, ..., 0.0184173584,
0.0107421875, 0.0127944946],
[0.0070610046, 0.0054702759, 0.0059432983, ..., 0.0166778564,
0.0122299194, 0.0089492798],
[0.0113754272, 0.0084228516, 0.0120010376, ..., 0.0273590088,
0.0140762329, 0.0107803345],
...,
[0.0120925903, 0.0055465698, 0.0123825073, ..., 0.0216217041,
0.0121994019, 0.0125427246],
[0.0135955811, 0.0066719055, 0.0160827637, ..., 0.0184020996,
0.0138168335, 0.0129928589],
[0.0071449280, 0.0064125061, 0.0047798157, ..., 0.0156250000,
0.0118713379, 0.0109329224]], device='cuda:0', dtype=torch.float16), tensor([[ 2023392650, 1010177125, -1289406317, ..., 1279628968,
-2103822205, 1447265365],
[ 2002999416, 1783731782, 1698252904, ..., 1971681173,
-2055768200, 1720019317],
[-2023401052, 678589785, -1808521094, ..., 1430677867,
-2089273206, 1750898759],
...,
[ 1703449704, 1770349877, -1807272091, ..., -2041219240,
1732671894, 1721131894],
[ 1732459378, 1197652024, 1950771288, ..., -1837668746,
1719236473, -2024245130],
[-2005375383, 1970881927, 1753777765, ..., 1971808872,
2003334805, 1970759287]], device='cuda:0', dtype=torch.int32), 128, tensor([ 0, 0, 0, ..., 39, 39, 39], device='cuda:0', dtype=torch.int32)`
Reverting to previous commit fixed the issue for me.
I was trying to do an apples-to-apple shootout on GPTQ vs the new llama.cpp k-quants (memory usage, speed, etc) but ran into a bump with perplexity. It looks like exllama loads a jsonl formatted version of wikitext-2's wiki.valid.raw (not the wiki.test.raw that is typically used for testing)?
Just wondering if there's a preformatted jsonl of the rest of wikitext-2 already. Is the format just literally chunking every line into a "text" object?
Hi there, really amazing work that you're doing here.
I'm trying to run either the benchmark or the webui to test (I have 2x4090), but it seems it can't find the compiler or something similar?
The complete error is:
python .\webui\app.py
F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py:358: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
warnings.warn(f'Error checking compiler version for {compiler}: {error}')
INFO: Could not find files for the given pattern(s)
Traceback (most recent call last):
File "F:\ChatIAs\exllama\webui\app.py", line 9, in <module>
import model_init
File "F:\ChatIAs\exllama\model_init.py", line 1, in <module>
from model import ExLlama, ExLlamaCache, ExLlamaConfig
File "F:\ChatIAs\exllama\model.py", line 5, in <module>
import cuda_ext
File "F:\ChatIAs\exllama\cuda_ext.py", line 14, in <module>
exllama_ext = load(
File "F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py", line 1283, in load
return _jit_compile(
File "F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py", line 1610, in _write_ninja_file_and_build_library
_write_ninja_file_to_build_library(
File "F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py", line 2057, in _write_ninja_file_to_build_library
_write_ninja_file(
File "F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py", line 2200, in _write_ninja_file
cl_paths = subprocess.check_output(['where',
File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 421, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['where', 'cl']' returned non-zero exit status 1.
I have CUDA 11.8 and CUDA 12.1 on my system. I do specify when building for gtpq for example (with $env:CUDA_PATH="CUDA_DIR"
, but here I'm not sure if it uses those or self built. Also, when specifying the CUDA version, it doesn't work either.
Maybe I'm missing something here?
Python 3.10.10
Windows 11 Pro
RTX 4090 x2
AMD Ryzen 7 7800X3D
VS2019
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
When i try to run: python3 example_chatbot.py -d /home/xxxxx/models/based-7B-GPTQ -un "Jeff" -p prompt_chatbort.txt
The following error appears:
Traceback (most recent call last):
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils /cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/xxxxxx/exllama/example_chatbot.py", line 1, in
from model import ExLlama, ExLlamaCache, ExLlamaConfig
File "/home/xxxxxx/exllama/model.py", line 5, in
import cuda_ext
File "/home/xxxxxx/exllama/cuda_ext.py", line 42, in
exllama_ext = load(
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils /cpp_extension.py", line 1284, in load
return jit_compile(
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils /cpp_extension.py", line 1509, in jit_compile
write_ninja_file_and_build_library(
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils /cpp_extension.py", line 1624, in write_ninja_file_and_build_library
run_ninja_build(
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils /cpp_extension.py", line 1909, in run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'exllama_ext': [1/5] /usr/bin/nvcc -DTOR CH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILE R_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1 011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home/xxxxxx/miniconda3/envs/gpt q/lib/python3.9/site-packages/torch/include -isystem /home/xxxxxx/miniconda3/envs /gptq/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/TH -i system /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/includ e/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/include/python3.9 -D_GLIBCXX_USE CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUD A_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constex pr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /home/xxxxxx/exllama/exllama_e xt/cuda_func/half_matmul.cu -o half_matmul.cuda.o
FAILED: half_matmul.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11 BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home/xxxxxx /miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem /home/l ucas/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/csrc/a pi/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages /torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-p ackages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/include/pyth on3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_C ONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -- expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=co mpute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /home/luc as/exllama/exllama_ext/cuda_func/half_matmul.cu -o half_matmul.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expa nded with ‘...’:
435 | function(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expa nded with ‘...’:
530 | operator=(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’
[2/5] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTE NSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYB IND11_BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home /xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem / home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/ csrc/api/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-pa ckages/torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/ site-packages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/includ e/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATOR S --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=a rch=compute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /ho me/xxxxxx/exllama/exllama_ext/cuda_func/q4_mlp.cu -o q4_mlp.cuda.o
FAILED: q4_mlp.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11 BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home/xxxxxx /miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem /home/l ucas/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/csrc/a pi/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages /torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-p ackages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/include/pyth on3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_C ONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -- expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=co mpute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /home/luc as/exllama/exllama_ext/cuda_func/q4_mlp.cu -o q4_mlp.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expa nded with ‘...’:
435 | function(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expa nded with ‘...’:
530 | operator=(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’
[3/5] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTE NSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYB IND11_BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home /xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem / home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/ csrc/api/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-pa ckages/torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/ site-packages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/includ e/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATOR S --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=a rch=compute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /ho me/xxxxxx/exllama/exllama_ext/cuda_func/q4_attn.cu -o q4_attn.cuda.o
FAILED: q4_attn.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11 BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home/xxxxxx /miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem /home/l ucas/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/csrc/a pi/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages /torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-p ackages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/include/pyth on3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_C ONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -- expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=co mpute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /home/luc as/exllama/exllama_ext/cuda_func/q4_attn.cu -o q4_attn.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expa nded with ‘...’:
435 | function(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expa nded with ‘...’:
530 | operator=(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’
[4/5] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTE NSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYB IND11_BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home /xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem / home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/ csrc/api/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-pa ckages/torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/ site-packages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/includ e/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATOR S --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=a rch=compute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /ho me/xxxxxx/exllama/exllama_ext/cuda_func/q4_matmul.cu -o q4_matmul.cuda.o
FAILED: q4_matmul.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11 BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home/xxxxxx /miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem /home/l ucas/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/csrc/a pi/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages /torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-p ackages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/include/pyth on3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_C ONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -- expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=co mpute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /home/luc as/exllama/exllama_ext/cuda_func/q4_matmul.cu -o q4_matmul.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expa nded with ‘...’:
435 | function(_Functor&& __f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expa nded with ‘...’:
530 | operator=(_Functor&& __f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘_ArgTypes’
ninja: build stopped: subcommand failed.
I have been using QLoRA to finetune my model on my 3090, which previously could only perform inferences and not finetuning.
With the incredible improvements achieved with exllama, is it possible to combine both QLoRa and exllama so that the finetuning requirements are similar to the requirements for inferences?
Hello! I am trying to run exllama on wizard-vicuna-13b-uncensored-gptq, and when i try to run any of the commands I get the following error. I am running it using the nvidia pytorch image nvcr.io/nvidia/pytorch:23.05-py3. I am using the newest version of cuda 12.1.1 and running it on a google vm with an L4 on ubuntu 18.04 LTS. I know the documentation says its not compatible with all gpu's, is it compatible with the L4? Any help would be very much appreciated. Thank you!!
10 t/s vs. 6 t/s on text-generation-webui.
Great project.
I'm developing AI assistant for fiction writer. As openai API gets pretty expensive with all the inference tricks needed, I'm looking for a good local alternative for most of inference, saving gpt4 just for polishing final results.
exllama looks pretty interesting, but I'm getting compilation error.
Even though in addition to fiction writer I'm a software developer, I'm far from being an AI expert.
Would it be correct to assume from the lines below that P40 is not supported currently?
-D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__
Maybe it was a silly try, but self.weight = tensors[key].half() did not work.
If P40 will not work with exllama, could somebody advise if oobabooga/GPTQ-for-LLaMa would work?
If not CUDA, maybe there are good options for i9-13900K with 128G DDR5?
The full Traceback:
python test_benchmark_inference.py -d /home/igorm/ai-assistant/agent-city/llm/models/Wizard-Vicuna-13B-Uncensored-GPTQ -p -ppl
Traceback (most recent call last):
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/test_benchmark_inference.py", line 1, in
from model import ExLlama, ExLlamaCache, ExLlamaConfig
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/model.py", line 5, in
import cuda_ext
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/cuda_ext.py", line 14, in
exllama_ext = load(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1283, in load
return jit_compile(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in jit_compile
write_ninja_file_and_build_library(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1623, in write_ninja_file_and_build_library
run_ninja_build(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'exllama_ext': [1/3] /opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu -o q4v2_matmul.cuda.o
FAILED: q4v2_matmul.cuda.o
/opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu -o q4v2_matmul.cuda.o
/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/../cuda_compat.cuh(48): error: cannot overload functions distinguished by return type alone
void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }
^
1 error detected in the compilation of "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu".
[2/3] /opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu -o half_matmul.cuda.o
FAILED: half_matmul.cuda.o
/opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu -o half_matmul.cuda.o
/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/../cuda_compat.cuh(48): error: cannot overload functions distinguished by return type alone
void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }
^
1 error detected in the compilation of "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu".
ninja: build stopped: subcommand failed.
Great work with this loader, I'm seeing 5x i/s improvements in Ooba and was hopeful that it would help serve up some gains when using ooba's multimodal extension (confirmed working in my current setup (Windows 10, RTX 2080Ti, 11 GB VRAM, 96 GB RAM, 11.8 Cuda, 2.0.1 Torch, with Llava or miniGPT pipelines at either 7b and 13b).
When attempting to use exllama as the loader with any of the 4 MM setups, regular text chat or instruct work well and much faster but as soon as attempting to use the multimodal extension to include a photo I get this error.
Maybe you can point me in the right direction to try and resolve this?
File "D:\00\text-generation-webui\modules\text_generation.py", line 300, in generate_reply_custom
for reply in shared.model.generate_with_streaming(question, state):
File "D:\00\text-generation-webui\modules\exllama.py", line 68, in generate_with_streaming
self.generator.gen_begin_reuse(ids)
File "D:\00\text-generation-webui\repositories\exllama\generator.py", line 191, in gen_begin_reuse
if reuse < in_tokens.shape[-1]: self.gen_feed_tokens(in_tokens[:, reuse:])
File "D:\00\text-generation-webui\repositories\exllama\generator.py", line 209, in gen_feed_tokens
self.model.forward(self.sequence[:, start:-1], self.cache, preprocess_only = True, lora = self.lora)
File "D:\00\text-generation-webui\repositories\exllama\model.py", line 841, in forward
hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
File "D:\00\text-generation-webui\repositories\exllama\model.py", line 459, in forward
hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
File "D:\00\text-generation-webui\repositories\exllama\model.py", line 381, in forward
new_keys = cache.key_states[self.index].narrow(2, past_len, q_len)
RuntimeError: start (49) + length (13970) exceeds dimension size (2048).
Output generated in 1.51 seconds (0.00 tokens/s, 0 tokens, context 14020, seed 979644525)```
I'm currently implementing the HF decoding for exllama, but I find that model sometimes do not generate the expected leading space. It happens kind of rarely, but still from time to time. Since I can only trigger it when I do sampling thereby I currently cannot give a prompt that can reproduce it on greedy decoding.
I check the oobabooga/text-generation-webui implementation and find that it's fixed in a strange way:
Since my implementing follows the HF interface so I have no access to generation index "i" and thereby cannot check whether a forward call is for the first token or not.
So I'm wondering what's the potential cause of this and is there any other way to fix it?
Update: Seems like not exllama's problem but has something to do with the strange "add leading space" behavior of HF tokenizer observed earlier.
model: llama-13B-4bit-128g
exllama:
(exllama) user@debian:~/AI/exllama$ python test_benchmark_inference.py -d ~/AI/2oobabooga/text-generation-webui/models/llama-13b-4bit-128g/ -p
-- Loading model
-- Tokenizer: /home/user/AI/2oobabooga/text-generation-webui/models/llama-13b-4bit-128g/tokenizer.model
-- Model config: /home/user/AI/2oobabooga/text-generation-webui/models/llama-13b-4bit-128g/config.json
-- Model: /home/user/AI/2oobabooga/text-generation-webui/models/llama-13b-4bit-128g/llama-13b-4bit-128g.safetensors
-- Sequence length: 2048
-- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'perf']
** Time, Load model: 1.57 seconds
-- Groupsize (inferred): 128
-- Act-order (inferred): no
** VRAM, Model: [cuda:0] 6,683.17 MB
-- Inference, first pass.
** Time, Inference: 2.08 seconds
** Speed: 923.57 tokens/second
-- Generating 128 tokens...
** Speed: 22.04 tokens/second
** VRAM, Inference: [cuda:0] 2,291.67 MB
** VRAM, Total: [cuda:0] 8,974.84 MB
ooba's webui:
streaming on:
(textgen) user@debian:~/AI/2oobabooga/text-generation-webui$ python3.10 server.py --wbits 4 --model llama-13b-4bit-128g --groups 128 --model_type llama
INFO:Gradio HTTP request redirected to localhost :)
bin /home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
INFO:Loading llama-13b-4bit-128g...
INFO:Found the following quantized model: models/llama-13b-4bit-128g/llama-13b-4bit-128g.safetensors
INFO:Loaded the model in 2.65 seconds.
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Output generated in 9.12 seconds (21.81 tokens/s, 199 tokens, context 4, seed 989197438)
Output generated in 8.57 seconds (23.22 tokens/s, 199 tokens, context 4, seed 26472177)
no stream:
(textgen) user@debian:~/AI/2oobabooga/text-generation-webui$ python3.10 server.py --wbits 4 --model llama-13b-4bit-128g --groups 128 --model_type llama --no-stream
INFO:Gradio HTTP request redirected to localhost :)
bin /home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
INFO:Loading llama-13b-4bit-128g...
INFO:Found the following quantized model: models/llama-13b-4bit-128g/llama-13b-4bit-128g.safetensors
INFO:Loaded the model in 2.48 seconds.
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
Output generated in 5.17 seconds (24.74 tokens/s, 128 tokens, context 4, seed 250438644)
Output generated in 4.57 seconds (28.02 tokens/s, 128 tokens, context 4, seed 1203371762)
Output generated in 4.80 seconds (26.65 tokens/s, 128 tokens, context 4, seed 484445001)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.