The code seem to only support CPU at the moment. Would docker be abl

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Can llm-api be used to run a model with GPU rather than CPU about llm-api HOT 7 CLOSED

niparis commented on May 23, 2024

Can llm-api be used to run a model with GPU rather than CPU

from llm-api.

Comments (7)

1b5d commented on May 23, 2024 1

@niparis I just pushed an image to run on GPU using GPTQ-for-llama, please test it out and let me know your thoughts! here is the image

I also added a section for more details in the README.md

from llm-api.

1b5d commented on May 23, 2024 1

I have pushed a new image with the requested fix, you can find it in the latest README.me file

from llm-api.

1b5d commented on May 23, 2024

I'm currently trying to work out a docker image that can run GPTQ for Llama, let me know if you have some tips

from llm-api.

cmhamiche commented on May 23, 2024

It is working great.
One suggestion tho, can we get to adjust max_length and max_new_tokens as parameters please ? Default values are at 50 for both.
I edited gptq_llama.py in the meantime:
l154: max_length = params.get("max_length", 2048)
l157: max_new_tokens = params.get("max_new_tokens", 2048)
l168: max_new_tokens=max_new_tokens,

edit: I tried this before but it did not work, that's why I'm bothering you.

llm = LLMAPI(
... host_name="http://localhost:8000",
... params={"max_length": 2048, "max_new_tokens": 2048, "temp": 0.2},
... verbose=True
... )

from llm-api.

cmhamiche commented on May 23, 2024

Sorry, I'm actually dumb I just had to edit the config.yaml.

edit: Well, I was talking too fast
Input length of input_ids is 240, but max_lengthis set to 50. This can lead to unexpected behavior. You should consider increasingmax_new_tokens.
I'll put the values in my script like this

llm = LLMAPI(
    host_name="http://localhost:8000",
    params = {"n_predict": 300, "temp": 0.2, "max_length": 2048, "max_new_tokens": 2048},
    verbose=True
)

Also it'd be great if the models were stored in separate folders instead of gptq_llama so tokenizer.model config.json and tokenizer_config.json don't get overwritten each time we change modelss.

Now on the bright side, this implementation is way better than using the textgen webui's api, it's simpler and faster. I'm using a french finetuned model and use a script that'll output on tty, Since I barely know what I'm doing, that saves me the headache of having to deal with unicode, utf8 and special characters for the time being.

from llm-api.

1b5d commented on May 23, 2024

Thanks for the feedback and for being patient, I usually have very little time to work on this project. But I will adjust the directory thing for sure!

from llm-api.

1b5d commented on May 23, 2024

Closing this for being outdated after supporting different GPU options

from llm-api.

Can llm-api be used to run a model with GPU rather than CPU about llm-api HOT 7 CLOSED

Comments (7)

Related Issues (12)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent