1b5d / llm-api Goto Github PK

View Code? Open in Web Editor NEW

145.0 3.0 22.0 55 KB

Run any Large Language Model behind a unified API

License: MIT License

Dockerfile 0.71% Python 99.29%

chatgpt gptq huggingface langchain llama llamacpp llm llm-inference machine-learning python

llm-api's People

Contributors

Stargazers

Watchers

llm-api's Issues

Can llm-api be used to run a model with GPU rather than CPU

The code seem to only support CPU at the moment.

Would docker be able to access GPUs ?

GPTQ Models with safetensors? - 'Missing key(s) in state_dict'

Firstly thank you so much for building this. I am really looking forward to using it with LangChain to get chat functions into Slack. Hopefully they integrate it soon! When my Python is a bit more up-to-scratch, I'll hopefully be able to get involved!

Secondly, I'm experiencing an issue when using the GPU containers for my models using safetensors. The output is long but here is a snippet:

llm-api-app  |   File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
llm-api-app  |     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
llm-api-app  | RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
llm-api-app  |  Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.g_idx", "model.layers.0.self_attn.o_proj.g_idx", "model.layers.0.self_attn.q_proj.g_idx", "model.layers.0.self_attn.v_proj.g_idx", "model.layers.0.mlp.down_proj.g_idx", "model.layers.0.mlp.gate_proj.g_idx", "model.layers.0.mlp.up_proj.g_idx", "model.layers.1.self_attn.k_proj.g_idx", "model.layers.1.self_attn.o_proj.g_idx", "model.layers.1.self_attn.q_proj.g_idx", "model.layers.1.self_attn.v_proj.g_idx", "model.layers.1.mlp.down_proj.g_idx", "model.layers.1.mlp.gate_proj.g_idx", "model.layers.1.mlp.up_proj.g_idx", "model.layers.2.self_attn.k_proj.g_idx", "model.layers.2.self_attn.o_proj.g_idx", "model.layers.2.self_attn.q_proj.g_idx", "model.layers.2.self_attn.v_proj.g_i ... snip ...

I have tried a few models models, which all exhibit this issue. If you want to test with one, I reliably get the error with the following model:
TheBloke/wizard-mega-13B-GPTQ

After running a safetensor model I also then can not run other models e.g. anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g

I am using your upstream image for this (not building locally) via the provide compose file.

Is there anything I should be doing different when using GPTQ models? Please let me know if you need more information.

Is this quantized 7b ggml model too old?

First, thanks for sharing your repo. I'm not understanding why I can't use this model directly. I may be missing something since I'm just learning. I was hoping to avoid having to download and mess around with llama-cpp to get this working. My goal is to spin up a web server to be able to generate and use embeddings with these models and use langchain at some point too.

My config.yaml

models_dir: /models
model_family: alpaca
model_name: 7b
setup_params:
  repo_id: Sosaka/Alpaca-native-4bit-ggml
  filename: ggml-alpaca-7b-q4.bin
  convert: true
  migrate: false
model_params:
  ctx_size: 2000
  seed: -1
  n_threads: 8
  n_batch: 2048
  n_parts: -1
  last_n_tokens_size: 16

What am I doing wrong? I tried setting convert: true thinking it would convert the old model, which I noticed convert.py but it doesn't seem to be used. I'm confused about which is better, convert.py or llama-cpp's convert-unversioned-ggml-to-ggml.py since the log says to use ggerganov's convert script.

Guidance?

No module named 'torch'

I am getting this issue running it in docker with default stuff, torch isn't in the requirements or the default Dockerfile?

Traceback (most recent call last):
File "/llm-api/./app/main.py", line 14, in
from app.llms import get_model_class
File "/llm-api/app/llms/init.py", line 7, in
from .gptq_llama.gptq_llama import GPTQLlamaLLM
File "/llm-api/app/llms/gptq_llama/init.py", line 4, in
from .gptq_llama import GPTQLlamaLLM
File "/llm-api/app/llms/gptq_llama/gptq_llama.py", line 10, in
import torch # pylint: disable=import-error
ModuleNotFoundError: No module named 'torch'

I have messed around trying to inject all the requirements but I am wondering if I am missing something, I'd have thought the default packaged docker-compose file should just work out of the box CPU wise right?

I can bypass all of that by adding:

RUN pip install torch safetensors transformers

To the Dockerfile, but then get this issue:

Traceback (most recent call last):
File "/llm-api/app/llms/gptq_llama/gptq_llama.py", line 21, in
from .GPTQforLLaMa import quant
ImportError: cannot import name 'quant' from 'app.llms.gptq_llama.GPTQforLLaMa' (unknown location)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/llm-api/./app/main.py", line 14, in
from app.llms import get_model_class
File "/llm-api/app/llms/init.py", line 7, in
from .gptq_llama.gptq_llama import GPTQLlamaLLM
File "/llm-api/app/llms/gptq_llama/init.py", line 4, in
from .gptq_llama import GPTQLlamaLLM
File "/llm-api/app/llms/gptq_llama/gptq_llama.py", line 24, in
raise ImportError(
ImportError: the GPTQ-for-LLaMa lib is missing, please install it first

I was hoping with it using Docker it would have all the dependencies installed automatically, am I doing it wrong?

I need a tutorial or real example.

any advice for getting this running with `gpt-x-alpaca` models?

First of all thanks for the repo, looks ideal.

I'm using gpt-x-alpaca-13b-native-4bit-128g-cuda.pt which can be found at repo anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g on HF.

The error I'm receiving is

invalid model file (bad magic [got 0x4034b50 want 0x67676a74])

Is this something which should be compatible?

Consider refactor for model types

At the moment the llms and associated inference / embeddings are in class specific implementations. Not sure if this is necessary or DRY as you want to start catering for different architectures (eg. Dolly v2).

Consider refactoring with interfaces using a config: AutoConfig = AutoConfig.from_pretrained(path_or_repo) type approach. As this might allow for scaling to different model types without the need for heavy configuration on the users side or massive amounts of boilerplate rewriting of specific implementations.

Eg. stub

from transformers import AutoConfig

config: AutoConfig = AutoConfig.from_pretrained(path_or_repo)

if config.model_type == "llama":
    from transformers import LlamaForCausalLM, LlamaTokenizer

    tokenizer: LlamaTokenizer = LlamaTokenizer.from_pretrained(path_or_repo)
    model: LlamaForCausalLM = LlamaForCausalLM.from_pretrained(
        path_or_repo, **model_kwargs
    )  # , load_in_8bit=True, device_map="auto")
elif config.model_type == "gpt_neox":
    from transformers import GPTNeoXForCausalLM, GPTNeoXTokenizerFast

    tokenizer: GPTNeoXForCausalLM = GPTNeoXForCausalLM.from_pretrained(path_or_repo)
    model: GPTNeoXTokenizerFast = GPTNeoXTokenizerFast.from_pretrained(
        path_or_repo, **model_kwargs
    )  # , load_in_8bit=True, device_map="auto")
else:
    logger.error(f"Unable to determine model type {config.model_type}. Attempting AutoModel")
    try:
        tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained(path_or_repo)
        model: AutoModelForCausalLM = AutoModelForCausalLM.from_pretrained(
            path_or_repo, **model_kwargs
        )  # , load_in_8bit=True, device_map="auto")
    except Exception as e:
        logger.exception(e)
        raise

Request: Lora support

Love this! Being able to spin up a local llm easily and interact with it over tried-and-true HTTP is a dream.

I frequently use vanilla llama with custom loras, and need the ability to load a model with a lora, and ideally, switch the lora and even load multiple loras in a given order.

My python is not the strongest - any chance of getting a feature like this added? I'm fairly certain we can take inspiration from text-generation-webui, which allows for loading a lora and changing lora while the model is loaded. (Afaik, TGWUI does not support loading more than one lora at a time yet.)

Illegal instruction (core dumped)

I presume there is a minimum CPU requirement like needing AVX2, AVX-512, FP16C or something?
Could you document the minimum instruction set and extensions required.

root@1d1c4289f303:/llm-api# python app/main.py
2023-10-26 23:31:19,237 - INFO - llama - found an existing model /models/llama_601507219781/ggml-model-q4_0.bin
2023-10-26 23:31:19,237 - INFO - llama - setup done successfully for /models/llama_601507219781/ggml-model-q4_0.bin
Illegal instruction (core dumped)
root@1d1c4289f303:/llm-api#

--- modulename: llama, funcname: init
llama.py(289): self.verbose = verbose
llama.py(291): self.numa = numa
llama.py(292): if not Llama.__backend_initialized:
llama.py(293): if self.verbose:
llama.py(294): llama_cpp.llama_backend_init(self.numa)
--- modulename: llama_cpp, funcname: llama_backend_init
llama_cpp.py(475): return _lib.llama_backend_init(numa)
Illegal instruction (core dumped)

I assume this has CPU requirements.
ENV CMAKE_ARGS "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"

OpenBLAS can be built for multiple targets with runtime detection of the target cpu by specifiying DYNAMIC_ARCH=1 in Makefile.rule, on the gmake command line or as -DDYNAMIC_ARCH=TRUE in cmake.

https://github.com/OpenMathLib/OpenBLAS/blob/develop/README.md

How to train model with my data - can i use api?

Maybe it is siły question but as a developer experimentong with tools i would like to have oprion to upload my data (ebooks etc) in plain text and metadata and train model with it to building private knowledge with AI agent. Is that possible to count on it?

Example config file

this repo really taught me why a running example is more important than the actual project.

Tried everything on the readme but couldn't get this to work,

config.yml:

  # models_dir: /models
# model_family: gptq_llama
# setup_params:
#   repo_id: repo_id
#   filename: model.safetensors
# model_params:
#   group_size: 128
#   wbits: 4
#   cuda_visible_devices: "0"
#   device: "cuda:0"
#   st_device: 0

# file: config.yaml

# models_dir: /models
# model_family: vicuna
# setup_params:
#   repo_id: TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g
#   filename: vicuna-13B-1.1-GPTQ-4bit-128g.compat.no-act-order.pt
#   convert: false
#   migrate: false
# model_params:
#   group_size: 128
#   wbits: 4
#   cuda_visible_devices: "0"
#   device: "cuda:0"
#   st_device: 0
#   ctx_size: 2000

#----------------------- alpaca
models_dir: /models
model_family: alpaca
model_name: 7b
setup_params:
  repo_id: Sosaka/Alpaca-native-4bit-ggml
  filename: ggml-alpaca-7b-q4.bin
  convert: false
  migrate: false
model_params:
  ctx_size: 2000
  seed: -1
  n_threads: 8
  n_batch: 2048
  n_parts: 01
  last_n_tokens_size: 16
#-----------------------

# models_dir: /models     # dir inside the container
# model_family: alpaca
# model_name: 7b
# setup_params:
#   key: value
# model_params:
#   key: value

# models_dir: /models     # dir inside the container
# model_family: alpaca
# model_name: 7b
# setup_params:
#   repo_id: user/repo_id
#   filename: ggml-model-q4_0.bin
#   convert: false
#   migrate: false
# model_params:
#   ctx_size: 2000
#   seed: -1
#   n_threads: 8
#   n_batch: 2048
#   n_parts: -1
#   last_n_tokens_size: 16

Models directory:

Question and stop sequence are included in LLM response

Expected response:
The capital of France is Paris.

Actual response:
</s> What is the capital of France?\n The capital of France is Paris.</s>

Code:

from langchain_llm_api import LLMAPI, APIEmbeddings

llm = LLMAPI(
    params={"temp": 0.2},
    verbose=True
)

print(llm("What is the capital of France?"))

Config:

models_dir: /models
model_family: gptq_llama
setup_params:
  repo_id: TheBloke/WizardLM-7B-uncensored-GPTQ
  filename: WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors
model_params:
  group_size: 128
  wbits: 4
  cuda_visible_devices: "0"
  device: "cuda:0"
  st_device: 0

1b5d / llm-api Goto Github PK

llm-api's People

Contributors

Stargazers

Watchers

Forkers

llm-api's Issues

Recommend Projects

Recommend Topics

Recommend Org