aminediro / cria Goto Github PK

View Code? Open in Web Editor NEW

212.0 8.0 12.0 2.78 MB

OpenAI compatible API for serving LLAMA-2 model

License: MIT License

Rust 78.75% Python 21.25%

cria's Introduction

cria's People

Contributors

Stargazers

Watchers

Forkers

shabbirhasan1 gmh5225 airfuse lee-b aparo sundogs8603 benjamint22 davgit hbcbh1999 wdoppenberg kazel04 zephyr800 coyang

cria's Issues

Can this project handle multiple requests at once?

Hello, I came across this project while searching for OpenAI API compatible servers for llama.cpp and I was wondering if this can handle multiple requests at once?

Loading another model into RAM for each concurrent user doesn't seem like a great idea, and I was wondering if this was even possible at all with this project.

Thank you for your work!

Response cutting off at around 256 tokens

The response cuts off at around 256tokens.

# .env
CRIA_MODEL_PATH=/home/bala/Models/llama-2-13b-chat.ggmlv3.q8_0.bin


# Other environement variables to set
CRIA_SERVICE_NAME=cria
CRIA_HOST=0.0.0.0
CRIA_PORT=3000
CRIA_MODEL_ARCHITECTURE=llama
CRIA_USE_GPU=false
CRIA_GPU_LAYERS=32
CRIA_ZIPKIN_ENDPOINT=http://zipkin-server:9411/api/v2/spans

Input

{
  "prompt":"[INST]<<SYS>>.<</SYS>>How do I get from UNSW to Central Station?[/INST]",
  "temperature":0.1
}

Output

console.log(response.choices[0].text)

There are several ways to get from the University of New South Wales (UNSW) to Central Station in Sydney. Here are some options: 1. Train: The easiest and most convenient way to get to Central Station from UNSW is by train. The UNSW campus is located near the Kensington station, which is on the Airport & South Line. You can take a train from Kensington station to Central Station. The journey takes around 20 minutes. 2. Bus: You can also take a bus from UNSW to Central Station. The UNSW campus is served by several bus routes, including the 395, 397, and 398. These buses run frequently throughout the day and the journey takes around 45-60 minutes, depending on traffic. 3. Light Rail: Another option is to take the light rail from UNSW to Central Station. The light rail runs along Anzac Parade and stops at Central Station. The journey takes around 30-40 minutes. 4. Taxi or Ride-sharing: You can also take a taxi or ride-sharing service such as Uber or Ly

Missing LICENSE

I see you have no LICENSE file for this project. The default is copyright.

I would suggest releasing the code under the GPL-3.0-or-later or AGPL-3.0-or-later license so that others are encouraged to contribute changes back to your project.

llama.cpp no longer support .bin

thread 'main' panicked at 'Failed to load LLaMA model from "/home/cisco/git/llama.cpp/models/llama-2-13b-chat/ggml-model-q4_0.gguf": invalid magic number 46554747 (GGUF) for "/home/cisco/git/llama.cpp/models/llama-2-13b-chat/ggml-model-q4_0.gguf"', /home/cisco/git/cria/src/lib.rs:54:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

"git submodule update --init --recursive" requires ssh

I am trying to get this running in Docker

Here is my current build

from ubuntu:latest

RUN apt-get update && apt-get install -y \
    sudo 

RUN apt-get update && \
    apt-get install -y \
    curl \
    build-essential \
    libssl-dev \
    pkg-config


# Install Rust and Cargo
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"


RUN yes | sudo apt install git 

RUN git clone https://github.com/AmineDiro/cria.git
WORKDIR /cria
#RUN yes | git submodule update --init --recursive

# Build the project using Cargo in release mode
#RUN cargo build --release

#COPY ggml-model-q4_0.bin .

Unfortunately, it seems like the git submodule update --init --recursive command is somehow interfacing with the repo via ssh

Have you considered maybe building this in a way that the dependencies can be installed without ssh?

Implement streaming for the /v1/chat/completions route

I think /v1/completions has streaming code but not /v1/chat/completions the former API is deprecated and I'm using the new one.

To me it also looks like the code could be combined somehow? i.e. after creating the prompt from the incoming JSON with reference to #18

Perhaps it's a case of passing it to the streaming code in routes/completions.rs

So you're are aware and for some context to my requests we have integrated cria into https://github.com/purton-tech/bionicgpt

Thanks

llama-2 70B support

I'm getting this error when trying to run on MacOS:

error: invalid value 'llama-2' for '<MODEL_ARCHITECTURE>': llama-2 is not one of supported model architectures: [Bloom, Gpt2, GptJ, GptNeoX, Llama, Mpt]

If I use LLama instead, it crashes (as it probably should)

GGML_ASSERT: llama-cpp/ggml.c:6192: ggml_nelements(a) == ne0*ne1*ne2
fish: Job 1, 'target/release/cria Llama ../ll…' terminated by signal SIGABRT (Abort)

GGUF support

I use Docker for deployment. I downloaded a 7B file, and the environment configuration file looks like this.

CRIA_SERVICE_NAME=cria
CRIA_HOST=0.0.0.0
CRIA_PORT=3000
CRIA_MODEL_ARCHITECTURE=llama

!/! Utilizado en docker-compose

CRIA_MODEL_PATH=/llama/llama-2-7b/consolidated.00.pth
CRIA_USE_GPU=true
CRIA_GPU_LAYERS=32
CRIA_ZIPKIN_ENDPOINT=http://zipkin-server:9411/api/v2/spans

LLama2 Chat Prompting incorrect?

From this guide https://replicate.com/blog/how-to-prompt-llama

A prompt with history would look like

<s>[INST] <<SYS>>
You are are a helpful... bla bla.. assistant
<</SYS>>

Hi there! [/INST] Hello! How can I help you today? </s><s>[INST] What is a neutron star? [/INST] A neutron star is a ... </s><s> [INST] Okay cool, thank you! [/INST]

It may even be that the newlines can be removed.

So I think this prompt technique should replace the one in https://github.com/AmineDiro/cria/blob/main/src/routes/chat.rs#L16

ailed to run custom build command for `ggml-sys v0.2.0-dev (D:\textgen\cria\llm\crates\ggml\sys)`

I was following the steps given in the readme, new to rust, so have no idea how to get through this one, I tried googling but nothing.

   Compiling ggml-sys v0.2.0-dev (D:\textgen\cria\llm\crates\ggml\sys)
error: failed to run custom build command for `ggml-sys v0.2.0-dev (D:\textgen\cria\llm\crates\ggml\sys)`

Caused by:
  process didn't exit successfully: `D:\textgen\cria\target\release\build\ggml-sys-347ad8dfc92431e3\build-script-build` (exit code: 101)
  --- stdout
  cargo:rerun-if-changed=llama-cpp
  OPT_LEVEL = Some("3")
  TARGET = Some("x86_64-pc-windows-msvc")
  HOST = Some("x86_64-pc-windows-msvc")
  cargo:rerun-if-env-changed=CC_x86_64-pc-windows-msvc
  CC_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CC_x86_64_pc_windows_msvc
  CC_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CC
  HOST_CC = None
  cargo:rerun-if-env-changed=CC
  CC = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
  DEBUG = Some("false")
  cargo:rerun-if-env-changed=CFLAGS_x86_64-pc-windows-msvc
  CFLAGS_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CFLAGS_x86_64_pc_windows_msvc
  CFLAGS_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CFLAGS
  HOST_CFLAGS = None
  cargo:rerun-if-env-changed=CFLAGS
  CFLAGS = None
  OPT_LEVEL = Some("3")
  TARGET = Some("x86_64-pc-windows-msvc")
  HOST = Some("x86_64-pc-windows-msvc")
  cargo:rerun-if-env-changed=CC_x86_64-pc-windows-msvc
  CC_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CC_x86_64_pc_windows_msvc
  CC_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CC
  HOST_CC = None
  cargo:rerun-if-env-changed=CC
  CC = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
  DEBUG = Some("false")
  cargo:rerun-if-env-changed=CFLAGS_x86_64-pc-windows-msvc
  CFLAGS_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CFLAGS_x86_64_pc_windows_msvc
  CFLAGS_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CFLAGS
  HOST_CFLAGS = None
  cargo:rerun-if-env-changed=CFLAGS
  CFLAGS = None

  --- stderr
  thread 'main' panicked at 'Please make sure nvcc is executable and the paths are defined using CUDA_PATH, CUDA_INCLUDE_PATH and/or CUDA_LIB_PATH', llm\crates\ggml\sys\build.rs:344:33
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Unable to build on Windows

I attempted to build on Windows, specifically Windows 11 and it failed and I get the following error.

error[E0308]: mismatched types
   --> llm\crates\llm-base\src\tokenizer\huggingface.rs:25:21
    |
25  |             .decode(vec![idx as u32], true)
    |              ------ ^^^^^^^^^^^^^^^^ expected `&[u32]`, found `Vec<u32>`
    |              |
    |              arguments to this method are incorrect
    |
    = note: expected reference `&[u32]`
                  found struct `Vec<u32>`
note: method defined here
   --> C:\Users\gabri\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tokenizers-0.13.4\src\tokenizer\mod.rs:814:12
    |
814 |     pub fn decode(&self, ids: &[u32], skip_special_tokens: bool) -> Result<String> {
    |            ^^^^^^

error[E0308]: mismatched types
   --> llm\crates\llm-base\src\tokenizer\huggingface.rs:70:21
    |
70  |             .decode(tokens, skip_special_tokens)
    |              ------ ^^^^^^ expected `&[u32]`, found `Vec<u32>`
    |              |
    |              arguments to this method are incorrect
    |
    = note: expected reference `&[u32]`
                  found struct `Vec<u32>`
note: method defined here
   --> C:\Users\gabri\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tokenizers-0.13.4\src\tokenizer\mod.rs:814:12
    |
814 |     pub fn decode(&self, ids: &[u32], skip_special_tokens: bool) -> Result<String> {
    |            ^^^^^^
help: consider borrowing here
    |
70  |             .decode(&tokens, skip_special_tokens)
    |                     +

For more information about this error, try `rustc --explain E0308`.
error: could not compile `llm-base` (lib) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...

If anyone knows how to fix it, I appreciate the help. I just follow the details on how to run cria based on the README.

It seems like there is a problem with the llm submodule?

aminediro / cria Goto Github PK

cria's Introduction

Cria - Local llama OpenAI-compatible API

Get started:

Using Docker (recommended way)

Local Install

Command line arguments reference

Prometheus Metrics

Tracing

Completion Example

TODO/ Roadmap:

API routes