Git Product home page Git Product logo

cria's Introduction

Cria - Local llama OpenAI-compatible API

The objective is to serve a local llama-2 model by mimicking an OpenAI API service. The llama2 model runs on GPU using ggml-sys crate with specific compilation flags.

Get started:

Using Docker (recommended way)

The easiest way of getting started is using the official Docker container. Make sure you have docker and docker-compose installed on your machine (example install for ubuntu20.04).

cria provides two docker images : one for CPU only deployments and a second GPU accelerated image. To use GPU image, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 11.7 or higher.

To deploy the cria gpu version using docker-compose:

  1. Clone the repos:
git clone [email protected]:AmineDiro/cria.git
cd cria/docker
  1. The api will load the model located in /app/model.bin by default. You should change the docker-compose file with ggml model path for docker to bind mount. You can also change environement variables for your specific config. Alternatively, the easiest way is to set CRIA_MODEL_PATH in adocker/.env :
# .env
CRIA_MODEL_PATH=/path/to/ggml/model

# Other environement variables to set
CRIA_SERVICE_NAME=cria
CRIA_HOST=0.0.0.0
CRIA_PORT=3000
CRIA_MODEL_ARCHITECTURE=llama
CRIA_USE_GPU=true
CRIA_GPU_LAYERS=32
CRIA_ZIPKIN_ENDPOINT=http://zipkin-server:9411/api/v2/spans
  1. Run docker-compose to startup the cria API server and the zipkin server
docker compose up -f docker-compose-gpu.yaml -d
  1. Enjoy using your local LLM API server 🤟 !

Local Install

  1. Git clone project

    git clone [email protected]:AmineDiro/cria.git
    cd cria/
  2. Build project ( I ❤️ cargo !).

    cargo b --release
    • For cuBLAS (nvidia GPU ) acceleration use
      cargo b --release --features cublas
    • For metal acceleration use
      cargo b --release --features metal

      ❗ NOTE: If you have issues building for GPU, checkout the building issues section

  3. Download GGML .bin LLama-2 quantized model (for example llama-2-7b)

  4. Run API, use the use-gpu flag to offload model layers to your GPU

    ./target/cria -a llama --model {MODEL_BIN_PATH} --use-gpu --gpu-layers 32

Command line arguments reference

All the parameters can be passed as environment variables or command line arguments. Here is the reference for the command line arguments:

./target/cria --help

Usage: cria [OPTIONS]

Options:
  -a, --model-architecture <MODEL_ARCHITECTURE>      [default: llama]
      --model <MODEL_PATH>
  -v, --tokenizer-path <TOKENIZER_PATH>
  -r, --tokenizer-repository <TOKENIZER_REPOSITORY>
  -H, --host <HOST>                                  [default: 0.0.0.0]
  -p, --port <PORT>                                  [default: 3000]
  -m, --prefer-mmap
  -c, --context-size <CONTEXT_SIZE>                  [default: 2048]
  -l, --lora-adapters <LORA_ADAPTERS>
  -u, --use-gpu
  -g, --gpu-layers <GPU_LAYERS>
  --n-gqa <N_GQA>
      Grouped Query attention : Specify -gqa 8 for 70B models to work
  -z, --zipkin-endpoint <ZIPKIN_ENDPOINT>
  -h, --help                                         Print help

For environment variables, just prefix the argument with CRIA_ and use uppercase letters. For example, to set the model path, you can use CRIA_MODEL environment variable.

There is a an example docker/.env.sample file in the project root directory.

Prometheus Metrics

We are exporting Prometheus metrics via the /metrics endpoint.

Tracing

We are tracing performance metrics using tracing and tracing-opentelemetry crates.

You can use the --zipkin-endpoint to export metrics to a zipkin endpoint.

There is a docker-compose file in the project root directory to run a local zipkin server on port 9411.

screenshot

Completion Example

You can use openai python client or directly use the sseclient python library and stream messages. Here is an example :

Here is a example using a Python client
import json
import sys
import time

import sseclient
import urllib3

url = "http://localhost:3000/v1/completions"


http = urllib3.PoolManager()
response = http.request(
    "POST",
    url,
    preload_content=False,
    headers={
        "Content-Type": "application/json",
    },
    body=json.dumps(
        {
            "prompt": "Morocco is a beautiful country situated in north africa.",
            "temperature": 0.1,
        }
    ),
)

client = sseclient.SSEClient(response)

s = time.perf_counter()
for event in client.events():
    chunk = json.loads(event.data)
    sys.stdout.write(chunk["choices"][0]["text"])
    sys.stdout.flush()
e = time.perf_counter()

print(f"\nGeneration from completion took {e-s:.2f} !")

You can clearly see generation using my M1 GPU:

TODO/ Roadmap:

  • Run Llama.cpp on CPU using llm-chain
  • Run Llama.cpp on GPU using llm-chain
  • Implement /models route
  • Implement basic /completions route
  • Implement streaming completions SSE
  • Cleanup cargo features with llm
  • Support MacOS Metal
  • Merge completions / completion_streaming routes in same endpoint
  • Implement /embeddings route
  • Implement route /chat/completions
  • Setup good tracing (Thanks to @aparo)
  • Docker deployment on CPU/GPU
  • Metrics : Prometheus (Thanks to @aparo)
  • Implement a global request queue
    • For each response put an entry in a queue
    • Spawn a model in separate task reading from ringbuffer, get entry and put each token in response
    • Construct stream from flume resp_rx chan and stream responses to user.
  • Implement streaming chat completions SSE
  • Setup CI/CD (thanks to @Benjamint22 )
  • BETTER ERRORS and http responses (deal with all the unwrapping)
  • Implement request batching
  • Implement request continuous batching
  • Maybe Support huggingface candle lib for a full rust integration 🤔 ?

API routes

Details on OpenAI API docs: https://platform.openai.com/docs/api-reference/

cria's People

Contributors

aminediro avatar aparo avatar benjamint22 avatar bringitup avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cria's Issues

Can this project handle multiple requests at once?

Hello, I came across this project while searching for OpenAI API compatible servers for llama.cpp and I was wondering if this can handle multiple requests at once?

Loading another model into RAM for each concurrent user doesn't seem like a great idea, and I was wondering if this was even possible at all with this project.

Thank you for your work!

Response cutting off at around 256 tokens

The response cuts off at around 256tokens.

# .env
CRIA_MODEL_PATH=/home/bala/Models/llama-2-13b-chat.ggmlv3.q8_0.bin


# Other environement variables to set
CRIA_SERVICE_NAME=cria
CRIA_HOST=0.0.0.0
CRIA_PORT=3000
CRIA_MODEL_ARCHITECTURE=llama
CRIA_USE_GPU=false
CRIA_GPU_LAYERS=32
CRIA_ZIPKIN_ENDPOINT=http://zipkin-server:9411/api/v2/spans

Input

{
  "prompt":"[INST]<<SYS>>.<</SYS>>How do I get from UNSW to Central Station?[/INST]",
  "temperature":0.1
}

Output

console.log(response.choices[0].text)

There are several ways to get from the University of New South Wales (UNSW) to Central Station in Sydney. Here are some options: 1. Train: The easiest and most convenient way to get to Central Station from UNSW is by train. The UNSW campus is located near the Kensington station, which is on the Airport & South Line. You can take a train from Kensington station to Central Station. The journey takes around 20 minutes. 2. Bus: You can also take a bus from UNSW to Central Station. The UNSW campus is served by several bus routes, including the 395, 397, and 398. These buses run frequently throughout the day and the journey takes around 45-60 minutes, depending on traffic. 3. Light Rail: Another option is to take the light rail from UNSW to Central Station. The light rail runs along Anzac Parade and stops at Central Station. The journey takes around 30-40 minutes. 4. Taxi or Ride-sharing: You can also take a taxi or ride-sharing service such as Uber or Ly

Missing LICENSE

I see you have no LICENSE file for this project. The default is copyright.

I would suggest releasing the code under the GPL-3.0-or-later or AGPL-3.0-or-later license so that others are encouraged to contribute changes back to your project.

llama.cpp no longer support .bin

thread 'main' panicked at 'Failed to load LLaMA model from "/home/cisco/git/llama.cpp/models/llama-2-13b-chat/ggml-model-q4_0.gguf": invalid magic number 46554747 (GGUF) for "/home/cisco/git/llama.cpp/models/llama-2-13b-chat/ggml-model-q4_0.gguf"', /home/cisco/git/cria/src/lib.rs:54:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

"git submodule update --init --recursive" requires ssh

I am trying to get this running in Docker

Here is my current build

from ubuntu:latest

RUN apt-get update && apt-get install -y \
    sudo 

RUN apt-get update && \
    apt-get install -y \
    curl \
    build-essential \
    libssl-dev \
    pkg-config


# Install Rust and Cargo
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"


RUN yes | sudo apt install git 

RUN git clone https://github.com/AmineDiro/cria.git
WORKDIR /cria
#RUN yes | git submodule update --init --recursive

# Build the project using Cargo in release mode
#RUN cargo build --release

#COPY ggml-model-q4_0.bin .

Unfortunately, it seems like the git submodule update --init --recursive command is somehow interfacing with the repo via ssh

Have you considered maybe building this in a way that the dependencies can be installed without ssh?

image

Implement streaming for the /v1/chat/completions route

I think /v1/completions has streaming code but not /v1/chat/completions the former API is deprecated and I'm using the new one.

To me it also looks like the code could be combined somehow? i.e. after creating the prompt from the incoming JSON with reference to #18

Perhaps it's a case of passing it to the streaming code in routes/completions.rs

So you're are aware and for some context to my requests we have integrated cria into https://github.com/purton-tech/bionicgpt

Thanks

llama-2 70B support

I'm getting this error when trying to run on MacOS:

error: invalid value 'llama-2' for '<MODEL_ARCHITECTURE>': llama-2 is not one of supported model architectures: [Bloom, Gpt2, GptJ, GptNeoX, Llama, Mpt]

If I use LLama instead, it crashes (as it probably should)

GGML_ASSERT: llama-cpp/ggml.c:6192: ggml_nelements(a) == ne0*ne1*ne2
fish: Job 1, 'target/release/cria Llama ../ll…' terminated by signal SIGABRT (Abort)

GGUF support

I use Docker for deployment. I downloaded a 7B file, and the environment configuration file looks like this.

CRIA_SERVICE_NAME=cria
CRIA_HOST=0.0.0.0
CRIA_PORT=3000
CRIA_MODEL_ARCHITECTURE=llama

!/! Utilizado en docker-compose

CRIA_MODEL_PATH=/llama/llama-2-7b/consolidated.00.pth
CRIA_USE_GPU=true
CRIA_GPU_LAYERS=32
CRIA_ZIPKIN_ENDPOINT=http://zipkin-server:9411/api/v2/spans

LLama2 Chat Prompting incorrect?

From this guide https://replicate.com/blog/how-to-prompt-llama

A prompt with history would look like

<s>[INST] <<SYS>>
You are are a helpful... bla bla.. assistant
<</SYS>>

Hi there! [/INST] Hello! How can I help you today? </s><s>[INST] What is a neutron star? [/INST] A neutron star is a ... </s><s> [INST] Okay cool, thank you! [/INST]

It may even be that the newlines can be removed.

So I think this prompt technique should replace the one in https://github.com/AmineDiro/cria/blob/main/src/routes/chat.rs#L16

ailed to run custom build command for `ggml-sys v0.2.0-dev (D:\textgen\cria\llm\crates\ggml\sys)`

I was following the steps given in the readme, new to rust, so have no idea how to get through this one, I tried googling but nothing.

   Compiling ggml-sys v0.2.0-dev (D:\textgen\cria\llm\crates\ggml\sys)
error: failed to run custom build command for `ggml-sys v0.2.0-dev (D:\textgen\cria\llm\crates\ggml\sys)`

Caused by:
  process didn't exit successfully: `D:\textgen\cria\target\release\build\ggml-sys-347ad8dfc92431e3\build-script-build` (exit code: 101)
  --- stdout
  cargo:rerun-if-changed=llama-cpp
  OPT_LEVEL = Some("3")
  TARGET = Some("x86_64-pc-windows-msvc")
  HOST = Some("x86_64-pc-windows-msvc")
  cargo:rerun-if-env-changed=CC_x86_64-pc-windows-msvc
  CC_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CC_x86_64_pc_windows_msvc
  CC_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CC
  HOST_CC = None
  cargo:rerun-if-env-changed=CC
  CC = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
  DEBUG = Some("false")
  cargo:rerun-if-env-changed=CFLAGS_x86_64-pc-windows-msvc
  CFLAGS_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CFLAGS_x86_64_pc_windows_msvc
  CFLAGS_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CFLAGS
  HOST_CFLAGS = None
  cargo:rerun-if-env-changed=CFLAGS
  CFLAGS = None
  OPT_LEVEL = Some("3")
  TARGET = Some("x86_64-pc-windows-msvc")
  HOST = Some("x86_64-pc-windows-msvc")
  cargo:rerun-if-env-changed=CC_x86_64-pc-windows-msvc
  CC_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CC_x86_64_pc_windows_msvc
  CC_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CC
  HOST_CC = None
  cargo:rerun-if-env-changed=CC
  CC = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
  DEBUG = Some("false")
  cargo:rerun-if-env-changed=CFLAGS_x86_64-pc-windows-msvc
  CFLAGS_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CFLAGS_x86_64_pc_windows_msvc
  CFLAGS_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CFLAGS
  HOST_CFLAGS = None
  cargo:rerun-if-env-changed=CFLAGS
  CFLAGS = None

  --- stderr
  thread 'main' panicked at 'Please make sure nvcc is executable and the paths are defined using CUDA_PATH, CUDA_INCLUDE_PATH and/or CUDA_LIB_PATH', llm\crates\ggml\sys\build.rs:344:33
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Unable to build on Windows

I attempted to build on Windows, specifically Windows 11 and it failed and I get the following error.

error[E0308]: mismatched types
   --> llm\crates\llm-base\src\tokenizer\huggingface.rs:25:21
    |
25  |             .decode(vec![idx as u32], true)
    |              ------ ^^^^^^^^^^^^^^^^ expected `&[u32]`, found `Vec<u32>`
    |              |
    |              arguments to this method are incorrect
    |
    = note: expected reference `&[u32]`
                  found struct `Vec<u32>`
note: method defined here
   --> C:\Users\gabri\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tokenizers-0.13.4\src\tokenizer\mod.rs:814:12
    |
814 |     pub fn decode(&self, ids: &[u32], skip_special_tokens: bool) -> Result<String> {
    |            ^^^^^^

error[E0308]: mismatched types
   --> llm\crates\llm-base\src\tokenizer\huggingface.rs:70:21
    |
70  |             .decode(tokens, skip_special_tokens)
    |              ------ ^^^^^^ expected `&[u32]`, found `Vec<u32>`
    |              |
    |              arguments to this method are incorrect
    |
    = note: expected reference `&[u32]`
                  found struct `Vec<u32>`
note: method defined here
   --> C:\Users\gabri\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tokenizers-0.13.4\src\tokenizer\mod.rs:814:12
    |
814 |     pub fn decode(&self, ids: &[u32], skip_special_tokens: bool) -> Result<String> {
    |            ^^^^^^
help: consider borrowing here
    |
70  |             .decode(&tokens, skip_special_tokens)
    |                     +

For more information about this error, try `rustc --explain E0308`.
error: could not compile `llm-base` (lib) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...

If anyone knows how to fix it, I appreciate the help. I just follow the details on how to run cria based on the README.

It seems like there is a problem with the llm submodule?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.