aminediro / cria Goto Github PK

OpenAI compatible API for serving LLAMA-2 model

License: MIT License

Rust 78.75% Python 21.25%

cria's Issues

Implement streaming for the /v1/chat/completions route

I think /v1/completions has streaming code but not /v1/chat/completions the former API is deprecated and I'm using the new one.

To me it also looks like the code could be combined somehow? i.e. after creating the prompt from the incoming JSON with reference to #18

Perhaps it's a case of passing it to the streaming code in routes/completions.rs

So you're are aware and for some context to my requests we have integrated cria into https://github.com/purton-tech/bionicgpt

Thanks

Can this project handle multiple requests at once?

Hello, I came across this project while searching for OpenAI API compatible servers for llama.cpp and I was wondering if this can handle multiple requests at once?

Loading another model into RAM for each concurrent user doesn't seem like a great idea, and I was wondering if this was even possible at all with this project.

Thank you for your work!

llama.cpp no longer support .bin

thread 'main' panicked at 'Failed to load LLaMA model from "/home/cisco/git/llama.cpp/models/llama-2-13b-chat/ggml-model-q4_0.gguf": invalid magic number 46554747 (GGUF) for "/home/cisco/git/llama.cpp/models/llama-2-13b-chat/ggml-model-q4_0.gguf"', /home/cisco/git/cria/src/lib.rs:54:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

GGUF support

I use Docker for deployment. I downloaded a 7B file, and the environment configuration file looks like this.

CRIA_SERVICE_NAME=cria
CRIA_HOST=0.0.0.0
CRIA_PORT=3000
CRIA_MODEL_ARCHITECTURE=llama

!/! Utilizado en docker-compose

CRIA_MODEL_PATH=/llama/llama-2-7b/consolidated.00.pth
CRIA_USE_GPU=true
CRIA_GPU_LAYERS=32
CRIA_ZIPKIN_ENDPOINT=http://zipkin-server:9411/api/v2/spans

Missing LICENSE

I see you have no LICENSE file for this project. The default is copyright.

I would suggest releasing the code under the GPL-3.0-or-later or AGPL-3.0-or-later license so that others are encouraged to contribute changes back to your project.

Unable to build on Windows

I attempted to build on Windows, specifically Windows 11 and it failed and I get the following error.

error[E0308]: mismatched types
   --> llm\crates\llm-base\src\tokenizer\huggingface.rs:25:21
    |
25  |             .decode(vec![idx as u32], true)
    |              ------ ^^^^^^^^^^^^^^^^ expected `&[u32]`, found `Vec<u32>`
    |              |
    |              arguments to this method are incorrect
    |
    = note: expected reference `&[u32]`
                  found struct `Vec<u32>`
note: method defined here
   --> C:\Users\gabri\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tokenizers-0.13.4\src\tokenizer\mod.rs:814:12
    |
814 |     pub fn decode(&self, ids: &[u32], skip_special_tokens: bool) -> Result<String> {
    |            ^^^^^^

error[E0308]: mismatched types
   --> llm\crates\llm-base\src\tokenizer\huggingface.rs:70:21
    |
70  |             .decode(tokens, skip_special_tokens)
    |              ------ ^^^^^^ expected `&[u32]`, found `Vec<u32>`
    |              |
    |              arguments to this method are incorrect
    |
    = note: expected reference `&[u32]`
                  found struct `Vec<u32>`
note: method defined here
   --> C:\Users\gabri\.cargo\registry\src\index.crates.io-6f17d22bba15001f\tokenizers-0.13.4\src\tokenizer\mod.rs:814:12
    |
814 |     pub fn decode(&self, ids: &[u32], skip_special_tokens: bool) -> Result<String> {
    |            ^^^^^^
help: consider borrowing here
    |
70  |             .decode(&tokens, skip_special_tokens)
    |                     +

For more information about this error, try `rustc --explain E0308`.
error: could not compile `llm-base` (lib) due to 2 previous errors
warning: build failed, waiting for other jobs to finish...

If anyone knows how to fix it, I appreciate the help. I just follow the details on how to run cria based on the README.

It seems like there is a problem with the llm submodule?

"git submodule update --init --recursive" requires ssh

I am trying to get this running in Docker

Here is my current build

from ubuntu:latest

RUN apt-get update && apt-get install -y \
    sudo 

RUN apt-get update && \
    apt-get install -y \
    curl \
    build-essential \
    libssl-dev \
    pkg-config


# Install Rust and Cargo
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"


RUN yes | sudo apt install git 

RUN git clone https://github.com/AmineDiro/cria.git
WORKDIR /cria
#RUN yes | git submodule update --init --recursive

# Build the project using Cargo in release mode
#RUN cargo build --release

#COPY ggml-model-q4_0.bin .

Unfortunately, it seems like the git submodule update --init --recursive command is somehow interfacing with the repo via ssh

Have you considered maybe building this in a way that the dependencies can be installed without ssh?

Response cutting off at around 256 tokens

The response cuts off at around 256tokens.

# .env
CRIA_MODEL_PATH=/home/bala/Models/llama-2-13b-chat.ggmlv3.q8_0.bin


# Other environement variables to set
CRIA_SERVICE_NAME=cria
CRIA_HOST=0.0.0.0
CRIA_PORT=3000
CRIA_MODEL_ARCHITECTURE=llama
CRIA_USE_GPU=false
CRIA_GPU_LAYERS=32
CRIA_ZIPKIN_ENDPOINT=http://zipkin-server:9411/api/v2/spans

Input

{
  "prompt":"[INST]<<SYS>>.<</SYS>>How do I get from UNSW to Central Station?[/INST]",
  "temperature":0.1
}

Output

console.log(response.choices[0].text)

There are several ways to get from the University of New South Wales (UNSW) to Central Station in Sydney. Here are some options: 1. Train: The easiest and most convenient way to get to Central Station from UNSW is by train. The UNSW campus is located near the Kensington station, which is on the Airport & South Line. You can take a train from Kensington station to Central Station. The journey takes around 20 minutes. 2. Bus: You can also take a bus from UNSW to Central Station. The UNSW campus is served by several bus routes, including the 395, 397, and 398. These buses run frequently throughout the day and the journey takes around 45-60 minutes, depending on traffic. 3. Light Rail: Another option is to take the light rail from UNSW to Central Station. The light rail runs along Anzac Parade and stops at Central Station. The journey takes around 30-40 minutes. 4. Taxi or Ride-sharing: You can also take a taxi or ride-sharing service such as Uber or Ly

ailed to run custom build command for `ggml-sys v0.2.0-dev (D:\textgen\cria\llm\crates\ggml\sys)`

I was following the steps given in the readme, new to rust, so have no idea how to get through this one, I tried googling but nothing.

   Compiling ggml-sys v0.2.0-dev (D:\textgen\cria\llm\crates\ggml\sys)
error: failed to run custom build command for `ggml-sys v0.2.0-dev (D:\textgen\cria\llm\crates\ggml\sys)`

Caused by:
  process didn't exit successfully: `D:\textgen\cria\target\release\build\ggml-sys-347ad8dfc92431e3\build-script-build` (exit code: 101)
  --- stdout
  cargo:rerun-if-changed=llama-cpp
  OPT_LEVEL = Some("3")
  TARGET = Some("x86_64-pc-windows-msvc")
  HOST = Some("x86_64-pc-windows-msvc")
  cargo:rerun-if-env-changed=CC_x86_64-pc-windows-msvc
  CC_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CC_x86_64_pc_windows_msvc
  CC_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CC
  HOST_CC = None
  cargo:rerun-if-env-changed=CC
  CC = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
  DEBUG = Some("false")
  cargo:rerun-if-env-changed=CFLAGS_x86_64-pc-windows-msvc
  CFLAGS_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CFLAGS_x86_64_pc_windows_msvc
  CFLAGS_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CFLAGS
  HOST_CFLAGS = None
  cargo:rerun-if-env-changed=CFLAGS
  CFLAGS = None
  OPT_LEVEL = Some("3")
  TARGET = Some("x86_64-pc-windows-msvc")
  HOST = Some("x86_64-pc-windows-msvc")
  cargo:rerun-if-env-changed=CC_x86_64-pc-windows-msvc
  CC_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CC_x86_64_pc_windows_msvc
  CC_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CC
  HOST_CC = None
  cargo:rerun-if-env-changed=CC
  CC = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
  DEBUG = Some("false")
  cargo:rerun-if-env-changed=CFLAGS_x86_64-pc-windows-msvc
  CFLAGS_x86_64-pc-windows-msvc = None
  cargo:rerun-if-env-changed=CFLAGS_x86_64_pc_windows_msvc
  CFLAGS_x86_64_pc_windows_msvc = None
  cargo:rerun-if-env-changed=HOST_CFLAGS
  HOST_CFLAGS = None
  cargo:rerun-if-env-changed=CFLAGS
  CFLAGS = None

  --- stderr
  thread 'main' panicked at 'Please make sure nvcc is executable and the paths are defined using CUDA_PATH, CUDA_INCLUDE_PATH and/or CUDA_LIB_PATH', llm\crates\ggml\sys\build.rs:344:33
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

LLama2 Chat Prompting incorrect?

From this guide https://replicate.com/blog/how-to-prompt-llama

A prompt with history would look like

<s>[INST] <<SYS>>
You are are a helpful... bla bla.. assistant
<</SYS>>

Hi there! [/INST] Hello! How can I help you today? </s><s>[INST] What is a neutron star? [/INST] A neutron star is a ... </s><s> [INST] Okay cool, thank you! [/INST]

It may even be that the newlines can be removed.

So I think this prompt technique should replace the one in https://github.com/AmineDiro/cria/blob/main/src/routes/chat.rs#L16

llama-2 70B support

I'm getting this error when trying to run on MacOS:

error: invalid value 'llama-2' for '<MODEL_ARCHITECTURE>': llama-2 is not one of supported model architectures: [Bloom, Gpt2, GptJ, GptNeoX, Llama, Mpt]

If I use LLama instead, it crashes (as it probably should)

GGML_ASSERT: llama-cpp/ggml.c:6192: ggml_nelements(a) == ne0*ne1*ne2
fish: Job 1, 'target/release/cria Llama ../ll…' terminated by signal SIGABRT (Abort)

aminediro / cria Goto Github PK

cria's Issues

Implement streaming for the /v1/chat/completions route

Can this project handle multiple requests at once?

llama.cpp no longer support .bin

GGUF support

!/! Utilizado en docker-compose

Missing LICENSE

Unable to build on Windows

"git submodule update --init --recursive" requires ssh

Response cutting off at around 256 tokens

ailed to run custom build command for `ggml-sys v0.2.0-dev (D:\textgen\cria\llm\crates\ggml\sys)`

LLama2 Chat Prompting incorrect?

llama-2 70B support

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent