Git Product home page Git Product logo

rapid-llama's Introduction

Rapid-LLaMA: A High-Performance LLM Inference Engine

image

Descriptions

rapid-llama is a super HIGH-performance inference engine for LLMs like LLaMA (3x of llama.cpp) written in C++ which can run a 8-bit quantized LLaMA2-7B model on a cpu with 56 cores in speed of ~30 tokens / s. It outperforms all current open-source inference engines, especially when compared to the renowned llama.cpp, with 2~3 times better inference speed on a CPU.

Advantages

  • Fast
    • Extremely fast on CPU. Faster than any other engines on Github including llama.cpp (3 times faster than llama.cpp).
  • Simple
    • Totally 6k+ lines of C++ codes with well-orgnized code structures and no dependencies except NUMA (if needed for multi-cpus).

โš ๏ธ Only CPU is supported currently. Support for GPU is coming soon.

Quick Start

Compile

Only Linux is supported currently. Support of other platforms including Windows, Mac, GPU is coming soon.

Requsitions

  • gcc version 10.x or newer versions
  • libnuma-dev

libraries like mpi, openblas, mkl, etc are NOT needed currently.

Compilation

Method 1. Using the provided build script:

bash ./build.sh

Method 2. Using Make:

make -j 4

Run

To run the inference engine, execute the following command:

Only gguf and llama2.c format of models are currently supported. Independent formort is coming soon.

./main -c ./models/cnllama-7b/ggml-model-f32.gguf -f gguf -j 56 -q int8 -n 200 -i 'That was a long long story happened in the ancient China.'

The command-line options are as follows:

  • -c: Path to the model file
  • -f: Model file format (e.g., gguf)
  • -j: Number of threads to use (e.g., 56)
  • -q: Quantization mode (e.g., int8)
  • -n: Number of tokens to generate (e.g., 200)
  • -i: Input text (e.g., 'That was a long long story happened in the ancient China.')
  • -h: show usage information

Performance

rapid-llama achieves a generation speed of approximately 25-30 tokens/s for an 8-bit quantized 7B model running on the following CPU configuration:

Architecture:            x86_64
Model name:              Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
CPU(s):                  112
Thread(s) per core:      2
Core(s) per socket:      28
Socket(s):               2

image

Latancy of first token will be optimized laterly.

Why

Why is it so fast?

  • Ultimate memory efficiency
    • Zero memory allocations and frees during inferencing.
    • Maximization of memory locality.
  • Well-designed thread scheduling algorithm
  • Optimized operators
    • Fuse all operators that can be fused together
    • Optmize calculation of several operators
  • Proper Quantizations

License

rapid-llama is licensed under the Apache 2 License.

Acknowledgements

We would like to express our gratitude to all contributors and users of RapidLLaMA. Your support and feedback have been invaluable in making this project a success. If you encounter any issues or have any suggestions, please feel free to open an issue on the GitHub repository.

Contact

Email: ๐Ÿ“ฉ[email protected]

Contact me if you any questions.

rapid-llama's People

Contributors

coderlsf avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.