Git Product home page Git Product logo

inferllm's Introduction

InferLLM

中文 README

InferLLM is a lightweight LLM model inference framework that mainly references and borrows from the llama.cpp project. llama.cpp puts almost all core code and kernels in a single file and use a large number of macros, making it difficult for developers to read and modify. InferLLM has the following features:

  • Simple structure, easy to get started and learning, and decoupled the framework part from the kernel part.
  • High efficiency, ported most of the kernels in llama.cpp.
  • Defined a dedicated KVstorage type for easy caching and management.
  • Compatible with multiple model formats (currently only supporting alpaca Chinese and English int4 models).
  • Currently supports CPU and GPU, optimized for Arm, x86, CUDA and riscv-vector. And it can be deployed on mobile phones, with acceptable speed.

In short, InferLLM is a simple and efficient LLM CPU inference framework that can deploy quantized models in LLM locally and has good inference speed.

Latest News

  • 2023.08.16: Add support for LLama-2-7B model.
  • 2023.08.8: Optimized the performance on Arm, which optimized the int4 matmul kernel with arm asm and kernel packing.
  • berfor: support chatglm/chatglm2, baichuan, alpaca, ggml-llama model.

How to use

Download model

Currently, InferLLM uses the same models as llama.cpp and can download models from the llama.cpp project. In addition, models can also be downloaded directly from Hugging Face kewin4933/InferLLM-Model. Currently, two alpaca, llama2, chatglm/chatglm2 and baichuan models are uploaded in this project, one is the Chinese int4 model and the other is the English int4 model.

Compile InferLLM

Download Compiles:

cmake
mingw w64 (thread)

Local compilation

mkdir build
cd build
cmake ..
make

GPU is disabled default, if you want to enable GPU, please use cmake -DENABLE_GPU=ON .. to enable GPU. Now only CUDA is supported, before use CUDA, please install CUDA toolkit first.

Android cross compilation

According to the cross compilation, you can use the pre-prepared tools/android_build.sh script. You need to install NDK in advance and configure the path of NDK to the NDK_ROOT environment variable.

export NDK_ROOT=/path/to/ndk
./tools/android_build.sh

Run InferLLM

Running ChatGLM model please refer to ChatGLM model documentation.

If it is executed locally, execute ./chatglm -m chatglm-q4.bin -t 4 directly. If you want to execute it on your mobile phone, you can use the adb command to copy alpaca and the model file to your mobile phone, and then execute adb shell ./chatglm -m chatglm-q4.bin -t 4.

The default device is CPU, if you want to inference with GPU, please use ./chatglm -m chatglm-q4.bin -g GPU to specify the GPU device.

  • x86 is:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz x86 running
  • android is xiaomi9,Qualcomm SM8150 Snapdragon 855 android running
  • CPU is SG2042, with riscv-vector 0.7, 64 threads sg2042 running

According to x86 profiling result, we strongly advise using 4 threads.

Supported model

Now InferLLM supports the fellowing models:

License

InferLLM is licensed under the Apache License, Version 2.0

inferllm's People

Contributors

bzy-080408 avatar chenqy4933 avatar ianvzs avatar megvii-mge avatar rejoicesyc avatar supercb avatar tpoisonooo avatar xhebox avatar xingchensong avatar zchrissirhcz avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.