libLLM: Efficient inference of large language models.

Welcome to libLLM, an open-source project designed for efficient inference of large language models (LLM) on ordinary personal computers and mobile devices. The core is implemented in C++14, without any third-party dependencies (such as BLAS or SentencePiece), enabling seamless operation across a variety of devices.

欢迎使用libLLM，这是一个专为在普通个人电脑和移动设备上高效推理大型语言模型（LLM）而设计的开源项目。核心使用C++14编写，没有第三方依赖（BLAS、SentencePiece等），能在各种设备中无缝运行。

Key features:

Optimized for everyday devices: libLLM has been optimized to run smoothly on common personal computers, ensuring the powerful capabilities of large language models are accessible to a wider range of users.
C++ code: Written in standard C++14, it is simple and efficient.
No external dependencies: The core functionality does not require third-party dependencies (BLAS, SentencePiece, etc.), and the necessary GEMM kernels are implemented internally (avx2, avx512).
CUDA support: Supports accelerated inference using CUDA.

特点

为日常设备进行优化：libLLM经过优化，可在常见的个人电脑上平稳运行，确保大型语言模型的强大功能面向更广泛的用户。
C++代码：采用标准C++14编写，简单高效。
无外部依赖：核心功能无需第三方依赖（BLAS、SentencePiece等），所需的GEMM内核均在内部实现(avx2、avx512)。
支持CUDA：支持使用CUDA加速推理。

Build

libLLM CPU only

$ mkdir build && cd build
$ cmake ..
$ make -j

For macOS

Please brew install OpenMP before cmake. NOTE: currently libllm macOS expected to be very slow since there is no aarch64 kernel for it.

% brew install libomp
% export OpenMP_ROOT=$(brew --prefix)/opt/libomp
% mkdir build && cd build
% cmake ..
% make -j

Build with CUDA

NOTE: specify -DCUDAToolkit_ROOT=<CUDA-DIR> if there is multiple CUDA versions in your OS.

Recommand versions are:

CUDA: 11.7

$ mkdir build && cd build
$ cmake -DWITH_CUDA=ON [-DCUDAToolkit_ROOT=<CUDA-DIR>] ..
$ make -j

Run libllm command line

$ ./src/llm/llm -config ../model/chatglm3-6b-libllm-q4/chatglm3.config 
INFO 2023-12-19T08:56:47Z lymath.cc:42] lymath: Use Avx512 backend.
INFO 2023-12-19T08:56:48Z cuda_operators.cc:46] cuda numDevices = 1
INFO 2023-12-19T08:56:48Z cuda_operators.cc:47] cuda:0 maxThreadsPerMultiProcessor = 2048
INFO 2023-12-19T08:56:48Z cuda_operators.cc:49] cuda:0 multiProcessorCount = 20
INFO 2023-12-19T08:56:48Z llm.cc:123] OMP max_threads = 20
INFO 2023-12-19T08:56:48Z bpe_model.cc:34] read tokenizer from ../model/chatglm3-6b-libllm-q4/chatglm3.tokenizer.bin
INFO 2023-12-19T08:56:48Z model_factory.cc:35] model_type = chatglm3
INFO 2023-12-19T08:56:48Z model_factory.cc:36] device = cuda
INFO 2023-12-19T08:56:48Z state_map.cc:58] read state map from ../model/chatglm3-6b-libllm-q4/chatglm3.q4.bin
INFO 2023-12-19T08:56:51Z state_map.cc:68] reading ... 100.0%
INFO 2023-12-19T08:56:51Z state_map.cc:69] 200 tensors read.
> 你好
 
 你好👋！我是人工智能助手 ChatGLM3-6B，很高兴见到你，欢迎问我任何问题。
(29 token, time=0.92s, 31.75ms per token)
>

API Examples

Python

from libllm import Model, ControlToken

model = Model("model/chatglm3-6b-libllm-q4/chatglm3.config")
prompt = [ControlToken("<|user|>"), "\n", "你好", ControlToken("<|assistant|>")]

for chunk in model.complete(prompt):
    print(chunk.text, end="", flush=True)

print("\nDone!")

Export Huggingface models

Here is an example of exporting ChatGLM3 model from huggingface.

$ cd tools
$ python chatglm_exporter.py

Then 3 files will be exported: chatglm3.config, chatglm3.q4.bin and chatglm3.tokenizer.bin

ishine / libllm Goto Github PK

libllm's Introduction

libLLM: Efficient inference of large language models.

Key features:

特点

Build

libLLM CPU only

For macOS

Build with CUDA

Run libllm command line

API Examples

Python

Export Huggingface models

libllm's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent