Git Product home page Git Product logo

libllm's Introduction

libLLM: Efficient inference of large language models.

Linux Windows macOS

Welcome to libLLM, an open-source project designed for efficient inference of large language models (LLM) on ordinary personal computers and mobile devices. The core is implemented in C++14, without any third-party dependencies (such as BLAS or SentencePiece), enabling seamless operation across a variety of devices.

欢迎使用libLLM,这是一个专为在普通个人电脑和移动设备上高效推理大型语言模型(LLM)而设计的开源项目。核心使用C++14编写,没有第三方依赖(BLAS、SentencePiece等),能在各种设备中无缝运行。

Key features:

  • Optimized for everyday devices: libLLM has been optimized to run smoothly on common personal computers, ensuring the powerful capabilities of large language models are accessible to a wider range of users.
  • C++ code: Written in standard C++14, it is simple and efficient.
  • No external dependencies: The core functionality does not require third-party dependencies (BLAS, SentencePiece, etc.), and the necessary GEMM kernels are implemented internally (avx2, avx512).
  • CUDA support: Supports accelerated inference using CUDA.

特点

  • 为日常设备进行优化:libLLM经过优化,可在常见的个人电脑上平稳运行,确保大型语言模型的强大功能面向更广泛的用户。
  • C++代码:采用标准C++14编写,简单高效。
  • 无外部依赖:核心功能无需第三方依赖(BLAS、SentencePiece等),所需的GEMM内核均在内部实现(avx2、avx512)。
  • 支持CUDA:支持使用CUDA加速推理。

Build

libLLM CPU only

$ mkdir build && cd build
$ cmake ..
$ make -j

For macOS

Please brew install OpenMP before cmake. NOTE: currently libllm macOS expected to be very slow since there is no aarch64 kernel for it.

% brew install libomp
% export OpenMP_ROOT=$(brew --prefix)/opt/libomp
% mkdir build && cd build
% cmake ..
% make -j

Build with CUDA

NOTE: specify -DCUDAToolkit_ROOT=<CUDA-DIR> if there is multiple CUDA versions in your OS.

Recommand versions are:

  • CUDA: 11.7
$ mkdir build && cd build
$ cmake -DWITH_CUDA=ON [-DCUDAToolkit_ROOT=<CUDA-DIR>] ..
$ make -j

Run libllm command line

$ ./src/llm/llm -config ../model/chatglm3-6b-libllm-q4/chatglm3.config 
INFO 2023-12-19T08:56:47Z lymath.cc:42] lymath: Use Avx512 backend.
INFO 2023-12-19T08:56:48Z cuda_operators.cc:46] cuda numDevices = 1
INFO 2023-12-19T08:56:48Z cuda_operators.cc:47] cuda:0 maxThreadsPerMultiProcessor = 2048
INFO 2023-12-19T08:56:48Z cuda_operators.cc:49] cuda:0 multiProcessorCount = 20
INFO 2023-12-19T08:56:48Z llm.cc:123] OMP max_threads = 20
INFO 2023-12-19T08:56:48Z bpe_model.cc:34] read tokenizer from ../model/chatglm3-6b-libllm-q4/chatglm3.tokenizer.bin
INFO 2023-12-19T08:56:48Z model_factory.cc:35] model_type = chatglm3
INFO 2023-12-19T08:56:48Z model_factory.cc:36] device = cuda
INFO 2023-12-19T08:56:48Z state_map.cc:58] read state map from ../model/chatglm3-6b-libllm-q4/chatglm3.q4.bin
INFO 2023-12-19T08:56:51Z state_map.cc:68] reading ... 100.0%
INFO 2023-12-19T08:56:51Z state_map.cc:69] 200 tensors read.
> 你好
 
 你好👋!我是人工智能助手 ChatGLM3-6B,很高兴见到你,欢迎问我任何问题。
(29 token, time=0.92s, 31.75ms per token)
> 

API Examples

Python

from libllm import Model, ControlToken

model = Model("model/chatglm3-6b-libllm-q4/chatglm3.config")
prompt = [ControlToken("<|user|>"), "\n", "你好", ControlToken("<|assistant|>")]

for chunk in model.complete(prompt):
    print(chunk.text, end="", flush=True)

print("\nDone!")

Export Huggingface models

Here is an example of exporting ChatGLM3 model from huggingface.

$ cd tools
$ python chatglm_exporter.py

Then 3 files will be exported: chatglm3.config, chatglm3.q4.bin and chatglm3.tokenizer.bin

libllm's People

Contributors

ling0322 avatar zchrissirhcz avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.