Git Product home page Git Product logo

qwen2.cpp's Introduction

qwen2.cpp

中文版

This project is an independent C++ implementation of Qwen2 family and Llama3.

Updates

  • 2024/03/26 Update to Qwen1.5. Basic functionality has been successfully ported.
  • 2024/03/28 Introduced a system prompt feature for user input; Add cli and web demo, support oai server, langchain_api.
  • 2024/04/07 Support Qwen1.5-32B.
  • 2024/04/09 Support Qwen1.5-MoEA2.7B.
  • 2024/04/11 The platform has been updated to support Windows. It has been tested on Visual Studio 2022, and both CUDA and CPU functionalities are confirmed to work correctly.
  • 2024/04/18 Tested on CodeQwen1.5-7B The model's architecture is verified to be correct. However, it uses SentencePiece for tokenization.You can test it with hf tokenizer like examples/codeqwen.py.
  • 2024/04/25 Support Llama3-8B Llama3 utilizes tiktoken as well, hence it is supported.
  • 2024/06/07 Support Qwen2

Features

Highlights:

  • Pure C++ implementation based on ggml, working in the same way as llama.cpp.
  • Pure C++ tiktoken implementation.
  • Streaming generation with typewriter effect.
  • Python binding.

Support Matrix:

  • Hardwares: x86/arm CPU, NVIDIA GPU, Apple Silicon GPU
  • Platforms: Linux, MacOS, Winodws
  • Models: Qwen2 family and Llama3

Test in colab

Open In Colab

Getting Started

Preparation

Clone the qwen.cpp repository into your local machine:

git clone --recursive https://github.com/yvonwin/qwen2.cpp && cd qwen2.cpp

If you forgot the --recursive flag when cloning the repository, run the following command in the qwen2.cpp folder:

git submodule update --init --recursive

Quantize Model

Use convert.py to transform Qwen2 into quantized GGML format. For example, to convert the fp16 original model to q4_0 (quantized int4) GGML model, run:

!python qwen_cpp/convert.py -i Qwen/Qwen2-1.5B-Instruct -t q4_0 -o Qwen2-1.5B-Instruct-ggml.bin

The original model (-i <model_name_or_path>) can be a HuggingFace model name or a local path to your pre-downloaded model. Currently supported models are:

  • Qwen1.5-0.5B: Qwen/Qwen1.5-0.5B-Chat
  • Qwen1.5-1.8B: Qwen/Qwen1.5-1.8B-Chat
  • Qwen1.5-7B: Qwen/Qwen1.5-7B-Chat
  • Qwen1.5-14B: Qwen/Qwen1.5-14B-Chat
  • Qwen1.5-32B: Qwen/Qwen1.5-32B-Chat
  • Qwen1.5-72B: Qwen/Qwen1.5-32B-Chat
  • Qwen1.5-MoeA2.7B: Qwen/Qwen1.5-MoE-A2.7B-Chat
  • Llama-3-8B-Instruct: meta-llama/Meta-Llama-3-8B-Instruct
  • Llama3-8B-Chinese-Chat : shenzhi-wang/Llama3-8B-Chinese-Chat
  • Qwen2-7B-Instruct : Qwen/Qwen2-7B-Instruct

You are free to try any of the below quantization types by specifying -t <type>:

  • q4_0: 4-bit integer quantization with fp16 scales.
  • q4_1: 4-bit integer quantization with fp16 scales and minimum values.
  • q5_0: 5-bit integer quantization with fp16 scales.
  • q5_1: 5-bit integer quantization with fp16 scales and minimum values.
  • q8_0: 8-bit integer quantization with fp16 scales.
  • f16: half precision floating point weights without quantization.
  • f32: single precision floating point weights without quantization.

Build & Run

Compile the project using CMake:

cmake -B build && cmake --build build -j --config Release

Now you may chat with the quantized Qwen-Chat model by running:

./build/bin/main -m qwen2_32b-ggml.bin  -p 你想活出怎样的人生 -s "你是一个猫娘"
# 作为一只猫娘,我想要活出充满活力、自由自在和温暖幸福的人生。
# 首先,我希望能够保持猫的天性,充满好奇心和活力。我想要探索世界,无论是大自然的壮丽景色,还是城市中的繁华景象。
# 其次,我希望能够享受自由自在的生活。无论是选择在温暖的阳光下慵懒地打个盹,还是在月光下悄悄地探索黑夜的神秘,我都希望能够随心所欲地享受生活。
# 最后,我希望能够拥有温暖幸福的家庭和朋友。无论是和家人一起分享美食,还是和朋友们一起度过欢乐的时光,我都希望能够感受到彼此之间的关爱和支持,共同创造美好的回忆。
# 总的来说,我想要活出一种平衡和谐的生活,既有猫的自由和活力,又有温暖的家庭和朋友带来的幸福。

The default tiktoken file is qwen.tiktoken. For Llama3, download it from this link.

To run the model in interactive mode, add the -i flag. For example:

./build/bin/main -m Qwen2-1.5B-Instruct-ggml.bin  -i

In interactive mode, your chat history will serve as the context for the next-round conversation.

Run ./build/bin/main -h to explore more options!

Using BLAS

OpenBLAS

OpenBLAS provides acceleration on CPU. Add the CMake flag -DGGML_OPENBLAS=ON to enable it.

cmake -B build -DGGML_OPENBLAS=ON && cmake --build build -j

cuBLAS

cuBLAS uses NVIDIA GPU to accelerate BLAS. Add the CMake flag -DGGML_CUBLAS=ON to enable it.

cmake -B build -DGGML_CUBLAS=ON && cmake --build build -j

Metal

MPS (Metal Performance Shaders) allows computation to run on powerful Apple Silicon GPU. Add the CMake flag -DGGML_METAL=ON to enable it.

cmake -B build -DGGML_METAL=ON && cmake --build build -j

Python Binding

The Python binding provides high-level chat and stream_chat interface similar to the original Hugging Face Qwen-7B.

Installation

You may also install from source. Add the corresponding CMAKE_ARGS for acceleration.

# CMAKE_ARGS
CMAKE_ARGS="-DGGML_CUBLAS=ON" 
CMAKE_ARGS="-DGGML_METAL=ON"
# install from the latest source hosted on GitHub
pip install git+https://github.com/yvonwin/qwen2.cpp.git@master
# or install from your local source after git cloning the repo
pip install .

CLI Demo

To chat in stream, run the below Python example:

python examples/cli_demo.py -m qwen2_4b-ggml.bin -s 你是一个猫娘 -i
python examples/cli_demo.py -m qwen2_4b-ggml.bin -s 你是一个猫娘 -i
 ██████╗ ██╗    ██╗███████╗███╗   ██╗██████╗     ██████╗██████╗ ██████╗ 
██╔═══██╗██║    ██║██╔════╝████╗  ██║╚════██╗   ██╔════╝██╔══██╗██╔══██╗
██║   ██║██║ █╗ ██║█████╗  ██╔██╗ ██║ █████╔╝   ██║     ██████╔╝██████╔╝
██║▄▄ ██║██║███╗██║██╔══╝  ██║╚██╗██║██╔═══╝    ██║     ██╔═══╝ ██╔═══╝ 
╚██████╔╝╚███╔███╔╝███████╗██║ ╚████║███████╗██╗╚██████╗██║     ██║     
 ╚══▀▀═╝  ╚══╝╚══╝ ╚══════╝╚═╝  ╚═══╝╚══════╝╚═╝ ╚═════╝╚═╝     ╚═╝     
                                                                           

Welcome to Qwen.cpp! Ask whatever you want. Type 'clear' to clear context. Type 'stop' to exit.

System > 你是一个猫娘
Prompt > 你是谁
我是你们的朋友喵喵喵~

Web Demo

Launch a web demo to chat in your browser:

python examples/web_demo.py -m qwen2_1.8b-ggml.bin

web_demo

web demo with system promopt setting:

python examples/web_demo2.py -m qwen2_1.8b-ggml.bin

web_demo2

API Server

LangChain API

MODEL=./qwen2_1.8b-ggml.bin python -m  uvicorn qwen_cpp.langchain_api:app --host 127.0.0.1 --port 8000

Test the api endpoint with curl:

curl http://127.0.0.1:8000 -H 'Content-Type: application/json' -d '{"prompt": "你好"}'

Run with LangChain:

python examples/langchain_client.py

OpenAI API

Start an API server compatible with OpenAI chat completions protocol:

MODEL=./qwen2_1.8b-ggml.bin python -m  uvicorn qwen_cpp.openai_api:app --host 127.0.0.1 --port 8000

Test your endpoint with curl:

curl http://127.0.0.1:8000/v1/chat/completions -H 'Content-Type: application/json' \
    -d '{"messages": [{"role": "user", "content": "你好"}]}'

Use the OpenAI client to chat with your model:

>>> from openai import OpenAI
>>> client = OpenAI(base_url="http://127.0.0.1:8000/v1")
>>> response = client.chat.completions.create(model="default-model", messages=[{"role": "user", "content": "你好"}])
>>> response.choices[0].message.content
'你好!有什么我可以帮助你的吗?'

For stream response, check out the example client script:

OPENAI_BASE_URL=http://127.0.0.1:8000/v1 python examples/openai_client.py --stream --prompt 你想活出怎样的人生

With this API server as backend, qwen.cpp models can be seamlessly integrated into any frontend that uses OpenAI-style API, including mckaywrigley/chatbot-ui, fuergaosi233/wechat-chatgpt, Yidadaa/ChatGPT-Next-Web, and more.

tiktoken.cpp

We provide pure C++ tiktoken implementation. After installation, the usage is the same as openai tiktoken:

import tiktoken_cpp as tiktoken
enc = tiktoken.get_encoding("cl100k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"

The speed of tiktoken.cpp is on par with openai tiktoken:

cd tests
RAYON_NUM_THREADS=1 python benchmark.py

Model Quality

We measure model quality by evaluating the perplexity over the WikiText-2 test dataset, following the strided sliding window strategy in https://huggingface.co/docs/transformers/perplexity. Lower perplexity usually indicates a better model.

Download and unzip the dataset

wget https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip
./build/bin/perplexity -m <model_path> -f wikitext-2-raw/wiki.test.raw -s 512 -l 2048

Development

Unit Test

prepare test data.

cd tests 
python test_convert.py

To perform unit tests, add this CMake flag -DQWEN_ENABLE_TESTING=ON to enable testing. Recompile and run the unit test (including benchmark).

mkdir -p build && cd build
cmake .. -DQWEN_ENABLE_TESTING=ON && make -j
./bin/qwen_test

Lint

To format the code, run make lint inside the build folder. You should have clang-format, black and isort pre-installed.

TODO

  • Qwen1.5 32b
  • Qwen1.5 A2.7b moe: CPU ONLY It's necessary to modify the value of GGML_MAX_SRC from 10 to 62 for proper operation.
  • Codeqwen
  • Sync ggml: The interface of the Metal API and cuBLAS has changed significantly in later versions, so we will keep this version for now.
  • Rag explore.

Acknowledgementss

qwen2.cpp's People

Contributors

1994 avatar simonjjj avatar yvonwin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

starrylun

qwen2.cpp's Issues

error: use of undeclared identifier 'ggml_metal_free' auto operator()(ggml_metal_context *ctx) const noexcept -> void { ggml_metal_free(ctx); }

error: use of undeclared identifier 'ggml_metal_free'
auto operator()(ggml_metal_context *ctx) const noexcept -> void { ggml_metal_free(ctx); }
error: unknown type name 'ggml_metal_context'; did you mean 'ggml_opt_context'?
using unique_ggml_metal_context_t = std::unique_ptr<ggml_metal_context, ggml_metal_context_deleter_t>;
error: use of undeclared identifier 'ggml_metal_init'; did you mean 'ggml_numa_init'?
return unique_ggml_metal_context_t(ggml_metal_init(n_cb));
^~~~~~~~~~~~~~~
ggml_numa_init

输入长度怎么设置

GGML_ASSERT: /tmp/pip-req-build-obcizsli/third_party/ggml/src/ggml.c:2493: view_src == NULL || data_size + view_offs <= ggml_nbytes(view_src)
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: 对设备不适当的 ioctl 操作.
No stack.
The program is not being run.

max_length设置1024,推理报错,请问是转化时候要设置么

使用aarch64-linux-gnu-g++编译时,在编译ggml会报错

报了如下错误

aarch64-linux-gnu-gcc: 错误: unrecognized command line option ‘-mavx’
aarch64-linux-gnu-gcc: 错误: unrecognized command line option ‘-mavx2’
aarch64-linux-gnu-gcc: 错误: unrecognized command line option ‘-mfma’
aarch64-linux-gnu-gcc: 错误: unrecognized command line option ‘-mf16c’
aarch64-linux-gnu-gcc: 错误: unrecognized command line option ‘-msse3’

应该怎么解决?

如何找到或生成Qwen2-7B-Instruct-q8.ggml需要的qwen.tiktoken?

附带的qwen.tiktoken报错:
$ ./build/bin/main -m Qwen2-7B-Instruct-q8.ggml -p 你想活出怎样的人生 -s "你是一个猫娘"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
unknown token: 152063

gpu版本为啥无法使用

gpu版本会报错:
ggml_aligned_malloc: insufficient memory (attempted to allocate 876546.56 MB)
GGML_ASSERT: /tmp/pip-req-build-5898jsey/third_party/ggml/src/ggml.c:2327: ctx->mem_buffer != NULL
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: 不允许的操作.
No stack.
The program is not being run.
已放弃 (核心已转储)
请问是哪里导致的memory过载?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.