Git Product home page Git Product logo

ggllm.cpp's Introduction

ggllm.cpp is a ggml-based tool to run quantized Falcon Models on CPU and GPU

For detailed (growing) examples and help check the new Wiki:
https://github.com/cmp-nct/ggllm.cpp/wiki

Features that differentiate from llama.cpp for now:

  • Support for Falcon 7B and 40B models (inference, quantization and perplexity tool)
  • Fully automated GPU offloading based on available and total VRAM
  • Higher efficiency in VRAM usage when using batched processing (more layers being offloaded)
  • 16 bit cuBLAs support (takes half the VRAM for those operations)
  • Improved loading screen and visualization
  • New tokenizer with regex emulation and BPE merge support
  • Finetune auto-detection and integrated syntax support (Just load OpenAssistant 7/40 add -ins for a chat or -enc -p "Question" and optional -sys "System prompt")
  • Stopwords support (-S)
  • Optimized RAM and VRAM calculation with batch processing support up to 8k
  • More command line parameter options (like disabling GPUs)
  • Current Falcon inference speed on consumer GPU: up to 54+ tokens/sec for 7B-4-5bit and 18-25 tokens/sec for 40B 3-6 bit, roughly 38/sec and 16/sec at at 1000 tokens generated

What is missing/being worked on:

  • Full GPU offloading of Falcon
  • Optimized quantization versions for Falcon
  • A new instruct mode
  • Large context support (4k-64k in the work)

Old model support
If you use GGML type models (file versions 1-4) you need to place tokenizer.json into the model directory ! (example: https://huggingface.co/OpenAssistant/falcon-40b-sft-mix-1226/blob/main/tokenizer.json)
If you use updated model binaries they are file version 10+ and called "GGCC", those do not need the load and convert that json file

How to just run it?

  1. In most cases you will want to choose a good instruct model, currently the best tunes are from OpenAssist.
  2. Falcon 40B is great even at Q2_K (2 bit) quantization, very good multilingual and reasoning quality.
  3. After downloading (and/or converting/quantizing) your model you launch falcon_main with -enc -p "Your question" or with -ins for multiple questions
  4. From there on you can dive into more options, there is a lot to change and optimize.

The Bloke features fine tuned weights in ggcc v10 with various quantization options:
https://huggingface.co/TheBloke/falcon-40b-sft-mix-1226-GGML (OpenAssistant 40B) https://huggingface.co/TheBloke/falcon-40b-instruct-GGML
https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-GGML
https://huggingface.co/TheBloke/falcon-7b-instruct-GGML
https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-7B-GGML

The official HF models are here:
https://huggingface.co/tiiuae/falcon-40b/
https://huggingface.co/tiiuae/falcon-7b/
https://huggingface.co/tiiuae/falcon-40b-instruct
https://huggingface.co/tiiuae/falcon-7b-instruct

OpenAssistant here: https://huggingface.co/OpenAssistant https://huggingface.co/OpenAssistant/falcon-7b-sft-mix-2000 https://huggingface.co/OpenAssistant/falcon-40b-sft-mix-1226 Download the 7B or 40B Falcon version, use falcon_convert.py (latest version) in 32 bit mode, then falcon_quantize to convert it to ggcc-v10

Prompting finetuned models right: cmp-nct#36

Conversion of HF models and quantization:

  1. use falcon_convert.py to produce a GGML v1 binary from HF - not recommended to be used directly
  2. use examples/falcon_quantize to convert these into memory aligned GGCC v10 binaries of your choice including mmap support from there on
    The Falcon 7B model features tensor sizes which are not yet supported by K-type quantizers - use the traditional quantization for those

Status/Bugs:

  • nothing major

Windows application binary download It's always recommended to compile ggllm.cpp fresh yourself so you benefit from the latest features.
However, in regular intervals full binary compilations are generated: https://github.com/cmp-nct/ggllm.cpp/releases/
The "master" files contain the executables.

How to compile ggllm.cpp:

  1. Recommended with cmake: (change the CUBLAS flag to 0 to disable CUDA requirements and support)
git clone https://github.com/cmp-nct/ggllm.cpp
cd ggllm.cpp
rm -rf build; mkdir build; cd build
# if you do not have cuda in path:
export PATH="/usr/local/cuda/bin:$PATH"
# in case of problems, this sometimes helped
#export CPATH="/usr/local/cuda/targets/x86_64-linux/include:"
#export LD_LIBRARY_PATH="/usr/local/cuda/lib64:"
cmake -DLLAMA_CUBLAS=1 -DCUDAToolkit_ROOT=/usr/local/cuda/ ..  
cmake --build . --config Release
# find the binaries in ./bin
# falcon_main, falcon_quantize, falcon_perplexity
  1. Building with make (fallback):
export LLAMA_CUBLAS=1;
# if you do not have "nvcc" in your path:
# export PATH="/usr/local/cuda/bin:$PATH"
make falcon_main falcon_quantize falcon_perplexity

Windows and Demos Note: those tutorials are before the latest performance patches Video tutorial for Windows compilation without WSL:
https://www.youtube.com/watch?v=BALw669Qeyw
Another demo of Falcon 40B at 5 bit quantization:
https://www.youtube.com/watch?v=YuTMFL1dKgQ&ab_channel=CmpNct
The speed can be seen at 35 tokens/sec start gradually lowering over context - that has been solved in the meantime

  1. Installing on WSL (Windows Subsystem for Linux)
# Use --no-mmap in WSL OR copy the model into a native directory (not /mnt/) or it will get stuck loading (thanks @nauful)
#Choose a current distro:
wsl.exe --list --online
wsl --install -d distro
# cmake 3.16 is required and the cuda toolset
# If you run an old distro you can upgrade (like apt update; apt upgrade; apt full-upgrade; pico /etc/apt/sources.list/; apt update; apt upgrade; apt full-upgrade; apt autoremove; lsb_release -a); then wsl --shutdown and restart it
# install cuda WSL toolkit
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.0-1_all.deb
dpkg -i cuda-keyring_1.0-1_all.deb
apt-get update; apt-get -y install cuda
# you might need to add it to your path:
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"
export PATH="/usr/local/cuda/bin:$PATH"
# now start with a fresh cmake and all should work 

CUDA Optimizing inference speed

  • Thread count will be optimal between 1 and 8. Start with -t 2
  • For huge prompts n_batch can speed up processing 10-20 times but additional VRAM of 500-1700 MB is required. That's -b 512
  • Multi GPU systems can benefit from single GPU processing when the model is small enough. That's --override-max-gpu 1
  • Multi GPU systems with different GPUs benefit from custom tensor splitting to load one GPU heavier. To load the 2nd GPU stronger: --tensor-split 1,3 -mg 1
  • Need to squeeze a model into VRAM but 1-2 layers don't fit ? Try --gpu-reserve-mb-main 1 to reduce reserved VRAM to 1 MB, you can use negative numbers to force VRAM swapping
  • Wish to reduce VRAM usage and offload less layers? Use -ngl 10 to only load 10 layers
  • Want to dive into details ? Use --debug-timings <1,2,3> to get detailed statistics on performance of each operation, how and where it was performed and it's total impact

Inference speed
Only some tensors are GPU supported currently and only mul_mat operation supported Some of the below examples require two GPUs to run at the given speed, the settings were tailored for one environment and a different GPU/CPU/DDR setup might require adaptions

Below examples are a bit outdated, models are faster now

Falcon 40B 6 bit K-type quantization:

falcon_main.exe -m Q:\models\falcon-40b-instruct\q6_k -n 512 -n 32 --debug-timings 0 -b 1 --ignore-eos -p "I am" # -ts 2,1
...
falcon_print_timings:        load time = 11554.93 ms
falcon_print_timings:      sample time =     7.54 ms /    32 runs   (    0.24 ms per token,  4244.59 tokens per second)
falcon_print_timings:        eval time =  1968.34 ms /    33 runs   (   59.65 ms per token,    16.77 tokens per second)
falcon_print_timings:       total time =  1980.28 ms

Falcon 40B 4 bit K-type quantization:

falcon_main.exe -m Q:\models\falcon-40b\q4_k -n 512 -n 128 --debug-timings 0 -b 1 --ignore-eos -p "I am" # -ts 2,1 # --override-max-gpu 1 --gpu-reserve-mb-main -500
...
falcon_print_timings:        load time =  8749.56 ms
falcon_print_timings:      sample time =    29.47 ms /   128 runs   (    0.23 ms per token,  4342.81 tokens per second)
falcon_print_timings:        eval time =  7046.11 ms /   129 runs   (   54.62 ms per token,    18.31 tokens per second)
falcon_print_timings:       total time =  7095.81 ms

Falcon 7B 8 bit quantization:

falcon_main.exe -m Q:\models\falcon-7b-instruct\q8_0 -n 512 -n 32 --debug-timings 0 -b 1 --ignore-eos --override-max-gpu 1 -p "I am"
...
falcon_print_timings:        load time =  2539.21 ms
falcon_print_timings:      sample time =     7.65 ms /    32 runs   (    0.24 ms per token,  4181.91 tokens per second)
falcon_print_timings:        eval time =   758.21 ms /    33 runs   (   22.98 ms per token,    43.52 tokens per second)
falcon_print_timings:       total time =   770.52 ms

Falcon 7B 4 bit quantization (large generation):

falcon_main.exe -t 2 -m Q:\models\falcon-7b\q4_1 -n 512 --debug-timings 0 -b 1 --ignore-eos --override-max-gpu 1 -p "I am"
...
falcon_print_timings:        load time =  2442.76 ms
falcon_print_timings:      sample time =   118.56 ms /   512 runs   (    0.23 ms per token,  4318.34 tokens per second)
falcon_print_timings:        eval time = 16719.48 ms /   769 runs   (   21.74 ms per token,    45.99 tokens per second)
falcon_print_timings:       total time = 16930.51 ms

CUDA sidenote:

  1. try to use less threads than you have physical processor cores

ggllm.cpp's People

Contributors

ggerganov avatar cmp-nct avatar slaren avatar sw avatar prusnak avatar anzz1 avatar johannesgaessler avatar ikawrakow avatar dannydaemonic avatar slyecho avatar kerfufflev2 avatar jploski avatar xaedes avatar tjohnman avatar comex avatar ivanstepanovftw avatar howard0su avatar green-sky avatar crd716 avatar ejones avatar blackhole89 avatar 0cc4m avatar j-f1 avatar dfyz avatar gjmulder avatar glinscott avatar unbounded avatar mgroeber9110 avatar zenixls2 avatar etra0 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.