Git Product home page Git Product logo

Comments (2)

sgsdxzy avatar sgsdxzy commented on June 16, 2024

Please refer to https://github.com/PygmalionAI/aphrodite-engine/wiki/8.-Quantization#gguf-dev-branch about how to use llama 3 gguf models (the tokenizer of llama 3 isn't a LlamaTokenizer)
The performance issue on the other hand can't be solved easily, as for most quants there isn't a fp32 kernel.

from aphrodite-engine.

Nero10578 avatar Nero10578 commented on June 16, 2024

Please refer to https://github.com/PygmalionAI/aphrodite-engine/wiki/8.-Quantization#gguf-dev-branch about how to use llama 3 gguf models (the tokenizer of llama 3 isn't a LlamaTokenizer) The performance issue on the other hand can't be solved easily, as for most quants there isn't a fp32 kernel.

Thanks for replying. Definitely missed that part because of banging my head trying different things to make this thing not go so slow. Does the GGUF kernel run on FP32? It does exist on llama.cpp but as I understand aphrodite converts GGUF to safetensors first anyways?

EDIT: Ok so I figured it out finally, for anyone trying to run aphrodite on Pascal non-GP100 GPUs, here is how:

  1. Create a miniconda environment: conda create -n aphrodite python=3.11
  2. Install CUDA: conda install -y -c "nvidia/label/cuda-12.1.1" cuda
  3. Install pytorch: pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
  4. Install aphrodite: pip install -e .
  5. Run aphrodite with either exl2 (slow as balls) or GGUF (fast)

On Llama 3 8B Q4KM, I can get about 40t/s on one request on a GTX Titan X Pascal 12GB, but increases of parallel requests just tanks the performance immediately after more than 4 parallel requests. It does seem like the GPU core is fully utilized at 100% and is choking on doing matrix multiplications without the help of Tensor cores. Its ok but definitely not usable for multiple parallel requests.

Completed 16 prompts and produced 967 tokens in 30.533 seconds.
Average TPS across all 1 threads: 31.7 - Individual Threads: Min TPS: 31.7, Max TPS: 31.7

Completed 16 prompts and produced 4361 tokens in 61.385 seconds.
Average TPS across all 4 threads: 71.0 - Individual Threads: Min TPS: 15.5, Max TPS: 18.9

Completed 16 prompts and produced 9541 tokens in 484.179 seconds.
Average TPS across all 8 threads: 19.7 - Individual Threads: Min TPS: 2.2, Max TPS: 2.7

Completed 16 prompts and produced 14610 tokens in 544.334 seconds.
Average TPS across all 12 threads: 26.8 - Individual Threads: Min TPS: 1.9, Max TPS: 2.7

from aphrodite-engine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.