Git Product home page Git Product logo

llm-inference's Introduction

LLM Inference

Table of Contents

Glossary and Illustration

  • Llama Architecture: stacks of N Transformer decoders; each decoder consists of Grouped-Query Attention (GQA), Rotary Position Embedding (RoPE), Residual Add, Root Mean Square Layer Normalization (RMSNorm), and Multi-Layer Perceptron (MLPs).

Llama 2

  • Prompt: the initial text or instruction given to the model.
  • Prompt Phase (Prefill Phase): the phase to generate the first token based on the prompt.
  • Generation Phase (Decoding Phase): genernate the next token based on the prompt and the previously generated tokens, in an token-by-token manner.
  • Autoregressive: predicting one token at a time, conditioned on the previously generated tokens.
  • KV (Key-Value) Cache: caching the attention Keys and Values in the Generation Phase, eliminating the recomputation for Keys and Values of previous tokens.
  • Weight: the parameter of the model, the $w$ in $y = w \cdot x + b$.
  • Activation: the output of a neuron, which is computed using an activation function, the $z$ in $z = f(y)$, where $f$ is the activation function like ReLU.
  • GPU Kernel: function that is executed on multiple GPU computing cores to perform parallel computations.
  • HBM (High Bandwidth Memory): a type of advanced memory technology, which is like the main memory of data-center GPUs.
  • Continuous Batching: as opposed to static batching (which batches requests together and starts processing only when all requests within the batch are ready), continuously batches requests and maximizes memory utilization.
  • Offloading: transfering data between GPU memory and main memory or NVMe storage, as GPU memory is limited.
  • Post-Training Quantization: quantizing the weights and activations of the model after the model has been trained.
  • Quantization-Aware Training: incorporating quantization considerations during training.

Open Source Software

Name Stars Hardware Org
Transformers CPU / NVIDIA GPU / TPU / AMD GPU Hugging Face
Text Generation Inference CPU / NVIDIA GPU / AMD GPU Hugging Face
gpt-fast CPU / NVIDIA GPU / AMD GPU PyTorch
TensorRT-LLM NVIDIA GPU NVIDIA
vLLM NVIDIA GPU UC Berkeley
llama.cpp / ggml CPU / Apple Silicon / NVIDIA GPU / AMD GPU ggml
ctransformers CPU / Apple Silicon / NVIDIA GPU / AMD GPU Ravindra Marella
DeepSpeed CPU / NVIDIA GPU Microsoft
FastChat CPU / NVIDIA GPU / Apple Silicon lmsys.org
MLC-LLM CPU / NVIDIA GPU MLC
LightLLM CPU / NVIDIA GPU SenseTime
LMDeploy CPU / NVIDIA GPU Shanghai AI Lab & SenseTime
OpenLLM CPU / NVIDIA GPU / AMD GPU BentoML
OpenPPL.nn / OpenPPL.nn.llm CPU / NVIDIA GPU OpenMMLab & SenseTime
ScaleLLM NVIDIA GPU Vectorch
RayLLM CPU / NVIDIA GPU / AMD GPU Anyscale
Xorbits Inference CPU / NVIDIA GPU / AMD GPU Xorbits

Paper List

Name Paper Title Paper Link Artifact Keywords Recommend
LLaMA LLaMA: Open and Efficient Foundation Language Models arXiv 23 Code Pre-training ⭐️⭐️⭐️⭐️⭐️
Llama 2 Llama 2: Open Foundation and Fine-Tuned Chat Models arXiv 23 Model Pre-training / Fine-tuning / Safety ⭐️⭐️⭐️⭐️
Multi-Query Fast Transformer Decoding: One Write-Head is All You Need arXiv 19 Architecture ⭐️⭐️⭐️
Grouped-Query GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints arXiv 23 Architecture ⭐️⭐️⭐️
RoPE Roformer: Enhanced transformer with rotary position embedding arXiv 21 Position Encoding ⭐️⭐️⭐️⭐️
Megatron-LM Efficient large-scale language model training on GPU clusters using megatron-LM SC 21 Code Tensor Parallel / Pipeline Parallel ⭐️⭐️⭐️⭐️⭐️
Alpa Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI 22 Code Automatic Parallel ⭐️⭐️⭐️
Gpipe GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism NeurIPS 19 Pipeline Parallel ⭐️⭐️⭐️
Google's Practice Efficiently Scaling Transformer Inference MLSys 23 Parallelism ⭐️⭐️⭐️⭐️
FlashAttention Fast and Memory-Efficient Exact Attention with IO-Awareness NeurIPS 23 Code Effiencent Attention / GPU ⭐️⭐️⭐️⭐️⭐️
Orca Orca: A distributed serving system for Transformer-Based generative models OSDI 22 Code Continuous Batching ⭐️⭐️⭐️⭐️⭐️
PagedAttention Efficient Memory Management for Large Language Model Serving with PagedAttention SOSP 23 Code Effiencent Attention / Continuous Batching ⭐️⭐️⭐️⭐️⭐️
FlexGen FlexGen: High-throughput generative inference of large language models with a single GPU ICML 23 Code Offloading ⭐️⭐️⭐️
Speculative Decoding Fast Inference from Transformers via Speculative Decoding ICML 23 Speculative Decoding ⭐️⭐️⭐️⭐️
LLM.int8() LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale NeurIPS 22 Code Quantization ⭐️⭐️⭐️⭐️

llm-inference's People

Contributors

luweizheng avatar hainaweiben avatar iirissm avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.