LLM Inference

LLM Inference

Glossary and Illustration

Llama Architecture: stacks of N Transformer decoders; each decoder consists of Grouped-Query Attention (GQA), Rotary Position Embedding (RoPE), Residual Add, Root Mean Square Layer Normalization (RMSNorm), and Multi-Layer Perceptron (MLPs).

Prompt: the initial text or instruction given to the model.
Prompt Phase (Prefill Phase): the phase to generate the first token based on the prompt.
Generation Phase (Decoding Phase): genernate the next token based on the prompt and the previously generated tokens, in an token-by-token manner.
Autoregressive: predicting one token at a time, conditioned on the previously generated tokens.
KV (Key-Value) Cache: caching the attention Keys and Values in the Generation Phase, eliminating the recomputation for Keys and Values of previous tokens.
Weight: the parameter of the model, the $w$ in $y = w \cdot x + b$.
Activation: the output of a neuron, which is computed using an activation function, the $z$ in $z = f(y)$, where $f$ is the activation function like ReLU.
GPU Kernel: function that is executed on multiple GPU computing cores to perform parallel computations.
HBM (High Bandwidth Memory): a type of advanced memory technology, which is like the main memory of data-center GPUs.
Continuous Batching: as opposed to static batching (which batches requests together and starts processing only when all requests within the batch are ready), continuously batches requests and maximizes memory utilization.
Offloading: transfering data between GPU memory and main memory or NVMe storage, as GPU memory is limited.
Post-Training Quantization: quantizing the weights and activations of the model after the model has been trained.
Quantization-Aware Training: incorporating quantization considerations during training.

Open Source Software

Name	Hardware	Org
Transformers	CPU / NVIDIA GPU / TPU / AMD GPU	Hugging Face
Text Generation Inference	CPU / NVIDIA GPU / AMD GPU	Hugging Face
gpt-fast	CPU / NVIDIA GPU / AMD GPU	PyTorch
TensorRT-LLM	NVIDIA GPU	NVIDIA
vLLM	NVIDIA GPU	UC Berkeley
llama.cpp / ggml	CPU / Apple Silicon / NVIDIA GPU / AMD GPU	ggml
ctransformers	CPU / Apple Silicon / NVIDIA GPU / AMD GPU	Ravindra Marella
DeepSpeed	CPU / NVIDIA GPU	Microsoft
FastChat	CPU / NVIDIA GPU / Apple Silicon	lmsys.org
MLC-LLM	CPU / NVIDIA GPU	MLC
LightLLM	CPU / NVIDIA GPU	SenseTime
LMDeploy	CPU / NVIDIA GPU	Shanghai AI Lab & SenseTime
OpenLLM	CPU / NVIDIA GPU / AMD GPU	BentoML
OpenPPL.nn / OpenPPL.nn.llm	CPU / NVIDIA GPU	OpenMMLab & SenseTime
ScaleLLM	NVIDIA GPU	Vectorch
RayLLM	CPU / NVIDIA GPU / AMD GPU	Anyscale
Xorbits Inference	CPU / NVIDIA GPU / AMD GPU	Xorbits

Paper List

Name	Paper Title	Paper Link	Artifact	Keywords	Recommend
LLaMA	LLaMA: Open and Efficient Foundation Language Models	arXiv 23	Code	Pre-training	⭐️⭐️⭐️⭐️⭐️
Llama 2	Llama 2: Open Foundation and Fine-Tuned Chat Models	arXiv 23	Model	Pre-training / Fine-tuning / Safety	⭐️⭐️⭐️⭐️
Multi-Query	Fast Transformer Decoding: One Write-Head is All You Need	arXiv 19		Architecture	⭐️⭐️⭐️
Grouped-Query	GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints	arXiv 23		Architecture	⭐️⭐️⭐️
RoPE	Roformer: Enhanced transformer with rotary position embedding	arXiv 21		Position Encoding	⭐️⭐️⭐️⭐️
Megatron-LM	Efficient large-scale language model training on GPU clusters using megatron-LM	SC 21	Code	Tensor Parallel / Pipeline Parallel	⭐️⭐️⭐️⭐️⭐️
Alpa	Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning	OSDI 22	Code	Automatic Parallel	⭐️⭐️⭐️
Gpipe	GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism	NeurIPS 19		Pipeline Parallel	⭐️⭐️⭐️
Google's Practice	Efficiently Scaling Transformer Inference	MLSys 23		Parallelism	⭐️⭐️⭐️⭐️
FlashAttention	Fast and Memory-Efficient Exact Attention with IO-Awareness	NeurIPS 23	Code	Effiencent Attention / GPU	⭐️⭐️⭐️⭐️⭐️
Orca	Orca: A distributed serving system for Transformer-Based generative models	OSDI 22	Code	Continuous Batching	⭐️⭐️⭐️⭐️⭐️
PagedAttention	Efficient Memory Management for Large Language Model Serving with PagedAttention	SOSP 23	Code	Effiencent Attention / Continuous Batching	⭐️⭐️⭐️⭐️⭐️
FlexGen	FlexGen: High-throughput generative inference of large language models with a single GPU	ICML 23	Code	Offloading	⭐️⭐️⭐️
Speculative Decoding	Fast Inference from Transformers via Speculative Decoding	ICML 23		Speculative Decoding	⭐️⭐️⭐️⭐️
LLM.int8()	LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale	NeurIPS 22	Code	Quantization	⭐️⭐️⭐️⭐️

iirissm / llm-inference Goto Github PK

llm-inference's Introduction

LLM Inference

Table of Contents

Glossary and Illustration

Open Source Software

Paper List

llm-inference's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent