Llama Architecture: stacks of N Transformer decoders; each decoder consists of Grouped-Query Attention (GQA), Rotary Position Embedding (RoPE), Residual Add, Root Mean Square Layer Normalization (RMSNorm), and Multi-Layer Perceptron (MLPs).
Prompt: the initial text or instruction given to the model.
Prompt Phase (Prefill Phase): the phase to generate the first token based on the prompt.
Generation Phase (Decoding Phase): genernate the next token based on the prompt and the previously generated tokens, in an token-by-token manner.
Autoregressive: predicting one token at a time, conditioned on the previously generated tokens.
KV (Key-Value) Cache: caching the attention Keys and Values in the Generation Phase, eliminating the recomputation for Keys and Values of previous tokens.
Weight: the parameter of the model, the $w$ in $y = w \cdot x + b$.
Activation: the output of a neuron, which is computed using an activation function, the $z$ in $z = f(y)$, where $f$ is the activation function like ReLU.
GPU Kernel: function that is executed on multiple GPU computing cores to perform parallel computations.
HBM (High Bandwidth Memory): a type of advanced memory technology, which is like the main memory of data-center GPUs.
Continuous Batching: as opposed to static batching (which batches requests together and starts processing only when all requests within the batch are ready), continuously batches requests and maximizes memory utilization.
Offloading: transfering data between GPU memory and main memory or NVMe storage, as GPU memory is limited.
Post-Training Quantization: quantizing the weights and activations of the model after the model has been trained.
Quantization-Aware Training: incorporating quantization considerations during training.