cli99 / llm-analysis Goto Github PK

View Code? Open in Web Editor NEW

313.0 8.0 36.0 1.45 MB

Latency and Memory Analysis of Transformer Models for Training and Inference

License: Apache License 2.0

Python 99.44% Shell 0.56%

analysis deep-learning machine-learning nlp transformers language-model language-models

llm-analysis's Introduction

llm-analysis

Latency and Memory Analysis of Transformer Models for Training and Inference

llm-analysis

Overview

Many formulas or equations are floating around in papers, blogs, etc., about how to calculate training or inference latency and memory for Large Language Models (LLMs) or Transformers. Rather than doing math on papers or typing in Excel sheets, let's automate the boring stuff with llm-analysis ⚙️!

Given the specified model, GPU, data type, and parallelism configurations, llm-analysis estimates the latency and memory usage of LLMs for training or inference. With llm-analysis, one can easily try out different training/inference setups theoretically, and better understand the system performance for different scenarios.

llm-analysis helps answer questions such as:

what batch size, data type, parallelism scheme to use to get a feasible (not getting OOM) and optimal (maximizing throughput with a latency constraint) setup for training or inference
time it takes with the given setup to do training or inference and the cost (GPU-hours)
how the latency/memory changes if using a different model, GPU type, number of GPU, data type for weights and activations, parallelism configuration (suggesting the performance benefit of modeling change, hardware improvement, quantization, parallelism, etc.)

Examples

Check the example use cases. With llm-analysis, you can do such analysis in minutes 🚀!

Quick Start

To install llm-analysis from pypi:
```
pip install llm-analysis
```

To install the latest development build:

pip install --upgrade git+https://github.com/cli99/llm-analysis.git@main

To install from source, clone the repo and run pip install . or poetry install (install poetry by pip install poetry).

Using the `LLMAnalysis` class

To integrate llm-analysis in your code, use the LLMAnalysis class. Refer to doc LLMAnalysis for details.

LLMAnalysis is constructed with flops and memory efficiency numbers and the following configuration classes:

ModelConfig covers model information, i.e. max sequence length, number of transformer layers, number of attention heads, hidden dimension, vocabulary size
GPUConfig covers GPU compute and memory specifications
DtypeConfig covers the number of bits used for the model weight, activation, and embedding
ParallelismConfig covers Tensor Parallelism (tp), Pipeline Parallelism (pp), Sequence Parallelism (sp), Expert Parallelism (ep),and Data Parallelism (dp).

Then LLMAnalysis can be queried with different arguments through the training and inference methods.

Using the Entry Point Functions for Command Line

llm-analysis provides two entry functions, train and infer, for ease of use through the command line interface. Run

python -m llm_analysis.analysis train --help

python -m llm_analysis.analysis infer --help

to check the options or read the linked doc. Refer to the examples to see how they are used.

train and infer use the pre-defined name-to-configuration mappings (model_configs, gpu_configs, dtype_configs) and other user-input arguments to construct the LLMAnalysis and do the query.

The pre-defined mappings are populated at the runtime from the model, GPU, and data type configuration json files under model_configs, gpu_configs, and dtype_configs. To add a new model, GPU or data type to the mapping for query, just add a json description file to the corresponding folder.

llm-analysis also supports retrieving ModelConfig from a model config json file path or Hugging Face with the model name .

From a local model config json file, e.g., python -m llm_analysis.analysis train --model_name=local_example_model.json. Check the model configurations under the model_configs folder.
From Hugging Face, e.g., use EleutherAI/gpt-neox-20b as model_name when calling the train or infer entry functions. python -m llm_analysis.analysis train --model_name=EleutherAI/gpt-neox-20b --total_num_gpus 32 --ds_zero 3. With this method, llm-analysis relies on transformers to find the corresponding model configuration on huggingface.co/models, meaning information of newer models only exist after certain version of the transformers library. To access latest models through their names, update the installed transformers package.

A list of handy commands is provided to query against the pre-defined mappings as well as Hugging Face, or to dump configurations. Run python -m llm_analysis.config --help for details.

Some examples:

python -m llm_analysis.config get_model_config_by_name EleutherAI/gpt-neox-20b

gets the ModelConfig from the populated mapping by name, if not found, llm-analysis tries to get it from HuggingFace.

Note that LLaMA models need at least transformers-4.28.1 to retrieve, either update to a later transformers library, or use the predefined ModelConfig for LLaMA models (/ in model names are replaced with _).

python -m llm_analysis.config list_gpu_configs

lists the names of all predefined GPU configurations, then you can query with

python -m llm_analysis.config get_gpu_config_by_name a100-sxm-80gb

to show the corresponding GPUConfig.

How to Set FLOPS and Memory Efficiency

Setting flops and memory efficiency to 1 (default) gives the lower bound of training or inference latency, as it assumes the peak hardware performance (which is never the case). A close-to-reality flops or memory efficiency can be found by benchmarking and profiling using the input dimensions in the model.

If one has to make assumptions, for flops efficiency, literature reports up to 0.5 for large scale model training, and up to 0.7 for inference; 0.9 can be an aggressive target for memory efficiency.

Current Scope and Limitations

llm-analysis aims to provide a lower-bound estimation of memory usage and latency.

Parallelism Scheme

llm-analysis currently covers Tensor Parallelism (tp), Pipeline Parallelism (pp), Sequence Parallelism (sp), Expert Parallelism (ep), and Data Parallelism (dp).

tp, pp, and sp adopt the style of parallelization used in Megatron-LM for training and FasterTransformer for inference

In the training analysis, dp sharding assumes using DeepSpeed ZeRO or FSDP. ds_zero is used to specify the dp sharding strategy

ds_zero	DeepSpeed ZeRO	FSDP	Sharding
0	disabled	NO_SHARD	No sharding
1	Stage 1	N/A	Shard optimizer states
2	Stage 2	SHARD_GRAD_OP	Shard gradients and optimizer states
3	Stage 3	FULL_SHARD	Shard gradients, optimizer states, model parameters

ep parallelizes the number of MLP experts across ep_size devices, i.e. the number of experts per GPU is total number of experts / ep_size. Thus for the MLP module, the number of devices for other parallelization dimensions is divided by ep_size compared to other parts of the model.

Communication

tp communication is calculated as using ring allreduce. ep communication is calculated as using alltoall. dp communication time to unshard model weight when using FSDP or DeepSpeed ZeRO is estimated and compared against the compute latency, the larger value of the two is used for the overall latency. Other dp and pp communications are ignored for now, i.e. assuming perfect computation and communication overlapping, which is not true when communication cannot overlap with compute due to dependency, or when communication is too long to hide due to slow interconnect or large data volume.

Activation Recomputation

llm-analysis supports both full and selective activation recomputation.

activation_recomputation	what is checkpointed and recomputed
0	No activation recomputation; requires the most amount of memory
1	Checkpoints the attention computation (QK^T matrix multiply, softmax, softmax dropout, and attention over V.) in the attention module of a transformer layer; as described in Reducing Activation Recomputation in Large Transformer Models.
2	Checkpoints the input to the attention module in a transformer layer; requires an extra forward pass on attention.
3	Checkpoints the input to the sequence of modules (layernom-attention-layernom) in a transformer layer; requires an extra forward pass on (layernom-attention-layernom).
4	Full activation recomputation stores the input to the transformer layer; requires the least amount of memory; requires an extra forward pass of the entire layer.

Data Types

Data types are expressed with the number of bits, only 32 (FP32, TF32), 16 (FP16, BF16), 8 (INT8), and 4 (INT4) bits data types are modeled for now.

Fine-Tuning

Fine-tuning is modeled the same (controlled by total_num_tokens passed to the train entry function) as pre-training, thus assuming full (all model parameters) fine-tuning. Parameter-efficient fine-tuning (PEFT) is in future support.

Assumptions in Inference

Inference assumes perfect overlapping of compute and memory operations when calculating latency, and maximum memory reuse when calculating memory usage.

TODOs (stay tuned 📻)

Check the TODOs below for what's next and stay tuned 📻! Any contributions or feedback are highly welcome!

Add dp (across and within a node), ep (within a node), pp (across nodes) communication analysis
Support efficient fine-tuning methods such as LoRA or Adapters
Add FP8 datatype support
Support CPU offloading (weight, KV cache, etc.) analysis in training and inference
Support other hardware (e.g. CPU) for inference analysis

Citation

If you use llm-analysis in your work, please cite:

Cheng Li. (2023). LLM-Analysis: Latency and Memory Analysis of Transformer Models for Training and Inference. GitHub repository, https://github.com/cli99/llm-analysis.

@misc{llm-analysis-chengli,
  author = {Cheng Li},
  title = {LLM-Analysis: Latency and Memory Analysis of Transformer Models for Training and Inference},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/cli99/llm-analysis}},
}

Contributing

Contributions and suggestions are welcome.

llm-analysis uses pre-commit to ensure code formatting is consistent. For pull requests with code contribution, please install the pre-commit (pip install pre-commit) as well as the used hooks (pip install in the repo), and format the code (runs automatically before each git commit) before submitting the PR.

Useful Links

llm-analysis's People

Contributors

Stargazers

Watchers

llm-analysis's Issues

latency [BUG]

The latency i am getting here and the actual time when i am inferencing are not same. And also there is a huge difference between these two. So could be the problem?

A question about layernorm activation memory.

Hi,

The function get_memory_activation_per_layer_layernorm() will return a value of seq_len * batch_size * hidden_dim / sp_size * dtype_bytes, which in fp16 will be 2sbh/s.

However, I find the paper Reducing Activation Recomputation in Large Transformer Models mentions that the activation memory of LayerNorm is 4sbh.

Unfortunately, I'm not so familiar with LLM memory consumption. Since other activation memory result fits the paper, I wonder if there exists a mistake inside the paper or there is a bug in this function?

Thanks,
Esar

mistral and mixtral inference[BUG]

Describe the bugMistral and Mixtral models not able to infer
When i give the name of the model as i do for other models in case of mistral there is a key error from the configuration_auto.py file in llm_analysis module. This is because there is no key with mistral in the config_map.

So could you also add all the models from hugging face which are not yet defined!!

[BUG] MLP intermediate dimension not used

model_config.ffn_embed_dim is not used. Instead it used 4x hidden_dim

[REQUEST] How to get other GPU config

Is your feature request related to a problem? Please describe.
I recently wanted to test on T4, but I don't know how to measure intra_node information.

Describe the solution you'd like
The following is the T4 information I checked, including intra_node_bandwidth_in_GB_per_sec intra_node_min_message_latency inter_node_bandwidth_in_GB_per_sec. I don’t know how to obtain it.

{
    "name": "T4-pcie-16gb",
    "mem_per_GPU_in_GB": 16,
    "hbm_bandwidth_in_GB_per_sec": 320,
    "intra_node_bandwidth_in_GB_per_sec": XXX,
    "intra_node_min_message_latency": XXX,
    "peak_fp16_TFLOPS": 65,
    "peak_i8_TFLOPS": 130,
    "peak_i4_TFLOPS": 260,
    "inter_node_bandwidth_in_GB_per_sec": XXX
}

[BUG]Is it possible that hbm_memory_efficiency is not working in the code?

Describe the bug
I tried with two unused hbm_memory_efficiency, 1 and 0.6 but ended up with the same value for (weight+op_state+grad+act)_memory_per_gpu. Is it possible that hbm_memory_efficiency is not working in the code?

To Reproduce
Steps to reproduce the behavior:

python -m llm_analysis.analysis train --model_name /hdd/echozhou/llm-analysis/examples/llama --gpu_name a100-pcie-40gb --activation_recomputation 1 --tp_size 1 --pp_size 3 --sp_size 1 --dp_size 1 --gradient_accumulation_steps 4 -b 16 --seq_len 1400 --total_num_gpus 3 --total_num_tokens 1e12 --activation_recomputation 2 --flops_efficiency 1 --hbm_memory_efficiency 0.6 --output_dir /hdd/echozhou/llm-analysis/examples/llama/test
python -m llm_analysis.analysis train --model_name /hdd/echozhou/llm-analysis/examples/llama --gpu_name a100-pcie-40gb --activation_recomputation 1 --tp_size 1 --pp_size 3 --sp_size 1 --dp_size 1 --gradient_accumulation_steps 4 -b 16 --seq_len 1400 --total_num_gpus 3 --total_num_tokens 1e12 --activation_recomputation 2 --flops_efficiency 1 --hbm_memory_efficiency 1 --output_dir /hdd/echozhou/llm-analysis/examples/llama/test

Expected behavior
The final memory consumption you get is all (weight+op_state+grad+act)_memory_per_gpu: 20.14 GB

Screenshots

question about the memory calculation

Hello @cli99, Thank you very much for open-sourcing your library for analyzing large language models. This is very helpful for us to understand various optimization algorithms and parallel configuration strategies. After going through the code, I have encountered a few questions.

In the paper "Reducing Activation Recomputation in Large Transformer Models", there are two LayerNorms in one transformer layer. But in the code:

        weight_memory_per_layer = (
            weight_memory_attn_per_layer
            + weight_memory_mlp_per_layer
            + weight_memory_layernorm_per_layer
        )

only one "weight_memory_layernorm_per_layer" is added.

Also in this paper, the blocks which can be parallelized using tensor parallelism are attention and mlp. But in the code below, LayerNorm can also be parallelized when tensor parallelism is applied.

        weight_memory_layernorm_per_layer = (
            self.get_num_params_per_layer_layernorm()
            * self.dtype_config.weight_bits
            / BITS_PER_BYTE
            / self.parallelism_config.tp_size
            / sharded_dp_size
        )

When I print the summary_dict in the provied python script in llama2 folder, it given me the following result:

{'batch_size_per_gpu': 1, 'seq_len': 512, 'tp_size': 2, 'ep_size': 1, 'pp_size': 1, 'num_tokens_to_generate': 32, 'flops_efficiency': 0.6, 'hbm_memory_efficiency': 0.6, 'layernorm_dtype_bytes': 2, 'use_kv_cache': True, 'kv_cache_latency': 0.00014570698054601933, 'kv_cache_memory_per_gpu': 89128960.0, 'weight_memory_per_gpu': 55292731392.0, 'weight_memory_embedding_per_gpu': 262144000.0, 'prefill_activation_memory_per_gpu': 16777216.0, 'prefill_max_batch_size_per_gpu': 1824, 'prefill_num_flops_fwd_total': 57305601146880.0, 'decode_activation_memory_per_gpu': 32768.0, 'decode_max_batch_size_per_gpu': 343, 'decode_num_flops_fwd_total': 110585446400.0, 'prefill_latency': 0.15908735621271564, 'prefill_latency_fwd_attn': 0.03487366607863248, 'prefill_latency_fwd_mlp': 0.11746919100170941, 'prefill_latency_fwd_layernorm': 0.001097087853522969, 'prefill_latency_fwd_tp_comm': 0.004473924266666667, 'prefill_latency_fwd_sharded_dp_comm': 0.0, 'prefill_latency_fwd_input_embedding': 0.00045651196944907646, 'prefill_latency_fwd_output_embedding_loss': 0.0007169750427350427, 'decode_latency': 0.04685015182136376, 'decode_latency_fwd_attn': 0.009875397743992155, 'decode_latency_fwd_mlp': 0.03510895406244892, 'decode_latency_fwd_layernorm': 2.1427497139120487e-06, 'decode_latency_fwd_tp_comm': 0.0012799999999999999, 'decode_latency_fwd_sharded_dp_comm': 0.0, 'decode_latency_fwd_input_embedding': 0.00043654994278240976, 'decode_latency_fwd_output_embedding_loss': 1.4003418803418803e-06, 'total_decode_latency': 1.4992048582836404, 'total_latency': 1.658292214496356, 'total_per_token_latency': 0.05182163170301113, 'prefill_tokens_per_sec': 3218.3575878613824, 'decode_tokens_per_sec': 21.344648013370964, 'total_tokens_per_sec': 19.296960885581193, 'prefill_cost_per_1k_tokens': 0.00038149203258474565, 'decode_cost_per_1k_tokens': 0.057521575291785504, 'total_cost_per_1k_tokens': 0.06362544781314144}

weight_memory_per_gpu is 55292731392.0, which is less 70B * 2 / 2 = 70B

So can you provide more information about this? Look forward to your response, thank you once again.

How to get the analysis of model Qwen1.5-0.5B

@mvpatel2000 @cli99 @weimingzha0 @digger-yu @BhAem I want to get the analysis info Time to first token (s) 、Time for completion (s) and Tokens/second about the model Qwen1.5-0.5B , so do I just need to run the following command :

HF_ENDPOINT=https://hf-mirror.com
gpu_name='a100-sxm-80gb'
dtype_name="w16a16e16"
output_dir='outputs_infer'
model_name=Qwen/Qwen1.5-0.5B
batch_size_per_gpu=1
tp_size=2
output_file_suffix="-bs${batch_size_per_gpu}"
cost_per_gpu_hour=2.21
seq_len=128
num_tokens_to_generate=242
flops_efficiency=0.7
hbm_memory_efficiency=0.9
achieved_tflops=200                # will overwrite the flops_efficiency above
achieved_memory_bandwidth_GBs=1200 # will overwrite the hbm_memory_efficiency above

if [[ ! -e $output_dir ]]; then
    mkdir $output_dir
elif [[ ! -d $output_dir ]]; then
    echo "$output_dir already exists but is not a directory" 1>&2
fi

HF_ENDPOINT=$HF_ENDPOINT CUDA_VISIBLE_DEVICES=3 python -m llm_analysis.analysis infer --model_name=${model_name} --gpu_name=${gpu_name} --dtype_name=${dtype_name} -output_dir=${output_dir} --output-file-suffix=${output_file_suffix} \
    --seq_len=${seq_len} --num_tokens_to_generate=${num_tokens_to_generate} --batch_size_per_gpu=${batch_size_per_gpu} \
    --tp_size=${tp_size} \
    --cost_per_gpu_hour=${cost_per_gpu_hour} \
    --flops_efficiency=${flops_efficiency} --hbm_memory_efficiency=${hbm_memory_efficiency} --log_level DEBUG

[REQUEST] Support for paged attention?

Hi,

Will this project support paged-attention? https://vllm.ai/?

Thanks,
Jason

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.