The bigcode-inference-benchmark from bigcode-project

Inference tasks and milestones

[WIP]
We want to achieve and demonstrate state-of-the-art inference throughputs and latencies for our models. Here is a list of milestones and tasks. These are not necessarily in order, we can (and should) already look into the later milestones.

Milestone 1: Make a starter implementation of MQA and add it do BigCode transformers. Agreeing on a common will be crucial for the next steps. (bigcode-project/transformers#4)

Task 1.1: Implement a GPT2 model with MHA and MQA within BigCode transformers. We should keep support for MHA so we can compare with an equally optimized implementation (@bigximik, @jlamypoirier, @mayank31398 ).
Task 1.2 Add basic profiling support to our benchmarking code (@jlamypoirier). #10
Task 1.3: Validate, profile and add simple optimizations for our model for a ~1B model such as SantaCoder (@jlamypoirier).

Milestone 2: Turn our starter implementation into a strong baseline.

Task 2.1: Verify our MQA implementation for correctness.
Task 2.2: Add complete support for SantaCoder models. The released checkpoints use a different version of the code, so some changes will be needed. We will also need to adapt our benchmarking code.
Task 2.3: Collaborate with the evaluation team to ensure a common codebase.
Task 2.4: Further optimizations
- Avoid concatenations (@jlamypoirier)
- Reduce the CPU bottleneck with cuda graphs and/or pytorch 2.0.
- Fuse more kernels, with custom ones and/or pytorch 2.0.
- Deal with the slow model creation and unnecessary initialization (@jlamypoirier?) Hacky prototype here https://github.com/bigcode-project/bigcode-inference-benchmark/blob/c1efe53ecbacb038347112cbc09c48b48a342ef0/src/utils/fast_init.py
- Get the faster GeLU into HF transformers (@jlamypoirier) huggingface/transformers#21344
- Something else?
Task 2.5: After the other steps, benchmark inference.

Milestone 3: Scaling up

Task 3.1: Look into alternative libraries (semi-optional)
- Try inference with Megatron
- Add MQA support to deepspeed
- Other suggestions?
Task 3.2: Collaborate with the training team do determine our scaling needs and the target model configurations.
Task 3.3: Add support for tensor model parallelism. This will likely involve an alternative library. This will be needed to reduce the latency for bigger models, and possibly for memory depending on the target model size and hardware (We can go ~40B with fp16 on A100).
Task 3.4: Optimize for bigger models.
Task 3.5: Benchmark the bigger models.

Milestone 4: Deployment

Task 4.1: Optimize end-to-end model performance
- Optimize tokenization
- Optimize decoding
- Run them asynchronously whenever possible (i.e., in parallel with GPU ops for other batches)
Task 4.2: Use a fast inference server (HF inference, Big science inference, Deepspeed inference, Nvidia Triton??)
Task 4.3: Integrate our optimized model into HF transformers. huggingface/transformers#21253

Improve inference speed of multi-query attention model

The multi-query attention paper reports up to 10x speed-ups compared to incremental decoding with multi-head attention model. We've implemented multi-query attention but only observed up to 25% speed-ups when it's fully integrated in the Transformers model. We did observe up to 2x speed-ups for a simplified version of the attention layer (without softmax and layer normalization). See more details here.

Further inference gains are likely possible but do require further investigation. For example, we would like to benchmark the difference in a more optimized inference environment like Deepspeed-inference. We are also happy to discuss other solutions and directions in the #wg-inference channel.

bigcode-project / bigcode-inference-benchmark Goto Github PK

bigcode-inference-benchmark's Introduction

bigcode-inference-benchmark

bigcode-inference-benchmark's People

Contributors

Stargazers

Watchers

Forkers

bigcode-inference-benchmark's Issues

Inference tasks and milestones

Try pytorch inference mode

Improve inference speed of multi-query attention model

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent