Git Product home page Git Product logo

bigcode-inference-benchmark's Introduction

bigcode-inference-benchmark

bigcode-inference-benchmark's People

Contributors

harm-devries avatar jlamypoirier avatar mayank31398 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bigcode-inference-benchmark's Issues

Inference tasks and milestones

[WIP]
We want to achieve and demonstrate state-of-the-art inference throughputs and latencies for our models. Here is a list of milestones and tasks. These are not necessarily in order, we can (and should) already look into the later milestones.

Milestone 1: Make a starter implementation of MQA and add it do BigCode transformers. Agreeing on a common will be crucial for the next steps. (bigcode-project/transformers#4)

  • Task 1.1: Implement a GPT2 model with MHA and MQA within BigCode transformers. We should keep support for MHA so we can compare with an equally optimized implementation (@bigximik, @jlamypoirier, @mayank31398 ).
  • Task 1.2 Add basic profiling support to our benchmarking code (@jlamypoirier). #10
  • Task 1.3: Validate, profile and add simple optimizations for our model for a ~1B model such as SantaCoder (@jlamypoirier).

Milestone 2: Turn our starter implementation into a strong baseline.

Milestone 3: Scaling up

  • Task 3.1: Look into alternative libraries (semi-optional)
    • Try inference with Megatron
    • Add MQA support to deepspeed
    • Other suggestions?
  • Task 3.2: Collaborate with the training team do determine our scaling needs and the target model configurations.
  • Task 3.3: Add support for tensor model parallelism. This will likely involve an alternative library. This will be needed to reduce the latency for bigger models, and possibly for memory depending on the target model size and hardware (We can go ~40B with fp16 on A100).
  • Task 3.4: Optimize for bigger models.
  • Task 3.5: Benchmark the bigger models.

Milestone 4: Deployment

  • Task 4.1: Optimize end-to-end model performance
    • Optimize tokenization
    • Optimize decoding
    • Run them asynchronously whenever possible (i.e., in parallel with GPU ops for other batches)
  • Task 4.2: Use a fast inference server (HF inference, Big science inference, Deepspeed inference, Nvidia Triton??)
  • Task 4.3: Integrate our optimized model into HF transformers. huggingface/transformers#21253

Improve inference speed of multi-query attention model

The multi-query attention paper reports up to 10x speed-ups compared to incremental decoding with multi-head attention model. We've implemented multi-query attention but only observed up to 25% speed-ups when it's fully integrated in the Transformers model. We did observe up to 2x speed-ups for a simplified version of the attention layer (without softmax and layer normalization). See more details here.

Further inference gains are likely possible but do require further investigation. For example, we would like to benchmark the difference in a more optimized inference environment like Deepspeed-inference. We are also happy to discuss other solutions and directions in the #wg-inference channel.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.