Git Product home page Git Product logo

pipegoose's Introduction

๐Ÿšง pipegoose: Large scale 4D parallelism pre-training for ๐Ÿค— transformers in Mixture of Experts

tests Code style: black Codecov Imports: isort

pipeline

โš ๏ธ The project is actively under development, and we're actively seeking collaborators. Come join us: [discord link] [roadmap] [good first issue]

โš ๏ธ The APIs is still a work in progress and could change at any time. None of the public APIs are set in stone until we hit version 0.6.9.

โš ๏ธ Currently only parallelize bloom-560m is supported. Support for hybrid 3D parallelism and distributed optimizer for ๐Ÿค— transformers will be available in the upcoming weeks (it's basically done, but it doesn't support ๐Ÿค— transformers yet)

โš ๏ธ This library is underperforming when compared to Megatron-LM and DeepSpeed (and not even achieving reasonable performance yet).

from torch.utils.data import DataLoader
+ from torch.utils.data.distributed import DistributedSampler
from torch.optim import Adam
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

+ from pipegoose.distributed import ParallelContext, ParallelMode
+ from pipegoose.nn import DataParallel, TensorParallel
+ from pipegoose.optim import DistributedOptimizer

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m")
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
tokenizer.pad_token = tokenizer.eos_token

BATCH_SIZE = 4
+ DATA_PARALLEL_SIZE = 2
+ parallel_context = ParallelContext.from_torch(
+    tensor_parallel_size=2,
+    data_parallel_size=2,
+    pipeline_parallel_size=1
+ )
+ model = TensorParallel(model, parallel_context).parallelize()
+ model = DataParallel(model, parallel_context).parallelize()
model.to("cuda")
+ device = next(model.parameters()).device

optim = Adam(model.parameters(), lr=1e-3)
+ optim = DistributedOptimizer(optim, parallel_context)

dataset = load_dataset("imdb", split="train")
+ dp_rank = parallel_context.get_local_rank(ParallelMode.DATA)
+ sampler = DistributedSampler(dataset, num_replicas=DATA_PARALLEL_SIZE, rank=dp_rank, seed=42)
+ dataloader = DataLoader(dataset, batch_size=BATCH_SIZE // DATA_PARALLEL_SIZE, shuffle=False, sampler=sampler)

for epoch in range(100):
+    sampler.set_epoch(epoch)

    for batch in dataloader:
        inputs = tokenizer(batch["text"], padding=True, truncation=True, max_length=1024, return_tensors="pt")
        inputs = {name: tensor.to(device) for name, tensor in inputs.items()}
        labels = inputs["input_ids"]

        outputs = model(**inputs, labels=labels)

        optim.zero_grad()
        outputs.loss.backward()
        optim.step()

Installation and try it out

You can install the package through the following command:

pip install pipegoose

And try out a hybrid tensor and data parallelism training script (You must have at least 4 GPUs in order to try hybrid 2D parallelism).

cd pipegoose/examples
torchrun --standalone --nnodes=1 --nproc-per-node=4 hybrid_parallelism.py

We did a small scale correctness test by comparing the validation losses between a paralleized transformer and one kept by default, starting at identical checkpoints and training data. We will conduct rigorous large scale convergence and weak scaling law benchmarks against Megatron and DeepSpeed in the near future if we manage to make it.

  • Data Parallelism [link]
  • Tensor Parallelism [link] (We've found a bug in convergence, and we are fixing it)
  • Hybrid 2D Parallelism (TP+DP) [link]
  • Distributed Optimizer ZeRO-1 Convergence: [sgd link] [adam link]

Features

  • Megatron-style 3D parallelism
  • Sequence parallelism and Mixture of Experts that work in 3D parallelism
  • ZeRO-1: Distributed Optimizer
  • Highly optimized CUDA kernels port from Megatron-LM, DeepSpeed
  • ...

Appreciation

  • Big thanks to ๐Ÿค— Hugging Face for sponsoring this project with GPUs for testing! And Zach Schrier for monthly twitch donations

  • The library's APIs are inspired by OSLO's and ColossalAI's APIs.

Citation

@software{pipegoose,
  title = {{pipegoose: Large-scale 4D parallelism pre-training for `transformers`}},
  author = {},
  url = {https://github.com/xrsrke/pipegoose},
  doi = {},
  month = {},
  year = {2024},
  version = {},
}

pipegoose's People

Contributors

abourramouss avatar danielgrittner avatar xrsrke avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.