Git Product home page Git Product logo

megatron's Introduction

Large-batch Training of Language Models

The original README of Megatron-LM is README_old.md.

Setup

Dataset

EC2

Download from s3 (check M*EKS Tutorial for the setup).

# Wikipedia preprocessed for Megatron-LM. model: 4-layer BERT, T5
aws --profile gluonnlp s3 cp s3://mstar-eks-dev-us-east-2/annnxu/my-bert_text_sentence.bin ./
aws --profile gluonnlp s3 cp s3://mstar-eks-dev-us-east-2/annnxu/my-bert_text_sentence.idx ./
# Wikipedia + BookCorpus preprocessed for Megatron-LM. model: BERT large
aws --profile gluonnlp s3 cp s3://mstar-eks-dev-us-east-2/annnxu/bert_text_sentence.bin ./
aws --profile gluonnlp s3 cp s3://mstar-eks-dev-us-east-2/annnxu/bert_text_sentence.idx ./
# jsonl of BookCorpus before preprocess
aws --profile gluonnlp s3 cp s3://mstar-eks-dev-us-east-2/annnxu/bookcorpus.jsonl ./
# logs, read.py, plot.py to plot figures. ${id} in [2,3,4,5,6,78,9,10,11,12]. It is '78' because I went to ICML on week 7 and 8, so I put the logs of these two weeks together.
aws --profile gluonnlp s3 cp s3://mstar-eks-dev-us-east-2/annnxu/logs_week${id} ./logs_week${id}/ --recursive

The Wikipedia dataset is downloaded and preprocessed following Megatron-LM README_old.md. The BookCorpus is downloaded from online, concatenated with Wikipedia, and then preprocessed with Megatron-LM in the same way.

Download vocabulary.

wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt

Remember to move all downloaded above to directory ~/data for EC2 instances.

EKS

For EKS, specify the data path in the yaml file.

Environment

EC2

I followed MIST Intern Onboarding Guide to create EC2 instances.

In each EC2 instance, create conda environment named "p37".

conda create -n p37 python=3.7 -y
conda activate p37
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch -y
conda install regex ninja nltk pybind11 -y

Install Apex.

cd ~
git clone https://github.com/anxuthu/apex.git
cd ~/apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Download the codes from my Weekly Progress, unzip it, and move it to ~/.

EKS

Check M*EKS Tutorial for the setup; slack @zhenghuj for any questions regarding EKS.

I have already uploaded the docker image (for BERT large), so that it can be directly specified (747303060528.dkr.ecr.us-east-2.amazonaws.com/mstar-eks:annnxu) in the yaml file to be submitted to the EKS cluster.

To create a new docker image, I use an EC2 instance to run the following command (check M*EKS Tutorial "Build with your customized docker image" for prior procedures) after downloading the codes.

cd ~/megatron
sudo chmod 666 /var/run/docker.sock
DOCKER_BUILDKIT=1 docker build --no-cache -t mstar-eks -f Dockerfile .
docker tag mstar-eks:latest 747303060528.dkr.ecr.us-east-2.amazonaws.com/mstar-eks:annnxu # replace "annnxu"
docker push 747303060528.dkr.ecr.us-east-2.amazonaws.com/mstar-eks:annnxu # upload; replace "annnxu"

Node Configuration

For 4-layer BERT and T5, I use EC2 g4dn.12xlarge inctances, each possessing 4 GPUs. The Amazon Machine Image (AMI) is "Deep Learning AMI (Ubuntu 18.04) Version 60.4".

For 4-layer BERT and T5 with tensor parallelism=8 (larger than 4), I use EC2 g4dn.metal instances, each possessing 8 GPUs.

Note: for distributed training with EC2 instances,

  • locally run scripts with NNODES=1 and NODE_RANK=0 first to create the index map, then set NNODES and NODE_RANK for each instance following the distributed setting.
  • make sure micro_batch_size x #GPUs <= global_batch_size.

For BERT large (24 layers) pre-training, I use the EKS cluster.

4-layer BERT

Shorter Training Steps

Check ./bert4_scripts, where "lamb" denotes FusedLAMB from Apex, "mylamb" denotes my PyTorch implementation of LAMB, "mylamb2" denotes our first proposed method layer-wise noise. I use one g4dn.12xlarge for each experiment, which should take 3-4 hours. Just run

./bert4_scripts/xxxx.sh

Longer Training Steps

Check ./bert4_scripts2, where "mylamb3" denotes our method by increasing the learning rate for the embedding weight. I use 4 g4dn.12xlarge for each experiment, which should take 2 hours. Run

#lr=0.01 for B=512, 1k, 2k; lr=0.01 * (2 ** 0.5) for B=4k; lr=0.02 for B=8k, 16k.
#for mylamb3, set "--alpha 1.0"
./bert4_scripts2/xxxx.sh $MASTER_ADDR $NNODES $NODE_RANK $lr

For tensor parallel = 2, 4 experiments, the training time is in proportional to the data parallelism, so the training time is 4 and 8 hours respectively with 4 g4dn.12xlarge nodes. Run

./bert4_scripts2/xxxx_tp.sh $MASTER_ADDR $NNODES $NODE_RANK $TENSOR_PARALLELISM

For tensor parallel = 8 experiments, remember to set "GPUS_PER_NODE=8" instead. I use 8 g4dn.metal nodes and it takes about 3-4 hours. Run the same script above.

./bert4_scripts2/xxxx_tp.sh $MASTER_ADDR $NNODES $NODE_RANK $TENSOR_PARALLELISM

BERT large (24 layers)

Check ./bert24_yaml. First setup the cluster

mstarx --profile gluonnlp config --cluster mstar-eks --region us-east-2 # cluster us-east-2

Cluster usage can be found in CloudWatch -> Dashboards -> mstar-eks. Job dag can be found in Airflow. The output is in /mnt_out/annnxu/.

Submit the job to EKS via

mstarx --profile gluonnlp submit -f bert24_yaml/xxxx.yaml

Each experiment should take about 2 days with 8 p4 nodes. Remember to set node_num in the yaml file.

For tensor parallelism experiment, add "--tensor-model-parallel-size" argument with 1, 2, 4, or 8 after "pretrain_bert.py" in the yaml file. Tensor parallelism=4 should take about 8 days.

For 1-B BERT (85 layers), check ./bert85_yaml. Each experiment should take about 12 hours with 4 p4 nodes.

T5 small (6 layers)

Check ./t5_scripts. I ues 4 g4dn.12xlarge for each experiment, which should take 12 hours. Run

./t5_scripts/xxxx.sh $MASTER_ADDR $NNODES $NODE_RANK

For tensor parallel = 2, 4 experiments, the training time is 12, 24 hours with 8 g4dn.12xlarge nodes. Run

./t5_scripts/xxxx_tp.sh $MASTER_ADDR $NNODES $NODE_RANK $TENSOR_PARALLELISM

For tensor parallel = 8 experiments, remember to set "GPUS_PER_NODE=8" instead. I use 16 g4dn.metal nodes for T5 and it takes about 11 hours. Run the same script above.

./t5_scripts/xxxx_tp.sh $MASTER_ADDR $NNODES $NODE_RANK $TENSOR_PARALLELISM

1-B BERT (85 layers)

Check ./bert85_yaml.

Submit the job to EKS via

mstarx --profile gluonnlp submit -f bert24_yaml/xxxx.yaml

Each experiment should take about 12 hours with 4 p4 nodes. Remember to set node_num in the yaml file.

megatron's People

Contributors

borisfom avatar boxin-wbx avatar crcrpar avatar deepakn94 avatar ekmb avatar eqy avatar erhoo82 avatar hwijeen avatar hyunwoongko avatar jaredcasper avatar jasperzhong avatar kantneel avatar ktaebum avatar kvareddy avatar kvtoraman avatar lazykyama avatar lmcafee-nvidia avatar mpatwary avatar nakosung avatar pxuab avatar rajeshkppt avatar raulpuric avatar roclark avatar satpalsr avatar singleheart avatar stas00 avatar sublee avatar szmigacz avatar tridao avatar zliucr avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.