microsoft / megatron-deepspeed Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nvidia/megatron-lm

1.6K 23.0 311.0 7.35 MB

Ongoing research training transformer language models at scale, including: BERT & GPT-2

License: Other

Python 75.98% Shell 20.35% Makefile 0.01% C++ 3.14% Cuda 0.31% C 0.10% Dockerfile 0.01% HTML 0.09%

megatron-deepspeed's Introduction

Latest News

[2023/07] Synced with upstream over 1k commits, see rebase folder for more details in terms of features and updated performance.

Megatron-DeepSpeed

DeepSpeed version of NVIDIA's Megatron-LM that adds additional support for several features such as MoE model training, Curriculum Learning, 3D Parallelism, and others. The examples_deepspeed/ folder includes example scripts about the features supported by DeepSpeed.

Recent sync with NVIDIA/Megatron-LM

In July 2023, we had a sync with the NVIDIA/Megatron-LM repo (where this repo is forked from) by git-merging 1100+ commits. Details can be found in the examples_deepspeed/rebase folder. Given the amount of merged commits, bugs can happen in the cases that we haven't tested, and your contribution (bug report, bug fix pull request) is highly welcomed. We also created a backup branch which is the version before this sync. This backup branch is just for comparison tests and for temporary use when you need to debug the main branch. We do not plan to continue supporting the version before sync.

Run on Azure and AzureML

To try out DeepSpeed on Azure, this fork of Megatron offers easy-to-use recipes and bash scripts. We strongly recommend to start with AzureML recipe in the examples_deepspeed/azureml folder. If you have a custom infrastructure (e.g. HPC clusters) or Azure VM based environment, please refer to the bash scripts in the examples_deepspeed/azure folder.

Below is Megatron-LM's original README. Note that examples mentioned below are from the original NVIDIA/Megatron-LM repo. All of them do NOT have DeepSpeed technologies integrations, and some of them may not work due to changes in this Megatron-DeepSpeed repo. Thus we recommend you to go to `../examples_deepspeed/` folder which includes examples that have DeepSpeed technologies integrated and are tested by DeepSpeed team.

Megatron (1, 2, and 3) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel (tensor, sequence, and pipeline), and multi-node pre-training of transformer based models such as GPT, BERT, and T5 using mixed precision.

Below are some of the projects where we have directly used Megatron:

Megatron is also used in NeMo Megatron, a framework to help enterprises overcome the challenges of building and training sophisticated natural language processing models with billions and trillions of parameters.

Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage NVIDIA's Selene supercomputer to perform scaling studies and use up to 3072 A100 GPUs for the largest model. Each cluster node has 8 NVIDIA 80GB A100 GPUs. The graph below shows that we scale nearly linear up to 1 trillion parameter models running on 3072 GPUs. Note that these results are from benchmark runs and these models were not trained to convergence; however, the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.

The following table shows both model (MFU) and hardware (HFU) FLOPs utilization for select configurations up to 1T parameters (see our paper for a description of how these are calculated). As the model size increases, we achieve better GPU utilization and for the one trillion parameter model, we reach a MFU and HFU of 56.3% and 57.0%, respectively. Note that these numbers are also measured on benchmark runs and in this case are measured using a data parallel size of one. Data parallelism introduces some overhead due to the gradient all-reduce required between the data parallel groups. However, for large transformer models, this overhead is not large and can almost entirely eliminted by overlapping the gradient all-reduce with backpropagation.

Model Size	Model FLOPs Utilization	Hardware FLOPs Utilization
22B	41.5%	43.7%
175B	51.4%	52.8%
530B	56.0%	57.0%
1T	56.3%	57.0%

Contents
Setup
- Downloading Checkpoints
Usage
Training
Evaluation and Tasks
Datasets
- Collecting Wikipedia Training Data
- Collecting GPT Webtext Data
Reproducibility

Setup

We strongly recommend using the latest release of NGC's PyTorch container with DGX nodes. If you can't use this for some reason, use the latest pytorch, cuda, nccl, and NVIDIA APEX releases. Data preprocessing requires NLTK, though this is not required for training, evaluation, or downstream tasks.

You can launch an instance of the PyTorch container and mount Megatron, your dataset, and checkpoints with the following Docker commands:

docker pull nvcr.io/nvidia/pytorch:xx.xx-py3
docker run --gpus all -it --rm -v /path/to/megatron:/workspace/megatron -v /path/to/dataset:/workspace/dataset -v /path/to/checkpoints:/workspace/checkpoints nvcr.io/nvidia/pytorch:xx.xx-py3

Downloading Checkpoints

We have provided pretrained BERT-345M and GPT-345M checkpoints for use to evaluate or finetuning downstream tasks. To access these checkpoints, first sign up for and setup the NVIDIA GPU Cloud (NGC) Registry CLI. Further documentation for downloading models can be found in the NGC documentation.

Alternatively, you can directly download the checkpoints using:

BERT-345M-uncased: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_uncased/zip -O megatron_bert_345m_v0.1_uncased.zip
BERT-345M-cased: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_bert_345m/versions/v0.1_cased/zip -O megatron_bert_345m_v0.1_cased.zip
GPT-345M: wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip

The models require vocabulary files to run. The BERT WordPiece vocab file can be extracted from Google's pretrained BERT models: uncased, cased. The GPT vocab file and merge table can be downloaded directly.

Additional notes for DeepSpeed. We have added a helper script to download the checkpoints and make the example runnable.

Steps to follow:

bash dataset/download_ckpt.sh -- this will download and extract the checkpoint
bash dataset/download_vocab.sh -- this will download GPT merges and vocab files.
bash examples/generate_text.sh -- this will generate examples using the 345m GPT model.

Usage

After installation, there are several possible workflows. The most comprehensive is:

Data preprocessing
Pretraining
Finetuning (Optional for zero-shot tasks)
Downstream task evaluation or text generation

However, steps 1 and 2 can be replaced by using one of the pretrained models mentioned above.

We've provided several scripts for pretraining both BERT and GPT in examples directory, as well as scripts for both zero-shot and fine-tuned downstream tasks including MNLI, RACE, WikiText103, and LAMBADA evaluation. There is also a script for GPT interactive text generation.

Training

Data Preprocessing

The training data requires preprocessing. First, place your training data in a loose json format, with one json containing a text sample per line. For example:

{"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"}
{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}

The name of the text field of the json can be changed by using the --json-key flag in preprocess_data.py The other metadata are optional and are not used in training.

The loose json is then processed into a binary format for training. To convert the json into mmap format use preprocess_data.py. An example script to prepare data for BERT training is:

python tools/preprocess_data.py \
       --input my-corpus.json \
       --output-prefix my-bert \
       --vocab-file bert-vocab.txt \
       --tokenizer-type BertWordPieceLowerCase \
       --split-sentences \
       --workers 5

The output will be two files named, in this case, my-bert_text_sentence.bin and my-bert_text_sentence.idx. The --data-path specified in later BERT training is the full path and new filename, but without the file extension.

For T5 use the same preprocessing as BERT, perhaps renaming it to:

       --output-prefix my-t5 \

Some minor modifications are required for GPT data preprocessing, namely, the addition of a merge table, an end-of-document token, removal of sentence splitting, and a change to the tokenizer type:

python tools/preprocess_data.py \
       --input my-corpus.json \
       --output-prefix my-gpt2 \
       --vocab-file gpt2-vocab.json \
       --dataset-impl mmap \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file gpt2-merges.txt \
       --append-eod \
       --workers 5

Here the output files are named my-gpt2_text_document.bin and my-gpt2_text_document.idx. As before, in GPT training, use the longer name without the extension as --data-path.

Further command line arguments are described in the source file preprocess_data.py.

BERT Pretraining

The examples/pretrain_bert.sh script runs single GPU 345M parameter BERT pretraining. Debugging is the primary use for single GPU training, as the code base and command line arguments are optimized for highly distributed training. Most of the arguments are fairly self-explanatory. By default, the learning rate decays linearly over the training iterations starting at --lr to a minimum set by --min-lr over --lr-decay-iters iterations. The fraction of training iterations used for warmup is set by --lr-warmup-fraction. While this is single GPU training, the batch size specified by --micro-batch-size is a single forward-backward path batch-size and the code will perform gradient accumulation steps until it reaches global-batch-size which is the batch size per iteration. The data is partitioned into a 949:50:1 ratio for training/validation/test sets (default is 969:30:1). This partitioning happens on the fly, but is consistent across runs with the same random seed (1234 by default, or specified manually with --seed). We use train-iters as the training iterations requested. Alternatively, one can provide --train-samples which is total number of samples to train on. If this option is present, then instead of providing --lr-decay-iters, one will need to provide --lr-decay-samples.

The logging, checkpoint-saving, and evaluation intervals are specified. Checkpointing the activations facilitates the training of larger models and/or batches. Note that the --data-path now includes the additional _text_sentence suffix added in preprocessing, but does not include the file extensions.

Further command line arguments are described in the source file arguments.py.

To run examples/pretrain_bert.sh, make any desired modifications including setting the environment variables for CHECKPOINT_PATH, VOCAB_FILE, and DATA_PATH. Make sure to set these variables to their paths in the container. Then launch the container with Megatron and necessary paths mounted (as explained in Setup) and run the example script.

GPT Pretraining

The examples/pretrain_gpt.sh script runs single GPU 345M parameter GPT pretraining. As mentioned above, single GPU training is primarily intended for debugging purposes, as the code is optimized for distributed training.

It follows largely the same format as the previous BERT script with a few notable differences: the tokenization scheme used is BPE (which requires a merge table and a json vocabulary file) instead of WordPiece, the model architecture allows for longer sequences (note that the max position embedding must be greater than or equal to the maximum sequence length), and the --lr-decay-style has been set to cosine decay. Note that the --data-path now includes the additional _text_document suffix added in preprocessing, but does not include the file extensions.

Further command line arguments are described in the source file arguments.py.

examples/pretrain_gpt.sh can be launched the same way as described for BERT. Set the env vars and make any other modifications, launch the container with appropriate mounts, and run the script.

T5 Pretraining

Very similar to BERT and GPT, the examples/pretrain_t5.sh script runs single GPU "base" (~220M parameter) T5 pretraining. The primary difference from BERT and GPT is the addition of the following arguments to accommodate the T5 architecture:

--kv-channels sets the inner dimension of the "key" and "value" matrices of all attention mechanisms in the model. For BERT and GPT this defaults to the hidden size divided by the number of attention heads, but can be configured for T5.
--ffn-hidden-size sets the hidden size in the feed-forward networks within a transformer layer. For BERT and GPT this defaults to 4 times the transformer hidden size, but can be configured for T5.
--encoder-seq-length and --decoder-seq-length set the sequence length for the encoder and decoder separately.

All of the other arguments remain as they were for BERT and GPT pretraining. Run this example with the same steps described above for the other scripts.

Distributed Pretraining

The examples/pretrain_{bert,gpt,t5}_distributed.sh scripts use the PyTorch distributed launcher for distributed training. As such, multi-node training can be achieved by properly setting environment variables. See the official PyTorch documentation for further description of these environment variables. By default, multi-node training uses the nccl distributed backend. A simple set of additional arguments and the use of the PyTorch distributed module with the torchrun elastic launcher (equivalent to python -m torch.distributed.run) are the only additional requirements to adopt distributed training. See any of examples/pretrain_{bert,gpt,t5}_distributed.sh for more details.

We use two types of parallelism: data and model parallelism. We facilitate two distributed data parallel implementations: a simple one of our own that performs gradient all-reduce at the end of back propagation step, and Torch's distributed data parallel wrapper that overlaps gradient reduction with back propagation computation. To switch between these two options use --DDP-impl local or --DDP-impl torch, respectively. As expected, Torch distributed data parallelism is more efficient at larger model sizes. For example, for the 8.3 billion parameters model running on 512 GPUs, the scaling increases from 60% to 76% when Torch's distributed data parallel is used. However, the overlapping method requires more memory and for some configurations (e.g., 2.5 billion parameters using 2-way model parallel and 1.2 billion parameters with no model parallel) can make the overall training slower as a result. We empirically found that using a smaller model in those cases improves the training time.

Second, we developed a simple and efficient two-dimensional model-parallel approach. To use tensor model parallelism (splitting execution of a single transformer module over multiple GPUs, see Section 3 of our paper), add the --tensor-model-parallel-size flag to specify the number of GPUs among which to split the model, along with the arguments passed to the distributed launcher as mentioned above. To use sequence parallelism specify --sequence-parallel, which requires tensor model parallel as it split among the same GPUs (more details in Section 4.2.2 of our paper).

To use pipeline model parallelism (sharding the transformer modules into stages with an equal number of transformer modules on each stage, and then pipelining execution by breaking the batch into smaller microbatches, see Section 2.2 of our paper), use the --pipeline-model-parallel-size flag to specify the number of stages to split the model into (e.g., splitting a model with 24 transformer layers across 4 stages would mean each stage gets 6 transformer layers each).

We have examples of how to use these two different forms of model parallelism the example scripts ending in distributed_with_mp.sh:

Other than these minor changes, the distributed training is identical to the training on a single GPU.

The interleaved pipelining schedule (more details in Section 2.2.2 of our paper) can be enabled using the --num-layers-per-virtual-pipeline-stage argument, which controls the number of transformer layers in a virtual stage (by default with the non-interleaved schedule, each GPU will execute a single virtual stage with NUM_LAYERS / PIPELINE_MP_SIZE transformer layers). The total number of layers in the transformer model should be divisible by this argument value. Additionally, the number of microbatches in the pipeline (computed as GLOBAL_BATCH_SIZE / (DATA_PARALLEL_SIZE * MICRO_BATCH_SIZE)) should be divisible by the PIPELINE_MP_SIZE when using this schedule (this condition is checked in an assertion in the code). The interleaved schedule is not supported for pipelines with 2 stages (PIPELINE_MP_SIZE=2).

Activation Checkpointing and Recomputation

To reduce GPU memory usage so deploy a large model to a training system, we support activation checkpointing and recomputation. We support two levels of recompute granularity: selective and full. Selective recomputation is the default and recommended in almost all cases. It saves the activations that take less space and are expensive to recompute and recomputes activations that take a lot of space but are relatively cheap to recompute (see our paper for details). To enable selective activation recompute simply use --recompute-activations.

For cases where memory is very tight, full checkpointing saves just the inputs to a transformer layer, or a block of transformer layers, and recomputes everything else. To turn on full activation recompute use --recompute-granularity full. When using full activation recomputation, there are two methods: uniform and block, chosen using the --recompute-method argument.

Uniform method uniformly divides the Transformer layers into groups of layers and stores the input activations of each group in the memory. The baseline group size is 1 and, in this case, the input activation of each Transformer layer is checkpointed. When the GPU memory is insufficient, increasing the number of layers per group reduces the memory usage thus enables running a bigger model. For example, when using the number of layers per group of 4, the input activation of each group of 4 Transformer layers is checkpointed.
Block method checkpoints the input activations of a set number of individual Transformer layers per pipeline stage and do the rest of layers without any checkpointing. This method can be used to skip checkpointing some Transformer layers until the GPU memory is fully used, which is applicable only when there is unused GPU memory. Checkpointing fewer transformer layers avoids unnecessary activation recomputation in the backprop thus improves training performance. For example, when we specify 5 layers to checkpoint of 8 layers per pipeline stage, the input activations of only the first 5 Transformer layers are checkpointed and activation recomputation for the rest 3 layers is not needed in the backprop.

Distributed Optimizer

Usage: --use-distributed-optimizer. Compatible with all model and data types.

The distributed optimizer is a memory savings technique, whereby the optimizer state is evenly distributed across data parallel ranks (versus the traditional method of replicating the optimizer state across data parallel ranks). As described in ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, our implementation distributes all optimizer state that does not overlap with the model state. For example, when using fp16 model params, the distributed optimizer maintains its own separate copy of fp32 main params & grads, which are distributed across DP ranks. When using bf16 model params, however, the distributed optimizer's fp32 main grads are the same as the model's fp32 grads, and so the grads in this case are not distributed (although the fp32 main params are still distributed, as they are separate from the bf16 model params).

Theoretical memory savings vary depending on the combination of the model's param dtype and grad dtype. In our implementation, the theoretical number of bytes per parameter is (where 'd' is the data parallel size):

	Non-distributed optim	Distributed optim
fp16 param, fp16 grads	20	4 + 16/d
bf16 param, fp32 grads	18	6 + 12/d
fp32 param, fp32 grads	16	8 + 8/d

FlashAttention

Usage: --use-flash-attn. Support attention head dimensions at most 128.

FlashAttention is a fast and memory-efficient algorithm to compute exact attention. It speeds up model training and reduces memory requirement.

To install FlashAttention:

pip install flash-attn

GPT-3 Example

In examples/pretrain_gpt3_175B.sh we have provided an example of how to configure Megatron to run GPT-3 with 175 billion parameters on 1024 GPUs. The script is designed for slurm with pyxis plugin but can be easily adopted to any other scheduler. It uses 8-way and 16-way tensor and pipeline parallelism, respectively. With options global-batch-size 1536 and rampup-batch-size 16 16 5859375, the training will start with global batch size 16 and linearly increase the global batch size to 1536 over 5,859,375 samples with incrmeental steps 16. The training dataset can be either a single set or a multiple datasets combined with a set of weights.

With full global batch size of 1536 on 1024 A100 GPUs, each iteration takes around 32 seconds resulting in 138 teraFLOPs per GPU which is 44% of the theoretical peak FLOPs.

Retro

See:

tools/retro/README.md for an overview.
tools/retro/examples/get_preprocess_cmd.sh for an example of common preprocessing arguments.
tools/retro/examples/preprocess_data.sh for an example of how to preprocess data.
tools/retro/examples/pretrain_model.sh for an example of how to pretrain a model.

Retro is a retrieval-enhanced model that is based on GPT. As described in Improving language models by retrieving from trillions of tokens, Retro retrieves from a database of document chunks by performing locality search using a sample's tokens. The retrieval database can be large -- often billions or even trillions of tokens -- and provides a more efficient storage mechanism of factual knowledge, when compared to storing factual knowledge implicitly within the network's parameters.

Using Retro requires two steps: 1) preprocessing the retrieval database and pretraining neighbors, and 2) pretraining a model using this data. Please see tools/retro/README.md for a detailed overview.

Evaluation and Tasks

We provide several command line arguments, detailed in the scripts listed below, to handle various zero-shot and fine-tuned downstream tasks. However, you can also finetune your model from a pretrained checkpoint on other corpora as desired. To do so, simply add the --finetune flag and adjust the input files and training parameters within the original training script. The iteration count will be reset to zero, and the optimizer and internal state will be reinitialized. If the fine-tuning is interrupted for any reason, be sure to remove the --finetune flag before continuing, otherwise the training will start again from the beginning.

Because evaluation requires substantially less memory than training, it may be advantageous to merge a model trained in parallel for use on fewer GPUs in downstream tasks. The following script accomplishes this. This example reads in a GPT model with 4-way tensor and 4-way pipeline model parallelism and writes out a model with 2-way tensor and 2-way pipeline model parallelism.

python tools/checkpoint_util.py \
        --model-type GPT \
        --load-dir checkpoints/gpt3_tp4_pp4 \
        --save-dir checkpoints/gpt3_tp2_pp2 \
        --target-tensor-parallel-size 2 \
        --target-pipeline-parallel-size 2

Several downstream tasks are described for both GPT and BERT models below. They can be run in distributed and model parallel modes with the same changes used in the training scripts.

GPT Text Generation

We have included a simple REST server to use for text generation in tools/run_text_generation_server.py. You run it much like you would start a pretraining job, specifying an appropriate pretrained checkpoint. There are also few optional parameters: temperature, top-kand top-p. See --help or the source file for more information. See examples/run_text_generation_server_345M.sh for an example of how to run the server.

Once the server is running you can use tools/text_generation_cli.py to query it, it takes one argument which is the host the server is running on.

tools/text_generation_cli.py localhost:5000

You can also use CURL or any other tools to query the server directly:

curl 'http://localhost:5000/api' -X 'PUT' -H 'Content-Type: application/json; charset=UTF-8'  -d '{"prompts":["Hello world"], "tokens_to_generate":1}'

See megatron/text_generation_server.py for more API options.

Detoxify GPT via Self-generation

We include an example in examples/detxoify_lm/ to detoxify language models by leveraging the generative power of language models.

See examples/detxoify_lm/README.md for step-by-step tutorials on how to perform domain-adaptive training and detoxify LM using self-generated corpus.

GPT Evaluation

We include example scripts for GPT evaluation on WikiText perplexity evaluation and LAMBADA Cloze accuracy.

WikiText Perplexity Evaluation

For even comparison with prior works, we evaluate perplexity on the word-level WikiText-103 test dataset, and appropriately compute perplexity given the change in tokens when using our subword tokenizer.

We use the following command to run WikiText-103 evaluation on a 345M parameter model.

TASK="WIKITEXT103"

VALID_DATA=<wikitext path>.txt
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m

COMMON_TASK_ARGS="--num-layers 24 \
                  --hidden-size 1024 \
                  --num-attention-heads 16 \
                  --seq-length 1024 \
                  --max-position-embeddings 1024 \
                  --fp16 \
                  --vocab-file $VOCAB_FILE"

python tasks/main.py \
       --task $TASK \
       $COMMON_TASK_ARGS \
       --valid-data $VALID_DATA \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file $MERGE_FILE \
       --load $CHECKPOINT_PATH \
       --micro-batch-size 8 \
       --activations-checkpoint-method uniform \
       --log-interval 10 \
       --no-load-optim \
       --no-load-rng

LAMBADA Cloze Accuracy

To compute LAMBADA cloze accuracy (the accuracy of predicting the last token given the preceding tokens) we utilize a detokenized, processed version of the LAMBADA dataset.

We use the following command to run LAMBADA evaluation on a 345M parameter model. Note that the --strict-lambada flag should be used to require whole word matching. Make that lambada is part of the file path.

TASK="LAMBADA"

VALID_DATA=<lambada path>.json
VOCAB_FILE=gpt2-vocab.json
MERGE_FILE=gpt2-merges.txt
CHECKPOINT_PATH=checkpoints/gpt2_345m
COMMON_TASK_ARGS=<same as those in WikiText Perplexity Evaluation above>

python tasks/main.py \
       --task $TASK \
       $COMMON_TASK_ARGS \
       --valid-data $VALID_DATA \
       --tokenizer-type GPT2BPETokenizer \
       --strict-lambada \
       --merge-file $MERGE_FILE \
       --load $CHECKPOINT_PATH \
       --micro-batch-size 8 \
       --activations-checkpoint-method uniform \
       --log-interval 10 \
       --no-load-optim \
       --no-load-rng

Further command line arguments are described in the source file main.py

BERT Task Evaluation

RACE Evaluation

The following script finetunes the BERT model for evaluation on the RACE dataset. The TRAIN_DATA and VALID_DATA directory contain the RACE dataset as separate .txt files. Note that for RACE, the batch size is the number of RACE query's to evaluate. Since each RACE query has four samples, the effective batch size passed through the model will be four times the batch size specified on the command line.

TRAIN_DATA="data/RACE/train/middle"
VALID_DATA="data/RACE/dev/middle \
            data/RACE/dev/high"
VOCAB_FILE=bert-vocab.txt
PRETRAINED_CHECKPOINT=checkpoints/bert_345m
CHECKPOINT_PATH=checkpoints/bert_345m_race
COMMON_TASK_ARGS="--num-layers 24 \
                  --hidden-size 1024 \
                  --num-attention-heads 16 \
                  --seq-length 512 \
                  --max-position-embeddings 512 \
                  --fp16 \
                  --vocab-file $VOCAB_FILE"

COMMON_TASK_ARGS_EXT="--train-data $TRAIN_DATA \
                      --valid-data $VALID_DATA \
                      --pretrained-checkpoint $PRETRAINED_CHECKPOINT \
                      --activations-checkpoint-method uniform \
                      --save-interval 10000 \
                      --save $CHECKPOINT_PATH \
                      --log-interval 100 \
                      --eval-interval 1000 \
                      --eval-iters 10 \
                      --weight-decay 1.0e-1"

python tasks/main.py \
       --task RACE \
       $COMMON_TASK_ARGS \
       $COMMON_TASK_ARGS_EXT \
       --tokenizer-type BertWordPieceLowerCase \
       --epochs 3 \
       --micro-batch-size 4 \
       --lr 1.0e-5 \
       --lr-warmup-fraction 0.06

MNLI Evaluation

The following script finetunes the BERT model for evaluation with the MultiNLI sentence pair corpus. Because the matching tasks are quite similar, the script can be quickly tweaked to work with the Quora Question Pairs (QQP) dataset as well.

TRAIN_DATA="data/glue_data/MNLI/train.tsv"
VALID_DATA="data/glue_data/MNLI/dev_matched.tsv \
            data/glue_data/MNLI/dev_mismatched.tsv"
PRETRAINED_CHECKPOINT=checkpoints/bert_345m
VOCAB_FILE=bert-vocab.txt
CHECKPOINT_PATH=checkpoints/bert_345m_mnli
COMMON_TASK_ARGS=<same as those in RACE Evaluation above>
COMMON_TASK_ARGS_EXT=<same as those in RACE Evaluation above>

python tasks/main.py \
       --task MNLI \
       $COMMON_TASK_ARGS \
       $COMMON_TASK_ARGS_EXT \
       --tokenizer-type BertWordPieceLowerCase \
       --epochs 5 \
       --micro-batch-size 8 \
       --lr 5.0e-5 \
       --lr-warmup-fraction 0.065

Datasets

We do not host any datasets for GPT or BERT training, however, we detail their collection so that our results may be reproduced.

Collecting Wikipedia Training Data

We recommend following the Wikipedia data extraction process specified by Google research: "the recommended pre-processing is to download the latest dump, extract the text with WikiExtractor.py, and then apply any necessary cleanup to convert it into plain text."

We recommend using the --json argument when using WikiExtractor, which will dump the Wikipedia data into loose json format (one json per line), making it more manageable on the file system and also readily consumable by our codebase. We recommend further preprocessing this json dataset by nltk punctuation standardization. For BERT training, use the --split-sentences flag to preprocess_data.py as described above to include sentence breaks in the produced index. If you'd like to use Wikipedia data for GPT training you should still clean it with nltk/spacy/ftfy, but do not use the --split-sentences flag.

Collecting GPT Webtext Data

We utilize the publicly available OpenWebText library from jcpeterson and eukaryote31's work to download urls. We then filtered, cleaned, and deduplicated all downloaded content according to the procedure described in our openwebtext directory. For reddit URLs corresponding to content up to October 2018 we arrived at approximately 37GB of content.

Reproducibility

Megatron training is intended to be bitwise reproducible. This means that the same training config run twice in the same HW and SW environment should produce identical model checkpoints, losses and accuracy metric values (iteration time metrics may vary).

There are currently three known Megatron optimizations that break reproducibility whilst still producing almost identical training runs. They are only applicable when using NGC containers >=22.05. The following workarounds should be applied in cases where reproducibility is required:

When training using the --bf16 option the backward pass of torch.nn.functional.embedding is non-deterministic. If reproducibility is required you should also use the option --embedding-weights-in-fp32. The speed and memory impact of this change is negligible.
Also when training using --bf16, reproducbility is only obtained when the checkpointing and resume schedule of training is identical. If the checkpointing schedule will change, i.e. checkpointing and resume will occur at different iterations, the option --no-bias-gelu-fusion should be used.
Flash attention is non-deterministic. If reproducibility is required do not use --use-flash-attn.

These sources of non-determinism are under active investigation. If you observe non-determinism in Megatron training under other circumstances please open an issue.

megatron-deepspeed's People

Contributors

Stargazers

Watchers

Forkers

qpc-database exa-labs teetone robinqrtz liuhatry adammoody exiaxu linyxus xiaoqingnlp anighose shivamsharma2705 hyoo tanghl1994 sameeravithana wuhuachaocoding shuhua886 tecmus ranchlai jmwoloso mrm8488 ericwangcn curtainwang mmarius hitcoogle chenxin061 xjohnxjohn tanmoyio iyaja yetiansh msp8955 test-mass-forker-org-1 rayzzq hyhuang00 isabella232 hibagus prathikr youhe-jiang ws0zzg4569 kangilkueon gabriel4256 vmjkom savitamittal1 feixliu samadejacobs f410955712 xwushirley wangraying dice-group dashstander kk9087 hubertlu-tw emilechapuis shenjunkun mysqlsc dfqytcom chenweize1998 moreh-dev sylartianii inkcherry byshiue jiayulu mssongit jazmiahenry zhangyichang mageirakos sarang-lvs anhnguyen7198 bqw18744018044 delock guoyejun saforem2 jinyouzhi mikedean2367 cservan 520jefferson hangxu0304 chunhui-shi ldd91 ajindal1 xu-kai fnshwi lycheenice kungfu-team knowledgehacker suffiank lzhangbv zetangforward overbestfitting yf-xue tramphero ints81 hengma1001 xwjim salary-only-17k phddamuge fredbjer choletlf vgurev yonggucheng flavorfan

megatron-deepspeed's Issues

merging the fix from downstream

Could you please merge this wrong dtype fix from downstream: bigscience-workshop/Megatron-DeepSpeed@42fe3b3

Thank you!

Issue loading GPT2 checkpoint: "torch.nn.modules.module.ModuleAttributeError: 'ParallelTransformerLayer' object has no attribute 'self_attention'"

Steps to Reproduce

Download model checkpoints and followed set-up as instructed in the main README.
Run docker container.
Run ./examples/generate_text.sh as instructed here.

Observed behavior

...
>>> done with compiling and loading fused kernels. Compilation time: 0.686 seconds
building GPT model ...
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 354871296
 loading checkpoint from /workspace/checkpoints/gpt2_345m at iteration 0
could not find arguments in the checkpoint ...
 checkpoint version 0
Traceback (most recent call last):
  File "tools/generate_samples_gpt.py", line 168, in <module>
    main()
  File "tools/generate_samples_gpt.py", line 127, in main
    _ = load_checkpoint(model, None, None)
  File "/workspace/megatron_deepspeed/megatron/checkpointing.py", line 397, in load_checkpoint
    fix_query_key_value_ordering(model, checkpoint_version)
  File "/workspace/megatron_deepspeed/megatron/checkpointing.py", line 248, in fix_query_key_value_ordering
    fixed_param = _transpose_first_dim(param.data, 3, True, model)
  File "/workspace/megatron_deepspeed/megatron/checkpointing.py", line 205, in _transpose_first_dim
    attention_module = model.language_model.encoder.layers[0].self_attention  # self_attention
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 795, in __getattr__
    raise ModuleAttributeError("'{}' object has no attribute '{}'".format(
torch.nn.modules.module.ModuleAttributeError: 'ParallelTransformerLayer' object has no attribute 'self_attention'

Solution found (please fix if acceptable)

Update megatron/checkpointing.py#L204 as shown below:

# Previous
attention_module = model.language_model.encoder.layers[0].self_attention
# Fixed
attention_module = model.language_model.encoder.layers[0].attention

Expected behavior

Able to load an available pre-trained checkpoint.

System info

Ubuntu 18.04, PyTorch docker (docker pull nvcr.io/nvidia/pytorch:20.12-py3), RTX 3090.

Broken url link

The url link of setup doc under workspace setting is broken

LM Evaluation Harness Integration

text_generation_utils.py includes a function that claims it exists to facilitate integration with the EleutherAI LM Evaluation Harness, and the recent blog post about your 540B parameter model confirms that it was evaluated on the Eval Harness. Can the rest of the implementation be made available? I am working on implementations based on Megatron and would love to not have to reinvent the wheel.

deepspeed to megatron - mismatch in function definition and call

In tools/convert_checkpoint/deepspeed_to_megatron.py the function _create_rank_checkpoint is defined (line 81) with four positional arguments and one keyword argument. On the line 146 in the same file, it is called with four arguments. However, the tp and pp indices are defined as 3rd and 4th arguments, but provided as 2nd and 3rd in the function call. The boolean args.for_release then gets interpreted as the pp_index, the actual pp_index as tp_index.

[Question] How to generate a merge file and a vocab file

I want to use the Megatron framework for Chinese NLP pre-training tasks. Currently, I have Chinese corpus resources and a vocab.txt file. However, for most frameworks, it seems that vocab.json and merge.txt are needed. Can I generate the above two files from Chinese corpus resources? If so, how can I generate them? Sorry, I haven't found a particularly suitable tutorial on Google.

Cannot run the pretrain_gpt example using moe branch

Hi,

I tried the examples (pretrain gpt and gpt with MoE) but failed to run both.

Running the pretrain gpt example shows an error like "Element 1 of tensors does not require grad and does not have a grad_fn"

Running MoE examples always show an error saying ep_size is not valid argument when calling moe in deepspeed (i tried from deepspeed from 0.5.0 to 0.6.1; unfortunately, none works).

Could anyone kindly help me with the issues?

Thanks

How to run bert with deepspeed?

I run pretrain_bert.py with --deepspeed and the following error occurs:

Traceback (most recent call last):
  File "pretrain_bert.py", line 147, in <module>
    args_defaults={'tokenizer_type': 'BertWordPieceLowerCase'})
  File "/root/Megatron-DeepSpeed/megatron/training.py", line 162, in pretrain
    train_data_iterator, valid_data_iterator)
  File "/root/Megatron-DeepSpeed/megatron/training.py", line 716, in train
    model[0].set_train_batch_size(global_batch_size)
  File "/root/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 948, in __getattr__
    type(self).__name__, name))
AttributeError: 'DeepSpeedEngine' object has no attribute 'set_train_batch_size'

If I just comment this line, it gives another error:

Traceback (most recent call last):
  File "pretrain_bert.py", line 147, in <module>
    lr_scheduler)
  File "/root/Megatron-DeepSpeed/megatron/training.py", line 396, in train_step
    assert isinstance(model[0], deepspeed.PipelineEngine), model
AssertionError

It seems you modified the GPT code in GPTModelPipe class to make it work with DeepSpeed, should we do the same to the Bert model?

Does Deepspeed compatible with megatron3.0 ?

Hello, does Deepspeed compatible with megatron3.0 ?

Load moe checkpoint in generate_text.sh

Hi,
I'm trying to re-produce a PR-MoE model. The training process seemed to go smoothly, but when it came to Inference I encountered some problems:
In this commit, --load $CHECKPOINT_PATH has been deleted from examples/generate_text.sh
So the checkpoint will not loaded at all? How do I load the weights to generate text? Thanks!

The FLOPS per GPU reported for the Megatron GPT model by the DeepSpeed Flops Profiler is much lower than that reported in the logs when we run pretrain_gpt.py

Sample Output for GPT 18.4B Parameter :

Profiler reports FLOPS per GPU as 13.36 TFLOPS, whereas the log prints the FLOPS per GPU as 125.18 TFLOPs
Profiler printed Samples/s is 49.55 and that printed by the log is 50.510, which match.
Is the FLOPS per GPU computation by the profiler incorrect?

0: -------------------------- DeepSpeed Flops Profiler --------------------------
0: Profile Summary at step 2:
0: Notations:
0: data parallel size (dp_size), model parallel size(mp_size),
0: number of parameters (params), number of multiply-accumulate operations(MACs),
0: number of floating-point operations (flops), floating-point operations per second (FLOPS),
0: fwd latency (forward propagation latency), bwd latency (backward propagation latency),
0: step (weights update latency), iter latency (sum of fwd, bwd and step latency)
0:
0: world size: 128
0: data parallel size: 16
0: model parallel size: 8
0: batch size per GPU: 8
0: params per gpu: 2318.53 M
0: params of model = params per GPU * mp_size: 18548.24 M
0: fwd MACs per GPU: 45999.1 GMACs
0: fwd flops per GPU: 91998.2 G
0: fwd flops of model = fwd flops per GPU * mp_size: 735985.6 G
0: fwd latency: 5.7 s
0: fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 16.15 TFLOPS
0: bwd latency: 14.55 s
0: bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency: 12.64 TFLOPS
0: fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency): 13.63 TFLOPS
0: step latency: 413.55 ms
0: iter latency: 20.66 s
0: FLOPS per GPU = 3 * fwd flops per GPU / iter latency: 13.36 TFLOPS
0: samples/second: 49.55
0:

How to merge the model partition that use both optimization about megatron's mp and deepspeed's zero 1?

When i use the optimization about megatron's mp and deepspeed's zero 1 to train my model, the checkpoints have many partitions.
but i cannot find a proper script to merge these.

Will Megatron-DeepSpeed rebase from https://github.com/NVIDIA/Megatron-LM in the future?

Hi, I noticed that this repo had forked from https://github.com/NVIDIA/Megatron-LM and is around 800 commits behind. There are new features keeping adding to Megatron-LM i.e. flash attention and sequence parallel. Wonder whether there is plan in Megatron-DeepSpeed to keep up with Megatron-LM repo? Thanks!

Issue generating text with GPT: "KeyError: 50284"

Steps to Reproduce

Download model checkpoints and followed set-up as instructed in the main README.
Run docker container.
Fix bug as instructed in here.
Run ./examples/generate_text.sh as instructed here.

Observed behavior

...
Traceback (most recent call last):
  File "tools/generate_samples_gpt.py", line 168, in <module>
    main()
  File "tools/generate_samples_gpt.py", line 144, in main
    generate_and_write_samples_unconditional(model, latencies, single_token_latency, model_latencies)
  File "/workspace/megatron_deepspeed/megatron/text_generation_utils.py", line 378, in generate_and_write_samples_unconditional
    for datum in generate_samples_unconditional(model, latencies=latencies, model_latencies=model_latencies, single_token_latency=single_token_latency):
  File "/workspace/megatron_deepspeed/megatron/text_generation_utils.py", line 356, in generate_samples_unconditional
    text = tokenizer.detokenize(tokens)
  File "/workspace/megatron_deepspeed/megatron/tokenizer/tokenizer.py", line 287, in detokenize
    return self.tokenizer.decode(token_ids)
  File "/workspace/megatron_deepspeed/megatron/tokenizer/gpt2_tokenization.py", line 288, in decode
    text = ''.join([self.decoder[token] for token in tokens])
  File "/workspace/megatron_deepspeed/megatron/tokenizer/gpt2_tokenization.py", line 288, in <listcomp>
    text = ''.join([self.decoder[token] for token in tokens])
KeyError: 50284

Solution found (please fix if acceptable)

This error is due to a non-recognized token. A fix has been found by updating megatron/tokenizer/gpt2_tokenization.py#L284 as shown below:

# Previous
text = ''.join([self.decoder[token] for token in tokens])

# Fixed
underscore_token_idx = 62
tokens_processed = [token if token in self.decoder else underscore_token_idx for token in tokens]
text = ''.join([self.decoder[token] for token in tokens_processed])

Expected behavior

Able to generate unconditional text.

System info

Ubuntu 18.04, PyTorch docker (docker pull nvcr.io/nvidia/pytorch:20.12-py3), RTX 3090.

If I just want to pretrain a simple gpt model without these characteristics, which script should I refer to?

There are many different branches in the example folder, all with their own characteristics，such as Azure, Moe, Curriculum Learning, etc. If I just want to pretrain a simple gpt model without these characteristics, which script should I refer to?

Question for usage of DeepSpeed transformer kernels

Hi, is there a plan to use DeepSpeed transformer kernels in the Megatron-DeepSpeed GPT model? We observe the following:

There is transformer training kernels in DeepSpeed https://github.com/microsoft/DeepSpeed/tree/master/csrc/transformer
In deepspeed there is replace_transformer_layer but it is called in inference engine only, not training engine.

Will there be plan to use deepspeed transformer kernels in this model? What is recommended way if we want to use deepspeed transformer kernels in Megatron GPT model?

ds_pretrain_gpt_125M_MoE64.sh didn't convergence, loss fly after 3k steps?

Hi guys,
In MoE branch, I tested ds_pretrain_gpt_125M_MoE64.sh
training loss is going up after 3k steps?

Is there any suggestions? By the way, which ds version to use for running this branch

megatron-deepspeed layernorm has different output compare with megatron-lm?

I test megatron-deepspeed and megatron-lm with same dataset, ＆ set all hyperparamters the same, but get different convergence curve.
When I compare two famework's input and output layer by layer, I found that before certain step(for example step 11), they have same output. But in that certain step, in a transformer-layer's layer_norm layer, they get same input but output differently.
My megatron-ds branch is moe and commit is 74285aa
megatron-lm tag is v2.4

RuntimeError: The global rank 0 is not part of the group <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7fbb8f4817f0>

I just use pretrain_gpt.py,but receive such problem
this is my script and library version:
script:
#! /bin/bash
set -e

Change for multinode config

logname=$(date +'%Y-%m-%d_%H:%M:%S')
if [ -n "$1" ]; then
logname=$1
fi

MP_SIZE=4
NUM_WORKERS=1
NUM_GPUS_PER_WORKER=4
HIDDEN_SIZE=3072
NUM_ATTN_HEADS=24
NUM_LAYERS=40
BATCHSIZE=16
DATA_PATH=$(cat data_augment.txt)
VOCAB_PATH=vocab.txt
MERGE_PATH=none.txt
CHECKPOINT_PATH=./checkpoints/${logname}

script_path=$(realpath $0)
script_dir=$(dirname $script_path)
config_json="$script_dir/new.json"
#config_json="wyp.json"

offloads to NVMe

#config_json="$script_dir/ds_zero_stage_infinity_config.json"

#ZeRO Configs
stage=3
reduce_scatter=true
contigious_gradients=true
rbs=50000000
agbs=5000000000

#Activation Checkpointing and Contigious Memory
chkp_layers=1
PA=true
PA_CPU=true
CC=true
SYNCHRONIZE=true
PROFILE=false

TiledLinear splits, 0 is disable

TILED_LINEAR="true"
TILE_DIM=1

Megatron Model Parallelism

LOGDIR="tensorboard/failed/${logname}"

#--cpu-optimizer
#--save $CHECKPOINT_PATH \

--load $CHECKPOINT_PATH \

gpt_options="
--cpu-optimizer
--save $CHECKPOINT_PATH
--load $CHECKPOINT_PATH
--tensor-model-parallel-size ${MP_SIZE}
--num-layers $NUM_LAYERS
--hidden-size $HIDDEN_SIZE
--num-attention-heads ${NUM_ATTN_HEADS}
--seq-length 2048
--max-position-embeddings 2048
--micro-batch-size $BATCHSIZE
--train-iters 1500
--train-tokens 1000000000
--data-path $DATA_PATH
--vocab-file $VOCAB_PATH
--merge-file $MERGE_PATH
--data-impl mmap
--split 98,0,2
--tokenizer-type EncDecTokenizer
--distributed-backend nccl
--lr 1.5e-4
--lr-decay-style cosine
--min-lr 1.0e-5
--weight-decay 1e-2
--clip-grad 1.0
--lr-warmup-fraction 0.01
--checkpoint-activations
--log-interval 1
--save-interval 10000
--eval-interval 2000
--eval-iters 10
--fp16
--scattered-embeddings
--split-transformers
--tensorboard-dir ${LOGDIR}
--no-pipeline-parallel
"
deepspeed_options="
--deepspeed
--deepspeed_config ${config_json}
--zero-stage ${stage}
--zero-reduce-bucket-size ${rbs}
--zero-allgather-bucket-size ${agbs}
--remote-device cpu
"

if [ "${contigious_gradients}" = "true" ]; then
deepspeed_options="${deepspeed_options}
--zero-contigious-gradients"
fi

if [ "${reduce_scatter}" = "true" ]; then
deepspeed_options="${deepspeed_options}
--zero-reduce-scatter"
fi

chkp_opt="
--deepspeed-activation-checkpointing
--checkpoint-num-layers ${chkp_layers}"

if [ "${PA}" = "true" ]; then
chkp_opt="${chkp_opt} --partition-activations"
fi

if [ "${PA_CPU}" = "true" ]; then
chkp_opt="${chkp_opt}
--checkpoint-in-cpu"
fi

if [ "${SYNCHRONIZE}" = "true" ]; then
chkp_opt="${chkp_opt}
--synchronize-each-layer"
fi

if [ "${CC}" = "true" ]; then
chkp_opt="${chkp_opt}
--contigious-checkpointing"
fi

if [ "${PROFILE}" = "true" ]; then
chkp_opt="${chkp_opt}
--profile-backward"
fi

if [ "${TILED_LINEAR}" = "true" ]; then
tile_opt="${tile_opt}
--memory-centric-tiled-linear
--tile-factor=${TILE_DIM}"
fi

full_options="${gpt_options} ${deepspeed_options} ${chkp_opt} ${tile_opt}"
run_cmd="deepspeed --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} pretrain_gpt.py ${full_options} |tee log/failed${logname}.log"

#if [ $(cat wlog/$(log_date).log | tail -1) == "end" ]; then

rm wlog/$(log_date).log

#fi

echo ${run_cmd}
eval ${run_cmd}
mv log/failed${logname}.log log/${logname}.log
mv ${LOGDIR} tensorboard/${logname}
set +x

then is my library version:
torch=1.12.1+cu113
deepspeed=0.9.1dev0(get from github)
apex=0.1

MoE Checkpoint size

Hi team,

I'm training MoE model using one of your example scripts (ds_pretrain_gpt_125M_MoE64.sh). The checkpoint for the model, especially expert checkpoints are larger than what I think they should be. Particularly, the experts have shape of [3072, 768] and [768, 3072] and dtype=fp16 which should be stored in 30727682*2=9MB. But the checkpoints are around 400MB. Is there something that I'm missing?

Is mos_loss calculated inside tensor model parallel region?

I noticed that the lm_logits used to calculate mos_loss might be tensor-model-paralleled, but are not gathered from model parallel region before the KLDivLoss. Did I miss something?

Checkpoint for the MoE version

Is the checkpoint for the MoE version of Megatron available?

Megtaron code incompatibility with pytorch 2.0

It looks like torch._six module is deprecated in pytorch 2.0. To have megatron compatibility with pytorch 2.0, it willr equire to remove the reference of this module from Megatron optimizer code.

Here is the traceback:
Traceback (most recent call last):
File "pretrain_gpt.py", line 28, in
from megatron.training import pretrain
File "/mnt/azureml/cr/j/d81be8a5d3ae4306b72803cebea562af/exe/wd/megatron/training.py", line 43, in
from megatron.optimizer import get_megatron_optimizer
File "/mnt/azureml/cr/j/d81be8a5d3ae4306b72803cebea562af/exe/wd/megatron/optimizer/init.py", line 23, in
from .optimizer import Float16OptimizerWithFloat16Params, FP32Optimizer
File "/mnt/azureml/cr/j/d81be8a5d3ae4306b72803cebea562af/exe/wd/megatron/optimizer/optimizer.py", line 30, in
from .clip_grads import clip_grad_norm_fp32, count_zeros_fp32
File "/mnt/azureml/cr/j/d81be8a5d3ae4306b72803cebea562af/exe/wd/megatron/optimizer/clip_grads.py", line 19, in
from torch._six import inf
ModuleNotFoundError: No module named 'torch._six'

Issues with DeepSpeed optimizer and tensor parallelism when changing topology between machines

Description:
While using DeepSpeed, we have observed that the optimizer records the topology of all GPUs and tensor parallelism records the parallelism of the cards. When we try to train on a machine with a different topology, such as modifying the TP, PP parameters, etc., we receive an error. We would like to know how to resolve this issue or if we need to manually implement a topology conversion method.

Steps to reproduce:
Use DeepSpeed optimizer with tensor parallelism on a machine with a certain topology.
Attempt to switch to a different machine with a different topology, such as modifying the TP, PP parameters.
Observe the error message.

Expected behavior:
The DeepSpeed optimizer should be able to adjust to changes in topology between machines without errors.

Actual behavior:
Errors occur when attempting to train on a machine with a different topology.

Possible solution:
We could manually implement a topology conversion method to avoid these errors. Alternatively, we could explore ways to modify the DeepSpeed optimizer to handle topology changes more seamlessly.

[Bug]Load checkpoint err using pretrain_bert.py with Megatron

Hello, when I was using deepspeed to save and load checkpoints, I got this error.

As saving checkpoints is fine, I think the language_model key is not saved in that process.

[BUG] the gpt model cannot run in specified container

On GPU-V100, i generate dataset with provided method successfully, but running examples/pretrain_gpt.sh failed in specified container(nvcr.io/nvidia/pytorch:20.12-py3). following is the error message:
" RuntimeError: element 1 of tensors does not require grad and does not have a grad_fn".

how can i solve this bug?

Website documentation is incoherent with the repository content

Hi,

This repository is pointed to by this doc page, yet the paths/scripts referenced in the docs are incoherent with it. Instead, the doc references scripts and paths found in the old megatron example repository which is marked as deprecated. Is there any plan to update the docs to make it coherent with this repository ? It's a bit confusing to know which files where modified since the original modified files are nowhere to be found (there are variations but it can still be confusing, especially when the folder they should be in doesn't exist anymore)

How to run use moe on T5?

When I use MoE in t5_model.py (one card)

        self.language_model, self._language_model_key = get_language_model(
            num_tokentypes=num_tokentypes,
            add_pooler=False,
            add_decoder=True,
            encoder_attn_mask_type=AttnMaskType.padding,
            init_method=init_method,
            scaled_init_method=scaled_init_method,
            num_experts=[2 for _ in range(len_exp)])

There's no gradient in the experts. The error occurs:

File "/Megatron-DeepSpeed/megatron/optimizer/optimizer.py", line 308, in _copy_model_grads_to_main_grads
    main_param.grad = model_param.main_grad.float()
AttributeError: 'Parameter' object has no attribute 'main_grad'

And for(distribute fp16) :

RuntimeError    : if torch.is_grad_enabled() and self.reducer._rebuild_buckets():Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.

without fp16:

Traceback (most recent call last):                                                                                                                                                                                 File "./pretrain_t5.py", line 150, in <module>                                                                                                                                                                     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,                                                                                                                                     File "/Megatron-DeepSpeed/megatron/training.py", line 170, in pretrain                                                                                                                                             iteration = train(forward_step_func,                                                                                                                                                                           File "/Megatron-DeepSpeed/megatron/training.py", line 945, in train                                                                                                                                                train_step(forward_step_func,                                                                                                                                                                                  File "/Megatron-DeepSpeed/megatron/training.py", line 597, in train_step                                                                                                                                           update_successful, grad_norm, num_zeros_in_grad = optimizer.step()
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/Megatron-DeepSpeed/megatron/optimizer/optimizer.py", line 505, in step
    grad_norm = self.clip_grad_norm(self.clip_grad)
  File "/Megatron-DeepSpeed/megatron/optimizer/optimizer.py", line 91, in clip_grad_norm
    return clip_grad_norm_fp32(params, clip_grad)
  File "/Megatron-DeepSpeed/megatron/optimizer/clip_grads.py", line 61, in clip_grad_norm_fp32
    grad = param.grad.detach()
AttributeError: 'NoneType' object has no attribute 'detach'

AttributeError: module 'transformer_inference' has no attribute 'layer_norm_fp16'

Hello,

I'm trying to execute Megatron-DeepSpeed/examples/generate_text.sh
But it shows the error below.

    new_module = transformer_inference.DeepSpeedMoEInference(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/moe_inference.py", line 308, in __init__
    self.ds_layernorm = inference_cuda_module.layer_norm_fp16 if self.config.fp16 or self.config.q_int8 else \
AttributeError: module 'transformer_inference' has no attribute 'layer_norm_fp16'

It seems it's an error there is layer_norm_fp16 omitted in the transformer_inference function.

Is there any way to solve this problem?

Invalid syntax error when unpacking *moe_losses in python-3.7

I am trying to use the new MOE support from DeepSpeed 0.5.8 on a system with python 3.7.11. However, I get "invalid syntax" errors for all of the statements like:

75: 3:   File "/path/to/megatron/model/language_model.py", line 408
75: 3:     return encoder_output, pooled_output, *moe_losses
75: 3:                                           ^
75: 3: SyntaxError: invalid syntax

It seems like I can work around those errors with a tuple(...) statement. I hit this problems in at least the following places.

diff --git a/megatron/model/gpt_model.py b/megatron/model/gpt_model.py
index 4eb983c..1efa1da 100644
--- a/megatron/model/gpt_model.py
+++ b/megatron/model/gpt_model.py
@@ -124,15 +124,23 @@ class GPTModel(MegatronModule):
             get_key_value=get_key_value)
 
         if self.post_process:
-            return post_language_model_processing(
+            #return post_language_model_processing(
+            #    lm_output, labels,
+            #    self.word_embeddings_weight(),
+            #    get_key_value,
+            #    self.parallel_output,
+            #    forward_method_parallel_output,
+            #    self.fp16_lm_cross_entropy), *moe_losses
+            return tuple(post_language_model_processing(
                 lm_output, labels,
                 self.word_embeddings_weight(),
                 get_key_value,
                 self.parallel_output,
                 forward_method_parallel_output,
-                self.fp16_lm_cross_entropy), *moe_losses
+                self.fp16_lm_cross_entropy), *moe_losses)
         else:
-            return lm_output, *moe_losses
+            #return lm_output, *moe_losses
+            return tuple(lm_output, *moe_losses)
 
     def state_dict_for_save_checkpoint(self, destination=None, prefix='',
                                        keep_vars=False):
diff --git a/megatron/model/language_model.py b/megatron/model/language_model.py
index cb27498..e7853e6 100644
--- a/megatron/model/language_model.py
+++ b/megatron/model/language_model.py
@@ -405,9 +405,11 @@ class TransformerLanguageModel(MegatronModule):
         # similarity between two sequences by average pooling
         if not self.add_decoder or output_enc_hidden:
             if self.add_pooler and self.post_process:
-                return encoder_output, pooled_output, *moe_losses
+                #return encoder_output, pooled_output, *moe_losses
+                return tuple(encoder_output, pooled_output, *moe_losses)
             else:
-                return encoder_output, *moe_losses
+                #return encoder_output, *moe_losses
+                return tuple(encoder_output, *moe_losses)
 
         # Decoder Embedding
         dec_embedding_output = self.embedding(dec_input_ids,
@@ -421,9 +423,11 @@ class TransformerLanguageModel(MegatronModule):
                                       enc_dec_attn_mask=enc_dec_attn_mask)
 
         if self.add_pooler and self.post_process:
-            return decoder_output, encoder_output, pooled_output, *moe_losses
+            #return decoder_output, encoder_output, pooled_output, *moe_losses
+            return tuple(decoder_output, encoder_output, pooled_output, *moe_losses)
         else:
-            return decoder_output, encoder_output, *moe_losses
+            #return decoder_output, encoder_output, *moe_losses
+            return tuple(decoder_output, encoder_output, *moe_losses)
 
     def state_dict_for_save_checkpoint(self, destination=None, prefix='',
                                        keep_vars=False):

ModuleNotFoundError: No module named 'lm_eval.datasets.coqa'

I followed exactly what this said, https://github.com/microsoft/Megatron-DeepSpeed/blob/gpt3_with_pile_train_eval/examples/MoE/readme_evalharness.md#prerequisites

but it looks like there is problem with the dataset

Traceback (most recent call last):
  File "../../tasks/eval_harness/evaluate.py", line 9, in <module>
    from lm_eval import evaluator, tasks, utils
  File "/home/xiaoxiawu/.local/lib/python3.8/site-packages/lm_eval/evaluator.py", line 7, in <module>
    import lm_eval.tasks
                          File "/home/xiaoxiawu/.local/lib/python3.8/site-packages/lm_eval/tasks/__init__.py", line 10, in <module>
    from . import coqa
  File "/home/xiaoxiawu/.local/lib/python3.8/site-packages/lm_eval/tasks/coqa.py", line 14, in <module>
    import lm_eval.datasets.coqa.coqa
ModuleNotFoundError: No module named 'lm_eval.datasets.coqa'

If we comment out " from . import coqa" in File "/home/xiaoxiawu/.local/lib/python3.8/site-packages/lm_eval/tasks/init.py", we will have another problem with other dataset such as lambada and more.. I wonder what's the problem.

how can I use the cpu_offload?

how can I use the cpu_offload? (especially, on MoE model. I wanna try the MoE model using ZeRO-Offload)
Is there any ZeRO-Offloading guide on readme? I've never seen it... T-T

The process is stuck at this step:compiling and loading fused kernels ...

I‘m runing examples/run_deepspeed_example.sh modified as follow:

#!/bin/bash
set -ex

CUDA_VISBLE_DEVICES=0,2,3,4
BASE_PATH=./
DATA_PATH=./
DS_CONFIG=ds_config.json

TP=1
PP=1
NLAYERS=24
HIDDEN=512

GLOBAL_BATCH=32
MICRO_BATCH=4

ZERO_STAGE=1

OUTPUT_DIR=ds_z${ZERO_STAGE}_nl${NLAYERS}_hs${HIDDEN}_gb${GLOBAL_BATCH}_mb${MICRO_BATCH}
#OUTPUT_DIR=baseline_nl${NLAYERS}_hs${HIDDEN}_gb${GLOBAL_BATCH}_mb${MICRO_BATCH}
mkdir -p $OUTPUT_DIR

cat <<EOT > $DS_CONFIG
{
  "train_batch_size" : $GLOBAL_BATCH,
  "train_micro_batch_size_per_gpu": $MICRO_BATCH,
  "steps_per_print": 1,

  "zero_optimization": {
    "stage": $ZERO_STAGE
  },

  "fp16": {
    "enabled": true,
    "initial_scale_power": 12
  },

  "wall_clock_breakdown" : true
}
EOT

export NCCL_DEBUG=warn

ds_args=""
ds_args=" --deepspeed ${ds_args}"
ds_args=" --no-pipeline-parallel ${ds_args}"
ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
ds_args=" --deepspeed-activation-checkpointing ${ds_args}"


deepspeed pretrain_gpt.py \
    --tensor-model-parallel-size $TP \
    --pipeline-model-parallel-size $PP \
    --num-layers $NLAYERS \
    --hidden-size $HIDDEN \
    --num-attention-heads 16 \
    --seq-length 1024 \
    --loss-scale 12 \
    --max-position-embeddings 1024 \
    --micro-batch-size 4 \
    --global-batch-size 32 \
    --train-iters 1000 \
    --lr 6.0e-5 \
    --min-lr 6.0e-6 \
    --lr-decay-style cosine \
    --log-interval 1 \
    --eval-iters 40 \
    --eval-interval 1000 \
    --data-path $DATA_PATH \
    --vocab-file $BASE_PATH/gpt2-vocab.json \
    --merge-file $BASE_PATH/gpt2-merges.txt \
    --save-interval 1000 \
    --split 98,2,0 \
    --clip-grad 1.0 \
    --weight-decay 0.1 \

The output of the program is as follows, which seems to be stuck, and there is no new output for a long time：

+ CUDA_VISBLE_DEVICES=0,2,3,4
+ BASE_PATH=./
+ DATA_PATH=./
+ DS_CONFIG=ds_config.json
+ TP=1
+ PP=1
+ NLAYERS=24
+ HIDDEN=512
+ GLOBAL_BATCH=32
+ MICRO_BATCH=4
+ ZERO_STAGE=1
+ OUTPUT_DIR=ds_z1_nl24_hs512_gb32_mb4
+ mkdir -p ds_z1_nl24_hs512_gb32_mb4
+ cat
+ export NCCL_DEBUG=warn
+ NCCL_DEBUG=warn
+ ds_args=
+ ds_args=' --deepspeed '
+ ds_args=' --no-pipeline-parallel  --deepspeed '
+ ds_args=' --deepspeed_config=ds_config.json  --no-pipeline-parallel  --deepspeed '
+ ds_args=' --zero-stage=1  --deepspeed_config=ds_config.json  --no-pipeline-parallel  --deepspeed '
+ ds_args=' --deepspeed-activation-checkpointing  --zero-stage=1  --deepspeed_config=ds_config.json  --no-pipeline-parallel  --deepspeed '
+ deepspeed pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 24 --hidden-size 512 --num-attention-heads 16 --seq-length 1024 --loss-scale 12 --max-position-embeddings 1024 --micro-batch-size 4 --global-batch-size 32 --train-iters 1000 --lr 6.0e-5 --min-lr 6.0e-6 --lr-decay-style cosine --log-interval 1 --eval-iters 40 --eval-interval 1000 --data-path ./ --vocab-file .//gpt2-vocab.json --merge-file .//gpt2-merges.txt --save-interval 1000 --split 98,2,0 --clip-grad 1.0 --weight-decay 0.1 --adam-beta1 0.9 --adam-beta2 0.95 --init-method-std 0.006 --fp16 --checkpoint-activations --tensorboard-dir ds_z1_nl24_hs512_gb32_mb4 --deepspeed-activation-checkpointing --zero-stage=1 --deepspeed_config=ds_config.json --no-pipeline-parallel --deepspeed --exit-interval 1000
+ tee ds_z1_nl24_hs512_gb32_mb4/output.log
[2022-11-10 09:45:41,839] [WARNING] [runner.py:159:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2022-11-10 09:45:42,016] [INFO] [runner.py:457:main] cmd = /opt/conda/bin/python3.8 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 24 --hidden-size 512 --num-attention-heads 16 --seq-length 1024 --loss-scale 12 --max-position-embeddings 1024 --micro-batch-size 4 --global-batch-size 32 --train-iters 1000 --lr 6.0e-5 --min-lr 6.0e-6 --lr-decay-style cosine --log-interval 1 --eval-iters 40 --eval-interval 1000 --data-path ./ --vocab-file .//gpt2-vocab.json --merge-file .//gpt2-merges.txt --save-interval 1000 --split 98,2,0 --clip-grad 1.0 --weight-decay 0.1 --adam-beta1 0.9 --adam-beta2 0.95 --init-method-std 0.006 --fp16 --checkpoint-activations --tensorboard-dir ds_z1_nl24_hs512_gb32_mb4 --deepspeed-activation-checkpointing --zero-stage=1 --deepspeed_config=ds_config.json --no-pipeline-parallel --deepspeed --exit-interval 1000
[2022-11-10 09:45:43,730] [INFO] [launch.py:96:main] 0 NCCL_VERSION=2.11.4
[2022-11-10 09:45:43,730] [INFO] [launch.py:96:main] 0 NCCL_DEBUG=warn
[2022-11-10 09:45:43,730] [INFO] [launch.py:103:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2022-11-10 09:45:43,730] [INFO] [launch.py:109:main] nnodes=1, num_local_procs=8, node_rank=0
[2022-11-10 09:45:43,730] [INFO] [launch.py:122:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2022-11-10 09:45:43,730] [INFO] [launch.py:123:main] dist_world_size=8
[2022-11-10 09:45:43,730] [INFO] [launch.py:125:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0a0+b6df043
torch cuda version ............... 11.5
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
fatal: not a git repository (or any of the parent directories): .git
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0a0+b6df043
torch cuda version ............... 11.5
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5
using world size: 8, data-parallel-size: 8, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
using torch.float16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.95
  adam_eps ........................................ 1e-08
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  aml_data_download_path .......................... None
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  bert_binary_head ................................ True
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  checkpoint_activations .......................... True
  checkpoint_in_cpu ............................... False
  checkpoint_num_layers ........................... 1
  clip_grad ....................................... 1.0
  compression_training ............................ False
  consumed_train_samples .......................... 0
  consumed_train_tokens ........................... 0
  consumed_valid_samples .......................... 0
  contigious_checkpointing ........................ False
  cpu_optimizer ................................... False
  cpu_torch_adam .................................. False
  create_moe_param_group .......................... False
  curriculum_learning ............................. False
  custom_token_counting ........................... False
  data_impl ....................................... infer
  data_parallel_size .............................. 8
  data_path ....................................... ['./']
  dataloader_type ................................. single
  DDP_impl ........................................ local
  decoder_seq_length .............................. None
  deepscale ....................................... False
  deepscale_config ................................ None
  deepspeed ....................................... True
  deepspeed_activation_checkpointing .............. True
  deepspeed_config ................................ ds_config.json
  deepspeed_mpi ................................... False
  distribute_checkpointed_activations ............. False
  distributed_backend ............................. nccl
  ds_inference .................................... False
  ds_pipeline_enabled ............................. False
  embedding_path .................................. None
  enable_expert_tensor_parallelism ................ False
  encoder_seq_length .............................. 1024
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 40
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... 1000
  expert_interval ................................. 2
  ffn_hidden_size ................................. 2048
  finetune ........................................ False
  fp16 ............................................ True
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  global_batch_size ............................... 32
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 512
  hidden_size_teacher ............................. None
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_dim ......................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference ....................................... False
  init_method_std ................................. 0.006
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  kd .............................................. False
  kd_alpha_ce ..................................... 1
  kd_beta_ce ...................................... 1
  kd_temp ......................................... 1.0
  kv_channels ..................................... 32
  layernorm_epsilon ............................... 1e-05
  lazy_mpu_init ................................... None
  load ............................................ None
  load_teacher .................................... None
  local_rank ...................................... 0
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 1
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_num_zeros_in_grad ........................... False
  log_optimizer_states_to_tensorboard ............. False
  log_params_norm ................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  loss_scale ...................................... 12.0
  loss_scale_window ............................... 1000
  lr .............................................. 6e-05
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_decay_tokens ................................. None
  lr_warmup_fraction .............................. None
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  lr_warmup_tokens ................................ None
  make_vocab_size_divisible_by .................... 128
  mask_prob ....................................... 0.15
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 1024
  memory_centric_tiled_linear ..................... False
  merge_file ...................................... .//gpt2-merges.txt
  micro_batch_size ................................ 4
  min_loss_scale .................................. 1.0
  min_lr .......................................... 6e-06
  mlp_type ........................................ standard
  mmap_warmup ..................................... False
  moe_eval_capacity_factor ........................ 1.0
  moe_expert_parallel_size ........................ 1
  moe_loss_coeff .................................. 0.1
  moe_min_capacity ................................ 4
  moe_token_dropping .............................. True
  moe_train_capacity_factor ....................... 1.0
  mos ............................................. False
  no_load_lr_state ................................ False
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_pipeline_parallel ............................ True
  no_save_optim ................................... None
  no_save_rng ..................................... None
  num_attention_heads ............................. 16
  num_attention_heads_teacher ..................... None
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... [1]
  num_experts_teacher ............................. [1]
  num_layers ...................................... 24
  num_layers_per_virtual_pipeline_stage ........... None
  num_layers_teacher .............................. None
  num_workers ..................................... 2
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  override_lr_scheduler ........................... False
  params_dtype .................................... torch.float16
  partition_activations ........................... False
  patch_dim ....................................... 16
  pipeline_model_parallel_size .................... 1
  profile_backward ................................ False
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  remote_device ................................... none
  reset_attention_mask ............................ False
  reset_iteration ................................. False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  sample_rate ..................................... 1.0
  save ............................................ None
  save_interval ................................... 1000
  scatter_gather_tensors_in_pipeline .............. True
  scattered_embeddings ............................ False
  seed ............................................ 1234
  seq_length ...................................... 1024
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  split ........................................... 98,2,0
  split_transformers .............................. False
  synchronize_each_layer .......................... False
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. ds_z1_nl24_hs512_gb32_mb4
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  tile_factor ..................................... 1
  titles_data_path ................................ None
  tokenizer_type .................................. GPT2BPETokenizer
  topk ............................................ 1
  train_iters ..................................... 1000
  train_samples ................................... None
  train_tokens .................................... None
  use_checkpoint_lr_scheduler ..................... False
  use_contiguous_buffers_in_ddp ................... False
  use_cpu_initialization .......................... None
  use_one_sent_docs ............................... False
  use_pin_memory .................................. False
  use_tutel ....................................... False
  virtual_pipeline_model_parallel_size ............ None
  vocab_extra_ids ................................. 0
  vocab_file ...................................... .//gpt2-vocab.json
  weight_decay .................................... 0.1
  world_size ...................................... 8
  zero_allgather_bucket_size ...................... 0.0
  zero_contigious_gradients ....................... False
  zero_reduce_bucket_size ......................... 0.0
  zero_reduce_scatter ............................. False
  zero_stage ...................................... 1
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
> building GPT2BPETokenizer tokenizer ...
fatal: not a git repository (or any of the parent directories): .git
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
[2022-11-10 09:45:46,590] [INFO] [distributed.py:48:init_distributed] Initializing torch distributed with backend: nccl
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0a0+b6df043
torch cuda version ............... 11.5
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5
fatal: not a git repository (or any of the parent directories): .git
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0a0+b6df043
torch cuda version ............... 11.5
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5
fatal: not a git repository (or any of the parent directories): .git
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------DeepSpeed general environment info:

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
torch install path--------------------------------------------------
JIT compiled ops requires ninja
 ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0a0+b6df043
torch cuda version ............... 11.5
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
fatal: not a git repository (or any of the parent directories): .git
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0a0+b6df043
torch cuda version ............... 11.5
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0a0+b6df043
torch cuda version ............... 11.5
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0a0+b6df043
torch cuda version ............... 11.5
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.5, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5
fatal: not a git repository (or any of the parent directories): .git
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
fatal: not a git repository (or any of the parent directories): .git
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
fatal: not a git repository (or any of the parent directories): .git
**** Git info for Megatron: git_hash=unknown git_branch=unknown ****
> setting tensorboard ...
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
[2022-11-10 09:45:47,963] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
> compiling dataset index builder ...
make: Entering directory '/workspace/Megatron-DeepSpeed/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/workspace/Megatron-DeepSpeed/megatron/data'
>>> done with dataset index builder. Compilation time: 0.091 seconds
> compiling and loading fused kernels ...

So I terminated the process. The error message when terminating is as follows. It seems that the program is stuck in the initialization of C++extension:

^CTraceback (most recent call last):
  File "pretrain_gpt.py", line 294, in <module>
--- Logging error ---
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/workspace/Megatron-DeepSpeed/megatron/training.py", line 98, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/workspace/Megatron-DeepSpeed/megatron/initialize.py", line 89, in initialize_megatron
    _compile_dependencies()
  File "/workspace/Megatron-DeepSpeed/megatron/initialize.py", line 137, in _compile_dependencies
    fused_kernels.load(args)
  File "/workspace/Megatron-DeepSpeed/megatron/fused_kernels/__init__.py", line 88, in load
    scaled_upper_triang_masked_softmax_cuda = _cpp_extention_load_helper(
  File "/workspace/Megatron-DeepSpeed/megatron/fused_kernels/__init__.py", line 56, in _cpp_extention_load_helper
    return cpp_extension.load(
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1125, in load
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/logging/__init__.py", line 1088, in emit
    stream.write(msg + self.terminator)
BrokenPipeError: [Errno 32] Broken pipe
Call stack:
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 218, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 214, in main
    time.sleep(1)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 178, in sigkill_handler
    logger.info(f"Killing subprocess {process.pid}")
Message: 'Killing subprocess 15854'
Arguments: ()
    --- Logging error ---
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/logging/__init__.py", line 1088, in emit
    stream.write(msg + self.terminator)
BrokenPipeError: [Errno 32] Broken pipe
Call stack:
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 218, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 214, in main
    time.sleep(1)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 178, in sigkill_handler
    logger.info(f"Killing subprocess {process.pid}")
Message: 'Killing subprocess 15855'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/logging/__init__.py", line 1088, in emit
    stream.write(msg + self.terminator)
BrokenPipeError: [Errno 32] Broken pipe
Call stack:
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 218, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 214, in main
    time.sleep(1)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 178, in sigkill_handler
    logger.info(f"Killing subprocess {process.pid}")
Message: 'Killing subprocess 15856'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/logging/__init__.py", line 1088, in emit
    stream.write(msg + self.terminator)
BrokenPipeError: [Errno 32] Broken pipe
Call stack:
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 218, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 214, in main
    time.sleep(1)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 178, in sigkill_handler
    logger.info(f"Killing subprocess {process.pid}")
Message: 'Killing subprocess 15857'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/logging/__init__.py", line 1088, in emit
    stream.write(msg + self.terminator)
BrokenPipeError: [Errno 32] Broken pipe
Call stack:
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 218, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 214, in main
    time.sleep(1)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 178, in sigkill_handler
    logger.info(f"Killing subprocess {process.pid}")
Message: 'Killing subprocess 15858'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/logging/__init__.py", line 1088, in emit
    stream.write(msg + self.terminator)
BrokenPipeError: [Errno 32] Broken pipe
Call stack:
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 218, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 214, in main
    time.sleep(1)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 178, in sigkill_handler
    logger.info(f"Killing subprocess {process.pid}")
Message: 'Killing subprocess 15859'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/logging/__init__.py", line 1088, in emit
    stream.write(msg + self.terminator)
BrokenPipeError: [Errno 32] Broken pipe
Call stack:
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 218, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 214, in main
    time.sleep(1)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 178, in sigkill_handler
    logger.info(f"Killing subprocess {process.pid}")
Message: 'Killing subprocess 15860'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/logging/__init__.py", line 1088, in emit
    stream.write(msg + self.terminator)
BrokenPipeError: [Errno 32] Broken pipe
Call stack:
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 218, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 214, in main
    time.sleep(1)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 178, in sigkill_handler
    logger.info(f"Killing subprocess {process.pid}")
Message: 'Killing subprocess 15861'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/logging/__init__.py", line 1088, in emit
    stream.write(msg + self.terminator)
BrokenPipeError: [Errno 32] Broken pipe
Call stack:
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 218, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 214, in main
    time.sleep(1)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 187, in sigkill_handler
    logger.info(f"Main process received {sig_names[signum]}, exiting")
Message: 'Main process received SIGINT, exiting'
Arguments: ()
Traceback (most recent call last):
  File "/opt/conda/bin/deepspeed", line 6, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 460, in main
    result.wait()
  File "/opt/conda/lib/python3.8/subprocess.py", line 1083, in wait
    return self._wait(timeout=timeout)
  File "/opt/conda/lib/python3.8/subprocess.py", line 1806, in _wait
    (pid, sts) = self._try_wait(0)
  File "/opt/conda/lib/python3.8/subprocess.py", line 1764, in _try_wait
    (pid, sts) = os.waitpid(self.pid, wait_flags)
KeyboardInterrupt

only tuning the layernorm or added adapter params error

[RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

pretrain_gpt_125M_MoE freezes during compilation

Hi all,
I have been trying to run moe models using examples in examples/MoE/. The training freezes after

>>> done with dataset index builder. Compilation time: 0.063 seconds
> compiling and loading fused kernels .

with every scripts.
I am running the training jobs on slurm with pytorch=1.12.1, cuda=11.3, apex=0.1.0, and deepspeed=0.7.0. Below is the training log up until the training job freezes. Would appreciate any help. Thanks!

[2022-08-15 01:20:45,499] [INFO] [launch.py:136:main] WORLD INFO DICT: {'localhost': [0]}
[2022-08-15 01:20:45,500] [INFO] [launch.py:142:main] nnodes=1, num_local_procs=1, node_rank=0
[2022-08-15 01:20:45,500] [INFO] [launch.py:155:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2022-08-15 01:20:45,500] [INFO] [launch.py:156:main] dist_world_size=1
[2022-08-15 01:20:45,501] [INFO] [launch.py:158:main] Setting CUDA_VISIBLE_DEVICES=0
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/n/holyscratch01/acc_lab/Users/yhjin0509/.conda/envs/moe7/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/n/holyscratch01/acc_lab/Users/yhjin0509/.conda/envs/moe7/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
**** Git info for Megatron: git_hash=b4d4a0e git_branch=main ****
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1
using torch.float16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.95
  adam_eps ........................................ 1e-08
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  aml_data_download_path .......................... None
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  bert_binary_head ................................ True
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  checkpoint_activations .......................... True
  checkpoint_in_cpu ............................... False
  checkpoint_num_layers ........................... 1
  clip_grad ....................................... 1.0
  compression_training ............................ False
  consumed_train_samples .......................... 0
  consumed_train_tokens ........................... 0
  consumed_valid_samples .......................... 0
  contigious_checkpointing ........................ False
  cpu_optimizer ................................... False
  cpu_torch_adam .................................. False
  create_moe_param_group .......................... True
  curriculum_learning ............................. False
  data_impl ....................................... mmap
  data_parallel_size .............................. 1
  data_path ....................................... ['/n/holylfs05/LABS/acc_lab/Users/yhjin0509/moe_data/mystic.the-eye.eu/public/AI/pile_neox/data/Books3Dataset_text_document']
  dataloader_type ................................. single
  DDP_impl ........................................ local
  decoder_seq_length .............................. None
  deepscale ....................................... False
  deepscale_config ................................ None
  deepspeed ....................................... True
  deepspeed_activation_checkpointing .............. True
  deepspeed_config ................................ ds_config_gpt_gpt-0.125B-lr-4.5e-4-minlr-4.5e-06-bs-256-gpus-1-mp-1-pp-1-ep-2-mlc-0.01-cap-1.0-drop-true.json
  deepspeed_mpi ................................... False
  distribute_checkpointed_activations ............. False
  distributed_backend ............................. nccl
  ds_inference .................................... False
  ds_pipeline_enabled ............................. False
  embedding_path .................................. None
  enable_expert_tensor_parallelism ................ False
  encoder_seq_length .............................. 2048
  eod_mask_loss ................................... False
  eval_interval ................................... 100
  eval_iters ...................................... 10
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... 30000000
  exit_interval ................................... None
  expert_interval ................................. 2
  ffn_hidden_size ................................. 3072
  finetune ........................................ False
  fp16 ............................................ True
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  global_batch_size ............................... 256
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 768
  hidden_size_teacher ............................. None
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_dim ......................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference ....................................... False
  init_method_std ................................. 0.014
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  kd .............................................. False
  kd_alpha_ce ..................................... 1
  kd_beta_ce ...................................... 1
  kd_temp ......................................... 1.0
  kv_channels ..................................... 64
  layernorm_epsilon ............................... 1e-05
  lazy_mpu_init ................................... None
  load ............................................ /n/holylfs05/LABS/acc_lab/Users/yhjin0509/moe//checkpoint/gpt-0.125B-lr-4.5e-4-minlr-4.5e-06-bs-256-gpus-1-mp-1-pp-1-ep-2-mlc-0.01-cap-1.0-drop-true
  load_teacher .................................... None
  local_rank ...................................... 0
  log_batch_size_to_tensorboard ................... True
  log_interval .................................... 10
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_num_zeros_in_grad ........................... False
  log_optimizer_states_to_tensorboard ............. False
  log_params_norm ................................. False
  log_timers_to_tensorboard ....................... True
  log_validation_ppl_to_tensorboard ............... True
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.00045
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_decay_tokens ................................. 300000000000
  lr_warmup_fraction .............................. None
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  lr_warmup_tokens ................................ 375000000
  make_vocab_size_divisible_by .................... 128
  mask_prob ....................................... 0.15
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 2048
  memory_centric_tiled_linear ..................... False
  merge_file ...................................... /n/holylfs05/LABS/acc_lab/Users/yhjin0509/moe_data/gpt2-merges.txt
  micro_batch_size ................................ 4
  min_loss_scale .................................. 1.0
  min_lr .......................................... 4.5e-06
  mlp_type ........................................ standard
  mmap_warmup ..................................... False
  moe_eval_capacity_factor ........................ 1.0
  moe_expert_parallel_size ........................ 1
  moe_loss_coeff .................................. 0.01
  moe_min_capacity ................................ 4
  moe_token_dropping .............................. True
  moe_train_capacity_factor ....................... 1.0
  mos ............................................. False
  no_load_lr_state ................................ False
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_pipeline_parallel ............................ True
  no_save_optim ................................... None
  no_save_rng ..................................... None
  num_attention_heads ............................. 12
  num_attention_heads_teacher ..................... None
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... [2]
  num_experts_teacher ............................. [1]
  num_layers ...................................... 12
  num_layers_per_virtual_pipeline_stage ........... None
  num_layers_teacher .............................. None
  num_workers ..................................... 0
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  override_lr_scheduler ........................... True
  params_dtype .................................... torch.float16
  partition_activations ........................... False
  patch_dim ....................................... 16
  pipeline_model_parallel_size .................... 1
  profile_backward ................................ False
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  remote_device ................................... none
  reset_attention_mask ............................ False
  reset_iteration ................................. False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  sample_rate ..................................... 1.0
  save ............................................ /n/holylfs05/LABS/acc_lab/Users/yhjin0509/moe//checkpoint/gpt-0.125B-lr-4.5e-4-minlr-4.5e-06-bs-256-gpus-1-mp-1-pp-1-ep-2-mlc-0.01-cap-1.0-drop-true
  save_interval ................................... 10000
  scatter_gather_tensors_in_pipeline .............. True
  scattered_embeddings ............................ False
  seed ............................................ 1234
  seq_length ...................................... 2048
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  split ........................................... 98,2,0
  split_transformers .............................. False
  synchronize_each_layer .......................... False
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. /n/home08/yhjin0509/moe/Megatron-DeepSpeed/examples/MoE/output/tensorboard/gpt-0.125B-lr-4.5e-4-minlr-4.5e-06-bs-256-gpus-1-mp-1-pp-1-ep-2-mlc-0.01-cap-1.0-drop-true_holygpu2c0710.rc.fas.harvard.edu_2022.08.15-01.20.41
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1
  tile_factor ..................................... 1
  titles_data_path ................................ None
  tokenizer_type .................................. GPT2BPETokenizer
  topk ............................................ 1
  train_iters ..................................... 1716613
  train_samples ................................... None
  train_tokens .................................... 300000000000
  use_checkpoint_lr_scheduler ..................... False
  use_contiguous_buffers_in_ddp ................... False
  use_cpu_initialization .......................... None
  use_one_sent_docs ............................... False
  use_pin_memory .................................. False
  use_tutel ....................................... False
  virtual_pipeline_model_parallel_size ............ None
  vocab_extra_ids ................................. 0
  vocab_file ...................................... /n/holylfs05/LABS/acc_lab/Users/yhjin0509/moe_data/gpt2-vocab.json
  weight_decay .................................... 0.1
  world_size ...................................... 1
  zero_allgather_bucket_size ...................... 0.0
  zero_contigious_gradients ....................... False
  zero_reduce_bucket_size ......................... 0.0
  zero_reduce_scatter ............................. False
  zero_stage ...................................... 1.0
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 64
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
WARNING: TensorBoard writing requested but is not available (are you using PyTorch 1.1.0 or later?), no TensorBoard logs will be written.
> initializing torch distributed ...
[2022-08-15 01:20:48,772] [INFO] [comm.py:628:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
[2022-08-15 01:20:48,776] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
> compiling dataset index builder ...
make: Entering directory `/n/home08/yhjin0509/moe/Megatron-DeepSpeed/megatron/data'
make: Nothing to be done for `default'.
make: Leaving directory `/n/home08/yhjin0509/moe/Megatron-DeepSpeed/megatron/data'
>>> done with dataset index builder. Compilation time: 0.063 seconds
> compiling and loading fused kernels ...

GeLU approximation differs from paper, BERT

During careful debugging I noticed that in fused_bias_gelu the GeLU activation is approximated as:

x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))

The approximation proposed in the original paper and used by HuggingFace and BERT takes the cube of x in the inner expression, the square. In the normal range of the function this affects the output/gradient by ~1e-7, so it's not a huge omission but is an odd one and makes accurate inference of Megatron-trained weights outside of the Megatron framework more difficult.

Is this an intentional change for some reason, or a simple bug?

How efficient is the BERT and T5 code?

I’m looking into training roBERTa and T5-style models but I wasn’t able to find much on their performance in Megatron DS. I know that at one point they were a WIP, but what’s the latest on them?

gpt_6.7B_PR-MoE16: CUDA out of memory

Hi team, I'm curious how to achieve the performance described in the blog, i.e. the amount of parameters and throughput.

When I train GPT-3 6.7B model on 16 NVIDIA A100-SXM4-40GB GPUs, and set the number of experts EP_SIZE "16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16", RuntimeError: CUDA out of memory error occurs. When I train the same model on 16 NVIDIA A100-SXM4-80GB, the GPU memory used will reach nearly 67 GiB, and the throughput(wps) will be about only 5000/s. At the same time, according to the output of the flops profile, the total number of parameters is only 42B, which is far from the performance described in the chart in the blog, which the 16 A100s described can train a model with about 0.1 trillion parameters, and the throughput can reach 70~80 thousands tokens/s.

So what should I do if I want to train a 100 billion model on 16 A100 GPUs? @conglongli @jeffra

"RuntimeError: trying to initialize the default process group twice!" error with pretrain_gpt example script

I am trying to run Megatron-LM GPT2. Previously, I used to run it from https://github.com/microsoft/DeepSpeedExamples/blob/master/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_pretrain_gpt2-zero2.sh repo.

With the current updated repository, I am running into below error with the script, https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples/pretrain_gpt.sh

 Traceback (most recent call last):
  File "pretrain_gpt.py", line 276, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/root/Megatron-DeepSpeed/megatron/training.py", line 130, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider, teacher=False)
  File "/root/Megatron-DeepSpeed/megatron/training.py", line 402, in setup_model_and_optimizer
    model = get_model(model_provider_func)
  File "/root/Megatron-DeepSpeed/megatron/training.py", line 265, in get_model
    model = model_provider_func(
  File "pretrain_gpt.py", line 46, in model_provider
    with deepspeed.zero.Init(data_parallel_group=mpu.get_data_parallel_group(),
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 655, in __init__
    init_distributed()
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 427, in init_distributed
    cdb = TorchBackend(dist_backend, timeout, init_method)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 35, in __init__
    self.init_process_group(backend, timeout, init_method)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 38, in init_process_group
    return torch.distributed.init_process_group(backend,
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 563, in init_process_group
    raise RuntimeError("trying to initialize the default process group " "twice!")
RuntimeError: trying to initialize the default process group twice!

Image used: pytorch/pytorch:1.11.0-cuda11.3-cudnn8-devel image

Please advise how to run GPT2 model with DeepSpeed.

Training stuck at Round robin gradient partitioning?

This is part of logs:

DeepSpeed general environment info:
torch install path ............... ['/home/user/miniconda/lib/python3.8/site-packages/torch']
torch version .................... 1.10.2+cu102
deepspeed install path ........... ['/home/user/miniconda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.8.2+4ae3a3da, 4ae3a3da, master
torch cuda version ............... 10.2
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed wheel compiled w. ...... torch 1.10, cuda 10.2
DeepSpeed general environment info:
torch install path ............... ['/home/user/miniconda/lib/python3.8/site-packages/torch']
torch version .................... 1.10.2+cu102
deepspeed install path ........... ['/home/user/miniconda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.8.2+4ae3a3da, 4ae3a3da, master
torch cuda version ............... 10.2
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed wheel compiled w. ...... torch 1.10, cuda 10.2
DeepSpeed general environment info:
torch install path ............... ['/home/user/miniconda/lib/python3.8/site-packages/torch']
torch version .................... 1.10.2+cu102
deepspeed install path ........... ['/home/user/miniconda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.8.2+4ae3a3da, 4ae3a3da, master
torch cuda version ............... 10.2
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed wheel compiled w. ...... torch 1.10, cuda 10.2
DeepSpeed general environment info:
torch install path ............... ['/home/user/miniconda/lib/python3.8/site-packages/torch']
torch version .................... 1.10.2+cu102
deepspeed install path ........... ['/home/user/miniconda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.8.2+4ae3a3da, 4ae3a3da, master
torch cuda version ............... 10.2
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed wheel compiled w. ...... torch 1.10, cuda 10.2
using world size: 4, data-parallel-size: 4, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:HFTokenizer
using torch.float16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  aml_data_download_path .......................... None
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  bert_binary_head ................................ True
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  checkpoint_activations .......................... True
  checkpoint_in_cpu ............................... False
  checkpoint_num_layers ........................... 1
  clip_grad ....................................... 1.0
  compression_training ............................ False
  consumed_train_samples .......................... 0
  consumed_train_tokens ........................... 0
  consumed_valid_samples .......................... 0
  contigious_checkpointing ........................ False
  cpu_optimizer ................................... False
  cpu_torch_adam .................................. False
  create_moe_param_group .......................... False
  curriculum_learning_legacy ...................... False
  custom_token_counting ........................... False
  data_efficiency_curriculum_learning ............. False
  data_impl ....................................... mmap
  data_parallel_size .............................. 4
  data_path ....................................... ['/data']
  dataloader_type ................................. single
  DDP_impl ........................................ local
  decoder_seq_length .............................. None
  deepscale ....................................... False
  deepscale_config ................................ None
  deepspeed ....................................... True
  deepspeed_activation_checkpointing .............. False
  deepspeed_config ................................ /workspace/Megatron-DeepSpeed/config/test.json
  deepspeed_mpi ................................... False
  distribute_checkpointed_activations ............. False
  distributed_backend ............................. nccl
  ds_inference .................................... False
  ds_pipeline_enabled ............................. True
  embedding_path .................................. None
  enable_expert_tensor_parallelism ................ False
  encoder_seq_length .............................. 1024
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 10
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... 5000
  expert_interval ................................. 2
  ffn_hidden_size ................................. 2048
  finetune ........................................ False
  fp16 ............................................ True
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  global_batch_size ............................... 1024
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 512
  hidden_size_teacher ............................. None
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_dim ......................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference ....................................... False
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  kd .............................................. False
  kd_alpha_ce ..................................... 1
  kd_beta_ce ...................................... 1
  kd_temp ......................................... 1.0
  kv_channels ..................................... 32
  layernorm_epsilon ............................... 1e-05
  lazy_mpu_init ................................... None
  load ............................................ /workspace/Megatron-DeepSpeed/checkpoints/ds_z2_nl24_hs512_gb1024_mb8
  load_teacher .................................... None
  local_rank ...................................... 0
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 100
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_num_zeros_in_grad ........................... False
  log_optimizer_states_to_tensorboard ............. False
  log_params_norm ................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.00015
  lr_decay_iters .................................. 320000
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_decay_tokens ................................. None
  lr_warmup_fraction .............................. 0.01
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  lr_warmup_tokens ................................ None
  make_vocab_size_divisible_by .................... 128
  mask_prob ....................................... 0.15
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 1024
  memory_centric_tiled_linear ..................... False
  merge_file ...................................... None
  micro_batch_size ................................ 8
  min_loss_scale .................................. 1.0
  min_lr .......................................... 1e-05
  mlp_type ........................................ standard
  mmap_warmup ..................................... False
  moe_eval_capacity_factor ........................ 1.0
  moe_expert_parallel_size ........................ 1
  moe_loss_coeff .................................. 0.1
  moe_min_capacity ................................ 4
  moe_token_dropping .............................. True
  moe_train_capacity_factor ....................... 1.0
  mos ............................................. False
  no_load_lr_state ................................ False
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_pipeline_parallel ............................ False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  num_attention_heads ............................. 16
  num_attention_heads_teacher ..................... None
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... [1]
  num_experts_teacher ............................. [1]
  num_layers ...................................... 24
  num_layers_per_virtual_pipeline_stage ........... None
  num_layers_teacher .............................. None
  num_workers ..................................... 2
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  override_lr_scheduler ........................... False
  params_dtype .................................... torch.float16
  partition_activations ........................... False
  patch_dim ....................................... 16
  pipeline_model_parallel_size .................... 1
  profile_backward ................................ False
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  random_ltd ...................................... False
  rank ............................................ 0
  remote_device ................................... none
  reset_attention_mask ............................ False
  reset_iteration ................................. False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  return_data_index ............................... False
  sample_rate ..................................... 1.0
  save ............................................ /workspace/Megatron-DeepSpeed/checkpoints/ds_z2_nl24_hs512_gb1024_mb8
  save_interval ................................... 10000
  scatter_gather_tensors_in_pipeline .............. True
  scattered_embeddings ............................ False
  seed ............................................ 1234
  seq_length ...................................... 1024
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  split ........................................... 949,50,1
  split_transformers .............................. False
  synchronize_each_layer .......................... False
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  tile_factor ..................................... 1
  titles_data_path ................................ None
  tokenizer_type .................................. HFTokenizer
  topk ............................................ 1
  train_data_exact_num_epochs ..................... None
  train_doc_idx_path .............................. None
  train_idx_path .................................. None
  train_iters ..................................... 500000
  train_sample_idx_path ........................... None
  train_samples ................................... None
  train_shuffle_idx_path .......................... None
  train_tokens .................................... None
  use_checkpoint_lr_scheduler ..................... False
  use_contiguous_buffers_in_ddp ................... False
  use_cpu_initialization .......................... None
  use_one_sent_docs ............................... False
  use_pin_memory .................................. False
  use_tutel ....................................... False
  virtual_pipeline_model_parallel_size ............ None
  vocab_extra_ids ................................. 0
  vocab_file ...................................... /workspace/tokenizer.json
  weight_decay .................................... 0.01
  world_size ...................................... 4
  zero_allgather_bucket_size ...................... 0.0
  zero_contigious_gradients ....................... False
  zero_reduce_bucket_size ......................... 0.0
  zero_reduce_scatter ............................. False
  zero_stage ...................................... 1.0
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 32
> building HFTokenizer tokenizer ...
 > padded vocab (size: 250002) with 110 dummy tokens (new size: 250112)
> initializing torch distributed ...
[2023-03-06 11:49:18,399] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
> compiling dataset index builder ...
make: Entering directory '/workspace/Megatron-DeepSpeed/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/workspace/Megatron-DeepSpeed/megatron/data'
>>> done with dataset index builder. Compilation time: 0.097 seconds
> compiling and loading fused kernels ...
Detected CUDA files, patching ldflags
Emitting ninja build file /workspace/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /workspace/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /workspace/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
>>> done with compiling and loading fused kernels. Compilation time: 8.615 seconds
time to initialize megatron (seconds): 72.271
[after megatron is initialized] datetime: 2023-03-06 11:49:28 
building GPT model ...
[2023-03-06 11:49:28,344] [INFO] [utils.py:829:see_memory_usage] Before Building Model
[2023-03-06 11:49:28,345] [INFO] [utils.py:830:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2023-03-06 11:49:28,345] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 119.29 GB, percent = 31.8%
SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0, ProcessCoord(pipe=0, data=1, model=0): 1, ProcessCoord(pipe=0, data=2, model=0): 2, ProcessCoord(pipe=0, data=3, model=0): 3}
[2023-03-06 11:49:28,451] [INFO] [module.py:372:_partition_layers] Partitioning pipeline stages with method type:transformer
stage=0 layers=31
     0: _to_float16
     1: EmbeddingPipe
     2: <lambda>
     3: ParallelTransformerLayerPipe
     4: ParallelTransformerLayerPipe
     5: ParallelTransformerLayerPipe
     6: ParallelTransformerLayerPipe
     7: ParallelTransformerLayerPipe
     8: ParallelTransformerLayerPipe
     9: ParallelTransformerLayerPipe
    10: ParallelTransformerLayerPipe
    11: ParallelTransformerLayerPipe
    12: ParallelTransformerLayerPipe
    13: ParallelTransformerLayerPipe
    14: ParallelTransformerLayerPipe
    15: ParallelTransformerLayerPipe
    16: ParallelTransformerLayerPipe
    17: ParallelTransformerLayerPipe
    18: ParallelTransformerLayerPipe
    19: ParallelTransformerLayerPipe
    20: ParallelTransformerLayerPipe
    21: ParallelTransformerLayerPipe
    22: ParallelTransformerLayerPipe
    23: ParallelTransformerLayerPipe
    24: ParallelTransformerLayerPipe
    25: ParallelTransformerLayerPipe
    26: ParallelTransformerLayerPipe
    27: <lambda>
    28: MixedFusedLayerNorm
    29: EmbeddingPipe
    30: float16_to_fp32
  loss: CrossEntropy
[2023-03-06 11:49:28,582] [INFO] [utils.py:829:see_memory_usage] After Building Model
[2023-03-06 11:49:28,582] [INFO] [utils.py:830:see_memory_usage] MA 0.39 GB         Max_MA 0.39 GB         CA 0.41 GB         Max_CA 0 GB 
[2023-03-06 11:49:28,583] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 119.37 GB, percent = 31.8%
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 204239872
> learning rate decay style: cosine
DeepSpeed is enabled.
[2023-03-06 11:49:28,586] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed info: version=0.8.2+4ae3a3da, git-hash=4ae3a3da, git-branch=master
[2023-03-06 11:49:28,897] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2023-03-06 11:49:28,897] [INFO] [logging.py:77:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-03-06 11:49:28,897] [INFO] [logging.py:77:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-03-06 11:49:28,912] [INFO] [logging.py:77:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-03-06 11:49:28,912] [INFO] [utils.py:55:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'apex.optimizers.fused_adam.FusedAdam'>
[2023-03-06 11:49:28,912] [INFO] [logging.py:77:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2023-03-06 11:49:28,912] [INFO] [stage_1_and_2.py:144:__init__] Reduce bucket size 500,000,000
[2023-03-06 11:49:28,912] [INFO] [stage_1_and_2.py:145:__init__] Allgather bucket size 500,000,000
[2023-03-06 11:49:28,912] [INFO] [stage_1_and_2.py:146:__init__] CPU Offload: False
[2023-03-06 11:49:28,912] [INFO] [stage_1_and_2.py:147:__init__] Round robin gradient partitioning: False

GPT-2 with pipeline parallel and bfloat16 doesn't work

Hi,
When using the script in examples/run_deepspeed_example.sh with Zero1 and bfloat16 ( the script works with fp16) I get the following error:
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 768, in _exec_backward_pass
self.optimizer.clear_lp_grads()
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'clear_lp_grads'

The run_deepspeed_example.sh is attached
!/bin/bash
set -ex

BASE_PATH=/vc_data/Megatron-LM/data
DATA_PATH=${BASE_PATH}/indexed_datasets/megatron
DS_CONFIG=ds_config.json

TP=2
PP=2
NLAYERS=24
HIDDEN=512

GLOBAL_BATCH=64
MICRO_BATCH=4

ZERO_STAGE=1

OUTPUT_DIR=ds_z${ZERO_STAGE}_nl${NLAYERS}_hs${HIDDEN}_gb${GLOBAL_BATCH}_mb${MICRO_BATCH}
#OUTPUT_DIR=baseline_nl${NLAYERS}_hs${HIDDEN}_gb${GLOBAL_BATCH}_mb${MICRO_BATCH}
mkdir -p $OUTPUT_DIR

cat < $DS_CONFIG
{
"train_batch_size" : $GLOBAL_BATCH,
"train_micro_batch_size_per_gpu": $MICRO_BATCH,
"steps_per_print": 1,

"zero_optimization": {
"stage": $ZERO_STAGE
},

"bf16": {"enabled": true},

"wall_clock_breakdown" : true
}
EOT

export NCCL_DEBUG=warn

ds_args=""
ds_args=" --deepspeed ${ds_args}"
#ds_args=" --no-pipeline-parallel ${ds_args}"
ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
ds_args=" --deepspeed-activation-checkpointing ${ds_args}"

deepspeed pretrain_gpt.py
--tensor-model-parallel-size $TP
--pipeline-model-parallel-size $PP
--num-layers $NLAYERS
--hidden-size $HIDDEN
--num-attention-heads 16
--seq-length 256
--loss-scale 12
--max-position-embeddings 1024
--micro-batch-size 4
--global-batch-size 1024
--train-iters 1000
--lr 6.0e-5
--min-lr 6.0e-6
--lr-decay-style cosine
--log-interval 1
--eval-iters 40
--eval-interval 1000
--data-path $DATA_PATH
--vocab-file $BASE_PATH/gpt2-vocab.json
--merge-file $BASE_PATH/gpt2-merges.txt
--save-interval 1000
--split 98,2,0
--clip-grad 1.0
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.95
--init-method-std 0.006
--bf16
--checkpoint-activations
--tensorboard-dir $OUTPUT_DIR
$ds_args
--exit-interval 5000 | tee ${OUTPUT_DIR}/output.log

Vocab size mismatch for T5

If I use the base example from examples/pretrain_t5_distributed_with_mp.sh I get the following error

Traceback (most recent call last):
  File "pretrain_t5.py", line 133, in <module>
    train_ds, valid_ds, test_ds = build_train_valid_test_datasets_provider(
  File "pretrain_t5.py", line 114, in train_valid_test_datasets_provider
    train_ds, valid_ds, test_ds = build_train_valid_test_datasets_provider(
  File "pretrain_t5.py", line 114, in train_valid_test_datasets_provider
    dataset = T5Dataset(
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/data/t5_dataset.py", line 68, in __init__
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 150, in pretrain
    train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/data/dataset_utils.py", line 425, in build_train_valid_test_datasets
    train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/data/dataset_utils.py", line 425, in build_train_valid_test_datasets
    assert len(self.sentinel_tokens) > 0, "Provide the argument --vocab-extra-ids 100 to the script"
AssertionError: Provide the argument --vocab-extra-ids 100 to the script

However if I add the extra vocab ids, I get

Traceback (most recent call last):
  File "pretrain_t5.py", line 133, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 170, in pretrain
    iteration = train(forward_step_func,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 945, in train
    train_step(forward_step_func,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 550, in train_step
    losses_reduced = forward_backward_func(
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/schedules.py", line 147, in forward_backward_no_pipelining
    output_tensor = forward_step(forward_step_func, data_iterator, model,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/schedules.py", line 65, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "pretrain_t5.py", line 93, in forward_step
    = get_batch(data_iterator)
  File "pretrain_t5.py", line 53, in get_batch
    data = next(data_iterator)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 457, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/data/t5_dataset.py", line 82, in __getitem__
    return build_training_sample(sample, seq_length,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/data/t5_dataset.py", line 134, in build_training_sample
    (tokens, masked_positions, masked_labels, _, masked_spans) = create_masked_lm_predictions(
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/data/dataset_utils.py", line 213, in create_masked_lm_predictions
    not is_start_piece(vocab_id_to_token_dict[token])):
KeyError: 50104

Traceback (most recent call last):
  File "pretrain_t5.py", line 133, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 170, in pretrain
    iteration = train(forward_step_func,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 945, in train
    train_step(forward_step_func,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 550, in train_step
    losses_reduced = forward_backward_func(
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/schedules.py", line 147, in forward_backward_no_pipelining
    output_tensor = forward_step(forward_step_func, data_iterator, model,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/schedules.py", line 65, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "pretrain_t5.py", line 93, in forward_step
    = get_batch(data_iterator)
  File "pretrain_t5.py", line 53, in get_batch
    data = next(data_iterator)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 457, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/data/t5_dataset.py", line 82, in __getitem__
    return build_training_sample(sample, seq_length,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/data/t5_dataset.py", line 134, in build_training_sample
    (tokens, masked_positions, masked_labels, _, masked_spans) = create_masked_lm_predictions(
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/data/dataset_utils.py", line 213, in create_masked_lm_predictions
    not is_start_piece(vocab_id_to_token_dict[token])):
KeyError: 50214

Traceback (most recent call last):
  File "pretrain_t5.py", line 133, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 170, in pretrain
    iteration = train(forward_step_func,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 945, in train
    train_step(forward_step_func,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 550, in train_step
    losses_reduced = forward_backward_func(
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/schedules.py", line 147, in forward_backward_no_pipelining
    output_tensor = forward_step(forward_step_func, data_iterator, model,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/schedules.py", line 65, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "pretrain_t5.py", line 93, in forward_step
    = get_batch(data_iterator)
  File "pretrain_t5.py", line 53, in get_batch
    data = next(data_iterator)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 457, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/data/t5_dataset.py", line 82, in __getitem__
    return build_training_sample(sample, seq_length,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/data/t5_dataset.py", line 134, in build_training_sample
    (tokens, masked_positions, masked_labels, _, masked_spans) = create_masked_lm_predictions(
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/data/dataset_utils.py", line 213, in create_masked_lm_predictions
    not is_start_piece(vocab_id_to_token_dict[token])):
KeyError: 50151

Traceback (most recent call last):
  File "pretrain_t5.py", line 133, in <module>
Traceback (most recent call last):
  File "pretrain_t5.py", line 133, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 170, in pretrain
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 170, in pretrain
    iteration = train(forward_step_func,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 945, in train
    iteration = train(forward_step_func,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 945, in train
    train_step(forward_step_func,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 550, in train_step
    train_step(forward_step_func,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/training.py", line 550, in train_step
    losses_reduced = forward_backward_func(
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/schedules.py", line 147, in forward_backward_no_pipelining
    losses_reduced = forward_backward_func(
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/schedules.py", line 147, in forward_backward_no_pipelining
    output_tensor = forward_step(forward_step_func, data_iterator, model,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/schedules.py", line 65, in forward_step
    output_tensor = forward_step(forward_step_func, data_iterator, model,
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/schedules.py", line 65, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "pretrain_t5.py", line 97, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "pretrain_t5.py", line 97, in forward_step
    output_tensor = model(tokens_enc,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/model/distributed.py", line 71, in forward
    output_tensor = model(tokens_enc,
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/model/distributed.py", line 71, in forward
    return self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/model/module.py", line 172, in forward
    return self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/model/module.py", line 172, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/model/t5_model.py", line 137, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/fsx/shiv/Megatron-DeepSpeed/megatron/model/t5_model.py", line 137, in forward
    decoder_output, encoder_output = lm_output
ValueError: too many values to unpack (expected 2)
    decoder_output, encoder_output = lm_output
ValueError: too many values to unpack (expected 2)

What is curious is that the keyerror changes from 50104 to 50204 to 50151

How to load huggingface pretrained model T5 and train further?

Hello all,
Is it possible that I can load some pretrained model released in huggingface, such as T5 or GPT2 model, and train further based on my own data?

If I set this pretrained path in --load argument, then only when there is tracker file then the code can load model params. But for pretrained params in huggingface, there is no such file.

Looking forward to your replay

syncing with the upstream?

Are there any plans to sync with the upstream?

e.g. the most recent bug fix PR for the fused softmax layer: NVIDIA#133
and the corresponding commit: NVIDIA@0be4052

Thank you!

Are there any other layer norm functions, such as RMSNorm or DeepNorm

Layer Norm kernel fails for ROCm

The test for the fused layer norm kernel seems to fail for ROCm

Here's a small reproduction script:

from megatron import fused_kernels
from megatron.model.fused_layer_norm import MixedFusedLayerNorm
from torch.nn import LayerNorm


from transformers import BertTokenizer
from transformers.models.bert.modeling_bert import BertModel
import transformers

transformers.logging.set_verbosity(
    transformers.logging.FATAL,
)

# Copied from https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/fused_kernels/tests/test_fused_kernels.py#L223
def test_layer_norm():
    tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
    bert = BertModel.from_pretrained("bert-base-cased").cuda().half()
    test_text = (
        "Hello. How are you? I am fine thank you and you? yes Good. "
        "hi hi hi hi hi hi hi hi hi hi hi hi hi"  # 32
    )

    tokens = tokenizer(
        [test_text] * 4,
        return_tensors="pt",
    )

    # [bsz, seq_len, d_model]
    embedding_output = (
        bert.embeddings(
            input_ids=tokens["input_ids"].cuda(),
            position_ids=None,
            token_type_ids=tokens["token_type_ids"].cuda(),
            inputs_embeds=None,
            past_key_values_length=0,
        )
        .cuda()
        .half()
    )

    fused_layernorm_layer = (
        MixedFusedLayerNorm(normalized_shape=embedding_output.size(-1)).cuda().half()
    )

    torch_layernorm_layer = (
        LayerNorm(normalized_shape=embedding_output.size(-1)).cuda().half()
    )

    fused_output = fused_layernorm_layer(embedding_output)
    torch_output = torch_layernorm_layer(embedding_output)
    test_result = (fused_output - torch_output).abs()

    while test_result.dim() != 1:
        test_result = test_result.mean(dim=-1)

    diff = test_result.mean(dim=-1)

    if diff <= 1e-3:
        print(
            f"\n[Success] test_layer_norm"
            f"\n > mean_difference={diff}"
            f"\n > fused_values={fused_output[-1][-1][:5].tolist()}"
            f"\n > torch_values={torch_output[-1][-1][:5].tolist()}"
        )
    else:
        print(
            f"\n[Fail] test_layer_norm"
            f"\n > mean_difference={diff}, "
            f"\n > fused_values={fused_output[-1][-1][:5].tolist()}, "
            f"\n > torch_values={torch_output[-1][-1][:5].tolist()}"
        )

        
if __name__ == "__main__":     
    # initialize args
    from megatron.global_vars import _parse_args
    args_defaults = {'tokenizer_type': 'GPT2BPETokenizer', 'micro_batch_size':4, 'num_layers':24, 'hidden_size':1024, 'num_attention_heads':16, 'seq_length':1024, 'max_position_embeddings':1024, 'vocab_file':'gpt2-vocab.json', 'merge_file':'gpt2-merges.txt', 'train_samples':1000000, 'lr':3.0e-4, 'loss_scale':0}
    args = _parse_args(extra_args_provider=None,
                        defaults=args_defaults,
                        ignore_unknown_args=False)
    # compile kernels   
    print("Compiling kernels...")
    fused_kernels.load(args)
    print("Done compiling kernels.\n")

    print("Running test...")
    test_layer_norm()

Which gives on a DGX (CUDA): ✅

[Success] test_layer_norm
 > mean_difference=0.0
 > fused_values=[-0.0704345703125, -0.391845703125, 1.068359375, -0.32470703125, 0.0677490234375]
 > torch_values=[-0.0704345703125, -0.391845703125, 1.068359375, -0.32470703125, 0.0677490234375]

But gives on a MI250X (ROCm): ❌

[Fail] test_layer_norm
 > mean_difference=0.100830078125, 
 > fused_values=[-0.01995849609375, -0.298583984375, 0.966796875, -0.240234375, 0.0997314453125], 
 > torch_values=[-0.0704345703125, -0.391845703125, 1.068359375, -0.32470703125, 0.0677490234375]

Any ideas on how to fix this?

How to finetune Llama-65b with this project？

Encountered error when enabling ZeRO and CPU Activation Checkpointing at the same time.

When I was testing GPT training and enabled ZeRO-3 and CPU Activation Checkpointing at the same time, I got some errors.

At first, there are some WARNING.

10.0.1.101: training ...
10.0.1.99: time (ms) | model-and-optimizer-setup: 37634.40 | train/valid/test-data-iterators-setup: 119428.93
10.0.1.101: [before the start of training step] datetime: 2023-03-05 09:12:02
10.0.1.101: [2023-03-05 09:12:02,597] [INFO] [checkpointing.py:552:forward] Activation Checkpointing Information
10.0.1.101: [2023-03-05 09:12:02,597] [INFO] [checkpointing.py:553:forward] ----Partition Activations True, CPU CHECKPOINTING True
10.0.1.101: [2023-03-05 09:12:02,597] [INFO] [checkpointing.py:556:forward] ----contiguous Memory Checkpointing True with 62 total layers
10.0.1.101: [2023-03-05 09:12:02,597] [INFO] [checkpointing.py:559:forward] ----Synchronization True
10.0.1.101: [2023-03-05 09:12:02,597] [INFO] [checkpointing.py:560:forward] ----Profiling time in checkpointing False
10.0.1.101: WARNING! The input of FusedLayerNorm should be on the GPU.This warning should only be triggered in the FusedLayerNorm unit tests.
10.0.1.99: WARNING! The input of FusedLayerNorm should be on the GPU.This warning should only be triggered in the FusedLayerNorm unit tests.
10.0.1.101: WARNING! The input of FusedLayerNorm should be on the GPU.This warning should only be triggered in the FusedLayerNorm unit tests.
...

After that, I got errors.

CLICK ME

10.0.1.101: Traceback (most recent call last):
10.0.1.101:   File "../pretrain_gpt.py", line 326, in <module>
10.0.1.99: Traceback (most recent call last):
10.0.1.99:   File "../pretrain_gpt.py", line 326, in <module>
10.0.1.101:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 187, in pretrain
10.0.1.101:     iteration = train(forward_step_func,
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 1034, in train
10.0.1.99:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
10.0.1.99:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 187, in pretrain
10.0.1.99:     iteration = train(forward_step_func,
10.0.1.99:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 1034, in train
10.0.1.101:     train_step(forward_step_func,
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 607, in train_step
10.0.1.101:     losses_reduced = forward_backward_func(
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 161, in forward_backward_no_pipelining
10.0.1.99:     train_step(forward_step_func,
10.0.1.99:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 607, in train_step
10.0.1.101:     backward_step(optimizer, input_tensor, output_tensor, output_tensor_grad, model)
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 100, in backward_step
10.0.1.101:     model.backward(output_tensor)
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 9, in wrapped_fn
10.0.1.99:     losses_reduced = forward_backward_func(
10.0.1.99:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 161, in forward_backward_no_pipelining
10.0.1.99:     backward_step(optimizer, input_tensor, output_tensor, output_tensor_grad, model)
10.0.1.99:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 100, in backward_step
10.0.1.101:     ret_val = func(*args, **kwargs)
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1968, in backward
10.0.1.99:     model.backward(output_tensor)
10.0.1.99:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 9, in wrapped_fn
10.0.1.99:     ret_val = func(*args, **kwargs)
10.0.1.99:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1968, in backward
10.0.1.101:     self.optimizer.backward(loss, retain_graph=retain_graph)
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 9, in wrapped_fn
10.0.1.101:     ret_val = func(*args, **kwargs)
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2089, in backward
10.0.1.96: Traceback (most recent call last):
10.0.1.101: Traceback (most recent call last):
10.0.1.101:   File "../pretrain_gpt.py", line 326, in <module>
10.0.1.99:     self.optimizer.backward(loss, retain_graph=retain_graph)
10.0.1.99:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 9, in wrapped_fn
10.0.1.99:     ret_val = func(*args, **kwargs)
10.0.1.99:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2089, in backward
10.0.1.97: Traceback (most recent call last):
10.0.1.97:   File "../pretrain_gpt.py", line 326, in <module>
10.0.1.96:   File "../pretrain_gpt.py", line 326, in <module>
10.0.1.101:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 51, in backward
10.0.1.101:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 187, in pretrain
10.0.1.99:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
10.0.1.99:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 51, in backward
10.0.1.96:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
10.0.1.101:     scaled_loss.backward(retain_graph=retain_graph)
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 402, in backward
10.0.1.96:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 187, in pretrain
10.0.1.101:     iteration = train(forward_step_func,
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 1034, in train
10.0.1.101:     torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
10.0.1.96:     iteration = train(forward_step_func,
10.0.1.96:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 1034, in train
10.0.1.99:     Traceback (most recent call last):
10.0.1.99: scaled_loss.backward(retain_graph=retain_graph)
10.0.1.96: Traceback (most recent call last):
10.0.1.96:   File "../pretrain_gpt.py", line 326, in <module>
10.0.1.99:   File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 402, in backward
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 191, in backward
10.0.1.99:   File "../pretrain_gpt.py", line 326, in <module>
10.0.1.96:     train_step(forward_step_func,
10.0.1.96:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 607, in train_step
10.0.1.101:     Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
10.0.1.99:     torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
10.0.1.99:   File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 191, in backward
10.0.1.101:     train_step(forward_step_func,
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 607, in train_step
10.0.1.101:     return user_fn(self, *args)
10.0.1.97:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
10.0.1.97:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 187, in pretrain
10.0.1.97: Traceback (most recent call last):
10.0.1.97:   File "../pretrain_gpt.py", line 326, in <module>
10.0.1.97:     iteration = train(forward_step_func,
10.0.1.97:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 1034, in train
10.0.1.97: Traceback (most recent call last):
10.0.1.97:   File "../pretrain_gpt.py", line 326, in <module>
10.0.1.97:     train_step(forward_step_func,
10.0.1.97:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 607, in train_step
10.0.1.97:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
10.0.1.97:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 187, in pretrain
10.0.1.97:     losses_reduced = forward_backward_func(
10.0.1.97:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 161, in forward_backward_no_pipelining
10.0.1.97:     iteration = train(forward_step_func,
10.0.1.97:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 1034, in train
10.0.1.97:     backward_step(optimizer, input_tensor, output_tensor, output_tensor_grad, model)
10.0.1.97:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 100, in backward_step
10.0.1.97:     model.backward(output_tensor)
10.0.1.97:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 9, in wrapped_fn
10.0.1.99:     Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
10.0.1.99:   File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
10.0.1.96:     losses_reduced = forward_backward_func(
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 698, in backward
10.0.1.99:     return user_fn(self, *args)
10.0.1.99:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 698, in backward
10.0.1.99:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
10.0.1.99:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 187, in pretrain
10.0.1.99:     outputs = ctx.run_function(*detached_inputs)
10.0.1.99:   File "/workdir/2023/Megatron-DeepSpeed/megatron/model/transformer.py", line 710, in custom_forward
10.0.1.99:     iteration = train(forward_step_func,
10.0.1.99:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 1034, in train
10.0.1.99:     x_, moe_loss = layer(x_, attention_mask=attention_mask, encoder_output=encoder_output, enc_dec_attn_mask=enc_dec_attn_mask)
10.0.1.99:   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1204, in _call_impl
10.0.1.96:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 161, in forward_backward_no_pipelining
10.0.1.101:     losses_reduced = forward_backward_func(
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 161, in forward_backward_no_pipelining
10.0.1.101:         outputs = ctx.run_function(*detached_inputs)backward_step(optimizer, input_tensor, output_tensor, output_tensor_grad, model)
10.0.1.101:
10.0.1.99:     train_step(forward_step_func,
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 100, in backward_step
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/model/transformer.py", line 710, in custom_forward
10.0.1.101:     model.backward(output_tensor)
10.0.1.99:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 607, in train_step
10.0.1.97:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
10.0.1.97:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 187, in pretrain
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 9, in wrapped_fn
10.0.1.97:     ret_val = func(*args, **kwargs)
10.0.1.101:         ret_val = func(*args, **kwargs)x_, moe_loss = layer(x_, attention_mask=attention_mask, encoder_output=encoder_output, enc_dec_attn_mask=enc_dec_attn_mask)
10.0.1.101:
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1204, in _call_impl
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1968, in backward
10.0.1.97:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1968, in backward
10.0.1.101: Traceback (most recent call last):
10.0.1.101:   File "../pretrain_gpt.py", line 326, in <module>
10.0.1.101:     result = forward_call(*input, **kwargs)
10.0.1.99:     result = forward_call(*input, **kwargs)
10.0.1.97:     train_step(forward_step_func,
10.0.1.97:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 607, in train_step
10.0.1.97:     iteration = train(forward_step_func,
10.0.1.97:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 1034, in train
10.0.1.96:     backward_step(optimizer, input_tensor, output_tensor, output_tensor_grad, model)
10.0.1.97: Traceback (most recent call last):
10.0.1.99:   File "/workdir/2023/Megatron-DeepSpeed/megatron/model/transformer.py", line 474, in forward
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/model/transformer.py", line 474, in forward
10.0.1.97:   File "../pretrain_gpt.py", line 326, in <module>
10.0.1.99:     losses_reduced = forward_backward_func(
10.0.1.101:     layernorm_output = self.input_layernorm(hidden_states)
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1204, in _call_impl
10.0.1.101:     self.optimizer.backward(loss, retain_graph=retain_graph)
10.0.1.97:     losses_reduced = forward_backward_func(
10.0.1.99:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 161, in forward_backward_no_pipelining
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 9, in wrapped_fn
10.0.1.97:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 161, in forward_backward_no_pipelining
10.0.1.101:     ret_val = func(*args, **kwargs)
10.0.1.97:     backward_step(optimizer, input_tensor, output_tensor, output_tensor_grad, model)
10.0.1.97:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 100, in backward_step
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2089, in backward
10.0.1.99:     layernorm_output = self.input_layernorm(hidden_states)
10.0.1.99:   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1204, in _call_impl
10.0.1.99:     backward_step(optimizer, input_tensor, output_tensor, output_tensor_grad, model)
10.0.1.99:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 100, in backward_step
10.0.1.101:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 187, in pretrain
10.0.1.101:     result = forward_call(*input, **kwargs)
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/model/fused_layer_norm.py", line 92, in forward
10.0.1.101:         iteration = train(forward_step_func,return F.layer_norm(input, self.normalized_shape, self.weight, self.bias, self.eps)
10.0.1.101:
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 1034, in train
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2509, in layer_norm
10.0.1.101:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 51, in backward
10.0.1.101:         train_step(forward_step_func,scaled_loss.backward(retain_graph=retain_graph)
10.0.1.101:
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 402, in backward
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 607, in train_step
10.0.1.101: Traceback (most recent call last):
10.0.1.101:   File "../pretrain_gpt.py", line 326, in <module>
10.0.1.96:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 100, in backward_step
10.0.1.97:     model.backward(output_tensor)
10.0.1.99:     model.backward(output_tensor)
10.0.1.99: Traceback (most recent call last):
10.0.1.99:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 9, in wrapped_fn
10.0.1.99:   File "../pretrain_gpt.py", line 326, in <module>
10.0.1.99: Traceback (most recent call last):
10.0.1.99:   File "../pretrain_gpt.py", line 326, in <module>
10.0.1.99:     result = forward_call(*input, **kwargs)
10.0.1.99:   File "/workdir/2023/Megatron-DeepSpeed/megatron/model/fused_layer_norm.py", line 92, in forward
10.0.1.99:     ret_val = func(*args, **kwargs)
10.0.1.99:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1968, in backward
10.0.1.96:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
10.0.1.96:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 187, in pretrain
10.0.1.96:     model.backward(output_tensor)
10.0.1.96:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 9, in wrapped_fn
10.0.1.96: Traceback (most recent call last):
10.0.1.96:   File "../pretrain_gpt.py", line 326, in <module>
10.0.1.96:     ret_val = func(*args, **kwargs)
10.0.1.99:     return F.layer_norm(input, self.normalized_shape, self.weight, self.bias, self.eps)
10.0.1.97:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 9, in wrapped_fn
10.0.1.101:     torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
10.0.1.96:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1968, in backward
10.0.1.96:     iteration = train(forward_step_func,
10.0.1.99:   File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2509, in layer_norm
10.0.1.97:     self.optimizer.backward(loss, retain_graph=retain_graph)
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 191, in backward
10.0.1.99:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
10.0.1.97:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 9, in wrapped_fn
10.0.1.101:     return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
10.0.1.101:     losses_reduced = forward_backward_func(
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 161, in forward_backward_no_pipelining
10.0.1.101: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:4! (when checking argument for argument weight in method wrapper__native_layer_norm)
10.0.1.101:     Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
10.0.1.99:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 187, in pretrain
10.0.1.97:     ret_val = func(*args, **kwargs)
10.0.1.101:     backward_step(optimizer, input_tensor, output_tensor, output_tensor_grad, model)
10.0.1.99:     self.optimizer.backward(loss, retain_graph=retain_graph)
10.0.1.97:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1968, in backward
10.0.1.101:   File "/workdir/2023/Megatron-DeepSpeed/megatron/schedules.py", line 100, in backward_step
10.0.1.96:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 1034, in train
10.0.1.99:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 9, in wrapped_fn
10.0.1.97:     train_step(forward_step_func,
10.0.1.101:     model.backward(output_tensor)
10.0.1.99:         ret_val = func(*args, **kwargs)iteration = train(forward_step_func,
10.0.1.97:   File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 607, in train_step
10.0.1.101:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 9, in wrapped_fn
10.0.1.99:
10.0.1.99:       File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2089, in backward
10.0.1.97:     ret_val = func(*args, **kwargs)
10.0.1.99: return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)  File "/workdir/2023/Megatron-DeepSpeed/megatron/training.py", line 1034, in train
10.0.1.101:     return user_fn(self, *args)
10.0.1.97:       File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2089, in backward

Test script

#!/bin/bash
set -ex

BASE_PATH=XXXXX
DATA_PATH=XXXXX
DS_CONFIG=XXXXX

TP=1
PP=1

MODEL_SIZE=50
GLOBAL_BATCH=256
MICRO_BATCH=8

ZERO_STAGE=3
PA_CPU=true

# Set to cpu for offloading to cpu for larger models
# OFFLOAD_DEVICE="cpu"
# CPU_OPTIM=" --cpu-optimizer"

# Set to none and empty string for no cpu offloading
OFFLOAD_DEVICE="none" 
CPU_OPTIM=" "

if [ "$MODEL_SIZE" = "10" ]; then
NLAYERS=50
HIDDEN=4096
ATTEN_HEADS=16
SEQ_LENGTH=1024
fi

if [ "$MODEL_SIZE" = "50" ]; then
NLAYERS=62
HIDDEN=8192
ATTEN_HEADS=32
SEQ_LENGTH=1024
fi

if [ "$MODEL_SIZE" = "100" ]; then
NLAYERS=125
HIDDEN=8192
ATTEN_HEADS=32
SEQ_LENGTH=1024
fi

OUTPUT_DIR=./log/ds_z${ZERO_STAGE}_nl${NLAYERS}_hs${HIDDEN}_gb${GLOBAL_BATCH}_mb${MICRO_BATCH}
#OUTPUT_DIR=baseline_nl${NLAYERS}_hs${HIDDEN}_gb${GLOBAL_BATCH}_mb${MICRO_BATCH}
mkdir -p $OUTPUT_DIR

cat <<EOT > $DS_CONFIG
{
  "train_batch_size" : $GLOBAL_BATCH,
  "train_micro_batch_size_per_gpu": $MICRO_BATCH,
  "steps_per_print": 1,

  "zero_optimization": {
    "stage": $ZERO_STAGE,
    "overlap_comm": true
  },

  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": $PA_CPU,
    "contiguous_memory_optimization": true
  },

  "aio": {
    "block_size": 1048576,
    "queue_depth": 16,
    "single_submit": false,
    "overlap_events": true,
    "thread_count": 2
  },
  "fp16": {
    "enabled": true
  },
  "wall_clock_breakdown" : true
}
EOT

export NCCL_DEBUG=warn

ds_args=""
ds_args=" --deepspeed ${ds_args}"
ds_args=" --no-pipeline-parallel ${ds_args}" 
ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}"
ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"
ds_args=" --deepspeed-activation-checkpointing ${ds_args}"
ds_args=" --partition-activations ${ds_args}"

if [ "$PA_CPU" = "true" ]; then
ds_args=" --checkpoint-in-cpu ${ds_args}"
fi

ds_args=" --synchronize-each-layer ${ds_args}"
ds_args=" --contigious-checkpointing ${ds_args}"


deepspeed --hostfile=./hostfile ../pretrain_gpt.py \
    --tensor-model-parallel-size $TP \
    --pipeline-model-parallel-size $PP \
    --num-layers $NLAYERS \
    --hidden-size $HIDDEN \
    --num-attention-heads $ATTEN_HEADS \
    --seq-length $SEQ_LENGTH \
    --loss-scale 12 \
    --max-position-embeddings $SEQ_LENGTH \
    --micro-batch-size $MICRO_BATCH \
    --global-batch-size $GLOBAL_BATCH \
    --train-iters 15 \
    --lr 6.0e-5 \
    --min-lr 6.0e-6 \
    --lr-decay-style cosine \
    --log-interval 1 \
    --eval-iters 0 \
    --eval-interval 1000 \
    --data-path $DATA_PATH \
    --vocab-file $BASE_PATH/gpt2-vocab.json \
    --merge-file $BASE_PATH/gpt2-merges.txt \
    --save-interval 1000 \
    --split 98,2,0 \
    --clip-grad 1.0 \
    --weight-decay 0.1 \
    --adam-beta1 0.9 \
    --adam-beta2 0.95 \
    --init-method-std 0.006 \
    --fp16 \
    --checkpoint-activations \
    --tensorboard-dir $OUTPUT_DIR \
    $CPU_OPTIM $ds_args \
    --exit-interval 5000 | tee ${OUTPUT_DIR}/output.log

Megatron-DeepSpeed commit: 57e6439
ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
random_ltd ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
spatial_inference ...... [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.13.0.dev20220719+cu113
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.8.1, unknown, unknown
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.3

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

[checkpoint conversion] meg-ds to meg-ds topology reshaping

Feature request

deepspeed_to_megatron.py --target_tp TARGET_TP --target_pp TARGET_PP [...]

where the checkpoint can be reshaped for a different TP/PP target for Megatron-Deepspeed to Megatron-LM, we need the same for Megatron-Deepspeed to Megatron-Deepspeed. i.e. currently it is not possible to change the TP topology once the training started.

So the desired API is:

deepspeed_to_deepspeed.py --target_tp TARGET_TP --target_pp TARGET_PP [...]

Critical new need: the optimizer states need to be reshaped as well

Thank you!

@tjruwase

microsoft / megatron-deepspeed Goto Github PK

megatron-deepspeed's Introduction

Latest News

Megatron-DeepSpeed

Recent sync with NVIDIA/Megatron-LM

Run on Azure and AzureML

Contents

Setup

Downloading Checkpoints

Usage

Training

Data Preprocessing

BERT Pretraining

GPT Pretraining

T5 Pretraining

Distributed Pretraining

Activation Checkpointing and Recomputation

Distributed Optimizer

FlashAttention

GPT-3 Example

Retro

Evaluation and Tasks

GPT Text Generation

Detoxify GPT via Self-generation

GPT Evaluation

WikiText Perplexity Evaluation

LAMBADA Cloze Accuracy

BERT Task Evaluation

RACE Evaluation

MNLI Evaluation

Datasets

Collecting Wikipedia Training Data

Collecting GPT Webtext Data

Reproducibility

megatron-deepspeed's People

Contributors

Stargazers

Watchers

Forkers

megatron-deepspeed's Issues

Steps to Reproduce

Observed behavior

Solution found (please fix if acceptable)

Expected behavior

System info

Steps to Reproduce

Observed behavior

Solution found (please fix if acceptable)

Expected behavior

System info

Change for multinode config

offloads to NVMe

TiledLinear splits, 0 is disable

Megatron Model Parallelism

--load $CHECKPOINT_PATH \

rm wlog/$(log_date).log

Feature request

Recommend Projects

Recommend Topics

Recommend Org