bigcode-project / megatron-lm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nvidia/megatron-lm

367.0 367.0 48.0 6.52 MB

Ongoing research training transformer models at scale

License: Other

Shell 1.33% C++ 7.50% Python 89.85% C 0.33% Cuda 0.83% Makefile 0.01% HTML 0.14%

megatron-lm's People

Contributors

Stargazers

Watchers

megatron-lm's Issues

TF-Model Architecture

Port Hailey's implementation of Fill In the Middle (FIM)
Positional encodings (e.g., rotary/alibi) (already implemented in Megatron-LM)
Multi Query Attention (already implemented in Megatron-LM)
Model size
Compute/data budget per language (which programming languages?? )

Since we do not have much time to experiment with all these components, we have to make decisions based on the current literature.

How to generate several result with num_beams

IndexError: too many indices for tensor of dimension 2

Meet a new Error:

Traceback (most recent call last):
  File "/Megatron-LM/pretrain_gpt.py", line 148, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/Megatron-LM/megatron/training.py", line 161, in pretrain
    iteration = train(forward_step_func,
  File "/Megatron-LM/megatron/training.py", line 740, in train
    train_step(forward_step_func,
  File "/Megatron-LM/megatron/training.py", line 434, in train_step
    losses_reduced = forward_backward_func(
  File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 360, in forward_backward_no_pipelining
    output_tensor = forward_step(forward_step_func, data_iterator,
  File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 218, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "/Megatron-LM/pretrain_gpt.py", line 81, in forward_step
    tokens, labels, loss_mask, attention_mask, position_ids = get_batch(
  File "/Megatron-LM/pretrain_gpt.py", line 46, in get_batch
    data_b = tensor_parallel.broadcast_data(keys, data, datatype)
  File "/Megatron-LM/megatron/core/tensor_parallel/data.py", line 76, in broadcast_data
    key_size, key_numel, total_numel = _build_key_size_numel_dictionaries(keys,
  File "/Megatron-LM/megatron/core/tensor_parallel/data.py", line 31, in _build_key_size_numel_dictionaries
    assert data[key].dim() < max_dim, 'you should increase MAX_DATA_DIM'
IndexError: too many indices for tensor of dimension 2

I check the "data", it because "data" is a tensor type, not a dictionary. I don't know when the key added into the data.

Conversion of Huggingface bigcode/santacoder to Nvidia Triton Inference server

Thanks for publishing the model to Huggingface. For using the Triton Inference server in Products like https://github.com/fauxpilot/fauxpilot:

Do you have any preferred way to convert it to Nvidia Triton Inference server (e.g. https://github.com/triton-inference-server/fastertransformer_backend), starting e.g. from the checkpoint by Huggingface?

model = AutoModelForCausalLM.from_pretrained(
    "bigcode/santacoder",
    revision="no-fim", # name of branch or commit hash
    trust_remote_code=True
)

Saved model checkpoint in different precision ?

Hi Team,
Great work!!
I would like to know the precision in which the model checkpoint is saved.
For example, if trained with bf16 precision, what will the checkpoint precision will be?

Thanks

LM Head FLOPs

Why are we not multiplying the LM Head flops per iteration with the checkpoint_activations_factor?

Megatron-LM/megatron/utils.py

Line 253 in bd0aaba

 flops_per_iteration += (6 * batch_size * seq_len * num_layers * (hidden_size**2)) * (vocab_size / (num_layers * hidden_size)) 

Afaik the factor of 4 means 1 forward, 2 backward & 1 forward, where the last forward is needed for ckpt acts. Don't we also need all 4 for the LM Head? cc @RaymondLi0 @NouamaneTazi

Incomplete humaneval evaluation code

How to reproduce the humaneval performance in this repo ?
I tried to evaluate using the evaluation branch, but it seems to be very different from the the default branch, is it possible to add a full evaluation process in default branch ?
Thanks a lot.

List of things to be (potentially) ported from Megatron-DeepSpeed

AliBi embeddings
Checkpoint reshaping
Universal checkpoints

Feel free to add more things if required.

TF-Tokenization

Train BPE tokenizer

Support MQA in tools.checkpoint_loader/saver_megatron

Checkpoint conversion to another model-parallel config does not support MQA for now.

Q and KV are split for MQA:
https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/tools/checkpoint_loader_megatron.py#L206 in this case the KV weights should not be concatenated (they are shared among TP-ranks)
https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/tools/checkpoint_saver_megatron.py#L229 same here, KV weights should not be split across TP-ranks, but copied instead

Without "<suffix>" token

I use

python tools/preprocess_data.py \
       --input /cobol/gpt2/data \
       --output-prefix /cobol/data_preprocess \
       --vocab-file /cobol/gpt2/vocab.json \
       --dataset-impl mmap \
       --tokenizer-type GPT2BPETokenizer \
       --merge-file /cobol/gpt2/merges.txt \
       --json-key content \
       --workers 32 \
       --chunk-size 25 \
       --append-eod

to create the dataset, and use

#!/bin/bash

# Runs the "345M" parameter model

export CUDA_DEVICE_MAX_CONNECTIONS=1

CHECKPOINT_PATH=/cobol/gpt2/checkpoint
VOCAB_FILE=/cobol/vocab_file/vocab.json
MERGE_FILE=/cobol/gpt2/merges.txt
DATA_PATH=/cobol/gpt2/data_document/data_preprocess_content_document

GPT_ARGS="
    --num-layers 24 \
    --hidden-size 1024 \
    --num-attention-heads 16 \
    --attention-head-type multihead \
    --seq-length 1024 \
    --max-position-embeddings 1024 \
    --micro-batch-size 4 \
    --global-batch-size 8 \
    --lr 0.00015 \
    --train-iters 500000 \
    --lr-decay-iters 320000 \
    --lr-decay-style cosine \
    --min-lr 1.0e-5 \
    --weight-decay 1e-2 \
    --lr-warmup-fraction .01 \
    --clip-grad 1.0 \
    --fp16
"

DATA_ARGS="
    --data-path $DATA_PATH \
    --vocab-file $VOCAB_FILE \
    --merge-file $MERGE_FILE \
    --data-impl mmap \
    --split 949,50,1
"

OUTPUT_ARGS="
    --log-interval 100 \
    --save-interval 10000 \
    --eval-interval 1000 \
    --eval-iters 10
"

torchrun pretrain_gpt.py \
    $GPT_ARGS \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --save $CHECKPOINT_PATH \
    --load $CHECKPOINT_PATH

to pretrain. But I get the Key Error:

using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
using torch.float16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... True
  attention_dropout ............................... 0.1
  attention_head_type ............................. multihead
  attention_softmax_in_fp32 ....................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  data_impl ....................................... mmap
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 1
  data_path ....................................... ['/cobol/gpt2/data_document/data_preprocess_content_document']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  DDP_impl ........................................ local
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout ............................. 600
  distributed_timeout_minutes ..................... 10
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 24
  encoder_seq_length .............................. 1024
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 10
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  ffn_hidden_size ................................. 4096
  fim_rate ........................................ 0.0
  fim_split_sample ................................ None
  fim_spm_rate .................................... 0.5
  finetune ........................................ False
  fp16 ............................................ True
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_e4m3 ........................................ False
  fp8_hybrid ...................................... False
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  fragment_fim_rate ............................... 0.5
  global_batch_size ............................... 8
  glu_activation .................................. None
  gradient_accumulation_fusion .................... True
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 1024
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 64
  layernorm_epsilon ............................... 1e-05
  lazy_mpu_init ................................... None
  load ............................................ /cobol/gpt2/checkpoint
  local_rank ...................................... 0
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 100
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.00015
  lr_decay_iters .................................. 320000
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_warmup_fraction .............................. 0.01
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 1024
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... /cobol/gpt2/merges.txt
  micro_batch_size ................................ 4
  min_loss_scale .................................. 1.0
  min_lr .......................................... 1e-05
  mmap_warmup ..................................... False
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  num_attention_heads ............................. 16
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... None
  num_layers ...................................... 24
  num_layers_per_virtual_pipeline_stage ........... None
  num_workers ..................................... 2
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  override_opt_param_scheduler .................... False
  params_dtype .................................... torch.float16
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... PositionEmbeddingType.absolute
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ 1
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_return_doc_ids ............................ False
  retro_workdir ................................... None
  rotary_percent .................................. 1.0
  rotary_theta .................................... 10000
  sample_rate ..................................... 1.0
  sanity_check_dataloader_interval ................ None
  save ............................................ /cobol/gpt2/checkpoint
  save_interval ................................... 10000
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 1024
  sequence_parallel ............................... False
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  split ........................................... 949,50,1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  structured_logs ................................. False
  structured_logs_dir ............................. None
  swiglu .......................................... False
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  test_weighted_split_names ....................... None
  test_weighted_split_paths ....................... None
  test_weighted_split_paths_path .................. None
  test_weighted_split_splits ...................... None
  test_weighted_split_weights ..................... None
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_file .................................. None
  tokenizer_model ................................. None
  tokenizer_type .................................. GPT2BPETokenizer
  train_data_path ................................. None
  train_iters ..................................... 500000
  train_samples ................................... None
  train_weighted_split_paths ...................... None
  train_weighted_split_paths_path ................. None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 1
  transformer_timers .............................. False
  untie_embeddings_and_output_weights ............. False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_contiguous_buffers_in_local_ddp ............. True
  use_cpu_initialization .......................... None
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. False
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. False
  valid_data_path ................................. None
  valid_num_workers ............................... 2
  valid_weighted_split_names ...................... None
  valid_weighted_split_paths ...................... None
  valid_weighted_split_paths_path ................. None
  valid_weighted_split_splits ..................... None
  valid_weighted_split_weights .................... None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... /cobol/vocab_file/vocab.json
  vocab_size ...................................... None
  wandb_entity_name ............................... None
  wandb_project_name .............................. None
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 1
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 2
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory '/Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/Megatron-LM/megatron/data'
>>> done with dataset index builder. Compilation time: 0.055 seconds
[rank0]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
/Megatron-LM/megatron/training.py:104: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.)
  start_time_tensor = torch.cuda.DoubleTensor([_TRAIN_START_TIME])
time to initialize megatron (seconds): 1.204
[after megatron is initialized] datetime: 2024-01-02 17:35:35 
building GPT model ...
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 354871296
> learning rate decay style: cosine
 loading release checkpoint from /cobol/gpt2/checkpoint
could not find arguments in the checkpoint ...
 checkpoint version 0
 succesfully fixed query-key-values ordering for checkpoint version 0
  successfully loaded checkpoint from /cobol/gpt2/checkpoint at iteration 0
(min, max) time across ranks (ms):
    load-checkpoint ................................: (540.41, 540.41)
[after model, optimizer, and learning rate scheduler are built] datetime: 2024-01-02 17:35:36 
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      4000000
    validation: 40080
    test:       80
> building train, validation, and test datasets for GPT ...
Single data path provided for train, valid & test
 > building dataset index ...
    reading sizes...
    reading pointers...
    reading document index...
    creating numpy buffer of mmap...
    creating memory view of numpy buffer...
 > finished creating indexed dataset in 0.000152 seconds
    number of documents: 4047
 > dataset split:
    train:
     document indices in [0, 3841) total of 3841 documents
    validation:
     document indices in [3841, 4043) total of 202 documents
    test:
     document indices in [4043, 4047) total of 4 documents
 > Tokens per epoch: 14533042
 > loading doc-idx mapping from /cobol/gpt2/data_document/data_preprocess_content_document_train_indexmap_4000000ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /cobol/gpt2/data_document/data_preprocess_content_document_train_indexmap_4000000ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /cobol/gpt2/data_document/data_preprocess_content_document_train_indexmap_4000000ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of samples: 4002264
    total number of epochs: 282
Traceback (most recent call last):
  File "/Megatron-LM/megatron/data/gpt_dataset.py", line 346, in __init__
    self.suffix_tok_id, self.prefix_tok_id, self.middle_tok_id, self.pad_tok_id = (self.tokenizer.special_tokens[tok] for tok in [FIM_SUFFIX, FIM_PREFIX, FIM_MIDDLE, FIM_PAD])
  File "/Megatron-LM/megatron/data/gpt_dataset.py", line 346, in <genexpr>
    self.suffix_tok_id, self.prefix_tok_id, self.middle_tok_id, self.pad_tok_id = (self.tokenizer.special_tokens[tok] for tok in [FIM_SUFFIX, FIM_PREFIX, FIM_MIDDLE, FIM_PAD])
KeyError: '<fim_suffix>'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Megatron-LM/pretrain_gpt.py", line 148, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/Megatron-LM/megatron/training.py", line 140, in pretrain
    = build_train_valid_test_data_iterators(
  File "/Megatron-LM/megatron/training.py", line 1047, in build_train_valid_test_data_iterators
    build_train_valid_test_data_loaders(
  File "/Megatron-LM/megatron/training.py", line 979, in build_train_valid_test_data_loaders
    train_ds, valid_ds, test_ds = build_train_valid_test_datasets_provider(
  File "/Megatron-LM/pretrain_gpt.py", line 100, in train_valid_test_datasets_provider
    train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
  File "/Megatron-LM/megatron/data/gpt_dataset.py", line 33, in build_train_valid_test_datasets
    all_train_datasets, all_valid_datasets, all_test_datasets = _build_train_valid_test_datasets(data_prefix[0],
  File "/Megatron-LM/megatron/data/gpt_dataset.py", line 234, in _build_train_valid_test_datasets
    train_dataset = build_dataset(0, 'train')
  File "/Megatron-LM/megatron/data/gpt_dataset.py", line 227, in build_dataset
    dataset = GPTDataset(name, data_prefix,
  File "/Megatron-LM/megatron/data/gpt_dataset.py", line 348, in __init__
    self.suffix_tok_id, self.prefix_tok_id, self.middle_tok_id, self.pad_tok_id = (self.tokenizer.vocab[tok] for tok in [FIM_SUFFIX, FIM_PREFIX, FIM_MIDDLE, FIM_PAD])
  File "/Megatron-LM/megatron/data/gpt_dataset.py", line 348, in <genexpr>
    self.suffix_tok_id, self.prefix_tok_id, self.middle_tok_id, self.pad_tok_id = (self.tokenizer.vocab[tok] for tok in [FIM_SUFFIX, FIM_PREFIX, FIM_MIDDLE, FIM_PAD])
KeyError: '<fim_suffix>'

It looks like the tokenizer doesn't add <fim_suffix>, what should I do？

Train Python model with FIM

As discussed today, let's train a 350M model with the following hyper parameters:

FIM rate 0.5
SPM vs PSM: 0.5
ALiBi encodings
Multi-Head Attention

Let's see how it compares against previously trained models.

Improve loading of the data-paths

The --(train|valid|test)-weighted-split-paths-path arguments (added in #32 ) parses the data arguments from a file in a specific format.
Loading could be made simpler by reading a structured file (json or yaml). Such file would be more human-readable too.

Benchmarking Memory Consumption of Optimizers Adam v.s. Adan

Benchmarking Results

The memory benchmarking is conducted based on the following config:

vocab size: 49280
batch size: 1
sequence length: 2048

Head	Layers	Emb. Dim	Model Size (MB)	Adam Peak (MB)	Adan Peak (MB)	$\Delta$ (%)
6	6	768	81	4490	4490	0
12	6	768	81	5848	5848	0
16	6	768	81	6776	6776	0
6	12	768	124	7151	7153	0.03
12	12	768	124	9869	9871	0.02
16	12	768	124	11733	11735	0.02
16	6	1024	128	7302	7304	0.03
16	12	1024	203	12719	12721	0.02
6	24	768	209	12471	12475	0.03
12	24	768	209	17922	17922	0
16	24	768	209	21596	21600	0.02
6	6	1536	248	6905	8241	19.35
12	6	1536	248	8235	8539	3.69
16	6	1536	248	9141	9445	3.33
16	24	1024	354	23530	23534	0.02
16	6	2048	407	11098	12159	9.56
6	12	1536	418	11137	13778	23.71
12	12	1536	418	13390	14164	5.78
16	12	1536	418	15667	15976	1.97
16	6	2560	603	13967	18207	30.36
16	12	2048	709	18851	20954	11.16
6	24	1536	758	19660	24819	26.24
12	24	1536	758	25096	25406	1.24
16	24	1536	758	28720	29030	1.08
16	12	2560	1075	28475	32134	12.85
16	24	2048	1313	34357	38595	12.34

Conclusion

The extra memory consumption does not increase linearly with the size of the model.
In most cases Adan's additional memory footprint does not exceed 10%.
However, when the embedding dimension (Emb. Dim) increases, the probability that Adan's extra memory is larger also increases.

OOM while merging starcoder model (after sft) from TP=4,PP=4 to TP=8,PP=1

I try to reshape a fine-tuned checkpoint of starcoder(https://huggingface.co/bigcode/starcoder-megatron/tree/main) from TP=4,PP=4 to TP=8,PP=1 using tools/checkpoint_util.py, but I encountered a memory OOM issue.
The machine I used has a memory of 512GB which is capable of loading the whole model. Any solution to solve this issue?

Here's my log before OOM (I added some debugging output to check the process). it seems like the ckpt loader is sending network layers to the saver.

Loaded checkpoint_loader_megatron as the loader.
Loaded checkpoint_saver_megatron as the saver.
Starting saver...
Starting loader...
Wandb import failed
Wandb import failed
/opt/conda/lib/python3.9/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/opt/conda/lib/python3.9/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
Setting num_layers to 40 from checkpoint
Setting hidden_size to 6144 from checkpoint
Setting ffn_hidden_size to 24576 from checkpoint
Setting seq_length to 2048 from checkpoint
Setting num_attention_heads to 48 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 8192 from checkpoint
Checkpoint did not provide arguments add_position_embedding
Checkpoint did not provide arguments use_rotary_position_embeddings
Checkpoint did not provide arguments rotary_percent
Checkpoint did not provide arguments add_bias_linear
Checkpoint did not provide arguments swiglu
Checkpoint did not provide arguments untie_embeddings_and_output_weights
Checkpoint did not provide arguments apply_layernorm_1p
Setting tokenizer_type to TokenizerFromFile from checkpoint
Setting padded_vocab_size to 49152 from checkpoint
Setting attention_head_type to multiquery from checkpoint
Setting tensor_model_parallel_size to 4 from checkpoint
Setting pipeline_model_parallel_size to 4 from checkpoint
Checkpoint did not provide arguments virtual_pipeline_model_parallel_size
Checkpoint did not provide arguments num_layers_per_virtual_pipeline_stage
using world size: 16, data-parallel-size: 1, tensor-model-parallel size: 4, pipeline-model-parallel size: 4 
setting global batch size to 1
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... False
  attention_dropout ............................... 0.1
  attention_head_type ............................. multiquery
  attention_softmax_in_fp32 ....................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. False
  bias_gelu_fusion ................................ False
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  data_impl ....................................... infer
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 1
  data_path ....................................... None
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  DDP_impl ........................................ local
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout ............................. 600
  distributed_timeout_minutes ..................... 10
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 40
  encoder_seq_length .............................. 2048
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 100
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  ffn_hidden_size ................................. 24576
  fim_rate ........................................ 0.0
  fim_spm_rate .................................... 0.5
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_e4m3 ........................................ False
  fp8_hybrid ...................................... False
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 1
  glu_activation .................................. None
  gradient_accumulation_fusion .................... True
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 6144
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  iteration ....................................... xxx
  kv_channels ..................................... 128
  layernorm_epsilon ............................... 1e-05
  lazy_mpu_init ................................... None
  load ............................................ xxx
  local_rank ...................................... 0
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 100
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. None
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. linear
  lr_warmup_fraction .............................. None
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... False
  max_position_embeddings ......................... 8192
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 0.0
  mmap_warmup ..................................... False
  no_load_optim ................................... True
  no_load_rng ..................................... True
  no_persist_layer_norm ........................... False
  no_save_optim ................................... True
  no_save_rng ..................................... True
  num_attention_heads ............................. 48
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... None
  num_layers ...................................... 40
  num_layers_per_virtual_pipeline_stage ........... None
  num_workers ..................................... 2
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  override_opt_param_scheduler .................... False
  padded_vocab_size ............................... 49152
  params_dtype .................................... torch.float32
  patch_dim ....................................... 16
  perform_initialization .......................... False
  pipeline_model_parallel_size .................... 4
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... PositionEmbeddingType.absolute
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ 1
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_return_doc_ids ............................ False
  retro_workdir ................................... None
  rotary_percent .................................. 1.0
  sample_rate ..................................... 1.0
  save ............................................ None
  save_interval ................................... None
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 2048
  sequence_parallel ............................... False
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  split ........................................... None
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  structured_logs ................................. False
  structured_logs_dir ............................. None
  swiglu .......................................... False
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 4
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  test_weighted_split_paths ....................... None
  test_weighted_split_paths_path .................. None
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_file .................................. None
  tokenizer_model ................................. None
  tokenizer_type .................................. TokenizerFromFile
  train_data_path ................................. None
  train_iters ..................................... None
  train_samples ................................... None
  train_weighted_split_paths ...................... None
  train_weighted_split_paths_path ................. None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 4
  transformer_timers .............................. False
  untie_embeddings_and_output_weights ............. False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_contiguous_buffers_in_local_ddp ............. True
  use_cpu_initialization .......................... True
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. False
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. False
  valid_data_path ................................. None
  valid_num_workers ............................... 2
  valid_weighted_split_paths ...................... None
  valid_weighted_split_paths_path ................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... None
  vocab_size ...................................... None
  wandb_entity_name ............................... None
  wandb_project_name .............................. None
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 16
-------------------- end of arguments ---------------------
Wandb import failed
setting number of micro-batches to constant 1
running on CUDA devices
loading rank 0 / count 4
building GPT model ...
 loading checkpoint from xxx at iteration xxx
 checkpoint version 3.0
  successfully loaded checkpoint from xxx at iteration xxx
loading rank 1 / count 4
building GPT model ...
 loading checkpoint from xxx at iteration xxx
 checkpoint version 3.0
  successfully loaded checkpoint from xxx at iteration xxx
loading rank 2 / count 4
building GPT model ...
 loading checkpoint from xxx at iteration xxx
 checkpoint version 3.0
  successfully loaded checkpoint from xxx at iteration xxx
loading rank 3 / count 4
building GPT model ...
 loading checkpoint from xxx at iteration xxx
 checkpoint version 3.0
  successfully loaded checkpoint from xxx at iteration xxx

Overwriting default ffn_hidden_size value None with value from checkpoint 24576.
Overwriting default kv_channels value None with value from checkpoint 128.
Overwriting default micro_batch_size value 1 with value from checkpoint 2.
Overwriting default global_batch_size value None with value from checkpoint 64.
Overwriting default log_interval value 100 with value from checkpoint 10.
Overwriting default tensorboard_dir value None with value from checkpoint xxx/tensorboard/.
Overwriting default dataloader_type value None with value from checkpoint single.
Overwriting default lr value None with value from checkpoint 1e-05.
Overwriting default lr_decay_style value linear with value from checkpoint cosine.
Overwriting default min_lr value 0.0 with value from checkpoint 1e-06.
Overwriting default load value None with value from checkpoint xxx/starcoder-megatron.
Checkpoint had argument load_step but new arguments does not have this.
Checkpoint had argument finetune_from but new arguments does not have this.
Overwriting default bf16 value False with value from checkpoint True.
Overwriting default accumulate_allreduce_grads_in_fp32 value False with value from checkpoint True.
Overwriting default local_rank value None with value from checkpoint 0.
Overwriting default eval_iters value 100 with value from checkpoint 10.
Overwriting default eval_interval value 1000 with value from checkpoint 5000.
Overwriting default data_path value None with value from checkpoint ['xx/data/sft_experiments/xxx'].
Overwriting default split value None with value from checkpoint 998,1,1.
Overwriting default merge_file value None with value from checkpoint ./experiments/cro/starcoder/merges.txt.
Overwriting default data_impl value infer with value from checkpoint mmap.
Overwriting default log_validation_ppl_to_tensorboard value False with value from checkpoint True.
Overwriting default world_size value 8 with value from checkpoint 16.
Checkpoint had argument transformer_pipeline_model_parallel_size but new arguments does not have this.
Checkpoint had argument data_parallel_size but new arguments does not have this.
Checkpoint had argument valid_weighted_split_names but new arguments does not have this.
Checkpoint had argument valid_weighted_split_weights but new arguments does not have this.
Checkpoint had argument valid_weighted_split_splits but new arguments does not have this.
Checkpoint had argument test_weighted_split_names but new arguments does not have this.
Checkpoint had argument test_weighted_split_weights but new arguments does not have this.
Checkpoint had argument test_weighted_split_splits but new arguments does not have this.
Checkpoint had argument consumed_train_samples but new arguments does not have this.
Checkpoint had argument consumed_valid_samples but new arguments does not have this.
Checkpoint had argument padded_vocab_size but new arguments does not have this.
Checkpoint had argument model_type but new arguments does not have this.
Checkpoint had argument iteration but new arguments does not have this.
Checkpoint had argument do_train but new arguments does not have this.
Checkpoint had argument do_valid but new arguments does not have this.
Checkpoint had argument do_test but new arguments does not have this.
Checkpoint had argument curr_iteration but new arguments does not have this.
using world size: 16, data-parallel-size: 2, tensor-model-parallel size: 4, pipeline-model-parallel size: 2 
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. True
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... False
  attention_dropout ............................... 0.1
  attention_head_type ............................. multiquery
  attention_softmax_in_fp32 ....................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ True
  bias_dropout_fusion ............................. False
  bias_gelu_fusion ................................ False
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  data_impl ....................................... mmap
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 2
  data_path ....................................... ['xx/data/sft_experiments/xxx']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  DDP_impl ........................................ local
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout ............................. 600
  distributed_timeout_minutes ..................... 10
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 40
  encoder_seq_length .............................. 2048
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  eval_interval ................................... 5000
  eval_iters ...................................... 10
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  ffn_hidden_size ................................. 24576
  fim_rate ........................................ 0.0
  fim_spm_rate .................................... 0.5
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_e4m3 ........................................ False
  fp8_hybrid ...................................... False
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 64
  glu_activation .................................. None
  gradient_accumulation_fusion .................... True
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 6144
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 128
  layernorm_epsilon ............................... 1e-05
  lazy_mpu_init ................................... None
  load ............................................ xxx/starcoder-megatron
  local_rank ...................................... 0
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 10
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... True
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 1e-05
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_warmup_fraction .............................. None
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... False
  max_position_embeddings ......................... 8192
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... ./experiments/cro/starcoder/merges.txt
  micro_batch_size ................................ 2
  min_loss_scale .................................. 1.0
  min_lr .......................................... 1e-06
  mmap_warmup ..................................... False
  no_load_optim ................................... True
  no_load_rng ..................................... True
  no_persist_layer_norm ........................... False
  no_save_optim ................................... True
  no_save_rng ..................................... True
  num_attention_heads ............................. 48
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... None
  num_layers ...................................... 40
  num_layers_per_virtual_pipeline_stage ........... None
  num_workers ..................................... 2
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  override_opt_param_scheduler .................... False
  params_dtype .................................... torch.bfloat16
  patch_dim ....................................... 16
  perform_initialization .......................... False
  pipeline_model_parallel_size .................... 2
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... PositionEmbeddingType.absolute
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ 1
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_return_doc_ids ............................ False
  retro_workdir ................................... None
  rotary_percent .................................. 1.0
  sample_rate ..................................... 1.0
  save ............................................ xxx
  save_interval ................................... 1
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 2048
  sequence_parallel ............................... False
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  split ........................................... 998,1,1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  structured_logs ................................. False
  structured_logs_dir ............................. None
  swiglu .......................................... False
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 4
  tensorboard_dir ................................. xxx/tensorboard/
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  test_weighted_split_names ....................... None
  test_weighted_split_paths ....................... None
  test_weighted_split_paths_path .................. None
  test_weighted_split_splits ...................... None
  test_weighted_split_weights ..................... None
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_file .................................. None
  tokenizer_model ................................. None
  tokenizer_type .................................. TokenizerFromFile
  train_data_path ................................. None
  train_iters ..................................... None
  train_samples ................................... None
  train_weighted_split_paths ...................... None
  train_weighted_split_paths_path ................. None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 2
  transformer_timers .............................. False
  untie_embeddings_and_output_weights ............. False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_contiguous_buffers_in_local_ddp ............. True
  use_cpu_initialization .......................... True
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. False
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. False
  valid_data_path ................................. None
  valid_num_workers ............................... 2
  valid_weighted_split_names ...................... None
  valid_weighted_split_paths ...................... None
  valid_weighted_split_paths_path ................. None
  valid_weighted_split_splits ..................... None
  valid_weighted_split_weights .................... None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... None
  vocab_size ...................................... None
  wandb_entity_name ............................... None
  wandb_project_name .............................. None
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 16
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 16
Setting consumed_train_samples to None and consumed_valid_samples to None
Wandb import failed
running on CUDA devices
sending embeddings
sending transformer layer 0
received embeddings
Original vocab size not specified, leaving embedding table as-is. If you've changed the tensor parallel size this could cause problems.
building GPT model ...
sending transformer layer 1
sending transformer layer 2
WARNING! Distributed processes aren't initialized, so word embeddings in the last layer are not initialized. If you are just manipulating a model this is fine, but this needs to be handled manually. If you are training something is definitely wrong.
sending transformer layer 3
sending transformer layer 4
sending transformer layer 5
sending transformer layer 6
sending transformer layer 7
sending transformer layer 8
sending transformer layer 9
loading rank 0 / count 4
building GPT model ...
 loading checkpoint from xxx at iteration xxx
building GPT model ...
building GPT model ...
building GPT model ...
debugging in saver, get transformers layer from queue ...
received transformer layer 0
debugging in saver, get transformers layer from queue ...
received transformer layer 1
debugging in saver, get transformers layer from queue ...
received transformer layer 2
debugging in saver, get transformers layer from queue ...
received transformer layer 3
debugging in saver, get transformers layer from queue ...
received transformer layer 4
debugging in saver, get transformers layer from queue ...
received transformer layer 5
debugging in saver, get transformers layer from queue ...
received transformer layer 6
debugging in saver, get transformers layer from queue ...
received transformer layer 7
debugging in saver, get transformers layer from queue ...
received transformer layer 8
debugging in saver, get transformers layer from queue ...
received transformer layer 9
debugging in saver, get transformers layer from queue ...
 checkpoint version 3.0
  successfully loaded checkpoint from xxx at iteration xxx
loading rank 1 / count 4
building GPT model ...
 loading checkpoint from xxx at iteration xxx
 checkpoint version 3.0
  successfully loaded checkpoint from xxx at iteration xxx
loading rank 2 / count 4
building GPT model ...
 loading checkpoint from xxx at iteration xxx

Wandb init error

Wandb used to work fine, but now there is some issue in the initialization.

Workaround for now: disable wandb

2023-02-17 22:39:17,767 (worker_0) : training ...
2023-02-17 22:39:24,279 (worker_7) : wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin

2023-02-17 22:39:25,325 (worker_7) : wandb: - Waiting for wandb.init()...
2023-02-17 22:39:26,326 (worker_7) : wandb: \ Waiting for wandb.init()...

[more of the same log]

2023-02-17 22:40:24,365 (worker_7) : init_wandb
2023-02-17 22:40:24,365 (worker_7) : wandb: ERROR Error communicating with wandb process
wandb: ERROR For more info see: https://docs.wandb.ai/library/init#init-start-error

2023-02-17 22:40:24,367 (worker_7) : Traceback (most recent call last):
2023-02-17 22:40:24,367 (worker_7) :   File "/app/toolkit_infiniband_example/run.py", line 34, in run
2023-02-17 22:40:24,367 (worker_7) :     runnable.run()
2023-02-17 22:40:24,367 (worker_7) :   File "/app/toolkit_infiniband_example/worker.py", line 89, in run
2023-02-17 22:40:24,367 (worker_7) :     self._model.train()
2023-02-17 22:40:24,367 (worker_7) :   File "/app/toolkit_infiniband_example/models/megatron_gpt.py", line 65, in train
2023-02-17 22:40:24,368 (worker_7) :     pretrain(train_valid_test_datasets_provider, model_provider,
2023-02-17 22:40:24,368 (worker_7) :   File "/app/megatron/training.py", line 155, in pretrain
2023-02-17 22:40:24,368 (worker_7) :     iteration = train(forward_step_func,
2023-02-17 22:40:24,368 (worker_7) :   File "/app/megatron/training.py", line 685, in train
2023-02-17 22:40:24,368 (worker_7) :     init_wandb()
2023-02-17 22:40:24,368 (worker_7) :   File "/app/megatron/initialize.py", line 244, in init_wandb
2023-02-17 22:40:24,368 (worker_7) :     wandb.init(
2023-02-17 22:40:24,368 (worker_7) :   File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1078, in init
2023-02-17 22:40:24,368 (worker_7) :     run = wi.init()
2023-02-17 22:40:24,368 (worker_7) :   File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 719, in init
2023-02-17 22:40:24,368 (worker_7) :     raise UsageError(error_message)
2023-02-17 22:40:24,368 (worker_7) : wandb.errors.UsageError: Error communicating with wandb process
2023-02-17 22:40:24,368 (worker_7) : For more info see: https://docs.wandb.ai/library/init#init-start-error
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
2023-02-17 22:42:33,930 (worker_7) : Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fd1940af970>: Failed to establish a new connection: [Errno 110] Connection timed out')': /api/5288891/store/

Multiple validation datasets

Add a feature to run validation on multiple datasets.
Would be useful to get validation loss on different programming languages for example

Incomplete humaneval evaluation code

hi @RaymondLi0 , in the branch evaluation, I found the evaluation code related to humaneval, but unfortunately it doesn't work.
Some specific mismatches are as follows.

Megatron-LM/tasks/human_eval/generate_samples.py

Line 161 in bd14566

prefix_lm=args.is_prefix_lm,

prefix_lm and sep_in_bidirectional_context are not arguments of generate_and_post_process.

Could you help me update it ?

Unstable training with MQA and TP>1

Models trained with MQA and TP>1 have exploding loss after a few thousand steps.
This issue does not happen when training the same models with MHA, or with TP=1.

Lower throughput with UL2 training

With micro-batch-size=2, global-batch-size=192, a 1B-model configuration, the UL2 training script gives:

    forward-backward ...............................: (5164.84, 5177.15)
    forward-compute ................................: (2584.92, 2735.66)
    backward-compute ...............................: (2423.13, 2586.40)
    batch-generator ................................: (408.98, 817.78)  <---
    data-iterator ..................................: (3.90, 43.10)
    broadcast-data .................................: (395.41, 776.42)  <---
    layernorm-grads-all-reduce .....................: (0.02, 0.03)
    embedding-grads-all-reduce .....................: (0.03, 0.04)
    grads-all-reduce ...............................: (193.33, 193.63)
    optimizer-copy-to-main-grad ....................: (10.61, 10.67)
    optimizer-unscale-and-check-inf ................: (36.12, 36.37)
    optimizer-clip-main-grad .......................: (2.90, 3.00)
    optimizer-count-zeros ..........................: (0.00, 0.01)
    optimizer-inner-step ...........................: (17.63, 17.73)
    optimizer-copy-main-to-model-params ............: (4.63, 4.72)
    optimizer ......................................: (72.92, 73.13)

where the GPT training script with the same configuration has twice shorter forward-time, coming mostly from having almost zero broadcast-data time. https://github.com/bigcode-project/Megatron-LM/blob/ul2-merge/pretrain_ul2.py#L90

Support interleaved pipeline schedules in checkpoint merging tools

https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/tools/checkpoint_loader_megatron.py#L121

Meta-information dropout

Add document meta-information at the beginning (filename, filepath, repo name, number of stars...)
Also there should be a dropout for each of these fields.

Add meta-information during preprocessing at the beginning of each document
Dropout this information when loading examples during training. Use a seed.

Optional: at the beginning of the preprocessed document, pre-pend the number of tokens that contain the meta information. That would allow to detokenize only the meta-information instead of the whole document during training.

On issue would be that preprocessed samples of length 2048 would get shorter if the meta-information is removed. Possible workaround:

add padding. Not ideal because it would add computational waste.
build index-mappings with longer sequences (e.g. 2148 instead of 2048) and truncate as needed: waste less compute, but waste some data. Would still need padding in some cases

Cleaner solution: tokenize document content and meta-information strings separately. Let N be the number of meta-information field that can be dropped-out. We create N+1 preprocessed flies. When loading data at training time, read through these N+1 files in parallel and add the meta-info with dropout. More implementation work needed in this solution

Train an Encoder on the BigCode Dataset

Opening this after we discussed it in Slack.

It might prove useful for evaluation to train an encoder model on the same dataset used to train the main model.

Things proposed:

Smaller model
Bootstrap weights from CLM

Create data composition

Once the stack v1.2 is done #24 , attribute weight to each data source (PL and NL).

Some PLs like HTML and CSS should probably be down-sampled.
On the other hand, we might want to upsample some low-resource PLs.

For the final model the goal would be to have a limit of 5 epochs maximum for example, and check that each of the data sources stays under that limit. An estimated 600B training tokens can be used to check this limit.

Experiment plan

Proposed list of experiments to run:

https://docs.google.com/spreadsheets/d/1xOIYoExQP_haA80ArY09fAk49fNVTO8eJrYysw71FQs/edit?usp=sharing

This list was created using this notebook: https://github.com/bigcode-project/Megatron-LM/blob/raymond-notebooks/notebooks/transformer_parameter_count.ipynb
Still open questions:

Which languages to train on? We could afford to do each experiment on single-language and multi-language datasets, doubling the compute.
Which evaluations. HumanEval, MBPP, repo-level eval? Some downstream tasks with finetuning? https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/finetuning/

Create the Stack 1.2 dataset

include new opt-out requests
Upload to GCP
run near-dedup-alt on all languages

Log GPU throughput

TF-Multi Node Training Layout

For the model training it's important that we achieve a high throughput. We need to figure out the multi-node training configuration(i.e., what combination of data, tensor, and pipeline parallelism) for 192 V100 GPUs.

Timeout on creating the index mappings

When launching very long training runs, building the index mappings can take more than 1 minute.
The consequence is that the other ranks will timeout. https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/megatron/training.py#L962
However the timeout passed to torch.distributed.initialize is 10 mins. Why isn't this value used in torch.distributed.broadcast?

The workaround for now is to first create the index mappings on a single worker, as a preliminary run.

Literature review on scaling laws

Get an idea of the different flavours of scaling-law works that are out there. Any work that tries to estimate the optimal scale of model and dataset size, with regards to a certain metric (PPL, or other)
Some references:

Kaplan et al: Scaling Laws for Neural Language Models
Henighan et al: Scaling Laws for Autoregressive Generative Modeling
Hernandez et al: Scaling Laws for Transfer
Tay et al: Scale efficiently: Insights from pre-training and
fine-tuning transformers
Tay et al: Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?
Tay et al: Transcending Scaling Laws with 0.1% Extra Compute
Hoffmann et al: Training Compute-Optimal Large Language Models
Bansal et al: Data Scaling Laws in NMT: The Effect of Noise and Architecture
Clark et al: Unified Scaling Laws for Routed Language Models

Double-check the code for key/value gradient reduction in the case of MQA, when tensor-model-parallel > 1, and for distributed optim

Looking again at this code: https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/megatron/optimizer/optimizer.py#L269
1 - Aren't the gradients of the biases missing from the reduction?

2 - The distributed optimizer misses this reduction:

Megatron-LM/megatron/optimizer/distrib_optimizer.py

Line 522 in 8169dec

def reduce_model_grads(self, args, timers):

Script to train starcoder

It would be very useful to have a script to reproduce and/or finetune StarCoder, similar to the example script for santacoder.

HuggingFace's trainer doesn't have the ability to do tensor/model parallel, which makes it difficult to fit the model on 1 8-GPU node, so this code would help users fine-tune models.

how to convert huggingface?

hi,
The megatron format how to convert huggingface for inference?

OOM on preprocessing dataset with large number of documents

When processing a dataset of 55GB, 31M samples, preprocessing runs out-of-memory on a machine with 1.5TB memory.

The error happens when saving the index. For other larger datasets there was no issue. But this dataset is the one with the most documents.

Traceback (most recent call last):
  File "Megatron-LM/tools/preprocess_data.py", line 227, in <module>
    main()
  File "Megatron-LM/tools/preprocess_data.py", line 224, in main
    builders[key].finalize(output_idx_files[key])
  File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 576, in finalize
    index.write(self._sizes, self._doc_idx)
  File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 369, in write
    pointers = self._get_pointers(sizes)
  File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 363, in _get_pointers
    pointers.append(address)
MemoryError

The workaround for now is to first shard the dataset, and tokenize each shard independently. At training time, the shards can be blended together

how to prepare the training data to train starcoder?

please tell me hot wo generate the following files:

WEIGHTS_TRAIN=/fsx/loubna/code/bigcode-data-mix/data/train_data_paths.txt.tmp

WEIGHTS_VALID=/fsx/loubna/code/bigcode-data-mix/data/valid_data_paths.txt.tmp

RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'

I am trying to run the Starcoder pretraining code(/examples/pretrain_bigcode_model.slurm). I created a custom pretrain_starcoder.sh file

      #!/bin/bash
      
      GPUS_PER_NODE=2
      # Change for multinode config
      MASTER_ADDR=localhost
      MASTER_PORT=6000
      NNODES=1
      NODE_RANK=0
      WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
      
      
      # File path setup
      CHECKPOINT_PATH=/home/jupyter/Satya/Megatron/Model_starcoder/
      TOKENIZER_FILE=/home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json
      #WEIGHTS_TRAIN=/fsx/loubna/code/bigcode-data-mix/data/train_data_paths.txt.tmp
      #WEIGHTS_VALID=/fsx/loubna/code/bigcode-data-mix/data/valid_data_paths.txt.tmp
      
      mkdir -p $CHECKPOINT_PATH/tensorboard
      
      DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
      
      GPT_ARGS="\
             --tensor-model-parallel-size 1 \
             --pipeline-model-parallel-size 1 \
             --sequence-parallel \
             --num-layers 40 \
             --hidden-size 6144 \
             --num-attention-heads 48 \
             --attention-head-type multiquery \
             --init-method-std 0.01275 \
             --seq-length 8192 \
             --max-position-embeddings 8192 \
             --attention-dropout 0.1 \
             --hidden-dropout 0.1 \
             --micro-batch-size 1 \
             --global-batch-size 512 \
             --lr 0.0003 \
             --min-lr 0.00003 \
             --train-iters 250000 \
             --lr-decay-iters 250000 \
             --lr-decay-style cosine \
             --lr-warmup-iters 2000 \
             --weight-decay .1 \
             --adam-beta2 .95 \
             --clip-grad 1.0 \
             --bf16 \
             --use-flash-attn \
             --fim-rate 0.5 \
             --log-interval 10 \
             --save-interval 2500 \
             --eval-interval 2500 \
             --eval-iters 2 \
             --use-distributed-optimizer \
             --valid-num-workers 0 \
      "
      
      TENSORBOARD_ARGS="--tensorboard-dir ${CHECKPOINT_PATH}/tensorboard"
      
      export NCCL_DEBUG=INFO
      python -m torch.distributed.launch $DISTRIBUTED_ARGS \
              pretrain_gpt.py \
              $GPT_ARGS \
          --tokenizer-type TokenizerFromFile \
          --tokenizer-file $TOKENIZER_FILE \
          --save $CHECKPOINT_PATH \
          --load $CHECKPOINT_PATH \
          #--train-weighted-split-paths-path $WEIGHTS_TRAIN \
          #--valid-weighted-split-paths-path $WEIGHTS_VALID \
          --structured-logs \
          --structured-logs-dir $CHECKPOINT_PATH/logs \
          $TENSORBOARD_ARGS \
          --wandb-entity-name loubnabnl \
          --wandb-project-name bigcode-pretraining \

i didn't set the datapath yet.

My current versions are

 CUDA - 11.0
 pytorch - 1.7.0 (i only found 1.7.1 and 1.7.0 for cuda 11.0).
 apex - 1.0
  gcc --version
      gcc (Ubuntu 9.4.0-1ubuntu1~18.04) 9.4.0
      Copyright (C) 2019 Free Software Foundation, Inc.
 nvcc --version
      nvcc: NVIDIA (R) Cuda compiler driver
      Copyright (c) 2005-2020 NVIDIA Corporation
      Built on Wed_Jul_22_19:09:09_PDT_2020
      Cuda compilation tools, release 11.0, V11.0.221
      Build cuda_11.0_bu.TC445_37.28845127_0
  2 AWS A100 GPUs.
 nvidia-smi
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
      |-------------------------------+----------------------+----------------------+
      | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
      |                               |                      |               MIG M. |
      |===============================+======================+======================|
      |   0  A100-SXM4-40GB      On   | 00000000:20:1C.0 Off |                    0 |
      | N/A   24C    P0    53W / 400W |      3MiB / 40537MiB |      0%      Default |
      |                               |                      |             Disabled |
      +-------------------------------+----------------------+----------------------+
      |   1  A100-SXM4-40GB      On   | 00000000:A0:1D.0 Off |                    0 |
      | N/A   25C    P0    50W / 400W |      3MiB / 40537MiB |      0%      Default |
      |                               |                      |             Disabled |
      +-------------------------------+----------------------+----------------------+

when i run $ bash ./examples/pretrain_starcoder.sh

          Wandb import failed
          Wandb import failed
          using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
          WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:TokenizerFromFile
          accumulate and all-reduce gradients in fp32 for bfloat16 data type.
          using torch.bfloat16 for parameters ...
          Persistent fused layer norm kernel is supported from pytorch v1.11 (nvidia pytorch container paired with v1.11). Defaulting to no_persist_layer_norm=True
          ------------------------ arguments ------------------------
            accumulate_allreduce_grads_in_fp32 .............. True
            adam_beta1 ...................................... 0.9
            adam_beta2 ...................................... 0.95
            adam_eps ........................................ 1e-08
            adlr_autoresume ................................. False
            adlr_autoresume_interval ........................ 1000
            apply_query_key_layer_scaling ................... True
            apply_residual_connection_post_layernorm ........ False
            async_tensor_model_parallel_allreduce ........... True
            attention_dropout ............................... 0.1
            attention_head_type ............................. multiquery
            attention_softmax_in_fp32 ....................... False
            bert_binary_head ................................ True
            bert_load ....................................... None
            bf16 ............................................ True
            bias_dropout_fusion ............................. True
            bias_gelu_fusion ................................ True
            biencoder_projection_dim ........................ 0
            biencoder_shared_query_context_model ............ False
            block_data_path ................................. None
            classes_fraction ................................ 1.0
            clip_grad ....................................... 1.0
            consumed_train_samples .......................... 0
            consumed_valid_samples .......................... 0
            data_impl ....................................... infer
            data_parallel_random_init ....................... False
            data_parallel_size .............................. 1
            data_path ....................................... None
            data_per_class_fraction ......................... 1.0
            data_sharding ................................... True
            dataloader_type ................................. single
            DDP_impl ........................................ local
            decoder_seq_length .............................. None
            dino_bottleneck_size ............................ 256
            dino_freeze_last_layer .......................... 1
            dino_head_hidden_size ........................... 2048
            dino_local_crops_number ......................... 10
            dino_local_img_size ............................. 96
            dino_norm_last_layer ............................ False
            dino_teacher_temp ............................... 0.07
            dino_warmup_teacher_temp ........................ 0.04
            dino_warmup_teacher_temp_epochs ................. 30
            distribute_saved_activations .................... False
            distributed_backend ............................. nccl
            distributed_timeout ............................. 600
            embedding_path .................................. None
            empty_unused_memory_level ....................... 0
            encoder_seq_length .............................. 8192
            end_weight_decay ................................ 0.1
            eod_mask_loss ................................... False
            eval_interval ................................... 2500
            eval_iters ...................................... 2
            evidence_data_path .............................. None
            exit_duration_in_mins ........................... None
            exit_interval ................................... None
            exit_signal_handler ............................. False
            ffn_hidden_size ................................. 24576
            fim_rate ........................................ 0.5
            fim_spm_rate .................................... 0.5
            finetune ........................................ False
            finetune_from ................................... None
            fp16 ............................................ False
            fp16_lm_cross_entropy ........................... False
            fp32_residual_connection ........................ False
            global_batch_size ............................... 512
            glu_activation .................................. None
            gradient_accumulation_fusion .................... True
            head_lr_mult .................................... 1.0
            hidden_dropout .................................. 0.1
            hidden_size ..................................... 6144
            hysteresis ...................................... 2
            ict_head_size ................................... None
            ict_load ........................................ None
            img_h ........................................... 224
            img_w ........................................... 224
            indexer_batch_size .............................. 128
            indexer_log_interval ............................ 1000
            inference_batch_times_seqlen_threshold .......... 512
            init_method_std ................................. 0.01275
            init_method_xavier_uniform ...................... False
            initial_loss_scale .............................. 4294967296
            iter_per_epoch .................................. 1250
            kv_channels ..................................... 128
            layernorm_epsilon ............................... 1e-05
            lazy_mpu_init ................................... None
            load ............................................ /home/jupyter/Satya/Megatron/Model_starcoder/
            local_rank ...................................... 0
            log_batch_size_to_tensorboard ................... False
            log_interval .................................... 10
            log_learning_rate_to_tensorboard ................ True
            log_loss_scale_to_tensorboard ................... True
            log_memory_to_tensorboard ....................... False
            log_num_zeros_in_grad ........................... False
            log_params_norm ................................. False
            log_timers_to_tensorboard ....................... False
            log_validation_ppl_to_tensorboard ............... False
            log_world_size_to_tensorboard ................... False
            loss_scale ...................................... None
            loss_scale_window ............................... 1000
            lr .............................................. 0.0003
            lr_decay_iters .................................. 250000
            lr_decay_samples ................................ None
            lr_decay_style .................................. cosine
            lr_warmup_fraction .............................. None
            lr_warmup_iters ................................. 2000
            lr_warmup_samples ............................... 0
            make_vocab_size_divisible_by .................... 128
            mask_factor ..................................... 1.0
            mask_prob ....................................... 0.15
            mask_type ....................................... random
            masked_softmax_fusion ........................... True
            max_position_embeddings ......................... 8192
            merge_file ...................................... None
            micro_batch_size ................................ 1
            min_loss_scale .................................. 1.0
            min_lr .......................................... 3e-05
            mmap_warmup ..................................... False
            no_load_optim ................................... None
            no_load_rng ..................................... None
            no_persist_layer_norm ........................... True
            no_save_optim ................................... None
            no_save_rng ..................................... None
            num_attention_heads ............................. 48
            num_channels .................................... 3
            num_classes ..................................... 1000
            num_experts ..................................... None
            num_layers ...................................... 40
            num_layers_per_virtual_pipeline_stage ........... None
            num_workers ..................................... 2
            onnx_safe ....................................... None
            openai_gelu ..................................... False
            optimizer ....................................... adam
            override_opt_param_scheduler .................... False
            params_dtype .................................... torch.bfloat16
            patch_dim ....................................... 16
            perform_initialization .......................... True
            pipeline_model_parallel_size .................... 1
            pipeline_model_parallel_split_rank .............. None
            position_embedding_type ......................... PositionEmbeddingType.absolute
            query_in_block_prob ............................. 0.1
            rampup_batch_size ............................... None
            rank ............................................ 0
            recompute_granularity ........................... None
            recompute_method ................................ None
            recompute_num_layers ............................ 1
            reset_attention_mask ............................ False
            reset_position_ids .............................. False
            retriever_report_topk_accuracies ................ []
            retriever_score_scaling ......................... False
            retriever_seq_length ............................ 256
            sample_rate ..................................... 1.0
            save ............................................ /home/jupyter/Satya/Megatron/Model_starcoder/
            save_interval ................................... 2500
            scatter_gather_tensors_in_pipeline .............. True
            seed ............................................ 1234
            seq_length ...................................... 8192
            sequence_parallel ............................... False
            sgd_momentum .................................... 0.9
            short_seq_prob .................................. 0.1
            split ........................................... None
            standalone_embedding_stage ...................... False
            start_weight_decay .............................. 0.1
            structured_logs ................................. False
            structured_logs_dir ............................. None
            swin_backbone_type .............................. tiny
            tensor_model_parallel_size ...................... 1
            tensorboard_dir ................................. None
            tensorboard_log_interval ........................ 1
            tensorboard_queue_size .......................... 1000
            test_weighted_split_paths ....................... None
            test_weighted_split_paths_path .................. None
            titles_data_path ................................ None
            tokenizer_file .................................. /home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json
            tokenizer_type .................................. TokenizerFromFile
            train_iters ..................................... 250000
            train_samples ................................... None
            train_weighted_split_paths ...................... None
            train_weighted_split_paths_path ................. None
            transformer_pipeline_model_parallel_size ........ 1
            transformer_timers .............................. False
            use_checkpoint_args ............................. False
            use_checkpoint_opt_param_scheduler .............. False
            use_contiguous_buffers_in_local_ddp ............. True
            use_cpu_initialization .......................... None
            use_distributed_optimizer ....................... True
            use_flash_attn .................................. True
            use_one_sent_docs ............................... False
            valid_num_workers ............................... 0
            valid_weighted_split_paths ...................... None
            valid_weighted_split_paths_path ................. None
            virtual_pipeline_model_parallel_size ............ None
            vision_backbone_type ............................ vit
            vision_pretraining .............................. False
            vision_pretraining_type ......................... classify
            vocab_extra_ids ................................. 0
            vocab_file ...................................... None
            wandb_entity_name ............................... None
            wandb_project_name .............................. None
            weight_decay .................................... 0.1
            weight_decay_incr_style ......................... constant
            world_size ...................................... 1
          -------------------- end of arguments ---------------------
          setting number of micro-batches to constant 512
          > building TokenizerFromFile tokenizer ...
           > padded vocab (size: 49152) with 0 dummy tokens (new size: 49152)
          05:15:56.69 >>> Call to _initialize_distributed in File "/tmp/Megatron/megatron/initialize.py", line 220
          05:15:56.69  220 | def _initialize_distributed():
          05:15:56.69  222 |     args = get_args()
          05:15:56.69 .......... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
          05:15:56.69  224 |     device_count = torch.cuda.device_count()
          05:15:56.69 .......... device_count = 2
          05:15:56.69  225 |     if torch.distributed.is_initialized():
          05:15:56.69  235 |         if args.rank == 0:
          05:15:56.69  236 |             print('> initializing torch distributed ...', flush=True)
          > initializing torch distributed ...
          05:15:56.69  238 |         if device_count > 0:
          05:15:56.69  239 |             device = args.rank % device_count
          05:15:56.69 .................. device = 0
          05:15:56.69  240 |             if args.local_rank is not None:
          05:15:56.69  241 |                 assert args.local_rank == device, \
          05:15:56.69  245 |             torch.cuda.set_device(device)
          05:15:56.70  249 |         torch.distributed.init_process_group(
          05:15:56.70  250 |             backend="gloo",#args.distributed_backend,
          05:15:56.70  251 |             world_size=args.world_size, rank=args.rank,
          05:15:56.70  252 |             timeout=timedelta(seconds=args.distributed_timeout))
          05:15:56.70  249 |         torch.distributed.init_process_group(
          05:15:56.70  256 |     if device_count > 0:
          05:15:56.70  257 |         if mpu.model_parallel_is_initialized():
          05:15:56.70  260 |             mpu.initialize_model_parallel(args.tensor_model_parallel_size,
          05:15:56.70  261 |                                           args.pipeline_model_parallel_size,
          05:15:56.70  262 |                                           args.virtual_pipeline_model_parallel_size,
          05:15:56.70  263 |                                           args.pipeline_model_parallel_split_rank)
          05:15:56.70  260 |             mpu.initialize_model_parallel(args.tensor_model_parallel_size,
          > initializing tensor model parallel with size 1
          > initializing pipeline model parallel with size 1
          05:15:56.70 <<< Return value from _initialize_distributed: None
          > setting random seeds to 1234 ...
          > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
          05:15:56.70 >>> Call to _compile_dependencies in File "/tmp/Megatron/megatron/initialize.py", line 160
          05:15:56.70  160 | def _compile_dependencies():
          05:15:56.70  162 |     args = get_args()
              05:15:56.73 >>> Call to get_args in File "/tmp/Megatron/megatron/global_vars.py", line 38
              05:15:56.73   38 | def get_args():
              05:15:56.73   40 |     _ensure_var_is_initialized(_GLOBAL_ARGS, 'args')
              05:15:56.73   41 |     return _GLOBAL_ARGS
              05:15:56.73 <<< Return value from get_args: Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
          05:15:56.73  162 |     args = get_args()
          05:15:56.73 .......... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
          05:15:56.73  168 |     if torch.distributed.get_rank() == 0:
              05:15:56.84 >>> Call to get_rank in File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584
              05:15:56.84 ...... group = <object object at 0x7fe25503e6c0>
              05:15:56.84  584 | def get_rank(group=group.WORLD):
              05:15:56.84  600 |     if _rank_not_in_group(group):
              05:15:56.84  603 |     _check_default_pg()
              05:15:56.84  604 |     if group == GroupMember.WORLD:
              05:15:56.84  605 |         return _default_pg.rank()
              05:15:56.84 <<< Return value from get_rank: 0
          05:15:56.84  168 |     if torch.distributed.get_rank() == 0:
          05:15:56.84  169 |         start_time = time.time()
          05:15:56.84 .............. start_time = 1686719756.846662
          05:15:56.84  170 |         print('> compiling dataset index builder ...')
          > compiling dataset index builder ...
          05:15:56.84  171 |         from megatron.data.dataset_utils import compile_helper
          05:15:56.84 .............. compile_helper = <function compile_helper at 0x7fe24b749280>
          05:15:56.84  172 |         compile_helper()
              05:15:56.92 >>> Call to compile_helper in File "/tmp/Megatron/megatron/data/dataset_utils.py", line 81
              05:15:56.92   81 | def compile_helper():
              05:15:56.92   84 |     import os
              05:15:56.92 .......... os = <module 'os' from '/opt/conda/envs/starcoder/lib/python3.8/os.py'>
              05:15:56.92   85 |     import subprocess
              05:15:56.92 .......... subprocess = <module 'subprocess' from '/opt/conda/envs/starcoder/lib/python3.8/subprocess.py'>
              05:15:56.92   86 |     path = os.path.abspath(os.path.dirname(__file__))
              05:15:56.92 .......... path = '/tmp/Megatron/megatron/data'
              05:15:56.92   87 |     ret = subprocess.run(['make', '-C', path])
          make: Entering directory '/tmp/Megatron/megatron/data'
          make: Nothing to be done for 'default'.
          make: Leaving directory '/tmp/Megatron/megatron/data'
              05:15:56.96 .......... ret = CompletedProcess(args=['make', '-C', '/tmp/Megatron/megatron/data'], returncode=0)
              05:15:56.96   88 |     if ret.returncode != 0:
              05:15:56.96 <<< Return value from compile_helper: None
          05:15:56.96  172 |         compile_helper()
          05:15:56.96  173 |         print('>>> done with dataset index builder. Compilation time: {:.3f} '
          05:15:56.96  174 |               'seconds'.format(time.time() - start_time), flush=True)
          05:15:56.96  173 |         print('>>> done with dataset index builder. Compilation time: {:.3f} '
          05:15:56.96  174 |               'seconds'.format(time.time() - start_time), flush=True)
          05:15:56.96  173 |         print('>>> done with dataset index builder. Compilation time: {:.3f} '
          >>> done with dataset index builder. Compilation time: 0.114 seconds
          05:15:56.96  181 |     seq_len = args.seq_length
          05:15:56.96 .......... seq_len = 8192
          05:15:56.96  182 |     attn_batch_size = \
          05:15:56.96  183 |         (args.num_attention_heads / args.tensor_model_parallel_size) * \
          05:15:56.96  184 |         args.micro_batch_size
          05:15:56.96  183 |         (args.num_attention_heads / args.tensor_model_parallel_size) * \
          05:15:56.96  182 |     attn_batch_size = \
          05:15:56.96 .......... attn_batch_size = 48.0
          05:15:56.96  187 |     custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \
          05:15:56.96  188 |         seq_len % 4 == 0 and attn_batch_size % 4 == 0
          05:15:56.96  187 |     custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \
          05:15:56.96  188 |         seq_len % 4 == 0 and attn_batch_size % 4 == 0
          05:15:56.96  187 |     custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \
          05:15:56.96 .......... custom_kernel_constraint = True
          05:15:56.96  190 |     if not ((args.fp16 or args.bf16) and
          05:15:56.96  191 |             custom_kernel_constraint and
          05:15:56.96  190 |     if not ((args.fp16 or args.bf16) and
          05:15:56.96  192 |             args.masked_softmax_fusion):
          05:15:56.96  190 |     if not ((args.fp16 or args.bf16) and
          05:15:56.96  199 |     if torch.distributed.get_rank() == 0:
              05:15:56.96 >>> Call to get_rank in File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584
              05:15:56.96 ...... group = <object object at 0x7fe25503e6c0>
              05:15:56.96  584 | def get_rank(group=group.WORLD):
              05:15:56.96  600 |     if _rank_not_in_group(group):
              05:15:56.96  603 |     _check_default_pg()
              05:15:56.96  604 |     if group == GroupMember.WORLD:
              05:15:56.96  605 |         return _default_pg.rank()
              05:15:56.96 <<< Return value from get_rank: 0
          05:15:56.96  199 |     if torch.distributed.get_rank() == 0:
          05:15:56.96  200 |         start_time = time.time()
          05:15:56.96 .............. start_time = 1686719756.9662645
          05:15:56.96  201 |         print('> compiling and loading fused kernels ...', flush=True)
          > compiling and loading fused kernels ...
          05:15:56.96  202 |         fused_kernels.load(args)
              05:15:56.96 >>> Call to load in File "/tmp/Megatron/megatron/fused_kernels/__init__.py", line 4
              05:15:56.96 ...... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
              05:15:56.96    4 | def load(args):
              05:15:56.96    5 |     if torch.version.hip is None:
              05:15:56.96    6 |         print("running on CUDA devices")
          running on CUDA devices
              05:15:56.96    7 |         from megatron.fused_kernels.cuda import load as load_kernels
              05:15:58.87 .............. load_kernels = <function load at 0x7fe2422201f0>
              05:15:58.87   12 |     load_kernels(args)
          Detected CUDA files, patching ldflags
          Emitting ninja build file /tmp/Megatron/megatron/fused_kernels/cuda/build/build.ninja...
          Building extension module scaled_upper_triang_masked_softmax_cuda...
          Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
          [1/2] /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o 
          FAILED: scaled_upper_triang_masked_softmax_cuda.cuda.o 
          /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (const char *const)
                    detected during:
                      instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<const char *const &>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(1375): here
                      instantiation of "__nv_bool pybind11::detail::object_api<Derived>::contains(T &&) const [with Derived=pybind11::handle, T=const char *const &]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/detail/internals.h(176): here
          
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                    detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(201): here
          
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::handle, pybind11::handle)
                    detected during:
                      instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::handle &, pybind11::handle &>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(923): here
                      instantiation of "pybind11::str pybind11::str::format(Args &&...) const [with Args=<pybind11::handle &, pybind11::handle &>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(755): here
          
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::handle, pybind11::handle, pybind11::none, pybind11::str)
                    detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::handle, pybind11::handle, pybind11::none, pybind11::str>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(971): here
          
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::object, const pybind11::handle)
                    detected during:
                      instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object &, const pybind11::handle &>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(923): here
                      instantiation of "pybind11::str pybind11::str::format(Args &&...) const [with Args=<pybind11::object &, const pybind11::handle &>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1401): here
          
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::cpp_function)
                    detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::cpp_function>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1407): here
          
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::cpp_function, pybind11::none, pybind11::none, const char [1])
                    detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::cpp_function, pybind11::none, pybind11::none, const char (&)[1]>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1418): here
          
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::tuple)
                    detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::tuple &>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1812): here
          
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::object)
                    detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object &>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1830): here
          
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
                      argument types are: (pybind11::object)
                    detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object>]" 
          /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1831): here
          
          10 errors detected in the compilation of "/tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu".
          ninja: build stopped: subcommand failed.
              05:16:05.35 !!! RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'
              05:16:05.35 !!! When calling: load_kernels(args)
              05:16:05.35 !!! Call ended by exception
          05:16:05.35  202 |         fused_kernels.load(args)
          05:16:05.39 !!! RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'
          05:16:05.39 !!! When calling: fused_kernels.load(args)
          05:16:05.39 !!! Call ended by exception
          Traceback (most recent call last):
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1516, in _run_ninja_build
              subprocess.run(
            File "/opt/conda/envs/starcoder/lib/python3.8/subprocess.py", line 516, in run
              raise CalledProcessError(retcode, process.args,
          subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
          
          The above exception was the direct cause of the following exception:
          
          Traceback (most recent call last):
            File "pretrain_gpt.py", line 158, in <module>
              pretrain(train_valid_test_datasets_provider, model_provider,
            File "/tmp/Megatron/megatron/training.py", line 107, in pretrain
              initialize_megatron(extra_args_provider=extra_args_provider,
            File "/tmp/Megatron/megatron/initialize.py", line 106, in initialize_megatron
              _compile_dependencies()
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/snoop/tracer.py", line 173, in simple_wrapper
              return function(*args, **kwargs)
            File "/tmp/Megatron/megatron/initialize.py", line 202, in _compile_dependencies
              fused_kernels.load(args)
            File "/tmp/Megatron/megatron/fused_kernels/__init__.py", line 12, in load
              load_kernels(args)
            File "/tmp/Megatron/megatron/fused_kernels/cuda/__init__.py", line 70, in load
              scaled_upper_triang_masked_softmax_cuda = _cpp_extention_load_helper(
            File "/tmp/Megatron/megatron/fused_kernels/cuda/__init__.py", line 42, in _cpp_extention_load_helper
              return cpp_extension.load(
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 969, in load
              return _jit_compile(
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1176, in _jit_compile
              _write_ninja_file_and_build_library(
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1280, in _write_ninja_file_and_build_library
              _run_ninja_build(
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1538, in _run_ninja_build
              raise RuntimeError(message) from e
          RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'
          Traceback (most recent call last):
            File "/opt/conda/envs/starcoder/lib/python3.8/runpy.py", line 194, in _run_module_as_main
              return _run_code(code, main_globals, None,
            File "/opt/conda/envs/starcoder/lib/python3.8/runpy.py", line 87, in _run_code
              exec(code, run_globals)
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module>
              main()
            File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
              raise subprocess.CalledProcessError(returncode=process.returncode,
          subprocess.CalledProcessError: Command '['/opt/conda/envs/starcoder/bin/python', '-u', 'pretrain_gpt.py', '--local_rank=0', '--tensor-model-parallel-size', '1', '--pipeline-model-parallel-size', '1', '--num-layers', '40', '--hidden-size', '6144', '--num-attention-heads', '48', '--attention-head-type', 'multiquery', '--init-method-std', '0.01275', '--seq-length', '8192', '--max-position-embeddings', '8192', '--attention-dropout', '0.1', '--hidden-dropout', '0.1', '--micro-batch-size', '1', '--global-batch-size', '512', '--lr', '0.0003', '--min-lr', '0.00003', '--train-iters', '250000', '--lr-decay-iters', '250000', '--lr-decay-style', 'cosine', '--lr-warmup-iters', '2000', '--weight-decay', '.1', '--adam-beta2', '.95', '--clip-grad', '1.0', '--bf16', '--use-flash-attn', '--fim-rate', '0.5', '--log-interval', '10', '--save-interval', '2500', '--eval-interval', '2500', '--eval-iters', '2', '--use-distributed-optimizer', '--valid-num-workers', '0', '--tokenizer-type', 'TokenizerFromFile', '--tokenizer-file', '/home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json', '--save', '/home/jupyter/Satya/Megatron/Model_starcoder/', '--load', '/home/jupyter/Satya/Megatron/Model_starcoder/']' returned non-zero exit status 1.
          examples/pretrain_starcoder.sh: line 75: --structured-logs: command not found

in the above code i also tried using snoop trace. Below is the main error.

      Detected CUDA files, patching ldflags
      Emitting ninja build file /tmp/Megatron/megatron/fused_kernels/cuda/build/build.ninja...
      Building extension module scaled_upper_triang_masked_softmax_cuda...
      Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
      [1/2] /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o 
         FAILED: scaled_upper_triang_masked_softmax_cuda.cuda.o

Fine tuning and Pre training scripts

Awesome work. Is it possible to checkin the pertaining and fine tuning scripts please?

How to use my dataset to finetune

I want to use my dataset to finetune starcoderbase with fim, is there code for this to help me?

ValueError: Invalid attention arguments: AttnType.self_attn, None

Traceback (most recent call last):
File "/data/lee/Megatron-LM/pretrain_gpt.py", line 158, in
pretrain(train_valid_test_datasets_provider, model_provider,
File "/data/lee/Megatron-LM/megatron/training.py", line 129, in pretrain
model, optimizer, opt_param_scheduler = setup_model_and_optimizer(model_provider,
File "/data/lee/Megatron-LM/megatron/training.py", line 376, in setup_model_and_optimizer
model = get_model(model_provider_func, model_type)
File "/data/lee/Megatron-LM/megatron/training.py", line 262, in get_model
model = model_provider_func(
File "/data/lee/Megatron-LM/pretrain_gpt.py", line 35, in model_provider
model = GPTModel(
File "/data/lee/Megatron-LM/megatron/model/gpt_model.py", line 74, in init
self.language_model, self._language_model_key = get_language_model(
File "/data/lee/Megatron-LM/megatron/model/language_model.py", line 75, in get_language_model
language_model = TransformerLanguageModel(
File "/data/lee/Megatron-LM/megatron/model/language_model.py", line 373, in init
self.encoder = ParallelTransformer(
File "/data/lee/Megatron-LM/megatron/model/transformer.py", line 1182, in init
[build_layer(i + 1 + offset) for i in range(self.num_layers)])
File "/data/lee/Megatron-LM/megatron/model/transformer.py", line 1182, in
[build_layer(i + 1 + offset) for i in range(self.num_layers)])
File "/data/lee/Megatron-LM/megatron/model/transformer.py", line 1130, in build_layer
return ParallelTransformerLayer(
File "/data/lee/Megatron-LM/megatron/model/transformer.py", line 877, in init
self.self_attention = ParallelAttention(
File "/data/lee/Megatron-LM/megatron/model/transformer.py", line 590, in init
raise ValueError(f"Invalid attention arguments: {attention_type}, {self.attention_head_type}")
ValueError: Invalid attention arguments: AttnType.self_attn, None

The parameters and commands are as follows. May I ask what the problem is？

GPUS_PER_NODE=1
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
CHECKPOINT_PATH=/data/lee/Megatron-LM/experiments/0606
VOCAB_FILE=vocab.json
MERGE_FILE=merges.txt
DATA_PATH=starcoder-abap_content_document
GPT_ARGS="--num-layers 12 \
--hidden-size 768 \
--num-attention-heads 12 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--micro-batch-size 12 \
--global-batch-size 192 \
--lr 0.0005 \
--train-iters 150000 \
--lr-decay-iters 150000 \
--lr-decay-style cosine \
--lr-warmup-iters 2000 \
--weight-decay .1 \
--adam-beta2 .999 \
--fp16 \
--log-interval 10 \
--save-interval 2000 \
--eval-interval 200 \
--eval-iters 10"

# --finetune
TENSORBOARD_ARGS="--tensorboard-dir experiments/tensorboard"


python -m torch.distributed.launch $DISTRIBUTED_ARGS \
        pretrain_gpt.py \
        --tensor-model-parallel-size 1 \
        --pipeline-model-parallel-size 1 \
        $GPT_ARGS \
        --vocab-file $VOCAB_FILE \
        --merge-file $MERGE_FILE \
        --save $CHECKPOINT_PATH \
        --data-path $DATA_PATH \
        $TENSORBOARD_ARGS

Selection of LLMs architecture

this project seems to pre-train decoder-only style LM. just wonder why not encoder-decoder style which more powerful for text generation (translation, summarization, conditional text generation).

Want explanation of the MQA related code

I confused it. currently I understand all details of the implementation.
closing isaue...