bigcode-project / megatron-lm Goto Github PK
View Code? Open in Web Editor NEWThis project forked from nvidia/megatron-lm
Ongoing research training transformer models at scale
License: Other
This project forked from nvidia/megatron-lm
Ongoing research training transformer models at scale
License: Other
Since we do not have much time to experiment with all these components, we have to make decisions based on the current literature.
Meet a new Error:
Traceback (most recent call last):
File "/Megatron-LM/pretrain_gpt.py", line 148, in <module>
pretrain(train_valid_test_datasets_provider,
File "/Megatron-LM/megatron/training.py", line 161, in pretrain
iteration = train(forward_step_func,
File "/Megatron-LM/megatron/training.py", line 740, in train
train_step(forward_step_func,
File "/Megatron-LM/megatron/training.py", line 434, in train_step
losses_reduced = forward_backward_func(
File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 360, in forward_backward_no_pipelining
output_tensor = forward_step(forward_step_func, data_iterator,
File "/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 218, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/Megatron-LM/pretrain_gpt.py", line 81, in forward_step
tokens, labels, loss_mask, attention_mask, position_ids = get_batch(
File "/Megatron-LM/pretrain_gpt.py", line 46, in get_batch
data_b = tensor_parallel.broadcast_data(keys, data, datatype)
File "/Megatron-LM/megatron/core/tensor_parallel/data.py", line 76, in broadcast_data
key_size, key_numel, total_numel = _build_key_size_numel_dictionaries(keys,
File "/Megatron-LM/megatron/core/tensor_parallel/data.py", line 31, in _build_key_size_numel_dictionaries
assert data[key].dim() < max_dim, 'you should increase MAX_DATA_DIM'
IndexError: too many indices for tensor of dimension 2
I check the "data", it because "data" is a tensor type, not a dictionary. I don't know when the key added into the data.
Thanks for publishing the model to Huggingface. For using the Triton Inference server in Products like https://github.com/fauxpilot/fauxpilot:
Do you have any preferred way to convert it to Nvidia Triton Inference server (e.g. https://github.com/triton-inference-server/fastertransformer_backend), starting e.g. from the checkpoint by Huggingface?
model = AutoModelForCausalLM.from_pretrained(
"bigcode/santacoder",
revision="no-fim", # name of branch or commit hash
trust_remote_code=True
)
Hi Team,
Great work!!
I would like to know the precision in which the model checkpoint is saved.
For example, if trained with bf16 precision, what will the checkpoint precision will be?
Thanks
Why are we not multiplying the LM Head flops per iteration with the checkpoint_activations_factor
?
Line 253 in bd0aaba
Afaik the factor of 4 means 1 forward, 2 backward & 1 forward, where the last forward is needed for ckpt acts. Don't we also need all 4 for the LM Head? cc @RaymondLi0 @NouamaneTazi
How to reproduce the humaneval performance in this repo ?
I tried to evaluate using the evaluation branch, but it seems to be very different from the the default branch, is it possible to add a full evaluation process in default branch ?
Thanks a lot.
Feel free to add more things if required.
Checkpoint conversion to another model-parallel config does not support MQA for now.
Q and KV are split for MQA:
https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/tools/checkpoint_loader_megatron.py#L206 in this case the KV weights should not be concatenated (they are shared among TP-ranks)
https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/tools/checkpoint_saver_megatron.py#L229 same here, KV weights should not be split across TP-ranks, but copied instead
I use
python tools/preprocess_data.py \
--input /cobol/gpt2/data \
--output-prefix /cobol/data_preprocess \
--vocab-file /cobol/gpt2/vocab.json \
--dataset-impl mmap \
--tokenizer-type GPT2BPETokenizer \
--merge-file /cobol/gpt2/merges.txt \
--json-key content \
--workers 32 \
--chunk-size 25 \
--append-eod
to create the dataset, and use
#!/bin/bash
# Runs the "345M" parameter model
export CUDA_DEVICE_MAX_CONNECTIONS=1
CHECKPOINT_PATH=/cobol/gpt2/checkpoint
VOCAB_FILE=/cobol/vocab_file/vocab.json
MERGE_FILE=/cobol/gpt2/merges.txt
DATA_PATH=/cobol/gpt2/data_document/data_preprocess_content_document
GPT_ARGS="
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--attention-head-type multihead \
--seq-length 1024 \
--max-position-embeddings 1024 \
--micro-batch-size 4 \
--global-batch-size 8 \
--lr 0.00015 \
--train-iters 500000 \
--lr-decay-iters 320000 \
--lr-decay-style cosine \
--min-lr 1.0e-5 \
--weight-decay 1e-2 \
--lr-warmup-fraction .01 \
--clip-grad 1.0 \
--fp16
"
DATA_ARGS="
--data-path $DATA_PATH \
--vocab-file $VOCAB_FILE \
--merge-file $MERGE_FILE \
--data-impl mmap \
--split 949,50,1
"
OUTPUT_ARGS="
--log-interval 100 \
--save-interval 10000 \
--eval-interval 1000 \
--eval-iters 10
"
torchrun pretrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH
to pretrain. But I get the Key Error:
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1
using torch.float16 for parameters ...
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. False
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.999
adam_eps ........................................ 1e-08
add_bias_linear ................................. True
add_position_embedding .......................... True
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_layernorm_1p .............................. False
apply_query_key_layer_scaling ................... True
apply_residual_connection_post_layernorm ........ False
async_tensor_model_parallel_allreduce ........... True
attention_dropout ............................... 0.1
attention_head_type ............................. multihead
attention_softmax_in_fp32 ....................... False
barrier_with_L1_time ............................ True
bert_binary_head ................................ True
bert_embedder_type .............................. megatron
bert_load ....................................... None
bf16 ............................................ False
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
data_impl ....................................... mmap
data_parallel_random_init ....................... False
data_parallel_size .............................. 1
data_path ....................................... ['/cobol/gpt2/data_document/data_preprocess_content_document']
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
DDP_impl ........................................ local
decoder_num_layers .............................. None
decoder_seq_length .............................. None
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout ............................. 600
distributed_timeout_minutes ..................... 10
embedding_path .................................. None
empty_unused_memory_level ....................... 0
encoder_num_layers .............................. 24
encoder_seq_length .............................. 1024
end_weight_decay ................................ 0.01
eod_mask_loss ................................... False
eval_interval ................................... 1000
eval_iters ...................................... 10
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_on_missing_checkpoint ...................... False
exit_signal_handler ............................. False
ffn_hidden_size ................................. 4096
fim_rate ........................................ 0.0
fim_split_sample ................................ None
fim_spm_rate .................................... 0.5
finetune ........................................ False
fp16 ............................................ True
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_e4m3 ........................................ False
fp8_hybrid ...................................... False
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_wgrad ....................................... True
fragment_fim_rate ............................... 0.5
global_batch_size ............................... 8
glu_activation .................................. None
gradient_accumulation_fusion .................... True
head_lr_mult .................................... 1.0
hidden_dropout .................................. 0.1
hidden_size ..................................... 1024
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
iter_per_epoch .................................. 1250
kv_channels ..................................... 64
layernorm_epsilon ............................... 1e-05
lazy_mpu_init ................................... None
load ............................................ /cobol/gpt2/checkpoint
local_rank ...................................... 0
log_batch_size_to_tensorboard ................... False
log_interval .................................... 100
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 0.00015
lr_decay_iters .................................. 320000
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. 0.01
lr_warmup_iters ................................. 0
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 128
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... True
max_position_embeddings ......................... 1024
max_tokens_to_oom ............................... 12000
merge_file ...................................... /cobol/gpt2/merges.txt
micro_batch_size ................................ 4
min_loss_scale .................................. 1.0
min_lr .......................................... 1e-05
mmap_warmup ..................................... False
no_load_optim ................................... None
no_load_rng ..................................... None
no_persist_layer_norm ........................... False
no_save_optim ................................... None
no_save_rng ..................................... None
num_attention_heads ............................. 16
num_channels .................................... 3
num_classes ..................................... 1000
num_experts ..................................... None
num_layers ...................................... 24
num_layers_per_virtual_pipeline_stage ........... None
num_workers ..................................... 2
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
output_bert_embeddings .......................... False
override_opt_param_scheduler .................... False
params_dtype .................................... torch.float16
patch_dim ....................................... 16
perform_initialization .......................... True
pipeline_model_parallel_size .................... 1
pipeline_model_parallel_split_rank .............. None
position_embedding_type ......................... PositionEmbeddingType.absolute
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... None
recompute_method ................................ None
recompute_num_layers ............................ 1
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
retro_add_retriever ............................. False
retro_cyclic_train_iters ........................ None
retro_encoder_attention_dropout ................. 0.1
retro_encoder_hidden_dropout .................... 0.1
retro_encoder_layers ............................ 2
retro_num_neighbors ............................. 2
retro_num_retrieved_chunks ...................... 2
retro_return_doc_ids ............................ False
retro_workdir ................................... None
rotary_percent .................................. 1.0
rotary_theta .................................... 10000
sample_rate ..................................... 1.0
sanity_check_dataloader_interval ................ None
save ............................................ /cobol/gpt2/checkpoint
save_interval ................................... 10000
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 1024
sequence_parallel ............................... False
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
split ........................................... 949,50,1
squared_relu .................................... False
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.01
structured_logs ................................. False
structured_logs_dir ............................. None
swiglu .......................................... False
swin_backbone_type .............................. tiny
tensor_model_parallel_size ...................... 1
tensorboard_dir ................................. None
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
test_weighted_split_names ....................... None
test_weighted_split_paths ....................... None
test_weighted_split_paths_path .................. None
test_weighted_split_splits ...................... None
test_weighted_split_weights ..................... None
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_file .................................. None
tokenizer_model ................................. None
tokenizer_type .................................. GPT2BPETokenizer
train_data_path ................................. None
train_iters ..................................... 500000
train_samples ................................... None
train_weighted_split_paths ...................... None
train_weighted_split_paths_path ................. None
transformer_impl ................................ local
transformer_pipeline_model_parallel_size ........ 1
transformer_timers .............................. False
untie_embeddings_and_output_weights ............. False
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_contiguous_buffers_in_local_ddp ............. True
use_cpu_initialization .......................... None
use_distributed_optimizer ....................... False
use_flash_attn .................................. False
use_one_sent_docs ............................... False
use_ring_exchange_p2p ........................... False
use_rotary_position_embeddings .................. False
valid_data_path ................................. None
valid_num_workers ............................... 2
valid_weighted_split_names ...................... None
valid_weighted_split_paths ...................... None
valid_weighted_split_paths_path ................. None
valid_weighted_split_splits ..................... None
valid_weighted_split_weights .................... None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... /cobol/vocab_file/vocab.json
vocab_size ...................................... None
wandb_entity_name ............................... None
wandb_project_name .............................. None
weight_decay .................................... 0.01
weight_decay_incr_style ......................... constant
world_size ...................................... 1
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 2
> building GPT2BPETokenizer tokenizer ...
> padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory '/Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/Megatron-LM/megatron/data'
>>> done with dataset index builder. Compilation time: 0.055 seconds
[rank0]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
/Megatron-LM/megatron/training.py:104: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.)
start_time_tensor = torch.cuda.DoubleTensor([_TRAIN_START_TIME])
time to initialize megatron (seconds): 1.204
[after megatron is initialized] datetime: 2024-01-02 17:35:35
building GPT model ...
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 354871296
> learning rate decay style: cosine
loading release checkpoint from /cobol/gpt2/checkpoint
could not find arguments in the checkpoint ...
checkpoint version 0
succesfully fixed query-key-values ordering for checkpoint version 0
successfully loaded checkpoint from /cobol/gpt2/checkpoint at iteration 0
(min, max) time across ranks (ms):
load-checkpoint ................................: (540.41, 540.41)
[after model, optimizer, and learning rate scheduler are built] datetime: 2024-01-02 17:35:36
> building train, validation, and test datasets ...
> datasets target sizes (minimum size):
train: 4000000
validation: 40080
test: 80
> building train, validation, and test datasets for GPT ...
Single data path provided for train, valid & test
> building dataset index ...
reading sizes...
reading pointers...
reading document index...
creating numpy buffer of mmap...
creating memory view of numpy buffer...
> finished creating indexed dataset in 0.000152 seconds
number of documents: 4047
> dataset split:
train:
document indices in [0, 3841) total of 3841 documents
validation:
document indices in [3841, 4043) total of 202 documents
test:
document indices in [4043, 4047) total of 4 documents
> Tokens per epoch: 14533042
> loading doc-idx mapping from /cobol/gpt2/data_document/data_preprocess_content_document_train_indexmap_4000000ns_1024sl_1234s_doc_idx.npy
> loading sample-idx mapping from /cobol/gpt2/data_document/data_preprocess_content_document_train_indexmap_4000000ns_1024sl_1234s_sample_idx.npy
> loading shuffle-idx mapping from /cobol/gpt2/data_document/data_preprocess_content_document_train_indexmap_4000000ns_1024sl_1234s_shuffle_idx.npy
loaded indexed file in 0.001 seconds
total number of samples: 4002264
total number of epochs: 282
Traceback (most recent call last):
File "/Megatron-LM/megatron/data/gpt_dataset.py", line 346, in __init__
self.suffix_tok_id, self.prefix_tok_id, self.middle_tok_id, self.pad_tok_id = (self.tokenizer.special_tokens[tok] for tok in [FIM_SUFFIX, FIM_PREFIX, FIM_MIDDLE, FIM_PAD])
File "/Megatron-LM/megatron/data/gpt_dataset.py", line 346, in <genexpr>
self.suffix_tok_id, self.prefix_tok_id, self.middle_tok_id, self.pad_tok_id = (self.tokenizer.special_tokens[tok] for tok in [FIM_SUFFIX, FIM_PREFIX, FIM_MIDDLE, FIM_PAD])
KeyError: '<fim_suffix>'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Megatron-LM/pretrain_gpt.py", line 148, in <module>
pretrain(train_valid_test_datasets_provider,
File "/Megatron-LM/megatron/training.py", line 140, in pretrain
= build_train_valid_test_data_iterators(
File "/Megatron-LM/megatron/training.py", line 1047, in build_train_valid_test_data_iterators
build_train_valid_test_data_loaders(
File "/Megatron-LM/megatron/training.py", line 979, in build_train_valid_test_data_loaders
train_ds, valid_ds, test_ds = build_train_valid_test_datasets_provider(
File "/Megatron-LM/pretrain_gpt.py", line 100, in train_valid_test_datasets_provider
train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
File "/Megatron-LM/megatron/data/gpt_dataset.py", line 33, in build_train_valid_test_datasets
all_train_datasets, all_valid_datasets, all_test_datasets = _build_train_valid_test_datasets(data_prefix[0],
File "/Megatron-LM/megatron/data/gpt_dataset.py", line 234, in _build_train_valid_test_datasets
train_dataset = build_dataset(0, 'train')
File "/Megatron-LM/megatron/data/gpt_dataset.py", line 227, in build_dataset
dataset = GPTDataset(name, data_prefix,
File "/Megatron-LM/megatron/data/gpt_dataset.py", line 348, in __init__
self.suffix_tok_id, self.prefix_tok_id, self.middle_tok_id, self.pad_tok_id = (self.tokenizer.vocab[tok] for tok in [FIM_SUFFIX, FIM_PREFIX, FIM_MIDDLE, FIM_PAD])
File "/Megatron-LM/megatron/data/gpt_dataset.py", line 348, in <genexpr>
self.suffix_tok_id, self.prefix_tok_id, self.middle_tok_id, self.pad_tok_id = (self.tokenizer.vocab[tok] for tok in [FIM_SUFFIX, FIM_PREFIX, FIM_MIDDLE, FIM_PAD])
KeyError: '<fim_suffix>'
It looks like the tokenizer doesn't add <fim_suffix>, what should I do?
As discussed today, let's train a 350M model with the following hyper parameters:
Let's see how it compares against previously trained models.
The --(train|valid|test)-weighted-split-paths-path
arguments (added in #32 ) parses the data arguments from a file in a specific format.
Loading could be made simpler by reading a structured file (json or yaml). Such file would be more human-readable too.
The memory benchmarking is conducted based on the following config:
Head | Layers | Emb. Dim | Model Size (MB) | Adam Peak (MB) | Adan Peak (MB) |
|
---|---|---|---|---|---|---|
6 | 6 | 768 | 81 | 4490 | 4490 | 0 |
12 | 6 | 768 | 81 | 5848 | 5848 | 0 |
16 | 6 | 768 | 81 | 6776 | 6776 | 0 |
6 | 12 | 768 | 124 | 7151 | 7153 | 0.03 |
12 | 12 | 768 | 124 | 9869 | 9871 | 0.02 |
16 | 12 | 768 | 124 | 11733 | 11735 | 0.02 |
16 | 6 | 1024 | 128 | 7302 | 7304 | 0.03 |
16 | 12 | 1024 | 203 | 12719 | 12721 | 0.02 |
6 | 24 | 768 | 209 | 12471 | 12475 | 0.03 |
12 | 24 | 768 | 209 | 17922 | 17922 | 0 |
16 | 24 | 768 | 209 | 21596 | 21600 | 0.02 |
6 | 6 | 1536 | 248 | 6905 | 8241 | 19.35 |
12 | 6 | 1536 | 248 | 8235 | 8539 | 3.69 |
16 | 6 | 1536 | 248 | 9141 | 9445 | 3.33 |
16 | 24 | 1024 | 354 | 23530 | 23534 | 0.02 |
16 | 6 | 2048 | 407 | 11098 | 12159 | 9.56 |
6 | 12 | 1536 | 418 | 11137 | 13778 | 23.71 |
12 | 12 | 1536 | 418 | 13390 | 14164 | 5.78 |
16 | 12 | 1536 | 418 | 15667 | 15976 | 1.97 |
16 | 6 | 2560 | 603 | 13967 | 18207 | 30.36 |
16 | 12 | 2048 | 709 | 18851 | 20954 | 11.16 |
6 | 24 | 1536 | 758 | 19660 | 24819 | 26.24 |
12 | 24 | 1536 | 758 | 25096 | 25406 | 1.24 |
16 | 24 | 1536 | 758 | 28720 | 29030 | 1.08 |
16 | 12 | 2560 | 1075 | 28475 | 32134 | 12.85 |
16 | 24 | 2048 | 1313 | 34357 | 38595 | 12.34 |
Emb. Dim
) increases, the probability that Adan's extra memory is larger also increases.I try to reshape a fine-tuned checkpoint of starcoder(https://huggingface.co/bigcode/starcoder-megatron/tree/main) from TP=4,PP=4 to TP=8,PP=1 using tools/checkpoint_util.py, but I encountered a memory OOM issue.
The machine I used has a memory of 512GB which is capable of loading the whole model. Any solution to solve this issue?
Here's my log before OOM (I added some debugging output to check the process). it seems like the ckpt loader is sending network layers to the saver.
Loaded checkpoint_loader_megatron as the loader.
Loaded checkpoint_saver_megatron as the saver.
Starting saver...
Starting loader...
Wandb import failed
Wandb import failed
/opt/conda/lib/python3.9/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/opt/conda/lib/python3.9/site-packages/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
Setting num_layers to 40 from checkpoint
Setting hidden_size to 6144 from checkpoint
Setting ffn_hidden_size to 24576 from checkpoint
Setting seq_length to 2048 from checkpoint
Setting num_attention_heads to 48 from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 8192 from checkpoint
Checkpoint did not provide arguments add_position_embedding
Checkpoint did not provide arguments use_rotary_position_embeddings
Checkpoint did not provide arguments rotary_percent
Checkpoint did not provide arguments add_bias_linear
Checkpoint did not provide arguments swiglu
Checkpoint did not provide arguments untie_embeddings_and_output_weights
Checkpoint did not provide arguments apply_layernorm_1p
Setting tokenizer_type to TokenizerFromFile from checkpoint
Setting padded_vocab_size to 49152 from checkpoint
Setting attention_head_type to multiquery from checkpoint
Setting tensor_model_parallel_size to 4 from checkpoint
Setting pipeline_model_parallel_size to 4 from checkpoint
Checkpoint did not provide arguments virtual_pipeline_model_parallel_size
Checkpoint did not provide arguments num_layers_per_virtual_pipeline_stage
using world size: 16, data-parallel-size: 1, tensor-model-parallel size: 4, pipeline-model-parallel size: 4
setting global batch size to 1
using torch.float32 for parameters ...
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. False
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.999
adam_eps ........................................ 1e-08
add_bias_linear ................................. True
add_position_embedding .......................... True
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_layernorm_1p .............................. False
apply_query_key_layer_scaling ................... True
apply_residual_connection_post_layernorm ........ False
async_tensor_model_parallel_allreduce ........... False
attention_dropout ............................... 0.1
attention_head_type ............................. multiquery
attention_softmax_in_fp32 ....................... False
barrier_with_L1_time ............................ True
bert_binary_head ................................ True
bert_embedder_type .............................. megatron
bert_load ....................................... None
bf16 ............................................ False
bias_dropout_fusion ............................. False
bias_gelu_fusion ................................ False
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
data_impl ....................................... infer
data_parallel_random_init ....................... False
data_parallel_size .............................. 1
data_path ....................................... None
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
DDP_impl ........................................ local
decoder_num_layers .............................. None
decoder_seq_length .............................. None
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout ............................. 600
distributed_timeout_minutes ..................... 10
embedding_path .................................. None
empty_unused_memory_level ....................... 0
encoder_num_layers .............................. 40
encoder_seq_length .............................. 2048
end_weight_decay ................................ 0.01
eod_mask_loss ................................... False
eval_interval ................................... 1000
eval_iters ...................................... 100
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_on_missing_checkpoint ...................... False
exit_signal_handler ............................. False
ffn_hidden_size ................................. 24576
fim_rate ........................................ 0.0
fim_spm_rate .................................... 0.5
finetune ........................................ False
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_e4m3 ........................................ False
fp8_hybrid ...................................... False
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_wgrad ....................................... True
global_batch_size ............................... 1
glu_activation .................................. None
gradient_accumulation_fusion .................... True
head_lr_mult .................................... 1.0
hidden_dropout .................................. 0.1
hidden_size ..................................... 6144
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
iter_per_epoch .................................. 1250
iteration ....................................... xxx
kv_channels ..................................... 128
layernorm_epsilon ............................... 1e-05
lazy_mpu_init ................................... None
load ............................................ xxx
local_rank ...................................... 0
log_batch_size_to_tensorboard ................... False
log_interval .................................... 100
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. None
lr_decay_iters .................................. None
lr_decay_samples ................................ None
lr_decay_style .................................. linear
lr_warmup_fraction .............................. None
lr_warmup_iters ................................. 0
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 128
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... False
max_position_embeddings ......................... 8192
max_tokens_to_oom ............................... 12000
merge_file ...................................... None
micro_batch_size ................................ 1
min_loss_scale .................................. 1.0
min_lr .......................................... 0.0
mmap_warmup ..................................... False
no_load_optim ................................... True
no_load_rng ..................................... True
no_persist_layer_norm ........................... False
no_save_optim ................................... True
no_save_rng ..................................... True
num_attention_heads ............................. 48
num_channels .................................... 3
num_classes ..................................... 1000
num_experts ..................................... None
num_layers ...................................... 40
num_layers_per_virtual_pipeline_stage ........... None
num_workers ..................................... 2
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
output_bert_embeddings .......................... False
override_opt_param_scheduler .................... False
padded_vocab_size ............................... 49152
params_dtype .................................... torch.float32
patch_dim ....................................... 16
perform_initialization .......................... False
pipeline_model_parallel_size .................... 4
pipeline_model_parallel_split_rank .............. None
position_embedding_type ......................... PositionEmbeddingType.absolute
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... None
recompute_method ................................ None
recompute_num_layers ............................ 1
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
retro_add_retriever ............................. False
retro_cyclic_train_iters ........................ None
retro_encoder_attention_dropout ................. 0.1
retro_encoder_hidden_dropout .................... 0.1
retro_encoder_layers ............................ 2
retro_num_neighbors ............................. 2
retro_num_retrieved_chunks ...................... 2
retro_return_doc_ids ............................ False
retro_workdir ................................... None
rotary_percent .................................. 1.0
sample_rate ..................................... 1.0
save ............................................ None
save_interval ................................... None
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 2048
sequence_parallel ............................... False
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
split ........................................... None
squared_relu .................................... False
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.01
structured_logs ................................. False
structured_logs_dir ............................. None
swiglu .......................................... False
swin_backbone_type .............................. tiny
tensor_model_parallel_size ...................... 4
tensorboard_dir ................................. None
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
test_weighted_split_paths ....................... None
test_weighted_split_paths_path .................. None
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_file .................................. None
tokenizer_model ................................. None
tokenizer_type .................................. TokenizerFromFile
train_data_path ................................. None
train_iters ..................................... None
train_samples ................................... None
train_weighted_split_paths ...................... None
train_weighted_split_paths_path ................. None
transformer_impl ................................ local
transformer_pipeline_model_parallel_size ........ 4
transformer_timers .............................. False
untie_embeddings_and_output_weights ............. False
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_contiguous_buffers_in_local_ddp ............. True
use_cpu_initialization .......................... True
use_distributed_optimizer ....................... False
use_flash_attn .................................. False
use_one_sent_docs ............................... False
use_ring_exchange_p2p ........................... False
use_rotary_position_embeddings .................. False
valid_data_path ................................. None
valid_num_workers ............................... 2
valid_weighted_split_paths ...................... None
valid_weighted_split_paths_path ................. None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... None
vocab_size ...................................... None
wandb_entity_name ............................... None
wandb_project_name .............................. None
weight_decay .................................... 0.01
weight_decay_incr_style ......................... constant
world_size ...................................... 16
-------------------- end of arguments ---------------------
Wandb import failed
setting number of micro-batches to constant 1
running on CUDA devices
loading rank 0 / count 4
building GPT model ...
loading checkpoint from xxx at iteration xxx
checkpoint version 3.0
successfully loaded checkpoint from xxx at iteration xxx
loading rank 1 / count 4
building GPT model ...
loading checkpoint from xxx at iteration xxx
checkpoint version 3.0
successfully loaded checkpoint from xxx at iteration xxx
loading rank 2 / count 4
building GPT model ...
loading checkpoint from xxx at iteration xxx
checkpoint version 3.0
successfully loaded checkpoint from xxx at iteration xxx
loading rank 3 / count 4
building GPT model ...
loading checkpoint from xxx at iteration xxx
checkpoint version 3.0
successfully loaded checkpoint from xxx at iteration xxx
Overwriting default ffn_hidden_size value None with value from checkpoint 24576.
Overwriting default kv_channels value None with value from checkpoint 128.
Overwriting default micro_batch_size value 1 with value from checkpoint 2.
Overwriting default global_batch_size value None with value from checkpoint 64.
Overwriting default log_interval value 100 with value from checkpoint 10.
Overwriting default tensorboard_dir value None with value from checkpoint xxx/tensorboard/.
Overwriting default dataloader_type value None with value from checkpoint single.
Overwriting default lr value None with value from checkpoint 1e-05.
Overwriting default lr_decay_style value linear with value from checkpoint cosine.
Overwriting default min_lr value 0.0 with value from checkpoint 1e-06.
Overwriting default load value None with value from checkpoint xxx/starcoder-megatron.
Checkpoint had argument load_step but new arguments does not have this.
Checkpoint had argument finetune_from but new arguments does not have this.
Overwriting default bf16 value False with value from checkpoint True.
Overwriting default accumulate_allreduce_grads_in_fp32 value False with value from checkpoint True.
Overwriting default local_rank value None with value from checkpoint 0.
Overwriting default eval_iters value 100 with value from checkpoint 10.
Overwriting default eval_interval value 1000 with value from checkpoint 5000.
Overwriting default data_path value None with value from checkpoint ['xx/data/sft_experiments/xxx'].
Overwriting default split value None with value from checkpoint 998,1,1.
Overwriting default merge_file value None with value from checkpoint ./experiments/cro/starcoder/merges.txt.
Overwriting default data_impl value infer with value from checkpoint mmap.
Overwriting default log_validation_ppl_to_tensorboard value False with value from checkpoint True.
Overwriting default world_size value 8 with value from checkpoint 16.
Checkpoint had argument transformer_pipeline_model_parallel_size but new arguments does not have this.
Checkpoint had argument data_parallel_size but new arguments does not have this.
Checkpoint had argument valid_weighted_split_names but new arguments does not have this.
Checkpoint had argument valid_weighted_split_weights but new arguments does not have this.
Checkpoint had argument valid_weighted_split_splits but new arguments does not have this.
Checkpoint had argument test_weighted_split_names but new arguments does not have this.
Checkpoint had argument test_weighted_split_weights but new arguments does not have this.
Checkpoint had argument test_weighted_split_splits but new arguments does not have this.
Checkpoint had argument consumed_train_samples but new arguments does not have this.
Checkpoint had argument consumed_valid_samples but new arguments does not have this.
Checkpoint had argument padded_vocab_size but new arguments does not have this.
Checkpoint had argument model_type but new arguments does not have this.
Checkpoint had argument iteration but new arguments does not have this.
Checkpoint had argument do_train but new arguments does not have this.
Checkpoint had argument do_valid but new arguments does not have this.
Checkpoint had argument do_test but new arguments does not have this.
Checkpoint had argument curr_iteration but new arguments does not have this.
using world size: 16, data-parallel-size: 2, tensor-model-parallel size: 4, pipeline-model-parallel size: 2
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. True
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.999
adam_eps ........................................ 1e-08
add_bias_linear ................................. True
add_position_embedding .......................... True
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_layernorm_1p .............................. False
apply_query_key_layer_scaling ................... True
apply_residual_connection_post_layernorm ........ False
async_tensor_model_parallel_allreduce ........... False
attention_dropout ............................... 0.1
attention_head_type ............................. multiquery
attention_softmax_in_fp32 ....................... False
barrier_with_L1_time ............................ True
bert_binary_head ................................ True
bert_embedder_type .............................. megatron
bert_load ....................................... None
bf16 ............................................ True
bias_dropout_fusion ............................. False
bias_gelu_fusion ................................ False
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
data_impl ....................................... mmap
data_parallel_random_init ....................... False
data_parallel_size .............................. 2
data_path ....................................... ['xx/data/sft_experiments/xxx']
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
DDP_impl ........................................ local
decoder_num_layers .............................. None
decoder_seq_length .............................. None
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout ............................. 600
distributed_timeout_minutes ..................... 10
embedding_path .................................. None
empty_unused_memory_level ....................... 0
encoder_num_layers .............................. 40
encoder_seq_length .............................. 2048
end_weight_decay ................................ 0.01
eod_mask_loss ................................... False
eval_interval ................................... 5000
eval_iters ...................................... 10
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_on_missing_checkpoint ...................... False
exit_signal_handler ............................. False
ffn_hidden_size ................................. 24576
fim_rate ........................................ 0.0
fim_spm_rate .................................... 0.5
finetune ........................................ False
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_e4m3 ........................................ False
fp8_hybrid ...................................... False
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_wgrad ....................................... True
global_batch_size ............................... 64
glu_activation .................................. None
gradient_accumulation_fusion .................... True
head_lr_mult .................................... 1.0
hidden_dropout .................................. 0.1
hidden_size ..................................... 6144
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
iter_per_epoch .................................. 1250
kv_channels ..................................... 128
layernorm_epsilon ............................... 1e-05
lazy_mpu_init ................................... None
load ............................................ xxx/starcoder-megatron
local_rank ...................................... 0
log_batch_size_to_tensorboard ................... False
log_interval .................................... 10
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... True
log_world_size_to_tensorboard ................... False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 1e-05
lr_decay_iters .................................. None
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. None
lr_warmup_iters ................................. 0
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 128
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... False
max_position_embeddings ......................... 8192
max_tokens_to_oom ............................... 12000
merge_file ...................................... ./experiments/cro/starcoder/merges.txt
micro_batch_size ................................ 2
min_loss_scale .................................. 1.0
min_lr .......................................... 1e-06
mmap_warmup ..................................... False
no_load_optim ................................... True
no_load_rng ..................................... True
no_persist_layer_norm ........................... False
no_save_optim ................................... True
no_save_rng ..................................... True
num_attention_heads ............................. 48
num_channels .................................... 3
num_classes ..................................... 1000
num_experts ..................................... None
num_layers ...................................... 40
num_layers_per_virtual_pipeline_stage ........... None
num_workers ..................................... 2
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
output_bert_embeddings .......................... False
override_opt_param_scheduler .................... False
params_dtype .................................... torch.bfloat16
patch_dim ....................................... 16
perform_initialization .......................... False
pipeline_model_parallel_size .................... 2
pipeline_model_parallel_split_rank .............. None
position_embedding_type ......................... PositionEmbeddingType.absolute
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... None
recompute_method ................................ None
recompute_num_layers ............................ 1
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
retro_add_retriever ............................. False
retro_cyclic_train_iters ........................ None
retro_encoder_attention_dropout ................. 0.1
retro_encoder_hidden_dropout .................... 0.1
retro_encoder_layers ............................ 2
retro_num_neighbors ............................. 2
retro_num_retrieved_chunks ...................... 2
retro_return_doc_ids ............................ False
retro_workdir ................................... None
rotary_percent .................................. 1.0
sample_rate ..................................... 1.0
save ............................................ xxx
save_interval ................................... 1
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 2048
sequence_parallel ............................... False
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
split ........................................... 998,1,1
squared_relu .................................... False
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.01
structured_logs ................................. False
structured_logs_dir ............................. None
swiglu .......................................... False
swin_backbone_type .............................. tiny
tensor_model_parallel_size ...................... 4
tensorboard_dir ................................. xxx/tensorboard/
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
test_weighted_split_names ....................... None
test_weighted_split_paths ....................... None
test_weighted_split_paths_path .................. None
test_weighted_split_splits ...................... None
test_weighted_split_weights ..................... None
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_file .................................. None
tokenizer_model ................................. None
tokenizer_type .................................. TokenizerFromFile
train_data_path ................................. None
train_iters ..................................... None
train_samples ................................... None
train_weighted_split_paths ...................... None
train_weighted_split_paths_path ................. None
transformer_impl ................................ local
transformer_pipeline_model_parallel_size ........ 2
transformer_timers .............................. False
untie_embeddings_and_output_weights ............. False
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_contiguous_buffers_in_local_ddp ............. True
use_cpu_initialization .......................... True
use_distributed_optimizer ....................... False
use_flash_attn .................................. False
use_one_sent_docs ............................... False
use_ring_exchange_p2p ........................... False
use_rotary_position_embeddings .................. False
valid_data_path ................................. None
valid_num_workers ............................... 2
valid_weighted_split_names ...................... None
valid_weighted_split_paths ...................... None
valid_weighted_split_paths_path ................. None
valid_weighted_split_splits ..................... None
valid_weighted_split_weights .................... None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... None
vocab_size ...................................... None
wandb_entity_name ............................... None
wandb_project_name .............................. None
weight_decay .................................... 0.01
weight_decay_incr_style ......................... constant
world_size ...................................... 16
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 16
Setting consumed_train_samples to None and consumed_valid_samples to None
Wandb import failed
running on CUDA devices
sending embeddings
sending transformer layer 0
received embeddings
Original vocab size not specified, leaving embedding table as-is. If you've changed the tensor parallel size this could cause problems.
building GPT model ...
sending transformer layer 1
sending transformer layer 2
WARNING! Distributed processes aren't initialized, so word embeddings in the last layer are not initialized. If you are just manipulating a model this is fine, but this needs to be handled manually. If you are training something is definitely wrong.
sending transformer layer 3
sending transformer layer 4
sending transformer layer 5
sending transformer layer 6
sending transformer layer 7
sending transformer layer 8
sending transformer layer 9
loading rank 0 / count 4
building GPT model ...
loading checkpoint from xxx at iteration xxx
building GPT model ...
building GPT model ...
building GPT model ...
debugging in saver, get transformers layer from queue ...
received transformer layer 0
debugging in saver, get transformers layer from queue ...
received transformer layer 1
debugging in saver, get transformers layer from queue ...
received transformer layer 2
debugging in saver, get transformers layer from queue ...
received transformer layer 3
debugging in saver, get transformers layer from queue ...
received transformer layer 4
debugging in saver, get transformers layer from queue ...
received transformer layer 5
debugging in saver, get transformers layer from queue ...
received transformer layer 6
debugging in saver, get transformers layer from queue ...
received transformer layer 7
debugging in saver, get transformers layer from queue ...
received transformer layer 8
debugging in saver, get transformers layer from queue ...
received transformer layer 9
debugging in saver, get transformers layer from queue ...
checkpoint version 3.0
successfully loaded checkpoint from xxx at iteration xxx
loading rank 1 / count 4
building GPT model ...
loading checkpoint from xxx at iteration xxx
checkpoint version 3.0
successfully loaded checkpoint from xxx at iteration xxx
loading rank 2 / count 4
building GPT model ...
loading checkpoint from xxx at iteration xxx
Wandb used to work fine, but now there is some issue in the initialization.
Workaround for now: disable wandb
2023-02-17 22:39:17,767 (worker_0) : training ...
2023-02-17 22:39:24,279 (worker_7) : wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
2023-02-17 22:39:25,325 (worker_7) : wandb: - Waiting for wandb.init()...
2023-02-17 22:39:26,326 (worker_7) : wandb: \ Waiting for wandb.init()...
[more of the same log]
2023-02-17 22:40:24,365 (worker_7) : init_wandb
2023-02-17 22:40:24,365 (worker_7) : wandb: ERROR Error communicating with wandb process
wandb: ERROR For more info see: https://docs.wandb.ai/library/init#init-start-error
2023-02-17 22:40:24,367 (worker_7) : Traceback (most recent call last):
2023-02-17 22:40:24,367 (worker_7) : File "/app/toolkit_infiniband_example/run.py", line 34, in run
2023-02-17 22:40:24,367 (worker_7) : runnable.run()
2023-02-17 22:40:24,367 (worker_7) : File "/app/toolkit_infiniband_example/worker.py", line 89, in run
2023-02-17 22:40:24,367 (worker_7) : self._model.train()
2023-02-17 22:40:24,367 (worker_7) : File "/app/toolkit_infiniband_example/models/megatron_gpt.py", line 65, in train
2023-02-17 22:40:24,368 (worker_7) : pretrain(train_valid_test_datasets_provider, model_provider,
2023-02-17 22:40:24,368 (worker_7) : File "/app/megatron/training.py", line 155, in pretrain
2023-02-17 22:40:24,368 (worker_7) : iteration = train(forward_step_func,
2023-02-17 22:40:24,368 (worker_7) : File "/app/megatron/training.py", line 685, in train
2023-02-17 22:40:24,368 (worker_7) : init_wandb()
2023-02-17 22:40:24,368 (worker_7) : File "/app/megatron/initialize.py", line 244, in init_wandb
2023-02-17 22:40:24,368 (worker_7) : wandb.init(
2023-02-17 22:40:24,368 (worker_7) : File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 1078, in init
2023-02-17 22:40:24,368 (worker_7) : run = wi.init()
2023-02-17 22:40:24,368 (worker_7) : File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_init.py", line 719, in init
2023-02-17 22:40:24,368 (worker_7) : raise UsageError(error_message)
2023-02-17 22:40:24,368 (worker_7) : wandb.errors.UsageError: Error communicating with wandb process
2023-02-17 22:40:24,368 (worker_7) : For more info see: https://docs.wandb.ai/library/init#init-start-error
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
2023-02-17 22:42:33,930 (worker_7) : Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fd1940af970>: Failed to establish a new connection: [Errno 110] Connection timed out')': /api/5288891/store/
Add a feature to run validation on multiple datasets.
Would be useful to get validation loss on different programming languages for example
hi @RaymondLi0 , in the branch evaluation, I found the evaluation code related to humaneval, but unfortunately it doesn't work.
Some specific mismatches are as follows.
prefix_lm
and sep_in_bidirectional_context
are not arguments of generate_and_post_process
.
Could you help me update it ?
With micro-batch-size=2, global-batch-size=192, a 1B-model configuration, the UL2 training script gives:
forward-backward ...............................: (5164.84, 5177.15)
forward-compute ................................: (2584.92, 2735.66)
backward-compute ...............................: (2423.13, 2586.40)
batch-generator ................................: (408.98, 817.78) <---
data-iterator ..................................: (3.90, 43.10)
broadcast-data .................................: (395.41, 776.42) <---
layernorm-grads-all-reduce .....................: (0.02, 0.03)
embedding-grads-all-reduce .....................: (0.03, 0.04)
grads-all-reduce ...............................: (193.33, 193.63)
optimizer-copy-to-main-grad ....................: (10.61, 10.67)
optimizer-unscale-and-check-inf ................: (36.12, 36.37)
optimizer-clip-main-grad .......................: (2.90, 3.00)
optimizer-count-zeros ..........................: (0.00, 0.01)
optimizer-inner-step ...........................: (17.63, 17.73)
optimizer-copy-main-to-model-params ............: (4.63, 4.72)
optimizer ......................................: (72.92, 73.13)
where the GPT training script with the same configuration has twice shorter forward-time, coming mostly from having almost zero broadcast-data time. https://github.com/bigcode-project/Megatron-LM/blob/ul2-merge/pretrain_ul2.py#L90
Add document meta-information at the beginning (filename, filepath, repo name, number of stars...)
Also there should be a dropout for each of these fields.
Optional: at the beginning of the preprocessed document, pre-pend the number of tokens that contain the meta information. That would allow to detokenize only the meta-information instead of the whole document during training.
On issue would be that preprocessed samples of length 2048 would get shorter if the meta-information is removed. Possible workaround:
Cleaner solution: tokenize document content and meta-information strings separately. Let N be the number of meta-information field that can be dropped-out. We create N+1 preprocessed flies. When loading data at training time, read through these N+1 files in parallel and add the meta-info with dropout. More implementation work needed in this solution
Opening this after we discussed it in Slack.
It might prove useful for evaluation to train an encoder model on the same dataset used to train the main model.
Things proposed:
Once the stack v1.2 is done #24 , attribute weight to each data source (PL and NL).
Some PLs like HTML and CSS should probably be down-sampled.
On the other hand, we might want to upsample some low-resource PLs.
For the final model the goal would be to have a limit of 5 epochs maximum for example, and check that each of the data sources stays under that limit. An estimated 600B training tokens can be used to check this limit.
Proposed list of experiments to run:
https://docs.google.com/spreadsheets/d/1xOIYoExQP_haA80ArY09fAk49fNVTO8eJrYysw71FQs/edit?usp=sharing
This list was created using this notebook: https://github.com/bigcode-project/Megatron-LM/blob/raymond-notebooks/notebooks/transformer_parameter_count.ipynb
Still open questions:
Which languages to train on? We could afford to do each experiment on single-language and multi-language datasets, doubling the compute.
Which evaluations. HumanEval, MBPP, repo-level eval? Some downstream tasks with finetuning? https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/finetuning/
For the model training it's important that we achieve a high throughput. We need to figure out the multi-node training configuration(i.e., what combination of data, tensor, and pipeline parallelism) for 192 V100 GPUs.
When launching very long training runs, building the index mappings can take more than 1 minute.
The consequence is that the other ranks will timeout. https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/megatron/training.py#L962
However the timeout passed to torch.distributed.initialize
is 10 mins. Why isn't this value used in torch.distributed.broadcast
?
The workaround for now is to first create the index mappings on a single worker, as a preliminary run.
Get an idea of the different flavours of scaling-law works that are out there. Any work that tries to estimate the optimal scale of model and dataset size, with regards to a certain metric (PPL, or other)
Some references:
Looking again at this code: https://github.com/bigcode-project/Megatron-LM/blob/multi-query-attention/megatron/optimizer/optimizer.py#L269
1 - Aren't the gradients of the biases missing from the reduction?
2 - The distributed optimizer misses this reduction:
It would be very useful to have a script to reproduce and/or finetune StarCoder, similar to the example script for santacoder.
HuggingFace's trainer doesn't have the ability to do tensor/model parallel, which makes it difficult to fit the model on 1 8-GPU node, so this code would help users fine-tune models.
hi,
The megatron format how to convert huggingface for inference?
When processing a dataset of 55GB, 31M samples, preprocessing runs out-of-memory on a machine with 1.5TB memory.
The error happens when saving the index. For other larger datasets there was no issue. But this dataset is the one with the most documents.
Traceback (most recent call last):
File "Megatron-LM/tools/preprocess_data.py", line 227, in <module>
main()
File "Megatron-LM/tools/preprocess_data.py", line 224, in main
builders[key].finalize(output_idx_files[key])
File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 576, in finalize
index.write(self._sizes, self._doc_idx)
File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 369, in write
pointers = self._get_pointers(sizes)
File "/app/Megatron-LM/megatron/data/indexed_dataset.py", line 363, in _get_pointers
pointers.append(address)
MemoryError
The workaround for now is to first shard the dataset, and tokenize each shard independently. At training time, the shards can be blended together
please tell me hot wo generate the following files:
I am trying to run the Starcoder pretraining code(/examples/pretrain_bigcode_model.slurm). I created a custom pretrain_starcoder.sh file
#!/bin/bash
GPUS_PER_NODE=2
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
# File path setup
CHECKPOINT_PATH=/home/jupyter/Satya/Megatron/Model_starcoder/
TOKENIZER_FILE=/home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json
#WEIGHTS_TRAIN=/fsx/loubna/code/bigcode-data-mix/data/train_data_paths.txt.tmp
#WEIGHTS_VALID=/fsx/loubna/code/bigcode-data-mix/data/valid_data_paths.txt.tmp
mkdir -p $CHECKPOINT_PATH/tensorboard
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
GPT_ARGS="\
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--sequence-parallel \
--num-layers 40 \
--hidden-size 6144 \
--num-attention-heads 48 \
--attention-head-type multiquery \
--init-method-std 0.01275 \
--seq-length 8192 \
--max-position-embeddings 8192 \
--attention-dropout 0.1 \
--hidden-dropout 0.1 \
--micro-batch-size 1 \
--global-batch-size 512 \
--lr 0.0003 \
--min-lr 0.00003 \
--train-iters 250000 \
--lr-decay-iters 250000 \
--lr-decay-style cosine \
--lr-warmup-iters 2000 \
--weight-decay .1 \
--adam-beta2 .95 \
--clip-grad 1.0 \
--bf16 \
--use-flash-attn \
--fim-rate 0.5 \
--log-interval 10 \
--save-interval 2500 \
--eval-interval 2500 \
--eval-iters 2 \
--use-distributed-optimizer \
--valid-num-workers 0 \
"
TENSORBOARD_ARGS="--tensorboard-dir ${CHECKPOINT_PATH}/tensorboard"
export NCCL_DEBUG=INFO
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
pretrain_gpt.py \
$GPT_ARGS \
--tokenizer-type TokenizerFromFile \
--tokenizer-file $TOKENIZER_FILE \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
#--train-weighted-split-paths-path $WEIGHTS_TRAIN \
#--valid-weighted-split-paths-path $WEIGHTS_VALID \
--structured-logs \
--structured-logs-dir $CHECKPOINT_PATH/logs \
$TENSORBOARD_ARGS \
--wandb-entity-name loubnabnl \
--wandb-project-name bigcode-pretraining \
i didn't set the datapath yet.
My current versions are
CUDA - 11.0
pytorch - 1.7.0 (i only found 1.7.1 and 1.7.0 for cuda 11.0).
apex - 1.0
gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~18.04) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
2 AWS A100 GPUs.
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB On | 00000000:20:1C.0 Off | 0 |
| N/A 24C P0 53W / 400W | 3MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-SXM4-40GB On | 00000000:A0:1D.0 Off | 0 |
| N/A 25C P0 50W / 400W | 3MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
when i run $ bash ./examples/pretrain_starcoder.sh
Wandb import failed
Wandb import failed
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:TokenizerFromFile
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
Persistent fused layer norm kernel is supported from pytorch v1.11 (nvidia pytorch container paired with v1.11). Defaulting to no_persist_layer_norm=True
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. True
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.95
adam_eps ........................................ 1e-08
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_query_key_layer_scaling ................... True
apply_residual_connection_post_layernorm ........ False
async_tensor_model_parallel_allreduce ........... True
attention_dropout ............................... 0.1
attention_head_type ............................. multiquery
attention_softmax_in_fp32 ....................... False
bert_binary_head ................................ True
bert_load ....................................... None
bf16 ............................................ True
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
data_impl ....................................... infer
data_parallel_random_init ....................... False
data_parallel_size .............................. 1
data_path ....................................... None
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
DDP_impl ........................................ local
decoder_seq_length .............................. None
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout ............................. 600
embedding_path .................................. None
empty_unused_memory_level ....................... 0
encoder_seq_length .............................. 8192
end_weight_decay ................................ 0.1
eod_mask_loss ................................... False
eval_interval ................................... 2500
eval_iters ...................................... 2
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_signal_handler ............................. False
ffn_hidden_size ................................. 24576
fim_rate ........................................ 0.5
fim_spm_rate .................................... 0.5
finetune ........................................ False
finetune_from ................................... None
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
global_batch_size ............................... 512
glu_activation .................................. None
gradient_accumulation_fusion .................... True
head_lr_mult .................................... 1.0
hidden_dropout .................................. 0.1
hidden_size ..................................... 6144
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
init_method_std ................................. 0.01275
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
iter_per_epoch .................................. 1250
kv_channels ..................................... 128
layernorm_epsilon ............................... 1e-05
lazy_mpu_init ................................... None
load ............................................ /home/jupyter/Satya/Megatron/Model_starcoder/
local_rank ...................................... 0
log_batch_size_to_tensorboard ................... False
log_interval .................................... 10
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 0.0003
lr_decay_iters .................................. 250000
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. None
lr_warmup_iters ................................. 2000
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 128
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... True
max_position_embeddings ......................... 8192
merge_file ...................................... None
micro_batch_size ................................ 1
min_loss_scale .................................. 1.0
min_lr .......................................... 3e-05
mmap_warmup ..................................... False
no_load_optim ................................... None
no_load_rng ..................................... None
no_persist_layer_norm ........................... True
no_save_optim ................................... None
no_save_rng ..................................... None
num_attention_heads ............................. 48
num_channels .................................... 3
num_classes ..................................... 1000
num_experts ..................................... None
num_layers ...................................... 40
num_layers_per_virtual_pipeline_stage ........... None
num_workers ..................................... 2
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
override_opt_param_scheduler .................... False
params_dtype .................................... torch.bfloat16
patch_dim ....................................... 16
perform_initialization .......................... True
pipeline_model_parallel_size .................... 1
pipeline_model_parallel_split_rank .............. None
position_embedding_type ......................... PositionEmbeddingType.absolute
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... None
recompute_method ................................ None
recompute_num_layers ............................ 1
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
sample_rate ..................................... 1.0
save ............................................ /home/jupyter/Satya/Megatron/Model_starcoder/
save_interval ................................... 2500
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 8192
sequence_parallel ............................... False
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
split ........................................... None
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.1
structured_logs ................................. False
structured_logs_dir ............................. None
swin_backbone_type .............................. tiny
tensor_model_parallel_size ...................... 1
tensorboard_dir ................................. None
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_weighted_split_paths ....................... None
test_weighted_split_paths_path .................. None
titles_data_path ................................ None
tokenizer_file .................................. /home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json
tokenizer_type .................................. TokenizerFromFile
train_iters ..................................... 250000
train_samples ................................... None
train_weighted_split_paths ...................... None
train_weighted_split_paths_path ................. None
transformer_pipeline_model_parallel_size ........ 1
transformer_timers .............................. False
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_contiguous_buffers_in_local_ddp ............. True
use_cpu_initialization .......................... None
use_distributed_optimizer ....................... True
use_flash_attn .................................. True
use_one_sent_docs ............................... False
valid_num_workers ............................... 0
valid_weighted_split_paths ...................... None
valid_weighted_split_paths_path ................. None
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... None
wandb_entity_name ............................... None
wandb_project_name .............................. None
weight_decay .................................... 0.1
weight_decay_incr_style ......................... constant
world_size ...................................... 1
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 512
> building TokenizerFromFile tokenizer ...
> padded vocab (size: 49152) with 0 dummy tokens (new size: 49152)
05:15:56.69 >>> Call to _initialize_distributed in File "/tmp/Megatron/megatron/initialize.py", line 220
05:15:56.69 220 | def _initialize_distributed():
05:15:56.69 222 | args = get_args()
05:15:56.69 .......... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
05:15:56.69 224 | device_count = torch.cuda.device_count()
05:15:56.69 .......... device_count = 2
05:15:56.69 225 | if torch.distributed.is_initialized():
05:15:56.69 235 | if args.rank == 0:
05:15:56.69 236 | print('> initializing torch distributed ...', flush=True)
> initializing torch distributed ...
05:15:56.69 238 | if device_count > 0:
05:15:56.69 239 | device = args.rank % device_count
05:15:56.69 .................. device = 0
05:15:56.69 240 | if args.local_rank is not None:
05:15:56.69 241 | assert args.local_rank == device, \
05:15:56.69 245 | torch.cuda.set_device(device)
05:15:56.70 249 | torch.distributed.init_process_group(
05:15:56.70 250 | backend="gloo",#args.distributed_backend,
05:15:56.70 251 | world_size=args.world_size, rank=args.rank,
05:15:56.70 252 | timeout=timedelta(seconds=args.distributed_timeout))
05:15:56.70 249 | torch.distributed.init_process_group(
05:15:56.70 256 | if device_count > 0:
05:15:56.70 257 | if mpu.model_parallel_is_initialized():
05:15:56.70 260 | mpu.initialize_model_parallel(args.tensor_model_parallel_size,
05:15:56.70 261 | args.pipeline_model_parallel_size,
05:15:56.70 262 | args.virtual_pipeline_model_parallel_size,
05:15:56.70 263 | args.pipeline_model_parallel_split_rank)
05:15:56.70 260 | mpu.initialize_model_parallel(args.tensor_model_parallel_size,
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
05:15:56.70 <<< Return value from _initialize_distributed: None
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
05:15:56.70 >>> Call to _compile_dependencies in File "/tmp/Megatron/megatron/initialize.py", line 160
05:15:56.70 160 | def _compile_dependencies():
05:15:56.70 162 | args = get_args()
05:15:56.73 >>> Call to get_args in File "/tmp/Megatron/megatron/global_vars.py", line 38
05:15:56.73 38 | def get_args():
05:15:56.73 40 | _ensure_var_is_initialized(_GLOBAL_ARGS, 'args')
05:15:56.73 41 | return _GLOBAL_ARGS
05:15:56.73 <<< Return value from get_args: Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
05:15:56.73 162 | args = get_args()
05:15:56.73 .......... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
05:15:56.73 168 | if torch.distributed.get_rank() == 0:
05:15:56.84 >>> Call to get_rank in File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584
05:15:56.84 ...... group = <object object at 0x7fe25503e6c0>
05:15:56.84 584 | def get_rank(group=group.WORLD):
05:15:56.84 600 | if _rank_not_in_group(group):
05:15:56.84 603 | _check_default_pg()
05:15:56.84 604 | if group == GroupMember.WORLD:
05:15:56.84 605 | return _default_pg.rank()
05:15:56.84 <<< Return value from get_rank: 0
05:15:56.84 168 | if torch.distributed.get_rank() == 0:
05:15:56.84 169 | start_time = time.time()
05:15:56.84 .............. start_time = 1686719756.846662
05:15:56.84 170 | print('> compiling dataset index builder ...')
> compiling dataset index builder ...
05:15:56.84 171 | from megatron.data.dataset_utils import compile_helper
05:15:56.84 .............. compile_helper = <function compile_helper at 0x7fe24b749280>
05:15:56.84 172 | compile_helper()
05:15:56.92 >>> Call to compile_helper in File "/tmp/Megatron/megatron/data/dataset_utils.py", line 81
05:15:56.92 81 | def compile_helper():
05:15:56.92 84 | import os
05:15:56.92 .......... os = <module 'os' from '/opt/conda/envs/starcoder/lib/python3.8/os.py'>
05:15:56.92 85 | import subprocess
05:15:56.92 .......... subprocess = <module 'subprocess' from '/opt/conda/envs/starcoder/lib/python3.8/subprocess.py'>
05:15:56.92 86 | path = os.path.abspath(os.path.dirname(__file__))
05:15:56.92 .......... path = '/tmp/Megatron/megatron/data'
05:15:56.92 87 | ret = subprocess.run(['make', '-C', path])
make: Entering directory '/tmp/Megatron/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/tmp/Megatron/megatron/data'
05:15:56.96 .......... ret = CompletedProcess(args=['make', '-C', '/tmp/Megatron/megatron/data'], returncode=0)
05:15:56.96 88 | if ret.returncode != 0:
05:15:56.96 <<< Return value from compile_helper: None
05:15:56.96 172 | compile_helper()
05:15:56.96 173 | print('>>> done with dataset index builder. Compilation time: {:.3f} '
05:15:56.96 174 | 'seconds'.format(time.time() - start_time), flush=True)
05:15:56.96 173 | print('>>> done with dataset index builder. Compilation time: {:.3f} '
05:15:56.96 174 | 'seconds'.format(time.time() - start_time), flush=True)
05:15:56.96 173 | print('>>> done with dataset index builder. Compilation time: {:.3f} '
>>> done with dataset index builder. Compilation time: 0.114 seconds
05:15:56.96 181 | seq_len = args.seq_length
05:15:56.96 .......... seq_len = 8192
05:15:56.96 182 | attn_batch_size = \
05:15:56.96 183 | (args.num_attention_heads / args.tensor_model_parallel_size) * \
05:15:56.96 184 | args.micro_batch_size
05:15:56.96 183 | (args.num_attention_heads / args.tensor_model_parallel_size) * \
05:15:56.96 182 | attn_batch_size = \
05:15:56.96 .......... attn_batch_size = 48.0
05:15:56.96 187 | custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \
05:15:56.96 188 | seq_len % 4 == 0 and attn_batch_size % 4 == 0
05:15:56.96 187 | custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \
05:15:56.96 188 | seq_len % 4 == 0 and attn_batch_size % 4 == 0
05:15:56.96 187 | custom_kernel_constraint = seq_len > 16 and seq_len <=8192 and \
05:15:56.96 .......... custom_kernel_constraint = True
05:15:56.96 190 | if not ((args.fp16 or args.bf16) and
05:15:56.96 191 | custom_kernel_constraint and
05:15:56.96 190 | if not ((args.fp16 or args.bf16) and
05:15:56.96 192 | args.masked_softmax_fusion):
05:15:56.96 190 | if not ((args.fp16 or args.bf16) and
05:15:56.96 199 | if torch.distributed.get_rank() == 0:
05:15:56.96 >>> Call to get_rank in File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 584
05:15:56.96 ...... group = <object object at 0x7fe25503e6c0>
05:15:56.96 584 | def get_rank(group=group.WORLD):
05:15:56.96 600 | if _rank_not_in_group(group):
05:15:56.96 603 | _check_default_pg()
05:15:56.96 604 | if group == GroupMember.WORLD:
05:15:56.96 605 | return _default_pg.rank()
05:15:56.96 <<< Return value from get_rank: 0
05:15:56.96 199 | if torch.distributed.get_rank() == 0:
05:15:56.96 200 | start_time = time.time()
05:15:56.96 .............. start_time = 1686719756.9662645
05:15:56.96 201 | print('> compiling and loading fused kernels ...', flush=True)
> compiling and loading fused kernels ...
05:15:56.96 202 | fused_kernels.load(args)
05:15:56.96 >>> Call to load in File "/tmp/Megatron/megatron/fused_kernels/__init__.py", line 4
05:15:56.96 ...... args = Namespace(DDP_impl='local', accumulate_allreduce...weight_decay_incr_style='constant', world_size=1)
05:15:56.96 4 | def load(args):
05:15:56.96 5 | if torch.version.hip is None:
05:15:56.96 6 | print("running on CUDA devices")
running on CUDA devices
05:15:56.96 7 | from megatron.fused_kernels.cuda import load as load_kernels
05:15:58.87 .............. load_kernels = <function load at 0x7fe2422201f0>
05:15:58.87 12 | load_kernels(args)
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/Megatron/megatron/fused_kernels/cuda/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o
FAILED: scaled_upper_triang_masked_softmax_cuda.cuda.o
/usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (const char *const)
detected during:
instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<const char *const &>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(1375): here
instantiation of "__nv_bool pybind11::detail::object_api<Derived>::contains(T &&) const [with Derived=pybind11::handle, T=const char *const &]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/detail/internals.h(176): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(201): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::handle, pybind11::handle)
detected during:
instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::handle &, pybind11::handle &>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(923): here
instantiation of "pybind11::str pybind11::str::format(Args &&...) const [with Args=<pybind11::handle &, pybind11::handle &>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(755): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::handle, pybind11::handle, pybind11::none, pybind11::str)
detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::handle, pybind11::handle, pybind11::none, pybind11::str>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(971): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::object, const pybind11::handle)
detected during:
instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object &, const pybind11::handle &>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pytypes.h(923): here
instantiation of "pybind11::str pybind11::str::format(Args &&...) const [with Args=<pybind11::object &, const pybind11::handle &>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1401): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::cpp_function)
detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::cpp_function>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1407): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::cpp_function, pybind11::none, pybind11::none, const char [1])
detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::handle, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::cpp_function, pybind11::none, pybind11::none, const char (&)[1]>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1418): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::tuple)
detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::tuple &>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1812): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::object)
detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object &>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1830): here
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/cast.h(2108): error: no instance of overloaded function "pybind11::detail::collect_arguments" matches the argument list
argument types are: (pybind11::object)
detected during instantiation of "pybind11::object pybind11::detail::object_api<Derived>::operator()(Args &&...) const [with Derived=pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>, policy=pybind11::return_value_policy::automatic_reference, Args=<pybind11::object>]"
/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/pybind11/pybind11.h(1831): here
10 errors detected in the compilation of "/tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu".
ninja: build stopped: subcommand failed.
05:16:05.35 !!! RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'
05:16:05.35 !!! When calling: load_kernels(args)
05:16:05.35 !!! Call ended by exception
05:16:05.35 202 | fused_kernels.load(args)
05:16:05.39 !!! RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'
05:16:05.39 !!! When calling: fused_kernels.load(args)
05:16:05.39 !!! Call ended by exception
Traceback (most recent call last):
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1516, in _run_ninja_build
subprocess.run(
File "/opt/conda/envs/starcoder/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "pretrain_gpt.py", line 158, in <module>
pretrain(train_valid_test_datasets_provider, model_provider,
File "/tmp/Megatron/megatron/training.py", line 107, in pretrain
initialize_megatron(extra_args_provider=extra_args_provider,
File "/tmp/Megatron/megatron/initialize.py", line 106, in initialize_megatron
_compile_dependencies()
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/snoop/tracer.py", line 173, in simple_wrapper
return function(*args, **kwargs)
File "/tmp/Megatron/megatron/initialize.py", line 202, in _compile_dependencies
fused_kernels.load(args)
File "/tmp/Megatron/megatron/fused_kernels/__init__.py", line 12, in load
load_kernels(args)
File "/tmp/Megatron/megatron/fused_kernels/cuda/__init__.py", line 70, in load
scaled_upper_triang_masked_softmax_cuda = _cpp_extention_load_helper(
File "/tmp/Megatron/megatron/fused_kernels/cuda/__init__.py", line 42, in _cpp_extention_load_helper
return cpp_extension.load(
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 969, in load
return _jit_compile(
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1176, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1280, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1538, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'
Traceback (most recent call last):
File "/opt/conda/envs/starcoder/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/starcoder/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/opt/conda/envs/starcoder/bin/python', '-u', 'pretrain_gpt.py', '--local_rank=0', '--tensor-model-parallel-size', '1', '--pipeline-model-parallel-size', '1', '--num-layers', '40', '--hidden-size', '6144', '--num-attention-heads', '48', '--attention-head-type', 'multiquery', '--init-method-std', '0.01275', '--seq-length', '8192', '--max-position-embeddings', '8192', '--attention-dropout', '0.1', '--hidden-dropout', '0.1', '--micro-batch-size', '1', '--global-batch-size', '512', '--lr', '0.0003', '--min-lr', '0.00003', '--train-iters', '250000', '--lr-decay-iters', '250000', '--lr-decay-style', 'cosine', '--lr-warmup-iters', '2000', '--weight-decay', '.1', '--adam-beta2', '.95', '--clip-grad', '1.0', '--bf16', '--use-flash-attn', '--fim-rate', '0.5', '--log-interval', '10', '--save-interval', '2500', '--eval-interval', '2500', '--eval-iters', '2', '--use-distributed-optimizer', '--valid-num-workers', '0', '--tokenizer-type', 'TokenizerFromFile', '--tokenizer-file', '/home/jupyter/Satya/Megatron/tokenizer_starcoder/tokenizer.json', '--save', '/home/jupyter/Satya/Megatron/Model_starcoder/', '--load', '/home/jupyter/Satya/Megatron/Model_starcoder/']' returned non-zero exit status 1.
examples/pretrain_starcoder.sh: line 75: --structured-logs: command not found
in the above code i also tried using snoop trace. Below is the main error.
Detected CUDA files, patching ldflags
Emitting ninja build file /tmp/Megatron/megatron/fused_kernels/cuda/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] /usr/local/cuda-11.0/bin/nvcc -DTORCH_EXTENSION_NAME=scaled_upper_triang_masked_softmax_cuda -DTORCH_API_INCLUDE_EXTENSION_H -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/envs/starcoder/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.0/include -isystem /opt/conda/envs/starcoder/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -std=c++17 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -c /tmp/Megatron/megatron/fused_kernels/cuda/scaled_upper_triang_masked_softmax_cuda.cu -o scaled_upper_triang_masked_softmax_cuda.cuda.o
FAILED: scaled_upper_triang_masked_softmax_cuda.cuda.o
Awesome work. Is it possible to checkin the pertaining and fine tuning scripts please?
I want to use my dataset to finetune starcoderbase with fim, is there code for this to help me?
Traceback (most recent call last):
File "/data/lee/Megatron-LM/pretrain_gpt.py", line 158, in
pretrain(train_valid_test_datasets_provider, model_provider,
File "/data/lee/Megatron-LM/megatron/training.py", line 129, in pretrain
model, optimizer, opt_param_scheduler = setup_model_and_optimizer(model_provider,
File "/data/lee/Megatron-LM/megatron/training.py", line 376, in setup_model_and_optimizer
model = get_model(model_provider_func, model_type)
File "/data/lee/Megatron-LM/megatron/training.py", line 262, in get_model
model = model_provider_func(
File "/data/lee/Megatron-LM/pretrain_gpt.py", line 35, in model_provider
model = GPTModel(
File "/data/lee/Megatron-LM/megatron/model/gpt_model.py", line 74, in init
self.language_model, self._language_model_key = get_language_model(
File "/data/lee/Megatron-LM/megatron/model/language_model.py", line 75, in get_language_model
language_model = TransformerLanguageModel(
File "/data/lee/Megatron-LM/megatron/model/language_model.py", line 373, in init
self.encoder = ParallelTransformer(
File "/data/lee/Megatron-LM/megatron/model/transformer.py", line 1182, in init
[build_layer(i + 1 + offset) for i in range(self.num_layers)])
File "/data/lee/Megatron-LM/megatron/model/transformer.py", line 1182, in
[build_layer(i + 1 + offset) for i in range(self.num_layers)])
File "/data/lee/Megatron-LM/megatron/model/transformer.py", line 1130, in build_layer
return ParallelTransformerLayer(
File "/data/lee/Megatron-LM/megatron/model/transformer.py", line 877, in init
self.self_attention = ParallelAttention(
File "/data/lee/Megatron-LM/megatron/model/transformer.py", line 590, in init
raise ValueError(f"Invalid attention arguments: {attention_type}, {self.attention_head_type}")
ValueError: Invalid attention arguments: AttnType.self_attn, None
The parameters and commands are as follows. May I ask what the problem is?
GPUS_PER_NODE=1
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
CHECKPOINT_PATH=/data/lee/Megatron-LM/experiments/0606
VOCAB_FILE=vocab.json
MERGE_FILE=merges.txt
DATA_PATH=starcoder-abap_content_document
GPT_ARGS="--num-layers 12 \
--hidden-size 768 \
--num-attention-heads 12 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--micro-batch-size 12 \
--global-batch-size 192 \
--lr 0.0005 \
--train-iters 150000 \
--lr-decay-iters 150000 \
--lr-decay-style cosine \
--lr-warmup-iters 2000 \
--weight-decay .1 \
--adam-beta2 .999 \
--fp16 \
--log-interval 10 \
--save-interval 2000 \
--eval-interval 200 \
--eval-iters 10"
# --finetune
TENSORBOARD_ARGS="--tensorboard-dir experiments/tensorboard"
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
pretrain_gpt.py \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
$GPT_ARGS \
--vocab-file $VOCAB_FILE \
--merge-file $MERGE_FILE \
--save $CHECKPOINT_PATH \
--data-path $DATA_PATH \
$TENSORBOARD_ARGS
this project seems to pre-train decoder-only style LM. just wonder why not encoder-decoder style which more powerful for text generation (translation, summarization, conditional text generation).
I confused it. currently I understand all details of the implementation.
closing isaue...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.