I am trying to train the 8 billion and the 20 billion models on SUMMIT and both failed.
SUMMIT has 6 Nvidia V100 16GB GPUs per node.
Both the 8 billion and the 20 billion give oom.
I am testing it on 1 node and even after I reduced the train batch size to 1, it didn't work:
The logs are:
use_npy_data_loader .......... False
train_data_path ..............
val_data_path ................
test_data_path ...............
input_data_sizes_file ........ sizes.txt
delim ........................ ,
text_key ..................... sentence
eval_text_key ................ None
valid_data ................... None
split ........................ 949,50,1
test_data .................... None
lazy_loader .................. False
loose_json ................... False
presplit_sentences ........... True
num_workers .................. 2
tokenizer_model_type ......... /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
tokenizer_path ............... tokenizer.model
tokenizer_type ............... BertWordPieceTokenizer
cache_dir .................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
use_tfrecords ................ True
seq_length ................... 512
max_preds_per_seq ............ 76
deepspeed .................... True
deepspeed_config ............. /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ds_bert_config.json
deepscale .................... False
deepscale_config ............. None
deepspeed_mpi ................ True
sharedfile ................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/test/.sharedfile
cuda ......................... True
rank ......................... 0
world_size ................... 6
dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
2020-02-29 04:40:19.647170: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
WARNING: Logging before flag parsing goes to stderr.
W0229 04:40:22.566024 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:46: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
W0229 04:40:22.567073 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:55: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
W0229 04:40:22.567220 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:66: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.
2020-02-29 04:40:22.567455: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-02-29 04:40:22.570236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-29 04:40:22.572765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-29 04:40:22.575278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-29 04:40:22.577850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-29 04:40:22.580415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-29 04:40:22.582986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-29 04:40:22.583008: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2020-02-29 04:40:22.583068: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2020-02-29 04:40:22.583108: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2020-02-29 04:40:22.583146: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2020-02-29 04:40:22.585072: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2020-02-29 04:40:22.585118: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2020-02-29 04:40:22.585156: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-02-29 04:40:22.615387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-29 04:40:22.623295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-29 04:40:22.623314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]
W0229 04:40:22.646660 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/python/data/util/random_seed.py:58: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0229 04:40:25.123421 35184372395936 lazy_loader.py:50]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
W0229 04:40:25.123578 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:86: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
W0229 04:40:25.123658 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
2020-02-29 04:40:25.149839: W tensorflow/core/common_runtime/eager/context.cc:371] Added two functions with the same name: __inference_Dataset_flat_map_read_one_file_28
W0229 04:40:25.153336 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:96: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
W0229 04:40:25.153439 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/batching.py:273: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
W0229 04:40:25.154995 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:116: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.
W0229 04:40:25.166115 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:119: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
configuring data
loading BertWordPieceTokenizer ( /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ ) from cache_dir /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
loaded /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
> padded vocab (size: 30) with 0 dummy tokens (new size: 30)
h36n18:125722:125722 [0] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125722:125722 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125722:125722 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
NCCL version 2.4.7nvb1+cuda10.1
h36n18:125724:125724 [2] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125724:125724 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125726:125726 [4] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125726:125726 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125727:125727 [5] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125727:125727 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125723:125723 [1] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125723:125723 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125722:125971 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:125725:125725 [3] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125725:125725 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125725:125725 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125726:125726 [4] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125724:125724 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125723:125723 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125727:125727 [5] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125725:125992 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:125723:125993 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:125724:125994 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:125726:125995 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:125727:125996 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:125722:125971 [0] NCCL INFO Duplicating rings to 4 per user request.
h36n18:125722:125971 [0] NCCL INFO Channel 00 : 0 1 2 3 4 5
h36n18:125722:125971 [0] NCCL INFO Channel 01 : 0 1 2 3 4 5
h36n18:125722:125971 [0] NCCL INFO Channel 02 : 0 1 2 3 4 5
h36n18:125722:125971 [0] NCCL INFO Channel 03 : 0 1 2 3 4 5
h36n18:125726:125995 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 00 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 01 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
h36n18:125726:125995 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 02 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
h36n18:125726:125995 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 03 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
h36n18:125726:125995 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO comm 0x200104006650 rank 5 nranks 6 cudaDev 5 nvmlDev 5 - Init COMPLETE
h36n18:125725:125992 [3] NCCL INFO comm 0x200104006650 rank 3 nranks 6 cudaDev 3 nvmlDev 3 - Init COMPLETE
h36n18:125724:125994 [2] NCCL INFO comm 0x200104006650 rank 2 nranks 6 cudaDev 2 nvmlDev 2 - Init COMPLETE
h36n18:125722:125971 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
h36n18:125722:125971 [0] NCCL INFO comm 0x20040c006650 rank 0 nranks 6 cudaDev 0 nvmlDev 0 - Init COMPLETE
h36n18:125722:125722 [0] NCCL INFO Launch mode Parallel
building BERT model ...
h36n18:125726:125995 [4] NCCL INFO comm 0x200104006650 rank 4 nranks 6 cudaDev 4 nvmlDev 4 - Init COMPLETE
h36n18:125723:125993 [1] NCCL INFO comm 0x200104006650 rank 1 nranks 6 cudaDev 1 nvmlDev 1 - Init COMPLETE
> number of parameters on model parallel rank 0: 2799983247
h36n18:125722:126579 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:125722:126579 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:125722:126579 [0] NCCL INFO comm 0x200404006620 rank 0 nranks 1 cudaDev 0 nvmlDev 0 - Init COMPLETE
> number of parameters on model parallel rank 5: 2799983247
> number of parameters on model parallel rank 3: 2799983247
Traceback (most recent call last):
File "pretrain_bert_nccl.py", line 629, in <module>
main()
File "pretrain_bert_nccl.py", line 579, in main
model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
File "pretrain_bert_nccl.py", line 170, in setup_model_and_optimizer
optimizer = get_optimizer(model, args)
File "pretrain_bert_nccl.py", line 141, in get_optimizer
'delayed_shift': args.hysteresis})
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 198, in __init__
master_param = param.detach().clone().float()
RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 15.75 GiB total capacity; 14.50 GiB already allocated; 16.94 MiB free; 373.95 MiB cached; 0 bytes inactive)
> number of parameters on model parallel rank 2: 2799983247
> number of parameters on model parallel rank 1: 2799983247
> number of parameters on model parallel rank 4: 2799983247
use_npy_data_loader .......... False
train_data_path ..............
val_data_path ................
test_data_path ...............
input_data_sizes_file ........ sizes.txt
delim ........................ ,
text_key ..................... sentence
eval_text_key ................ None
valid_data ................... None
split ........................ 949,50,1
test_data .................... None
lazy_loader .................. False
loose_json ................... False
presplit_sentences ........... True
num_workers .................. 2
tokenizer_model_type ......... /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
tokenizer_path ............... tokenizer.model
tokenizer_type ............... BertWordPieceTokenizer
cache_dir .................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
use_tfrecords ................ True
seq_length ................... 512
max_preds_per_seq ............ 76
deepspeed .................... True
deepspeed_config ............. /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ds_bert_config.json
deepscale .................... False
deepscale_config ............. None
deepspeed_mpi ................ True
sharedfile ................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/test/.sharedfile
cuda ......................... True
rank ......................... 0
world_size ................... 6
dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
2020-02-29 05:07:35.425203: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
WARNING: Logging before flag parsing goes to stderr.
W0229 05:07:38.074505 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:46: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
W0229 05:07:38.074888 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:55: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.
W0229 05:07:38.075031 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:66: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.
2020-02-29 05:07:38.075261: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-02-29 05:07:38.078041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-29 05:07:38.080565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-29 05:07:38.083095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-29 05:07:38.085669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-29 05:07:38.088239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-29 05:07:38.090805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-29 05:07:38.090827: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2020-02-29 05:07:38.090887: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2020-02-29 05:07:38.090926: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2020-02-29 05:07:38.090965: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2020-02-29 05:07:38.092861: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2020-02-29 05:07:38.092907: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2020-02-29 05:07:38.092946: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-02-29 05:07:38.123406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-29 05:07:38.130912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-29 05:07:38.130926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]
W0229 05:07:38.154345 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/python/data/util/random_seed.py:58: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0229 05:07:39.526942 35184372395936 lazy_loader.py:50]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
W0229 05:07:39.527102 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:86: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
W0229 05:07:39.527187 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
2020-02-29 05:07:39.553327: W tensorflow/core/common_runtime/eager/context.cc:371] Added two functions with the same name: __inference_Dataset_flat_map_read_one_file_28
W0229 05:07:39.556849 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:96: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
W0229 05:07:39.556953 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/batching.py:273: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
W0229 05:07:39.559207 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:116: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.
W0229 05:07:39.570396 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:119: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
configuring data
loading BertWordPieceTokenizer ( /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ ) from cache_dir /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
loaded /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
> padded vocab (size: 30) with 0 dummy tokens (new size: 30)
h36n18:127714:127714 [0] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127714:127714 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127714:127714 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127718:127718 [4] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127718:127718 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127719:127719 [5] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127719:127719 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127717:127717 [3] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127717:127717 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127714:127963 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:127716:127716 [2] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127715:127715 [1] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127716:127716 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127715:127715 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127715:127715 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127719:127719 [5] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127716:127716 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127717:127717 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127718:127718 [4] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127716:127984 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:127715:127985 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:127719:127986 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:127717:127987 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:127718:127988 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:127714:127963 [0] NCCL INFO Duplicating rings to 4 per user request.
h36n18:127714:127963 [0] NCCL INFO Channel 00 : 0 1 2 3 4 5
h36n18:127714:127963 [0] NCCL INFO Channel 01 : 0 1 2 3 4 5
h36n18:127714:127963 [0] NCCL INFO Channel 02 : 0 1 2 3 4 5
h36n18:127714:127963 [0] NCCL INFO Channel 03 : 0 1 2 3 4 5
h36n18:127715:127985 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 00 : 5[5] -> 0[0] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 01 : 5[5] -> 0[0] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
h36n18:127715:127985 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 02 : 5[5] -> 0[0] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
h36n18:127715:127985 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 03 : 5[5] -> 0[0] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
h36n18:127715:127985 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO comm 0x200104006650 rank 5 nranks 6 cudaDev 5 nvmlDev 5 - Init COMPLETE
h36n18:127717:127987 [3] NCCL INFO comm 0x200104006650 rank 3 nranks 6 cudaDev 3 nvmlDev 3 - Init COMPLETE
h36n18:127716:127984 [2] NCCL INFO comm 0x200104006650 rank 2 nranks 6 cudaDev 2 nvmlDev 2 - Init COMPLETE
h36n18:127714:127963 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
h36n18:127714:127963 [0] NCCL INFO comm 0x20040c006650 rank 0 nranks 6 cudaDev 0 nvmlDev 0 - Init COMPLETE
h36n18:127714:127714 [0] NCCL INFO Launch mode Parallel
h36n18:127715:127985 [1] NCCL INFO comm 0x200104006650 rank 1 nranks 6 cudaDev 1 nvmlDev 1 - Init COMPLETE
h36n18:127718:127988 [4] NCCL INFO comm 0x200104006650 rank 4 nranks 6 cudaDev 4 nvmlDev 4 - Init COMPLETE
building BERT model ...
> number of parameters on model parallel rank 0: 1381032967
> number of parameters on model parallel rank 1: 1381032967
> number of parameters on model parallel rank 5: 1381032967
> number of parameters on model parallel rank 3: 1381032967
> number of parameters on model parallel rank 2: 1381032967
h36n18:127714:128267 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:127714:128267 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127714:128267 [0] NCCL INFO comm 0x200404006620 rank 0 nranks 1 cudaDev 0 nvmlDev 0 - Init COMPLETE
> number of parameters on model parallel rank 4: 1381032967
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127715:128279 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:127715:128279 [1] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127715:128279 [1] NCCL INFO comm 0x2001c8006620 rank 0 nranks 1 cudaDev 1 nvmlDev 1 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127719:128281 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:127719:128281 [5] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127719:128281 [5] NCCL INFO comm 0x2001ec006620 rank 0 nranks 1 cudaDev 5 nvmlDev 5 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127716:128283 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:127716:128283 [2] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127716:128283 [2] NCCL INFO comm 0x200340006620 rank 0 nranks 1 cudaDev 2 nvmlDev 2 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127717:128286 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:127717:128286 [3] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127717:128286 [3] NCCL INFO comm 0x200320006620 rank 0 nranks 1 cudaDev 3 nvmlDev 3 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127718:128288 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:127718:128288 [4] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127718:128288 [4] NCCL INFO comm 0x2001f4006620 rank 0 nranks 1 cudaDev 4 nvmlDev 4 - Init COMPLETE
h36n18:127714:128336 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:127718:128337 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:127719:128338 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:127715:128339 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:127717:128341 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:127716:128340 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:127714:128336 [0] NCCL INFO Duplicating rings to 4 per user request.
h36n18:127714:128336 [0] NCCL INFO Channel 00 : 0 1 2 3 4 5
h36n18:127714:128336 [0] NCCL INFO Channel 01 : 0 1 2 3 4 5
h36n18:127714:128336 [0] NCCL INFO Channel 02 : 0 1 2 3 4 5
h36n18:127714:128336 [0] NCCL INFO Channel 03 : 0 1 2 3 4 5
h36n18:127719:128338 [5] NCCL INFO Ring 00 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO Ring 01 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO Ring 02 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO Ring 03 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO comm 0x200408006620 rank 5 nranks 6 cudaDev 5 nvmlDev 5 - Init COMPLETE
h36n18:127715:128339 [1] NCCL INFO comm 0x200424006620 rank 1 nranks 6 cudaDev 1 nvmlDev 1 - Init COMPLETE
h36n18:127718:128337 [4] NCCL INFO comm 0x200410006620 rank 4 nranks 6 cudaDev 4 nvmlDev 4 - Init COMPLETE
h36n18:127717:128341 [3] NCCL INFO comm 0x20033c006620 rank 3 nranks 6 cudaDev 3 nvmlDev 3 - Init COMPLETE
h36n18:127714:128336 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
h36n18:127714:128336 [0] NCCL INFO comm 0x200718006620 rank 0 nranks 6 cudaDev 0 nvmlDev 0 - Init COMPLETE
h36n18:127714:127714 [0] NCCL INFO Launch mode Parallel
h36n18:127716:128340 [2] NCCL INFO comm 0x20035c006620 rank 2 nranks 6 cudaDev 2 nvmlDev 2 - Init COMPLETE
learning rate decaying linear
Partition Activations False and Correctness Check False
Traceback (most recent call last):
File "pretrain_bert_nccl.py", line 629, in <module>
main()
File "pretrain_bert_nccl.py", line 607, in main
timers, args)
File "pretrain_bert_nccl.py", line 338, in train
args, timers)
File "pretrain_bert_nccl.py", line 297, in train_step
nsp_loss, args)
File "pretrain_bert_nccl.py", line 272, in backward_step
optimizer.update_master_grads()
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
self._model_grads_to_master_grads()
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 15.75 GiB total capacity; 14.04 GiB already allocated; 580.94 MiB free; 200.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
File "pretrain_bert_nccl.py", line 629, in <module>
main()
File "pretrain_bert_nccl.py", line 607, in main
timers, args)
File "pretrain_bert_nccl.py", line 338, in train
args, timers)
File "pretrain_bert_nccl.py", line 297, in train_step
nsp_loss, args)
File "pretrain_bert_nccl.py", line 272, in backward_step
optimizer.update_master_grads()
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
self._model_grads_to_master_grads()
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 2; 15.75 GiB total capacity; 14.13 GiB already allocated; 586.94 MiB free; 188.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
File "pretrain_bert_nccl.py", line 629, in <module>
main()
File "pretrain_bert_nccl.py", line 607, in main
timers, args)
File "pretrain_bert_nccl.py", line 338, in train
args, timers)
File "pretrain_bert_nccl.py", line 297, in train_step
nsp_loss, args)
File "pretrain_bert_nccl.py", line 272, in backward_step
optimizer.update_master_grads()
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
self._model_grads_to_master_grads()
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 1; 15.75 GiB total capacity; 14.13 GiB already allocated; 582.88 MiB free; 192.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
File "pretrain_bert_nccl.py", line 629, in <module>
main()
File "pretrain_bert_nccl.py", line 607, in main
timers, args)
File "pretrain_bert_nccl.py", line 338, in train
args, timers)
File "pretrain_bert_nccl.py", line 297, in train_step
nsp_loss, args)
File "pretrain_bert_nccl.py", line 272, in backward_step
optimizer.update_master_grads()
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
self._model_grads_to_master_grads()
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 5; 15.75 GiB total capacity; 14.16 GiB already allocated; 554.94 MiB free; 196.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
File "pretrain_bert_nccl.py", line 629, in <module>
main()
File "pretrain_bert_nccl.py", line 607, in main
timers, args)
File "pretrain_bert_nccl.py", line 338, in train
args, timers)
File "pretrain_bert_nccl.py", line 297, in train_step
nsp_loss, args)
File "pretrain_bert_nccl.py", line 272, in backward_step
optimizer.update_master_grads()
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
self._model_grads_to_master_grads()
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 4; 15.75 GiB total capacity; 14.16 GiB already allocated; 554.94 MiB free; 196.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
File "pretrain_bert_nccl.py", line 629, in <module>
main()
File "pretrain_bert_nccl.py", line 607, in main
timers, args)
File "pretrain_bert_nccl.py", line 338, in train
args, timers)
File "pretrain_bert_nccl.py", line 297, in train_step
nsp_loss, args)
File "pretrain_bert_nccl.py", line 272, in backward_step
optimizer.update_master_grads()
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
self._model_grads_to_master_grads()
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 3; 15.75 GiB total capacity; 14.16 GiB already allocated; 558.94 MiB free; 192.72 MiB cached; 0 bytes inactive)
From my understanding from the paper on table 8 that you were able to train both the 8 and 20 billion models on 4 x 16GB GPU using 4 way model parallelism.
In my case I am using 6 way model parallelism with batch size 1 and it dosn't work.