Got following: <div class="snippet-clipboard-content notranslate position-relative

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubusercontent.com

NCCL timeout with 2B+ parameter model about opennmt-py HOT 8 CLOSED

Dagamies commented on May 23, 2024

NCCL timeout with 2B+ parameter model

from opennmt-py.

Comments (8)

vince62s commented on May 23, 2024

This timeout is just a safeguard. you won't full train a 2.2B parameter, unless using tensor_parallel on a bunch of GPUs.
What are you trying to do exactly ?

from opennmt-py.

Dagamies commented on May 23, 2024

I think this is true for OpenNMT-tf, where practical maximum for our use-cases is about 1 B parameters. With flash attention and 8bit BitsAndBytes optimizers OpenNMT-py seems to be able to do much bigger. And I'm using 80Gb A100s. The key here is to keep batch_size small enough so that updates will fit in. This has a nasty side effect of decreasing GPU utilization but you just can not get everything. We pretrain LMs for complex entity extraction using highly curated data and custom objectives. This far I have been able to use OpenNMT-py to pretrain 1.7B parameter model successfully. These are always encoder-decoder constructs with width of 1024 and FFN size 16384. 1.7B parameter model has 2x18 layers. This one, that causes timeout issues has 2x24 layers.

This is the log from failed run when it timed out (and this is actually only 2.18B parameters, not 2.2 :D ):

[2023-11-12 22:26:22,155 INFO] encoder: 1040289792
[2023-11-12 22:26:22,155 INFO] decoder: 1141133312
[2023-11-12 22:26:22,155 INFO] * number of parameters: 2181423104
[2023-11-12 22:26:22,158 INFO] Trainable parameters = {'torch.float32': 2181423104, 'torch.float16': 0, 'torch.uint8': 0, 'torch.int8': 0}
[2023-11-12 22:26:22,158 INFO] Non trainable parameters = {'torch.float32': 0, 'torch.float16': 0, 'torch.uint8': 0, 'torch.int8': 0}
[2023-11-12 22:26:22,158 INFO]  * src vocab size = 131072
[2023-11-12 22:26:22,158 INFO]  * tgt vocab size = 131072
[2023-11-12 22:26:22,772 INFO] Starting training on GPU: [0, 1]
[2023-11-12 22:26:22,772 INFO] Start training loop and validate every 2000 steps...
[2023-11-12 22:26:22,772 INFO] Scoring with: TransformPipe()
[2023-11-12 22:26:28,030 INFO] Weighted corpora loaded so far:
                        * corpus_1: 1
[2023-11-12 22:26:29,180 INFO] Weighted corpora loaded so far:
                        * corpus_1: 1
[2023-11-12 22:26:34,184 INFO] Weighted corpora loaded so far:
                        * corpus_1: 1
[2023-11-12 22:26:35,471 INFO] Weighted corpora loaded so far:
                        * corpus_1: 1
/ai/onmt/onmt-py-3342/OpenNMT-py/onmt/utils/distributed.py:104: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.)
  all_gather_list._in_buffer = torch.cuda.ByteTensor(max_size)
/ai/onmt/onmt-py-3342/OpenNMT-py/onmt/utils/distributed.py:104: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:83.)
  all_gather_list._in_buffer = torch.cuda.ByteTensor(max_size)
[2023-11-12 22:31:19,138 INFO] Updated dropout/attn dropout to 0.000000 0.000000 at step 2
[2023-11-12 23:17:08,173 INFO] Step 100/300000; acc: 6.4; ppl: 79642.1; xent: 11.3; lr: 0.00001; sents:   71793; bsz: 1523/ 180/ 3; 12798/1515 tok/s;   3045 sec;
[2023-11-13 00:03:18,137 INFO] Step 200/300000; acc: 8.4; ppl: 17905.0; xent: 9.8; lr: 0.00002; sents:   73158; bsz: 1525/ 182/ 3; 14090/1678 tok/s;   5815 sec;
[2023-11-13 00:49:28,742 INFO] Step 300/300000; acc: 10.0; ppl: 3765.1; xent: 8.2; lr: 0.00004; sents:   72734; bsz: 1522/ 183/ 3; 14067/1690 tok/s;   8586 sec;
[2023-11-13 01:35:33,462 INFO] Step 400/300000; acc: 14.1; ppl: 1733.7; xent: 7.5; lr: 0.00005; sents:   72411; bsz: 1523/ 179/ 3; 14099/1662 tok/s;  11351 sec;
[2023-11-13 02:21:35,117 INFO] Step 500/300000; acc: 17.8; ppl: 1163.1; xent: 7.1; lr: 0.00006; sents:   70631; bsz: 1520/ 183/ 3; 14090/1694 tok/s;  14112 sec;
[2023-11-13 03:07:33,398 INFO] Step 600/300000; acc: 19.0; ppl: 969.9; xent: 6.9; lr: 0.00007; sents:   71566; bsz: 1523/ 178/ 3; 14135/1656 tok/s;  16871 sec;
[2023-11-13 03:53:29,326 INFO] Step 700/300000; acc: 19.8; ppl: 822.5; xent: 6.7; lr: 0.00009; sents:   71551; bsz: 1522/ 183/ 3; 14134/1698 tok/s;  19627 sec;
[2023-11-13 04:39:29,231 INFO] Step 800/300000; acc: 20.4; ppl: 732.2; xent: 6.6; lr: 0.00010; sents:   71949; bsz: 1523/ 181/ 3; 14131/1680 tok/s;  22386 sec;
[2023-11-13 05:25:18,799 INFO] Step 900/300000; acc: 20.9; ppl: 651.2; xent: 6.5; lr: 0.00011; sents:   72201; bsz: 1523/ 182/ 3; 14182/1698 tok/s;  25136 sec;
[2023-11-13 06:11:13,127 INFO] Step 1000/300000; acc: 21.6; ppl: 579.8; xent: 6.4; lr: 0.00012; sents:   70949; bsz: 1521/ 181/ 3; 14137/1681 tok/s;  27890 sec;
[2023-11-13 06:57:12,614 INFO] Step 1100/300000; acc: 22.8; ppl: 511.0; xent: 6.2; lr: 0.00014; sents:   72523; bsz: 1524/ 182/ 3; 14139/1688 tok/s;  30650 sec;
[2023-11-13 07:43:09,431 INFO] Step 1200/300000; acc: 25.2; ppl: 445.7; xent: 6.1; lr: 0.00015; sents:   72681; bsz: 1524/ 183/ 3; 14154/1698 tok/s;  33407 sec;
[2023-11-13 08:29:04,638 INFO] Step 1300/300000; acc: 29.0; ppl: 376.3; xent: 5.9; lr: 0.00016; sents:   70942; bsz: 1522/ 181/ 3; 14140/1683 tok/s;  36162 sec;
[2023-11-13 09:16:07,062 INFO] Step 1400/300000; acc: 34.1; ppl: 306.4; xent: 5.7; lr: 0.00017; sents:   71338; bsz: 1523/ 181/ 3; 13817/1638 tok/s;  38984 sec;
[E ProcessGroupNCCL.cpp:467] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=144159, OpType=ALLGATHER, NumelIn=4096, NumelOut=8192, Timeout(ms)=60000) ran for 60820 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:481] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:487] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:852] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=144159, OpType=ALLGATHER, NumelIn=4096, NumelOut=8192, Timeout(ms)=60000) ran for 60820 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=144159, OpType=ALLGATHER, NumelIn=4096, NumelOut=8192, Timeout(ms)=60000) ran for 60820 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:467] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=144160, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=60000) ran for 60485 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:481] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:487] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:852] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=144160, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=60000) ran for 60485 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=144160, OpType=ALLREDUCE, NumelIn=134217728, NumelOut=134217728, Timeout(ms)=60000) ran for 60485 milliseconds before timing out.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 28 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d

from opennmt-py.

Dagamies commented on May 23, 2024

There is one issue I forgot to mention: When training these larger models, for some reason 1 random GPU sometimes (quite infrequently) idles for 20-30 seconds while others will keep happily working. I have no clue why this happens, but timeout might be somehow related to this phenomena. Never seen this happen with OpenNMT-tf

from opennmt-py.

vince62s commented on May 23, 2024

If you get this timeout thing it's because one process gets killed (probably OOM) and the other has been waiting for more than 60 seconds to receive synched data.
Maybe use bnb_NF4 instead of 8 bit quant.

from opennmt-py.

Dagamies commented on May 23, 2024

This is the phenomena I'm referring to. In this case I have timeout set for 600 seconds. The "freeze" in training is about 90 seconds. During the freeze there is no change in the GPU memory usage. This model has number of parameters: 1543800831, but the same happens with 2B+ model. I'm running it with NCCL_DEBUG=WARN NCCL_BUFFSIZE=8388608 CUDA_VISIBLE_DEVICES=0,1,2,3. If the timeout is 60 seconds, processes get killed.

from opennmt-py.

Dagamies commented on May 23, 2024

Forgot to mention, this happened after 5 hours of training with 4 A100s. It is somehow related to distributed training, all models train just fine if single GPU is used. Training is done in a pod that is based on latest Nvidia Pytorch image (nvcr.io/nvidia/pytorch:23.11-py3). It does not matter if I use 2 or 4 GPUs on multi-gpu training.

from opennmt-py.

vince62s commented on May 23, 2024

can you share the config file? are the gpu idle when this happens ?

from opennmt-py.

Dagamies commented on May 23, 2024

Hi, I can but got an OOM exception when it was saving the model:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 23.00 GiB. GPU 0 has a total capacty of 79.17 GiB of which 12.13 GiB is free. Process 2991037 has 67.04 GiB memory in use. Of the allocated memory 43.17 GiB is allocated by PyTorch, and 23.17 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

--> changed envs to:
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 NCCL_DEBUG=WARN NCCL_BUFFSIZE=8388608
--> It now allocates very quickly 80632MiB of 81920MiB GPU memory, but training has now been stable for over 24 hours. And random GPU idling does not seem to occur at all. So I think we can conclude that this is PyTorch memory management issue?

from opennmt-py.

NCCL timeout with 2B+ parameter model about opennmt-py HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent