🐛 Describe the bug heartbeatMonitor error after run multip

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Maybe relates to <a href="https://github.com/pytorch/pytorch/issues/123969" data-hove

heartbeatMonitor error after run script multiple times about pytorch HOT 11 CLOSED

garfield1997 commented on July 19, 2024

heartbeatMonitor error after run script multiple times

from pytorch.

Comments (11)

garfield1997 commented on July 19, 2024 2

If there are no further questions from everyone, this issue can be closed.

from pytorch.

wconstab commented on July 19, 2024 1

@kurman has an ENV you can set that will keep tcpstore alive past trainer exit. It should help avoid this crash.

from pytorch.

weifengpy commented on July 19, 2024 1

After removing this break statement, this test case does not report an error and exit when run repeatedly. Would it be more reasonable to delete this break here?

Hi @garfield1997 , thanks for stepping up and root cause it. Are you interested in sending us a PR so we can address the issue through PR review? It might speed up the landing

from pytorch.

garfield1997 commented on July 19, 2024 1

After I updated my local repository, this problem no longer occurs. @c-p-i-o @kurman @weifengpy Thank you for taking the time to look into this issue.
My current environment is as follows:

Collecting environment information...
PyTorch version: 2.5.0a0+gitc35ffaf
Is debug build: False
CUDA used to build PyTorch: 12.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: 10.0.0-4ubuntu1 
CMake version: version 3.28.1
Libc version: glibc-2.31

Python version: 3.10.13 (main, Sep 11 2023, 13:21:10) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.11.0-27-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 535.104.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          64
On-line CPU(s) list:             0-63
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
Stepping:                        4
CPU MHz:                         1000.000
CPU max MHz:                     3700.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        4200.00
Virtualization:                  VT-x
L1d cache:                       1 MiB
L1i cache:                       1 MiB
L2 cache:                        32 MiB
L3 cache:                        44 MiB
NUMA node0 CPU(s):               0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62
NUMA node1 CPU(s):               1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT vulnerable
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] flake8==6.1.0
[pip3] flake8-bugbear==23.3.23
[pip3] flake8-comprehensions==3.12.0
[pip3] flake8-executable==2.1.3
[pip3] flake8-logging-format==0.9.0
[pip3] flake8-pyi==23.3.1
[pip3] flake8-simplify==0.19.3
[pip3] mypy==1.7.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.0
[pip3] optree==0.10.0
[pip3] torch==2.5.0a0+gitc35ffaf
[conda] No relevant packages

from pytorch.

garfield1997 commented on July 19, 2024

Maybe relates to #123969

from pytorch.

garfield1997 commented on July 19, 2024

https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1323

        try {
          checkExceptionDump =
              globalStore_->check({std::string(EXCEPTION_DUMP)});
        } catch (const std::exception& e) {
          LOG(ERROR)
              << logPrefix()
              << "Failed to get exception dump flag from the global store."
              << e.what();
          break;
        }

It appears that after breaking out of the loop here, the thread was aborted. If this is indeed the case, I would like to know if this behavior is expected, as the script above seems to be legitimate.

from pytorch.

weifengpy commented on July 19, 2024

It appears that after breaking out of the loop here

cc @c-p-i-o who just made the change to break out at https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1323

from pytorch.

garfield1997 commented on July 19, 2024

After removing this break statement, this test case does not report an error and exit when run repeatedly. Would it be more reasonable to delete this break here?

from pytorch.

c-p-i-o commented on July 19, 2024

It appears that after breaking out of the loop here

cc @c-p-i-o who just made the change to break out at https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1323

Hello @weifengpy / @garfield1997 - the break was added to catch the case where the globalStore isn't available and/or it throws an exception. Not having the try/catch + break there causes the program to crash.
As we can see in the logs, we are getting a "Connection reset by peer" and that's what's triggering failures.

[rank1]:[E614 11:14:03.086670611 ProcessGroupNCCL.cpp:1319] [PG 0 Rank 1] Failed to get exception dump flag from the global store.Connection reset by peer
Exception raised from recvBytes at /projs/framework/xushuo/pytorch_gar/torch/csrc/distributed/c10d/Utils.hpp:669 (most recent call first):

But, another option could be to do a continue here to see if it resolves the issue.
The change would be at the break we instead put:

try {
          checkExceptionDump =
              globalStore_->check({std::string(EXCEPTION_DUMP)});
        } catch (const std::exception& e) {
          LOG(ERROR)
              << logPrefix()
              << "Failed to get exception dump flag from the global store."
              << e.what();
          checkDumpSignal = false;
          continue;
        }

from pytorch.

c-p-i-o commented on July 19, 2024

FWIW, I tried the attached repro on my machine and haven't hit the repro so far:
Do you know how long it takes?

run  iter 204 success!
run  iter 205 success!

will let it run overnight.

from pytorch.

kurman commented on July 19, 2024

ENV you can set that will keep tcpstore alive past trainer exit. It should help avoid this crash.

This is enabled by default now after #128096 landed about 5 days ago.

from pytorch.

Related Issues (20)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.