Comments (14)
FWIW, I just tested this with forge 22.0.1 and it worked OK for me. The PID after the --attach-mpi arg should be the same PID that you are attaching STAT to and this assumes that the PID is present on the node where STAT and DDT are being launched. Can you confirm with ps x
that the PID it is attempting to attach to is indeed the appropriate srun PID?
from stat.
The PID is correct, I use stat-gui -- attached, pleas see below
++ ssh aa3-2035 pgrep -o srun
+ /lus/h2resw01/hpcperm/atosla/Tools/spack/opt/spack/linux-rhel8-zen/gcc-8.4.1/stat-develop-lznpndx6cjf5opqo2wwv6hpsqf6jahzx/bin/stat-gui --attach aa3-2035:312897
<Jun 28 17:35:23> <Launchmon> (INFO): Just continued the RM process out of the first trap
fork exec DDT ['/usr/local/apps/forge/21.1.1/bin/ddt', '--attach-mpi', '312897', '--subset', '0,2,3,6,8,42,16,17', './t3.x']
The specified process was not found.
and also I have checked on the compute node with ps.
If I try to attach directly ddt to srun on the compute node I get the following:
$ ddt --attach-mpi=`pgrep -o srun ` ~/old-home/Apps/stat-tests/t3.x
The specified process was found, but no MPI process details were
contained in this process.
Is this an mpirun process, or is the mpirun binary stripped?
My forge version is older 21.1.1
from stat.
For the example you gave you attached to "aa3-2035:312897" and I'm guessing you ran that from a login node or launch node. Are you able to ssh to aa3-2035 and launch the STAT GUI there? Once on aa3-2035, you should be able to just attach to the PID, 312897, in your case. Either way, I would then suggest trying to copy the exact ddt command line and all its args and run it on aa3-2035 as a sanity check. Sometimes there are multiple srun processes and you need to be sure to attach to the appropriate one. The forge version should be fine as they have supported this for many years now.
from stat.
sorry for the incorrect reference to this issue, I meant to post it to issue 44.
from stat.
I have launched the job with srun from the login node. In this case the top srun demon (job slurm demon) runs on the login node and ddt detects the correct srun process to connect. (well on my system ddt fails with the error I have described above).
When a job is submited with sbatch the job slurm deamon runs on the head node of the node allocation.
So, I think that stat makes the fork on the login node instead on the job head node and therefore it cannot find the job slurm demon.
from stat.
I see and this does not surprise me. The DDT attach-mpi option does not support the remote node case, where you are launching on a head node and pointing it to a compute node. I don't think there is an easy way for us to support this without requesting a new feature in DDT. Are you OK with only using this when you are on the compute node?
from stat.
ddt has a second way to attach via the option
--attach=[host1:]pid1,[host2:]pid2...
I have tried this from the login node and it works fine with a sbatch job.
If STAT would have a class for a MPI rank that contains the node name and pid it would be easy to build the list from above.
It is not always practical to submit a job from the login node with srun. The batch scripts for the production runs are quite complex setting quite large sets of parameters before the parallel job starts.
from stat.
The hack from bellow in STATGUI.py allows to attach ddt on my system
#arg_list.append("--attach-mpi")
#arg_list.append(str(self.proctab.launcher_pid))
#arg_list.append("--subset")
#rank_list_arg = ''
#for rank in subset_list:
# if rank == subset_list[0]:
# rank_list_arg += '%d' % rank
# else:
# rank_list_arg += ',%d' % rank
#arg_list.append(rank_list_arg)
arg_list.append("--attach")
hplist = self.proctab.process_list
attach_list_arg = ''
for rank in subset_list:
attach_list_arg += hplist[rank][1]+':'+'%d' % hplist[rank][2]+','
arg_list.append(attach_list_arg[:-1])
Could this way of attaching be incorporated in debugger menu?
from stat.
I just pushed a change to the develop branch that adds a "DDT serial" button that implements your suggestion (f3827a3). It appears to work for me. Can you test it out?
from stat.
The new attachment works, but I don't get it to what group of nodes is attaching.
In my list I have equivalence classes with 1 node. I have cliked on the middle button for that class but ddt still attaches to a large subset of nodes.
How id this meant to work?
from stat.
By default, STAT will choose a single representative (Rep), the lowest MPI rank) for every equivalence class listed in the Equivalence Classes window. If you don't want a representative for a given equivalence class, please click on the "None' radio button in the EQ Classes window. The "All" should select all ranks listed on that line. Could you perhaps send screenshots of you eq class window before clicking the subset attach and then send the ddt subset attach command that STAT issues (should be printed to the command line).
from stat.
Ok, I got you. I was confused about how the equivalence classes a are used by ddt. It makes sense now! Thanks.
I think that "DDT serial" is not the best name because this is used for a parallel apps and it might confuse some users. But I don't have a better idea. Essentially the second attach method goes directly to the app processes. I don't see a short name for this.
from stat.
I just pushed another change to the STAT develop branch. I renamed both buttons, so now we have "DDT bulk attach" and "DDT host:PID attach". It's a bit wordy for buttons, but hopefully clarifies the difference. I also updated the documentation about this. Let me know if this looks good to you.
from stat.
I looks good. Thanks! Closing this issue here.
from stat.
Related Issues (20)
- Possibility of new release? HOT 19
- Understanding error with stat v3.0.1 HOT 4
- Error: <stdin>: syntax error in line 115 scanning a quoted string (missing endquote? longer than 16384?) HOT 7
- Error using stat-cl: "terminate called without an active exception" HOT 19
- Spack Unable to install for Mac Mojave/Catalina systems HOT 8
- STAT_BackEnd.C:3892:36: error: cannot convert 'Dyninst::dyn_c_vector<Dyninst::SymtabAPI::Field*>*' HOT 8
- Encounter SDBG_TRACER_ERROR (ERROR) with BACKTRACE when executing stat-cl HOT 4
- [Question] Is Rust really required as dependency of STAT? HOT 10
- PYTHONPATH is not properly expanded in stat-gui? HOT 5
- Error building stat HOT 4
- Option to keep just the most recent trace HOT 5
- no function info in stat .dot files HOT 27
- SDBG_TRACER_ERROR (ERROR) while attaching hanging mpi process HOT 3
- stat-core-merger stuck communicating with gdb HOT 5
- STAT_merge.dot strangely formatted HOT 3
- Building STAT gui on summit HOT 2
- can't open file '/lib/python3.9/site-packages/STATmain.py' HOT 16
- error when trying to use stat-gui -a host:pid HOT 28
- <LMON BE API> (ERROR): read_lmonp return a neg value HOT 58
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stat.