Git Product home page Git Product logo

Comments (14)

lee218llnl avatar lee218llnl commented on September 28, 2024

FWIW, I just tested this with forge 22.0.1 and it worked OK for me. The PID after the --attach-mpi arg should be the same PID that you are attaching STAT to and this assumes that the PID is present on the node where STAT and DDT are being launched. Can you confirm with ps x that the PID it is attempting to attach to is indeed the appropriate srun PID?

from stat.

antonl321 avatar antonl321 commented on September 28, 2024

The PID is correct, I use stat-gui -- attached, pleas see below

++ ssh aa3-2035 pgrep -o srun
+ /lus/h2resw01/hpcperm/atosla/Tools/spack/opt/spack/linux-rhel8-zen/gcc-8.4.1/stat-develop-lznpndx6cjf5opqo2wwv6hpsqf6jahzx/bin/stat-gui --attach aa3-2035:312897
<Jun 28 17:35:23> <Launchmon> (INFO): Just continued the RM process out of the first trap
fork exec DDT ['/usr/local/apps/forge/21.1.1/bin/ddt', '--attach-mpi', '312897', '--subset', '0,2,3,6,8,42,16,17', './t3.x']
The specified process was not found.

and also I have checked on the compute node with ps.

If I try to attach directly ddt to srun on the compute node I get the following:

$ ddt --attach-mpi=`pgrep -o srun ` ~/old-home/Apps/stat-tests/t3.x
The specified process was found, but no MPI process details were
 contained in this process.
Is this an mpirun process, or is the mpirun binary stripped?

My forge version is older 21.1.1

from stat.

lee218llnl avatar lee218llnl commented on September 28, 2024

For the example you gave you attached to "aa3-2035:312897" and I'm guessing you ran that from a login node or launch node. Are you able to ssh to aa3-2035 and launch the STAT GUI there? Once on aa3-2035, you should be able to just attach to the PID, 312897, in your case. Either way, I would then suggest trying to copy the exact ddt command line and all its args and run it on aa3-2035 as a sanity check. Sometimes there are multiple srun processes and you need to be sure to attach to the appropriate one. The forge version should be fine as they have supported this for many years now.

from stat.

lee218llnl avatar lee218llnl commented on September 28, 2024

sorry for the incorrect reference to this issue, I meant to post it to issue 44.

from stat.

antonl321 avatar antonl321 commented on September 28, 2024

I have launched the job with srun from the login node. In this case the top srun demon (job slurm demon) runs on the login node and ddt detects the correct srun process to connect. (well on my system ddt fails with the error I have described above).

When a job is submited with sbatch the job slurm deamon runs on the head node of the node allocation.
So, I think that stat makes the fork on the login node instead on the job head node and therefore it cannot find the job slurm demon.

from stat.

lee218llnl avatar lee218llnl commented on September 28, 2024

I see and this does not surprise me. The DDT attach-mpi option does not support the remote node case, where you are launching on a head node and pointing it to a compute node. I don't think there is an easy way for us to support this without requesting a new feature in DDT. Are you OK with only using this when you are on the compute node?

from stat.

antonl321 avatar antonl321 commented on September 28, 2024

ddt has a second way to attach via the option
--attach=[host1:]pid1,[host2:]pid2...

I have tried this from the login node and it works fine with a sbatch job.
If STAT would have a class for a MPI rank that contains the node name and pid it would be easy to build the list from above.

It is not always practical to submit a job from the login node with srun. The batch scripts for the production runs are quite complex setting quite large sets of parameters before the parallel job starts.

from stat.

antonl321 avatar antonl321 commented on September 28, 2024

The hack from bellow in STATGUI.py allows to attach ddt on my system

            #arg_list.append("--attach-mpi")
            #arg_list.append(str(self.proctab.launcher_pid))
            #arg_list.append("--subset")
            #rank_list_arg = ''
            #for rank in subset_list:
            #    if rank == subset_list[0]:
            #        rank_list_arg += '%d' % rank
            #    else:
            #        rank_list_arg += ',%d' % rank
            #arg_list.append(rank_list_arg)

            arg_list.append("--attach")
            hplist = self.proctab.process_list
            attach_list_arg = ''
            for rank in subset_list:
                attach_list_arg += hplist[rank][1]+':'+'%d' % hplist[rank][2]+','
            arg_list.append(attach_list_arg[:-1])

Could this way of attaching be incorporated in debugger menu?

from stat.

lee218llnl avatar lee218llnl commented on September 28, 2024

I just pushed a change to the develop branch that adds a "DDT serial" button that implements your suggestion (f3827a3). It appears to work for me. Can you test it out?

from stat.

antonl321 avatar antonl321 commented on September 28, 2024

The new attachment works, but I don't get it to what group of nodes is attaching.
In my list I have equivalence classes with 1 node. I have cliked on the middle button for that class but ddt still attaches to a large subset of nodes.
How id this meant to work?

from stat.

lee218llnl avatar lee218llnl commented on September 28, 2024

By default, STAT will choose a single representative (Rep), the lowest MPI rank) for every equivalence class listed in the Equivalence Classes window. If you don't want a representative for a given equivalence class, please click on the "None' radio button in the EQ Classes window. The "All" should select all ranks listed on that line. Could you perhaps send screenshots of you eq class window before clicking the subset attach and then send the ddt subset attach command that STAT issues (should be printed to the command line).

from stat.

antonl321 avatar antonl321 commented on September 28, 2024

Ok, I got you. I was confused about how the equivalence classes a are used by ddt. It makes sense now! Thanks.

I think that "DDT serial" is not the best name because this is used for a parallel apps and it might confuse some users. But I don't have a better idea. Essentially the second attach method goes directly to the app processes. I don't see a short name for this.

from stat.

lee218llnl avatar lee218llnl commented on September 28, 2024

I just pushed another change to the STAT develop branch. I renamed both buttons, so now we have "DDT bulk attach" and "DDT host:PID attach". It's a bit wordy for buttons, but hopefully clarifies the difference. I also updated the documentation about this. Let me know if this looks good to you.

from stat.

antonl321 avatar antonl321 commented on September 28, 2024

I looks good. Thanks! Closing this issue here.

from stat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.