Git Product home page Git Product logo

Comments (6)

gadorlhiac avatar gadorlhiac commented on July 17, 2024

Upon further thought, this may be related to user error/unhandled errors.

The second example runs without the call to indexamajig which may be because after the first call to indexamajig, a stream file is created. If that file is present indexamajig will fail if called again, because it will refuse to overwrite the stream file. This may cause the job to finish without running the second line.

When testing I was running the job many times, and may have remembered to delete the stream file between some attempts and not others.

from btx.

gadorlhiac avatar gadorlhiac commented on July 17, 2024

Note, the same issue occurs even when mpirun is the first/only call in the secondary batch script.

For instance, run_analysis will silently do nothing, since it makes use of two cores and mpi.

from btx.

fredericpoitevin avatar fredericpoitevin commented on July 17, 2024

Note - the following works:

(base) [fpoitevi@sdfiana001 launchpad]$ cat test1.sh
#!/bin/bash
#SBATCH -p milano
#SBATCH --job-name=multilevel
#SBATCH --output=./ml.out
#SBATCH --error=./ml.err
#SBATCH --ntasks=1
#SBATCH --time=1:00:00
#SBATCH --exclusive
#SBATCH -A lcls

python ./test1.py
(base) [fpoitevi@sdfiana001 launchpad]$ cat test1.py 
import os
os.system('sbatch test2.sh')
(base) [fpoitevi@sdfiana001 launchpad]$ cat test2.sh
#!/bin/bash
#SBATCH -p milano
#SBATCH --job-name=ml2
#SBATCH --output=./ml2.out
#SBATCH --error=./ml2.err
#SBATCH --ntasks=64
#SBATCH --time=2:00:00
#SBATCH --exclusive

#SBATCH -A lcls

export PATH=/sdf/group/lcls/ds/tools/crystfel/0.10.2/bin:$PATH
export PATH=/sdf/group/lcls/ds/tools/:$PATH

export SIT_PSDM_DATA=/sdf/data/lcls/ds/

/sdf/group/lcls/ds/ana/sw/conda1/inst/envs/ana-4.0.47-py3/bin/mpirun -n 64 python test2.py
(base) [fpoitevi@sdfiana001 launchpad]$ cat test2.py
from mpi4py import MPI

def main():
    print("testing testing...")
    comm = MPI.COMM_WORLD
    name=MPI.Get_processor_name()
    print(f"name: {name}, my rank is {comm.rank}")

if __name__ == '__main__':
    main()

from btx.

fredericpoitevin avatar fredericpoitevin commented on July 17, 2024

Changing the command in Indexer to a similar simple test fails:

        if not dont_report:
            #command +=f"\npython {self.script_path} -e {self.exp} -r {self.run} -d {self.det_type} --taskdir {self.taskdir} --report --tag {self.tag} "
            # debugging
            command =f"\npython {self.script_path} -e {self.exp} -r {self.run} -d {self.det_type} --taskdir {self.taskdir} --report --tag {self.tag} "
            if ( self.tag_cxi != '' ): command += f' --tag_cxi {self.tag_cxi}'
            command += "\n"
            # debugging
            command =f"\npython /sdf/data/lcls/ds/mfx/mfxp23120/scratch/fpoitevi/launchpad/test2.py"
        if addl_command is not None:
            command += f"\n{addl_command}"

        js = JobScheduler(self.tmp_exe, ncores=self.ncores, jobname=f'idx_r{self.run:04}', queue=self.queue, time=self.time)
        js.write_header()
        js.write_main(command, dependencies=['crystfel'] + self.methods.split(','))
        # debugging
        #js.clean_up()
        js.submit()
        logger.info(f"Indexing executable submitted: {self.tmp_exe}")

Here is the slurm script written and submitted by iScheduler:

(base) [fpoitevi@sdfiana001 launchpad]$ cat /sdf/home/f/fpoitevi/.btx//task_0e02d8b8-1bfe-4bff-81a7-71117e00eb1f.sh
#!/bin/bash
#SBATCH -p milano
#SBATCH --job-name=idx_r0131
#SBATCH --output=./idx_r0131.out
#SBATCH --error=./idx_r0131.err
#SBATCH --ntasks=64
#SBATCH --time=2:00:00
#SBATCH --exclusive

#SBATCH -A lcls

export PATH=/sdf/group/lcls/ds/tools/crystfel/0.10.2/bin:$PATH
export PATH=/sdf/group/lcls/ds/tools/:$PATH

export SIT_PSDM_DATA=/sdf/data/lcls/ds/

/sdf/group/lcls/ds/ana/sw/conda1/inst/envs/ana-4.0.47-py3/bin/mpirun -n 64 /sdf/group/lcls/ds/ana/sw/conda1/inst/envs/ana-4.0.47-py3/bin/python /sdf/data/lcls/ds/mfx/mfxp23120/scratch/fpoitevi/launchpad/test2.py
(base) [fpoitevi@sdfiana001 launchpad]$ sacct -j 18567556
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
18567556      idx_r0131     milano       lcls        125     FAILED      1:0 
18567556.ba+      batch                  lcls        125     FAILED      1:0 
18567556.ex+     extern                  lcls        125  COMPLETED      0:0 

from btx.

fredericpoitevin avatar fredericpoitevin commented on July 17, 2024

Forcing submission on one core (and thus not using mpirun) - as already tested above, works.
Commenting in ischeduler:

        for ppath in possible_paths:
            if os.path.exists(ppath):
                pythonpath = ppath
                #if self.ncores > 1:
                #    pythonpath = f"{os.path.split(ppath)[0]}/mpirun -n {self.ncores} {ppath}"

Yields this submission script

#!/bin/bash
#SBATCH -p milano
#SBATCH --job-name=idx_r0131
#SBATCH --output=./idx_r0131.out
#SBATCH --error=./idx_r0131.err
#SBATCH --ntasks=64
#SBATCH --time=2:00:00
#SBATCH --exclusive

#SBATCH -A lcls

export PATH=/sdf/group/lcls/ds/tools/crystfel/0.10.2/bin:$PATH
export PATH=/sdf/group/lcls/ds/tools/:$PATH

export SIT_PSDM_DATA=/sdf/data/lcls/ds/

/sdf/group/lcls/ds/ana/sw/conda1/inst/envs/ana-4.0.47-py3/bin/python /sdf/data/lcls/ds/mfx/mfxp23120/scratch/fpoitevi/launchpad/test2.py

And the output:

(base) [fpoitevi@sdfiana001 launchpad]$ cat idx_r0131.out 
testing testing...
name: sdfmilan017, my rank is 0

from btx.

gadorlhiac avatar gadorlhiac commented on July 17, 2024

It seems that this problem can be resolved through conditional import of mpi4py in both main and indexer. This can be accomplished by:

  • Passing an additional parameter to main via a command-line argument used in elog_submit (-n $CORES)
  • Passing an additional parameter to indexer during object initialization (mpi_init).

(Refer to linked PR #325 for the relevant changes)

For the multi-step jobs MPI should not be initialized during the first job submission, but it is needed for the second round, so, while ungainly, this method might be able to help as an immediate stop-gap.

from btx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.