Comments (6)
Upon further thought, this may be related to user error/unhandled errors.
The second example runs without the call to indexamajig
which may be because after the first call to indexamajig
, a stream
file is created. If that file is present indexamajig
will fail if called again, because it will refuse to overwrite the stream file. This may cause the job to finish without running the second line.
When testing I was running the job many times, and may have remembered to delete the stream
file between some attempts and not others.
from btx.
Note, the same issue occurs even when mpirun
is the first/only call in the secondary batch script.
For instance, run_analysis
will silently do nothing, since it makes use of two cores and mpi.
from btx.
Note - the following works:
(base) [fpoitevi@sdfiana001 launchpad]$ cat test1.sh
#!/bin/bash
#SBATCH -p milano
#SBATCH --job-name=multilevel
#SBATCH --output=./ml.out
#SBATCH --error=./ml.err
#SBATCH --ntasks=1
#SBATCH --time=1:00:00
#SBATCH --exclusive
#SBATCH -A lcls
python ./test1.py
(base) [fpoitevi@sdfiana001 launchpad]$ cat test1.py
import os
os.system('sbatch test2.sh')
(base) [fpoitevi@sdfiana001 launchpad]$ cat test2.sh
#!/bin/bash
#SBATCH -p milano
#SBATCH --job-name=ml2
#SBATCH --output=./ml2.out
#SBATCH --error=./ml2.err
#SBATCH --ntasks=64
#SBATCH --time=2:00:00
#SBATCH --exclusive
#SBATCH -A lcls
export PATH=/sdf/group/lcls/ds/tools/crystfel/0.10.2/bin:$PATH
export PATH=/sdf/group/lcls/ds/tools/:$PATH
export SIT_PSDM_DATA=/sdf/data/lcls/ds/
/sdf/group/lcls/ds/ana/sw/conda1/inst/envs/ana-4.0.47-py3/bin/mpirun -n 64 python test2.py
(base) [fpoitevi@sdfiana001 launchpad]$ cat test2.py
from mpi4py import MPI
def main():
print("testing testing...")
comm = MPI.COMM_WORLD
name=MPI.Get_processor_name()
print(f"name: {name}, my rank is {comm.rank}")
if __name__ == '__main__':
main()
from btx.
Changing the command
in Indexer
to a similar simple test fails:
if not dont_report:
#command +=f"\npython {self.script_path} -e {self.exp} -r {self.run} -d {self.det_type} --taskdir {self.taskdir} --report --tag {self.tag} "
# debugging
command =f"\npython {self.script_path} -e {self.exp} -r {self.run} -d {self.det_type} --taskdir {self.taskdir} --report --tag {self.tag} "
if ( self.tag_cxi != '' ): command += f' --tag_cxi {self.tag_cxi}'
command += "\n"
# debugging
command =f"\npython /sdf/data/lcls/ds/mfx/mfxp23120/scratch/fpoitevi/launchpad/test2.py"
if addl_command is not None:
command += f"\n{addl_command}"
js = JobScheduler(self.tmp_exe, ncores=self.ncores, jobname=f'idx_r{self.run:04}', queue=self.queue, time=self.time)
js.write_header()
js.write_main(command, dependencies=['crystfel'] + self.methods.split(','))
# debugging
#js.clean_up()
js.submit()
logger.info(f"Indexing executable submitted: {self.tmp_exe}")
Here is the slurm script written and submitted by iScheduler:
(base) [fpoitevi@sdfiana001 launchpad]$ cat /sdf/home/f/fpoitevi/.btx//task_0e02d8b8-1bfe-4bff-81a7-71117e00eb1f.sh
#!/bin/bash
#SBATCH -p milano
#SBATCH --job-name=idx_r0131
#SBATCH --output=./idx_r0131.out
#SBATCH --error=./idx_r0131.err
#SBATCH --ntasks=64
#SBATCH --time=2:00:00
#SBATCH --exclusive
#SBATCH -A lcls
export PATH=/sdf/group/lcls/ds/tools/crystfel/0.10.2/bin:$PATH
export PATH=/sdf/group/lcls/ds/tools/:$PATH
export SIT_PSDM_DATA=/sdf/data/lcls/ds/
/sdf/group/lcls/ds/ana/sw/conda1/inst/envs/ana-4.0.47-py3/bin/mpirun -n 64 /sdf/group/lcls/ds/ana/sw/conda1/inst/envs/ana-4.0.47-py3/bin/python /sdf/data/lcls/ds/mfx/mfxp23120/scratch/fpoitevi/launchpad/test2.py
(base) [fpoitevi@sdfiana001 launchpad]$ sacct -j 18567556
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
18567556 idx_r0131 milano lcls 125 FAILED 1:0
18567556.ba+ batch lcls 125 FAILED 1:0
18567556.ex+ extern lcls 125 COMPLETED 0:0
from btx.
Forcing submission on one core (and thus not using mpirun) - as already tested above, works.
Commenting in ischeduler:
for ppath in possible_paths:
if os.path.exists(ppath):
pythonpath = ppath
#if self.ncores > 1:
# pythonpath = f"{os.path.split(ppath)[0]}/mpirun -n {self.ncores} {ppath}"
Yields this submission script
#!/bin/bash
#SBATCH -p milano
#SBATCH --job-name=idx_r0131
#SBATCH --output=./idx_r0131.out
#SBATCH --error=./idx_r0131.err
#SBATCH --ntasks=64
#SBATCH --time=2:00:00
#SBATCH --exclusive
#SBATCH -A lcls
export PATH=/sdf/group/lcls/ds/tools/crystfel/0.10.2/bin:$PATH
export PATH=/sdf/group/lcls/ds/tools/:$PATH
export SIT_PSDM_DATA=/sdf/data/lcls/ds/
/sdf/group/lcls/ds/ana/sw/conda1/inst/envs/ana-4.0.47-py3/bin/python /sdf/data/lcls/ds/mfx/mfxp23120/scratch/fpoitevi/launchpad/test2.py
And the output:
(base) [fpoitevi@sdfiana001 launchpad]$ cat idx_r0131.out
testing testing...
name: sdfmilan017, my rank is 0
from btx.
It seems that this problem can be resolved through conditional import of mpi4py
in both main
and indexer
. This can be accomplished by:
- Passing an additional parameter to
main
via a command-line argument used inelog_submit
(-n $CORES
) - Passing an additional parameter to
indexer
during object initialization (mpi_init
).
(Refer to linked PR #325 for the relevant changes)
For the multi-step jobs MPI should not be initialized during the first job submission, but it is needed for the second round, so, while ungainly, this method might be able to help as an immediate stop-gap.
from btx.
Related Issues (20)
- Script to generate a default yaml HOT 4
- Simple interface for the config to generate small data files
- Job summary tasks
- `run_analysis` fails if mask is not provided
- Airflow integration on s3df HOT 1
- `TypeError` when an event does not have an ebeamPhotonEnergy value HOT 1
- rename `timetool_diagnostics`
- enhance and test `ih5` everywhere
- Fix airflow for psffb after confirmation of succesful s3df integration
- `retrieve_pixel_index_map` not working for pnCCD detector HOT 1
- Change logging level as a command line argument
- Notes for running CrystFEL's `detector-shift`
- interactive visualization of masks in jupyter
- make `index` handle the case where no reflections were indexed. HOT 1
- Number of Cores HOT 1
- ENH: add `ResoNet` task HOT 2
- ENH: add shot-per-shot total intensity option to `run_analysis` task HOT 1
- Analysis of PyFAI geometry optimization HOT 5
- ENH: create hdf5 files with total number of photons, event times, and fiducials
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from btx.