Comments (5)
From the conversation in #20, for example in #20 (comment), I have some concern about the SLURM script used for submission.
Googling around, I believe we should at least use the following options:
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=<dask_processes * dask_threads>
#SBATCH --mem=<dask_memory * dask_processes>
Another solution maybe to play along with Slurm ntasks=, and tweak the dask-worker command line to only use nprocs=1.
Finaly I have some concern on using sbatch vs srun.
Do we have some Slurm users that could answer that?
cc @bw4sz, @leej3, @bocklund. @luizirber.
from dask-jobqueue.
Also ping @davidedelvento, @kmpaul, @jedwards4b as potentially interested user/developers.
from dask-jobqueue.
On our system at UF, Sbatch is preferred. Srun is really on interactive nodes for short periods of time. I believe after 24 hours all jobs are wiped away. Beyond that, we have not experienced a difference in environment.
from dask-jobqueue.
For the resources allocations, here are some useful links:
- https://slurm.schedmd.com/cpu_management.html#Example4 shows that with only using ntasks, reservation may span accross several nodes, as pointed out by @bw4sz.
- https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html#shared-memory-example-openmp shows an example for OpenMP, which is quite close to what we want to achieve here, even if we have several processes.
- see also https://slurm.schedmd.com/cpu_management.html#Example2 for how allocations are documented to work.
from dask-jobqueue.
RE: Interactive vs. Batch Jobs
Would think that you would want to have an option to choose either one. Reason being interactive jobs can be good for a Distributed Adaptive Cluster (i.e. dynamic scaling based on demand) whereas batch jobs would be good when not using a Distributed Adaptive Cluster.
from dask-jobqueue.
Related Issues (20)
- Suppress "Couldn't detect a suitable IP address" messages on cluster nodes with no internet HOT 1
- Cluster keeps appending "interface" flag to job script HOT 7
- Release 0.8.1 HOT 2
- OARCluster implementation does not let OAR take into account the memory parameter HOT 4
- `JobQueueCluster` with local worker(s) HOT 3
- Restart cluster job on task completion HOT 3
- Add CI with more tests for OAR
- dask_jobqueue tries to import non-existent function dask.utils.ignoring HOT 3
- a direct way to specify the worker spec HOT 4
- Documentation bug: interface HOT 1
- documentation: document `worker_command` kwarg
- Strange Worker KeyError when using LSFCluster. HOT 6
- Update NERSC Cori to NERSC Perlmutter in docs HOT 3
- SLURMCluster doesn't spawn new workers when old ones timeout HOT 12
- conftest.py not included in PyPI source tarball HOT 1
- CI is currently failing HOT 4
- ConnectionRefusedError HOT 2
- ImportError on ignoring attribute from dask.utils when importing dask_jobqueue HOT 2
- Resource allocation on SLURM cluster HOT 9
- Add a `py.typed` marker HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-jobqueue.