Comments (11)
I believe that cluster.workers is only a convention at this point. It's managed by the cluster object and so represents workers that have been asked for, but not workers that have connected. For the latter you would want the following:
cluster.scheduler.workers
from dask-jobqueue.
I think this would be useful when using an autoscaling cluster. Particularly when jobs are waiting in the queue and the adaptive cluster is trying to decide if it should scale up/down.
from dask-jobqueue.
from dask-jobqueue.
Is the following approach a dependable way of querying the cluster for the number of workers?
min_workers = 10 # for example
while len(cluster.workers) < min_workers:
sleep(1)
The next step would be to associate these workers with the job ids from pbs/slurm/etc.
from dask-jobqueue.
Having cluster.workers be a dictionary mapping job id to a status in {'pending', 'running', 'finished'}
might be nice. For some operations we'll want a mapping between job-id and address (like tcp://...
) so that we know which job id corresponds to which worker. I suspect that the cleanest way to do this is to send the job-id through something like an environment variable or the --name
keyword (we may already do this sometimes?)
from dask-jobqueue.
I am using dask-jobqueue today for the first time. I find myself wishing I could check the status of the worker jobs from the notebook (rather than flipping back to the shell and running qstat
). So I agree this would be a great feature.
from dask-jobqueue.
@rabernat - most of the hard work for this functionality is being done in #63. Stay tuned.
from dask-jobqueue.
This appears to have been addressed:
In [15]: import dask_jobqueue
In [16]: cluster = dask_jobqueue.SLURMCluster(cores=1, processes=1)
In [17]: cluster.worker_spec
Out[17]: {}
In [18]: cluster.workers
Out[18]: {}
In [19]: cluster.scale(2)
In [20]: cluster.workers
Out[20]: {0: <SLURMJob: status=running>, 1: <SLURMJob: status=running>}
However, it appears that the reported information can be misleading sometimes. For instance, I am using a system which restricts the number of concurrent jobs to 35. When I submit 40 jobs, dask_jobqueue
reports that all 40 workers are in running
state even when some of the jobs are in pending
state according to squeue
:
dask-jobqueue
reports all 40 workers to be inrunning
state:
In [22]: cluster.scale(40)
In [23]: cluster.workers
Out[23]:
{0: <SLURMJob: status=running>,
1: <SLURMJob: status=running>,
2: <SLURMJob: status=running>,
3: <SLURMJob: status=running>,
4: <SLURMJob: status=running>,
5: <SLURMJob: status=running>,
6: <SLURMJob: status=running>,
7: <SLURMJob: status=running>,
8: <SLURMJob: status=running>,
9: <SLURMJob: status=running>,
10: <SLURMJob: status=running>,
11: <SLURMJob: status=running>,
12: <SLURMJob: status=running>,
13: <SLURMJob: status=running>,
14: <SLURMJob: status=running>,
15: <SLURMJob: status=running>,
16: <SLURMJob: status=running>,
17: <SLURMJob: status=running>,
18: <SLURMJob: status=running>,
19: <SLURMJob: status=running>,
20: <SLURMJob: status=running>,
21: <SLURMJob: status=running>,
22: <SLURMJob: status=running>,
23: <SLURMJob: status=running>,
24: <SLURMJob: status=running>,
25: <SLURMJob: status=running>,
26: <SLURMJob: status=running>,
27: <SLURMJob: status=running>,
28: <SLURMJob: status=running>,
29: <SLURMJob: status=running>,
30: <SLURMJob: status=running>,
31: <SLURMJob: status=running>,
32: <SLURMJob: status=running>,
33: <SLURMJob: status=running>,
34: <SLURMJob: status=running>,
35: <SLURMJob: status=running>,
36: <SLURMJob: status=running>,
37: <SLURMJob: status=running>,
38: <SLURMJob: status=running>,
39: <SLURMJob: status=running>}
squeue
shows some of the jobs to be inpending
state:
In [24]: !squeue -u abanihi
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5543014 dav dask-wor abanihi PD 0:00 1 (QOSMaxJobsPerUserLimit)
5543015 dav dask-wor abanihi PD 0:00 1 (QOSMaxJobsPerUserLimit)
5543016 dav dask-wor abanihi PD 0:00 1 (QOSMaxJobsPerUserLimit)
5543017 dav dask-wor abanihi PD 0:00 1 (QOSMaxJobsPerUserLimit)
5543002 dav dask-wor abanihi R 0:31 1 casper17
5543003 dav dask-wor abanihi R 0:31 1 casper17
5543004 dav dask-wor abanihi R 0:31 1 casper17
5543005 dav dask-wor abanihi R 0:31 1 casper17
5543006 dav dask-wor abanihi R 0:31 1 casper17
5543007 dav dask-wor abanihi R 0:31 1 casper17
5543008 dav dask-wor abanihi R 0:31 1 casper17
5543009 dav dask-wor abanihi R 0:31 1 casper17
5543010 dav dask-wor abanihi R 0:31 1 casper17
5543011 dav dask-wor abanihi R 0:31 1 casper17
5543012 dav dask-wor abanihi R 0:31 1 casper17
5543013 dav dask-wor abanihi R 0:31 1 casper17
5542985 dav dask-wor abanihi R 0:34 1 casper12
5542986 dav dask-wor abanihi R 0:34 1 casper15
5542987 dav dask-wor abanihi R 0:34 1 casper15
5542988 dav dask-wor abanihi R 0:34 1 casper15
5542989 dav dask-wor abanihi R 0:34 1 casper22
5542990 dav dask-wor abanihi R 0:34 1 casper22
5542991 dav dask-wor abanihi R 0:34 1 casper22
5542992 dav dask-wor abanihi R 0:34 1 casper22
5542993 dav dask-wor abanihi R 0:34 1 casper22
5542994 dav dask-wor abanihi R 0:34 1 casper22
5542995 dav dask-wor abanihi R 0:34 1 casper22
5542996 dav dask-wor abanihi R 0:34 1 casper22
5542997 dav dask-wor abanihi R 0:34 1 casper22
5542998 dav dask-wor abanihi R 0:34 1 casper22
5542999 dav dask-wor abanihi R 0:34 1 casper22
5543000 dav dask-wor abanihi R 0:34 1 casper17
5543001 dav dask-wor abanihi R 0:34 1 casper17
5542980 dav dask-wor abanihi R 0:35 1 casper10
5542981 dav dask-wor abanihi R 0:35 1 casper19
5542982 dav dask-wor abanihi R 0:35 1 casper19
5542983 dav dask-wor abanihi R 0:35 1 casper12
5542984 dav dask-wor abanihi R 0:35 1 casper12
5542978 dav dask-wor abanihi R 1:08 1 casper10
5542979 dav dask-wor abanihi R 1:08 1 casper10
from dask-jobqueue.
However, it appears that the reported information can be misleading sometimes. For instance, I am using a system which restricts the number of concurrent jobs to 35. When I submit 40 jobs, dask_jobqueue reports that all 40 workers are in running state even when some of the jobs are in pending state according to squeue:
Never mind ;)
I hadn't seen @mrocklin's comment
I believe that cluster.workers is only a convention at this point. It's managed by the cluster object and so represents workers that have been asked for, but not workers that have connected. For the latter you would want the following:
cluster.scheduler.workers
In [26]: len(cluster.scheduler.workers)
Out[26]: 36
from dask-jobqueue.
I am curious... What needs to be done for this issue to be considered "fixed"? I'd happy to work on missing functionality.
from dask-jobqueue.
from dask-jobqueue.
Related Issues (20)
- Cluster keeps appending "interface" flag to job script HOT 7
- Release 0.8.1 HOT 2
- OARCluster implementation does not let OAR take into account the memory parameter HOT 4
- `JobQueueCluster` with local worker(s) HOT 3
- Restart cluster job on task completion HOT 3
- Add CI with more tests for OAR
- dask_jobqueue tries to import non-existent function dask.utils.ignoring HOT 3
- a direct way to specify the worker spec HOT 4
- Documentation bug: interface HOT 1
- documentation: document `worker_command` kwarg
- Strange Worker KeyError when using LSFCluster. HOT 6
- Update NERSC Cori to NERSC Perlmutter in docs HOT 3
- SLURMCluster doesn't spawn new workers when old ones timeout HOT 12
- conftest.py not included in PyPI source tarball HOT 1
- CI is currently failing HOT 4
- ConnectionRefusedError HOT 2
- ImportError on ignoring attribute from dask.utils when importing dask_jobqueue HOT 2
- Resource allocation on SLURM cluster HOT 9
- Add a `py.typed` marker HOT 1
- Unable to submit jobs to PBS queue HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-jobqueue.