Comments (9)
My best solution at the moment to force this to do what I want it to is to have a function like the following:
def deploy_jobs(index):
cluster = SLURMCluster(
cores=16,
processes=1,
memory="64GB",
queue="single",
walltime="03:00:00",
death_timeout="15s",
worker_extra_args=["--resources GPU=1"],
log_directory=f'./ce-two-layer-10/dask-logs',
job_script_prologue=["module load devel/cuda/12.1"],
job_extra_directives=["--gres=gpu:1"]
)
cluster.scale(1)
client = Client(cluster)
return client.submit(
train, index, resources={"GPU": 1}
)
I then loop over my parameters, create a cluster, scale it, and submit jobs to it. Essentially wrapping the submit function of Dask with a short cluster setup step. Not ideal, but it works perfectly.
The obvious downside is that I can't collect the resource information into a single dashboard to monitor.
from dask-jobqueue.
That's a nice comment, thank you for bringing that up @SamTov.
I stumbled across the same issue and would be happy to learn how to solve it.
from dask-jobqueue.
however, is that four workers are submitted to the queue
This is definitely weird, and I don't see how it would happen, do you have enough resources or quota on your cluster to book 5 GPUs?
only one of them starts to take networks and train them sequentially
Maybe the resources mechanism is not working as expected. I will try to build a reproducer. But could you try not using resources, e.g. something like:
cluster = SLURMCluster(
cores=1, # Force only one task per worker at a time
processes=1,
job_cpu=16, # But still book 16 cores with Slurm
memory="64GB",
queue="Anonymised",
walltime="01:00:00",
death_timeout="15s",
log_directory=f'./ce-perceptron/dask-logs',
job_script_prologue=["module load devel/cuda/12.1"],
job_extra_directives=["--gres=gpu:1"]
)
results = [client.submit(train, index) for index in indices]
from dask-jobqueue.
Hi @guillaumeeb, thanks for the answer. I have enough resources and permissions to access the nodes. Usually run ~30 jobs at a time with GPUs but deploying from a bash script. For this other study though we really need to change a lot of parameters so are turning to Dask. Additionally to this, when I use my hacky solution of just creating many clusters, I can run 20 at a time.
It is also odd, when I use adapt for example, on top of it just killing workers and resubmitting, it only ever submits one at a time.
from dask-jobqueue.
It is also odd, when I use adapt for example, on top of it just killing workers and resubmitting, it only ever submits one at a time.
Adapt and Dask resources do not work well together if I remember correctly. So please try without resources and see if it solves some problems.
from dask-jobqueue.
The configuration you put up earlier did not really work for me. Admittedly the jobs stopped dying. However, it didn't take any net work training tasks.
from dask-jobqueue.
However, it didn't take any net work training tasks.
That sound strange, are you able to submit other tasks?
from dask-jobqueue.
With my current study, I ask for 20 workers in the adapt command. It will start 20 but at the moment only 5 of them will pick up jobs. The others fail. I removed the death_timeout="15s"
part of the cluster and I now actually get an error from the worker:
2023-10-25 13:48:01,215 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.20.33.12:42733'. Reason: failure-to-start-<class 'OSError'>
2023-10-25 13:48:01,216 - distributed.dask_worker - INFO - End worker
OSError: [Errno 113] No route to host
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/comm/core.py", line 342, in connect
comm = await wait_for(
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/utils.py", line 1910, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/comm/tcp.py", line 503, in connect
convert_stream_closed_error(self, e)
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/comm/tcp.py", line 141, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x148ba14b4a60>: OSError: [Errno 113] No route to host
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/core.py", line 616, in start
await wait_for(self.start_unsafe(), timeout=timeout)
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/utils.py", line 1910, in wait_for
return await asyncio.wait_for(fut, timeout)
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/nanny.py", line 351, in start_unsafe
comm = await self.rpc.connect(saddr)
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/core.py", line 1626, in connect
return connect_attempt.result()
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/core.py", line 1516, in _connect
comm = await connect(
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/comm/core.py", line 368, in connect
raise OSError(
OSError: Timed out trying to connect to tcp://129.206.9.242:40101 after 30 s
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/cli/dask_worker.py", line 544, in <module>
main() # pragma: no cover
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/cli/dask_worker.py", line 450, in main
asyncio_run(run(), loop_factory=get_loop_factory())
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
return loop.run_until_complete(main)
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/cli/dask_worker.py", line 447, in run
[task.result() for task in done]
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/cli/dask_worker.py", line 447, in <listcomp>
[task.result() for task in done]
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/cli/dask_worker.py", line 420, in wait_for_nannies_to_finish
await asyncio.gather(*nannies)
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/home/st/st_st/st_ac134186/miniconda3/envs/zincware/lib/python3.10/site-packages/distributed/core.py", line 624, in start
raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
RuntimeError: Nanny failed to start.
from dask-jobqueue.
Either this is a network issue, either your scheduler is overloaded and cannot accept new Workers.
Does this also happen if you didn't submit any task yet? Just scaling up the Cluster?
from dask-jobqueue.
Related Issues (20)
- Documentation bug: interface HOT 1
- documentation: document `worker_command` kwarg
- Strange Worker KeyError when using LSFCluster. HOT 6
- Update NERSC Cori to NERSC Perlmutter in docs HOT 3
- SLURMCluster doesn't spawn new workers when old ones timeout HOT 12
- conftest.py not included in PyPI source tarball HOT 1
- CI is currently failing HOT 4
- ConnectionRefusedError HOT 2
- ImportError on ignoring attribute from dask.utils when importing dask_jobqueue HOT 2
- Add a `py.typed` marker HOT 1
- Unable to submit jobs to PBS queue HOT 2
- Worker startup timeout leads to inconsistent cluster state HOT 3
- Remove deprecated project kwarg in Cluster implementation, or use it as it should be
- TypeError: unhashable type: 'list' when importing dask-jobqueue HOT 3
- Release soon HOT 25
- mem error HOT 1
- Broken link in docs HOT 2
- Documentation about `memory` vs 'job_mem` could be improved HOT 1
- Potentially confusing information about `processes` in the docs HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-jobqueue.