Comments (9)
from dask-jobqueue.
Here's what I'm seeing in some more detail:
In [1]: from dask_jobqueue import PBSCluster
...:
In [2]: cluster = PBSCluster(project='P48500028', processes=9, queue='premium', walltime='00:05:00',
...: resource_spec='select=1:ncpus=36', interface='ib0')
...:
In [3]: cluster.adapt(minimum=1, maximum=10)
Out[3]: <distributed.deploy.adaptive.Adaptive at 0x2aaac87b5c18>
In [4]: cluster._adaptive.log
Out[4]:
deque([(1522436688.2028198, 'up', {'n': 1}),
(1522436689.096044, 'up', {'n': 1}),
(1522436690.096046, 'up', {'n': 1}),
(1522436691.096431, 'up', {'n': 1}),
(1522436692.0965822, 'up', {'n': 1}),
(1522436693.0963206, 'up', {'n': 1}),
(1522436694.0966222, 'up', {'n': 1}),
(1522436695.0969365, 'up', {'n': 1}),
(1522436696.0968359, 'up', {'n': 1}),
(1522436697.0959227, 'up', {'n': 1}),
(1522436698.0963962, 'up', {'n': 1}),
(1522436699.0967467, 'up', {'n': 1}),
(1522436700.0968142, 'up', {'n': 1}),
(1522436701.0960631, 'up', {'n': 1}),
(1522436702.096041, 'up', {'n': 1}),
(1522436703.096142, 'up', {'n': 1}),
(1522436704.0963974, 'up', {'n': 1}),
(1522436705.0966284, 'up', {'n': 1}),
(1522436706.0966349, 'up', {'n': 1}),
(1522436707.0966036, 'up', {'n': 1}),
(1522436708.0960915, 'up', {'n': 1}),
(1522436709.0967355, 'up', {'n': 1}),
(1522436710.0959985, 'up', {'n': 1}),
(1522436711.0961752, 'up', {'n': 1}),
(1522436712.0966375, 'up', {'n': 1}),
(1522436713.096167, 'up', {'n': 1}),
(1522436714.0965333, 'up', {'n': 1}),
(1522436715.0961437, 'up', {'n': 1}),
(1522436719.0971153,
'down',
{'tcp://10.148.14.139:36779',
'tcp://10.148.14.139:39699',
'tcp://10.148.14.139:47350',
'tcp://10.148.14.139:53760',
'tcp://10.148.14.139:53913',
'tcp://10.148.14.139:57132',
'tcp://10.148.14.139:58287',
'tcp://10.148.14.139:58418'}),
(1522436720.0961378, 'up', {'n': 1}),
(1522436721.096705, 'up', {'n': 1}),
(1522436722.0962725, 'up', {'n': 1}),
(1522436723.0962276, 'up', {'n': 1}),
(1522436724.096429, 'up', {'n': 1}),
(1522436725.0959609, 'up', {'n': 1}),
(1522436726.095945, 'up', {'n': 1}),
(1522436727.096537, 'up', {'n': 1}),
(1522436728.0968175, 'up', {'n': 1}),
(1522436729.0963116, 'up', {'n': 1}),
(1522436730.096864, 'up', {'n': 1}),
(1522436731.0960515, 'up', {'n': 1}),
(1522436732.0965285, 'up', {'n': 1}),
(1522436733.0960526, 'up', {'n': 1}),
(1522436734.09688, 'up', {'n': 1}),
(1522436735.0968754, 'up', {'n': 1}),
(1522436736.0964487, 'up', {'n': 1}),
(1522436737.0960183, 'up', {'n': 1}),
(1522436738.096634, 'up', {'n': 1}),
(1522436739.0960908, 'up', {'n': 1}),
(1522436740.0961347, 'up', {'n': 1}),
(1522436741.0966475, 'up', {'n': 1}),
(1522436742.0964673, 'up', {'n': 1}),
(1522436743.096905, 'up', {'n': 1}),
(1522436744.0962358, 'up', {'n': 1}),
(1522436745.0969105, 'up', {'n': 1}),
(1522436746.0968463, 'up', {'n': 1}),
(1522436747.0967472, 'up', {'n': 1}),
(1522436748.0967875, 'up', {'n': 1}),
(1522436749.0962346, 'up', {'n': 1}),
(1522436750.0961862, 'up', {'n': 1}),
(1522436751.0968661, 'up', {'n': 1}),
(1522436752.0963519, 'up', {'n': 1}),
(1522436753.0961668, 'up', {'n': 1}),
(1522436754.096832, 'up', {'n': 1}),
(1522436755.0967185, 'up', {'n': 1}),
(1522436756.0967488, 'up', {'n': 1}),
(1522436757.0959885, 'up', {'n': 1}),
(1522436758.09631, 'up', {'n': 1}),
(1522436759.095931, 'up', {'n': 1}),
(1522436760.0963905, 'up', {'n': 1}),
(1522436761.096087, 'up', {'n': 1}),
(1522436762.096715, 'up', {'n': 1}),
(1522436763.0963166, 'up', {'n': 1}),
(1522436764.0968325, 'up', {'n': 1}),
(1522436765.0965812, 'up', {'n': 1}),
(1522436766.0960157, 'up', {'n': 1}),
(1522436767.0963984, 'up', {'n': 1}),
(1522436768.0960555, 'up', {'n': 1}),
(1522436769.0960166, 'up', {'n': 1}),
(1522436770.096787, 'up', {'n': 1}),
(1522436771.0960639, 'up', {'n': 1}),
(1522436772.0960958, 'up', {'n': 1}),
(1522436773.0966663, 'up', {'n': 1}),
(1522436774.0960972, 'up', {'n': 1}),
(1522436775.096504, 'up', {'n': 1}),
(1522436776.0963619, 'up', {'n': 1}),
(1522436777.0962903, 'up', {'n': 1}),
(1522436778.0968869, 'up', {'n': 1}),
(1522436779.0965302, 'up', {'n': 1}),
(1522436780.0961578, 'up', {'n': 1}),
(1522436781.0965698, 'up', {'n': 1}),
(1522436782.0966446, 'up', {'n': 1}),
(1522436783.0968, 'up', {'n': 1}),
(1522436784.096893, 'up', {'n': 1}),
(1522436785.0965545, 'up', {'n': 1}),
(1522436786.096395, 'up', {'n': 1}),
(1522436787.0963097, 'up', {'n': 1}),
(1522436788.0960252, 'up', {'n': 1}),
(1522436789.0964012, 'up', {'n': 1}),
(1522436790.0960677, 'up', {'n': 1}),
(1522436791.096888, 'up', {'n': 1}),
(1522436792.0964658, 'up', {'n': 1}),
(1522436793.0963855, 'up', {'n': 1}),
(1522436794.0964293, 'up', {'n': 1}),
(1522436795.0967596, 'up', {'n': 1}),
(1522436796.0966263, 'up', {'n': 1}),
(1522436797.0963128, 'up', {'n': 1}),
(1522436798.096786, 'up', {'n': 1}),
(1522436799.0965168, 'up', {'n': 1}),
(1522436800.096292, 'up', {'n': 1}),
(1522436801.096066, 'up', {'n': 1}),
(1522436802.0960214, 'up', {'n': 1}),
(1522436803.0964055, 'up', {'n': 1}),
(1522436804.0963562, 'up', {'n': 1}),
(1522436805.0965285, 'up', {'n': 1}),
(1522436806.0966988, 'up', {'n': 1}),
(1522436807.0962427, 'up', {'n': 1}),
(1522436808.0961447, 'up', {'n': 1}),
(1522436809.0966911, 'up', {'n': 1}),
(1522436810.0964615, 'up', {'n': 1})])
Watching the PBS status, a single worker was launched but was quickly culled. After that no additional worker were launched, despite having set the minimum
workers to 1.
A few things I don't understand here:
- Why was the first job culled? It seems minimum was used to determine the initial number of workers but it was not respected when autoscaling down.
- How Adaptive uses the
scale_factor=2
when the number of workers is 0. - How does autoscaling work with grouped resources? I set the minimum number of workers to 1 but each job really has 9 or 18 processes (workers). Could that be causing a problem?
from dask-jobqueue.
Why was the first job culled? It seems minimum was used to determine the initial number of workers but it was not respected when autoscaling down.
I don't know. Maybe your last question has to do with this?
How Adaptive uses the scale_factor=2 when the number of workers is 0.
How does autoscaling work with grouped resources? I set the minimum number of workers to 1 but each job really has 9 or 18 processes (workers). Could that be causing a problem?
It's a good question. I'm not sure anyone has thought about this. Perhaps this is the problem?
In general adaptive deployments are young and could use help both with experimentation and development.
from dask-jobqueue.
Thanks @mrocklin - it seems like we could do a better job defining the terminology around grouped workers and maybe add a test or two to adaptive that targets this use case.
from dask-jobqueue.
It's a good question. I'm not sure anyone has thought about this. Perhaps this is the problem?
In general adaptive deployments are young and could use help both with experimentation and development.
Do you think there is existing test coverage for this code path within distributed? If not, would you be open to me adding some there? I could add it here as well but this doesn't seem like the appropriate place.
from dask-jobqueue.
from dask-jobqueue.
@mrocklin - I'm looking at the adaptive tests now. Can you point me to the tests for the grouped workers (without adaptive)?
from dask-jobqueue.
from dask-jobqueue.
closed via #63
from dask-jobqueue.
Related Issues (20)
- CI: Distributed fixtures not compliant with dask-jobqueue
- dask-jobqueue for Fujitsu HPC HOT 3
- Remove deprecated parameters `env_extra`, `extra`, `job_extra` HOT 3
- Suppress "Couldn't detect a suitable IP address" messages on cluster nodes with no internet HOT 1
- Cluster keeps appending "interface" flag to job script HOT 7
- Release 0.8.1 HOT 2
- OARCluster implementation does not let OAR take into account the memory parameter HOT 4
- `JobQueueCluster` with local worker(s) HOT 3
- Restart cluster job on task completion HOT 3
- Add CI with more tests for OAR
- dask_jobqueue tries to import non-existent function dask.utils.ignoring HOT 3
- a direct way to specify the worker spec HOT 4
- Documentation bug: interface HOT 1
- documentation: document `worker_command` kwarg
- Strange Worker KeyError when using LSFCluster. HOT 6
- Update NERSC Cori to NERSC Perlmutter in docs HOT 3
- SLURMCluster doesn't spawn new workers when old ones timeout HOT 12
- conftest.py not included in PyPI source tarball HOT 1
- CI is currently failing HOT 4
- ConnectionRefusedError HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-jobqueue.