failure to autoscale unless workers are already present about dask-jobqueue HOT 9 CLOSED

dask commented on May 24, 2024

failure to autoscale unless workers are already present

from dask-jobqueue.

Comments (9)

mrocklin commented on May 24, 2024 1

Can you report the contents of cluster._adaptive.log ?

…

On Thu, Mar 29, 2018 at 3:35 PM, Joe Hamman ***@***.***> wrote: I am testing the PBSCluster along with autoscaling. It seems that I am unable to get the cluster to launch any workers without explicitly starting at least one worker. I would expect that this configuration would scale from 0 to 10 (180 processes) without further interaction/configuration. cluster = PBSCluster(queue='default', walltime='01:00:00', project='MyAccount', resource_spec='1:ncpus=36:mpiprocs=36:mem=109GB', interface='ib0', threads=4, processes=18) client = Client(cluster) cluster.adapt(minimum=0, maximum=10) @mrocklin <https://github.com/mrocklin> - this may actually be a problem with the dask adaptive cluster but I wanted to discuss here to see if I am missing something obvious specific to PBS. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszP7j_X4s7IepLV44BwOnbXqxShn6ks5tjTeVgaJpZM4TA3ut> .

from dask-jobqueue.

jhamman commented on May 24, 2024

Here's what I'm seeing in some more detail:

In [1]: from dask_jobqueue import PBSCluster
   ...:

In [2]: cluster = PBSCluster(project='P48500028', processes=9, queue='premium', walltime='00:05:00',
   ...:                      resource_spec='select=1:ncpus=36', interface='ib0')
   ...:
In [3]: cluster.adapt(minimum=1, maximum=10)
Out[3]: <distributed.deploy.adaptive.Adaptive at 0x2aaac87b5c18>

In [4]: cluster._adaptive.log
Out[4]:
deque([(1522436688.2028198, 'up', {'n': 1}),
       (1522436689.096044, 'up', {'n': 1}),
       (1522436690.096046, 'up', {'n': 1}),
       (1522436691.096431, 'up', {'n': 1}),
       (1522436692.0965822, 'up', {'n': 1}),
       (1522436693.0963206, 'up', {'n': 1}),
       (1522436694.0966222, 'up', {'n': 1}),
       (1522436695.0969365, 'up', {'n': 1}),
       (1522436696.0968359, 'up', {'n': 1}),
       (1522436697.0959227, 'up', {'n': 1}),
       (1522436698.0963962, 'up', {'n': 1}),
       (1522436699.0967467, 'up', {'n': 1}),
       (1522436700.0968142, 'up', {'n': 1}),
       (1522436701.0960631, 'up', {'n': 1}),
       (1522436702.096041, 'up', {'n': 1}),
       (1522436703.096142, 'up', {'n': 1}),
       (1522436704.0963974, 'up', {'n': 1}),
       (1522436705.0966284, 'up', {'n': 1}),
       (1522436706.0966349, 'up', {'n': 1}),
       (1522436707.0966036, 'up', {'n': 1}),
       (1522436708.0960915, 'up', {'n': 1}),
       (1522436709.0967355, 'up', {'n': 1}),
       (1522436710.0959985, 'up', {'n': 1}),
       (1522436711.0961752, 'up', {'n': 1}),
       (1522436712.0966375, 'up', {'n': 1}),
       (1522436713.096167, 'up', {'n': 1}),
       (1522436714.0965333, 'up', {'n': 1}),
       (1522436715.0961437, 'up', {'n': 1}),
       (1522436719.0971153,
        'down',
        {'tcp://10.148.14.139:36779',
         'tcp://10.148.14.139:39699',
         'tcp://10.148.14.139:47350',
         'tcp://10.148.14.139:53760',
         'tcp://10.148.14.139:53913',
         'tcp://10.148.14.139:57132',
         'tcp://10.148.14.139:58287',
         'tcp://10.148.14.139:58418'}),
       (1522436720.0961378, 'up', {'n': 1}),
       (1522436721.096705, 'up', {'n': 1}),
       (1522436722.0962725, 'up', {'n': 1}),
       (1522436723.0962276, 'up', {'n': 1}),
       (1522436724.096429, 'up', {'n': 1}),
       (1522436725.0959609, 'up', {'n': 1}),
       (1522436726.095945, 'up', {'n': 1}),
       (1522436727.096537, 'up', {'n': 1}),
       (1522436728.0968175, 'up', {'n': 1}),
       (1522436729.0963116, 'up', {'n': 1}),
       (1522436730.096864, 'up', {'n': 1}),
       (1522436731.0960515, 'up', {'n': 1}),
       (1522436732.0965285, 'up', {'n': 1}),
       (1522436733.0960526, 'up', {'n': 1}),
       (1522436734.09688, 'up', {'n': 1}),
       (1522436735.0968754, 'up', {'n': 1}),
       (1522436736.0964487, 'up', {'n': 1}),
       (1522436737.0960183, 'up', {'n': 1}),
       (1522436738.096634, 'up', {'n': 1}),
       (1522436739.0960908, 'up', {'n': 1}),
       (1522436740.0961347, 'up', {'n': 1}),
       (1522436741.0966475, 'up', {'n': 1}),
       (1522436742.0964673, 'up', {'n': 1}),
       (1522436743.096905, 'up', {'n': 1}),
       (1522436744.0962358, 'up', {'n': 1}),
       (1522436745.0969105, 'up', {'n': 1}),
       (1522436746.0968463, 'up', {'n': 1}),
       (1522436747.0967472, 'up', {'n': 1}),
       (1522436748.0967875, 'up', {'n': 1}),
       (1522436749.0962346, 'up', {'n': 1}),
       (1522436750.0961862, 'up', {'n': 1}),
       (1522436751.0968661, 'up', {'n': 1}),
       (1522436752.0963519, 'up', {'n': 1}),
       (1522436753.0961668, 'up', {'n': 1}),
       (1522436754.096832, 'up', {'n': 1}),
       (1522436755.0967185, 'up', {'n': 1}),
       (1522436756.0967488, 'up', {'n': 1}),
       (1522436757.0959885, 'up', {'n': 1}),
       (1522436758.09631, 'up', {'n': 1}),
       (1522436759.095931, 'up', {'n': 1}),
       (1522436760.0963905, 'up', {'n': 1}),
       (1522436761.096087, 'up', {'n': 1}),
       (1522436762.096715, 'up', {'n': 1}),
       (1522436763.0963166, 'up', {'n': 1}),
       (1522436764.0968325, 'up', {'n': 1}),
       (1522436765.0965812, 'up', {'n': 1}),
       (1522436766.0960157, 'up', {'n': 1}),
       (1522436767.0963984, 'up', {'n': 1}),
       (1522436768.0960555, 'up', {'n': 1}),
       (1522436769.0960166, 'up', {'n': 1}),
       (1522436770.096787, 'up', {'n': 1}),
       (1522436771.0960639, 'up', {'n': 1}),
       (1522436772.0960958, 'up', {'n': 1}),
       (1522436773.0966663, 'up', {'n': 1}),
       (1522436774.0960972, 'up', {'n': 1}),
       (1522436775.096504, 'up', {'n': 1}),
       (1522436776.0963619, 'up', {'n': 1}),
       (1522436777.0962903, 'up', {'n': 1}),
       (1522436778.0968869, 'up', {'n': 1}),
       (1522436779.0965302, 'up', {'n': 1}),
       (1522436780.0961578, 'up', {'n': 1}),
       (1522436781.0965698, 'up', {'n': 1}),
       (1522436782.0966446, 'up', {'n': 1}),
       (1522436783.0968, 'up', {'n': 1}),
       (1522436784.096893, 'up', {'n': 1}),
       (1522436785.0965545, 'up', {'n': 1}),
       (1522436786.096395, 'up', {'n': 1}),
       (1522436787.0963097, 'up', {'n': 1}),
       (1522436788.0960252, 'up', {'n': 1}),
       (1522436789.0964012, 'up', {'n': 1}),
       (1522436790.0960677, 'up', {'n': 1}),
       (1522436791.096888, 'up', {'n': 1}),
       (1522436792.0964658, 'up', {'n': 1}),
       (1522436793.0963855, 'up', {'n': 1}),
       (1522436794.0964293, 'up', {'n': 1}),
       (1522436795.0967596, 'up', {'n': 1}),
       (1522436796.0966263, 'up', {'n': 1}),
       (1522436797.0963128, 'up', {'n': 1}),
       (1522436798.096786, 'up', {'n': 1}),
       (1522436799.0965168, 'up', {'n': 1}),
       (1522436800.096292, 'up', {'n': 1}),
       (1522436801.096066, 'up', {'n': 1}),
       (1522436802.0960214, 'up', {'n': 1}),
       (1522436803.0964055, 'up', {'n': 1}),
       (1522436804.0963562, 'up', {'n': 1}),
       (1522436805.0965285, 'up', {'n': 1}),
       (1522436806.0966988, 'up', {'n': 1}),
       (1522436807.0962427, 'up', {'n': 1}),
       (1522436808.0961447, 'up', {'n': 1}),
       (1522436809.0966911, 'up', {'n': 1}),
       (1522436810.0964615, 'up', {'n': 1})])

Watching the PBS status, a single worker was launched but was quickly culled. After that no additional worker were launched, despite having set the minimum workers to 1.

A few things I don't understand here:

Why was the first job culled? It seems minimum was used to determine the initial number of workers but it was not respected when autoscaling down.
How Adaptive uses the scale_factor=2 when the number of workers is 0.
How does autoscaling work with grouped resources? I set the minimum number of workers to 1 but each job really has 9 or 18 processes (workers). Could that be causing a problem?

from dask-jobqueue.

mrocklin commented on May 24, 2024

Why was the first job culled? It seems minimum was used to determine the initial number of workers but it was not respected when autoscaling down.

I don't know. Maybe your last question has to do with this?

How Adaptive uses the scale_factor=2 when the number of workers is 0.

See logic in https://github.com/dask/distributed/blob/8acddcc017172e43f815be9cc5183593dd7f6b45/distributed/deploy/adaptive.py#L222-L248

How does autoscaling work with grouped resources? I set the minimum number of workers to 1 but each job really has 9 or 18 processes (workers). Could that be causing a problem?

It's a good question. I'm not sure anyone has thought about this. Perhaps this is the problem?

In general adaptive deployments are young and could use help both with experimentation and development.

from dask-jobqueue.

jhamman commented on May 24, 2024

Thanks @mrocklin - it seems like we could do a better job defining the terminology around grouped workers and maybe add a test or two to adaptive that targets this use case.

from dask-jobqueue.

jhamman commented on May 24, 2024

@mrocklin -

It's a good question. I'm not sure anyone has thought about this. Perhaps this is the problem?

In general adaptive deployments are young and could use help both with experimentation and development.

Do you think there is existing test coverage for this code path within distributed? If not, would you be open to me adding some there? I could add it here as well but this doesn't seem like the appropriate place.

from dask-jobqueue.

mrocklin commented on May 24, 2024

Adding tests there would be welcome

…

On Fri, Apr 27, 2018 at 4:40 PM, Joe Hamman ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> - It's a good question. I'm not sure anyone has thought about this. Perhaps this is the problem? In general adaptive deployments are young and could use help both with experimentation and development. Do you think there is existing test coverage for this code path within distributed? If not, would you be open to me adding some there? I could add it here as well but this doesn't seem like the appropriate place. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszKFV6zh3iaxytNcnE3XfKRn98g13ks5ts4JZgaJpZM4TA3ut> .

from dask-jobqueue.

jhamman commented on May 24, 2024

@mrocklin - I'm looking at the adaptive tests now. Can you point me to the tests for the grouped workers (without adaptive)?

from dask-jobqueue.

mrocklin commented on May 24, 2024

I think that you want distributed/tests/test_scheduler.py::test_workers_to_close_grouped or more generally the key= argument in workers_to_close, which gets passed in through retire_workers

…

On Tue, May 1, 2018 at 12:24 AM, Joe Hamman ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> - I'm looking at the adaptive tests now. Can you point me to the tests for the grouped workers (without adaptive)? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszEyOUFxSOUm9LifwGcj6AjBEIMKSks5tt-NsgaJpZM4TA3ut> .

from dask-jobqueue.

jhamman commented on May 24, 2024

closed via #63

from dask-jobqueue.

failure to autoscale unless workers are already present about dask-jobqueue HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent