Git Product home page Git Product logo

Comments (9)

mrocklin avatar mrocklin commented on May 24, 2024 1

from dask-jobqueue.

jhamman avatar jhamman commented on May 24, 2024

Here's what I'm seeing in some more detail:

In [1]: from dask_jobqueue import PBSCluster
   ...:

In [2]: cluster = PBSCluster(project='P48500028', processes=9, queue='premium', walltime='00:05:00',
   ...:                      resource_spec='select=1:ncpus=36', interface='ib0')
   ...:
In [3]: cluster.adapt(minimum=1, maximum=10)
Out[3]: <distributed.deploy.adaptive.Adaptive at 0x2aaac87b5c18>

In [4]: cluster._adaptive.log
Out[4]:
deque([(1522436688.2028198, 'up', {'n': 1}),
       (1522436689.096044, 'up', {'n': 1}),
       (1522436690.096046, 'up', {'n': 1}),
       (1522436691.096431, 'up', {'n': 1}),
       (1522436692.0965822, 'up', {'n': 1}),
       (1522436693.0963206, 'up', {'n': 1}),
       (1522436694.0966222, 'up', {'n': 1}),
       (1522436695.0969365, 'up', {'n': 1}),
       (1522436696.0968359, 'up', {'n': 1}),
       (1522436697.0959227, 'up', {'n': 1}),
       (1522436698.0963962, 'up', {'n': 1}),
       (1522436699.0967467, 'up', {'n': 1}),
       (1522436700.0968142, 'up', {'n': 1}),
       (1522436701.0960631, 'up', {'n': 1}),
       (1522436702.096041, 'up', {'n': 1}),
       (1522436703.096142, 'up', {'n': 1}),
       (1522436704.0963974, 'up', {'n': 1}),
       (1522436705.0966284, 'up', {'n': 1}),
       (1522436706.0966349, 'up', {'n': 1}),
       (1522436707.0966036, 'up', {'n': 1}),
       (1522436708.0960915, 'up', {'n': 1}),
       (1522436709.0967355, 'up', {'n': 1}),
       (1522436710.0959985, 'up', {'n': 1}),
       (1522436711.0961752, 'up', {'n': 1}),
       (1522436712.0966375, 'up', {'n': 1}),
       (1522436713.096167, 'up', {'n': 1}),
       (1522436714.0965333, 'up', {'n': 1}),
       (1522436715.0961437, 'up', {'n': 1}),
       (1522436719.0971153,
        'down',
        {'tcp://10.148.14.139:36779',
         'tcp://10.148.14.139:39699',
         'tcp://10.148.14.139:47350',
         'tcp://10.148.14.139:53760',
         'tcp://10.148.14.139:53913',
         'tcp://10.148.14.139:57132',
         'tcp://10.148.14.139:58287',
         'tcp://10.148.14.139:58418'}),
       (1522436720.0961378, 'up', {'n': 1}),
       (1522436721.096705, 'up', {'n': 1}),
       (1522436722.0962725, 'up', {'n': 1}),
       (1522436723.0962276, 'up', {'n': 1}),
       (1522436724.096429, 'up', {'n': 1}),
       (1522436725.0959609, 'up', {'n': 1}),
       (1522436726.095945, 'up', {'n': 1}),
       (1522436727.096537, 'up', {'n': 1}),
       (1522436728.0968175, 'up', {'n': 1}),
       (1522436729.0963116, 'up', {'n': 1}),
       (1522436730.096864, 'up', {'n': 1}),
       (1522436731.0960515, 'up', {'n': 1}),
       (1522436732.0965285, 'up', {'n': 1}),
       (1522436733.0960526, 'up', {'n': 1}),
       (1522436734.09688, 'up', {'n': 1}),
       (1522436735.0968754, 'up', {'n': 1}),
       (1522436736.0964487, 'up', {'n': 1}),
       (1522436737.0960183, 'up', {'n': 1}),
       (1522436738.096634, 'up', {'n': 1}),
       (1522436739.0960908, 'up', {'n': 1}),
       (1522436740.0961347, 'up', {'n': 1}),
       (1522436741.0966475, 'up', {'n': 1}),
       (1522436742.0964673, 'up', {'n': 1}),
       (1522436743.096905, 'up', {'n': 1}),
       (1522436744.0962358, 'up', {'n': 1}),
       (1522436745.0969105, 'up', {'n': 1}),
       (1522436746.0968463, 'up', {'n': 1}),
       (1522436747.0967472, 'up', {'n': 1}),
       (1522436748.0967875, 'up', {'n': 1}),
       (1522436749.0962346, 'up', {'n': 1}),
       (1522436750.0961862, 'up', {'n': 1}),
       (1522436751.0968661, 'up', {'n': 1}),
       (1522436752.0963519, 'up', {'n': 1}),
       (1522436753.0961668, 'up', {'n': 1}),
       (1522436754.096832, 'up', {'n': 1}),
       (1522436755.0967185, 'up', {'n': 1}),
       (1522436756.0967488, 'up', {'n': 1}),
       (1522436757.0959885, 'up', {'n': 1}),
       (1522436758.09631, 'up', {'n': 1}),
       (1522436759.095931, 'up', {'n': 1}),
       (1522436760.0963905, 'up', {'n': 1}),
       (1522436761.096087, 'up', {'n': 1}),
       (1522436762.096715, 'up', {'n': 1}),
       (1522436763.0963166, 'up', {'n': 1}),
       (1522436764.0968325, 'up', {'n': 1}),
       (1522436765.0965812, 'up', {'n': 1}),
       (1522436766.0960157, 'up', {'n': 1}),
       (1522436767.0963984, 'up', {'n': 1}),
       (1522436768.0960555, 'up', {'n': 1}),
       (1522436769.0960166, 'up', {'n': 1}),
       (1522436770.096787, 'up', {'n': 1}),
       (1522436771.0960639, 'up', {'n': 1}),
       (1522436772.0960958, 'up', {'n': 1}),
       (1522436773.0966663, 'up', {'n': 1}),
       (1522436774.0960972, 'up', {'n': 1}),
       (1522436775.096504, 'up', {'n': 1}),
       (1522436776.0963619, 'up', {'n': 1}),
       (1522436777.0962903, 'up', {'n': 1}),
       (1522436778.0968869, 'up', {'n': 1}),
       (1522436779.0965302, 'up', {'n': 1}),
       (1522436780.0961578, 'up', {'n': 1}),
       (1522436781.0965698, 'up', {'n': 1}),
       (1522436782.0966446, 'up', {'n': 1}),
       (1522436783.0968, 'up', {'n': 1}),
       (1522436784.096893, 'up', {'n': 1}),
       (1522436785.0965545, 'up', {'n': 1}),
       (1522436786.096395, 'up', {'n': 1}),
       (1522436787.0963097, 'up', {'n': 1}),
       (1522436788.0960252, 'up', {'n': 1}),
       (1522436789.0964012, 'up', {'n': 1}),
       (1522436790.0960677, 'up', {'n': 1}),
       (1522436791.096888, 'up', {'n': 1}),
       (1522436792.0964658, 'up', {'n': 1}),
       (1522436793.0963855, 'up', {'n': 1}),
       (1522436794.0964293, 'up', {'n': 1}),
       (1522436795.0967596, 'up', {'n': 1}),
       (1522436796.0966263, 'up', {'n': 1}),
       (1522436797.0963128, 'up', {'n': 1}),
       (1522436798.096786, 'up', {'n': 1}),
       (1522436799.0965168, 'up', {'n': 1}),
       (1522436800.096292, 'up', {'n': 1}),
       (1522436801.096066, 'up', {'n': 1}),
       (1522436802.0960214, 'up', {'n': 1}),
       (1522436803.0964055, 'up', {'n': 1}),
       (1522436804.0963562, 'up', {'n': 1}),
       (1522436805.0965285, 'up', {'n': 1}),
       (1522436806.0966988, 'up', {'n': 1}),
       (1522436807.0962427, 'up', {'n': 1}),
       (1522436808.0961447, 'up', {'n': 1}),
       (1522436809.0966911, 'up', {'n': 1}),
       (1522436810.0964615, 'up', {'n': 1})])

Watching the PBS status, a single worker was launched but was quickly culled. After that no additional worker were launched, despite having set the minimum workers to 1.

A few things I don't understand here:

  1. Why was the first job culled? It seems minimum was used to determine the initial number of workers but it was not respected when autoscaling down.
  2. How Adaptive uses the scale_factor=2 when the number of workers is 0.
  3. How does autoscaling work with grouped resources? I set the minimum number of workers to 1 but each job really has 9 or 18 processes (workers). Could that be causing a problem?

from dask-jobqueue.

mrocklin avatar mrocklin commented on May 24, 2024

Why was the first job culled? It seems minimum was used to determine the initial number of workers but it was not respected when autoscaling down.

I don't know. Maybe your last question has to do with this?

How Adaptive uses the scale_factor=2 when the number of workers is 0.

See logic in https://github.com/dask/distributed/blob/8acddcc017172e43f815be9cc5183593dd7f6b45/distributed/deploy/adaptive.py#L222-L248

How does autoscaling work with grouped resources? I set the minimum number of workers to 1 but each job really has 9 or 18 processes (workers). Could that be causing a problem?

It's a good question. I'm not sure anyone has thought about this. Perhaps this is the problem?

In general adaptive deployments are young and could use help both with experimentation and development.

from dask-jobqueue.

jhamman avatar jhamman commented on May 24, 2024

Thanks @mrocklin - it seems like we could do a better job defining the terminology around grouped workers and maybe add a test or two to adaptive that targets this use case.

from dask-jobqueue.

jhamman avatar jhamman commented on May 24, 2024

@mrocklin -

It's a good question. I'm not sure anyone has thought about this. Perhaps this is the problem?

In general adaptive deployments are young and could use help both with experimentation and development.

Do you think there is existing test coverage for this code path within distributed? If not, would you be open to me adding some there? I could add it here as well but this doesn't seem like the appropriate place.

from dask-jobqueue.

mrocklin avatar mrocklin commented on May 24, 2024

from dask-jobqueue.

jhamman avatar jhamman commented on May 24, 2024

@mrocklin - I'm looking at the adaptive tests now. Can you point me to the tests for the grouped workers (without adaptive)?

from dask-jobqueue.

mrocklin avatar mrocklin commented on May 24, 2024

from dask-jobqueue.

jhamman avatar jhamman commented on May 24, 2024

closed via #63

from dask-jobqueue.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.