Git Product home page Git Product logo

Comments (20)

ian-r-rose avatar ian-r-rose commented on August 13, 2024

Yes, I can help with some of this. To my mind, the most difficult thing will be managing the lifecycle of the clusters. Your basic description of the setup sounds reasonable. For communicating between the server side and the client side: I think rather than a config json we should probably model it after the rest of the notebook REST API. In particular, the sessions API is probably a good place to start.
Major questions that I have.

  • How do we poll for clusters in a backend-agnostic way? Does dask-distributed have all the abstractions we need?
  • What are things that need to be configurable on the backend? How much of that should be exposed to the frontend vs configured at server launch time?

from dask-labextension.

mrocklin avatar mrocklin commented on August 13, 2024

So whenever someone starts up a Client in a Python session we would optionally hit some address to see if it will respond? Presumably it responds with the address that we should check? That could work.

I can imagine doing this either over HTTP similar to how Jupyter does things, or using Dask comms instead.

>>> client = Client('http://localhost/api/dask') 
>>> client.scheduler.address
'tcp://some-other-address:####'
>>> client = Client('tcp://localhost:8786')  # actually the address of our nbserver extension
>>> client.scheduler.address
'tcp://some-other-address:####'

from dask-labextension.

mrocklin avatar mrocklin commented on August 13, 2024

What are things that need to be configurable on the backend? How much of that should be exposed to the frontend vs configured at server launch time?

Well, to start we'll need to decide what library we're using to construct clusters. Common choices today include dask-kubernetes, dask-jobqueue, dask-yarn, and the LocalCluster in the core dask.distributed library. This should probably be determined by configuration, and not by the user directly.

At runtime we'll want users to be able to start, stop, and restart their cluster. We'll also want them to have numerical or text inputs for number of cores and memory. They'll also want to be able to hit "Adapt" and have Dask take over the decision about cores and memory.

from dask-labextension.

mrocklin avatar mrocklin commented on August 13, 2024

image

from dask-labextension.

ian-r-rose avatar ian-r-rose commented on August 13, 2024

So whenever someone starts up a Client in a Python session we would optionally hit some address to see if it will respond? Presumably it responds with the address that we should check? That could work.

Yes, we can have something like a dask/clients endpoint that returns a list of active clients, as well as ids for them. We can then hit dask/clients/clientid with GET, DELETE, etc requests to manage them. We can poll the list every few seconds to keep up to date. This is pretty close to how we handle kernel sessions in the application. I am currently thinking that our server extension would just proxy the client dashboard urls and comms and such, but you have a better idea of the networking requirements there than I.

Well, to start we'll need to decide what library we're using to construct clusters. Common choices today include dask-kubernetes, dask-jobqueue, dask-yarn, and the LocalCluster in the core dask.distributed library. This should probably be determined by configuration, and not by the user directly.

Are the abstractions here sufficient that we could hit multiple (or all) of these use cases with a single extension, and allow their selection via a config option?

At runtime we'll want users to be able to start, stop, and restart their cluster. We'll also want them to have numerical or text inputs for number of cores and memory. They'll also want to be able to hit "Adapt" and have Dask take over the decision about cores and memory.

I think this should be doable via a REST API.

from dask-labextension.

mrocklin avatar mrocklin commented on August 13, 2024

Are the abstractions here sufficient that we could hit multiple (or all) of these use cases with a single extension, and allow their selection via a config option?

I think so, yes.

Yes, we can have something like a dask/clients endpoint that returns a list of active clients

I think that we'll want to switch out the term clients for clusters or schedulers. The client is an object that the user will need to interact with in their notebook/script/whatever. That object will need the address of the scheduler to connect to.

@ian-r-rose perhaps we should chat about this real-time? We might be able to bounce back and forth and come up with a plan more quickly. I'm around most of today and tomorrow if you're free.

from dask-labextension.

ian-r-rose avatar ian-r-rose commented on August 13, 2024

Sure, I am around and pretty flexible today. Feel free to ping me on Gitter and we can set up a room.

from dask-labextension.

mrocklin avatar mrocklin commented on August 13, 2024

@ian-r-rose and I had a quick chat we agreed that ...

  • a server extension should probably manage a few clusters, not just one
  • a user might attach a particular cluster to a notebook kernel by clicking and dragging something into a notebook
  • on the server side this can probably be a fairly simple tornado web application

As an initial set of operations, the following probably work pretty well

from dask.distributed import LocalCluster

cluster = LocalCluster(threads_per_worker=2, memory_limit='4GB')  # configure workers and start

cluster.scale(10)  # scale cluster to ten workers
cluster.scale(2)  # scale cluster down to two workers

cluster.adapt(minimum=0, maximum=10)  # adapt cluster between 0 and 10 workers

cluster.close()  # shut down cluster

We may at some point want to start these running on the same event loop as the Jupyter web server, I'm not sure. This will probably affect some discussions that we're thinking about for deployment now upstream.

from dask-labextension.

ian-r-rose avatar ian-r-rose commented on August 13, 2024

It looks like you were right to be concerned about the tornado event loop @mrocklin. In my initial explorations, just importing dask.distributed from the same process as that which is running the notebook server seems to cause problems. Specifically, the notebook server no longer responds to HTTP requests. Any ideas about what might be causing the problem or how we could get around it? Since it seems to be happening at import time, I don't really know where to start looking.

https://github.com/ian-r-rose/dask-labextension/tree/serverextension

from dask-labextension.

mrocklin avatar mrocklin commented on August 13, 2024

@ian-r-rose I'm happy to investigate. This may sound dumb, but what's the right way to install and test this?

from dask-labextension.

ian-r-rose avatar ian-r-rose commented on August 13, 2024

Thanks @mrocklin. You can install it with

pip install -e .
jupyter serverextension enable --sys-prefix dask_labextension

This attempts to add an additional REST endpoint to the web server. However, I was able to reproduce the problem with a do-nothing extension that just imported dask.distributed.

from dask-labextension.

ian-r-rose avatar ian-r-rose commented on August 13, 2024

My suspicion is that both dask.distributed and the notebook server are trying to wrest control of the default tornado IOLoop and stepping on each other's toes, but I don't know that for sure.

from dask-labextension.

mrocklin avatar mrocklin commented on August 13, 2024

Some binary search of imports and code lead to this diff on the Dask side, which seems to solve the immediate problem

diff --git a/distributed/utils.py b/distributed/utils.py
index df7561aa..dcdd7f5e 100644
--- a/distributed/utils.py
+++ b/distributed/utils.py
@@ -1394,8 +1394,8 @@ def reset_logger_locks():
 
 
 # Only bother if asyncio has been loaded by Tornado
-if 'asyncio' in sys.modules:
-    fix_asyncio_event_loop_policy(sys.modules['asyncio'])
+# if 'asyncio' in sys.modules:
+#     fix_asyncio_event_loop_policy(sys.modules['asyncio'])
 
 
 def has_keyword(func, keyword):

I'll look into why we did this in the first place. In the mean time though applying this diff directly may allow us to move forward.

from dask-labextension.

mrocklin avatar mrocklin commented on August 13, 2024

Seems to be a workaround for tornadoweb/tornado#2183

from dask-labextension.

mrocklin avatar mrocklin commented on August 13, 2024

OK, after looking more at this I'm not sure that Dask is doing something wrong here. I've standardized things on the Dask side at dask/distributed#2326 .

If possible I think we should ask someone on the Jupyter side about why this might cause issues. Who is the right contact for this today?

from dask-labextension.

mrocklin avatar mrocklin commented on August 13, 2024

@minrk do you have thoughts on why adding the following lines might break the Jupyter server?

    import asyncio
    import tornado.platform.asyncio
    asyncio.set_event_loop_policy(tornado.platform.asyncio.AnyThreadEventLoopPolicy())

from dask-labextension.

ian-r-rose avatar ian-r-rose commented on August 13, 2024

Thanks for looking into this @mrocklin, I'll apply your fix while we work out a more permanent solution. @Carreau recently did a lot of work on the IPython event loop and may have some insights as well.

from dask-labextension.

Carreau avatar Carreau commented on August 13, 2024

I can have a look. I've been poking at async stuff, and we are still in some places using old way and doing ensure_future instead of yielding anything which is not None, This has lead for me to some prototype just not running coroutines.

So far I'm working on deploying a JupyterHub on the merced cluster and once this is done, I'll likely start integrating dask, so happy to be a guinea pig and debug these things.

I just need to get things to work first :-)

from dask-labextension.

ian-r-rose avatar ian-r-rose commented on August 13, 2024

Thanks for the info @Carreau.

from dask-labextension.

ian-r-rose avatar ian-r-rose commented on August 13, 2024

Fixed by #31

from dask-labextension.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.