globus / globus-compute Goto Github PK

View Code? Open in Web Editor NEW

142.0 142.0 47.0 26.06 MB

Globus Compute: High Performance Function Serving for Science

Home Page: https://www.globus.org/compute

License: Apache License 2.0

Python 97.71% Shell 0.63% Makefile 1.15% Dockerfile 0.05% Groovy 0.32% Jinja 0.14%

globus-compute's People

Contributors

Stargazers

Watchers

globus-compute's Issues

Add resource requirements to functions

Support specification of resource requirements (mem, cpu) for a given function. When a function is published we should be able to associate a walltime, number of cpus, and amount of memory required by the container and then need to use these when launching a pod/block.

Endpoint list should list remote endpoints as well

Currently funcx-endpoint list only lists local endpoints, however it would be useful to list out all available/online endpoints including the remote ones.

From Whit.

Deduplicate requirements.txt

We should add parsl to the reqs and remove a bunch of modules like IPP that are indirect requirements from parsl.

Escape single quotes properly

Reported by @aprilyw

Change to fair research login for auth

We currently use the mdf_toolbox to perform globus auth logins in the sdk client. We should update this to use the fair research login. When doing this we should disable the local browser such that logins will work on remote machines.

Implement `get_result`

We should implement a get_result method for the client which returns the deserialized result. Currently one needs to instantiate their own deserializer and deserialize, something like:

from funcx.serialize import FuncXSerializer
serializer = FuncXSerializer()

result = serializer.deserialize(fxc.get_task_status(task_id)['result'])

Add "get_local_endpoint" function to sdk

It'd be nice to be able to resolve the uuid of the local endpoint to submit tasks to. It seems we could add a function to the sdk to check in the ~/.funcx/funcx.cfg file for an endpoint_uuid and use that.

Failure for second execution on petrelkube

Somehow the second execution always seems to fail on a petrelkube node.

`get_task_status` should always return status

The output of get_task_status depends on the task status. If it is pending:

{'status': 'PENDING', 'task_id': '57776306-e86a-4870-a1d0-614598c0a7af'}

If it is complete:

{'completion_t': 1569447236.1054666,
 'result': '01\ngANYHAAAAEhlbGxvLCBhd2Vzb21lIGF3ZXNvbWUgd29ybGRxAC4=\n',
 'task_id': '57776306-e86a-4870-a1d0-614598c0a7af'}

This makes user-side programmatic handling of various task states a bit awkward. I think get_task_status should always have a field status in the returned dictionary, which says PENDING, COMPLETE, etc. Furthermore, I do not think the result should be returned when one calls get_task_status (for that users can use get_result, see #65).

Enormous Requirements List for Client

Have you considered breaking off the funcX client into a separate package?

I realize that's a maintenance burden but just getting the client installs like 30 libraries, which seems excessive. Maybe the tradeoff is worth it once the funcX API settles down?

Make sure parsl provider logs appear somewhere

Currently logging from parsl providers is muted. To debug provider issues, we need those to appear somewhere.

Support multiple containers per executor

Currently the executor assumes that every worker is uniform, running the same container. However for better resource management we want the manager to be able start, replace and otherwise manage containers based on the needs of the tasks it receives.

For this to work we'll need to update several components:

Update task serialization logic to add a header that indicates container info
Managers to support workers that can disconnect
Manager to report online containers and resources to the interchange
Interchange to parse task headers and direct tasks to Managers with matching containers
Interchange to detect when there's insufficient active containers but enough CPU/Mem to accommodate loading more containers and send tasks to managers.
Interchange might need to do logic to allow sending tasks to managers so that the manager can swap containers from the cache.

There's a bunch of caching/scavenging logic that needs to happen here. Might be worthwhile to read up on distributed caching.

`client.get_result` should raise a dedicated exception if the task is pending

Currently client.get_result raises a bare Exception if the task is pending, which isn't very useful for error handling. It would be nice if users could implement a try/except clause catching a dedicated exception.

Authenticate endpoint with a handshake

But data flows in cleartext.

Throttle # of tasks

Serialization support

We want to support functions that take numpy arrays, and complex data objects that are common in ML in addition to supporting json like inputs. This would require support for serializing complex data types for transport over the web.

There are 3 different transport paths that we need to consider:

 Client Layer
  |        ^
  |        |
 web      web
  |        |
REST Service Layer
  |        |
 redis /  0mq     #<- We might have both for reliability+low latency
  |        |
Forwarder Layer
  |        |
 0mq      0mq 
  |        |
  V        |
Worker Layer

The transport package from the client could look like this :

( b'<function_id, endpoind_targets ...>',
  b'<args> + <kwargs>'
)

The package that needs to be shipped from the hub service to the endpoint would be:

( b'<info block : task_id, container_id, cpu, mem limits ... >',
  b'<env_variables>', 
  b'serialization_method' + b'<function_body>',
  b'serialization_method' + b'<args>' + b'<kwargs>',
)

The return result package :

( b'info block: task_id, status' ,
  b'serialization_method' + b'<results>'  or b'<exception>'
)

Automatically pull down containers from ECR

We need to support automatically pulling down ECR containers rather than assuming the user will do it themselves before running functions.

Update tutorial notebook

The current binder notebook is out of date: it has been renamed and our binder links don't work, it uses an endpoint that doesn't exist, and the the function result is never retrieved.

Store result after successful execution

We don't seem to be storing the result from the function invocation in the database's results table for async tasks.

Support for scaling from the Interchange

We need to move all the provider+channel logic down to the interchange.

Reregistration process for endpoints

On-node endpoint

We need a setup a reliable on node endpoint minus the webservice to which we can send functions and have this be part of our CI workflow.

Add on-node endpoint config + setup scripts
Add travis yml file to start the endpoint
Ideally we want a couple tests to make sure it works.

remove died manager from interesting manager list

I sometimes saw in the interchange.log that the interesting manager is greater than the total manager. My guess is that some managers reach their walltime and leave. But the interchange does not remove those died managers from the interesting_manager list.

I have not met any issue with that so far, but I imagine if we test thoroughly there would be.

2019-10-14 17:12:20.363 interchange:565 [INFO]  [MAIN] Managers count (total/interesting): 0/1

Singularity and Docker support. Include paths to local containers when tasks are serialized.

DLHub+funcX

Extend the funcX endpoint to support and manage DLHub containers as different executors and route requests appropriately.

Submit tasks to multiple endpoints

We want to give funcX the ability to select where a task is executed. In the future we'd like to be able to specify any endpoint (*) be used and have funcX use manager-provided information to select a resource. As a first step we can enable the specification of a list of endpoints and have funcX simply submit a task to each of them.

Fix race condition in directory creation

On initializing a new endpoint using the CondorProvider, the first job fails with:

Hold reason: Error from [email protected]: SHADOW at 129.74.85.10 failed to send file(s) to <10.32.86.95:44375>: error reading from /afs/crc.nd.edu/user/a/awoodard/.funcx/ndt3/worker_logs/submit_scripts/parsl.1.1569443103.6405063.script: (errno 2) No such file or directory; STARTER failed to receive file(s) from <129.74.85.10:9618>

This looks like a race condition where the output directory hasn't been created in time (subsequent jobs succeed).

Kubernetes support

We need to pull out the container type at the interchange and then use it to launch the pods.

Excessive logging

Hi,

I registered a resource as a funcx executor and found that about a day later, interchange.log was 5.3 gb.

I see many lines telling me that I have no managers, going on to infinity:

tail -n 20 interchange.log produces the following

2019-10-15 20:42:14.939 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:14.949 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:14.960 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:14.970 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:14.981 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:14.991 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.002 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.012 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.023 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.033 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.044 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.054 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.065 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.075 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.086 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.096 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.107 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.117 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.128 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0

Change name of `funcx-endpoint start <endpoint>`

It is clunky to have to run funcx-endpoint start <endpoint> twice. I propose we rename the first start to funcx-endpoint configure <endpoint>.

Public endpoints for demonstrations

We need some public endpoints we can use for demonstrations. We should probably put some restrictions on what they can do and how often etc. but these are necessary for binder demonstrations etc.

Open questions: where to host it, is a random AWS instance sufficient? Also, do we want any restrictions on what they do?

Matching python versions

Currently the executor components assume that a consistent Python3 version and environment is available across the submit side, the interchange, and the workers. This is difficult and most likely terrible to enforce from a user-experience standpoint.

One potential solution is to relax some of the constraints and checks on version matching.

Note: We need to confirm whether it is just pickled functions alone that don't port between python versions.

If the above is right, we can get away with only enforcing python versions at the manager level.

Endpoints do not terminate cleanly

When funcx-endpoint stop <endpoint_name> is invoked in the endpoint rearchitect branch, the processes persist rather than cleanly terminate.

Re-Architect endpoint

Currently the endpoint is a pretty bulky, deploying a parsl DFK with providers and all of that for functionality already supported by the remote interchange model. This model also requires the user to put a config on a login machine, which means that the user has distributed points of control, which makes for bad user-ex. Switching over to using the remote interchange model for the endpoint deployment means that we'd get to make the endpoint deployment a one time hassle, and once the endpoint is connected back, things are seamless, and configuration and control happens entirely from the webservice.
This does sacrifice being able to directly connect to the interchange and send tasks to it for lower latency. (We can work on this later).

Here's how this could work.

User ssh's onto theta, installs funcX and runs the command:
funcx-endpoint start
This starts a daemon process on theta that dials back to hub.funcx.org (say) and requests a connection as user_x from hostname_theta. The hub then creates a executor client object, and the rest reply contains the ports to which the endpoint must connect. Once the endpoint connection is made, the user gets a success message. This registration mechanism allows us to not have to tell the user any special info or magic string.

Once the endpoint is connected, the webservice would give the user the option to configure the system. Requests for scaling are handled by issuing command to the endpoint which is responsible for taking on the strategy+channel+provider roles in Parsl.

Add file arg to 'funcx-endpoint configure'

We should add an argument to 'funcx-endpoint configure -X...' that allows us to point to an external config file. This will help with our debugging and perhaps any additional use cases that want to run a lot of endpoints (i.e., they won't need to change the default-template each time).

make petrelkube endpoint be usable by all alcf users

or some other auth practices. Need to change the web services

Use pickle to serialize remote exceptions

Traceback (most recent call last):
  File "/afs/crc.nd.edu/user/a/awoodard/.local/lib/python3.6/site-packages/funcx/serialize/facade.py", line 60, in serialize
    serialized = method.serialize(data)
  File "/afs/crc.nd.edu/user/a/awoodard/.local/lib/python3.6/site-packages/funcx/serialize/concretes.py", line 21, in serialize
    x = json.dumps(data)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-54b64/x86_64-centos7-gcc7-opt/lib/python3.6/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-54b64/x86_64-centos7-gcc7-opt/lib/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-54b64/x86_64-centos7-gcc7-opt/lib/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-54b64/x86_64-centos7-gcc7-opt/lib/python3.6/json/encoder.py", line 180, in default
    o.__class__.__name__)
TypeError: Object of type 'RemoteExceptionWrapper' is not JSON serializable

forwarder error message when deployed locally

When I deployed the forwarder locally on Theta and registered endpoint through that, I got the Traceback below after ~12 hours of running. Have not dug into the problem of this yet, but want to log here in case I forgot.

Process Forwarder-1:
Traceback (most recent call last):
  File "/home/zzli/anaconda3/envs/dlhub-JPDC-theta/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/zzli/anaconda3/envs/dlhub-JPDC-theta/lib/python3.6/site-packages/forwarder/forwarder.py", line 233, in run
    self.task_loop()
  File "/home/zzli/anaconda3/envs/dlhub-JPDC-theta/lib/python3.6/site-packages/forwarder/forwarder.py", line 189, in task_loop
    fu.add_done_callback(partial(self.handle_app_update, task_id))
UnboundLocalError: local variable 'fu' referenced before assignment

Report exceptions to user

At the moment, if a function produces an exception, it is not returned to the user. Reported by @aprilyw

Add resource tracking

Collect resource usage data from the managers and have the forwarder report it to Redis for tracking, visualizations, and counter updating!

Docs

Currently we have https://funcx.readthedocs.io/en/latest/ pointed at nothing. We should get started with:

How to install
How to start the endpoint in on-node style from #2

Throttle amount of data flowing through the web service

Store endpoint config locally

We should record the uuid assigned to an endpoint in a ~/.funcx/funcx.cfg file and retrieve it when we start the endpoint.

Manage synchronous tasks

With the change to Redis and elastic beanstalk all execution requests are now asynchronous and return an identifier. Thus, to support synchronous tasks by polling for status updates.

Add documentation for funcx.Config

Switch to redis hashmaps for tasks

Currently we use a key-value set for tasks and a separate one for results from Redis as the communication layer between the forwarder and the web-service. This makes it harder to track tasks. We should instead switch to a single tasks hashmap, that will be updated with the task block on registration, and the result blob on completion. In between, based on task info available, states could be set, which would allow for much greater visibility into task progress.

globus / globus-compute Goto Github PK

globus-compute's People

Contributors

Stargazers

Watchers

Forkers

globus-compute's Issues

Recommend Projects

Recommend Topics

Recommend Org