Git Product home page Git Product logo

funcx's Introduction

Globus Compute - Fast Function Serving

Apache Licence V2.0 Build status Documentation Status NSF award info NSF award info

Globus Compute (formerly funcX) is a high-performance function-as-a-service (FaaS) platform that enables intuitive, flexible, efficient, scalable, and performant remote function execution on existing infrastructure including clouds, clusters, and supercomputers.

image

Website: https://www.globus.org/compute

Documentation: https://globus-compute.readthedocs.io/en/latest/

Quickstart

Globus Compute is currently available on PyPI.

To install Globus Compute, please ensure you have python3.7+.:

$ python3 --version

Install using Pip:

$ pip install globus-compute-sdk

To use our example notebooks you will need Jupyter.:

$ pip install jupyter

Note

The Globus Compute client is supported on MacOS, Linux and Windows. The Globus Compute endpoint is only supported on Linux.

Documentation

Complete documentation for Globus Compute is available here

funcx's People

Contributors

ak2000 avatar andrew-s-rosen avatar annawoodard avatar benclifford avatar bengalewsky avatar benhg avatar blaiszik avatar chris-janidlo avatar danielskatz avatar dependabot[bot] avatar joshbryan-globus avatar jvsr1 avatar khk-globus avatar knagaitsev avatar kurtmckee avatar kylechard avatar leiglobus avatar nickolausds avatar pratik-99 avatar pre-commit-ci[bot] avatar ravescovi avatar rjmello avatar ryanchard avatar sirosen avatar theodore-ando avatar tskluzac avatar tylern4 avatar yadudoc avatar yongyanrao avatar zhuozhaoli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

funcx's Issues

Support multiple containers per executor

Currently the executor assumes that every worker is uniform, running the same container. However for better resource management we want the manager to be able start, replace and otherwise manage containers based on the needs of the tasks it receives.

For this to work we'll need to update several components:

  • Update task serialization logic to add a header that indicates container info
  • Managers to support workers that can disconnect
  • Manager to report online containers and resources to the interchange
  • Interchange to parse task headers and direct tasks to Managers with matching containers
  • Interchange to detect when there's insufficient active containers but enough CPU/Mem to accommodate loading more containers and send tasks to managers.
  • Interchange might need to do logic to allow sending tasks to managers so that the manager can swap containers from the cache.

There's a bunch of caching/scavenging logic that needs to happen here. Might be worthwhile to read up on distributed caching.

Switch to redis hashmaps for tasks

Currently we use a key-value set for tasks and a separate one for results from Redis as the communication layer between the forwarder and the web-service. This makes it harder to track tasks. We should instead switch to a single tasks hashmap, that will be updated with the task block on registration, and the result blob on completion. In between, based on task info available, states could be set, which would allow for much greater visibility into task progress.

Change to fair research login for auth

We currently use the mdf_toolbox to perform globus auth logins in the sdk client. We should update this to use the fair research login. When doing this we should disable the local browser such that logins will work on remote machines.

Proper function caching

if a user published a function, found it wrong, changed it, and re-published again with the SAME function name, the web service still used the old version of function.

Kubernetes support

We need to pull out the container type at the interchange and then use it to launch the pods.

Fix race condition in directory creation

On initializing a new endpoint using the CondorProvider, the first job fails with:

Hold reason: Error from [email protected]: SHADOW at 129.74.85.10 failed to send file(s) to <10.32.86.95:44375>: error reading from /afs/crc.nd.edu/user/a/awoodard/.funcx/ndt3/worker_logs/submit_scripts/parsl.1.1569443103.6405063.script: (errno 2) No such file or directory; STARTER failed to receive file(s) from <129.74.85.10:9618>

This looks like a race condition where the output directory hasn't been created in time (subsequent jobs succeed).

Add resource tracking

Collect resource usage data from the managers and have the forwarder report it to Redis for tracking, visualizations, and counter updating!

Deduplicate requirements.txt

We should add parsl to the reqs and remove a bunch of modules like IPP that are indirect requirements from parsl.

Serialization support

We want to support functions that take numpy arrays, and complex data objects that are common in ML in addition to supporting json like inputs. This would require support for serializing complex data types for transport over the web.

There are 3 different transport paths that we need to consider:

 Client Layer
  |        ^
  |        |
 web      web
  |        |
REST Service Layer
  |        |
 redis /  0mq     #<- We might have both for reliability+low latency
  |        |
Forwarder Layer
  |        |
 0mq      0mq 
  |        |
  V        |
Worker Layer

The transport package from the client could look like this :

( b'<function_id, endpoind_targets ...>',
  b'<args> + <kwargs>'
) 

The package that needs to be shipped from the hub service to the endpoint would be:

( b'<info block : task_id, container_id, cpu, mem limits ... >',
  b'<env_variables>', 
  b'serialization_method' + b'<function_body>',
  b'serialization_method' + b'<args>' + b'<kwargs>',
)

The return result package :

( b'info block: task_id, status' ,
  b'serialization_method' + b'<results>'  or b'<exception>'
)

Public endpoints for demonstrations

We need some public endpoints we can use for demonstrations. We should probably put some restrictions on what they can do and how often etc. but these are necessary for binder demonstrations etc.

Open questions: where to host it, is a random AWS instance sufficient? Also, do we want any restrictions on what they do?

remove died manager from interesting manager list

I sometimes saw in the interchange.log that the interesting manager is greater than the total manager. My guess is that some managers reach their walltime and leave. But the interchange does not remove those died managers from the interesting_manager list.

I have not met any issue with that so far, but I imagine if we test thoroughly there would be.

2019-10-14 17:12:20.363 interchange:565 [INFO]  [MAIN] Managers count (total/interesting): 0/1

Re-Architect endpoint

Currently the endpoint is a pretty bulky, deploying a parsl DFK with providers and all of that for functionality already supported by the remote interchange model. This model also requires the user to put a config on a login machine, which means that the user has distributed points of control, which makes for bad user-ex. Switching over to using the remote interchange model for the endpoint deployment means that we'd get to make the endpoint deployment a one time hassle, and once the endpoint is connected back, things are seamless, and configuration and control happens entirely from the webservice.
This does sacrifice being able to directly connect to the interchange and send tasks to it for lower latency. (We can work on this later).

Here's how this could work.

User ssh's onto theta, installs funcX and runs the command:
funcx-endpoint start
This starts a daemon process on theta that dials back to hub.funcx.org (say) and requests a connection as user_x from hostname_theta. The hub then creates a executor client object, and the rest reply contains the ports to which the endpoint must connect. Once the endpoint connection is made, the user gets a success message. This registration mechanism allows us to not have to tell the user any special info or magic string.

Once the endpoint is connected, the webservice would give the user the option to configure the system. Requests for scaling are handled by issuing command to the endpoint which is responsible for taking on the strategy+channel+provider roles in Parsl.

Submit tasks to multiple endpoints

We want to give funcX the ability to select where a task is executed. In the future we'd like to be able to specify any endpoint (*) be used and have funcX use manager-provided information to select a resource. As a first step we can enable the specification of a list of endpoints and have funcX simply submit a task to each of them.

Managers to report warm containers

We want the containers to advertize available containers and slots to the interchange. This is required to allow the interchange to do smarter routing of tasks to managers.

This is a pre-req for #4

Implement `get_result`

We should implement a get_result method for the client which returns the deserialized result. Currently one needs to instantiate their own deserializer and deserialize, something like:

from funcx.serialize import FuncXSerializer
serializer = FuncXSerializer()

result = serializer.deserialize(fxc.get_task_status(task_id)['result'])

Condense non-debug worker logs.

With the manager spinning up and down many containers, we can get thousands or more funcx_worker_X.log files. If we're not in debug mode, then we want to have all workers write very few info/error log-lines to a single file.

Do AFTER PR merged for #5

forwarder error message when deployed locally

When I deployed the forwarder locally on Theta and registered endpoint through that, I got the Traceback below after ~12 hours of running. Have not dug into the problem of this yet, but want to log here in case I forgot.

Process Forwarder-1:
Traceback (most recent call last):
  File "/home/zzli/anaconda3/envs/dlhub-JPDC-theta/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/zzli/anaconda3/envs/dlhub-JPDC-theta/lib/python3.6/site-packages/forwarder/forwarder.py", line 233, in run
    self.task_loop()
  File "/home/zzli/anaconda3/envs/dlhub-JPDC-theta/lib/python3.6/site-packages/forwarder/forwarder.py", line 189, in task_loop
    fu.add_done_callback(partial(self.handle_app_update, task_id))
UnboundLocalError: local variable 'fu' referenced before assignment

DLHub+funcX

Extend the funcX endpoint to support and manage DLHub containers as different executors and route requests appropriately.

Update tutorial notebook

The current binder notebook is out of date: it has been renamed and our binder links don't work, it uses an endpoint that doesn't exist, and the the function result is never retrieved.

Manage synchronous tasks

With the change to Redis and elastic beanstalk all execution requests are now asynchronous and return an identifier. Thus, to support synchronous tasks by polling for status updates.

Add "get_local_endpoint" function to sdk

It'd be nice to be able to resolve the uuid of the local endpoint to submit tasks to. It seems we could add a function to the sdk to check in the ~/.funcx/funcx.cfg file for an endpoint_uuid and use that.

Endpoints do not terminate cleanly

When funcx-endpoint stop <endpoint_name> is invoked in the endpoint rearchitect branch, the processes persist rather than cleanly terminate.

Matching python versions

Currently the executor components assume that a consistent Python3 version and environment is available across the submit side, the interchange, and the workers. This is difficult and most likely terrible to enforce from a user-experience standpoint.

One potential solution is to relax some of the constraints and checks on version matching.

Note: We need to confirm whether it is just pickled functions alone that don't port between python versions.

If the above is right, we can get away with only enforcing python versions at the manager level.

Add resource requirements to functions

Support specification of resource requirements (mem, cpu) for a given function. When a function is published we should be able to associate a walltime, number of cpus, and amount of memory required by the container and then need to use these when launching a pod/block.

`get_task_status` should always return status

The output of get_task_status depends on the task status. If it is pending:

{'status': 'PENDING', 'task_id': '57776306-e86a-4870-a1d0-614598c0a7af'}

If it is complete:

{'completion_t': 1569447236.1054666,
 'result': '01\ngANYHAAAAEhlbGxvLCBhd2Vzb21lIGF3ZXNvbWUgd29ybGRxAC4=\n',
 'task_id': '57776306-e86a-4870-a1d0-614598c0a7af'}

This makes user-side programmatic handling of various task states a bit awkward. I think get_task_status should always have a field status in the returned dictionary, which says PENDING, COMPLETE, etc. Furthermore, I do not think the result should be returned when one calls get_task_status (for that users can use get_result, see #65).

Enormous Requirements List for Client

Have you considered breaking off the funcX client into a separate package?

I realize that's a maintenance burden but just getting the client installs like 30 libraries, which seems excessive. Maybe the tradeoff is worth it once the funcX API settles down?

Excessive logging

Hi,

I registered a resource as a funcx executor and found that about a day later, interchange.log was 5.3 gb.

I see many lines telling me that I have no managers, going on to infinity:

tail -n 20 interchange.log produces the following

2019-10-15 20:42:14.939 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:14.949 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:14.960 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:14.970 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:14.981 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:14.991 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.002 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.012 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.023 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.033 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.044 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.054 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.065 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.075 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.086 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.096 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.107 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.117 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.128 interchange:563 [INFO]  [MAIN] Managers count (total/interesting): 0/0

Add file arg to 'funcx-endpoint configure'

We should add an argument to 'funcx-endpoint configure -X...' that allows us to point to an external config file. This will help with our debugging and perhaps any additional use cases that want to run a lot of endpoints (i.e., they won't need to change the default-template each time).

Store endpoint config locally

We should record the uuid assigned to an endpoint in a ~/.funcx/funcx.cfg file and retrieve it when we start the endpoint.

Create binder examples

We need to update the examples, add more, and make them runnable in binder. Anyone want to have a crack at this? I believe ZZ has some experience with using Binder.

On-node endpoint

We need a setup a reliable on node endpoint minus the webservice to which we can send functions and have this be part of our CI workflow.

  • Add on-node endpoint config + setup scripts
  • Add travis yml file to start the endpoint
  • Ideally we want a couple tests to make sure it works.

Use pickle to serialize remote exceptions

Traceback (most recent call last):
  File "/afs/crc.nd.edu/user/a/awoodard/.local/lib/python3.6/site-packages/funcx/serialize/facade.py", line 60, in serialize
    serialized = method.serialize(data)
  File "/afs/crc.nd.edu/user/a/awoodard/.local/lib/python3.6/site-packages/funcx/serialize/concretes.py", line 21, in serialize
    x = json.dumps(data)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-54b64/x86_64-centos7-gcc7-opt/lib/python3.6/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-54b64/x86_64-centos7-gcc7-opt/lib/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-54b64/x86_64-centos7-gcc7-opt/lib/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-54b64/x86_64-centos7-gcc7-opt/lib/python3.6/json/encoder.py", line 180, in default
    o.__class__.__name__)
TypeError: Object of type 'RemoteExceptionWrapper' is not JSON serializable

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.