globus / globus-compute Goto Github PK
View Code? Open in Web Editor NEWGlobus Compute: High Performance Function Serving for Science
Home Page: https://www.globus.org/compute
License: Apache License 2.0
Globus Compute: High Performance Function Serving for Science
Home Page: https://www.globus.org/compute
License: Apache License 2.0
Support specification of resource requirements (mem, cpu) for a given function. When a function is published we should be able to associate a walltime, number of cpus, and amount of memory required by the container and then need to use these when launching a pod/block.
Currently funcx-endpoint list
only lists local endpoints, however it would be useful to list out all available/online endpoints including the remote ones.
From Whit.
We should add parsl to the reqs and remove a bunch of modules like IPP that are indirect requirements from parsl.
Reported by @aprilyw
We currently use the mdf_toolbox to perform globus auth logins in the sdk client. We should update this to use the fair research login. When doing this we should disable the local browser such that logins will work on remote machines.
We should implement a get_result
method for the client which returns the deserialized result. Currently one needs to instantiate their own deserializer and deserialize, something like:
from funcx.serialize import FuncXSerializer
serializer = FuncXSerializer()
result = serializer.deserialize(fxc.get_task_status(task_id)['result'])
It'd be nice to be able to resolve the uuid of the local endpoint to submit tasks to. It seems we could add a function to the sdk to check in the ~/.funcx/funcx.cfg file for an endpoint_uuid and use that.
Somehow the second execution always seems to fail on a petrelkube node.
The output of get_task_status
depends on the task status. If it is pending:
{'status': 'PENDING', 'task_id': '57776306-e86a-4870-a1d0-614598c0a7af'}
If it is complete:
{'completion_t': 1569447236.1054666,
'result': '01\ngANYHAAAAEhlbGxvLCBhd2Vzb21lIGF3ZXNvbWUgd29ybGRxAC4=\n',
'task_id': '57776306-e86a-4870-a1d0-614598c0a7af'}
This makes user-side programmatic handling of various task states a bit awkward. I think get_task_status
should always have a field status
in the returned dictionary, which says PENDING
, COMPLETE
, etc. Furthermore, I do not think the result should be returned when one calls get_task_status
(for that users can use get_result
, see #65).
Have you considered breaking off the funcX client into a separate package?
I realize that's a maintenance burden but just getting the client installs like 30 libraries, which seems excessive. Maybe the tradeoff is worth it once the funcX API settles down?
Currently logging from parsl providers is muted. To debug provider issues, we need those to appear somewhere.
Currently the executor assumes that every worker is uniform, running the same container. However for better resource management we want the manager to be able start, replace and otherwise manage containers based on the needs of the tasks it receives.
For this to work we'll need to update several components:
There's a bunch of caching/scavenging logic that needs to happen here. Might be worthwhile to read up on distributed caching.
Currently client.get_result
raises a bare Exception
if the task is pending, which isn't very useful for error handling. It would be nice if users could implement a try/except clause catching a dedicated exception.
But data flows in cleartext.
We want to support functions that take numpy arrays, and complex data objects that are common in ML in addition to supporting json like inputs. This would require support for serializing complex data types for transport over the web.
There are 3 different transport paths that we need to consider:
Client Layer
| ^
| |
web web
| |
REST Service Layer
| |
redis / 0mq #<- We might have both for reliability+low latency
| |
Forwarder Layer
| |
0mq 0mq
| |
V |
Worker Layer
The transport package from the client could look like this :
( b'<function_id, endpoind_targets ...>',
b'<args> + <kwargs>'
)
The package that needs to be shipped from the hub service to the endpoint would be:
( b'<info block : task_id, container_id, cpu, mem limits ... >',
b'<env_variables>',
b'serialization_method' + b'<function_body>',
b'serialization_method' + b'<args>' + b'<kwargs>',
)
The return result package :
( b'info block: task_id, status' ,
b'serialization_method' + b'<results>' or b'<exception>'
)
We need to support automatically pulling down ECR containers rather than assuming the user will do it themselves before running functions.
The current binder notebook is out of date: it has been renamed and our binder links don't work, it uses an endpoint that doesn't exist, and the the function result is never retrieved.
We don't seem to be storing the result from the function invocation in the database's results table for async tasks.
We need to move all the provider+channel logic down to the interchange.
We need a setup a reliable on node endpoint minus the webservice to which we can send functions and have this be part of our CI workflow.
I sometimes saw in the interchange.log
that the interesting manager is greater than the total manager. My guess is that some managers reach their walltime and leave. But the interchange does not remove those died managers from the interesting_manager
list.
I have not met any issue with that so far, but I imagine if we test thoroughly there would be.
2019-10-14 17:12:20.363 interchange:565 [INFO] [MAIN] Managers count (total/interesting): 0/1
Extend the funcX endpoint to support and manage DLHub containers as different executors and route requests appropriately.
We want to give funcX the ability to select where a task is executed. In the future we'd like to be able to specify any endpoint (*) be used and have funcX use manager-provided information to select a resource. As a first step we can enable the specification of a list of endpoints and have funcX simply submit a task to each of them.
On initializing a new endpoint using the CondorProvider
, the first job fails with:
Hold reason: Error from [email protected]: SHADOW at 129.74.85.10 failed to send file(s) to <10.32.86.95:44375>: error reading from /afs/crc.nd.edu/user/a/awoodard/.funcx/ndt3/worker_logs/submit_scripts/parsl.1.1569443103.6405063.script: (errno 2) No such file or directory; STARTER failed to receive file(s) from <129.74.85.10:9618>
This looks like a race condition where the output directory hasn't been created in time (subsequent jobs succeed).
We need to pull out the container type at the interchange and then use it to launch the pods.
Hi,
I registered a resource as a funcx executor and found that about a day later, interchange.log
was 5.3 gb.
I see many lines telling me that I have no managers, going on to infinity:
tail -n 20 interchange.log
produces the following
2019-10-15 20:42:14.939 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:14.949 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:14.960 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:14.970 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:14.981 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:14.991 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.002 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.012 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.023 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.033 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.044 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.054 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.065 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.075 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.086 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.096 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.107 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.117 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
2019-10-15 20:42:15.128 interchange:563 [INFO] [MAIN] Managers count (total/interesting): 0/0
It is clunky to have to run funcx-endpoint start <endpoint>
twice. I propose we rename the first start to funcx-endpoint configure <endpoint>
.
We need some public endpoints we can use for demonstrations. We should probably put some restrictions on what they can do and how often etc. but these are necessary for binder demonstrations etc.
Open questions: where to host it, is a random AWS instance sufficient? Also, do we want any restrictions on what they do?
Currently the executor components assume that a consistent Python3 version and environment is available across the submit side, the interchange, and the workers. This is difficult and most likely terrible to enforce from a user-experience standpoint.
One potential solution is to relax some of the constraints and checks on version matching.
Note: We need to confirm whether it is just pickled functions alone that don't port between python versions.
If the above is right, we can get away with only enforcing python versions at the manager level.
When funcx-endpoint stop <endpoint_name>
is invoked in the endpoint rearchitect branch, the processes persist rather than cleanly terminate.
Currently the endpoint is a pretty bulky, deploying a parsl DFK with providers and all of that for functionality already supported by the remote interchange model. This model also requires the user to put a config on a login machine, which means that the user has distributed points of control, which makes for bad user-ex. Switching over to using the remote interchange model for the endpoint deployment means that we'd get to make the endpoint deployment a one time hassle, and once the endpoint is connected back, things are seamless, and configuration and control happens entirely from the webservice.
This does sacrifice being able to directly connect to the interchange and send tasks to it for lower latency. (We can work on this later).
Here's how this could work.
User ssh's onto theta, installs funcX and runs the command:
funcx-endpoint start
This starts a daemon process on theta that dials back to hub.funcx.org
(say) and requests a connection as user_x
from hostname_theta
. The hub
then creates a executor client object, and the rest reply contains the ports to which the endpoint must connect. Once the endpoint connection is made, the user gets a success message. This registration mechanism allows us to not have to tell the user any special info or magic string.
Once the endpoint is connected, the webservice would give the user the option to configure the system. Requests for scaling are handled by issuing command to the endpoint which is responsible for taking on the strategy+channel+provider roles in Parsl.
We should add an argument to 'funcx-endpoint configure -X...' that allows us to point to an external config file. This will help with our debugging and perhaps any additional use cases that want to run a lot of endpoints (i.e., they won't need to change the default-template each time).
or some other auth practices. Need to change the web services
Traceback (most recent call last):
File "/afs/crc.nd.edu/user/a/awoodard/.local/lib/python3.6/site-packages/funcx/serialize/facade.py", line 60, in serialize
serialized = method.serialize(data)
File "/afs/crc.nd.edu/user/a/awoodard/.local/lib/python3.6/site-packages/funcx/serialize/concretes.py", line 21, in serialize
x = json.dumps(data)
File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-54b64/x86_64-centos7-gcc7-opt/lib/python3.6/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-54b64/x86_64-centos7-gcc7-opt/lib/python3.6/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-54b64/x86_64-centos7-gcc7-opt/lib/python3.6/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/cvmfs/sft.cern.ch/lcg/releases/Python/3.6.5-54b64/x86_64-centos7-gcc7-opt/lib/python3.6/json/encoder.py", line 180, in default
o.__class__.__name__)
TypeError: Object of type 'RemoteExceptionWrapper' is not JSON serializable
When I deployed the forwarder locally on Theta and registered endpoint through that, I got the Traceback below after ~12 hours of running. Have not dug into the problem of this yet, but want to log here in case I forgot.
Process Forwarder-1:
Traceback (most recent call last):
File "/home/zzli/anaconda3/envs/dlhub-JPDC-theta/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/zzli/anaconda3/envs/dlhub-JPDC-theta/lib/python3.6/site-packages/forwarder/forwarder.py", line 233, in run
self.task_loop()
File "/home/zzli/anaconda3/envs/dlhub-JPDC-theta/lib/python3.6/site-packages/forwarder/forwarder.py", line 189, in task_loop
fu.add_done_callback(partial(self.handle_app_update, task_id))
UnboundLocalError: local variable 'fu' referenced before assignment
At the moment, if a function produces an exception, it is not returned to the user. Reported by @aprilyw
Collect resource usage data from the managers and have the forwarder report it to Redis for tracking, visualizations, and counter updating!
Currently we have https://funcx.readthedocs.io/en/latest/ pointed at nothing. We should get started with:
We should record the uuid assigned to an endpoint in a ~/.funcx/funcx.cfg file and retrieve it when we start the endpoint.
With the change to Redis and elastic beanstalk all execution requests are now asynchronous and return an identifier. Thus, to support synchronous tasks by polling for status updates.
Currently we use a key-value set for tasks and a separate one for results from Redis as the communication layer between the forwarder and the web-service. This makes it harder to track tasks. We should instead switch to a single tasks hashmap, that will be updated with the task block on registration, and the result blob on completion. In between, based on task info available, states could be set, which would allow for much greater visibility into task progress.
if a user published a function, found it wrong, changed it, and re-published again with the SAME function name, the web service still used the old version of function.
With the manager spinning up and down many containers, we can get thousands or more funcx_worker_X.log files. If we're not in debug mode, then we want to have all workers write very few info/error log-lines to a single file.
Do AFTER PR merged for #5
We want the containers to advertize available containers and slots to the interchange. This is required to allow the interchange to do smarter routing of tasks to managers.
This is a pre-req for #4
We need to update the examples, add more, and make them runnable in binder. Anyone want to have a crack at this? I believe ZZ has some experience with using Binder.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.