iterait / shepherd Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 2.0 2.28 MB

Provides access to computation resources on a single machine through a REST API

Home Page: https://iterait.github.io/shepherd/

License: Other

Python 100.00%

gpu queue worker

shepherd's People

Contributors

Stargazers

Watchers

Forkers

kubantjan mild-blue

shepherd's Issues

Stress tests: store large files into the storage

@Teyras in #55:

I would like to test 1) sending something like 100kB in the payload (anything more is a very bad idea) 2) uploading a ginormous file (I'm talking e.g. 2GB) directly to Minio and processing it.

and

sending big stuff in the payload is probably common, but it results in a payload file that is unreadable with common text editors. I would recommend storing large blobs as standalone files (and maybe add an endpoint for that).

Expose available model names and version in the status end-point

Include queued/processing flag in job status endpoint response

/jobs/{job_id}/ready endpoint should return the timestamp of the time when job finished

This could be easily implemented by

finished_at = datetime.fromtimestamp(calendar.timegm(minio.stat_object(request_id, 'done').last_modified))

Worker does not die when the API port is already taken

Storage job_data_exists checks only if the bucket exists

Checking for bucket existence is not sufficient in start_job end-point as it is always true and the job is never queued for execution.

Consider merging job status files into one

DONE and ERROR should be in a single file
there should be some other details in the new file (@blazekadam pls) - is the job in queue/in progress/...?
the API should also return the additional details

Refactor shepherd

Shepherd should only expose start/stop, the rest should be private
Greenlets should not be started in init
Handles to all greenlets must be kept and correctly terminated in close()
We should rethink where sheep-related stuff (public BaseSheep attributes) is kept (in_progress, queue, socket, ...)

Update, generate and deploy sphinx docs including API specs

generate API to docs
explain runners better
Include example responses in the introduction
Include direct payload and result end-points in the docs

Add unit tests and overall tests

ConnectionResetError while running load tests

Related to #55

Traceback (most recent call last):
  File "/home/gdynusa/.pyenv/versions/3.7.2/envs/shepherd/lib/python3.7/site-packages/aiohttp/web_protocol.py", line 447, in start
    await resp.prepare(request)
  File "/home/gdynusa/.pyenv/versions/3.7.2/envs/shepherd/lib/python3.7/site-packages/aiohttp/web_response.py", line 353, in prepare
    return await self._start(request)
  File "/home/gdynusa/.pyenv/versions/3.7.2/envs/shepherd/lib/python3.7/site-packages/aiohttp/web_response.py", line 667, in _start
    return await super()._start(request)
  File "/home/gdynusa/.pyenv/versions/3.7.2/envs/shepherd/lib/python3.7/site-packages/aiohttp/web_response.py", line 410, in _start
    await writer.write_headers(status_line, headers)
  File "/home/gdynusa/.pyenv/versions/3.7.2/envs/shepherd/lib/python3.7/site-packages/aiohttp/http_writer.py", line 112, in write_headers
    self._write(buf)
  File "/home/gdynusa/.pyenv/versions/3.7.2/envs/shepherd/lib/python3.7/site-packages/aiohttp/http_writer.py", line 67, in _write
    raise ConnectionResetError('Cannot write to closing transport')
ConnectionResetError: Cannot write to closing transport

Shepherd start-job with invalid JSON returns 500 instead of 400

Rethink async architecture

I think that good things might happen if we ditch gevent and use asyncio from the python (3.5+) standard library. Some semi-random thoughts on that follow:

We could use an async s3 client (like this or this) to download payloads of queued jobs in advance while a sheep is eating something. This would be a good opportunity to abstract Minio stuff away to a service so that we can swap file storage implementations if needed.
We might have to rethink how the shepherd API works. Currently, we use a WSGI server from gevent that takes care of everything for us. Off the top of my head, we could replace Flask with Sanic - a fast, async, pure-python HTTP server with an API similar to Flask. We would have to make some adjustments to apistrap though (and it's hard to tell how hard it would be). However, it seems that you can also call asyncio tasks from ordinary Flask routes too. That would mean we could run the API without modifications using e.g. aiohttp-wsgi.
PyZMQ supports asyncio.

Debug mode: propagate exc. stacktrace through API

Similarly to apistrap, it would be nice to enable debug mode of worker and let it pass the exception stacktrace through the API response.

Example of out-of-worker response

code: 500
message: Couldn't obtain results of `123`
debug_data: my_runner.py line 68: invalid...

Implement periodic checks of registry accessibility

Refer to #65 and #62 for details on why it was removed.

Shepherd not running on Python 3.6

https://docs.python.org/3.7/library/asyncio-task.html#asyncio.create_task

Environmental variables preceded by $

Parsing environmental variables in the config file exhibits unexpected behavior when the environmental variable is preceded by a string containing $ char. For instance:

$321/$MY_ENV_VAR
KEY-A29$56Y$12-${KEY_SUFFIX}
This does not have to be supported, but it would be nice to gracefully terminate (e.g., exception).
The regex r'([^$]*)\$([A-Z_][A-Z_0-9]*)' is to blame.

(By the way, support for escaping $ via \ would be a nice-to-have, but not necessary at the moment.)

As a sidenote, I would personally prefer to have two different regexes. One for matching the lines containing an env variable and a smaller one used for substitution. That is, roughly something like

re_env  = '\$([A-Z_][A-Z_0-9]*)'
re_line = '^.*' + re_env + '.*$'
# The second one does not even need its own named variable,
# it will only be used once in the call to `add_implicit_resolver`.

With this, the issue would probably be solved immediately.

Capture stdout/stderr for all container types

Check Minio connection more intelligently

Currently, we check if Minio is accessible when shepherd is starting, i.e. at a time when it 1) doesn't really matter that much 2) is likely that it is in fact not accessible. This leads to some horrible hacks such as putting sleep in our Dockerfiles. Also, if Minio breaks down later, we might find out only when users start complaining.

I propose we think of a better way of checking Minio connectivity and reporting failures.

Upload outputs of finished jobs in the background

(In)Valid image fixtures should(n't) be fixtures

Rethink job status endpoint

Example setup cannot process jobs on a Mac

Request:

{
    "model": {
	"name": "cxflow-test",
	"version": "latest"
    },
    "job_id": "job-05",
    "payload": "{\"key\": [42]}"
}

Trace:

ModuleNotFoundError: No module named 'examples.docker'
'  File "/Users/veronika/PycharmProjects/cxworker/cxworker/runner/base_runner.py", line 128, in process_all\n    self._process_job(input_path, output_path)\n', 
'  File "/Users/veronika/PycharmProjects/cxworker/cxworker/runner/json_runner.py", line 80, in _process_job\n    self._load_dataset()\n', 
'  File "/Users/veronika/PycharmProjects/cxworker/cxworker/runner/base_runner.py", line 90, in _load_dataset\n    self._dataset = create_dataset(self._config, None)\n', 
'  File "/usr/local/lib/python3.7/site-packages/cxflow/cli/common.py", line 88, in create_dataset\n    dataset = create_object(dataset_module, dataset_class, args=(yaml_to_str(dataset_config),))\n', 
'  File "/usr/local/lib/python3.7/site-packages/cxflow/utils/reflection.py", line 61, in create_object\n    return get_attribute(module_name, class_name)(*args, **kwargs)\n', 
'  File "/usr/local/lib/python3.7/site-packages/cxflow/utils/reflection.py", line 39, in get_attribute\n    _module = importlib.import_module(module_name)\n', 
'  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/importlib/__init__.py", line 127, in import_module\n    return _bootstrap._gcd_import(name[level:], package, level)\n', '  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import\n', 
'  File "<frozen importlib._bootstrap>", line 983, in _find_and_load\n', 
'  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked\n', 
'  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed\n', '  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import\n', 
'  File "<frozen importlib._bootstrap>", line 983, in _find_and_load\n', 
'  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked\n', 
'  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed\n', 
'  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import\n', 
'  File "<frozen importlib._bootstrap>", line 983, in _find_and_load\n', 
'  File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked\n'

Migrate issues

Make shepherd robust to io fails

A number of operations in _listen co-routine may fail, e.g.:

shepherd/shepherd/shepherd/shepherd.py

Line 297 in ede07cf

shutil.rmtree(working_directory)

which effectively kills the co-routine and the shepherd is unable to recover from it.

Shepherd should be able to deal with these cases (and report them in the logs ofc).

This can be reproduced by running two shepherds with the same working directory.

Remove emloop dependency from bare sheep

I think that shepherd is too tightly bounded to emloop. Hence, it's e.g. impossible to create a runner without (even empty) model config.

We talk about and redesign the runner architecture

Fix documentation generation

See https://circleci.com/gh/iterait/shepherd/399

Support one-shot model "deployments"

The intended usecase is

train an awesome model
run something like shepherd oneshot -p 6666 ./an-awesome-model
either
a. wrap the whole affair with Docker
b. point demotool at :6666 and impress people.

Consider using pipenv+Pipfile instead of requirements.txt

Set up web for documentation

Create minimal SDK

Let's start on tomorrow :)

Finalize docs

generate API to docs
separate BaseRunner and CXRunner and explain them better
Include example responses in the introduction
Include direct payload and result end-points in the docs

Can't read/write from/to default input/output file

It happened just now, out of nowhere.

2018-10-19 11:48:09.588: ERROR   @base_runner : [Errno 2] No such file or directory: '/tmp/shepherd-data/bare_sheep/d15c23aa-39b8-4d1a-88ff-e8e09fbc8734/outputs/output'                                                                                                   
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/shepherd/runner/base_runner.py", line 128, in process_all
    self._process_job(input_path, output_path)
  File "/isletnet/model/isletnet_runner.py", line 177, in _process_job
    with open(path.join(output_path, DEFAULT_OUTPUT_FILE), 'w') as file:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/shepherd-data/bare_sheep/d15c23aa-39b8-4d1a-88ff-e8e09fbc8734/outputs/output'                                                                                                                                
2018-10-19 11:48:09.588: ERROR   @base_runner : Sending ErrorMessage for job `d15c23aa-39b8-4d1a-88ff-e8e09fbc8734

After another payload, even reading the input file fails. Any ideas @Teyras @FloopCZ @blazekadam ?

What's even worse is that even if the runner clearly fails, shepherd doesn't respond with an error (so you're waiting forever)

Wasn't able to fix is by recreating containers so I deleted all related volumes, killed and removed all related containers and run docker-compose up once again. This helped. No idea what's going on

Add an endpoint that forcefully empties a queue

Create endpoints for fetching the payload

Similarly to /jobs/{job_id}/result we should create /jobs/{job_id}/payload endpoint returning the original payload.

Shepherd dev fails to work with our demotool

Re-consider sheep queue -> socket design

Follow-up of discussion on #21

Teyras 11 hours ago
Why can there be more than one job in progress?

@blazekadam
blazekadam 3 hours ago •
Because a job is in in_progress as soon as you prepare its folder and send an InputMessage to the socket. There can be multiple messages (and jobs) waiting in the socket to be processed by the runner. If you kill/lost the runner you need to resolve all the jobs already sent to the socket. In this case, in_progress set of job ids comes handy. (In fact, sheep/runner is killed only when there are no jobs in_progress).

It may not be the best strategy though. I can imagine ensuring there is only one such job and sending the next one only after the current job is done. With this strategy, we would never lose more than one job when runner crashes. Anyways, runner crashes should be quite rare so I do not think we need to rework it now.

Btw there are some other (imo bigger) issues. In the case the runner gets stuck (i.e. it is still running but not responding/sending messages) we would never attempt to restart it. I think that each job should have a certain (configurable?) time-out after which we give up, consider it to be an error and restart the sheep.

@Teyras
Teyras 2 hours ago
OK, let's make an issue for this and move on.

Related to #40

Download inputs for queued jobs in the background

Health checker blocks the shepherd if docker registry is inaccessible

It is possible that this issue is reproducible only from within a docker.

Most likely, it's this line

shepherd/shepherd/docker/registry.py

Line 16 in ede07cf

 response = requests.get(config.url + "/v2/_catalog", auth=HTTPBasicAuth(config.username, config.password)) 

perhaps an explicit timeout would do the job

Unify and facilitate exception handling and reporting

Consider

https://github.com/Cognexa/cxdealer/blob/76be5b2fef20e52eb8897ce8874cb4b266cf7e1c/CXDealer/daemon/event_listener.py#L67

Job gets stuck when BaseSheep can't write to stdout/err file

Set shepherd config like this: sheep.bare_sheep.stdout_file: /var/local/stdout.txt (i.e. the file isn't writeable, but the dir exists)
Run shepherd
Start new job (id=123)
The sheep raises Permission denied (and probably dies?)...
... however, the curently processed job (id=123) seems to be computed (it is not!)
so result keeps on returning 202 and you keep on waiting (forever)
btw starting new job (id=456) will naturally succeed

Refactor Messenger

Messenger should be either a regular class or a module. A regular class instance could contain a socket and we could use it to abstract our code from ZMQ communication (easier testing). On the other hand, using a module would convey that there is no hidden state.

Store sheep output (stdout/stderr) in Minio (per job)

Don't start minio in the test subprocess

This is actually really ugly.

Minio is a well separated project (from shepherd) and I can't see any reason to force a developer (me) to have minio installed on my bare metal. On the contrary, it is completely fine to have minio executed in a container on a specified port.

So I suggest:

removing executing minio in the test subprocess
letting user to handle minio execution (docker example: docker run -it -p 9000:9000 minio/minio server --address=0.0.0.0:9000 /export)

Would love to see some comments on this @Teyras @blazekadam @FloopCZ @gdynusa