Git Product home page Git Product logo

shepherd's People

Contributors

blazekadam avatar dependabot[bot] avatar floopcz avatar gdynusa avatar ludakas avatar petrbel avatar teyras avatar zatkovav avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

shepherd's Issues

Stress tests: store large files into the storage

@Teyras in #55:

I would like to test 1) sending something like 100kB in the payload (anything more is a very bad idea) 2) uploading a ginormous file (I'm talking e.g. 2GB) directly to Minio and processing it.

and

sending big stuff in the payload is probably common, but it results in a payload file that is unreadable with common text editors. I would recommend storing large blobs as standalone files (and maybe add an endpoint for that).

Refactor shepherd

  • Shepherd should only expose start/stop, the rest should be private
  • Greenlets should not be started in init
  • Handles to all greenlets must be kept and correctly terminated in close()
  • We should rethink where sheep-related stuff (public BaseSheep attributes) is kept (in_progress, queue, socket, ...)

ConnectionResetError while running load tests

Related to #55

Traceback (most recent call last):
  File "/home/gdynusa/.pyenv/versions/3.7.2/envs/shepherd/lib/python3.7/site-packages/aiohttp/web_protocol.py", line 447, in start
    await resp.prepare(request)
  File "/home/gdynusa/.pyenv/versions/3.7.2/envs/shepherd/lib/python3.7/site-packages/aiohttp/web_response.py", line 353, in prepare
    return await self._start(request)
  File "/home/gdynusa/.pyenv/versions/3.7.2/envs/shepherd/lib/python3.7/site-packages/aiohttp/web_response.py", line 667, in _start
    return await super()._start(request)
  File "/home/gdynusa/.pyenv/versions/3.7.2/envs/shepherd/lib/python3.7/site-packages/aiohttp/web_response.py", line 410, in _start
    await writer.write_headers(status_line, headers)
  File "/home/gdynusa/.pyenv/versions/3.7.2/envs/shepherd/lib/python3.7/site-packages/aiohttp/http_writer.py", line 112, in write_headers
    self._write(buf)
  File "/home/gdynusa/.pyenv/versions/3.7.2/envs/shepherd/lib/python3.7/site-packages/aiohttp/http_writer.py", line 67, in _write
    raise ConnectionResetError('Cannot write to closing transport')
ConnectionResetError: Cannot write to closing transport

Rethink async architecture

I think that good things might happen if we ditch gevent and use asyncio from the python (3.5+) standard library. Some semi-random thoughts on that follow:

  1. We could use an async s3 client (like this or this) to download payloads of queued jobs in advance while a sheep is eating something. This would be a good opportunity to abstract Minio stuff away to a service so that we can swap file storage implementations if needed.
  2. We might have to rethink how the shepherd API works. Currently, we use a WSGI server from gevent that takes care of everything for us. Off the top of my head, we could replace Flask with Sanic - a fast, async, pure-python HTTP server with an API similar to Flask. We would have to make some adjustments to apistrap though (and it's hard to tell how hard it would be). However, it seems that you can also call asyncio tasks from ordinary Flask routes too. That would mean we could run the API without modifications using e.g. aiohttp-wsgi.
  3. PyZMQ supports asyncio.

Debug mode: propagate exc. stacktrace through API

Similarly to apistrap, it would be nice to enable debug mode of worker and let it pass the exception stacktrace through the API response.

Example of out-of-worker response

code: 500
message: Couldn't obtain results of `123`
debug_data: my_runner.py line 68: invalid...

Environmental variables preceded by $

Parsing environmental variables in the config file exhibits unexpected behavior when the environmental variable is preceded by a string containing $ char. For instance:

$321/$MY_ENV_VAR
KEY-A29$56Y$12-${KEY_SUFFIX}
This does not have to be supported, but it would be nice to gracefully terminate (e.g., exception).
The regex r'([^$]*)\$([A-Z_][A-Z_0-9]*)' is to blame.

(By the way, support for escaping $ via \ would be a nice-to-have, but not necessary at the moment.)

As a sidenote, I would personally prefer to have two different regexes. One for matching the lines containing an env variable and a smaller one used for substitution. That is, roughly something like

re_env  = '\$([A-Z_][A-Z_0-9]*)'
re_line = '^.*' + re_env + '.*$'
# The second one does not even need its own named variable,
# it will only be used once in the call to `add_implicit_resolver`.

With this, the issue would probably be solved immediately.

Check Minio connection more intelligently

Currently, we check if Minio is accessible when shepherd is starting, i.e. at a time when it 1) doesn't really matter that much 2) is likely that it is in fact not accessible. This leads to some horrible hacks such as putting sleep in our Dockerfiles. Also, if Minio breaks down later, we might find out only when users start complaining.

I propose we think of a better way of checking Minio connectivity and reporting failures.

Example setup cannot process jobs on a Mac

Request:

{
    "model": {
	"name": "cxflow-test",
	"version": "latest"
    },
    "job_id": "job-05",
    "payload": "{\"key\": [42]}"
}

Trace:

ModuleNotFoundError: No module named 'examples.docker'
'  File "/Users/veronika/PycharmProjects/cxworker/cxworker/runner/base_runner.py", line 128, in process_all\n    self._process_job(input_path, output_path)\n', 
'  File "/Users/veronika/PycharmProjects/cxworker/cxworker/runner/json_runner.py", line 80, in _process_job\n    self._load_dataset()\n', 
'  File "/Users/veronika/PycharmProjects/cxworker/cxworker/runner/base_runner.py", line 90, in _load_dataset\n    self._dataset = create_dataset(self._config, None)\n', 
'  File "/usr/local/lib/python3.7/site-packages/cxflow/cli/common.py", line 88, in create_dataset\n    dataset = create_object(dataset_module, dataset_class, args=(yaml_to_str(dataset_config),))\n', 
'  File "/usr/local/lib/python3.7/site-packages/cxflow/utils/reflection.py", line 61, in create_object\n    return get_attribute(module_name, class_name)(*args, **kwargs)\n', 
'  File "/usr/local/lib/python3.7/site-packages/cxflow/utils/reflection.py", line 39, in get_attribute\n    _module = importlib.import_module(module_name)\n', 
'  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/importlib/__init__.py", line 127, in import_module\n    return _bootstrap._gcd_import(name[level:], package, level)\n', '  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import\n', 
'  File "<frozen importlib._bootstrap>", line 983, in _find_and_load\n', 
'  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked\n', 
'  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed\n', '  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import\n', 
'  File "<frozen importlib._bootstrap>", line 983, in _find_and_load\n', 
'  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked\n', 
'  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed\n', 
'  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import\n', 
'  File "<frozen importlib._bootstrap>", line 983, in _find_and_load\n', 
'  File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked\n'

Make shepherd robust to io fails

A number of operations in _listen co-routine may fail, e.g.:

shutil.rmtree(working_directory)

which effectively kills the co-routine and the shepherd is unable to recover from it.

Shepherd should be able to deal with these cases (and report them in the logs ofc).

This can be reproduced by running two shepherds with the same working directory.

Remove emloop dependency from bare sheep

I think that shepherd is too tightly bounded to emloop. Hence, it's e.g. impossible to create a runner without (even empty) model config.

We talk about and redesign the runner architecture

Support one-shot model "deployments"

The intended usecase is

  1. train an awesome model
  2. run something like shepherd oneshot -p 6666 ./an-awesome-model
  3. either
    a. wrap the whole affair with Docker
    b. point demotool at :6666 and impress people.

Finalize docs

  • generate API to docs
  • separate BaseRunner and CXRunner and explain them better
  • Include example responses in the introduction
  • Include direct payload and result end-points in the docs

Can't read/write from/to default input/output file

It happened just now, out of nowhere.

2018-10-19 11:48:09.588: ERROR   @base_runner : [Errno 2] No such file or directory: '/tmp/shepherd-data/bare_sheep/d15c23aa-39b8-4d1a-88ff-e8e09fbc8734/outputs/output'                                                                                                   
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/shepherd/runner/base_runner.py", line 128, in process_all
    self._process_job(input_path, output_path)
  File "/isletnet/model/isletnet_runner.py", line 177, in _process_job
    with open(path.join(output_path, DEFAULT_OUTPUT_FILE), 'w') as file:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/shepherd-data/bare_sheep/d15c23aa-39b8-4d1a-88ff-e8e09fbc8734/outputs/output'                                                                                                                                
2018-10-19 11:48:09.588: ERROR   @base_runner : Sending ErrorMessage for job `d15c23aa-39b8-4d1a-88ff-e8e09fbc8734

After another payload, even reading the input file fails. Any ideas @Teyras @FloopCZ @blazekadam ?

What's even worse is that even if the runner clearly fails, shepherd doesn't respond with an error (so you're waiting forever)

Wasn't able to fix is by recreating containers so I deleted all related volumes, killed and removed all related containers and run docker-compose up once again. This helped. No idea what's going on

Re-consider sheep queue -> socket design

Follow-up of discussion on #21

Teyras 11 hours ago
Why can there be more than one job in progress?

@blazekadam
blazekadam 3 hours ago โ€ข
Because a job is in in_progress as soon as you prepare its folder and send an InputMessage to the socket. There can be multiple messages (and jobs) waiting in the socket to be processed by the runner. If you kill/lost the runner you need to resolve all the jobs already sent to the socket. In this case, in_progress set of job ids comes handy. (In fact, sheep/runner is killed only when there are no jobs in_progress).

It may not be the best strategy though. I can imagine ensuring there is only one such job and sending the next one only after the current job is done. With this strategy, we would never lose more than one job when runner crashes. Anyways, runner crashes should be quite rare so I do not think we need to rework it now.

Btw there are some other (imo bigger) issues. In the case the runner gets stuck (i.e. it is still running but not responding/sending messages) we would never attempt to restart it. I think that each job should have a certain (configurable?) time-out after which we give up, consider it to be an error and restart the sheep.

@Teyras
Teyras 2 hours ago
OK, let's make an issue for this and move on.

Related to #40

Job gets stuck when BaseSheep can't write to stdout/err file

  1. Set shepherd config like this: sheep.bare_sheep.stdout_file: /var/local/stdout.txt (i.e. the file isn't writeable, but the dir exists)
  2. Run shepherd
  3. Start new job (id=123)
  4. The sheep raises Permission denied (and probably dies?)...
  5. ... however, the curently processed job (id=123) seems to be computed (it is not!)
  6. so result keeps on returning 202 and you keep on waiting (forever)
  7. btw starting new job (id=456) will naturally succeed

Refactor Messenger

Messenger should be either a regular class or a module. A regular class instance could contain a socket and we could use it to abstract our code from ZMQ communication (easier testing). On the other hand, using a module would convey that there is no hidden state.

Don't start minio in the test subprocess

This is actually really ugly.

Minio is a well separated project (from shepherd) and I can't see any reason to force a developer (me) to have minio installed on my bare metal. On the contrary, it is completely fine to have minio executed in a container on a specified port.

So I suggest:

  1. removing executing minio in the test subprocess
  2. letting user to handle minio execution (docker example: docker run -it -p 9000:9000 minio/minio server --address=0.0.0.0:9000 /export)

Would love to see some comments on this @Teyras @blazekadam @FloopCZ @gdynusa

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.