lablup / backend.ai-kernel-runner Goto Github PK

A common base runner for various programming languages

License: MIT License

Python 98.25% Java 1.75%

backend.ai-kernel-runner's Issues

Support Python 3.5.2 for NGC (NVIDIA GPU Cloud) images

It's considered a legacy version in the Python world, but we need to support it because current NGC images are shipped with it.

NGC kernel runners

Let's add ai.backend.kernel subpackages for NGC images.

TensorFlow/Caffe/etc.: it mostly works like a normal Python but without the query mode.
PyTorch: it uses Python 3.6.7 and ~~thus we can support the query mode.~~ – the query mode will be supported via #5 and lablup/backend.ai-agent#93.
DIGITS: it runs a digits server as the container entrypoint. Make it a service port.

Jupyter kernel protocol support

To utilize the ecosystem of Jupyter custom kernels such as IJulia and XEUS-cling, let's implement the client-side of Jupyter kernel protocol as a standalone kernel runner.

Customized bootstrapping

Users would want to install their own packages via pip.
We could use a simple config file which describes the package list to install on startup of kernel session.

Define a simple format to list up dependencies, maybe as YAML
- requirements.txt format is perfect for Python, including custom index support. We could allow users to specify it as a whole using YAML content block.
- System packages (apt-get) or their whilte-listed aliases (e.g., "build-tools" installs "build-essential" and "python-dev" in Python containers while "build-essential" and "php7-dev" in PHP containers)
- Backend.AI-specific options such as various CUDA versions
Implement the installation process on startup
Provide a local .pip cache for commonly used packages (e.g., numpy) as Docker volumes like Travis CI
Storage information to store the generated artifacts (e.g., virtual folder ID, external S3 bucket with credentials) — idea by @serialx
Provide a way to show the config applied to the kernel session (lablup/backend.ai-manager#50)

Make "docker logs" useful

Let's send captured stdout/stderr streams to the original stdout/stderr of the kernel runner so that docker logs API also shows the streamed results.

Suppress warnings about "dtype size changed" when loading numpy-dependent libs

ref) https://stackoverflow.com/questions/40845304/runtimewarning-numpy-dtype-size-changed-may-indicate-binary-incompatibility

Let's suppress this warning until we have real problems with it.

Kernel runner's Python inproc.py for query mode execution
sitecustomize.py for Python batch mode execution

Historically, the kernel runner has been tightly coupled with the agent.
So updating the kernel runner happens frequently and this has imposed burdens to update & distribute new kernel images everytime even when the image's platform versions (e.g., Python version) were not changed.

So, let's install only a minimal and separate Python for the kernel runner into the kernel images and make the agent to mount the kernel runner package and its dependencies at runtime. We may need to build separate sets of Python wheels for different distros, such as Ubuntu 16.04 / 18.04 and Alpine 3.8.

Make the agent's --debug-kernel option to be a default behavior. Add the kernel runner to the requirements of the agent.
Change kernel images not to install the kernel runner in their Dockerfiles.
Make a separate "kernel runner environment" Docker image which will be distributed with the agent. Its Python directories will be directly mounted into kernels at runtime. Update the agent to work with it.
- Could we make the pip install process to automatically pull or build the kernel runner environment Docker image?
Update the offline installers and the development setup script in the meta-repository.
Update the documentation.

Set the default ThreadPoolExecutor to prevent janus from creating too many threads

We rely on janus to mediate communication between user-written codes and the base runner.
It uses asyncio's run_in_executor() API to invoke asynchornous operations from synchronous codes, and the default executor it spawns 5 threads per CPU core detected.

We are currently limiting the number of CPU cores detected by Python via library-level hooking, but when this does not work or user codes try to spawn more threads, janus can fail with hard-to-debug errors the client-side (e.g., the execute API reports "finished" status immediately regardless of whatever code is requested without executing them!).

Let's safely limit the number of threads spawned by run_in_executor() API by setting a default executor by the base kernel runner.

Investigate and fix uvloop-specific terminal app error

The following error happens randomly (quite frequently) on start up of python -m ai.backend.kernel git:

shell-kernel: Fatal error on transport ReadUnixTransport (error status in uv_stream_t.read callback)
protocol: <ai.backend.kernel.terminal.StdoutProtocol object at 0x7f184ae94cf8>
transport: <ReadUnixTransport closed=False reading=False 0x7f184ae97298>
OSError: [Errno 5] Input/output error
opened shell pty: stdin at port 2002, stdout at port 2003
shell-kernel: Task exception was never retrieved
future: <Task finished coro=<Terminal.terminal_in() done, defined at /home/joongi/backend.ai-kernel-runner/ai/backend/kernel/terminal.py:136> exception=RuntimeError('read called while another coroutine is already waiting for incoming data',)>
Traceback (most recent call last):
  File "/home/joongi/backend.ai-kernel-runner/ai/backend/kernel/terminal.py", line 139, in terminal_in
    data = await self.sock_term_in.read()
  File "/home/joongi/.pyenv/versions/bai-kernel-runner-dev/lib/python3.6/site-packages/aiozmq/stream.py", line 278, in read
    raise RuntimeError('read called while another coroutine is '
RuntimeError: read called while another coroutine is already waiting for incoming data

and also another error in terminating:

shell-kernel: Fatal error on transport ReadUnixTransport (error status in uv_stream_t.read callback)
protocol: <ai.backend.kernel.terminal.StdoutProtocol object at 0x7f88dccedcc0>
transport: <ReadUnixTransport closed=False reading=False 0x7f88dccf0298>
OSError: [Errno 5] Input/output error

If I don't use uvloop, then it works fine.

Cancel timeout'ed enqueued queries

If the waiting time for an enqueued task (=run) exceeds the execute API timeout, the client will get a timeout error but the task item still persists.
This will cause mismatch of run IDs afterwards, which may prevent the user from using the session until restarts.

lablup / backend.ai-kernel-runner Goto Github PK

backend.ai-kernel-runner's People

Contributors

Stargazers

Watchers

Forkers

backend.ai-kernel-runner's Issues

Support Python 3.5.2 for NGC (NVIDIA GPU Cloud) images

NGC kernel runners

Jupyter kernel protocol support

Customized bootstrapping

Make "docker logs" useful

MRO Server Adaptor

Suppress warnings about "dtype size changed" when loading numpy-dependent libs

Distribute with agent

Set the default ThreadPoolExecutor to prevent janus from creating too many threads

Investigate and fix uvloop-specific terminal app error

Cancel timeout'ed enqueued queries

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent