Git Product home page Git Product logo

backend.ai-kernel-runner's People

Contributors

achimnol avatar adrysn avatar zeniuus avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

backend.ai-kernel-runner's Issues

NGC kernel runners

Let's add ai.backend.kernel subpackages for NGC images.

  • TensorFlow/Caffe/etc.: it mostly works like a normal Python but without the query mode.
  • PyTorch: it uses Python 3.6.7 and thus we can support the query mode. โ€“ the query mode will be supported via #5 and lablup/backend.ai-agent#93.
  • DIGITS: it runs a digits server as the container entrypoint. Make it a service port.

Jupyter kernel protocol support

To utilize the ecosystem of Jupyter custom kernels such as IJulia and XEUS-cling, let's implement the client-side of Jupyter kernel protocol as a standalone kernel runner.

Customized bootstrapping

Users would want to install their own packages via pip.
We could use a simple config file which describes the package list to install on startup of kernel session.

  • Define a simple format to list up dependencies, maybe as YAML
    • requirements.txt format is perfect for Python, including custom index support. We could allow users to specify it as a whole using YAML content block.
    • System packages (apt-get) or their whilte-listed aliases (e.g., "build-tools" installs "build-essential" and "python-dev" in Python containers while "build-essential" and "php7-dev" in PHP containers)
    • Backend.AI-specific options such as various CUDA versions
  • Implement the installation process on startup
  • Provide a local .pip cache for commonly used packages (e.g., numpy) as Docker volumes like Travis CI
  • Storage information to store the generated artifacts (e.g., virtual folder ID, external S3 bucket with credentials) โ€” idea by @serialx
  • Provide a way to show the config applied to the kernel session (lablup/backend.ai-manager#50)

Make "docker logs" useful

Let's send captured stdout/stderr streams to the original stdout/stderr of the kernel runner so that docker logs API also shows the streamed results.

Distribute with agent

Historically, the kernel runner has been tightly coupled with the agent.
So updating the kernel runner happens frequently and this has imposed burdens to update & distribute new kernel images everytime even when the image's platform versions (e.g., Python version) were not changed.

So, let's install only a minimal and separate Python for the kernel runner into the kernel images and make the agent to mount the kernel runner package and its dependencies at runtime. We may need to build separate sets of Python wheels for different distros, such as Ubuntu 16.04 / 18.04 and Alpine 3.8.

  • Make the agent's --debug-kernel option to be a default behavior. Add the kernel runner to the requirements of the agent.
  • Change kernel images not to install the kernel runner in their Dockerfiles.
  • Make a separate "kernel runner environment" Docker image which will be distributed with the agent. Its Python directories will be directly mounted into kernels at runtime. Update the agent to work with it.
    • Could we make the pip install process to automatically pull or build the kernel runner environment Docker image?
  • Update the offline installers and the development setup script in the meta-repository.
  • Update the documentation.

Set the default ThreadPoolExecutor to prevent janus from creating too many threads

We rely on janus to mediate communication between user-written codes and the base runner.
It uses asyncio's run_in_executor() API to invoke asynchornous operations from synchronous codes, and the default executor it spawns 5 threads per CPU core detected.

We are currently limiting the number of CPU cores detected by Python via library-level hooking, but when this does not work or user codes try to spawn more threads, janus can fail with hard-to-debug errors the client-side (e.g., the execute API reports "finished" status immediately regardless of whatever code is requested without executing them!).

Let's safely limit the number of threads spawned by run_in_executor() API by setting a default executor by the base kernel runner.

Investigate and fix uvloop-specific terminal app error

The following error happens randomly (quite frequently) on start up of python -m ai.backend.kernel git:

shell-kernel: Fatal error on transport ReadUnixTransport (error status in uv_stream_t.read callback)
protocol: <ai.backend.kernel.terminal.StdoutProtocol object at 0x7f184ae94cf8>
transport: <ReadUnixTransport closed=False reading=False 0x7f184ae97298>
OSError: [Errno 5] Input/output error
opened shell pty: stdin at port 2002, stdout at port 2003
shell-kernel: Task exception was never retrieved
future: <Task finished coro=<Terminal.terminal_in() done, defined at /home/joongi/backend.ai-kernel-runner/ai/backend/kernel/terminal.py:136> exception=RuntimeError('read called while another coroutine is already waiting for incoming data',)>
Traceback (most recent call last):
  File "/home/joongi/backend.ai-kernel-runner/ai/backend/kernel/terminal.py", line 139, in terminal_in
    data = await self.sock_term_in.read()
  File "/home/joongi/.pyenv/versions/bai-kernel-runner-dev/lib/python3.6/site-packages/aiozmq/stream.py", line 278, in read
    raise RuntimeError('read called while another coroutine is '
RuntimeError: read called while another coroutine is already waiting for incoming data

and also another error in terminating:

shell-kernel: Fatal error on transport ReadUnixTransport (error status in uv_stream_t.read callback)
protocol: <ai.backend.kernel.terminal.StdoutProtocol object at 0x7f88dccedcc0>
transport: <ReadUnixTransport closed=False reading=False 0x7f88dccf0298>
OSError: [Errno 5] Input/output error

If I don't use uvloop, then it works fine.

Cancel timeout'ed enqueued queries

If the waiting time for an enqueued task (=run) exceeds the execute API timeout, the client will get a timeout error but the task item still persists.
This will cause mismatch of run IDs afterwards, which may prevent the user from using the session until restarts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.