lablup / backend.ai-kernel-runner Goto Github PK
View Code? Open in Web Editor NEWA common base runner for various programming languages
License: MIT License
A common base runner for various programming languages
License: MIT License
It's considered a legacy version in the Python world, but we need to support it because current NGC images are shipped with it.
Let's add ai.backend.kernel
subpackages for NGC images.
To utilize the ecosystem of Jupyter custom kernels such as IJulia and XEUS-cling, let's implement the client-side of Jupyter kernel protocol as a standalone kernel runner.
Users would want to install their own packages via pip.
We could use a simple config file which describes the package list to install on startup of kernel session.
requirements.txt
format is perfect for Python, including custom index support. We could allow users to specify it as a whole using YAML content block..pip
cache for commonly used packages (e.g., numpy) as Docker volumes like Travis CILet's send captured stdout/stderr streams to the original stdout/stderr of the kernel runner so that docker logs
API also shows the streamed results.
.
Let's suppress this warning until we have real problems with it.
inproc.py
for query mode executionHistorically, the kernel runner has been tightly coupled with the agent.
So updating the kernel runner happens frequently and this has imposed burdens to update & distribute new kernel images everytime even when the image's platform versions (e.g., Python version) were not changed.
So, let's install only a minimal and separate Python for the kernel runner into the kernel images and make the agent to mount the kernel runner package and its dependencies at runtime. We may need to build separate sets of Python wheels for different distros, such as Ubuntu 16.04 / 18.04 and Alpine 3.8.
--debug-kernel
option to be a default behavior. Add the kernel runner to the requirements of the agent.pip install
process to automatically pull or build the kernel runner environment Docker image?We rely on janus to mediate communication between user-written codes and the base runner.
It uses asyncio's run_in_executor()
API to invoke asynchornous operations from synchronous codes, and the default executor it spawns 5 threads per CPU core detected.
We are currently limiting the number of CPU cores detected by Python via library-level hooking, but when this does not work or user codes try to spawn more threads, janus can fail with hard-to-debug errors the client-side (e.g., the execute API reports "finished" status immediately regardless of whatever code is requested without executing them!).
Let's safely limit the number of threads spawned by run_in_executor()
API by setting a default executor by the base kernel runner.
The following error happens randomly (quite frequently) on start up of python -m ai.backend.kernel git
:
shell-kernel: Fatal error on transport ReadUnixTransport (error status in uv_stream_t.read callback)
protocol: <ai.backend.kernel.terminal.StdoutProtocol object at 0x7f184ae94cf8>
transport: <ReadUnixTransport closed=False reading=False 0x7f184ae97298>
OSError: [Errno 5] Input/output error
opened shell pty: stdin at port 2002, stdout at port 2003
shell-kernel: Task exception was never retrieved
future: <Task finished coro=<Terminal.terminal_in() done, defined at /home/joongi/backend.ai-kernel-runner/ai/backend/kernel/terminal.py:136> exception=RuntimeError('read called while another coroutine is already waiting for incoming data',)>
Traceback (most recent call last):
File "/home/joongi/backend.ai-kernel-runner/ai/backend/kernel/terminal.py", line 139, in terminal_in
data = await self.sock_term_in.read()
File "/home/joongi/.pyenv/versions/bai-kernel-runner-dev/lib/python3.6/site-packages/aiozmq/stream.py", line 278, in read
raise RuntimeError('read called while another coroutine is '
RuntimeError: read called while another coroutine is already waiting for incoming data
and also another error in terminating:
shell-kernel: Fatal error on transport ReadUnixTransport (error status in uv_stream_t.read callback)
protocol: <ai.backend.kernel.terminal.StdoutProtocol object at 0x7f88dccedcc0>
transport: <ReadUnixTransport closed=False reading=False 0x7f88dccf0298>
OSError: [Errno 5] Input/output error
If I don't use uvloop, then it works fine.
If the waiting time for an enqueued task (=run) exceeds the execute API timeout, the client will get a timeout error but the task item still persists.
This will cause mismatch of run IDs afterwards, which may prevent the user from using the session until restarts.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.