lablup / backend.ai-manager Goto Github PK
View Code? Open in Web Editor NEWBackend.AI Manager and API Gateway Daemon
License: GNU Lesser General Public License v3.0
Backend.AI Manager and API Gateway Daemon
License: GNU Lesser General Public License v3.0
Thanks to @gofeel for discovering this issue.
(Along with lablup/backend.ai-kernel-runner#1)
tcp://logger.lablup:2120
)codeonweb-logs
with PutObject / DeleteObject permissions)Different languages may use slightly different timezone offset formats when generating ISO 8601-style datetime strings. So, after checking timing range (+- 15 minutes to the server time), we need to use the date header value as-is when canonicalizing the request.
Python 3.6:
>>> from datetime import datetime
>>> from dateutil.tz import tzutc
>>> datetime.now(tzutc()).isoformat()
'2017-01-10T06:15:09.189505+00:00'
>>> datetime.now().isoformat()
'2017-01-10T15:15:15.573437'
>>> datetime.utcnow().isoformat()
'2017-01-10T06:15:23.914917'
Javascript
var d = new Date;
console.log(d.toISOString());
// 2017-01-10T06:16:21.828Z
Due to diverged develop branch, we should postpone renaming the manager/agent codes for now.
However, the client will be released as "Backend.AI" brand soon, we need to accept the new API headers before upgrading the manager and agents.
Inversed sorna.manager.registry.InstanceRegistry.enumerate_instances()
's default check_shadow
parameter has caused the monitoring view misses lost instances.
Let's make API hadlers aware of client request versions.
Details in WIP.
Most things will work fine, but we need to test it out.
(The most breaking difference is the async generator protocol change, and this may impact 3rd-party asyncio libraries.)
Provide a virtual folder API. Users can create virtual folders with limited storage and mount them when launching kernel sessions by passing the list of virtual folders via options.
/home/work
of kernel containersWe need to run a GPU-capable high-end physical server for deep-learning kernels and EC2 instances for scaling other type of kernels side-by-side.
Let the manager be able to choose which instance to use depending on supported kernel types.
The list of available kernels and their respective metadata should be read from the Backend.AI manager instead of hard-coded in the client-side.
ai.backend.gateway.etcd.ConfigServer
to retrieve image metadata.images
.py
files of the given entry from S3.py
files??sys.path
Extend the gateway entrypoint and unit test scripts to accept SSL context configuration.
Previously the clients had to generate and attach a unique run ID for all execute requests for each run.
Now we are going to improve this to reduce the burden of clients a little, by auto-assigning a run ID if the first execute requests of a run has a "null" run ID.
NOTE: Existing clients will still work without any change.
Move most kernel-related logic into the sorna manager and registry.
For example, currently neumann.ingen tries to remember what kernel is associated with its current user session in the browser and create a new kernel if the matching kernel does not exist or cannot be connected.
GET_OR_CREATE_KERNEL(user_id, entry_id, spec)
To get detailed exception information without losing tracebacks and other rich contexts during error propagation, we need to support a 3rd-party event monitoring service.
Datadog is great for monitoring continuous values and event volumes, but not good for exception logging.
is_admin
flag column of keypairs.Now we have multiple threads/processes for the manager.
Hence, we need atomic rate limiting enforcement, using Redis scripts.
Travis CI test failing...
Let's get notified when sorna has problems!
Though it seems not to affect the service, I could see this on the production log:
2017-01-10 05:45:22 ERROR aiohttp.server Error handling request
Traceback (most recent call last):
File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/server.py", line 268, in start
self.transport.write_eof()
File "/home/sorna/.pyenv/versions/3.6.0/lib/python3.6/asyncio/transports.py", line 117, in write_eof
raise NotImplementedError
NotImplementedError
Let it run on top of docker, and recognize other containers such as Redis server.
Let's add an API to Sorna so that external services can easily integrate with this code execution service.
When the gateway detects a new (or returning) agent, it should now query its information including:
This information is only updated when the gateway and an agent make their first contact, not on every heartbeat.
In the future, this agent information will be used when choosing which agent instance to run specific kernel sessions as well as load balancing.
It will be useful to distinguish which session is for which "experiment" the user thinks.
Python 3.6 RC1 is released. Let's have some fun.
2017-01-10 04:48:23 ERROR aiohttp.server Error handling request
Traceback (most recent call last):
File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/web_server.py", line 61, in handle_request
resp = yield from self._handler(request)
File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/web.py", line 249, in _handle
resp = yield from handler(request)
File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/auth.py", line 105, in auth_middleware_handler
if not check_date(request):
File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/auth.py", line 63, in check_date
if dt < min_time or dt > max_time:
TypeError: can't compare offset-naive and offset-aware datetimes
Convert the current tests/test_shell.py
into a more structured test suite.
The test suite will consist of:
sorna.agent
sys
, it should be not corrupt the agent.)sorna.manager
sorna.manager
It is causing repeated agent-lost and revival loops, maybe due to synchronization issues.
In sorna/gateway/kernel.py execute_snippet()
function:
try:
with _timeout(2):
params = await request.json()
log.info(f'EXECUTE_SNIPPET (k:{kern_id})')
except (asyncio.TimeoutError, json.decoder.JSONDecodeError):
log.warn('EXECUTE_SNIPPET: invalid/missing parameters')
raise InvalidAPIParameters
JSONDecodeError is not caught and bubbled up to aiohttp, causing 500 Internal Server Error.
This should have been replaced with InvalidAPIParameters exception.
2017-01-10 07:22:37 ERROR aiohttp.server Error handling request
Traceback (most recent call last):
File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/web_server.py", line 61, in handle_request
resp = yield from self._handler(request)
File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/web.py", line 249, in _handle
resp = yield from handler(request)
File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/auth.py", line 146, in auth_middleware_handler
return (await handler(request))
File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/ratelimit.py", line 48, in rlim_middleware_handler
response = await handler(request)
File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/auth.py", line 154, in wrapped
return (await handler(request))
File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/kernel.py", line 394, in execute_snippet
params = await request.json()
File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/web_reqrep.py", line 384, in json
return loads(body)
File "/home/sorna/.pyenv/versions/3.6.0/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/home/sorna/.pyenv/versions/3.6.0/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/sorna/.pyenv/versions/3.6.0/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Supervisord uses SIGTERM as its default stop signal, but sorna-manager and sorna-agent does not respond to SIGTERM, resulting in either hang-up or forced shutdown.
(In my test, SIGINT is automatically converted to KeyboardInterrupt
, but SIGTERM is not to SystemExit
.)
Users can reuse their client-side session token for different sessions.
This means that a token may be reused ONLY AFTER a previous session with the same token has completely terminated. Otherwise, the session should be "reused" (returned without creating a new session) and any new session creation requests must be blocked if they have different language specs (#56).
We need to enforce this constraint in the database level, if possible, to avoid potential logical and synchronization bugs in the manager.
In SQL, we call an index that applies to a subset of rows in a table as "partial index". So we could create a partial index over kernels.sess_id
with unique constraint and a condition where kernels.status != "TERMINATED"
.
Including #2, implement a redis-based manager state management. More details will follow.
If the user specifies --docker-registry
option when running the gateway server, let's use it by replacing "lablup" docker hub account in the kernel image names to "user-provided-registry-address" (e.g., "lablup/kernel-python3" becomes "registry-cache.internal.example.com/kernel-python3")
For non-standard HTTP methods to pass-through firewalls, it is often useful to provide a workaround such as X-Method-Override
header to specify custom HTTP methods while the request itself is sent as a POST request.
Let's add a middleware to handle this or patch aiohttp to support this.
execute_file
that accepts multiple files as HTTP multipart upload./home/work/_upload
(excluded from S3 upload)Write documentation both manually and by docstrings.
When running unit tests, we sometimes need to destroy the kernels deliberately, without waiting for timeouts.
This will be useful for client developers who implement fine-grained billing for their own individual users while the client-side service uses a single API keypair for Backend.AI.
Egoing has reported an issue that he could not see the result of the following code:
var 입력한비밀번호 = '1111';
var 소금의크기 = 32;
var 암호화반복횟수 = 10000;
var 암호의길이 = 32;
var crypto = require('crypto');
crypto.randomBytes(소금의크기, function(오류, 소금){
crypto.pbkdf2(입력한비밀번호, 소금, 암호화반복횟수, 암호의길이, 'sha512', function(오류, 생성된암호){
console.log(생성된암호.toString('hex'));
});
});
This is due to the current nodejs kernel just goes through the synchronous part until it sends the execution result and callbacks generated by the user code are executed later.
We need a "blocking" mechanism until all user callbacks finish as well as temporarily removing existing sorna-side callbacks from the event loop.
As the result, I have found a small hacky open source project that uses C++ addon to access uv_run()
function, and patched it to implement a blocking call until all callbacks finish:
abbr/deasync#53
Then, I have added unref()
/ ref()
support to zeromq.node project:
JustinTulloss/zeromq.node#503
Now we can implement a proper blocking call for nodejs4 kernel.
UPDATING
: the update is triggered by two cases:
pip install -U
), and then restart automatically.AgentStatus
?UNHEALTHY
(#46)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.