lablup / backend.ai-manager Goto Github PK
View Code? Open in Web Editor NEWBackend.AI Manager and API Gateway Daemon
License: GNU Lesser General Public License v3.0
Backend.AI Manager and API Gateway Daemon
License: GNU Lesser General Public License v3.0
Though it seems not to affect the service, I could see this on the production log:
2017-01-10 05:45:22 ERROR aiohttp.server Error handling request
Traceback (most recent call last):
File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/server.py", line 268, in start
self.transport.write_eof()
File "/home/sorna/.pyenv/versions/3.6.0/lib/python3.6/asyncio/transports.py", line 117, in write_eof
raise NotImplementedError
NotImplementedError
Details in WIP.
Provide a virtual folder API. Users can create virtual folders with limited storage and mount them when launching kernel sessions by passing the list of virtual folders via options.
/home/work
of kernel containersPython 3.6 RC1 is released. Let's have some fun.
Now we have multiple threads/processes for the manager.
Hence, we need atomic rate limiting enforcement, using Redis scripts.
Thanks to @gofeel for discovering this issue.
is_admin
flag column of keypairs.In sorna/gateway/kernel.py execute_snippet()
function:
try:
with _timeout(2):
params = await request.json()
log.info(f'EXECUTE_SNIPPET (k:{kern_id})')
except (asyncio.TimeoutError, json.decoder.JSONDecodeError):
log.warn('EXECUTE_SNIPPET: invalid/missing parameters')
raise InvalidAPIParameters
JSONDecodeError is not caught and bubbled up to aiohttp, causing 500 Internal Server Error.
This should have been replaced with InvalidAPIParameters exception.
2017-01-10 07:22:37 ERROR aiohttp.server Error handling request
Traceback (most recent call last):
File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/web_server.py", line 61, in handle_request
resp = yield from self._handler(request)
File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/web.py", line 249, in _handle
resp = yield from handler(request)
File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/auth.py", line 146, in auth_middleware_handler
return (await handler(request))
File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/ratelimit.py", line 48, in rlim_middleware_handler
response = await handler(request)
File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/auth.py", line 154, in wrapped
return (await handler(request))
File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/kernel.py", line 394, in execute_snippet
params = await request.json()
File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/web_reqrep.py", line 384, in json
return loads(body)
File "/home/sorna/.pyenv/versions/3.6.0/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/home/sorna/.pyenv/versions/3.6.0/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/sorna/.pyenv/versions/3.6.0/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
When running unit tests, we sometimes need to destroy the kernels deliberately, without waiting for timeouts.
Extend the gateway entrypoint and unit test scripts to accept SSL context configuration.
Let's add an API to Sorna so that external services can easily integrate with this code execution service.
Let's make API hadlers aware of client request versions.
.py
files of the given entry from S3.py
files??sys.path
Most things will work fine, but we need to test it out.
(The most breaking difference is the async generator protocol change, and this may impact 3rd-party asyncio libraries.)
It is causing repeated agent-lost and revival loops, maybe due to synchronization issues.
2017-01-10 04:48:23 ERROR aiohttp.server Error handling request
Traceback (most recent call last):
File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/web_server.py", line 61, in handle_request
resp = yield from self._handler(request)
File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/web.py", line 249, in _handle
resp = yield from handler(request)
File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/auth.py", line 105, in auth_middleware_handler
if not check_date(request):
File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/auth.py", line 63, in check_date
if dt < min_time or dt > max_time:
TypeError: can't compare offset-naive and offset-aware datetimes
For non-standard HTTP methods to pass-through firewalls, it is often useful to provide a workaround such as X-Method-Override
header to specify custom HTTP methods while the request itself is sent as a POST request.
Let's add a middleware to handle this or patch aiohttp to support this.
We need to run a GPU-capable high-end physical server for deep-learning kernels and EC2 instances for scaling other type of kernels side-by-side.
Let the manager be able to choose which instance to use depending on supported kernel types.
Supervisord uses SIGTERM as its default stop signal, but sorna-manager and sorna-agent does not respond to SIGTERM, resulting in either hang-up or forced shutdown.
(In my test, SIGINT is automatically converted to KeyboardInterrupt
, but SIGTERM is not to SystemExit
.)
The list of available kernels and their respective metadata should be read from the Backend.AI manager instead of hard-coded in the client-side.
ai.backend.gateway.etcd.ConfigServer
to retrieve image metadata.images
Let it run on top of docker, and recognize other containers such as Redis server.
To get detailed exception information without losing tracebacks and other rich contexts during error propagation, we need to support a 3rd-party event monitoring service.
Datadog is great for monitoring continuous values and event volumes, but not good for exception logging.
Move most kernel-related logic into the sorna manager and registry.
For example, currently neumann.ingen tries to remember what kernel is associated with its current user session in the browser and create a new kernel if the matching kernel does not exist or cannot be connected.
GET_OR_CREATE_KERNEL(user_id, entry_id, spec)
Let's get notified when sorna has problems!
Write documentation both manually and by docstrings.
Egoing has reported an issue that he could not see the result of the following code:
var 입력한비밀번호 = '1111';
var 소금의크기 = 32;
var 암호화반복횟수 = 10000;
var 암호의길이 = 32;
var crypto = require('crypto');
crypto.randomBytes(소금의크기, function(오류, 소금){
crypto.pbkdf2(입력한비밀번호, 소금, 암호화반복횟수, 암호의길이, 'sha512', function(오류, 생성된암호){
console.log(생성된암호.toString('hex'));
});
});
This is due to the current nodejs kernel just goes through the synchronous part until it sends the execution result and callbacks generated by the user code are executed later.
We need a "blocking" mechanism until all user callbacks finish as well as temporarily removing existing sorna-side callbacks from the event loop.
As the result, I have found a small hacky open source project that uses C++ addon to access uv_run()
function, and patched it to implement a blocking call until all callbacks finish:
abbr/deasync#53
Then, I have added unref()
/ ref()
support to zeromq.node project:
JustinTulloss/zeromq.node#503
Now we can implement a proper blocking call for nodejs4 kernel.
tcp://logger.lablup:2120
)codeonweb-logs
with PutObject / DeleteObject permissions)Including #2, implement a redis-based manager state management. More details will follow.
When the gateway detects a new (or returning) agent, it should now query its information including:
This information is only updated when the gateway and an agent make their first contact, not on every heartbeat.
In the future, this agent information will be used when choosing which agent instance to run specific kernel sessions as well as load balancing.
Previously the clients had to generate and attach a unique run ID for all execute requests for each run.
Now we are going to improve this to reduce the burden of clients a little, by auto-assigning a run ID if the first execute requests of a run has a "null" run ID.
NOTE: Existing clients will still work without any change.
UPDATING
: the update is triggered by two cases:
pip install -U
), and then restart automatically.AgentStatus
?UNHEALTHY
(#46)
Different languages may use slightly different timezone offset formats when generating ISO 8601-style datetime strings. So, after checking timing range (+- 15 minutes to the server time), we need to use the date header value as-is when canonicalizing the request.
Python 3.6:
>>> from datetime import datetime
>>> from dateutil.tz import tzutc
>>> datetime.now(tzutc()).isoformat()
'2017-01-10T06:15:09.189505+00:00'
>>> datetime.now().isoformat()
'2017-01-10T15:15:15.573437'
>>> datetime.utcnow().isoformat()
'2017-01-10T06:15:23.914917'
Javascript
var d = new Date;
console.log(d.toISOString());
// 2017-01-10T06:16:21.828Z
Inversed sorna.manager.registry.InstanceRegistry.enumerate_instances()
's default check_shadow
parameter has caused the monitoring view misses lost instances.
Travis CI test failing...
(Along with lablup/backend.ai-kernel-runner#1)
If the user specifies --docker-registry
option when running the gateway server, let's use it by replacing "lablup" docker hub account in the kernel image names to "user-provided-registry-address" (e.g., "lablup/kernel-python3" becomes "registry-cache.internal.example.com/kernel-python3")
Convert the current tests/test_shell.py
into a more structured test suite.
The test suite will consist of:
sorna.agent
sys
, it should be not corrupt the agent.)sorna.manager
sorna.manager
It will be useful to distinguish which session is for which "experiment" the user thinks.
Users can reuse their client-side session token for different sessions.
This means that a token may be reused ONLY AFTER a previous session with the same token has completely terminated. Otherwise, the session should be "reused" (returned without creating a new session) and any new session creation requests must be blocked if they have different language specs (#56).
We need to enforce this constraint in the database level, if possible, to avoid potential logical and synchronization bugs in the manager.
In SQL, we call an index that applies to a subset of rows in a table as "partial index". So we could create a partial index over kernels.sess_id
with unique constraint and a condition where kernels.status != "TERMINATED"
.
This will be useful for client developers who implement fine-grained billing for their own individual users while the client-side service uses a single API keypair for Backend.AI.
execute_file
that accepts multiple files as HTTP multipart upload./home/work/_upload
(excluded from S3 upload)Due to diverged develop branch, we should postpone renaming the manager/agent codes for now.
However, the client will be released as "Backend.AI" brand soon, we need to accept the new API headers before upgrading the manager and agents.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.