lablup / backend.ai-manager Goto Github PK

View Code? Open in Web Editor NEW

31.0 17.0 20.0 5.23 MB

Backend.AI Manager and API Gateway Daemon

License: GNU Lesser General Public License v3.0

Python 99.90% Mako 0.03% Dockerfile 0.02% Shell 0.04%

api-gateway asyncio http auto-scaling sorna python

backend.ai-manager's Issues

Rarely happening errors (report to aiohttp maintainers?)

Though it seems not to affect the service, I could see this on the production log:

2017-01-10 05:45:22 ERROR aiohttp.server Error handling request
Traceback (most recent call last):
  File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/server.py", line 268, in start
    self.transport.write_eof()
  File "/home/sorna/.pyenv/versions/3.6.0/lib/python3.6/asyncio/transports.py", line 117, in write_eof
    raise NotImplementedError
NotImplementedError

Database design and deploy for API user accounts and metering

Details in WIP.

Virtual folder API

Provide a virtual folder API. Users can create virtual folders with limited storage and mount them when launching kernel sessions by passing the list of virtual folders via options.

Write an initial master-agent negotiation step to get mount information
- This is done via etcd.
Test mounting mounted SMB volumes into docker containers
- This works like a charm with Azure FileShare storage.
Test mounting mounted NFS volumes into docker containers
- A quick googling shows that mounting from the container does not work without privileged mode. Mounting in the host seems to work.
- Have a look at docker volume NFS extension or NetApp docker volume plugin?
Configurations (separate NFS host or a local root directory)
CRUD of virtual folder instances with a database schema
- Each folder has a unique ID, filesystem alias, ownership and ACL, size/file limits.
- Filesystem alias cannot start with dots.
- sorna-agent should skip mounted virtual folders when uploading output files in /home/work of kernel containers
Deploy a backend NFS/SMB server - later, we would migrate to Amazon EFS or other similar NFS-compatible storage services for scalability
- How could we dynamically mount/unmount subtree of the NFS? Fuse?
- The NFS server itself may use the standard kernel-side NFS daemon, but should use a separate EBS volume for ease of management and backups.
- UPDATE: This is done using Azure File Share. It has some latency due to geographic distances, so later we will revisit for performance optimization using closer locations if available.
Implement mounting of virtual folders to agent instances and kernels
- We should keep each virtual folder exclusively mounted. (not shared among multiple kernels even for a single user) -- release this limitation later? (it's NFS's responsibility to keep filesystem consistency anyway)
- Filesystem-level synchronization should be handled by the NFS/SMB server.

Upgrade to Python 3.6

Python 3.6 RC1 is released. Let's have some fun.

Apply new image naming and tagging rules

Update sample alias map
Add explicit version tag argument to kernel creation API
Include explicit version tags when listing/querying the kernel sessions

Atomic rate limiting

Now we have multiple threads/processes for the manager.
Hence, we need atomic rate limiting enforcement, using Redis scripts.

Reject reuse of client-side session tokens with different settings explicitly

Thanks to @gofeel for discovering this issue.

Redesign database and add admin API

Migrate session mgmt from Redis to PostgreSQL. Use an in-memory (unlogged) table?
Migrate instance registry from Redis to PostgreSQL. Use an in-memory (unlogged) table?
Add history logging for instance/session events.
Move internal Django models (keypair and usage) to this project.
~~Add a basic permission model to keypairs.~~ Replaced with a simple is_admin flag column of keypairs.
Add admin-only APIs for sorna-cloud service. The admin privilege is given to a specific keypair.
- Add new keypairs, list keypairs for a given UID, show usage for a given keypair, etc.
Add management APIs for users.
- List and control currently running kernel sessions for a given keypair.
- Show the resource usage and limits for a given keypair.

Exception escapes try-except clause

In sorna/gateway/kernel.py execute_snippet() function:

    try:
        with _timeout(2):
            params = await request.json()
        log.info(f'EXECUTE_SNIPPET (k:{kern_id})')
    except (asyncio.TimeoutError, json.decoder.JSONDecodeError):
        log.warn('EXECUTE_SNIPPET: invalid/missing parameters')
        raise InvalidAPIParameters

JSONDecodeError is not caught and bubbled up to aiohttp, causing 500 Internal Server Error.
This should have been replaced with InvalidAPIParameters exception.

2017-01-10 07:22:37 ERROR aiohttp.server Error handling request
Traceback (most recent call last):
  File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/web_server.py", line 61, in handle_request
    resp = yield from self._handler(request)
  File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/web.py", line 249, in _handle
    resp = yield from handler(request)
  File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/auth.py", line 146, in auth_middleware_handler
    return (await handler(request))
  File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/ratelimit.py", line 48, in rlim_middleware_handler
    response = await handler(request)
  File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/auth.py", line 154, in wrapped
    return (await handler(request))
  File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/kernel.py", line 394, in execute_snippet
    params = await request.json()
  File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/web_reqrep.py", line 384, in json
    return loads(body)
  File "/home/sorna/.pyenv/versions/3.6.0/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/home/sorna/.pyenv/versions/3.6.0/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/sorna/.pyenv/versions/3.6.0/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Support deliberate kernel destroy

When running unit tests, we sometimes need to destroy the kernels deliberately, without waiting for timeouts.

Evaluate GraphQL as a new versionless API design

http://graphql.org/
https://github.com/graphql-python/graphql-core
https://github.com/apollographql/graphql-subscriptions
http://graphql.org/blog/subscriptions-in-graphql-and-relay/

SSL Support

Extend the gateway entrypoint and unit test scripts to accept SSL context configuration.

REST API and Documentation for APIs

Let's add an API to Sorna so that external services can easily integrate with this code execution service.

API Design
- Authentication like AWS API
- Standardized error response (candidate: RFC-7807)
- Kernel/session management
- Code execution and result returns in various forms (including WebSocket-based stdout streaming and customized display data)
Backend
- Usage metrics and statistics
- Billing
Documentation
- Create a readthedocs site.

Version-aware API handlers

Let's make API hadlers aware of client request versions.

Importing and including external codes

Implement "import" functionality
- Fetch .py files of the given entry from S3
- Separate editor for .py files??
- Allow the code blocks in the entry to import them by manipulating sys.path
"include" functionality
- It's different from "import": it literally copy-and-pastes the designated code block (identified by the entry ID and the code block ID)
- We need to show the code block ID to the user.

Upgrade to Python 3.5.2

Most things will work fine, but we need to test it out.
(The most breaking difference is the async generator protocol change, and this may impact 3rd-party asyncio libraries.)

Move all "shared states" to Redis

It is causing repeated agent-lost and revival loops, maybe due to synchronization issues.

Internal server error when user-provided timestamp has no timezone offset

2017-01-10 04:48:23 ERROR aiohttp.server Error handling request
Traceback (most recent call last):
  File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/web_server.py", line 61, in handle_request
    resp = yield from self._handler(request)
  File "/home/sorna/venv/lib/python3.6/site-packages/aiohttp/web.py", line 249, in _handle
    resp = yield from handler(request)
  File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/auth.py", line 105, in auth_middleware_handler
    if not check_date(request):
  File "/home/sorna/venv/lib/python3.6/site-packages/sorna/gateway/auth.py", line 63, in check_date
    if dt < min_time or dt > max_time:
TypeError: can't compare offset-naive and offset-aware datetimes

Support X-Method-Override header

For non-standard HTTP methods to pass-through firewalls, it is often useful to provide a workaround such as X-Method-Override header to specify custom HTTP methods while the request itself is sent as a POST request.
Let's add a middleware to handle this or patch aiohttp to support this.

Capability-based instance selection

We need to run a GPU-capable high-end physical server for deep-learning kernels and EC2 instances for scaling other type of kernels side-by-side.
Let the manager be able to choose which instance to use depending on supported kernel types.

Add supported kernel list to instance registry.
- The list also contains the min/max resource requirements (cpu/mem/gpu slots) of each kernel.
Let agents report their supported kernel list with heartbeats.
When creating kernels, choose an instance that can run the given kernel type.

Handle SIGTERM correctly

Supervisord uses SIGTERM as its default stop signal, but sorna-manager and sorna-agent does not respond to SIGTERM, resulting in either hang-up or forced shutdown.
(In my test, SIGINT is automatically converted to KeyboardInterrupt, but SIGTERM is not to SystemExit.)

API New: User API for Querying Image Metadata

The list of available kernels and their respective metadata should be read from the Backend.AI manager instead of hard-coded in the client-side.

Write methods in ai.backend.gateway.etcd.ConfigServer to retrieve image metadata.
Add a new GraphQL schema type images

Containerization

Let it run on top of docker, and recognize other containers such as Redis server.

Multi-core support

For scalability, the gateway server must become multi-core enabled.

Adopt aiotools for server spinning up.
Redesign the event dispatcher mechanism.
Make all Redis-transactions to DB-transactions (along with #31) for multi-process safety.

Clean up of old/hung kernels

Extend heartbeats to include the time (age) when the kernel last executed any user code. If that exceeds a certain threshold, destroy the kernel.
Clean up the tracking information of kernels that do not respond heartbeats.

Sentry monitoring support

To get detailed exception information without losing tracebacks and other rich contexts during error propagation, we need to support a 3rd-party event monitoring service.
Datadog is great for monitoring continuous values and event volumes, but not good for exception logging.

Implement per-user per-entry kernel session

Move most kernel-related logic into the sorna manager and registry.
For example, currently neumann.ingen tries to remember what kernel is associated with its current user session in the browser and create a new kernel if the matching kernel does not exist or cannot be connected.

Shorten UUID key encoding to save key space (unpadded URL-safe base64: previous 32 bytes hex will be reduced to 22 bytes.)
Add new key-value sets to the registry: (user_id, entry_id, spec) -> set(kernel_id, ...)
- The value may include multiple kernel IDs for future extention such as distributed computing supports.
Add new API to the manager: GET_OR_CREATE_KERNEL(user_id, entry_id, spec)
Move neumann's kernel lookup logic to the new API and implement "create if not exists".
Implement clean-up of session tracking keys in Redis
Write unit tests

Datadog monitoring support

Let's get notified when sorna has problems!

Documentation

Write documentation both manually and by docstrings.

Allow callbacks in nodejs4 kernel

Egoing has reported an issue that he could not see the result of the following code:

var 입력한비밀번호 = '1111';
var 소금의크기 = 32;
var 암호화반복횟수 = 10000;
var 암호의길이 = 32;
var crypto = require('crypto');
crypto.randomBytes(소금의크기, function(오류, 소금){
    crypto.pbkdf2(입력한비밀번호, 소금, 암호화반복횟수, 암호의길이, 'sha512', function(오류, 생성된암호){
        console.log(생성된암호.toString('hex'));
    });
});

This is due to the current nodejs kernel just goes through the synchronous part until it sends the execution result and callbacks generated by the user code are executed later.
We need a "blocking" mechanism until all user callbacks finish as well as temporarily removing existing sorna-side callbacks from the event loop.

As the result, I have found a small hacky open source project that uses C++ addon to access uv_run() function, and patched it to implement a blocking call until all callbacks finish:
abbr/deasync#53

Then, I have added unref() / ref() support to zeromq.node project:
JustinTulloss/zeromq.node#503

Now we can implement a proper blocking call for nodejs4 kernel.

Logging System

Create an EC2 instance that runs logstash daemon.
- input: ZMQ PULL socket (tcp://logger.lablup:2120)
- output: S3 (bucket name: codeonweb-logs with PutObject / DeleteObject permissions)
Add logging handlers to neumann and sorna to generate properly formatted JSON messages delivered via a ZMQ PUSH socket.
- ref: http://blog.tpa.me.uk/2013/11/20/logstash-v1-1-v1-2-json-event-layout-format-change/
Build a log-search system (some simple script or ElasticSearch?)
Schedule a cron job (using AWS Lambda?) to make summary of logs hourly and daily.
Periodically move old logs into yyyy/mm/dd subdirectories of the S3 bucket.
- This requirement is changed to storing logs with monthly/daily separated paths in the bucket.

null return after server restart

When server is restarted, sorna does not return result.
- Neumann spawns uwsgi instance, (to request to sorna), and no answer is returned from sorna.
- As a result, uwsgi instance spawns again and again.. per every user request.
User need to logout and login to work sorna again.
Or, nginx should be restarted to work again.
Maybe session-related issue?

Implement Redis-based manager state management

Including #2, implement a redis-based manager state management. More details will follow.

Basic inventory management

When the gateway detects a new (or returning) agent, it should now query its information including:

the instance type / specification of the agent
the list of available kernels
the resource requirements of each kernel (number of CPU cores, maximum memory, etc.)
the supported API version of each kernel

This information is only updated when the gateway and an agent make their first contact, not on every heartbeat.

In the future, this agent information will be used when choosing which agent instance to run specific kernel sessions as well as load balancing.

API Update: Server-assigned run IDs

Previously the clients had to generate and attach a unique run ID for all execute requests for each run.
Now we are going to improve this to reduce the burden of clients a little, by auto-assigning a run ID if the first execute requests of a run has a "null" run ID.

NOTE: Existing clients will still work without any change.

Implement docker kernel driver

Create the agent docker image.
Implement port mapping of created containers.
Test the functionalities implemented in the local driver.

Extend AgentStatus enum type to include updating and unhealthy statuses

Implement addition of new selections to th existing pgsql enum type using alembic migration.
UPDATING: the update is triggered by two cases:
- (1) Kernel image update: the agent will automatically check and pull the latest Docker images when the image registry in etcd is updated. During such updates, the agent may not be responsive and the kernel creation may use stale images which result in unexpected/undefined behavior.
- (2) Agent version update: if the new version of Backend.AI agent is released, the agent should stop, upgrade itself (by calling pip install -U), and then restart automatically.
- The updating process for individual agents should be delayed until all existing kernel sessions managed by each agent finish. When an update is scheduled, the agent should NOT accept new kernel creation requests – the manager must check this!
  - Separate this "accepting" status as a discrete boolean flag or include as one of AgentStatus?
- To keep the service availability, the agents should NOT begin the update at the same time – we could use etcd as a distributed lock coordinator to make it a rolling update. (#49)
- We need some coordination of agent version update and image update if there are any backward-incompatible changes on either or both sides.
UNHEALTHY (#46)
- Once an agent is checked as unhealthy, we need to terminate and replace it with a new instance.

Use user-provided timestamp string as-is when calculating server-side signature

Different languages may use slightly different timezone offset formats when generating ISO 8601-style datetime strings. So, after checking timing range (+- 15 minutes to the server time), we need to use the date header value as-is when canonicalizing the request.

Python 3.6:

>>> from datetime import datetime
>>> from dateutil.tz import tzutc
>>> datetime.now(tzutc()).isoformat()
'2017-01-10T06:15:09.189505+00:00'
>>> datetime.now().isoformat()
'2017-01-10T15:15:15.573437'
>>> datetime.utcnow().isoformat()
'2017-01-10T06:15:23.914917'

Javascript

var d = new Date;
console.log(d.toISOString());
// 2017-01-10T06:16:21.828Z

Missing instances not reported to datadog monitoring

Inversed sorna.manager.registry.InstanceRegistry.enumerate_instances()'s default check_shadow parameter has caused the monitoring view misses lost instances.

Travis CI test failing...

API New/Update: Per-session configuration APIs

(Along with lablup/backend.ai-kernel-runner#1)

Extend the session creation API to pass a user-config
- Support extra environment variables in kernels
- Support customization of resource requirements (throttled to the configured per-image limits)
Add a new API to retrieve the applied config for a session

Upgrade to aioredis v1.0

https://aioredis.readthedocs.io/en/v1.0.0/migration.html

Take version as argument when creating kernel sessions

Add kernel version argument (used as Docker image tags in the agents)
Validate version argument with the information retrieved from etcd

Local docker registry cache

If the user specifies --docker-registry option when running the gateway server, let's use it by replacing "lablup" docker hub account in the kernel image names to "user-provided-registry-address" (e.g., "lablup/kernel-python3" becomes "registry-cache.internal.example.com/kernel-python3")

Write unit tests

Convert the current tests/test_shell.py into a more structured test suite.

The test suite will consist of:

Local kernel agent functionality test in sorna.agent
- basic code execution tests
- stdout/stderr publishing and redirection tests
- security tests such as scope intrusion (e.g., if the user code tries to replace sys, it should be not corrupt the agent.)
- file creation/deletion tests
- custom package loading tests
Integration test with local kernel driver in sorna.manager
- basic code execution tests
- instance capacity tests
- timeout tests
Integration test with docker in sorna.manager
- (TODO)

User-defined alias for computation sessions

It will be useful to distinguish which session is for which "experiment" the user thinks.

DB schema update
GraphQL schema update
extension of the create kernel API

Apply partial unique index to kernel status and session token

Users can reuse their client-side session token for different sessions.
This means that a token may be reused ONLY AFTER a previous session with the same token has completely terminated. Otherwise, the session should be "reused" (returned without creating a new session) and any new session creation requests must be blocked if they have different language specs (#56).

We need to enforce this constraint in the database level, if possible, to avoid potential logical and synchronization bugs in the manager.

In SQL, we call an index that applies to a subset of rows in a table as "partial index". So we could create a partial index over kernels.sess_id with unique constraint and a condition where kernels.status != "TERMINATED".

Return resource usage statistics when an execution request has finished

This will be useful for client developers who implement fine-grained billing for their own individual users while the client-side service uses a single API keypair for Backend.AI.

Multi-file upload and compilation support

Multi-file upload API
- Add new API execute_file that accepts multiple files as HTTP multipart upload.
- The kernel then should run the "main" program in the REPL while allowing access to other files.
- The kernel may have an optional build stage.
- The uploaded files go into /home/work/_upload (excluded from S3 upload)
- During the same kernel session, the uploaded files are preserved. If new files with the same filename is uploaded again, the existing one is overwritten. (How to delete?)
Problems
- How to specify multiple files and dependencies in client-side? => manually selected via some UI / typed as front-matter comments / etc
- How to give progress feedbacks from build stages? => just wait-and-get-the-result at first implementation, include compilation logs when failed

Accept "BackendAI" like "Sorna" in API request headers

Due to diverged develop branch, we should postpone renaming the manager/agent codes for now.
However, the client will be released as "Backend.AI" brand soon, we need to accept the new API headers before upgrading the manager and agents.

lablup / backend.ai-manager Goto Github PK

backend.ai-manager's Issues

Recommend Projects

Recommend Topics

Recommend Org