Ansible deployment for Conda Compute Cluster

Ansible based playbooks for the deployment and orchestration of the Conda Compute Cluster.

Conda Compute Cluster features:

Conda Compute Cluster (CCC) has been developed by ViCoS UL, FRI to enable deep learning researches a seamless migration between different GPU servers when working on specific projects. Main features of Conda Compute Cluster are:

Running multiple docker containers on different hosts simultainously.
Seamless transition from one host to another.
SSH access to containers through reverse proxy (FRP proxy).
Designed for running Conda enviroment on NVIDIA GPUs for deep learning research.

Containers are based on Conda Compute Containers that enable seamless transition from one host to another due to:

Home folder mounted on common shared storage .
Forbidden modification of non-home files.
Users can modify certain propreties from withing container:
- can modify container image (must be based on vicos/ccc-base:latest)
- can modify apt packages, repositories and soruces installed at container boot
- can modify on which hosts to deploy containers
Pre-installed Miniconda on /home/USER/Conda.

Cluster management is done through a single ansible script and enables deployment of the following features:

Automatic deployment of containers uppon change of config.
Using NFS with FS-cache for shared storage.
Management of local disk with ZFS.
Harwdware monitoring and management:
- automatic management of system FANs when using SUPERMICRO server based on GPU temperature (Sperfans GPU Controller)
- monitoring of GPU and CPU reported as Prometheus metrics
- monitoring of GPU usage for automatic reservation using patroller

Deplyoment playbooks

Two playbooks are available that deploy Conda Compute Cluster and Containers:

cluster-deploy.yml: deployment of cluster infrastructure (network, docker, FRP client, ZFS, NFS, FS-Cache, HW monitoring, GPU fan controlers, etc.)
containers-deploy.yml: depyloment of compute containers based on Conda Compute Container (CCC) images

Deploy cluster infrastructure:

Run the following command to deploy the infrastructure:

ansible-playbook cluster-deploy.yml -i <path-to-inventory> \
                 --vault-password-file <path-to-secret> -e vars_file=<path-to-secret-vars-dir> \
                 -e machines=<node-or-group-pattern> \
                 -e only_roles=<list of roles>

Inventory/nodes:

You can specifcy the cluster definition in the supplied inventory folder. See sample-inventory for example. Tasks are deployed on the nodes defined by the -e machines=<node-or-group-pattern>.

Roles:

By default all roles are executed in the order as specifid below. Deployment can be limited to only specific roles by supplying -e only_roles=<list of roles>. List of roles can be comma seperated list of role names:

netplan: network intrface definition using netplan
docker: docker with pre-defined docker neworks, repository logins and portrainer agent for GUI management
frp-client: FRP client for access to containers through the proxy server
zfs: ZFS pools for local storage
cachefilesd: FS-Cache for caching of the NFS storage into local scratch disks
nfs-storage: NFS storage for shared storage (needed for shared /home/user over all compute nodes)
superfan-gpu: superfans GPU controller for regulating SYSTEM FANs based on GPU temperature
monitoring-agent: HW monitoring for providing Prometheus metrics of CPU and GPUs
compute-container-nightwatch: CCC nightwatch for providing automatic updated of the compute container upon changes to to the Ansible config or user-supplied config
patroller: GPU Patroler for automatic GPU reservation system based on https://github.com/vicoslab/patroller
sshd-hostkey: not an actual role but a minor task to deploy ssh-daemon keys for CCC containers

Example of the cluster-wide config organization:

Example of how to provide cluster configurations is in the sample-inventory folder that includes:

hosts definitions: your-cluster.yml with ccc-cluster as main group of your cluster nodes
cluster settings: group_vars/ccc-cluster/cluster-vars.yml
cluster secrets: vault_vars/cluster-secrets.yml (requires --vault-password-file to unlock)
host-specific settings: sample-inventory/host_vars

Cluster-wide settings contain principal configuration of the whole cluster and are sectioned into settings for individual roles. Settings are used both by the cluster-deploy.yml and containers-deploy.yml playbooks.

Cluster secrets

Cluster secrets are stored in a seperate vault_vars folder and should not be in present in group_vars to allow running containers-deploy.yml without needing vault secret. Secrets can be instead loaded for cluster deployment using -e vars_file=<path-to-secret-vars-dir> which will load vars only for cluster-deploy.yml playbook.

Deploy compute containers

Run the following command to deploy compute containers:

ansible-playbook containers-deploy.yml -i <path-to-inventory> \
                 -e machines=<node-or-group-pattern> \
                 -e containers=<list of STACK_NAME> \
                 -e users=<list of USER_EMAIL>

Containers filtering

By default all containers are deployed!!

To limit the deployment of only specific containers two additional filters can be used. For both filters, the provided values must be a comma separated list in a string format:

-e containers=<list of STACK_NAME>: filters based on containers` STACK_NAME value
-e users=<list of USER_EMAIL>: filters based on containers` USER_EMAIL value

Container deployment config:

List of containers for deployment and list of users are stored need to be set in the inventory configuration:

yaml variable deployment_containers: list of containers for deployment (e.g., see group_vars/ccc-cluster/user-containers.yml)
yaml variable deployment_users: list of users for deployment (e.g., see group_vars/ccc-cluster/user-list.yml)
yaml variable deployment_types: list of users types (e.g., see group_vars/ccc-cluster/user-list.yml)

Example of the container config organization :

Example of how to provide cluster configurations is in the sample-inventory folder that includes:

list of containers for deployment as deployment_containers var in group_vars/ccc-cluster/user-containers.yml
list of users for deployment as deployment_users var in group_vars/ccc-cluster/user-list.yml
list of users types as deployment_types var in group_vars/ccc-cluster/user-list.yml

User containers for deployment

Each container for depoyment must be provided in deployment_containers variable as an array/list of dictionary with the following keys for each container:

STACK_NAME: name of the compute containers
CONTAINER_IMAGE: container image that will be deployed (e.g., "registry.vicos.si/ccc-juypter:ubuntu18.04-cuda10.1")
USER_EMAIL: user's email
INSTALL_PACKAGES: additional apt packages that are installed at startup (registry.vicos.si/ccc-base:<...> images do not provide sudo access by default !!)
INSTALL_REPOSITORY_KEYS: comma separated list of links to fingerprint keys for installed repositoriy sources (added using apt-key add)
INSTALL_REPOSITORY_SOURCES: comma separated list repositoriy sources (deb ... sources or ppa links that can be added using add-apt-repository)
SHM_SIZE: shared memory settings
FRP_PORTS: dict() with TCP and HTTP keys with info of the forwarded ports to the FRP server
- TCP: a list of tcp ports as string values
- HTTP: a list of http ports as dict() objects with port, subdomain, pass (optional), health_check (optional) and subdomain_hostname_prefix (optional - bool) keys

Centralized user information

User informations can be centralized in separate file for quick reuse. Containers and users are matched based on emails. The following user information must be present within the deployment_containers[<USER_EMAIL>] dictionary:

USER_FULLNAME: user's first and last name (from
USER_MENTOR: user's mentor (optional)
USER_NAME: username for the OS
USER_PUBKEY: SSH public key for access to the compute containre
USER_TYPE: user group/type that restricts network, nodes and GPU devices (groups/types are defined in deployment_types key)
ADDITIONAL_DEVICE_GROUPS: allowed additional device groups besides ones defined by USER_TYPE

Feature list

TODO list:

enable of redirection of container loging output to the user

vicoslab / ccc-deployment Goto Github PK

ccc-deployment's Introduction

Ansible deployment for Conda Compute Cluster

Conda Compute Cluster features:

Deplyoment playbooks

Deploy cluster infrastructure:

Inventory/nodes:

Roles:

Example of the cluster-wide config organization:

Cluster secrets

Deploy compute containers

Containers filtering

Container deployment config:

Example of the container config organization :

User containers for deployment

Centralized user information

Feature list

TODO list:

ccc-deployment's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org