Git Product home page Git Product logo

ccc-deployment's Introduction

Ansible deployment for Conda Compute Cluster

Ansible based playbooks for the deployment and orchestration of the Conda Compute Cluster.

Conda Compute Cluster features:

Conda Compute Cluster (CCC) has been developed by ViCoS UL, FRI to enable deep learning researches a seamless migration between different GPU servers when working on specific projects. Main features of Conda Compute Cluster are:

  • Running multiple docker containers on different hosts simultainously.
  • Seamless transition from one host to another.
  • SSH access to containers through reverse proxy (FRP proxy).
  • Designed for running Conda enviroment on NVIDIA GPUs for deep learning research.

Containers are based on Conda Compute Containers that enable seamless transition from one host to another due to:

  • Home folder mounted on common shared storage .
  • Forbidden modification of non-home files.
  • Users can modify certain propreties from withing container:
    • can modify container image (must be based on vicos/ccc-base:latest)
    • can modify apt packages, repositories and soruces installed at container boot
    • can modify on which hosts to deploy containers
  • Pre-installed Miniconda on /home/USER/Conda.

Cluster management is done through a single ansible script and enables deployment of the following features:

  • Automatic deployment of containers uppon change of config.
  • Using NFS with FS-cache for shared storage.
  • Management of local disk with ZFS.
  • Harwdware monitoring and management:
    • automatic management of system FANs when using SUPERMICRO server based on GPU temperature (Sperfans GPU Controller)
    • monitoring of GPU and CPU reported as Prometheus metrics
    • monitoring of GPU usage for automatic reservation using patroller

Deplyoment playbooks

Two playbooks are available that deploy Conda Compute Cluster and Containers:

  • cluster-deploy.yml: deployment of cluster infrastructure (network, docker, FRP client, ZFS, NFS, FS-Cache, HW monitoring, GPU fan controlers, etc.)
  • containers-deploy.yml: depyloment of compute containers based on Conda Compute Container (CCC) images

Deploy cluster infrastructure:

Run the following command to deploy the infrastructure:

ansible-playbook cluster-deploy.yml -i <path-to-inventory> \
                 --vault-password-file <path-to-secret> -e vars_file=<path-to-secret-vars-dir> \
                 -e machines=<node-or-group-pattern> \
                 -e only_roles=<list of roles> 

Inventory/nodes:

You can specifcy the cluster definition in the supplied inventory folder. See sample-inventory for example. Tasks are deployed on the nodes defined by the -e machines=<node-or-group-pattern>.

Roles:

By default all roles are executed in the order as specifid below. Deployment can be limited to only specific roles by supplying -e only_roles=<list of roles>. List of roles can be comma seperated list of role names:

Example of the cluster-wide config organization:

Example of how to provide cluster configurations is in the sample-inventory folder that includes:

Cluster-wide settings contain principal configuration of the whole cluster and are sectioned into settings for individual roles. Settings are used both by the cluster-deploy.yml and containers-deploy.yml playbooks.

Cluster secrets

Cluster secrets are stored in a seperate vault_vars folder and should not be in present in group_vars to allow running containers-deploy.yml without needing vault secret. Secrets can be instead loaded for cluster deployment using -e vars_file=<path-to-secret-vars-dir> which will load vars only for cluster-deploy.yml playbook.

Deploy compute containers

Run the following command to deploy compute containers:

ansible-playbook containers-deploy.yml -i <path-to-inventory> \
                 -e machines=<node-or-group-pattern> \
                 -e containers=<list of STACK_NAME> \
                 -e users=<list of USER_EMAIL>

Containers filtering

By default all containers are deployed!!

To limit the deployment of only specific containers two additional filters can be used. For both filters, the provided values must be a comma separated list in a string format:

  • -e containers=<list of STACK_NAME>: filters based on containers` STACK_NAME value
  • -e users=<list of USER_EMAIL>: filters based on containers` USER_EMAIL value

Container deployment config:

List of containers for deployment and list of users are stored need to be set in the inventory configuration:

Example of the container config organization :

Example of how to provide cluster configurations is in the sample-inventory folder that includes:

User containers for deployment

Each container for depoyment must be provided in deployment_containers variable as an array/list of dictionary with the following keys for each container:

  • STACK_NAME: name of the compute containers
  • CONTAINER_IMAGE: container image that will be deployed (e.g., "registry.vicos.si/ccc-juypter:ubuntu18.04-cuda10.1")
  • USER_EMAIL: user's email
  • INSTALL_PACKAGES: additional apt packages that are installed at startup (registry.vicos.si/ccc-base:<...> images do not provide sudo access by default !!)
  • INSTALL_REPOSITORY_KEYS: comma separated list of links to fingerprint keys for installed repositoriy sources (added using apt-key add)
  • INSTALL_REPOSITORY_SOURCES: comma separated list repositoriy sources (deb ... sources or ppa links that can be added using add-apt-repository)
  • SHM_SIZE: shared memory settings
  • FRP_PORTS: dict() with TCP and HTTP keys with info of the forwarded ports to the FRP server
    • TCP: a list of tcp ports as string values
    • HTTP: a list of http ports as dict() objects with port, subdomain, pass (optional), health_check (optional) and subdomain_hostname_prefix (optional - bool) keys
Centralized user information

User informations can be centralized in separate file for quick reuse. Containers and users are matched based on emails. The following user information must be present within the deployment_containers[<USER_EMAIL>] dictionary:

  • USER_FULLNAME: user's first and last name (from
  • USER_MENTOR: user's mentor (optional)
  • USER_NAME: username for the OS
  • USER_PUBKEY: SSH public key for access to the compute containre
  • USER_TYPE: user group/type that restricts network, nodes and GPU devices (groups/types are defined in deployment_types key)
  • ADDITIONAL_DEVICE_GROUPS: allowed additional device groups besides ones defined by USER_TYPE

Feature list

  • setting docker repository login from config
  • encrypted data for authentication settings
  • can deploy compute-container only to specific group nodes (student or lab nodes) or specific node
  • can control deploying compute-container through config
  • support for NVIDIA GPU driver installation
  • performance tunded NFS mount settings with FS-cache
  • custom ZFS storage mounting
  • IPMI FAN controler using NVIDIA GPU temperatures (designed for supermicro server)
  • centralized storage of users (with thier names, email and PUBKEY) in a single file
  • loading of SSH pubkey from GITHUB
  • prometheus export for monitoring of the HW (for CPU and GPU - GPU utilization, temperature, etc)
  • users can provide custom settings inside of the containers by editing ~/.containers/<STACK_NAME>.yml file
  • compute-container-nightwatch that monitors ~/.containers/<STACK_NAME>.yml files and redeploys them using ansible-pull
  • constraining to specific GPUs based on device groups and user group

TODO list:

  • enable of redirection of container loging output to the user

ccc-deployment's People

Contributors

skokec avatar lukacu avatar

Stargazers

Josip Šarić avatar Andrej Čop avatar  avatar  avatar  avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.