Git Product home page Git Product logo

nebari-slurm's Introduction

Nebari Slurm

This project is being renamed from QHub HPC to Nebari Slurm.

Nebari Slurm is an opinionated open source deployment of jupyterhub based on an HPC jobscheduler. Nebari Slurm is a "distribution" of these packages much like Debian and Ubuntu are distributions of Linux. The high level goal of this distribution is to form a cohesive set of tools that enable:

  • environment management via conda and conda-store
  • monitoring of compute infrastructure and services
  • scalable and efficient compute via jupyterlab and dask
  • deployment of jupyterhub on prem without requiring deep devops knowledge of the Slurm/HPC and jupyter ecosystem

Features

  • Scalable compute environment based on the Slurm workload manger to take advantage of entire fleet of nodes
  • Ansible based provisioning on Ubuntu 18.04 and Ubuntu 20.04 nodes to deploy one master server and N workers. These workers can be pre-existing nodes in your compute environment
  • Customizable Themes for JupyterHub

jupyterhub-theme

  • JupyterHub integration allowing users to select the memory, cpus, and environment that jupyterlab instances for users are launched in

jupyterhub

  • Dask Gateway integration allowing users to selct the memory, cpus, and environment that dask schedule/workers use

dask-gateway

  • Monitoring of entire cluster via grafana to monitor the nodes, jupyterhub, slurm, and traefik

grafana

  • Shared directories between all users for collaborative compute

Dependencies

Install ansible dependencies

ansible-galaxy collection install -r requirements.yaml

Testing

There are tests for deploying Nebari Slurm on a virtual machine provisioner and in the cloud.

Virtual Machines

Vagrant is a tool responsible for creating and provisioning vms. It has convenient integration with ansible which allows for easy effective control over configuration. Currently the Vagrantfile only has support for libvirt and virtualbox.

cd tests/ubuntu1804
# cd tests/ubuntu2004
vagrant up --provider=<provider-name>
# vagrant up --provider=libvirt
# vagrant up --provider=virtualbox

Notebook for testing functionality

  • tests/assets/notebook/test-dask-gateway.ipynb

Cloud

Services

Current testing environment spins up four nodes:

  • all nodes :: node_exporter for node metrics
  • master node :: slurm scheduler, munge, mysql, jupyterhub, grafana, prometheus
  • worker node :: slurm daemon, munge

Jupyterhub

Jupyterhub is accessible via <master node ip>:8000

You may need to find a way to port-forward, e.g. over ssh:

vagrant ssh hpc01-test -- -N -L localhost:8000:localhost:8000

then access http://localhost:8000/ on the host.

Grafana

Grafana is accessible via <master node ip>:3000

License

Nebari Slurm is BSD3 licensed.

Contributing

Contributions are welcome!

nebari-slurm's People

Contributors

costrouc avatar aktech avatar viniciusdc avatar tylerpotts avatar adam-d-lewis avatar sjdemartini avatar ericdwang avatar balast avatar danlester avatar dsrawat984 avatar

Stargazers

 avatar Patrick Hüther avatar Ivan Ogasawara avatar Niranjan Anandkumar avatar Lars Martens avatar  avatar mz avatar  avatar Pamphile Roy avatar Damien Irving avatar Someone avatar A.s. avatar Daniel Kahlenberg avatar Rohit Goswami avatar Jesse Lopez avatar Chris Layton avatar  avatar John Bradley avatar Ravi Tiwari avatar Carol Willing avatar sterlinm avatar  avatar Rafael Xavier avatar Ehsan Totoni avatar  avatar Anirrudh Krishnan avatar

Watchers

Pearu Peterson avatar James Cloos avatar Max Klein avatar Trent Oliphant avatar  avatar Anirrudh Krishnan avatar Fatma Tarlaci avatar  avatar

nebari-slurm's Issues

Install Hadoop

Hadoop will be installed on the cluster when the following flag is set in group_vars/all.yaml

bodo:
  enabled: true

This is a script provided by bodo to install hadoop

      "conda install --quiet --yes adlfs -c conda-forge",
      "apt-get install -y openjdk-11-jre-headless",
      "cd /opt",
      "wget https://mirrors.sonic.net/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz",
      "tar xf hadoop-3.3.0.tar.gz",
      "rm hadoop-3.3.0.tar.gz",
      "echo \"export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64\" >> /etc/bash.bashrc",
      "echo \"export HADOOP_HOME=/opt/hadoop-3.3.0\" >> /etc/bash.bashrc",
      "echo \"export HADOOP_INSTALL=\\$HADOOP_HOME\" >> /etc/bash.bashrc",
      "echo \"export HADOOP_MAPRED_HOME=\\$HADOOP_HOME\" >> /etc/bash.bashrc",
      "echo \"export HADOOP_COMMON_HOME=\\$HADOOP_HOME\" >> /etc/bash.bashrc",
      "echo \"export HADOOP_HDFS_HOME=\\$HADOOP_HOME\" >> /etc/bash.bashrc",
      "echo \"export YARN_HOME=\\$HADOOP_HOME\" >> /etc/bash.bashrc",
      "echo \"export HADOOP_COMMON_LIB_NATIVE_DIR=\\$HADOOP_HOME/lib/native\" >> /etc/bash.bashrc",
      "echo \"export PATH=\\$PATH:\\$HADOOP_HOME/sbin:\\$HADOOP_HOME/bin\" >> /etc/bash.bashrc",
      "echo \"export HADOOP_OPTS='-Djava.library.path=/opt/hadoop-3.3.0/lib/native'\" >> /etc/bash.bashrc",
      "echo \"export HADOOP_OPTIONAL_TOOLS=hadoop-azure\" >> /etc/bash.bashrc",
      "echo \"export ARROW_LIBHDFS_DIR=\\$HADOOP_COMMON_LIB_NATIVE_DIR\" >> /etc/bash.bashrc",
      "echo \"export CLASSPATH=\\`\\$HADOOP_HOME/bin/hdfs classpath --glob\\`\" >> /etc/bash.bashrc"

Remove dashes from group names

The tests have ansible groupnames such as hpc01-test. Dashes are considered invalid, and this throws a warning. Change test groupnames to have underscores instead of dashes

Conda installs a hidden .condarc file

Conda installs a default channel file in /opt/conda/.condarc owned by root

Our scripts install /etc/conda/condarc manually, but the above file is also present

This creates a conflict with default channels and the channel resolution becomes difficult to debug

Firewall Configuration issues on Vagrant Test Setup

Trying to spin up the test qhub-hpc is failing with the error below:

TASK [install conda environment] ***********************************************
failed: [hpc02-test] (item={'path': '/home', 'host': 'hpc01-test'}) => {"ansible_loop_var": "item", "changed": false, "item": {"host": "hpc01-test", "path": "/home"}, "msg": "Error mounting /home: mount.nfs: Connection timed out\n"}
changed: [hpc01-test]

I've included the rest of the output below:

Expand to see complete output [nix-shell:~/CodingProjects/qhub-hpc/tests/ubuntu2004]$ vagrant up --provision Bringing machine 'hpc01-test' up with 'libvirt' provider... Bringing machine 'hpc02-test' up with 'libvirt' provider... ==> hpc02-test: Checking if box 'generic/ubuntu2004' version '3.3.4' is up to date... ==> hpc01-test: Checking if box 'generic/ubuntu2004' version '3.3.4' is up to date... ==> hpc02-test: Creating image (snapshot of base box volume). ==> hpc02-test: Creating domain with the following settings... ==> hpc02-test: -- Name: ubuntu2004_hpc02-test ==> hpc02-test: -- Domain type: kvm ==> hpc02-test: -- Cpus: 4 ==> hpc02-test: -- Feature: acpi ==> hpc02-test: -- Feature: apic ==> hpc02-test: -- Feature: pae ==> hpc02-test: -- Memory: 8192M ==> hpc02-test: -- Management MAC: ==> hpc02-test: -- Loader: ==> hpc02-test: -- Nvram: ==> hpc02-test: -- Base box: generic/ubuntu2004 ==> hpc02-test: -- Storage pool: default ==> hpc02-test: -- Image: /var/lib/libvirt/images/ubuntu2004_hpc02-test.img (128G) ==> hpc02-test: -- Volume Cache: default ==> hpc02-test: -- Kernel: ==> hpc02-test: -- Initrd: ==> hpc02-test: -- Graphics Type: vnc ==> hpc02-test: -- Graphics Port: -1 ==> hpc02-test: -- Graphics IP: 127.0.0.1 ==> hpc02-test: -- Graphics Password: Not defined ==> hpc02-test: -- Video Type: cirrus ==> hpc02-test: -- Video VRAM: 256 ==> hpc02-test: -- Sound Type: ==> hpc02-test: -- Keymap: en-us ==> hpc02-test: -- TPM Path: ==> hpc02-test: -- INPUT: type=mouse, bus=ps2 ==> hpc01-test: Creating image (snapshot of base box volume). ==> hpc01-test: Creating domain with the following settings... ==> hpc01-test: -- Name: ubuntu2004_hpc01-test ==> hpc01-test: -- Domain type: kvm ==> hpc01-test: -- Cpus: 2 ==> hpc01-test: -- Feature: acpi ==> hpc01-test: -- Feature: apic ==> hpc01-test: -- Feature: pae ==> hpc01-test: -- Memory: 4096M ==> hpc01-test: -- Management MAC: ==> hpc01-test: -- Loader: ==> hpc01-test: -- Nvram: ==> hpc01-test: -- Base box: generic/ubuntu2004 ==> hpc01-test: -- Storage pool: default ==> hpc01-test: -- Image: /var/lib/libvirt/images/ubuntu2004_hpc01-test.img (128G) ==> hpc01-test: -- Volume Cache: default ==> hpc01-test: -- Kernel: ==> hpc01-test: -- Initrd: ==> hpc01-test: -- Graphics Type: vnc ==> hpc01-test: -- Graphics Port: -1 ==> hpc01-test: -- Graphics IP: 127.0.0.1 ==> hpc01-test: -- Graphics Password: Not defined ==> hpc01-test: -- Video Type: cirrus ==> hpc01-test: -- Video VRAM: 256 ==> hpc01-test: -- Sound Type: ==> hpc01-test: -- Keymap: en-us ==> hpc01-test: -- TPM Path: ==> hpc01-test: -- INPUT: type=mouse, bus=ps2 ==> hpc02-test: Creating shared folders metadata... ==> hpc01-test: Creating shared folders metadata... ==> hpc02-test: Starting domain. ==> hpc01-test: Starting domain. ==> hpc01-test: Waiting for domain to get an IP address... ==> hpc02-test: Waiting for domain to get an IP address... ==> hpc01-test: Waiting for SSH to become available... ==> hpc02-test: Waiting for SSH to become available... ==> hpc01-test: Setting hostname... ==> hpc02-test: Setting hostname... ==> hpc01-test: Configuring and enabling network interfaces... ==> hpc02-test: Configuring and enabling network interfaces... ==> hpc01-test: Running provisioner: shell... ==> hpc02-test: Running provisioner: shell... hpc01-test: Running: inline script hpc02-test: Running: inline script hpc01-test: DNSSEC=yes hpc02-test: DNSSEC=yes ==> hpc01-test: Running provisioner: ansible... ==> hpc02-test: Running provisioner: ansible... hpc02-test: Running ansible-playbook... hpc01-test: Running ansible-playbook... [WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details [WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details

PLAY [Populate /etc/hosts with internal interface ip addresses] ****************

PLAY [Populate /etc/hosts with internal interface ip addresses] ****************

TASK [Gathering Facts] *********************************************************

TASK [Gathering Facts] *********************************************************
ok: [hpc01-test]

TASK [Gather facts from ALL hosts (regardless of limit or tags)] ***************
skipping: [hpc01-test] => (item=hpc01-test)
ok: [hpc02-test]

TASK [Gather facts from ALL hosts (regardless of limit or tags)] ***************
ok: [hpc01-test -> 192.168.121.114] => (item=hpc02-test)

TASK [Build hosts file on nodes] ***********************************************
changed: [hpc01-test] => (item=hpc01-test)
changed: [hpc01-test] => (item=hpc02-test)

PLAY [hpc-master] **************************************************************

TASK [Gathering Facts] *********************************************************
ok: [hpc02-test -> 192.168.121.190] => (item=hpc01-test)
skipping: [hpc02-test] => (item=hpc02-test)

TASK [Build hosts file on nodes] ***********************************************
changed: [hpc02-test] => (item=hpc01-test)
changed: [hpc02-test] => (item=hpc02-test)

PLAY [hpc-master] **************************************************************
skipping: no hosts matched

PLAY [hpc-worker] **************************************************************

TASK [Gathering Facts] *********************************************************
ok: [hpc02-test]

TASK [Install firewall] ********************************************************
ok: [hpc01-test]
included: /home/balast/CodingProjects/qhub-hpc/tasks/firewall.yaml for hpc02-test

TASK [Install firewall] ********************************************************

TASK [Always allow ssh traffic] ************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/firewall.yaml for hpc01-test

TASK [Always allow ssh traffic] ************************************************
changed: [hpc01-test]

TASK [By default deny all incoming network requests] ***************************
changed: [hpc02-test]

TASK [By default deny all incoming network requests] ***************************
changed: [hpc01-test]

TASK [Allow any network requests witin internal ip range] **********************
changed: [hpc01-test]

TASK [Install common packages] *************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/apt-packages.yaml for hpc01-test

TASK [Ensure apt packages are installed] ***************************************
changed: [hpc02-test]

TASK [Allow any network requests witin internal ip range] **********************
changed: [hpc02-test]

TASK [Install common packages] *************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/apt-packages.yaml for hpc02-test

TASK [Ensure apt packages are installed] ***************************************
changed: [hpc01-test]

TASK [Install users/groups] ****************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/accounts.yaml for hpc01-test

TASK [Ensure groups are present] ***********************************************
changed: [hpc01-test] => (item={'name': 'example-user', 'gid': 10000})

TASK [Ensure users are present] ************************************************
changed: [hpc01-test] => (item={'username': 'example-user', 'uid': 10000, 'fullname': 'Example User', 'email': '[email protected]', 'password': '$6$3aaf4gr8D$2T31r9/GtXM6rVY8oHOejn.sThwhBZehbPZC.ZkN0XJOZUuguR9VnRQRYmqYAt9eW3LgLR21q1kbqSYSEDm5U.', 'primary_group': 'example-user', 'groups': ['users', 'example-user']})

TASK [Ensure users are disabled] ***********************************************

TASK [Ensure groups are disabled] **********************************************

TASK [Install prometheus node_exporter] ****************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/node-exporter.yaml for hpc01-test

TASK [Check that the node exporter binary exists] ******************************
ok: [hpc01-test]

TASK [Download node_exporter binary to local folder] ***************************
changed: [hpc01-test]

TASK [Unpack node_exporter binary] *********************************************
changed: [hpc01-test]

TASK [Install node_exporter binary] ********************************************
changed: [hpc01-test]

TASK [Create node_exporter group] **********************************************
changed: [hpc01-test]

TASK [Create the node_exporter user] *******************************************
changed: [hpc01-test]

TASK [Copy the node_exporter systemd service file] *****************************
changed: [hpc01-test]

TASK [Ensure Node Exporter is enabled on boot] *********************************
changed: [hpc01-test]

TASK [Install prometheus] ******************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/prometheus.yaml for hpc01-test

TASK [Check that the node exporter binary exists] ******************************
ok: [hpc01-test]

TASK [Download prometheus binary to local folder] ******************************
changed: [hpc01-test]

TASK [Unpack prometheus binary] ************************************************
changed: [hpc01-test]

TASK [Install prometheus binary] ***********************************************
changed: [hpc01-test]

TASK [Create prometheus group] *************************************************
changed: [hpc01-test]

TASK [Create the prometheus user] **********************************************
changed: [hpc01-test]

TASK [Ensure that promethus configuration directory exists] ********************
changed: [hpc01-test]

TASK [Ensure that promethus data directory exists] *****************************
changed: [hpc01-test]

TASK [Copy prometheus configuration] *******************************************
changed: [hpc01-test]

TASK [Copy the prometheus systemd service file] ********************************
changed: [hpc01-test]

TASK [Ensure Prometheus is enabled on boot] ************************************
changed: [hpc01-test]

TASK [Install prometheus slurm-exporter] ***************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/prometheus-slurm-exporter.yaml for hpc01-test

TASK [Check that the slurm exporter binary exists] *****************************
ok: [hpc01-test]

TASK [Download prometheus-slurm-exporter tarball to local folder] **************
changed: [hpc01-test]

TASK [Unpack prometheus slurm exporter] ****************************************
changed: [hpc01-test]

TASK [Install go package] ******************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/golang.yaml for hpc01-test

TASK [Check that the go binary exists] *****************************************
ok: [hpc01-test]

TASK [Download go tarball to local folder] *************************************
changed: [hpc01-test]

TASK [Unpack golang] ***********************************************************
changed: [hpc01-test]

TASK [Set golang to user path] *************************************************
changed: [hpc01-test]

TASK [Build prometheus_slurm_exporter] *****************************************
changed: [hpc01-test]

TASK [Install prometheus_slurm_exporter binary] ********************************
changed: [hpc01-test]

TASK [Create prometheus_slurm_exporter group] **********************************
changed: [hpc01-test]

TASK [Create the prometheus_slurm_exporter user] *******************************
changed: [hpc01-test]

TASK [Copy the prometheus_slurm_exporter systemd service file] *****************
changed: [hpc02-test]

TASK [Install users/groups] ****************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/accounts.yaml for hpc02-test

TASK [Ensure groups are present] ***********************************************
changed: [hpc01-test]

TASK [Ensure prometheus_slurm_exporter is enabled on boot] *********************
changed: [hpc02-test] => (item={'name': 'example-user', 'gid': 10000})

TASK [Ensure users are present] ************************************************
changed: [hpc02-test] => (item={'username': 'example-user', 'uid': 10000, 'fullname': 'Example User', 'email': '[email protected]', 'password': '$6$3aaf4gr8D$2T31r9/GtXM6rVY8oHOejn.sThwhBZehbPZC.ZkN0XJOZUuguR9VnRQRYmqYAt9eW3LgLR21q1kbqSYSEDm5U.', 'primary_group': 'example-user', 'groups': ['users', 'example-user']})

TASK [Ensure users are disabled] ***********************************************

TASK [Ensure groups are disabled] **********************************************

TASK [Install prometheus node_exporter] ****************************************
changed: [hpc01-test]

TASK [Install grafana] *********************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/node-exporter.yaml for hpc02-test

TASK [Check that the node exporter binary exists] ******************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/grafana.yaml for hpc01-test

TASK [Add apt keys for grafana] ************************************************
ok: [hpc02-test]

TASK [Download node_exporter binary to local folder] ***************************
changed: [hpc01-test]

TASK [Add apt repository for grafana] ******************************************
changed: [hpc02-test]

TASK [Unpack node_exporter binary] *********************************************
changed: [hpc02-test]

TASK [Install node_exporter binary] ********************************************
changed: [hpc02-test]

TASK [Create node_exporter group] **********************************************
changed: [hpc02-test]

TASK [Create the node_exporter user] *******************************************
changed: [hpc02-test]

TASK [Copy the node_exporter systemd service file] *****************************
changed: [hpc02-test]

TASK [Ensure Node Exporter is enabled on boot] *********************************
changed: [hpc02-test]

TASK [Install slurm worker] ****************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/slurm-worker.yaml for hpc02-test

TASK [Install slurm common] ****************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/slurm/common.yaml for hpc02-test

TASK [Install munge] ***********************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/slurm/munge.yaml for hpc02-test

TASK [Ensure munge gid is fixed] ***********************************************
changed: [hpc02-test]

TASK [Ensure munge uid fixed] **************************************************
[WARNING]: File '/etc/apt/sources.list.d/packages_grafana_com_oss_deb.list'
created with default permissions '600'. The previous default was '666'. Specify
'mode' to avoid this warning.
changed: [hpc01-test]

TASK [Install grafana] *********************************************************
changed: [hpc02-test]

TASK [Check munge directory] ***************************************************
changed: [hpc02-test]

TASK [Install munge key] *******************************************************
changed: [hpc02-test]

TASK [Install munge controller packages] ***************************************
changed: [hpc02-test]

TASK [Ensure Munge is enabled and running] *************************************
ok: [hpc02-test]

TASK [Install slurm client packages] *******************************************
changed: [hpc01-test]

TASK [Copy grafana datasource provision file] **********************************
changed: [hpc01-test]

TASK [Copy grafana dashboard provision file] ***********************************
changed: [hpc01-test]

TASK [Copy grafana dashboards] *************************************************
changed: [hpc01-test] => (item=jupyterhub)
changed: [hpc01-test] => (item=node_exporter)
changed: [hpc01-test] => (item=slurm_exporter)
changed: [hpc01-test] => (item=traefik)

TASK [Copy Grafana Configuration] **********************************************
changed: [hpc01-test]

TASK [Ensure granfana is started] **********************************************
changed: [hpc01-test]

TASK [Install mysql] ***********************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/mysql.yaml for hpc01-test

TASK [Install mysql] ***********************************************************
changed: [hpc02-test]

TASK [ensure that slurm configuration directory exists] ************************
ok: [hpc02-test]

TASK [install slurm.conf] ******************************************************
changed: [hpc02-test]

TASK [Install extra execution host configs] ************************************
changed: [hpc02-test]

TASK [Install slurmd] **********************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/slurm/slurmd.yaml for hpc02-test

TASK [Create slurm spool directory] ********************************************
changed: [hpc02-test]

TASK [Create slurm log directory] **********************************************
changed: [hpc02-test]

TASK [Ensure slurm pid directory exists] ***************************************
changed: [hpc02-test]

TASK [Copy the slurmctl systemd service file] **********************************
changed: [hpc02-test]

TASK [Install Slurmd execution host packages] **********************************
changed: [hpc02-test]

TASK [Ensure slurmd is enabled on boot] ****************************************
changed: [hpc02-test]

TASK [Install nfs client] ******************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/nfs-client.yaml for hpc02-test

TASK [Install nfs] *************************************************************
changed: [hpc02-test]

TASK [Ensure nfs mounted directories exist] ************************************
ok: [hpc02-test] => (item={'path': '/home', 'host': 'hpc01-test'})
changed: [hpc02-test] => (item={'path': '/opt/conda', 'host': 'hpc01-test'})

TASK [Add fstab entries for nfs mounts] ****************************************
changed: [hpc01-test]

TASK [Create mysql database] ***************************************************
changed: [hpc01-test]

TASK [Create mysql users] ******************************************************
changed: [hpc01-test] => (item={'username': 'slurm', 'password': 'password1', 'privileges': '.:ALL'})

TASK [Install slurm master] ****************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/slurm-master.yaml for hpc01-test

TASK [Install slurm common] ****************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/slurm/common.yaml for hpc01-test

TASK [Install munge] ***********************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/slurm/munge.yaml for hpc01-test

TASK [Ensure munge gid is fixed] ***********************************************
changed: [hpc01-test]

TASK [Ensure munge uid fixed] **************************************************
changed: [hpc01-test]

TASK [Check munge directory] ***************************************************
changed: [hpc01-test]

TASK [Install munge key] *******************************************************
changed: [hpc01-test]

TASK [Install munge controller packages] ***************************************
changed: [hpc01-test]

TASK [Ensure Munge is enabled and running] *************************************
ok: [hpc01-test]

TASK [Install slurm client packages] *******************************************
changed: [hpc01-test]

TASK [ensure that slurm configuration directory exists] ************************
ok: [hpc01-test]

TASK [install slurm.conf] ******************************************************
changed: [hpc01-test]

TASK [Install extra execution host configs] ************************************
changed: [hpc01-test]

TASK [Install slurmdbd] ********************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/slurm/slurmdbd.yaml for hpc01-test

TASK [Ensure slurmdbd log directory exists] ************************************
changed: [hpc01-test]

TASK [Ensure slurm pid directory exists] ***************************************
changed: [hpc01-test]

TASK [install slurmdbd.conf] ***************************************************
changed: [hpc01-test]

TASK [Copy the slurmdbd systemd service file] **********************************
changed: [hpc01-test]

TASK [Install slurm controller packages] ***************************************
changed: [hpc01-test]

TASK [Ensure slurmdbd is enabled on boot] **************************************
changed: [hpc01-test]

TASK [Install slurmctld] *******************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/slurm/slurmctld.yaml for hpc01-test

TASK [Ensure slurm state directory exists] *************************************
changed: [hpc01-test]

TASK [Ensure slurm log directory exists] ***************************************
ok: [hpc01-test]

TASK [Ensure slurm pid directory exists] ***************************************
ok: [hpc01-test]

TASK [Copy the slurmctl systemd service file] **********************************
changed: [hpc01-test]

TASK [Install slurm controller packages] ***************************************
changed: [hpc01-test]

TASK [Ensure slurmctld is enabled on boot] *************************************
changed: [hpc01-test]

TASK [Install traefik] *********************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/traefik.yaml for hpc01-test

TASK [Check that the traefik binary exists] ************************************
ok: [hpc01-test]

TASK [Download traefik binary] *************************************************
changed: [hpc01-test]

TASK [Unpack traefik binary] ***************************************************
changed: [hpc01-test]

TASK [Install traefik binary] **************************************************
changed: [hpc01-test]

TASK [Create traefik group] ****************************************************
changed: [hpc01-test]

TASK [Create the traefik user] *************************************************
changed: [hpc01-test]

TASK [Ensure that traefik configuration directory exists] **********************
changed: [hpc01-test]

TASK [Ensure that traefik acme configuration directory exists] *****************
changed: [hpc01-test]

TASK [Copy traefik configuration] **********************************************
changed: [hpc01-test]

TASK [Copy traefik dynamic configuration] **************************************
changed: [hpc01-test]

TASK [Copy the traefik systemd service file] ***********************************
changed: [hpc01-test]

TASK [Ensure Traefik is enabled on boot] ***************************************
changed: [hpc01-test]

TASK [Allow traefik http through firewall] *************************************
changed: [hpc01-test]

TASK [Allow traefik https through firewall] ************************************
changed: [hpc01-test]

TASK [Install conda] ***********************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/conda.yaml for hpc01-test

TASK [Check that the conda binary exists] **************************************
ok: [hpc01-test]

TASK [download miniconda installer] ********************************************
changed: [hpc01-test]

TASK [install miniforge] *******************************************************
changed: [hpc01-test]

TASK [ensure conda.sh activated in shell] **************************************
changed: [hpc01-test]

TASK [Ensure conda activate directory exists] **********************************
changed: [hpc01-test]

TASK [create conda configuration directory] ************************************
changed: [hpc01-test]

TASK [Remove implicit .condarc file installed by miniforge] ********************
changed: [hpc01-test]

TASK [Create default condarc for users] ****************************************
changed: [hpc01-test]

TASK [Install conda environments] **********************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/conda/environment.yaml for hpc01-test => (item=jupyterhub)
included: /home/balast/CodingProjects/qhub-hpc/tasks/conda/environment.yaml for hpc01-test => (item=dask-gateway)
included: /home/balast/CodingProjects/qhub-hpc/tasks/conda/environment.yaml for hpc01-test => (item=jupyterlab)
included: /home/balast/CodingProjects/qhub-hpc/tasks/conda/environment.yaml for hpc01-test => (item=dashboards)

TASK [create environments directory] *******************************************
changed: [hpc01-test]

TASK [copy environments files] *************************************************
changed: [hpc01-test]

TASK [install conda environment] ***********************************************
failed: [hpc02-test] (item={'path': '/home', 'host': 'hpc01-test'}) => {"ansible_loop_var": "item", "changed": false, "item": {"host": "hpc01-test", "path": "/home"}, "msg": "Error mounting /home: mount.nfs: Connection timed out\n"}
changed: [hpc01-test]

TASK [create environments directory] *******************************************
ok: [hpc01-test]

TASK [copy environments files] *************************************************
changed: [hpc01-test]

TASK [install conda environment] ***********************************************
changed: [hpc01-test]

TASK [create environments directory] *******************************************
ok: [hpc01-test]

TASK [copy environments files] *************************************************
changed: [hpc01-test]

TASK [install conda environment] ***********************************************
failed: [hpc02-test] (item={'path': '/opt/conda', 'host': 'hpc01-test'}) => {"ansible_loop_var": "item", "changed": false, "item": {"host": "hpc01-test", "path": "/opt/conda"}, "msg": "Error mounting /opt/conda: mount.nfs: Connection timed out\n"}

PLAY RECAP *********************************************************************
hpc02-test : ok=45 changed=30 unreachable=0 failed=1 skipped=2 rescued=0 ignored=0

==> hpc02-test: An error occurred. The error will be shown after all tasks complete.
changed: [hpc01-test]

TASK [create environments directory] *******************************************
ok: [hpc01-test]

TASK [copy environments files] *************************************************
changed: [hpc01-test]

TASK [install conda environment] ***********************************************
changed: [hpc01-test]

TASK [Install nfs server] ******************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/nfs-server.yaml for hpc01-test

TASK [Install nfs] *************************************************************
changed: [hpc01-test]

TASK [Ensure nfs directory created] ********************************************
ok: [hpc01-test] => (item=/home)
ok: [hpc01-test] => (item=/opt/conda)

TASK [nfs configuration] *******************************************************
changed: [hpc01-test]

TASK [Ensure nfs server is started] ********************************************
changed: [hpc01-test]

TASK [Share directory] *********************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/share.yaml for hpc01-test

TASK [Ensure that share directory exists] **************************************
changed: [hpc01-test]

TASK [Copy example notebooks] **************************************************
changed: [hpc01-test]

TASK [Install jupyterhub] ******************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/jupyterhub.yaml for hpc01-test

TASK [Create hub config directory] *********************************************
changed: [hpc01-test]

TASK [Create hub state directory] **********************************************
changed: [hpc01-test]

TASK [Copy jupyterhub_config.py file] ******************************************
[WARNING]: File '/etc/jupyterhub/jupyterhub_config.py' created with default
permissions '600'. The previous default was '666'. Specify 'mode' to avoid this
warning.
changed: [hpc01-test]

TASK [Setup JupyterHub systemd unit] *******************************************
changed: [hpc01-test]

TASK [Restart JupyterHub] ******************************************************
changed: [hpc01-test]

TASK [Install jupyterhub ssh] **************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/jupyterhub-ssh.yaml for hpc01-test

TASK [Ensure that jupyterhub-ssh configuration directory exists] ***************
changed: [hpc01-test]

TASK [Copy the jupyterhub_ssh configuration] ***********************************
changed: [hpc01-test]

TASK [Setup JupyterHub-SSH systemd unit] ***************************************
changed: [hpc01-test]

TASK [Restart JupyterHub SSH] **************************************************
changed: [hpc01-test]

TASK [Allow jupyterhub ssh through firewall] ***********************************
changed: [hpc01-test]

TASK [Install dask-gateway] ****************************************************
included: /home/balast/CodingProjects/qhub-hpc/tasks/dask-gateway.yaml for hpc01-test

TASK [Create dask group] *******************************************************
changed: [hpc01-test]

TASK [Create the dask user] ****************************************************
changed: [hpc01-test]

TASK [Ensure that dask-gateway configuration directory exists] *****************
changed: [hpc01-test]

TASK [Ensure that dask-gateway runtime directory exists] ***********************
changed: [hpc01-test]

TASK [Copy the dask-gateway configuration] *************************************
changed: [hpc01-test]

TASK [Copy the dask-gateway systemd service file] ******************************
changed: [hpc01-test]

TASK [Ensure dask-gateway is enabled on boot] **********************************
changed: [hpc01-test]

TASK [Allow dask-gateway proxy through firewall] *******************************
changed: [hpc01-test]

TASK [Install bodo] ************************************************************
skipping: [hpc01-test]

TASK [Enable backups] **********************************************************
skipping: [hpc01-test]

PLAY [hpc-worker] **************************************************************
skipping: no hosts matched

PLAY RECAP *********************************************************************
hpc01-test : ok=158 changed=116 unreachable=0 failed=0 skipped=4 rescued=0 ignored=0

An error occurred while executing multiple actions in parallel.
Any errors that occurred are shown below.

An error occurred while executing the action on the 'hpc02-test'
machine. Please handle this error then try again:

Ansible failed to complete successfully. Any error output should be
visible above. Please fix these errors and try again.

Jupyterlab won't spin up

Running ubuntu2004 tests, jupyterhub starts, but when I log on to jupyterhub, I'm unable to spin up a jupyterlab session for example-user. The slurmspawner logs are given below:

vagrant@hpc01-test:/home/example-user$ cat .jupyterhub_slurmspawner_6.log
/opt/conda/envs/jupyterlab/bin/jupyterhub-singleuser
running command batchspawner-singleuser jupyterhub-singleuser --ip=0.0.0.0 --SingleUserNotebookApp.default_url=/lab
srun: error: Couldn't find the specified plugin name for mpi/pmi_v2 looking at all files
srun: error: cannot find mpi plugin for mpi/pmi_v2
srun: error: cannot create mpi context for mpi/pmi_v2
srun: error: invalid MPI type 'pmi_v2', --mpi=list for acceptable types

Refactor onprem CDS Dashboards code to use new useroptions branch

There is a branch in the cdsdashboards repo incorporating 'spawner options' into the core dashboard build process. This should allow you to simplify your code, and fixes the problem with restarting the server breaking the dashboard-ness of it (launching JupyterLab instead).

https://github.com/ideonate/cdsdashboards/tree/useroptions

I'm very happy to do this refactoring, but assigned to Adam so he can assign back to me if he doesn't prefer to do it himself, or add any notes etc.

CDS limitations around shared dashboards

The current CDS solution has a design that could limit my usecase. I present the following situation and proposed changes. I'd like to understand the complexity and level of effort of each change.

I lay out 4 changes to the workflow:

  1. git repo/project registration
  2. launch as user
  3. user resource management and kill/processes, and
  4. weekly Jupyter server clean up.

Today's CDS workflow works well for creating a dashboard or voila notebook and sharing it casually within a small group of users. The publisher will create the dashboard and spin up a resource for any user to use. As we imbeds user credentials in Jupyter server, the user of a dashboard will now have access to the publisher's database and file privileges. This works for limited use cases, but when we consider this workflow for generic applications broadly it could create a security risk and decrease the quality of an audit trail, by losing who is interacting with data and file. There are additional issues, where a publisher may have stopped the server and a user wouldn’t have the tools to start it or access it. We also worry about resource planning. It's hard to know a priori if a user has 1 or 10 users, but we would need to provision for 10. Lastly, under this model dashboards are tied to a user and not a project or team, so if the dashboard is extended by another user the dashboard will move, and possibly have the old copy live until the original publisher disables it.

I would like Quansight to consider an alternative solution:

  1. Publishers would register a git repo with the CDS. The repo would be a "project" or folder. CDS will look for a file called "cds.yaml" which will list the files to be run as CDS dashboards along with the environment settings. To add a new dashboard a user will include the files, add the instructions to the yaml file, and then do a git push/ and pull. Optional: CDS can respect the directory structure of repo for creating more logical grouping for users. To be determined, what pulls these changes, or is a pull part of the startup process.

  2. The dashboard would launch on a Jupyter server belonging to the end user. This server could either be the "default" named server, a sever of a specific name, such as cds. I would propose the later so that a dashboard would not affect a user's research in a different area.

  3. Similar to Jupyter Tree and Lab, We will probably also want to make memory usage observable and simple to kill CDS dashboard on your server.

  4. I think we will want a tool that kills all CDS servers on a weekly basis, weekend reset. This will free up resources, particularly for users who do a quick look and see.

What this fixes:

dashboards are owned by project/team, not a person.
we use git to manage code access, changes etc. reverting new version simple; git has access controls
user capacity is easier to manage
easier to launch and manage more dashboards
user has control of what dashboard kernels they are running; users can see their resources use
use case for dashboards widen when users are acting as themselves: inserts and data changes to sql is controlled better with clear audit, file browsing and access is cleaner
avoids the use of a utility account that we would probably consider using

Downsides: for first dashboard, users will need to wait to launch server, and then launch the dashboard.

Conda env should be omitted from spawner options form for dashboards

Conda env is already selected on dashboard edit form, so don't need it on spawner options form.

In jupyterhub_config.py.j2 template (cdsb2 branch), there is an attempt at this but the problem is that the database doesn't always seem to have the server/spawner user_options loaded in the options_form function.

Running in a simple local setup of JupyterHub I do see all the values I need at that point, so not sure why it is lacking in HPC environment.

Deployment takes two ansible-playbook applications

The current playbook deploys the workers and master at the same time. The deployment fails the first time because the workers are trying to mount the /opt/conda directory via nfs but the master tasks have not yet provisioned nfs. As of now I haven't found any clean ways of stating that the client nfs needs to wait for the master to progress to the nfs server being provisioned.

This will likely be a simple fix.

Dashboards as Viewer

These are based on the 'monorepo' idea - a shared folder(s) containing a tree of Voilà apps, maybe a separate one for Bokeh.

25 hrs: Option to launch dashboard as viewer - assumes files are in shared folder or via git. This would be a global option affecting all dashboards, so we could no longer start 'run as owner' dashboards too. This would include:

  • Build raw option into cdsdashboards
  • Regex or path to be specified in config so users can't create dashboards outside of shared folders
  • Integrate as an option into qhub
  • There is already a PR to serve multiple bokeh files. We would finish that work, including expanding a folder to find all files
    15 hrs: Testing, debugging, transition to client

Multiple primary machines in test Vagrantfiles

I believe I found a bug. Vagrant's documentation specifies you can only have a single primary machine specified, but the qhub-onprem/tests/ubuntu*/Vagrantfile has multiple primary vm's defined.

Required ansible-galaxy collection installs

I was required to install a few ansible-galaxy collections in order for ansible to provision the nodes properly. I happened to be running on VMs in tests/ubuntu2004, but I think this would be required regardless. Here are a list of the ansible-galaxy collections I needed to install.

  • ansible-galaxy collection install community.general
  • ansible-galaxy collection install community.mysql
  • ansible-galaxy collection install ansible.posix

Transfer GCP deployment to using terraform CDK

The files for terraform are stored here:
https://github.com/Quansight/qhub-hpc/tree/main/tests/gcp

The goal is to translate these terraform files into terraform CDK

The resources can be analyzed via the google console and searching for "compute engine"

Testing existing terraform files

Clone and cd into the proper directory:

git clone https://github.com/Quansight/qhub-hpc.git
cd qhub-hpc/tests/gcp

The terraform script requires a private key named id_rsa in the same directory as the script

ssh-keygen

When prompted for a name, enter ./id_rsa

Change the permissions on the file:

chmod 400 id_rsa

Now we can Initialize, plan, and apply the terraform files:

terraform init
terraform plan
terraform apply --auto-approve

To ssh directly to any of the instances, run ssh -i id_rsa ubuntu@<public_ip>

Ansible-playbook takes two applications

The current playbook deploys the workers and master at the same time. The deployment fails the first time because the workers are trying to mount the /opt/conda directory via nfs but the master tasks have not yet provisioned nfs. As of now I haven't found any clean ways of stating that the client nfs needs to wait for the master to progress to the nfs server being provisioned.

This will likely be a simple fix.

Automate configuration of Slurm workers

Currently, a user has to manually run a notebook to create a slurm configuration file. It would be nice to automate this so that the users have this file created by default

Could use some lessons learned from initialize.py here

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.