Git Product home page Git Product logo

raps_poc_deployment's Introduction

RAPS_PoC_deployment

Project which creates everything you need to deploy HPC clusters on Azure and to begin testing ECMWF codes.

1: Install required programs

First, install the azure cli and terraform. Terraform is a program which can deploy cloud instances using code instead of a GUI and it requires the azure CLI to be able to deploy Azure instances.
the script "scripts/install_prereqs.sh" will try install terraform and the azure cli for you. If succseful, you will be prompted to log in.

There is a script "scripts/make_keys.sh" which geenerates an ssh keypair and stores them into this repo under ".ssh". These keys are used to connect to your clusters.

2: Deploy a Cyclecloud host server

Cyclecloud is an azure tool for automating the creation and deployment of HPC clusters. It is a server with a web front-end which you can use to interactively create clusters.
"deploy-cyclecloud/" contains a terraform project which will deploy a cyclecloud host server. Create your cyclecloud host as outlined in "deploy-cyclecloud/README.md" or by using the script "scripts/deploy_cyclecloud_host"

you need to provide Azure Service Principal details as well as cluster username and password in the config.env before executing the deploy_cyclecloud_host.sh script. To get SP credentials, you can use the script "scripts/create_sp.sh". The password likely needs to have a scpecial character, number and uppercase character.

3: Configure the Cyclecloud CLI

There is also a CLI for cyclecloud which is used to upload custom cluster definition files and interact with your clusters locally.\ The directory "cluster-configs/" contains a template file which defines a cluster for testing ECMWF workloads on Azure. The script "scripts/install_cyclecloud_cli.sh" installs and configures the CC cli. It also uploads the afforementioned template

The script ends by outputing the web address for the Cyclecloud GUI, as well as the username and password required to log in.

4: Launch a Cluster

Navigate to the URL outputed by "scripts/install_cyclecloud_cli.sh" and log in. Click the "+" button in the bottom left corner to create a new cluster. Select the template called "alma_slurm_singleQ". Most options are preloaded, simply fill in a name and select a subnet.

Alternatively, there's now a script "scripts/create_cluster.sh" which will create and launch a cluster based on the ECMWF-specific VMI containing hbv3 instances via the command line. The scheduler node will take around 5 mins to launch. The script will then configure the scheduler, by adding a public key so you can connect, uploading a private key which can be used to clone git repos and then cloning raps and some other repos. Make sure you provide an absolute path to a github key in the config.env script before running thus script. Then you can connect using "cyclecloud connect scheduler -c hbv3-cluster -k ~/RAPS_PoC_deployment/.ssh/cc_key".

The script will wait until the cluster is finished creating. At that point it will upload your git key, clone some repos containing ECMWF-specific codes and slurm scripts. Once all this is done the script will output the cyclecloud command needed to connect to your new cluster.

5: pre-reqs before running IFS

Once you log onto the scheduler node of your new cluster, you should see some existing directories in your home directory. "raps-poc" contains scripts to build the IFS and IFS dwarves, as well as scripts to setup Lustre and link lustre to your new cluster. "raps", "dwarves" and "ifs-bundle-CY48R1" contain source code for their respective projects.

To start running IFS, first you need to setup Lustre so you can access the IFS input data. Go to "~/raps-poc/lustre" and run the scripts "install_prereqs.sh" (just installs azure cli) and then "create_lustre.sh". This script deploys Lustre in Azure, configures firewalls so that it can talk to the blob storage containing the input data, preloads the data into lustre, mounts the LFS onto the scheduler and finally fixes the broken symlinks etc in the LFS.

Next, go to "~/raps" and source initbm

Finally you can navigate to "~/raps/bin/SLURM/azure/hbv3" and you will see some example slurm scripts. Currently only "io_ens_tco319.hb3.slurm" is known to work.

TODO

move az cli install on cluster from lustre initalisation to image build process

raps_poc_deployment's People

Contributors

cathalobrien avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.