Git Product home page Git Product logo

1click-hpc's Introduction

1Click-HPC

This project aims at speeding up the deployment of an HPC Cluster on AWS. Following the instructions below a fully functional and ready to use HPC cluster will be created with just 1-Click.

Get Started

Step 1

Click the link below corresponding to your preferred AWS Region . You will be asked a few questions about services like VPC, FSx, etc; if you have no idea how to answer or what these services are, just leave the detault values. 1Click-HPC will take care of creating everything needed for your HPC Cluster to run.

Region Launch
US ---
N. Virginia (us-east-1) Launch
Ohio (us-east-2) Launch
N. California (us-west-1) Launch
Oregon (us-west-2) Launch
Canada ---
Central (ca-central-1) Launch
EU ---
Frankfurt (eu-central-1) Launch
Ireland (eu-west-1) Launch
Stockholm (eu-north-1) Launch
Milan (eu-south-1) Launch
APJ ---
Tokyo (ap-northeast-1) Launch
Seoul (ap-northeast-2) Launch
Hong Kong (ap-east-1) Launch
Mumbai (ap-south-1) Launch

Step 2

  1. Just change the "Stack Name" as you like.
  2. Enter the password for the Admin user "ec2-user":
  3. Check the checkbox to acknowledge the IAM resources creations.
  4. Click the "Create Stack" botton.

Step2

Step 3

  1. Click on the "Stack Name" to monitor the cluster creation steps.
  2. Wait until all the resources are created

Step3

Step 4

  1. When the cluster creation is completed, go to the "outputs" tab
  2. Click the "EnginFrameURL" to access your HPC Cluster using the EnginFrame portal.
  3. Alternatively, Click the "Cloud9URL" if you wish to connect to your Cloud9 Instance and then ssh into your cluster form there.

Step4

Step 5

You can login on EnginFrame by using "ec2-user" as username and the password you chose. Username: ec2-user
Password: *********

Step5

Step 6

After you login, you are redirected to the "list Spoolers" page. Spoolers are scratch area located in the /fsx FileSystem that are managed by EnginFrame and used as the HPC jobs execution directory.

Step6

Step 7

We would reccomend to immediatelly change the password by using the service as below.

Step7

Architecture

Architecture

Additional Docs

https://github.com/aws-samples/1click-hpc/tree/main/docs

License

This software is licensed under the MIT-0 License. See the LICENSE file.

1click-hpc's People

Contributors

amazon-auto avatar cmbrehm avatar nicolaven avatar robbymeda avatar rvencu avatar sean-smith avatar soham-g avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

1click-hpc's Issues

FSX for Lustre volume is readable by all users

We noticed in a multiuser HPC cluster with FSX attached, all users are able to browse and read files from all other users, even though writing is only possible on own folders only

Is there a way to restrict access similar to /home folder functionality?

Meet error when runnning cloudformation stack

The lambda function created gave the error as below



26 Jan 2022 09:05:13,763 [INFO] (/var/runtime/bootstrap.py) main started at epoch 1643187913763
--
26 Jan 2022 09:05:13,961 [INFO] (/var/runtime/bootstrap.py) init complete at epoch 1643187913961
Traceback (most recent call last):
  File "/var/task/index.py", line 48, in lambda_handler
    responseData = {'Error': traceback.format_exc(e)}
  File "/var/lang/lib/python3.6/traceback.py", line 167, in format_exc
    return "".join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
  File "/var/lang/lib/python3.6/traceback.py", line 121, in format_exception
    type(value), value, tb, limit=limit).format(chain=chain))
  File "/var/lang/lib/python3.6/traceback.py", line 509, in __init__
    capture_locals=capture_locals)
  File "/var/lang/lib/python3.6/traceback.py", line 338, in extract
    if limit >= 0:
TypeError: '>=' not supported between instances of 'ClientError' and 'int'

DCV Jobs failing

Not sure if there's a setup step that I'm missing here but when I run the included Windows or Linux DCV job I get:

sbatch failed (parameters: -J Linux_Desktop -D /fsx/nice/enginframe/sessions/ec2-user/tmp4716553958834750820.session.ef -C dcv2, exit value: 1)

Some heads-up needed to customize the code

Hi, I am adding customization to implement https://docs.aws.amazon.com/parallelcluster/latest/ug/launch-instances-odcr-v3.html

I am creating a uniquely named policy and attach it to the HeadNode just fine

I am also creating a resource group to add all existing targeted capacity reservations. I should use some query for that or can I just attach arn containing wildcard on last section?

Second and harder problem, I should create the json to override the slurm compute nodes settings. I can retrieve current zone id and account id from the headnode itself but I should somehow transmit the cluster name or the group name so I do not have to hardcode it in the file. Currently that script looks like this: https://github.com/rvencu/1click-hpc/blob/main/modules/50.install.capacity.reservation.pool.sh

#!/bin/bash
set -e

ACCOUNT_ID=`aws sts get-caller-identity | jq -r '."Account"'`
EC2_AVAIL_ZONE=`curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone`
EC2_REGION="`echo \"$EC2_AVAIL_ZONE\" | sed 's/[a-z]$//'`"

# Override run_instance attributes
# Name of the group is still hardcoded, need a way to get variable from cloudformation here
cat > /opt/slurm/etc/pcluster/run_instances_overrides.json << EOF
{
    "compute-od-gpu": {
        "p4d-24xlarge": {
            "CapacityReservationSpecification": {
                "CapacityReservationTarget": {
                    "CapacityReservationResourceGroupArn": "arn:aws:resource-groups:$EC2_REGION:$ACCOUNT_ID:group/EC2CRGroup"
                }
            }
        }
    }
}
EOF

ec2-user Password characters

Hi Nicola, Sean / Team,

Please add a note ec2-user password should not contain special character " @ " at cloud formation page.. Because i see Cloudformation rolling back at the stage of SlurmDB creation with the below status reason if i use " @ " in password.

"The password for the master user. The password can include any printable ASCII character except "/", """, or "@"."

Capacity/Production Readiness

All,

Stumbled on this while struggling to get slurmrestd set up on pcluster. It looks like this provides a lot of friendly wrappers for HPC type problems. Is this code production-ready? Does it support GPU instances? Is there an API provided?

grafana monitoring not working for static resources

when we use a non-zero minimum in cluster config for resources, they get alive at cluster launch. then this job-related check will never have a value of True:

if [[ $job_comment == *"Key=Monitoring,Value=ON"* ]]; then

because this must be run in the root context, the only chance to do it is in the prolog script to attach it to a job, so basically the plan would be to

  1. install the docker container anyway in post-install but do not start it
  2. use prolog and epilog to start and stop the container depending on user's choice to monitor or not

the problem is how to send a signal about the job to prolog and epilog since the custom user env variables are not sent, and the job comment is not sent. Because per slurm manuals, we should not perform scontrol from prolog; this will impair the scaling of the jobs similarly to the API calls (this is related to #34 )

Looking at the variables available at prolog/epilog time I only have 2 ideas so far:

  1. SLURM_PRIO_PROCESS Scheduling priority (nice value) at the time of submission. Available in SrunProlog, TaskProlog, SrunEpilog and TaskEpilog. We can #SBATCH --nice 0 or some sensible value to uniquely identify the intention then use the TaskProlog and TaskEpilog to start/stop the monitoring container
  2. use some crafted slurm job name like [GM] my job name then pick and interpret this from SLURM_JOB_NAME Name of the job. Available in PrologSlurmctld, SrunProlog, TaskProlog, EpilogSlurmctld, SrunEpilog and TaskEpilog. Meaning also the use of TaskProlog and TaskEpilog to start/stop the monitoring container

Gateway timeout when using Job Submission

I think the title is self-explaining.

I had to increase the Idle Timeout setting in the ALB to make it work.
You may want to adjust it in the CF template.

Regards.
PL

SSO Integration

Hi Nicola, Sean / Team,

Is there a way to integrate SSO in this stack.
@nicolaven I tried to integrate the octa but without success.
Could you please help me a bit with the integration.

DCV authentication issue with AD users at the creation of Linux DCV sessions in a 1Click-HPC cluster

Ciao Nicola!

As you know in the context of a HPC POC in AWS for a french company, UCit (mainly myself) have sligtly modified and used 1Click-HPC to run the POC's HPC environment.
I have faced a strange authentication issue while trying to start a DCV session on a g4dn instance with Centos 7.9.2009 + DCV 2022.0 r12760 + EnginFrame (EF) 2021.0-r1592 + Slurm 21.08.8-2 on AWS. The issue is specific to the users stored in the AD attached to the cluster.

I have finally found a fix for the issue but I think it's important to discuss it with you to understand what's the underlying behaviour here...

The symptom is the following;

  • launching a DCV Session as system user "centos" using a standard Linux Desktop Service in EF works fine
  • launching a DCV session as a user created in the AD using the exact same standard Linux Desktop Service in EF fails because of an autentication issue.

The error message got in slurm-$JobID.out is the following:

[2022/06/09 14:40:15]  INFO  Starting DCV session...
[2022/06/09 14:40:15]  INFO  DCV version supports --gl-display parameter
[2022/06/09 14:40:15]  INFO  Creating DCV session "dcv create-session --type=virtual tmp7339021573904918669 "
Could not create session. Could not get the system bus: Exhausted all available authentication mechanisms (tried: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS) (available: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS)
[2022/06/09 14:40:15] ERROR  Failed to launch DCV session (exit code: 1)
[2022/06/09 14:40:15] FATAL  Error: DCV failed to create session
[2022/06/09 14:40:15] FATAL  Exiting with code 1

After a lot of tests, described below, I have found a solution which consists in adding at the very beginning of the file $EF_ROOT/plugins/interactive/lib/remote/linux.jobscript.functions the following line:

id "${USER}"

Indeed, this initializes kind of a "first connection" of the user trying to start a session on the targeted system, so that the User is known at system level...

With the help of Benjamin Depardon, I have tested the issue by issueing on the Head Node of the cluster in the command line the following command:

srun -N 1 -p dcv-gpu --exclusive -C "[g4dn.xlarge*1]" dcv create-session my_session

And then, we have tried all the following options:

  • restarting gdm only after dcvserver was restarted at the end of the installation process => NOT working

  • restarting dbus + dbus-org.+ gdm after dcvserver was restarted at the end of the installation process => NOT working

  • changing /etc/pam.d/dcv with the following contents

#%PAM-1.0
# Default NICE DCV PAM configuration.
# This file is auto-generated, user changes will be destroyed at
# installation/update time.
# To make changes, create a file named dcv.custom in this
# directory and set the 'pam-service-name' parameter in the
# [security] section of dcv.conf to 'dcv.custom'
#auth    include password-auth
#account include password-auth
auth    include password-auth
account     required                                     pam_access.so
account     required                                     pam_unix.so
account     sufficient                                   pam_localuser.so
account     sufficient                                   pam_usertype.so issystem
account     [default=bad success=ok user_unknown=ignore] pam_sss.so
account     required                                     pam_permit.so

=> NOT working

  • running on the remote system the commands:
    $> getent passwd | grep username
    or
    $> getent passwd -s sss | grep username
    or
    $> sssctl cache-upgrade
    => NOT working

  • adding the following command at the very beginning of Slurm's prolog.sh script:
    $> id "${SLURM_JOB_USER}"
    -> NOT working

  • running the following command on the DCV node before the session was created:
    $> id username
    or
    $> sssctl user-checks username
    => SUCCESSFUL

  • connecting on the DCV node with SSH as the user username (or as any other user and the switching with the command: $> su - username) before the session was created

=> SUCCESSFUL

Our conclusion is that the user must be known by the system (and stored in any kind of cache) for the authentication process to allow the execution of the tasks required by the Slurm job.

Our questions are:

  • is it a know issue?
  • can you explain further how the internal authentication methods of DCV work and why in our case DCV has denied the authorization for the user in AD to crete a session?
  • is there a "better" way to solve it than to hack EF code the way we did to allow any user in AD to launch a DCV session?

Please don't hesitate to ask for any complementary information and to let us know what you think.

Best regards,
Vincent.

ERROR 502: Bad Gateway

I have an error : « ERROR 502: Bad Gateway »
The CFT installation seems working fine. I have no error message.

On the Head Node I have the same message :


[ec2-user@ip-XXXX ~]$ wget https://XXXX.eu-west-1.elb.amazonaws.com/ --no-check-certificate
--2022-01-18 17:05:38--  https://XXXX.eu-west-1.elb.amazonaws.com/
Resolving XXXXX.eu-west-1.elb.amazonaws.com (XXXXX.eu-west-1.elb.amazonaws.com) [SNIP] connected.
WARNING: cannot verify XXXXX.eu-west-1.elb.amazonaws.com's certificate, issued by ‘/C=US/ST=WA/L=Seattle/O=AWS WWSO/OU=HPC/CN=EnginFrame’:
Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 502 Bad Gateway
2022-01-18 17:05:38 ERROR 502: Bad Gateway.

Can't use my own Active Directory

Getting this after leaving everything on AUTO but Active Directory

Template format error: Unresolved resource dependencies [ActiveDirectory] in the Resources block of the template

ParallelCluster 3.6 support

1click-hpc doesn't work with the latest version of Parallel Cluster, the OnNodeConfigured scripts of the HeadNode are failing; this seems to be related to a change to /etc/parallelcluster/cfnconfig introduced in the newer version.

LBInit issues

I bumped a while ago into LBInit issues, meaning when I delete a stack usually LBInit fails to delete. The workaround is to wait some more minutes then retry the stack delete and it works.

But today I started having problems with its creation. In the cloudwatch log I find this:

{
    "Status": "FAILED",
    "Reason": "See the details in CloudWatch Log Stream: 2022/06/22/[$LATEST]d883c58469a545d58f5fb61863e901ef",
    "PhysicalResourceId": "2022/06/22/[$LATEST]d883c58469a545d58f5fb61863e901ef",
    "StackId": "arn:aws:cloudformation:us-east-1:842865360552:stack/origtest/0cdfe300-f1fa-11ec-b068-121de38a7e19",
    "RequestId": "10fc583d-c908-41c1-af07-751ba3a4b563",
    "LogicalResourceId": "LBInit",
    "NoEcho": false,
    "Data": {
        "ClientErrorCode": "NoSuchEntity",
        "ClientErrorMessage": "The Server Certificate with name origtest-981587795.us-east-1.elb.amazonaws.com cannot be found."
    }
}

I have another HPC cluster active, with a different name, it should not interfere with the creation of another cluster in the account. The above error still appears with everything set on AUTO

Active Directory is integrated into Enginframe?

Enginframe is interesting as workspace but it seems it needs to define users locally while we are already using AD to manage the users.

Is there any integration made or can I get some hints about such a potential integration so I can develop it myself?

`pcluster create-cluster --wait` is exiting early

you can see this in bootstrap.log

+ /home/ec2-user/.local/bin/pcluster create-cluster --cluster-name hpc-1click-hpc365 --cluster-configuration config.us-east-1.yaml --rollback-on-failure false --wait
{
  "message": "The security token included in the request is expired"
}

Seems like if the command goes beyond 10-15 minutes, the EC2/Cloud9 credentials are cycling and the pcluster CLI doesn't take this into account. The --wait option seems to be deprecated in pcluster so we probably need to move to a polling approach that allows the credentials to refresh.

This causes the outer CloudFormation stack to fail initially, but it will succeed if it is Retried.

scaling issues due to prolog tagging api

We got into scaling issue with the tagging in prolog script

I understand the prolog is ran at every step and when many nodes are involved the job fails with timeouts

we need to find another place to do the tagging and I understand that the comment is job related but some other tags can be done only once when the instances are created, either because of the min value in the configuration or created by slurm

I am looking at places where this could be done.

maybe it can be done at the headnode instead in the PrologSlurmctld https://slurm.schedmd.com/prolog_epilog.html

add p4d.24xlarge for us-east-1

We have many p4d.24xlarge pods and we need them into the config.

But more than this, we need to be able to pull them from the capacity reservations we got. Without that there are not a lot of p4d instances ondemand and usually the cluster fails to build.

Head Node was created in the private subnet

I noticed the head node was created in the private subnet. Checked the parameters in the cloudformation where I supplied my custom VPC details:

PrivateSubnetAId subnet-0112272390ac53c95 -
PublicSubnetAId subnet-79775a34 -
PublicSubnetBId subnet-2fd94770 -
VpcId vpc-4678d63b

And the head node was created in the private subnet.

image

As a consequence, I cannot SSH into the EIP of the head node.

I can see the code makes this deliberately

if [[ $PRIVATE_SUBNET_ID == "NONE" ]];then

If it is supposed to be there, then please explain how to SSH into the head node using the elastic IP.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.