aws-samples / 1click-hpc Goto Github PK

Deploy your HPC Cluster on AWS in 20min. with just 1-Click.

License: MIT No Attribution

Shell 95.27% Python 0.65% HTML 4.07%

hpc enginframe aws aws-parallelcluster parallelcluster efa fsx dcv cluster hpc-cluster

1click-hpc's Introduction

1Click-HPC

This project aims at speeding up the deployment of an HPC Cluster on AWS. Following the instructions below a fully functional and ready to use HPC cluster will be created with just 1-Click.

Get Started

Step 1

Click the link below corresponding to your preferred AWS Region . You will be asked a few questions about services like VPC, FSx, etc; if you have no idea how to answer or what these services are, just leave the detault values. 1Click-HPC will take care of creating everything needed for your HPC Cluster to run.

Region	Launch
US	---
N. Virginia (us-east-1)
Ohio (us-east-2)
N. California (us-west-1)
Oregon (us-west-2)
Canada	---
Central (ca-central-1)
EU	---
Frankfurt (eu-central-1)
Ireland (eu-west-1)
Stockholm (eu-north-1)
Milan (eu-south-1)
APJ	---
Tokyo (ap-northeast-1)
Seoul (ap-northeast-2)
Hong Kong (ap-east-1)
Mumbai (ap-south-1)

Step 2

Just change the "Stack Name" as you like.
Enter the password for the Admin user "ec2-user":
Check the checkbox to acknowledge the IAM resources creations.
Click the "Create Stack" botton.

Step 3

Click on the "Stack Name" to monitor the cluster creation steps.
Wait until all the resources are created

Step 4

When the cluster creation is completed, go to the "outputs" tab
Click the "EnginFrameURL" to access your HPC Cluster using the EnginFrame portal.
Alternatively, Click the "Cloud9URL" if you wish to connect to your Cloud9 Instance and then ssh into your cluster form there.

Step 5

You can login on EnginFrame by using "ec2-user" as username and the password you chose. Username: ec2-user
Password: *********

Step 6

After you login, you are redirected to the "list Spoolers" page. Spoolers are scratch area located in the /fsx FileSystem that are managed by EnginFrame and used as the HPC jobs execution directory.

Step 7

We would reccomend to immediatelly change the password by using the service as below.

Architecture

Additional Docs

https://github.com/aws-samples/1click-hpc/tree/main/docs

License

This software is licensed under the MIT-0 License. See the LICENSE file.

1click-hpc's People

Contributors

Stargazers

Watchers

1click-hpc's Issues

multiple 1click-hpc clusters with the same FSx will crash enginframe

Since the /fsx/nice location is not unique to the cluster, starting multiple clusters with the same fsx will overwrite the portal data for older clusters

FSX for Lustre volume is readable by all users

We noticed in a multiuser HPC cluster with FSX attached, all users are able to browse and read files from all other users, even though writing is only possible on own folders only

Is there a way to restrict access similar to /home folder functionality?

Meet error when runnning cloudformation stack

The lambda function created gave the error as below



26 Jan 2022 09:05:13,763 [INFO] (/var/runtime/bootstrap.py) main started at epoch 1643187913763
--
26 Jan 2022 09:05:13,961 [INFO] (/var/runtime/bootstrap.py) init complete at epoch 1643187913961
Traceback (most recent call last):
  File "/var/task/index.py", line 48, in lambda_handler
    responseData = {'Error': traceback.format_exc(e)}
  File "/var/lang/lib/python3.6/traceback.py", line 167, in format_exc
    return "".join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
  File "/var/lang/lib/python3.6/traceback.py", line 121, in format_exception
    type(value), value, tb, limit=limit).format(chain=chain))
  File "/var/lang/lib/python3.6/traceback.py", line 509, in __init__
    capture_locals=capture_locals)
  File "/var/lang/lib/python3.6/traceback.py", line 338, in extract
    if limit >= 0:
TypeError: '>=' not supported between instances of 'ClientError' and 'int'

DCV Jobs failing

Not sure if there's a setup step that I'm missing here but when I run the included Windows or Linux DCV job I get:

sbatch failed (parameters: -J Linux_Desktop -D /fsx/nice/enginframe/sessions/ec2-user/tmp4716553958834750820.session.ef -C dcv2, exit value: 1)

Slurm DB always get set as db.t4g.micro no matter what option is used

need bigger instances cause for large fleets this small instance db.t4g.micro is unable to respond in time when large jobs are launched

Some heads-up needed to customize the code

Hi, I am adding customization to implement https://docs.aws.amazon.com/parallelcluster/latest/ug/launch-instances-odcr-v3.html

I am creating a uniquely named policy and attach it to the HeadNode just fine

I am also creating a resource group to add all existing targeted capacity reservations. I should use some query for that or can I just attach arn containing wildcard on last section?

Second and harder problem, I should create the json to override the slurm compute nodes settings. I can retrieve current zone id and account id from the headnode itself but I should somehow transmit the cluster name or the group name so I do not have to hardcode it in the file. Currently that script looks like this: https://github.com/rvencu/1click-hpc/blob/main/modules/50.install.capacity.reservation.pool.sh

#!/bin/bash
set -e

ACCOUNT_ID=`aws sts get-caller-identity | jq -r '."Account"'`
EC2_AVAIL_ZONE=`curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone`
EC2_REGION="`echo \"$EC2_AVAIL_ZONE\" | sed 's/[a-z]$//'`"

# Override run_instance attributes
# Name of the group is still hardcoded, need a way to get variable from cloudformation here
cat > /opt/slurm/etc/pcluster/run_instances_overrides.json << EOF
{
    "compute-od-gpu": {
        "p4d-24xlarge": {
            "CapacityReservationSpecification": {
                "CapacityReservationTarget": {
                    "CapacityReservationResourceGroupArn": "arn:aws:resource-groups:$EC2_REGION:$ACCOUNT_ID:group/EC2CRGroup"
                }
            }
        }
    }
}
EOF

Access Denied on S3 template URL as of 2022-05-12

As of today, the template URL https://enginframe.s3.amazonaws.com/AWS-HPC-Cluster.yaml returns an Access Denied error.

Yesterday, I was able to deploy the 1click-hpc solution without problems.

ec2-user Password characters

Hi Nicola, Sean / Team,

Please add a note ec2-user password should not contain special character " @ " at cloud formation page.. Because i see Cloudformation rolling back at the stage of SlurmDB creation with the below status reason if i use " @ " in password.

"The password for the master user. The password can include any printable ASCII character except "/", """, or "@"."

Capacity/Production Readiness

All,

Stumbled on this while struggling to get slurmrestd set up on pcluster. It looks like this provides a lot of friendly wrappers for HPC type problems. Is this code production-ready? Does it support GPU instances? Is there an API provided?

grafana monitoring not working for static resources

when we use a non-zero minimum in cluster config for resources, they get alive at cluster launch. then this job-related check will never have a value of True:

1click-hpc/modules/40.install.monitoring.compute.sh

Line 59 in 7a833d4

if [[ $job_comment == *"Key=Monitoring,Value=ON"* ]]; then

because this must be run in the root context, the only chance to do it is in the prolog script to attach it to a job, so basically the plan would be to

install the docker container anyway in post-install but do not start it
use prolog and epilog to start and stop the container depending on user's choice to monitor or not

the problem is how to send a signal about the job to prolog and epilog since the custom user env variables are not sent, and the job comment is not sent. Because per slurm manuals, we should not perform scontrol from prolog; this will impair the scaling of the jobs similarly to the API calls (this is related to #34 )

Looking at the variables available at prolog/epilog time I only have 2 ideas so far:

SLURM_PRIO_PROCESS Scheduling priority (nice value) at the time of submission. Available in SrunProlog, TaskProlog, SrunEpilog and TaskEpilog. We can #SBATCH --nice 0 or some sensible value to uniquely identify the intention then use the TaskProlog and TaskEpilog to start/stop the monitoring container
use some crafted slurm job name like [GM] my job name then pick and interpret this from SLURM_JOB_NAME Name of the job. Available in PrologSlurmctld, SrunProlog, TaskProlog, EpilogSlurmctld, SrunEpilog and TaskEpilog. Meaning also the use of TaskProlog and TaskEpilog to start/stop the monitoring container

Gateway timeout when using Job Submission

I think the title is self-explaining.

I had to increase the Idle Timeout setting in the ALB to make it work.
You may want to adjust it in the CF template.

Regards.
PL

SSO Integration

Hi Nicola, Sean / Team,

Is there a way to integrate SSO in this stack.
@nicolaven I tried to integrate the octa but without success.
Could you please help me a bit with the integration.

DCV authentication issue with AD users at the creation of Linux DCV sessions in a 1Click-HPC cluster

Ciao Nicola!

As you know in the context of a HPC POC in AWS for a french company, UCit (mainly myself) have sligtly modified and used 1Click-HPC to run the POC's HPC environment.
I have faced a strange authentication issue while trying to start a DCV session on a g4dn instance with Centos 7.9.2009 + DCV 2022.0 r12760 + EnginFrame (EF) 2021.0-r1592 + Slurm 21.08.8-2 on AWS. The issue is specific to the users stored in the AD attached to the cluster.

I have finally found a fix for the issue but I think it's important to discuss it with you to understand what's the underlying behaviour here...

The symptom is the following;

launching a DCV Session as system user "centos" using a standard Linux Desktop Service in EF works fine
launching a DCV session as a user created in the AD using the exact same standard Linux Desktop Service in EF fails because of an autentication issue.

The error message got in slurm-$JobID.out is the following:

[2022/06/09 14:40:15]  INFO  Starting DCV session...
[2022/06/09 14:40:15]  INFO  DCV version supports --gl-display parameter
[2022/06/09 14:40:15]  INFO  Creating DCV session "dcv create-session --type=virtual tmp7339021573904918669 "
Could not create session. Could not get the system bus: Exhausted all available authentication mechanisms (tried: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS) (available: EXTERNAL, DBUS_COOKIE_SHA1, ANONYMOUS)
[2022/06/09 14:40:15] ERROR  Failed to launch DCV session (exit code: 1)
[2022/06/09 14:40:15] FATAL  Error: DCV failed to create session
[2022/06/09 14:40:15] FATAL  Exiting with code 1

After a lot of tests, described below, I have found a solution which consists in adding at the very beginning of the file $EF_ROOT/plugins/interactive/lib/remote/linux.jobscript.functions the following line:

id "${USER}"

Indeed, this initializes kind of a "first connection" of the user trying to start a session on the targeted system, so that the User is known at system level...

With the help of Benjamin Depardon, I have tested the issue by issueing on the Head Node of the cluster in the command line the following command:

srun -N 1 -p dcv-gpu --exclusive -C "[g4dn.xlarge*1]" dcv create-session my_session

And then, we have tried all the following options:

restarting gdm only after dcvserver was restarted at the end of the installation process => NOT working
restarting dbus + dbus-org.+ gdm after dcvserver was restarted at the end of the installation process => NOT working
changing /etc/pam.d/dcv with the following contents

#%PAM-1.0
# Default NICE DCV PAM configuration.
# This file is auto-generated, user changes will be destroyed at
# installation/update time.
# To make changes, create a file named dcv.custom in this
# directory and set the 'pam-service-name' parameter in the
# [security] section of dcv.conf to 'dcv.custom'
#auth    include password-auth
#account include password-auth
auth    include password-auth
account     required                                     pam_access.so
account     required                                     pam_unix.so
account     sufficient                                   pam_localuser.so
account     sufficient                                   pam_usertype.so issystem
account     [default=bad success=ok user_unknown=ignore] pam_sss.so
account     required                                     pam_permit.so

=> NOT working

running on the remote system the commands:
$> getent passwd | grep username
or
$> getent passwd -s sss | grep username
or
$> sssctl cache-upgrade
=> NOT working
adding the following command at the very beginning of Slurm's prolog.sh script:
$> id "${SLURM_JOB_USER}"
-> NOT working
running the following command on the DCV node before the session was created:
$> id username
or
$> sssctl user-checks username
=> SUCCESSFUL
connecting on the DCV node with SSH as the user username (or as any other user and the switching with the command: $> su - username) before the session was created

=> SUCCESSFUL

Our conclusion is that the user must be known by the system (and stored in any kind of cache) for the authentication process to allow the execution of the tasks required by the Slurm job.

Our questions are:

is it a know issue?
can you explain further how the internal authentication methods of DCV work and why in our case DCV has denied the authorization for the user in AD to crete a session?
is there a "better" way to solve it than to hack EF code the way we did to allow any user in AD to launch a DCV session?

Please don't hesitate to ask for any complementary information and to let us know what you think.

Best regards,
Vincent.

ERROR 502: Bad Gateway

I have an error : « ERROR 502: Bad Gateway »
The CFT installation seems working fine. I have no error message.

On the Head Node I have the same message :


[ec2-user@ip-XXXX ~]$ wget https://XXXX.eu-west-1.elb.amazonaws.com/ --no-check-certificate
--2022-01-18 17:05:38--  https://XXXX.eu-west-1.elb.amazonaws.com/
Resolving XXXXX.eu-west-1.elb.amazonaws.com (XXXXX.eu-west-1.elb.amazonaws.com) [SNIP] connected.
WARNING: cannot verify XXXXX.eu-west-1.elb.amazonaws.com's certificate, issued by ‘/C=US/ST=WA/L=Seattle/O=AWS WWSO/OU=HPC/CN=EnginFrame’:
Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 502 Bad Gateway
2022-01-18 17:05:38 ERROR 502: Bad Gateway.

Grafana - how to map jobID to instanceID so Grafana can show correct information

Missing that bit, mapping job IDs to instance IDs...

reusing same DM for slurm accounting with multiple clusters clashes enginframe setup

Hi, I noticed that enginframe creates a database on the same DB server as slurm accounting.

while slurm can use a single accounting DB per organization https://aws.amazon.com/blogs/compute/enabling-job-accounting-for-hpc-with-aws-parallelcluster-and-amazon-rds/

is the same true for enginframe? I get 404 error when I want to open the enginframe link on a second deployed cluster

Grafana monitoring - all panels keep saying No data

Grafana seems to be installed correctly but all panels say No data. Including the compute nodes list, everything is empty.

Is there any further configuration step that I missed?

Can't use my own Active Directory

Getting this after leaving everything on AUTO but Active Directory

Template format error: Unresolved resource dependencies [ActiveDirectory] in the Resources block of the template

ParallelCluster 3.6 support

1click-hpc doesn't work with the latest version of Parallel Cluster, the OnNodeConfigured scripts of the HeadNode are failing; this seems to be related to a change to /etc/parallelcluster/cfnconfig introduced in the newer version.

pcluster update-cluster is deleting slurmdbd service

Updating the cluster config from the Cloud9 instance is removing the slurmdbd service.

I am not sure maybe there is a personalized update procedure via enginframe portal instead?

forked the repo but the cloudformation template still points to your enginframe url

And the extra docs referring to the customizations seem behind to older version of ParallelCluster.

Maybe we could host our custom templates on your server too to avoid an extra installation of a custom enginframe?

EFA monitoring with Grafana

Hi, I can see a nvlink panel, is there any way to also monitor EFA metrics?

LBInit issues

I bumped a while ago into LBInit issues, meaning when I delete a stack usually LBInit fails to delete. The workaround is to wait some more minutes then retry the stack delete and it works.

But today I started having problems with its creation. In the cloudwatch log I find this:

{
    "Status": "FAILED",
    "Reason": "See the details in CloudWatch Log Stream: 2022/06/22/[$LATEST]d883c58469a545d58f5fb61863e901ef",
    "PhysicalResourceId": "2022/06/22/[$LATEST]d883c58469a545d58f5fb61863e901ef",
    "StackId": "arn:aws:cloudformation:us-east-1:842865360552:stack/origtest/0cdfe300-f1fa-11ec-b068-121de38a7e19",
    "RequestId": "10fc583d-c908-41c1-af07-751ba3a4b563",
    "LogicalResourceId": "LBInit",
    "NoEcho": false,
    "Data": {
        "ClientErrorCode": "NoSuchEntity",
        "ClientErrorMessage": "The Server Certificate with name origtest-981587795.us-east-1.elb.amazonaws.com cannot be found."
    }
}

I have another HPC cluster active, with a different name, it should not interfere with the creation of another cluster in the account. The above error still appears with everything set on AUTO

Active Directory is integrated into Enginframe?

Enginframe is interesting as workspace but it seems it needs to define users locally while we are already using AD to manage the users.

Is there any integration made or can I get some hints about such a potential integration so I can develop it myself?

tagging not working

need to fix https://github.com/aws-samples/1click-hpc/blob/main/modules/07.configure.slurm.tagging.headnode.sh and use the existing prolog dir

`pcluster create-cluster --wait` is exiting early

you can see this in bootstrap.log

+ /home/ec2-user/.local/bin/pcluster create-cluster --cluster-name hpc-1click-hpc365 --cluster-configuration config.us-east-1.yaml --rollback-on-failure false --wait
{
  "message": "The security token included in the request is expired"
}

Seems like if the command goes beyond 10-15 minutes, the EC2/Cloud9 credentials are cycling and the pcluster CLI doesn't take this into account. The --wait option seems to be deprecated in pcluster so we probably need to move to a polling approach that allows the credentials to refresh.

This causes the outer CloudFormation stack to fail initially, but it will succeed if it is Retried.

scaling issues due to prolog tagging api

We got into scaling issue with the tagging in prolog script

I understand the prolog is ran at every step and when many nodes are involved the job fails with timeouts

we need to find another place to do the tagging and I understand that the comment is job related but some other tags can be done only once when the instances are created, either because of the min value in the configuration or created by slurm

I am looking at places where this could be done.

maybe it can be done at the headnode instead in the PrologSlurmctld https://slurm.schedmd.com/prolog_epilog.html

add p4d.24xlarge for us-east-1

We have many p4d.24xlarge pods and we need them into the config.

But more than this, we need to be able to pull them from the capacity reservations we got. Without that there are not a lot of p4d instances ondemand and usually the cluster fails to build.

Head Node was created in the private subnet

I noticed the head node was created in the private subnet. Checked the parameters in the cloudformation where I supplied my custom VPC details:

PrivateSubnetAId	subnet-0112272390ac53c95	-
PublicSubnetAId	subnet-79775a34	-
PublicSubnetBId	subnet-2fd94770	-
VpcId	vpc-4678d63b

And the head node was created in the private subnet.

As a consequence, I cannot SSH into the EIP of the head node.

I can see the code makes this deliberately

1click-hpc/scripts/Cloud9-Bootstrap.sh

Line 74 in 93d1930

if [[ $PRIVATE_SUBNET_ID == "NONE" ]];then

If it is supposed to be there, then please explain how to SSH into the head node using the elastic IP.

Customizing Portal Splash Page

Hi, is there a method to customize the splash page of the enginframe portal?

Template error: Fn::Select cannot select nonexistent value at index 2

When I try and create in us-west-1 I get an error:

2021-04-01 14:51:49 UTC-0700 | PublicSubnetB | CREATE_FAILED | Template error: Fn::Select cannot select nonexistent value at index 2
-- | -- | -- | --