- Create and activate an AWS Account
- Select your AWS Region. For the tutorial below, we assume the region to be
us-west-2
- Manage your service limits for GPU enabled EC2 instances. We recommend service limits be set to at least 4 instances each for p3.16xlarge, p3dn.24xlarge, p4d.24xlarge, and g5.48xlarge for training, and 2 instances each for g4dn.xlarge, and g5.xlarge for testing.
For the build machine, we need AWS CLI and Docker installed. The AWS CLI must be configured for Adminstrator job function. You may use your laptop for your build machine if it has AWS CLI and Docker installed, or you may launch an EC2 instance for your build machine, as described below.
To launch an EC2 instance for the build machine, you will need Adminstrator job function access to AWS Management Console. In the console, execute following steps:
-
Create an Amazon EC2 key pair in your selected AWS region, if you do not already have one
-
Create an AWS Service role for an EC2 instance, and add AWS managed policy for Administrator access to this IAM Role.
-
Launch a m5.xlarge instance from Amazon Linux 2 AMI using the IAM Role created in the previous step. Use 100 GB for
Root
volume size. -
After the instance state is
Running
, connect to your linux instance asec2-user
. On the linux instance, install the required software tools as described below:sudo yum install -y docker git sudo systemctl enable docker.service sudo systemctl start docker.service sudo usermod -aG docker ec2-user exit
Now, reconnect to your linux instance. All steps described under Step by step section below must be executed on the build machine.
While the solution described in this tutorial is general, and can be used to train and test any type of deep learning network (DNN) model, we will make the tutorial concrete by focusing on distributed TensorFlow training for TensorPack Mask/Faster-RCNN, and AWS Mask-RCNN models. The high-level outline of solution is as follows:
- Setup build environment on the build machine
- Use Terraform to create infrastructure
- Stage training data on EFS, or FSx for Lustre shared file-system
- Use Helm charts to launch training jobs in the EKS cluster
- Use Jupyter notebook to test the trained model
For this tutorial, we assume the region to be us-west-2
. You may need to adjust the commands below if you use a different AWS Region.
Clone this git repository on the build machine using the following commands:
cd ~
git clone https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow.git
To install kubectl
on Linux, execute following commands:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
./eks-cluster/install-kubectl-linux.sh
For non-Linux, install and configure kubectl for EKS, install aws-iam-authenticator, and make sure the command aws-iam-authenticator help
works.
Install Terraform. Terraform configuration files in this repository are consistent with Terraform v1.1.4 syntax, but may work with other Terraform versions, as well.
Helm is package manager for Kubernetes. It uses a package format named charts. A Helm chart is a collection of files that define Kubernetes resources. Install helm.
We recommend the quick start option for first-time walk-through.
This option creates an Amazon EKS cluster, three managed node groups (system
, inference
, training
), Amazon EFS and Amazon FSx for Lustre shared file-systems. The system
node group size is fixed, and it runs the pods in the kube-system
namespace. The EKS cluster uses Cluster Autoscaler to automatically scale the inference
and training
node groups up and down as needed. To complete quick-start, execute the commands below:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-cluster-and-nodegroup
terraform init
terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]'
The advanced option separates the creation of the EKS cluster from EKS managed node group for training
.
To create the EKS cluster, two managed node groups (system
, inference
), Amazon EFS and Amazon FSx for Lustre shared file-systems, execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-cluster
terraform init
terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]'
Save the output of the this command for creating the EKS managed node group.
This step creates an EKS managed node group for training
. Use the output of previous command for specifying node_role_arn
, and subnet_ids
below. Specify a unique value for nodegroup_name
variable. To create the node group, execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-nodegroup
terraform init
terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var="node_role_arn=" -var="nodegroup_name=" -var="subnet_ids="
The infrastructure created above includes an EFS file-system, and a FSx for Lustre file-system.
Start attach-pvc
container for access to the EFS shared file-system by executing following steps:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster
kubectl apply -f attach-pvc.yaml -n kubeflow
Start attach-pvc-fsx
container for access to the FSx for Lustre shared file-system by executing following steps:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster
kubectl apply -f attach-pvc-fsx.yaml -n kubeflow
Below, we will build and push all the Docker images to Amazon ECR. Replace aws-region
below, and execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
./build-ecr-images.sh aws-region
To download COCO 2017 dataset to your build environment instance and upload it to Amazon S3 bucket, customize eks-cluster/prepare-s3-bucket.sh
script to specify your S3 bucket in S3_BUCKET
variable, and execute eks-cluster/prepare-s3-bucket.sh
Next, we stage the data on EFS and FSx file-systems, so you have the option to use either one in training.
To stage data on EFS, customize S3_BUCKET
variable in eks-cluster/stage-data.yaml
and execute:
kubectl apply -f stage-data.yaml -n kubeflow
To stage data on FSx for Lustre, customize S3_BUCKET
variable in eks-cluster/stage-data-fsx.yaml
and execute:
kubectl apply -f stage-data-fsx.yaml -n kubeflow
Execute kubectl get pods -n kubeflow
to check the status of the two Pods. Once the status of the two Pods is marked Completed
, data is successfully staged.
To deploy Kubeflow MPIJob CustomResouceDefintion using mpijob chart, execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/charts
helm install --debug mpijob ./mpijob/
You have two Helm charts available for training Mask-RCNN models. Both these Helm charts use the same Kubernetes namespace, namely kubeflow
. Do not install both Helm charts at the same time.
To train TensorPack Mask-RCNN model, customize charts/maskrcnn/valuex.yaml
, as described below:
- Set
shared_fs
anddata_fs
toefs
, orfsx
, as applicable. Setshared_pvc
to the name of the respectivepersistent-volume-claim
, which istensorpack-efs-gp-bursting
forefs
, andtensorpack-fsx
forfsx
. - Use AWS check ip to get the public IP of your web browser client. Use this public IP to set
global.source_cidr
as a/32
CIDR. This will restrict Internet access to Jupyter notebook and TensorBoard services to your public IP.
To password protect TensorBoard, you must set htpasswd
in charts/maskrcnn/charts/jupyter/value.yaml
to a quoted MD5 password hash.
To install the maskrcnn
chart, execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/charts
helm install --debug maskrcnn ./maskrcnn/
To train AWS Mask-RCNN optimized model, customize charts/maskrcnn-optimized/values.yaml
, as described below:
- Set
shared_fs
anddata_fs
toefs
, orfsx
, as applicable. Setshared_pvc
to the name of the respectivepersistent-volume-claim
, which istensorpack-efs-gp-bursting
forefs
, andtensorpack-fsx
forfsx
. - Set
global.source_cidr
to your public source CIDR.
To password protect TensorBoard, you must set htpasswd
in charts/maskrcnn-optimized/charts/jupyter/value.yaml
to a quoted MD5 password hash.
To install the maskrcnn-optimized
chart, execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/charts
helm install --debug maskrcnn-optimized ./maskrcnn-optimized/
Note, this solution uses EKS autoscaling to automatically scale-up (from zero nodes) and scale-down (to zero nodes) the size of the EKS managed nodegroup used for training. So, if currently your training node group has zero nodes, it may take several minutes for the GPU nodes to be Ready
and for the training pods to reach Running
state. During this time, the maskrcnn-launcher-xxxxx
pod may crash and restart automatically several times, and that is nominal behavior. Once the maskrcnn-launcher-xxxxx
is in Running
state, replace xxxxx
with your launcher pod suffix below and execute:
kubectl logs -f maskrcnn-launcher-xxxxx -n kubeflow
This will show the live training log from the launcher pod.
Model checkpoints and all training logs are also available on the shared_fs
file-system set in values.yaml
, i.e. efs
or fsx
.
If you configured your shared_fs
file-system to be efs
, you can access your training logs by going inside the attach-pvc
container as follows:
kubectl exec --tty --stdin -n kubeflow attach-pvc bash
cd /efs
ls -ltr maskrcnn-*
Type exit
to exit from the attach-pvc
container.
If you configured your shared_fs
file-system to be fsx
, you can access your training by going inside the attach-pvc-fsx
container as follows :
kubectl exec --tty --stdin -n kubeflow attach-pvc-fsx bash
cd /fsx
ls -ltr maskrcnn-*
Type exit
to exit from the attach-pvc-fsx
container.
Execute kubectl logs -f jupyter-xxxxx -n kubeflow -c jupyter
to display Jupyter log. At the beginning of the Jupyter log, note the security token required to access Jupyter service in a browser.
Execute kubectl get services -n kubeflow
to get the service DNS address. To test the trained model using a Jupyter notebook, access the service in a browser on port 443 using the service DNS and the security token. Your URL to access the Jupyter service should look similar to the example below:
Because the service endpoint in this tutorial uses a self-signed certificate, accessing Jupyter service in a browser will display a browser warning. If you deem it appropriate, proceed to access the service. Open the notebook, and run it it to test the trained model. Note, there may not be any trained model checkpoint available at a given time, while training is in progress.
To access TensorBoard via web, use the service DNS address noted above. Your URL to access the TensorBoard service should look similar to the example below:
https://xxxxxxxxxxxxxxxxxxxxxxxxx.elb.xx-xxxx-x.amazonaws.com:6443/
Accessing TensorBoard service in a browser will display a browser warning, because the service endpoint uses a self-signed certificate. If you deem it appropriate, proceed to access the service. When prompted for authentication, use the default username tensorboard
, and your password.
When training is complete, you may delete an installed chart by executing helm delete chart-name
, for example helm delete maskrcnn
. This will destroy all pods used in training and testing, including Tensorboard and Jupyter service pods. However, the logs and trained models will be preserved on the shared file system used for training. When you delete all the helm charts, the kubenetes cluster autoscaler may scale down the inference
and training
node groups to zero size.
When you are done with this tutorial, you can destory all the infrastructure, including the shared EFS, and FSx for Lustre file-systems. If you want to preserve the data on the shared file-systems, you may want to first upload it to Amazon S3.
If you used the quick start option above, execute following commands:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-cluster-and-nodegroup
terraform destroy -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]'
If you used the advanced option above, execute following commands:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-nodegroup
terraform destroy -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var="node_role_arn=" -var="nodegroup_name=" -var="subnet_ids="
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-cluster
terraform destroy -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]'