- Create and activate an AWS Account
- Select your AWS Region. For the tutorial below, we assume the region to be
us-west-2 - Manage your service limits for GPU enabled EC2 instances. We recommend service limits be set to at least 4 instances each for p3.16xlarge, p3dn.24xlarge, p4d.24xlarge, and g5.48xlarge for training, and 2 instances each for g4dn.xlarge, and g5.xlarge for testing.
For the build machine, we need AWS CLI and Docker installed. The AWS CLI must be configured for Adminstrator job function. You may use your laptop for your build machine if it has AWS CLI and Docker installed, or you may launch an EC2 instance for your build machine, as described below.
To launch an EC2 instance for the build machine, you will need Adminstrator job function access to AWS Management Console. In the console, execute following steps:
-
Create an Amazon EC2 key pair in your selected AWS region, if you do not already have one
-
Create an AWS Service role for an EC2 instance, and add AWS managed policy for Administrator access to this IAM Role.
-
Launch a m5.xlarge instance from Amazon Linux 2 AMI using the IAM Role created in the previous step. Use 100 GB for
Rootvolume size. -
After the instance state is
Running, connect to your linux instance asec2-user. On the linux instance, install the required software tools as described below:sudo yum install -y docker git sudo systemctl enable docker.service sudo systemctl start docker.service sudo usermod -aG docker ec2-user exit
Now, reconnect to your linux instance. All steps described under Step by step section below must be executed on the build machine.
While the solution described in this tutorial is general, and can be used to train and test any type of deep learning network (DNN) model, we will make the tutorial concrete by focusing on distributed TensorFlow training for TensorPack Mask/Faster-RCNN, and AWS Mask-RCNN models. The high-level outline of solution is as follows:
- Setup build environment on the build machine
- Use Terraform to create infrastructure
- Stage training data on EFS, or FSx for Lustre shared file-system
- Use Helm charts to launch training jobs in the EKS cluster
- Use Jupyter notebook to test the trained model
For this tutorial, we assume the region to be us-west-2. You may need to adjust the commands below if you use a different AWS Region.
Clone this git repository on the build machine using the following commands:
cd ~
git clone https://github.com/aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow.git
To install kubectl on Linux, execute following commands:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
./eks-cluster/install-kubectl-linux.sh
For non-Linux, install and configure kubectl for EKS, install aws-iam-authenticator, and make sure the command aws-iam-authenticator help works.
Install Terraform. Terraform configuration files in this repository are consistent with Terraform v1.1.4 syntax, but may work with other Terraform versions, as well.
Helm is package manager for Kubernetes. It uses a package format named charts. A Helm chart is a collection of files that define Kubernetes resources. Install helm.
We recommend the quick start option for first-time walk-through.
This option creates an Amazon EKS cluster, three managed node groups (system, inference, training), Amazon EFS and Amazon FSx for Lustre shared file-systems. The system node group size is fixed, and it runs the pods in the kube-system namespace. The EKS cluster uses Cluster Autoscaler to automatically scale the inference and training node groups up and down as needed. To complete quick-start, execute the commands below:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-cluster-and-nodegroup
terraform init
terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]'
The advanced option separates the creation of the EKS cluster from EKS managed node group for training.
To create the EKS cluster, two managed node groups (system, inference), Amazon EFS and Amazon FSx for Lustre shared file-systems, execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-cluster
terraform init
terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]'
Save the output of the this command for creating the EKS managed node group.
This step creates an EKS managed node group for training. Use the output of previous command for specifying node_role_arn, and subnet_ids below. Specify a unique value for nodegroup_name variable. To create the node group, execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-nodegroup
terraform init
terraform apply -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var="node_role_arn=" -var="nodegroup_name=" -var="subnet_ids="
Below, we will build and push all the Docker images to Amazon ECR. Replace aws-region below, and execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow
./build-ecr-images.sh aws-region
To download COCO 2017 dataset to your build environment instance and upload it to Amazon S3 bucket, customize eks-cluster/prepare-s3-bucket.sh script to specify your S3 bucket in S3_BUCKET variable, and execute eks-cluster/prepare-s3-bucket.sh
Next, we stage the data on EFS and FSx file-systems, so you have the option to use either one in training.
To stage data on EFS, customize S3_BUCKET variable in eks-cluster/stage-data.yaml and execute:
kubectl apply -f stage-data.yaml -n kubeflow
To stage data on FSx for Lustre, customize S3_BUCKET variable in eks-cluster/stage-data-fsx.yaml and execute:
kubectl apply -f stage-data-fsx.yaml -n kubeflow
Execute kubectl get pods -n kubeflow to check the status of the two Pods. Once the status of the two Pods is marked Completed, data is successfully staged.
To deploy Kubeflow MPIJob CustomResouceDefintion using mpijob chart, execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/charts
helm install --debug mpijob ./mpijob/
You have two Helm charts available for training Mask-RCNN models. Both these Helm charts use the same Kubernetes namespace, namely kubeflow. Do not install both Helm charts at the same time.
To train TensorPack Mask-RCNN model, customize charts/maskrcnn/valuex.yaml, as described below:
- Set
shared_fsanddata_fstoefs, orfsx, as applicable. Setshared_pvcto the name of the respectivepersistent-volume-claim, which istensorpack-efs-gp-burstingforefs, andtensorpack-fsxforfsx. - Set
global.source_cidrto your public source CIDR.
To password protect TensorBoard, you must set htpasswd in charts/maskrcnn/charts/jupyter/value.yaml to a quoted MD5 password hash.
To install the maskrcnn chart, execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/charts
helm install --debug maskrcnn ./maskrcnn/
To train AWS Mask-RCNN optimized model, customize charts/maskrcnn-optimized/values.yaml, as described below:
- Set
shared_fsanddata_fstoefs, orfsx, as applicable. Setshared_pvcto the name of the respectivepersistent-volume-claim, which istensorpack-efs-gp-burstingforefs, andtensorpack-fsxforfsx. - Set
global.source_cidrto your public source CIDR.
To password protect TensorBoard, you must set htpasswd in charts/maskrcnn-optimized/charts/jupyter/value.yaml to a quoted MD5 password hash.
To install the maskrcnn-optimized chart, execute:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/charts
helm install --debug maskrcnn-optimized ./maskrcnn-optimized/
Execute kubectl get pods -n kubeflow to see the status of the pods. Execute: kubectl logs -f maskrcnn-launcher-xxxxx -n kubeflow to see live log of training from the launcher (change xxxxx to your specific pod name). Model checkpoints and logs will be placed on the shared_fs file-system set in values.yaml, i.e. efs or fsx.
Customize and apply eks-cluster/attach-pvc.yaml if you need to attach to efs or fsx shared file system using a K8s pod.
Execute kubectl logs -f jupyter-xxxxx -n kubeflow -c jupyter to display Jupyter log. At the beginning of the Jupyter log, note the security token required to access Jupyter service in a browser.
Execute kubectl get services -n kubeflow to get the service DNS address. To test the trained model using a Jupyter notebook, access the service in a browser on port 443 using the service DNS and the security token. Your URL to access the Jupyter service should look similar to the example below:
Because the service endpoint in this tutorial uses a self-signed certificate, accessing Jupyter service in a browser will display a browser warning. If you deem it appropriate, proceed to access the service. Open the notebook, and run it it to test the trained model. Note, there may not be any trained model checkpoint available at a given time, while training is in progress.
To access TensorBoard via web, use the service DNS address noted above. Your URL to access the TensorBoard service should look similar to the example below:
https://xxxxxxxxxxxxxxxxxxxxxxxxx.elb.xx-xxxx-x.amazonaws.com:6443/
Accessing TensorBoard service in a browser will display a browser warning, because the service endpoint uses a self-signed certificate. If you deem it appropriate, proceed to access the service. When prompted for authentication, use the default username tensorboard, and your password.
When training is complete, you may delete an installed chart by executing helm delete chart-name, for example helm delete maskrcnn. This will destroy all pods used in training and testing, including Tensorboard and Jupyter service pods. However, the logs and trained models will be preserved on the shared file system used for training. When you delete all the helm charts, the kubenetes cluster autoscaler may scale down the inference and training node groups to zero size.
When you are done with this tutorial, you can destory all the infrastructure, including the shared EFS, and FSx for Lustre file-systems. If you want to preserve the data on the shared file-systems, you may want to first upload it to Amazon S3.
If you used the quick start option above, execute following commands:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-cluster-and-nodegroup
terraform destroy -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]'
If you used the advanced option above, execute following commands:
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-nodegroup
terraform destroy -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var="node_role_arn=" -var="nodegroup_name=" -var="subnet_ids="
cd ~/amazon-eks-machine-learning-with-terraform-and-kubeflow/eks-cluster/terraform/aws-eks-cluster
terraform destroy -var="profile=default" -var="region=us-west-2" -var="cluster_name=my-eks-cluster" -var='azs=["us-west-2a","us-west-2b","us-west-2c"]'