Part 3: Provision
What Will You Do¶
In this part of the self-paced exercise, you will provision an Amazon EKS cluster with a GPU node group based on a declarative cluster specification
Step 1: Cluster Spec¶
- Open Terminal (on macOS/Linux) or Command Prompt (Windows) and navigate to the folder where you forked the Git repository
- Navigate to the folder "
/getstarted/gpueks/cluster"
The "eks-gpu.yaml" file contains the declarative specification for our Amazon EKS Cluster.
Cluster Details¶
The following items may need to be updated/customized if you made changes to these or used alternate names.
- cluster name: "demo-gpueks"
- cloud provider: "eks
- project: "default project"
- AWS Region: us-west-1 region
- One regular node group (with t3.large instance type)
- One GPU node group (with g4dn.xlarge instance type)
Step 2: Provision Cluster¶
- Navigate to the "cluster" sub folder
- Type the command
rctl apply -f eks-gpu.yaml
If there are no errors, you will be presented with a "Task ID" that you can use to check progress/status. Note that this step requires creation of infrastructure in your AWS account and can take ~20-30 minutes to complete.
{
"taskset_id": "z24zlmy",
"operations": [
{
"operation": "NodegroupCreation",
"resource_name": "gpu-nodegroup",
"status": "PROVISION_TASK_STATUS_PENDING"
},
{
"operation": "NodegroupCreation",
"resource_name": "t3-nodegroup",
"status": "PROVISION_TASK_STATUS_PENDING"
},
{
"operation": "ClusterCreation",
"resource_name": "demo-gpueks",
"status": "PROVISION_TASK_STATUS_PENDING"
}
],
"comments": "The status of the operations can be fetched using taskset_id",
"status": "PROVISION_TASKSET_STATUS_PENDING"
}
- Navigate to the project in your Org
- Click on Infrastructure -> Clusters. You should see something like the following
- Click on the cluster name to monitor progress
Step 3: Verify Cluster¶
Once provisioning is complete, you should see a healthy cluster in the web console
- Click on the kubectl link and type the following command
kubectl get nodes
You should see something like the following
NAME STATUS ROLES AGE VERSION
ip-192-168-139-38.us-west-1.compute.internal Ready <none> 5m28s v1.21.5-eks-bc4871b
ip-192-168-168-241.us-west-1.compute.internal Ready <none> 5m28s v1.21.5-eks-bc4871b
Step 4: Verify GPU Operator¶
Now, let us verify whether the Nvidia GPU Operator's resources are operational on the EKS cluster
- Click on the kubectl link and type the following command
kubectl get po -n nvidia
You should see something like the following
NAME READY STATUS RESTARTS AGE
gpu-operator-857dbb9945-jfnqk 1/1 Running 0 4h21m
gpu-operator-node-feature-discovery-master-6f698554df-jvht5 1/1 Running 0 4h21m
gpu-operator-node-feature-discovery-worker-9pw8c 1/1 Running 0 118m
gpu-operator-node-feature-discovery-worker-bzvt5 1/1 Running 0 9m56s
The GPU Operator will automatically add "required labels" to the GPU enabled worker nodes.
- Click on nodes and expand the node that belongs to the "gpu" node group
Recap¶
Congratulations! At this point, you have successfully configured and provisioned an Amazon EKS cluster with a GPU node group in your AWS account using the RCTL CLI. You are now ready to move on to the next step where you will deploy a "GPU Workload" and review the integrated "GPU Dashboards"