Skip to content

Part 3: Provision

What Will You Do

In this part of the self-paced exercise, you will provision an Amazon EKS cluster with a GPU node group based on a declarative cluster specification


Step 1: Cluster Spec

  • Open Terminal (on macOS/Linux) or Command Prompt (Windows) and navigate to the folder where you forked the Git repository
  • Navigate to the folder "/getstarted/gpueks/cluster"

The "eks-gpu.yaml" file contains the declarative specification for our Amazon EKS Cluster.

Cluster Details

The following items may need to be updated/customized if you made changes to these or used alternate names.

  • cluster name: "demo-gpueks"
  • cloud provider: "eks
  • project: "default project"
  • AWS Region: us-west-1 region
  • One regular node group (with t3.large instance type)
  • One GPU node group (with g4dn.xlarge instance type)

Step 2: Provision Cluster

  • Navigate to the "cluster" sub folder
  • Type the command
rctl apply -f eks-gpu.yaml

If there are no errors, you will be presented with a "Task ID" that you can use to check progress/status. Note that this step requires creation of infrastructure in your AWS account and can take ~20-30 minutes to complete.

{
  "taskset_id": "z24zlmy",
  "operations": [
    {
      "operation": "NodegroupCreation",
      "resource_name": "gpu-nodegroup",
      "status": "PROVISION_TASK_STATUS_PENDING"
    },
    {
      "operation": "NodegroupCreation",
      "resource_name": "t3-nodegroup",
      "status": "PROVISION_TASK_STATUS_PENDING"
    },
    {
      "operation": "ClusterCreation",
      "resource_name": "demo-gpueks",
      "status": "PROVISION_TASK_STATUS_PENDING"
    }
  ],
  "comments": "The status of the operations can be fetched using taskset_id",
  "status": "PROVISION_TASKSET_STATUS_PENDING"
}
  • Navigate to the project in your Org
  • Click on Infrastructure -> Clusters. You should see something like the following

Provisioning in Process

  • Click on the cluster name to monitor progress

Provisioning in Process


Step 3: Verify Cluster

Once provisioning is complete, you should see a healthy cluster in the web console

Provisioned Cluster

  • Click on the kubectl link and type the following command
kubectl get nodes

You should see something like the following

NAME                                            STATUS   ROLES    AGE     VERSION
ip-192-168-139-38.us-west-1.compute.internal    Ready    <none>   5m28s   v1.21.5-eks-bc4871b
ip-192-168-168-241.us-west-1.compute.internal   Ready    <none>   5m28s   v1.21.5-eks-bc4871b

Step 4: Verify GPU Operator

Now, let us verify whether the Nvidia GPU Operator's resources are operational on the EKS cluster

  • Click on the kubectl link and type the following command
kubectl get po -n nvidia

You should see something like the following

NAME                                                          READY   STATUS    RESTARTS   AGE
gpu-operator-857dbb9945-jfnqk                                 1/1     Running   0          4h21m
gpu-operator-node-feature-discovery-master-6f698554df-jvht5   1/1     Running   0          4h21m
gpu-operator-node-feature-discovery-worker-9pw8c              1/1     Running   0          118m
gpu-operator-node-feature-discovery-worker-bzvt5              1/1     Running   0          9m56s

The GPU Operator will automatically add "required labels" to the GPU enabled worker nodes.

  • Click on nodes and expand the node that belongs to the "gpu" node group

GPU Node


Recap

Congratulations! At this point, you have successfully configured and provisioned an Amazon EKS cluster with a GPU node group in your AWS account using the RCTL CLI. You are now ready to move on to the next step where you will deploy a "GPU Workload" and review the integrated "GPU Dashboards"