Skip to content

Cluster Add-Ons

In this step, you will configure and package critical software add-ons into a Rafay cluster blueprint and deploy it to your Kubernetes cluster that you either provisioned or imported into your Rafay Org. The shared, multi-tenant host cluster will look like the following with all the add-ons configured.

Cluster Add-ons

Software Add-On Primary Purposes
Rafay Management Operator Centralized Management, policy enforcement and zero trust Kubectl
Local Storage Storage for compute instances
Ingress Nginx Ingress Nginx Ingress Controller to expose web applications such as Jupyter notebooks
Nvidia GPU Operator Software driver for Nvidia GPU hardware. We will use the GPU Simulator as an alternative if necessary
Rafay Monitoring Centralized metrics and dashboards
Isolation Rafay's Admission Controller for Kata providing microVM based isolation

GPU Operator

The GPU Operator automates the deployment, configuration, and management of GPU drivers, runtime libraries, and monitoring components on GPU-enabled nodes. It simplifies GPU operations by ensuring that the necessary software stack (such as NVIDIA drivers, CUDA toolkit, and device plugins) is correctly set up within the cluster.

Depending on whether we have a real GPU or not, we will deploy this software add-on as part of a Rafay Cluster Blueprint. Follow the steps below for the instructions.


Real GPU

Important

Follow these steps ONLY if your cluster has real GPUs attached to it.

Step 1: Create Namespace for GPU Operator

In this step, you will create a namespace on the cluster for installing the Nvidia GPU Operator in.

  • Under Infrastructure, select Namespaces and create a new namespace with name gpu-operator-resources.

Create Namespace

  • Click Save and go to placement.

Placement

  • Select the target cluster from the list of available clusters and click Save and go to publish.

Select Cluster

  • Publish the namespace and make sure that it gets published successfully in the target cluster before moving to the next step.

Publish Namespace


Step 2: Create Nvidia GPU Operator Add-on

In this step, you will create an add-on from the system catalog for the GPU Operator.

  • Navigate to Infrastructure -> Add-Ons
  • Click New Add-On -> Create New Add-On from Catalog
  • Search for gpu-operator and select the card
  • Click Create Add-On
  • Enter a name for the add-on and select the previously created namespace
  • Click Create

addon

Info

Since we have only a single GPU and we want to try allocating it to multiple users, we will oversubscribe the GPU using Time Slicing. As you can see from the YAML spec below, the single GPU will show up as 4 GPUs in our cluster.

Time Slicing

  • Enter a version name
  • Create a YAML file named values.yaml with the following content:
---
devicePlugin:
  config:
    create: true
    name: "time-slice-config"
    default: "time-slice-4"
    data:
      time-slice-4: |-
        version: v1
        flags:
          migStrategy: none
        sharing:
          timeSlicing:
            renameByDefault: true
            failRequestsGreaterThanOne: true
            resources:
            - name: nvidia.com/gpu
              replicas: 4
  • Upload the values.yaml in the add-on
  • Click Save Changes

addon


Step 3: Create Custom Blueprint

In this step, you will create a custom cluster blueprint which contains the GPU Operator add-on.

  • Navigate to Infrastructure -> Blueprints
  • Click New Blueprint
  • Enter a name for the blueprint
  • Click Save

blueprint

  • Enter a version name
  • Select Minimal for the default blueprint
  • Click Configure Add-Ons
  • Add the previously created add-on
  • Click Save Changes
  • Enable Local Storage and Monitoring & Alerting under Managed System Add-ons
  • Click Save Changes

blueprint


Step 4: Apply Blueprint

In this step, you will apply the blueprint to the cluster.

  • Navigate to Infrastructure -> Clusters
  • Click on the gear icon on the cluster card and select Update Blueprint
  • Select the previously created blueprint and click Save and Publish

blueprint

After a few minutes, the blueprint will be applied to the cluster.

blueprint

By navigating to the cluster card, you will see the GPUs detected by the GPU Operator add-on.

blueprint

Note

Notice that the cluster is reporting 4 GPUs although we have only one real GPU. We can now allocate these dynamically to end users when they request compute instances.


GPU Simulator

Important

Follow these steps ONLY if your cluster does not have real GPUs.

Step 1: Create Namespace for GPU Simulator

In this step, you will create a namespace on the cluster for installing the GPU Simulator in.

  • Under Infrastructure, select Namespaces and create a new namespace with the following name
gpu-operator-resources

Create Namespace

  • Click Save and go to placement.

Placement

  • Select the target cluster from the list of available clusters and click Save and go to publish.

Select Cluster

  • Publish the namespace and make sure that it gets published successfully in the target cluster before moving to the next step.

Publish Namespace


Step 2: Create GPU Simulator Add-on

In this step, you will create an add-on from the system catalog for the GPU simulator and configure it to present a large number of GPUs.

  • Navigate to Infrastructure -> Add-Ons
  • Click New Add-On -> Create New Add-On from Catalog
  • Search for fake-gpu-operator and select the card
  • Click Create Add-On
  • Enter a name for the add-on and select the previously created namespace
  • Click Create

addon

  • Enter a version name
  • Create a YAML file named values.yaml with the following content:
topology:
  # nodePools is a map of node pool name to node pool configuration.
  # Nodes are assigned to node pools based on the node pool label's value (key is configurable via nodePoolLabelKey).
  # 
  # For example, nodes that have the label "run.ai/simulated-gpu-node-pool: default"
  # will be assigned to the "default" node pool.
  nodePools:
    A100:
      gpuProduct: NVIDIA-A100
      gpuCount: 8
    H100:
      gpuProduct: NVIDIA-H100
      gpuCount: 8
    T400:
      gpuProduct: NVIDIA-T400
      gpuCount: 8  
  • Upload the values.yaml in the add-on
  • Click Save Changes

addon


Step 3: Assign GPUs to Nodes

You can assign GPUs to nodes by applying a label to specific nodes. Run the following command being sure to update the node name and the node pool name. Since we have only a single node in our cluster, you can attach only a single node pool at a given time.

kubectl label node <node-name> run.ai/simulated-gpu-node-pool=<node-pool-name>

Note

The node pool names are defined in the values.yaml file used when deploying the workload. The options in the provided values.yaml file are A100, H100 and T400. Additional node pool groups can be added to the values.yaml file as needed.


Step 4: Create Custom Blueprint

In this step, you will create a custom cluster blueprint which contains the GPU simulator add-on.

  • Navigate to Infrastructure -> Blueprints
  • Click New Blueprint
  • Enter a name for the blueprint
  • Click Save

blueprint

  • Enter a version name
  • Select Minimal for the default blueprint
  • Click Configure Add-Ons
  • Add the previously created add-on
  • Click Save Changes
  • Enable Local Storage and Monitoring & Alerting under Managed System Add-ons
  • Click Save Changes

blueprint


Step 5: Apply Blueprint

In this step, you will apply the blueprint to the cluster.

  • Navigate to Infrastructure -> Clusters
  • Click on the gear icon on the cluster card and select Update Blueprint
  • Select the previously created blueprint and click Save and Publish

blueprint

After a few minutes, the blueprint will be applied to the cluster.

blueprint

By navigating to the cluster card, you will see the simulated GPUs.

blueprint

Notice that the cluster is reporting 8 simulated GPUs. We can now allocate these dynamically to end users when they request compute instances.