Skip to content

Compute Cluster

Overview

A compute cluster is a Kubernetes cluster that forms the data plane for GenAI and serverless inference workloads. Multiple models can be deployed and operated concurrently on a compute cluster.

The compute cluster is imported into the system using a generated bootstrap YAML configuration, which initializes all required GenAI and serverless inference components.

Lifecycle management of the underlying Kubernetes cluster and associated infrastructure is not handled by the GenAI or serverless inference solution and must be managed independently.

The data plane works on CNCF-conformant Kubernetes clusters and has been extensively validated with MKS Kubernetes clusters.

Compute cluster initialization for GenAI workloads is currently supported only on MKS clusters. Enablement for additional cluster types is planned in future releases.


Prerequisites

Before initializing a compute cluster, the following conditions must be met on the target Kubernetes cluster:

Cluster Connectivity

  • The cluster must be connected to the controller.

GPU Enablement

  • Worker nodes must be GPU-backed.
  • A GPU blueprint must be applied so that the NVIDIA GPU Operator is deployed.

The following pods must be in a Running state in the gpu-operator namespace:

  • gpu-feature-discovery-*
  • nvidia-container-toolkit-daemonset-*
  • nvidia-cuda-validator-*
  • nvidia-dcgm-exporter-*
  • nvidia-device-plugin-daemonset-*
  • nvidia-operator-validator-*

Storage Backend

At least one of the following storage solutions must be running on GPU-backed nodes: - Rook Ceph - OpenEBS

Once these prerequisites are met, the compute cluster can be initialized.


Add a Compute Cluster

  1. Navigate to Operations Console → GenAI → Compute Clusters
  2. Click New Compute Cluster
  3. Enter a Name and an optional Description
  4. Keep the Type set to Import
  5. Select Save Changes

Create Compute

Saving the configuration creates the compute cluster entry and sets the status to Waiting for initialization. A Download YAML Config option is displayed along with a kubectl apply command.

Example:

kubectl apply -f <compute-name>-compute-bootstrap.yaml

Bootstrap YAML


Bootstrap YAML Initialization

Applying the bootstrap YAML on the target Kubernetes cluster initializes the compute cluster and deploys all required GenAI and serverless inference components.

Initialization begins with the gaap-syncer, which sequentially brings up the remaining services.

Deployed Namespaces and Components

gaap-controller

This namespace contains the core GenAI control-plane and gateway components, including: - gaap-syncer - gaap-operator - gaap-metrics - gaap-dbp - gaap-data-gateway - ai-gateway-controller-* - envoy-gateway-* - envoy-gap-controller-* - envoy-ratelimit-*

gaap-controller pods

monitoring

This namespace provides monitoring and observability components, including: - gaapmon-blackbox-exporter-* - gaapmon-k8s-state-metrics-* - gaapmon-node-exporter-* - prometheus-gaapmon-prometheus-*

monitoring pods

When all components in both namespaces are in a Running state, the compute cluster is fully initialized and ready for model deployment.

Capacity from multiple compute clusters can be aggregated into a unified inventory and used across multiple model deployments.


List Compute Clusters

Navigate to Operations Console → GenAI → Compute Clusters to view all registered compute clusters available for model deployments.

List Compute Clusters

Nodes

The Nodes tab provides visibility into node-level resource utilization within the compute cluster.

Each node entry displays current capacity and usage details for the following resources:

  • CPU: Total cores, used cores, usage percentage, and available capacity
  • Memory: Total memory, used memory, usage percentage, and available capacity
  • GPU: Total GPU units, used units, usage percentage, and available units

This view reflects real-time resource consumption across nodes that make up the compute cluster. The available capacity shown here represents the resources that can be allocated for GenAI and serverless inference model deployments.

The Nodes view is read-only and is intended for monitoring resource distribution and utilization across the compute cluster.

List Compute Clusters

K8s Resources

The K8s Resources tab provides an aggregated view of Kubernetes resource utilization for the compute cluster.

This view summarizes cluster-wide resource capacity and consumption across the following resource types:

CPU Resources

  • Total CPU: The total number of CPU cores available in the compute cluster
  • Allocated CPU: The number of CPU cores currently allocated to workloads
  • Available CPU: The remaining CPU capacity available for allocation
  • Utilization percentage indicating overall CPU usage across the cluster

Memory Resources

  • Total Memory: The total memory capacity available in the compute cluster
  • Allocated Memory: The amount of memory currently allocated to workloads
  • Available Memory: The remaining memory capacity available for allocation
  • Utilization percentage indicating overall memory usage across the cluster

The K8s Resources view reflects the combined resource consumption across all nodes in the compute cluster and provides a high-level snapshot of cluster capacity. This information helps assess overall resource availability for current and future model deployments.

The K8s Resources view is read-only and intended for monitoring cluster-level resource utilization.

List Compute Clusters


Delete Compute Cluster

Deleting a compute cluster removes the GenAI and serverless inference operator resources from the cluster and prevents the cluster from being used for subsequent model deployments.

Delete Compute Cluster

Deleting a compute cluster does not deprovision the underlying Kubernetes cluster or associated infrastructure, which must be handled separately.