Compute Cluster

Overview¶

A compute cluster is a Kubernetes cluster that forms the data plane for GenAI and serverless inference workloads. Multiple models can be deployed and operated concurrently on a compute cluster.

The compute cluster is imported into the system using a generated bootstrap YAML configuration, which initializes all required GenAI and serverless inference components.

Lifecycle management of the underlying Kubernetes cluster and associated infrastructure is not handled by the GenAI or serverless inference solution and must be managed independently.

The data plane works on CNCF-conformant Kubernetes clusters and has been extensively validated with MKS Kubernetes clusters.

Compute cluster initialization for GenAI workloads is currently supported only on MKS clusters. Enablement for additional cluster types is planned in future releases.

Prerequisites¶

Before initializing a compute cluster, the following conditions must be met on the target Kubernetes cluster:

Cluster Connectivity¶

The cluster must be connected to the controller.

GPU Enablement¶

Worker nodes must be GPU-backed.
A GPU blueprint must be applied so that the NVIDIA GPU Operator is deployed.

The following pods must be in a Running state in the gpu-operator namespace:

gpu-feature-discovery-*
nvidia-container-toolkit-daemonset-*
nvidia-cuda-validator-*
nvidia-dcgm-exporter-*
nvidia-device-plugin-daemonset-*
nvidia-operator-validator-*

Storage Backend¶

At least one of the following storage solutions must be running on GPU-backed nodes: - Rook Ceph - OpenEBS

Once these prerequisites are met, the compute cluster can be initialized.

Add a Compute Cluster¶

Navigate to Operations Console → GenAI → Compute Clusters
Click New Compute Cluster
Enter a Name and an optional Description
Keep the Type set to Import
Select Save Changes

Saving the configuration creates the compute cluster entry and sets the status to Waiting for initialization. A Download YAML Config option is displayed along with a kubectl apply command.

Example:

kubectl apply -f <compute-name>-compute-bootstrap.yaml

Bootstrap YAML Initialization¶

Applying the bootstrap YAML on the target Kubernetes cluster initializes the compute cluster and deploys all required GenAI and serverless inference components.

Initialization begins with the gaap-syncer, which sequentially brings up the remaining services.

Deployed Namespaces and Components¶

gaap-controller¶

This namespace contains the core GenAI control-plane and gateway components, including: - gaap-syncer - gaap-operator - gaap-metrics - gaap-dbp - gaap-data-gateway - ai-gateway-controller-* - envoy-gateway-* - envoy-gap-controller-* - envoy-ratelimit-*

monitoring¶

This namespace provides monitoring and observability components, including: - gaapmon-blackbox-exporter-* - gaapmon-k8s-state-metrics-* - gaapmon-node-exporter-* - prometheus-gaapmon-prometheus-*

When all components in both namespaces are in a Running state, the compute cluster is fully initialized and ready for model deployment.

Capacity from multiple compute clusters can be aggregated into a unified inventory and used across multiple model deployments.

List Compute Clusters¶

Navigate to Operations Console → GenAI → Compute Clusters to view all registered compute clusters available for model deployments.

Nodes¶

The Nodes tab provides visibility into node-level resource utilization within the compute cluster.

Each node entry displays current capacity and usage details for the following resources:

CPU: Total cores, used cores, usage percentage, and available capacity
Memory: Total memory, used memory, usage percentage, and available capacity
GPU: Total GPU units, used units, usage percentage, and available units

This view reflects real-time resource consumption across nodes that make up the compute cluster. The available capacity shown here represents the resources that can be allocated for GenAI and serverless inference model deployments.

The Nodes view is read-only and is intended for monitoring resource distribution and utilization across the compute cluster.

K8s Resources¶

The K8s Resources tab provides an aggregated view of Kubernetes resource utilization for the compute cluster.

This view summarizes cluster-wide resource capacity and consumption across the following resource types:

CPU Resources¶

Total CPU: The total number of CPU cores available in the compute cluster
Allocated CPU: The number of CPU cores currently allocated to workloads
Available CPU: The remaining CPU capacity available for allocation
Utilization percentage indicating overall CPU usage across the cluster

Memory Resources¶

Total Memory: The total memory capacity available in the compute cluster
Allocated Memory: The amount of memory currently allocated to workloads
Available Memory: The remaining memory capacity available for allocation
Utilization percentage indicating overall memory usage across the cluster

The K8s Resources view reflects the combined resource consumption across all nodes in the compute cluster and provides a high-level snapshot of cluster capacity. This information helps assess overall resource availability for current and future model deployments.

The K8s Resources view is read-only and intended for monitoring cluster-level resource utilization.

Delete Compute Cluster¶

Deleting a compute cluster removes the GenAI and serverless inference operator resources from the cluster and prevents the cluster from being used for subsequent model deployments.

Deleting a compute cluster does not deprovision the underlying Kubernetes cluster or associated infrastructure, which must be handled separately.