GPU Config
Graphics Processing Units (GPUs) accelerate certain types of workloads, particularly in machine learning and data processing tasks. By leveraging GPUs, users can significantly improve the performance and speed of these workloads compared to using traditional central processing units (CPUs) alone. GPU Configuration can be managed via UI, RCTL, Terraform, System Sync and Swagger API (V2 and V3)
GPU Configuration via UI¶
- Under Node Pools, click Add Node Pool
- Enable Configure GPU Settings
-
Select a GPU Type and GPU Count
-
Optionally, enable Enable GPU Sharing. GPU Accelerator Sharing ensures that GPU resources are utilized effectively across multiple workloads, maximizing the utilization of expensive GPU hardware and reducing costs.
-
Select the GPU strategy and enter the Max Shared Clients
-
GPU Driver Installation: The process of installing the necessary device drivers for GPU hardware on a computing system. GPU driver installation involves configuring how GPU drivers will be installed on the virtual machine (VM) instances that support GPU workloads. The two types of driver installations are Google-managed, and User-managed. By default, User-managed is selected.
- User-managed allows users to manually install drivers on the node or provide a driver installer DaemonSet for the node pool.
- On selecting Google-managed, drivers will be fetched from a third-party location and installed automatically. Select a Driver version.
- Provide GPU Partition Size
Note: When you create a GPU node pool with the DriverInstallationType set as "user-managed," the GPU count will not appear in the cluster card of the console. It will only appear once you manually install the drivers.
Field Name | Description |
---|---|
GKE Node Accelerator | |
GPU Type* | Allows optimization for specific workload requirements, ensuring efficient performance and cost-effectiveness by leveraging the most suitable hardware accelerators available |
GPU Count* | Refers to the quantity of Graphics Processing Units (GPUs) assigned to each node, facilitating workload optimization and resource allocation for GPU-accelerated tasks |
GPU Sharing | |
GPU Strategy* | Defines how GPUs are allocated and shared among pods within the cluster, with options like time-sharing allowing efficient utilization of GPU resources across multiple workloads based on predefined allocation policies |
Max Shared Clients | The maximum number of clients permitted to concurrently share a single physical GPU within a GKE cluster |
GPU Driver Installation | |
Driver Version* | Represents the specific version of the GPU driver installed on each node, ensuring compatibility with GPU-accelerated workloads |
GPU Partition Size | Defines the size of partitions to be created on the GPU within a GKE cluster. Valid values are outlined in the NVIDIA documentation, specifying the granularity for allocating GPU resources based on workload requirements and resource availability |
Note
To understand the limitations, please refer to this page.
GPU Configuration via RCTL¶
V3 Config Spec (Recommended)
Below is an example of a v3 spec for creating a GKE cluster with GPU configuration.
apiVersion: infra.k8smgmt.io/v3
kind: Cluster
metadata:
name: new-gpu
project: defaultproject
modifiedAt: "2024-03-12T09:42:58.528168Z"
spec:
cloudCredentials: cred-gke
type: gke
config:
gcpProject: dev-12345
location:
type: zonal
config:
zone: us-central1-a
controlPlaneVersion: "1.27"
network:
name: default
subnetName: default
access:
type: public
config: null
enableVPCNativetraffic: true
maxPodsPerNode: 110
features:
enableComputeEnginePersistentDiskCSIDriver: true
nodePools:
- name: default-nodepool
nodeVersion: "1.27"
size: 3
machineConfig:
imageType: COS_CONTAINERD
machineType: n1-standard-4
bootDiskType: pd-standard
bootDiskSize: 100
accelerators:
- type: nvidia-tesla-t4
count: 1
gpuDriverInstallation:
type: google-managed
config:
version: "LATEST"
upgradeSettings:
strategy: SURGE
config:
maxSurge: 1
blueprint:
name: minimal
version: latest
V2 Config Spec
Below is an example of a v2 spec for creating a GKE cluster with GPU configuration.
apiVersion: infra.k8smgmt.io/v2
kind: Cluster
metadata:
name: demogpu-test
project: defaultproject
spec:
blueprint:
name: minimal
version: latest
cloudCredentials: cred-gke
config:
controlPlaneVersion: "1.27"
feature:
enableComputeEnginePersistentDiskCSIDriver: true
location:
type: zonal
zone: us-central1-a
name: demogpu-test
network:
enableVPCNativeTraffic: true
maxPodsPerNode: 110
name: default
networkAccess:
privacy: public
nodeSubnetName: default
nodePools:
- machineConfig:
accelerators:
- count: 1
driverInstallation:
type: user-managed
type: nvidia-tesla-t4
bootDiskSize: 100
bootDiskType: pd-standard
imageType: COS_CONTAINERD
machineType: n1-standard-4
management: {}
name: default-nodepool
nodeVersion: "1.27"
size: 3
upgradeSettings:
strategy: SURGE
surgeSettings:
maxSurge: 1
maxUnavailable: 0
project: dev-12345
type: Gke