Skip to content

GPU Metrics

GPU Metrics - SM Clock

In the previous blog, we discussed why tracking and reporting GPU Memory Utilization metrics matters. In this blog, we will dive deeper into another critical GPU metric i.e. GPU SM Clock. The GPU SM clock (Streaming Multiprocessor clock) metric refers to the clock speed at which the GPU's cores (SMs) are running.

The SM is the main processing unit of the GPU, responsible for executing compute tasks such as deep learning operations, simulations, and graphics rendering. Monitoring the SM clock speed can help users assess the performance and health of your GPU during workloads and detect potential bottlenecks related to clock speed throttling.

GPU SM Clock

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.

GPU Metrics - Memory Utilization

In the introductory blog on GPU metrics, we discussed about the GPU metrics that matter and why they matter. In this blog, we will dive deeper into one of the critical GPU metrics i.e. GPU Memory Utilization.

GPU memory utilization refers to the percentage of the GPU’s dedicated memory (i.e. framebuffer) that is currently in use. It measures how much of the available GPU memory is occupied by data such as models, textures, tensors, or intermediate results during computation.

GPU Memory Utilization

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.

What GPU Metrics to Monitor and Why?

With the increasing reliance on GPUs for compute-intensive tasks such as machine learning, deep learning, data processing, and rendering, both infrastructure administrators and users of GPUs (i.e. data scientists, ML engineers and GenAI app developers) require timely access and insights into performance, efficiency, and overall health of their GPU resources.

In order to make data driven, logical decisions, it is critical for these users to have access to critical metrics for their GPUs. This is the first blog in a series where we will describe the GPU metrics that you should track and monitor. In subsequent blogs, we will do a deep dive into each metric, why it matters and how to use it effectively.

Intro to GPU Metrics

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.