Skip to content

Mohan Atreya

Using GPUs in Kubernetes

Unlike CPU and Memory, GPUs are not natively supported in Kubernetes. Kubernetes manages CPU and memory natively. This means it can automatically schedule containers based on these resources, allocates them to Pods, and handles resource isolation and over-subscription.

GPUs are considered specialized hardware and require the use of device plugins to support GPUs in Kubernetes. Device Plugins help make Kubernetes GPU-aware allowing it to Discover, Allocate and Schedule GPUs for containerized workloads. Without a device plugin, Kubernetes is unaware of the GPUs available on the nodes and cannot assign them to Pods. In this blog, we will discuss why GPUs are not natively supported and understand how device plugins help address this gap.

Device Plugin K8s

Rafay Newsletter-September 2024

Welcome to the September 2024 edition of the Rafay customer newsletter. This month, we’re excited to bring you the latest product enhancements and insightful content crafted to help you make the most of your AI/ML, Kubernetes, and cloud-native operations.

Every month, we push out a number of incremental updates to our product documentation, new functionality, our YouTube channel, tech blogs etc. Our users tell us that it will be great if we summarized all the updates for the month in the form of a newsletter that they can read or listen to in 10 minutes.

Newsletter Sep 2024

Why do we need Custom Schedulers for Kubernetes?

The Kubernetes scheduler is the brain that is responsible for assigning pods to nodes based on resource availability, constraints, and affinity/anti-affinity rules. For small to medium-sized clusters running simple stateless applications like web services or APIs, the default Kubernetes scheduler is a great fit. The default Kubernetes scheduler manages resource allocation, ensures even distribution of workloads across nodes, and supports features like node affinity, pod anti-affinity, and automatic rescheduling.

The default scheduler is extremely well-suited for long-running applications like web services, APIs, and microservices. Learn more about the scheduling framework.

Unfortunately, AI/ML workloads have very different requirements that the default scheduler cannot satisfy!

k8s Scheduling Framework

Break Glass Workflows for Developer Access to Kubernetes Clusters - Introduction

In any large-scale, production-grade Kubernetes setup, maintaining the security and integrity of the clusters is critical. However, there are exceptional circumstances—such as production outages or critical bugs—where developers need emergency access to a Kubernetes cluster to resolve issues.

This is where a "Break Glass" process comes into play. It is a controlled procedure that grants temporary, elevated access to developers in critical situations, with the appropriate safeguards in place to minimize risks.

Break Glass

Pod Identity versus IRSA for Amazon EKS - Part 1

When managing containerized applications on Amazon Elastic Kubernetes Service (EKS), a critical concern is securely granting permissions to your applications so that they can securely access AWS resources. Traditionally, AWS has provided mechanisms like IAM Roles for Service Accounts (IRSA) to enable fine-grained permissions management within EKS clusters. However, EKS Pod Identity, a newer feature, offers a more refined and efficient solution.

In this blog, we’ll explore how EKS Pod Identity differs from IRSA, and why it represents a significant improvement for identity management in Amazon EKS based environments. Let's assume our EKS cluster resident application needs to securely access data in an AWS s3 bucket.

App Accessing AWS S3

Bringing DevOps and Automation to Machine Learning via MLOps

The vast majority of organizations are new to AI/ML. As a result, most in-house systems and processes supporting this is likely ad-hoc. Industry analysts like Gartner forecast that organizations will need to quickly transition from Pilots to Production with AI/ML in order to make it across the chasm.

Most organizations already have reasonably mature DevOps processes and systems in place. So, going mainstream with AI should be a walk in the park. Correct? Turns out that this is not really true “IT leaders responsible for AI are discovering the AI pilot paradox, where launching pilots is deceptively easy but deploying them into production is notoriously challenging.” by Chirag Dekate, Gartner

In this blog, we will try and answer the following question:

Why do we need a new process called MLOps when most organizations already have reasonably mature DevOps practices? How is MLOps different from DevOps?

DevOps vs MLOps

GPU Metrics - SM Clock

In the previous blog, we discussed why tracking and reporting GPU Memory Utilization metrics matters. In this blog, we will dive deeper into another critical GPU metric i.e. GPU SM Clock. The GPU SM clock (Streaming Multiprocessor clock) metric refers to the clock speed at which the GPU's cores (SMs) are running.

The SM is the main processing unit of the GPU, responsible for executing compute tasks such as deep learning operations, simulations, and graphics rendering. Monitoring the SM clock speed can help users assess the performance and health of your GPU during workloads and detect potential bottlenecks related to clock speed throttling.

GPU SM Clock

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.