In the previous blog, we introduced the concept of custom schedulers and why they are necessary for certain use cases. In this blog, we will compare and contrast three popular schedulers: Volcano, Kueue and Yunikorn.
The Kubernetes scheduler is the brain that is responsible for assigning pods to nodes based on resource availability, constraints, and affinity/anti-affinity rules. For small to medium-sized clusters running simple stateless applications like web services or APIs, the default Kubernetes scheduler is a great fit. The default Kubernetes scheduler manages resource allocation, ensures even distribution of workloads across nodes, and supports features like node affinity, pod anti-affinity, and automatic rescheduling.
The default scheduler is extremely well-suited for long-running applications like web services, APIs, and microservices. Learn more about the scheduling framework.
Unfortunately, AI/ML workloads have very different requirements that the default scheduler cannot satisfy!
In continuation of our Part 1 of our blog introducing Workload Identity for Azure AKS,this is Part 2 where will explore how to use Workload Identity with the Rafay's GitOps approach, enabling your Kubernetes pods to securely access Azure resources.
In continuation of our Part 1 of our blog introducing Pod Identity vs. IRSA for Amazon EKS, this is Part 2, where we will explore how to use Amazon EKS Pod Identity with the Rafay platform. This blog post will guide you through deploying the Amazon EKS Pod Identity Agent and configuring role associations, enabling your Kubernetes pods to securely access AWS services.
In any large-scale, production-grade Kubernetes setup, maintaining the security and integrity of the clusters is critical. However, there are exceptional circumstances—such as production outages or critical bugs—where developers need emergency access to a Kubernetes cluster to resolve issues.
This is where a "Break Glass" process comes into play. It is a controlled procedure that grants temporary, elevated access to developers in critical situations, with the appropriate safeguards in place to minimize risks.
When managing containerized applications on Amazon Elastic Kubernetes Service (EKS), a critical concern is securely granting permissions to your applications so that they can securely access AWS resources. Traditionally, AWS has provided mechanisms like IAM Roles for Service Accounts (IRSA) to enable fine-grained permissions management within EKS clusters. However, EKS Pod Identity, a newer feature, offers a more refined and efficient solution.
In this blog, we’ll explore how EKS Pod Identity differs from IRSA, and why it represents a significant improvement for identity management in Amazon EKS based environments. Let's assume our EKS cluster resident application needs to securely access data in an AWS s3 bucket.
The vast majority of organizations are new to AI/ML. As a result, most in-house systems and processes supporting this is likely ad-hoc. Industry analysts like Gartner forecast that organizations will need to quickly transition from Pilots to Production with AI/ML in order to make it across the chasm.
Most organizations already have reasonably mature DevOps processes and systems in place. So, going mainstream with AI should be a walk in the park. Correct? Turns out that this is not really true “IT leaders responsible for AI are discovering the AI pilot paradox, where launching pilots is deceptively easy but deploying them into production is notoriously challenging.” by Chirag Dekate, Gartner
In this blog, we will try and answer the following question:
Why do we need a new process called MLOps when most organizations already have reasonably mature DevOps practices? How is MLOps different from DevOps?
In the previous blog, we discussed why tracking and reporting GPU power usage matters. In this blog, we will dive deeper into another critical GPU metric i.e. GPU Framebuffer usage.
In the previous blog, we discussed why tracking and reporting GPU SM Clock metrics matters. In this blog, we will dive deeper into another critical GPU metric i.e. GPU Power.
In the previous blog, we discussed why tracking and reporting GPU Memory Utilization metrics matters. In this blog, we will dive deeper into another critical GPU metric i.e. GPU SM Clock. The GPU SM clock (Streaming Multiprocessor clock) metric refers to the clock speed at which the GPU's cores (SMs) are running.
The SM is the main processing unit of the GPU, responsible for executing compute tasks such as deep learning operations, simulations, and graphics rendering. Monitoring the SM clock speed can help users assess the performance and health of your GPU during workloads and detect potential bottlenecks related to clock speed throttling.