Skip to content

Index

Introducing "Schedules" on the Rafay Platform: Simplifying Cost Optimization and Compliance for Platform Teams

Platform teams today are increasingly tasked with balancing cost efficiency, compliance, and operational agility across complex cloud environments. Actions such as cost-optimization measures and compliance-related tasks are critical, yet executing these tasks consistently and effectively can be challenging.

With the recent introduction of the “Schedules” capability on the Rafay Platform, platform teams can now orchestrate one-time or recurring actions across environments in a standardized, centralized manner. This new feature enables teams to implement cost-saving policies, manage compliance actions, and ensure operational efficiency—all from a single interface. Here’s a closer look at how this feature can streamline your workflows and add value to your platform operations.

Schedules

Spatial Partitioning of GPUs using Nvidia MIG

In the prior blogs, we discussed why GPUs are managed differently in Kubernetes, how the GPU Operator helps streamline management and various strategies to share GPUs on Kubernetes. In 2020, Nvidia introduced Multi-Instance GPU (MIG) that takes GPU sharing to a different level.

In this blog, we will start by reviewing some common industry use cases where MIG is used and then dive deeper into how MIG is configured and used.

Nvidia MIG

GPU Sharing Strategies in Kubernetes

In the previous blogs, we discussed why GPUs are managed differently in Kubernetes and how the GPU Operator can help streamline management. In Kubernetes, although you can request fractional CPU units for workloads, you cannot request fractional GPU units.

Pod manifests must request GPU resources in integers which results in an entire physical GPU allocated to one container even if the container only requires a fraction of the resources. In this blog, we will describe two popular and commonly used strategies to share a GPU on Kubernetes.

GPU Sharing in Kubernetes

Amazon EKS v1.31 using Rafay

Our recent release update in Oct to our Production environment adds support for a number of new features and enhancements. We will write about the other new features in separate blogs. This blog is focused on our turnkey support for Amazon EKS v1.31.

Both new cluster provisioning and in-place upgrades of existing EKS clusters are supported. As with most Kubernetes releases, this version also deprecates and removes a number of features. To ensure there is zero impact to our customers, we have made sure that every feature in the Rafay Kubernetes Operations Platform has been validated on this Kubernetes version.

Kubernetes v1.31

Why do we need a GPU Operator for Kubernetes

This is a follow up from the previous blog where we discussed device plugins for GPUs in Kubernetes. We reviewed why the Nvidia device plugin was necessary for GPU support in Kubernetes. A GPU Operator is needed in Kubernetes to automate and simplify the management of GPUs for workloads running on Kubernetes.

In this blog, we will look at how a GPU operator helps automate and streamline operations through the lens of a market leading implementation by Nvidia.

Without and With GPU Operator

Using GPUs in Kubernetes

Unlike CPU and Memory, GPUs are not natively supported in Kubernetes. Kubernetes manages CPU and memory natively. This means it can automatically schedule containers based on these resources, allocates them to Pods, and handles resource isolation and over-subscription.

GPUs are considered specialized hardware and require the use of device plugins to support GPUs in Kubernetes. Device Plugins help make Kubernetes GPU-aware allowing it to Discover, Allocate and Schedule GPUs for containerized workloads. Without a device plugin, Kubernetes is unaware of the GPUs available on the nodes and cannot assign them to Pods. In this blog, we will discuss why GPUs are not natively supported and understand how device plugins help address this gap.

Device Plugin K8s

Enhancing Security and Compliance in Break Glass Workflows with Rafay

Maintaining stringent security and compliance standards is more critical than ever today. Implementing break glass workflows for developers presents unique challenges that require careful consideration to prevent unauthorized access and ensure regulatory compliance.

In the previous blog, we introduced the concept of break glass workflows and why organizations require it. This blog post delves into how Rafay enables Platform teams to orchestrate secure and compliant break glass workflows within their organizations. Watch a video recording of this feature in Rafay.

Rafay Newsletter-September 2024

Welcome to the September 2024 edition of the Rafay customer newsletter. This month, we’re excited to bring you the latest product enhancements and insightful content crafted to help you make the most of your AI/ML, Kubernetes, and cloud-native operations.

Every month, we push out a number of incremental updates to our product documentation, new functionality, our YouTube channel, tech blogs etc. Our users tell us that it will be great if we summarized all the updates for the month in the form of a newsletter that they can read or listen to in 10 minutes.

Newsletter Sep 2024