Skip to content

Kubernetes

Self-Service Slurm Clusters on Kubernetes with Rafay GPU PaaS

In the previous blog, we discussed how Project Slinky bridges the gap between Slurm, the de facto job scheduler in HPC, and Kubernetes, the standard for modern container orchestration.

Project Slinky and Rafay’s GPU Platform-as-a-Service (PaaS) combined provide enterprises and cloud providers with a transformative combination that enables secure, multi-tenant, self-service access to Slurm-based HPC environments on shared Kubernetes clusters. Together, they allow cloud providers and enterprise platform teams to offer Slurm-as-a-Service on Kubernetes—without compromising on performance, usability, or control.

Design

Project Slinky: Bringing Slurm Scheduling to Kubernetes

As high-performance computing (HPC) environments evolve, there’s an increasing demand to bridge the gap between traditional HPC job schedulers and modern cloud-native infrastructure. Project Slinky is an open-source project that integrates Slurm, the industry-standard workload manager for HPC, with Kubernetes, the de facto orchestration platform for containers.

This enables organizations to deploy and operate Slurm-based workloads on Kubernetes clusters allowing them to leverage the best of both worlds: Slurm’s mature, job-centric HPC scheduling model and Kubernetes’s scalable, cloud-native runtime environment.

Project Slinky

Get Started with Cilium as a Load Balancer for On-Premises Kubernetes Clusters

Organizations deploying Kubernetes in on-premises data centers or hybrid cloud environments often face challenges with exposing services externally. Unlike public cloud providers that offer managed load balancers out of the box, bare metal environments require custom solutions. This is where Cilium steps in as a powerful alternative, offering native load balancing capabilities using BGP (Border Gateway Protocol).

Cilium is more than just a CNI plugin. It enables advanced networking features, such as observability, security, and load balancing—all integrated deeply with the Kubernetes networking model. Specifically, Cilium can advertise Kubernetes LoadBalancer service IPs to external routers using BGP, making these services reachable directly from external networks without needing to rely on cloud-native load balancers or manual proxy setups. This is ideal for enterprises running bare metal Kubernetes clusters, air-gapped environments, or hybrid cloud setups.

Want to dive deeper? Check out our introductory blog on Cilium’s Kubernetes load balancing capabilities. Navigate to the detailed step-by-step instructions for additional information.

Using Cilium as a Kubernetes Load Balancer: A Powerful Alternative to MetalLB

In Kubernetes, exposing services of type LoadBalancer in on-prem or bare-metal environments typically requires a dedicated "Layer 2" or "BGP-based" software load balancer—such as MetalLB. While MetalLB has been the go-to solution for this use case, recent advances in Cilium, a powerful eBPF-based Kubernetes networking stack, offer a modern and more integrated alternative.

Cilium isn’t just a fast, scalable Container Network Interface (CNI). It also includes cilium-lb, a built-in eBPF-powered load balancer that can replace MetalLB with a more performant, secure, and cloud-native approach.

Cilium based k8s Load Balancer

Introduction to Slurm-The Backbone of HPC

This is part-1 in a blog series on Slurm. In the first part, we will provide some introductory concepts about Slurm. We are not talking about the fictional soft drink in the world of Futurama. Instead, this blog is about Slurm (Simple Linux Utility for Resource Management), an open-source, fault-tolerant, and highly scalable cluster management job scheduler and resource manager used in high-performance computing (HPC) environments.

Slurm was originally conceptualized in 2002 at Lawrence Livermore National Laboratory (LLNL) and has been actively developed and maintained especially by SchedMD. In this time, Slurm has become the defacto workload manager for HPC with >50% of the Top-500 super computers using it.

Slurm Logo

Get Started with Auto Mode for Amazon EKS with Rafay

This is Part 3 in our series on Amazon EKS Auto Mode. In the previous posts, we explored:

  1. Part 1: An Introduction: Learn the core concepts and benefits of EKS Auto Mode.
  2. Part 2: Considerations: Understand the key considerations before Configuring EKS Auto Mode.

In this post, we will dive into the steps required to build and manage an Amazon EKS cluster with Auto Mode template using the Rafay Platform. This exercise is specifically well suited for platform teams interested in providing their end users with a controlled self-service experience with centralized governance.

EKS Auto Mode Cluster in Rafay

Deploying Custom CNI (Kube-OVN) in Rafay MKS Upstream Kubernetes Cluster Using the Blueprint Add-On Approach

In continuation of our Part 1 intro blog on the Kube-OVN CNI, this is Part 2, where we will cover how easy it is to manage CNI configurations using Rafay's Blueprint Add-On approach.In the evolving cloud-native landscape, networking requirements are becoming more complex, with platform teams needing enhanced control and customization over their Kubernetes clusters. Rafay's support for custom, compatible CNIs allows organizations to select and deploy advanced networking solutions tailored to their needs. While there are several options available, this blog will focus specifically on deploying the Kube-OVN CNI. Using Rafay’s Blueprint Add-On approach, we will guide you through the steps to seamlessly integrate Kube-OVN into an upstream Kubernetes cluster managed by Rafay’s Managed Kubernetes Service.

Our upcoming release, scheduled for December in the production environment, introduces several new features and enhancements. Each of these will be covered in separate blog posts. This particular blog focuses on the support and process for deploying Kube-OVN as the primary CNI on an upstream Kubernetes cluster.

kube ovn

Watch a video showcasing how users can customize and configure Kube-OVN as the primary CNI on Rafay MKS Kubernetes clusters.

The Kube-OVN CNI: A Powerful Networking Solution for Kubernetes

Kubernetes has become the de facto standard for orchestrating containerized applications, but efficient networking remains one of the biggest challenges. For Kubernetes networking, Container Network Interface (CNI) plugins handle the essential task of managing the network configuration between pods, nodes, and external systems. Among these CNI plugins, Kube-OVN stands out as a feature-rich and enterprise-ready solution, designed for cloud-native applications requiring robust networking features.

In this blog, we will discuss how it is different from popular CNI plugins such as Calico and Cilium and use cases where it is particularly useful.

Kube-OVN Logo

Introducing "Schedules" on the Rafay Platform: Simplifying Cost Optimization and Compliance for Platform Teams

Platform teams today are increasingly tasked with balancing cost efficiency, compliance, and operational agility across complex cloud environments. Actions such as cost-optimization measures and compliance-related tasks are critical, yet executing these tasks consistently and effectively can be challenging.

With the recent introduction of the “Schedules” capability on the Rafay Platform, platform teams can now orchestrate one-time or recurring actions across environments in a standardized, centralized manner. This new feature enables teams to implement cost-saving policies, manage compliance actions, and ensure operational efficiency—all from a single interface. Here’s a closer look at how this feature can streamline your workflows and add value to your platform operations.

Schedules

Spatial Partitioning of GPUs using Nvidia MIG

In the prior blogs, we discussed why GPUs are managed differently in Kubernetes, how the GPU Operator helps streamline management and various strategies to share GPUs on Kubernetes. In 2020, Nvidia introduced Multi-Instance GPU (MIG) that takes GPU sharing to a different level.

In this blog, we will start by reviewing some common industry use cases where MIG is used and then dive deeper into how MIG is configured and used.

Nvidia MIG