Skip to content

Mohan Atreya

Self-Service Slurm Clusters on Kubernetes with Rafay GPU PaaS

In the previous blog, we discussed how Project Slinky bridges the gap between Slurm, the de facto job scheduler in HPC, and Kubernetes, the standard for modern container orchestration.

Project Slinky and Rafay’s GPU Platform-as-a-Service (PaaS) combined provide enterprises and cloud providers with a transformative combination that enables secure, multi-tenant, self-service access to Slurm-based HPC environments on shared Kubernetes clusters. Together, they allow cloud providers and enterprise platform teams to offer Slurm-as-a-Service on Kubernetes—without compromising on performance, usability, or control.

Design

Project Slinky: Bringing Slurm Scheduling to Kubernetes

As high-performance computing (HPC) environments evolve, there’s an increasing demand to bridge the gap between traditional HPC job schedulers and modern cloud-native infrastructure. Project Slinky is an open-source project that integrates Slurm, the industry-standard workload manager for HPC, with Kubernetes, the de facto orchestration platform for containers.

This enables organizations to deploy and operate Slurm-based workloads on Kubernetes clusters allowing them to leverage the best of both worlds: Slurm’s mature, job-centric HPC scheduling model and Kubernetes’s scalable, cloud-native runtime environment.

Project Slinky

Get Started with Cilium as a Load Balancer for On-Premises Kubernetes Clusters

Organizations deploying Kubernetes in on-premises data centers or hybrid cloud environments often face challenges with exposing services externally. Unlike public cloud providers that offer managed load balancers out of the box, bare metal environments require custom solutions. This is where Cilium steps in as a powerful alternative, offering native load balancing capabilities using BGP (Border Gateway Protocol).

Cilium is more than just a CNI plugin. It enables advanced networking features, such as observability, security, and load balancing—all integrated deeply with the Kubernetes networking model. Specifically, Cilium can advertise Kubernetes LoadBalancer service IPs to external routers using BGP, making these services reachable directly from external networks without needing to rely on cloud-native load balancers or manual proxy setups. This is ideal for enterprises running bare metal Kubernetes clusters, air-gapped environments, or hybrid cloud setups.

Want to dive deeper? Check out our introductory blog on Cilium’s Kubernetes load balancing capabilities. Navigate to the detailed step-by-step instructions for additional information.

Using Cilium as a Kubernetes Load Balancer: A Powerful Alternative to MetalLB

In Kubernetes, exposing services of type LoadBalancer in on-prem or bare-metal environments typically requires a dedicated "Layer 2" or "BGP-based" software load balancer—such as MetalLB. While MetalLB has been the go-to solution for this use case, recent advances in Cilium, a powerful eBPF-based Kubernetes networking stack, offer a modern and more integrated alternative.

Cilium isn’t just a fast, scalable Container Network Interface (CNI). It also includes cilium-lb, a built-in eBPF-powered load balancer that can replace MetalLB with a more performant, secure, and cloud-native approach.

Cilium based k8s Load Balancer

Cost Management for SageMaker AI: The Case for Strong Administrative Guardrails

Enterprises are increasingly leveraging Amazon SageMaker AI to empower their data science teams with scalable, managed machine learning (ML) infrastructure. However, without proper administrative controls, SageMaker AI usage can lead to unexpected cost overruns and significant waste.

In large organizations where dozens or hundreds of data scientists may be experimenting concurrently, this risk compounds quickly.

Cost Overruns

BioContainers: Streamlining Bioinformatics with the Power of Portability

In today's fast-paced world of bioinformatics, the constant evolution of tools, dependencies, and operating system environments presents a significant challenge. Researchers often spend countless hours grappling with software installation, configuration, and version conflicts, hindering their ability to focus on scientific discovery. Enter biocontainers – a revolutionary approach that leverages containerization technology to package bioinformatics software and its entire environment into self-contained, portable units.

Imagine a meticulously organized lab where every experiment, regardless of its complexity, can be instantly replicated with identical results.

This is the promise of biocontainers. Built upon established container platforms like Docker and Singularity, biocontainers encapsulate everything a bioinformatics tool needs to run: the application itself, its libraries, dependencies, and even specific operating system configurations.

BioContainers Logo

Why Inventory Management is Table Stakes for GPU Clouds

In the world of GPU clouds, where speed, scalability, and efficiency are paramount, it’s surprising how many “Neo cloud” providers still manage their infrastructure the old-fashioned way—through spreadsheets.

As laughable as it sounds, this is the harsh reality. Inventory management, one of the most foundational aspects of a reliable cloud platform, is often overlooked or under built. And for modern GPU clouds, that’s a deal breaker.

Inventory Management

Comparing HPA and KEDA: Choosing the Right Tool for Kubernetes Autoscaling

In Kubernetes, autoscaling is key to ensuring application performance while managing infrastructure costs. Two powerful tools that help achieve this are the Horizontal Pod Autoscaler (HPA) and Kubernetes Event-Driven Autoscaling (KEDA). While they share the goal of scaling workloads, their approaches and capabilities are actually very different and distinct.

In this introductory blog, we will provide a bird's eye view of how they compare, and when you might choose one over the other.

HPA vs KEDA

A Fresh New Look: Rafay Console Gets a UI/UX Makeover for Enhanced Usability

At Rafay, we believe that user experience is as critical as the powerful automation capabilities we deliver. With that commitment in mind, we’ve been working for the last few months on a revamp of the Rafay Console User Interface (UI). The changes are purposeful and designed to streamline navigation, increase operational clarity, and elevate your productivity. Whether you’re managing clusters, deploying workloads, or orchestrating environments, the new interface will put everything you need right at your fingertips.

The new UI will launch as part of our v3.5 Release scheduled to rollout end of May 2025. We understand change is hard and it can take a few hours for users to get used to the new experience. Note that existing projects and configurations remain unchanged, and users can continue managing their infrastructure and applications without interruption.

In this blog, we provide a closer look at the most impactful improvements and how they will benefit our users.

Migration