Self-Service Slurm Clusters on Kubernetes with Rafay GPU PaaS

In the previous blog, we discussed how Project Slinky bridges the gap between Slurm, the de facto job scheduler in HPC, and Kubernetes, the standard for modern container orchestration.

Project Slinky and Rafay’s GPU Platform-as-a-Service (PaaS) combined provide enterprises and cloud providers with a transformative combination that enables secure, multi-tenant, self-service access to Slurm-based HPC environments on shared Kubernetes clusters. Together, they allow cloud providers and enterprise platform teams to offer Slurm-as-a-Service on Kubernetes—without compromising on performance, usability, or control.

Design

Project Slinky: Bringing Slurm Scheduling to Kubernetes

As high-performance computing (HPC) environments evolve, there’s an increasing demand to bridge the gap between traditional HPC job schedulers and modern cloud-native infrastructure. Project Slinky is an open-source project that integrates Slurm, the industry-standard workload manager for HPC, with Kubernetes, the de facto orchestration platform for containers.

This enables organizations to deploy and operate Slurm-based workloads on Kubernetes clusters allowing them to leverage the best of both worlds: Slurm’s mature, job-centric HPC scheduling model and Kubernetes’s scalable, cloud-native runtime environment.

Project Slinky

Get Started with Cilium as a Load Balancer for On-Premises Kubernetes Clusters

Organizations deploying Kubernetes in on-premises data centers or hybrid cloud environments often face challenges with exposing services externally. Unlike public cloud providers that offer managed load balancers out of the box, bare metal environments require custom solutions. This is where Cilium steps in as a powerful alternative, offering native load balancing capabilities using BGP (Border Gateway Protocol).

Cilium is more than just a CNI plugin. It enables advanced networking features, such as observability, security, and load balancing—all integrated deeply with the Kubernetes networking model. Specifically, Cilium can advertise Kubernetes LoadBalancer service IPs to external routers using BGP, making these services reachable directly from external networks without needing to rely on cloud-native load balancers or manual proxy setups. This is ideal for enterprises running bare metal Kubernetes clusters, air-gapped environments, or hybrid cloud setups.

Want to dive deeper? Check out our introductory blog on Cilium’s Kubernetes load balancing capabilities. Navigate to the detailed step-by-step instructions for additional information.

Using Cilium as a Kubernetes Load Balancer: A Powerful Alternative to MetalLB

In Kubernetes, exposing services of type LoadBalancer in on-prem or bare-metal environments typically requires a dedicated "Layer 2" or "BGP-based" software load balancer—such as MetalLB. While MetalLB has been the go-to solution for this use case, recent advances in Cilium, a powerful eBPF-based Kubernetes networking stack, offer a modern and more integrated alternative.

Cilium isn’t just a fast, scalable Container Network Interface (CNI). It also includes cilium-lb, a built-in eBPF-powered load balancer that can replace MetalLB with a more performant, secure, and cloud-native approach.

Cilium based k8s Load Balancer

Cost Management for SageMaker AI: The Case for Strong Administrative Guardrails

Enterprises are increasingly leveraging Amazon SageMaker AI to empower their data science teams with scalable, managed machine learning (ML) infrastructure. However, without proper administrative controls, SageMaker AI usage can lead to unexpected cost overruns and significant waste.

In large organizations where dozens or hundreds of data scientists may be experimenting concurrently, this risk compounds quickly.

Cost Overruns

BioContainers: Streamlining Bioinformatics with the Power of Portability

In today's fast-paced world of bioinformatics, the constant evolution of tools, dependencies, and operating system environments presents a significant challenge. Researchers often spend countless hours grappling with software installation, configuration, and version conflicts, hindering their ability to focus on scientific discovery. Enter biocontainers – a revolutionary approach that leverages containerization technology to package bioinformatics software and its entire environment into self-contained, portable units.

Imagine a meticulously organized lab where every experiment, regardless of its complexity, can be instantly replicated with identical results.

This is the promise of biocontainers. Built upon established container platforms like Docker and Singularity, biocontainers encapsulate everything a bioinformatics tool needs to run: the application itself, its libraries, dependencies, and even specific operating system configurations.

BioContainers Logo

Why Inventory Management is Table Stakes for GPU Clouds

In the world of GPU clouds, where speed, scalability, and efficiency are paramount, it’s surprising how many “Neo cloud” providers still manage their infrastructure the old-fashioned way—through spreadsheets.

As laughable as it sounds, this is the harsh reality. Inventory management, one of the most foundational aspects of a reliable cloud platform, is often overlooked or under built. And for modern GPU clouds, that’s a deal breaker.

Inventory Management

Introducing Platform Version with Rafay MKS clusters.

Our upcoming release introduces support for a number of new features and enhancements. One such enhancement is the introduction of Platform Versioning for Rafay MKS clusters a major feature in our v3.5 release. This new capability is designed to simplify and standardize the upgrade lifecycle of critical components in upstream Kubernetes clusters managed by Rafay MKS.

Why Platform Version?

Upgrading Kubernetes clusters is essential, but the core components—such as etcd, CRI, and Salt Minion also require updates for:

  • Security patches
  • Compatibility with new Kubernetes features
  • Performance improvements

Platform Versioning introduces a structured, reliable, and repeatable upgrade path for these foundational components, reducing risk and operational overhead.

What is a Platform Version?

A Platform Version defines a tested and validated set of component versions that can be safely upgraded together. This ensures compatibility and stability across your clusters.

We are introducing v1.0.0 as the very first Platform Version for new clusters. This version includes:

  • CRI: v2.0.4
  • etcd: v3.5.21
  • Salt Minion: v3006.9

Note

For existing clusters, the initial platform version will be shown as v0.1.0, which is assigned for reference purposes to older clusters that were created before platform versioning was introduced. Please perform the upgrade to v1.0.0 during scheduled downtime, as it involves updates to core components such as etcd and CRI.

How Does Platform Versioning Work?

You can upgrade the Platform Version in two ways:

  • During a Kubernetes version upgrade
  • As a standalone platform upgrade

This flexibility allows you to keep your clusters secure and up to date, regardless of your Kubernetes upgrade schedule.

Platform Version

Controlled and Responsive Update Cadence

Platform Versions are not released frequently. New versions are published only when:

  • A high severity CVE or vulnerability is addressed
  • A major performance or compatibility feature is introduced
  • There are significant version changes in core components

This approach ensures that upgrades are meaningful and necessary, minimizing disruption.

Whenever a new Platform Version is released, existing clusters can seamlessly upgrade to the latest version, ensuring they benefit from the latest security patches and improvements without manual intervention.

Evolving Platform Versions and Expanding Coverage

We are committed to continuously improving Platform Versioning. In future releases, we will introduce new platform versions to to expand the scope of Platform Versioning by including more critical components as part of the platform version. For this initial release, we have started with three foundational components etcd, CRI, and Salt Minion because of their critical importance to cluster stability. Over time, we will enhance Platform Versioning to cover additional components, ensuring your clusters remain robust, secure, and up to date.

Platform Version Documentation

For detailed documentation, see: Platform Version Docs

In Summary

Platform Versioning makes it easier than ever to keep your clusters current and secure by managing the upgrade lifecycle of foundational components like etcd, CRI, and Salt Minion.

Whether you apply it alongside a Kubernetes version bump or independently, Platform Versioning ensures your infrastructure remains stable, secure, and optimized now and in the future.

Comparing HPA and KEDA: Choosing the Right Tool for Kubernetes Autoscaling

In Kubernetes, autoscaling is key to ensuring application performance while managing infrastructure costs. Two powerful tools that help achieve this are the Horizontal Pod Autoscaler (HPA) and Kubernetes Event-Driven Autoscaling (KEDA). While they share the goal of scaling workloads, their approaches and capabilities are actually very different and distinct.

In this introductory blog, we will provide a bird's eye view of how they compare, and when you might choose one over the other.

HPA vs KEDA