Skip to content

Index

Stop Paying for Resources Your Pods Don't Need

If you manage Kubernetes infrastructure at scale, you already know the pattern. Development teams request CPU and memory "just to be safe." Nobody wants their app to OOM. Nobody wants to get paged at 2am because a pod got throttled. So requests get padded and they stay padded.

The result? Clusters are full of pods consuming far less than what they've been allocated. Nodes are running hot on paper but idle in practice. And the platform team responsible for cost governance across dozens of clusters, projects, and namespaces has no easy way to prove it.

Scaling Trust: The Fortanix and Rafay Integration for Enterprise Confidential AI

In the modern enterprise, Artificial Intelligence (AI) has moved from a "nice-to-have" experimental phase to a core business driver. However, for organizations in highly regulated sectors—such as banking, healthcare, and government—the path to AI adoption is fraught with security hurdles.

The primary concern is protecting sensitive data not just at rest or in transit, but in use. In the image below, the app uses a proprietary model which needs to be secured using confidential computing.

Confidential VM

Traditional security measures often fall short when data must be decrypted to be processed by an AI model. This is where Confidential Computing changes the game, and why the joint integration between Fortanix and Rafay is a landmark development for the "AI Factory" of the future.

NVIDIA AICR Generates It. Rafay Runs It. Your GPU Clusters, Finally Under Control

Deploying GPU-accelerated Kubernetes infrastructure for AI workloads has never been simple. Administrators face a relentless compatibility matrix i.e. matching GPU driver versions to CUDA releases, pinning Kubernetes versions to container runtimes, tuning configurations differently for NVIDIA H100s versus A100s, and doing all of it differently again for training versus inference.

One wrong version combination and workloads fail silently, or worse, perform far below hardware capability. For years, the answer was static documentation, tribal knowledge, and hoping that whoever wrote the runbook last week remembered to update it.

NVIDIA's AI Cluster Runtime (AICR) and the Rafay Platform represent a new approach — one where GPU infrastructure configuration is treated as code, generated deterministically, validated against real hardware, and enforced continuously across fleets of clusters.

Together, they cover the full lifecycle from first aicr snapshot to production-grade day-2 operations, with cluster blueprints as the critical bridge between the two.

Baton Pass

From Slurm to Kubernetes: A Guide for HPC Users

If you've spent years submitting batch jobs with Slurm, moving to a Kubernetes-based cluster can feel like learning a new language. The concepts are familiar — resource requests, job queues, priorities — but the vocabulary and tooling are different. This guide bridges that gap, helping HPC veterans understand how Kubernetes handles workloads and what that means day-to-day.

SLurm to k8s

Run nvidia-smi on Remote GPU Kubernetes Clusters Using Rafay Zero Trust Access

Infra operators managing GPU-enabled Kubernetes clusters often need a fast and secure way to validate GPU visibility, driver health, and runtime readiness without exposing the cluster directly or relying on bastion hosts, VPNs, or manually managed kubeconfigs.

With Rafay's zero trust kubectl, operators can securely access remote Kubernetes resources and execute commands inside running pods from the Rafay platform. A simple but powerful example is running nvidia-smi inside a GPU Operator pod to confirm that the NVIDIA driver stack, CUDA runtime, and GPU devices are functioning correctly on a remote cluster.

In this post, we walk through how infra operators can use Rafay's zero trust access workflow to run nvidia-smi on a remote GPU-based Kubernetes cluster.

Nvidia SMI over ZTKA

How Rafay Helps GPU Clouds Run Complex Hackathons at Scale

Running a hackathon is hard. Running a GPU-powered hackathon for thousands of participants — where every developer needs a fully configured environment (notebooks, developer pod etc) with dedicated GPU resources, ready to go the moment the event kicks off — is an entirely different class of problem. This is exactly where Rafay's platform has helped change the game for GPU Cloud providers.

GuardDuty

Interact with Your Rafay Managed Kubernetes Clusters Using MCP-compatible AI clients

The Model Context Protocol (MCP) is an open standard that enables AI assistants to securely interact with external tools and systems. When used with Kubernetes, MCP allows an AI assistant to execute operations (for example, kubectl commands), retrieve live cluster state, and reason about results without requiring users to manually copy and paste output into a chat interface.

This blog uses Claude Desktop as an example AI assistant. The same approach applies to any MCP-compatible AI client.

For platform administrators, this capability enables controlled, auditable, and policy-driven AI-assisted cluster operations.


For production environments, the recommended approach is to run the MCP server locally and connect to your Kubernetes cluster using a Rafay Zero Trust Kubectl Access (ZTKA) kubeconfig.

In this model:

  • The MCP server runs on the administrator’s workstation
  • Cluster access is established through Rafay’s ZTKA secure relay
  • No inbound access to the cluster is required
  • No VPN tunnels or exposed Kubernetes API endpoints are needed

This architecture aligns with zero-trust security principles and enterprise compliance requirements.

Kubernetes v1.35 for Rafay MKS

As part of our continuous effort to bring the latest Kubernetes versions to our users, support for Kubernetes v1.35 will be added soon to the Rafay Operations Platform for MKS cluster types.

Both new cluster provisioning and in-place upgrades of existing clusters are supported. As with most Kubernetes releases, this version deprecates and removes a number of features. To ensure zero impact to our customers, we have validated every feature in the Rafay Kubernetes Operations Platform on this Kubernetes version. Support will be promoted from Preview to Production in a few days and made available to all customers.

Important: Platform Version 1.2.0 Required

Kubernetes v1.35 requires etcd version 3.5.24 which is delivered as part of Rafay Platform Version 1.2.0. When creating new clusters based on Kubernetes v1.35, select Platform Version 1.2.0 along with it. For upgrading existing clusters to Kubernetes v1.35, upgrade to Platform Version 1.2.0 first or together with the Kubernetes upgrade. Clusters cannot be provisioned or upgraded to Kubernetes v1.35 without Platform Version 1.2.0.

Kubernetes v1.35 Release

Managing Environments at Scale with Fleet Plans

As organizations scale their cloud infrastructure, managing dozens or even hundreds of environments becomes increasingly complex. Whether you are rolling out security patches, updating configuration variables, or deploying new template versions, performing these operations manually on each environment is time-consuming, error-prone, and simply unsustainable.

Fleet Plans solve this challenge—a powerful feature that eliminates the need to manage environments individually by enabling bulk operations across multiple environments in parallel.

Fleet Plans General Flow

Fleet Plans provide a streamlined workflow for managing multiple environments at scale, enabling bulk operations with precision and control.

Note: Fleet Plans currently support day 2 operations only, focusing on managing and updating existing environments rather than initial provisioning.