Skip to content

Ankur Pandita

No More SSH: Control Plane Overrides for Rafay MKS Clusters

Customizing a Kubernetes control plane has always been an uncomfortable exercise. You SSH into a master node, carefully edit a static pod manifest, and then hope nothing breaks. With our latest release, we are replacing that workflow entirely. Control Plane Overrides give you a safe, declarative way to customize the API Server, Controller Manager, and Scheduler for MKS (Managed Kubernetes Service) clusters — Rafay's upstream Kubernetes offering for bare metal and VMs — directly from the Rafay Console or cluster specification.

Accelerating the AI Factory: Rafay & NVIDIA NCX Infra Controller (NICo)

Acquiring GPU hardware is the easy part. Turning it into a productive, multi-tenant AI service with proper isolation, self-service provisioning, and the governance to operate it at scale is where most get stuck. Custom integration work piles up, timelines slip, and the gap between racked hardware and revenue widens.

Rafay is closing that gap through a new integration with the NVIDIA NCX Infrastructure Controller (NICo), NVIDIA's open-source component for automated bare-metal lifecycle management. Together, Rafay and NICo give operators a unified platform to manage their GPU fleet to deliver cloud-like, self-service experiences to end users.

Interact with Your Rafay Managed Kubernetes Clusters Using MCP-compatible AI clients

The Model Context Protocol (MCP) is an open standard that enables AI assistants to securely interact with external tools and systems. When used with Kubernetes, MCP allows an AI assistant to execute operations (for example, kubectl commands), retrieve live cluster state, and reason about results without requiring users to manually copy and paste output into a chat interface.

This blog uses Claude Desktop as an example AI assistant. The same approach applies to any MCP-compatible AI client.

For platform administrators, this capability enables controlled, auditable, and policy-driven AI-assisted cluster operations.


For production environments, the recommended approach is to run the MCP server locally and connect to your Kubernetes cluster using a Rafay Zero Trust Kubectl Access (ZTKA) kubeconfig.

In this model:

  • The MCP server runs on the administrator’s workstation
  • Cluster access is established through Rafay’s ZTKA secure relay
  • No inbound access to the cluster is required
  • No VPN tunnels or exposed Kubernetes API endpoints are needed

This architecture aligns with zero-trust security principles and enterprise compliance requirements.

Kubernetes v1.35 for Rafay MKS

As part of our continuous effort to bring the latest Kubernetes versions to our users, support for Kubernetes v1.35 will be added soon to the Rafay Operations Platform for MKS cluster types.

Both new cluster provisioning and in-place upgrades of existing clusters are supported. As with most Kubernetes releases, this version deprecates and removes a number of features. To ensure zero impact to our customers, we have validated every feature in the Rafay Kubernetes Operations Platform on this Kubernetes version. Support will be promoted from Preview to Production in a few days and made available to all customers.

Important: Platform Version 1.2.0 Required

Kubernetes v1.35 requires etcd version 3.5.24 which is delivered as part of Rafay Platform Version 1.2.0. When creating new clusters based on Kubernetes v1.35, select Platform Version 1.2.0 along with it. For upgrading existing clusters to Kubernetes v1.35, upgrade to Platform Version 1.2.0 first or together with the Kubernetes upgrade. Clusters cannot be provisioned or upgraded to Kubernetes v1.35 without Platform Version 1.2.0.

Kubernetes v1.35 Release

Managing Environments at Scale with Fleet Plans

As organizations scale their cloud infrastructure, managing dozens or even hundreds of environments becomes increasingly complex. Whether you are rolling out security patches, updating configuration variables, or deploying new template versions, performing these operations manually on each environment is time-consuming, error-prone, and simply unsustainable.

Fleet Plans solve this challenge—a powerful feature that eliminates the need to manage environments individually by enabling bulk operations across multiple environments in parallel.

Fleet Plans General Flow

Fleet Plans provide a streamlined workflow for managing multiple environments at scale, enabling bulk operations with precision and control.

Note: Fleet Plans currently support day 2 operations only, focusing on managing and updating existing environments rather than initial provisioning.

Granular Control of Your EKS Auto Mode Managed Nodes with Custom Node Classes and Node Pools

With a couple of releases back, we added EKS Auto Mode support in our platform for doing either quick configuration or custom configuration. In this blog, we will explore how you can create an EKS cluster using quick configuration and then dive deep into creating custom node classes and node pools using addons to deploy them on EKS Auto Mode enabled clusters.

Dynamic Resource Allocation for GPU Allocation on Rafay's MKS (Kubernetes 1.34)

This blog demonstrates how to leverage Dynamic Resource Allocation (DRA) for efficient GPU allocation using Multi-Instance GPU (MIG) strategy on Rafay's Managed Kubernetes Service (MKS) running Kubernetes 1.34.

In our previous blog series, we covered various aspects of Dynamic Resource Allocation (DRA) in Kubernetes:

DRA is GA in Kubernetes 1.34

With Kubernetes 1.34, Dynamic Resource Allocation (DRA) is Generally Available (GA) and enabled by default on MKS clusters. This means you can immediately start using DRA features without additional configuration.

Prerequisites

Before we begin, ensure you have:

  • A Rafay MKS cluster running Kubernetes 1.34 (see MKS v1.34 Blog)
  • GPU nodes with compatible NVIDIA GPUs (A100, H100, or similar MIG-capable GPUs)
  • Container Device Interface (CDI) enabled (automatically enabled in MKS for Kubernetes 1.34)
  • Basic understanding of Dynamic Resource Allocation concepts (covered in our previous blog series)
  • Active Rafay account with appropriate permissions to manage MKS clusters and addons

Kubernetes v1.34 for Rafay MKS

As part of our continuous effort to bring the latest Kubernetes versions to our users, support for Kubernetes v1.34 will be added soon to the Rafay Operations Platform for MKS cluster types.

Both new cluster provisioning and in-place upgrades of existing clusters are supported. As with most Kubernetes releases, this version also deprecates and removes a number of features. To ensure there is zero impact to our customers, we have made sure that every feature in the Rafay Kubernetes Operations Platform has been validated on this Kubernetes version. This will be promoted from Preview to Production in a few days and will be made available to all customers.

Kubernetes v1.34 Release

Upstream Kubernetes on RHEL 10 using Rafay

Our upcoming release update will add support for a number of new features and enhancements. This blog is focused on the upcoming support for Upstream Kubernetes on nodes based on Red Hat Enterprise Linux (RHEL) v10.0. Both new cluster provisioning and in-place upgrades of Kubernetes clusters will be supported for lifecycle management.

RHEL 9.2