Cloud providers offering GPU or Neo Cloud services need accurate and automated mechanisms to track resource consumption. Usage data becomes the foundation for billing, showback, or chargeback models that customers expect. The Rafay Platform provides usage metering APIs that can be easily integrated into a provider’s billing system. '
In this blog, we’ll walk through how to use these APIs with a sample Python script to generate detailed usage reports.
Our upcoming release update will add support for a number of new features and enhancements. This blog is focused on the upcoming support for Upstream Kubernetes on nodes based on Red Hat Enterprise Linux (RHEL) v10.0. Both new cluster provisioning and in-place upgrades of Kubernetes clusters will be supported for lifecycle management.
At Rafay, we are continuously evolving our platform to deliver powerful capabilities that streamline and accelerate the software delivery lifecycle. One such enhancement is the recent update to our GitOps pipeline engine, designed to optimize execution time and flexibility — enabling a better experience for platform teams and developers alike.
Rafay provides a tightly integrated pipeline framework that supports a range of common operational use cases, including:
System Synchronization: Use Git as the single source of truth to orchestrate controller configurations
Application Deployment: Define and automate your app deployment process directly from version-controlled pipelines
Approval Workflows: Insert optional approval gates to control when and how specific pipeline stages are triggered, offering an added layer of governance and compliance
This comprehensive design empowers platform teams to standardize delivery patterns while still accommodating organization-specific controls and policies.
Historically, Rafay’s GitOps pipeline executed all stages sequentially, regardless of interdependencies. While effective for simpler workflows, this model imposed time constraints for more complex operations.
With our latest update, the pipeline engine now supports Directed Acyclic Graphs (DAGs) — allowing stages to execute in parallel, wherever dependencies allow.
Recently, Bitnami announced significant changes to its container image distribution here. As part of this update, the Bitnami public catalog (docker.io/bitnami) will be permanently deleted on September 29th.
All existing container images (including older or versioned tags such as 2.50.0, 10.6, etc.) will be moved from the public catalog (docker.io/bitnami) to a Bitnami Legacy repository (docker.io/bitnamilegacy).
The legacy catalog will no longer receive updates or support. It is intended only as a temporary migration solution to give users time to transition.
Implementing Day-2 Operations such as agent replacement is cumbersome today because every configuration tied to a previous agent must be reconfigured manually. This makes tasks like scaling, retiring agents, or handling failures both error-prone and time-consuming.
To address this pain point, we are introducing the concept of an Agent Pool.
Instead of binding configurations directly to individual agents, customers can now attach multiple agents to a shared Agent Pool. Configurations such as Environment Templates and Resource Templates reference the pool, rather than a single agent.
This simple shift brings significant operational benefits:
Seamless Failover and Replacement
Add or remove agents from a pool without reconfiguring existing associations.
Simplified Day-2 Operations
Manage scaling, upgrades, and retirements without disruption.
Load Balancing
Distribute load across multiple agents within a pool for higher availability and performance.
Artificial intelligence (AI) and high-performance computing (HPC) workloads are evolving at unprecedented speed. Enterprises today require infrastructure that can scale elastically, provide consistent performance, and ensure secure multi-tenant operation. NVIDIA’s Performance Reference Architecture (PRA), built on HGX platforms with Shared NVSwitch GPU Passthrough Virtualization, delivers precisely this capability.
This is the introductory blog in a multi part series. In this blog, we explain why PRA is critical for modern enterprises and service providers, highlight the benefits of adoption, and outline the key steps required to successfully deploy and support the PRA design/architecture.
When it came to selecting an immutable operating system for Rafay's Kubernetes Distribution (Rafay MKS), we found ourselves evaluating two strong contenders: Talos and Flatcar Linux. Both offered immutability and a focus on running containers, but in the end, Flatcar Linux won out for our needs. In this blog, we provide a deeper look into why we made that choice, and how the pros and cons stacked up.
Whether you're training deep learning models, running simulations, or just curious about your GPU's performance, nvidia-smi is your go-to command-line tool. Short for NVIDIA System Management Interface, this utility provides essential real-time information about your NVIDIA GPU’s health, workload, and performance.
In this blog, we’ll explore what nvidia-smi is, how to use it, and walk through a real output from a system using an NVIDIA T1000 8GB GPU.
In the world of FinOps, precise cost allocation is more than just a “nice to have”, it’s the foundation for accurate chargeback, accountability, and informed decision-making. With Rafay’s latest release, Chargeback Summary Reports aggregated by namespace now support custom label-based metadata enrichment.
This enhancement empowers FinOps teams to add business-relevant metadata (like team or cost_center) directly into their cost reports making it easier to trace expenses to the right owners and justify resource consumption.
In large, multi-tenant Kubernetes environments, namespaces often represent workloads owned by different teams, applications, or business units. Without enriched metadata, a FinOps practitioner might see “Namespace A” incurring costs, but need extra steps to figure out which team or cost center is responsible.
Now, you can define specific label keys (e.g., team, cost_center) in the chargeback report configuration, and Rafay will automatically include them as additional columns in the report—populated with values from the namespace labels. This directly embeds organizational context into your cost visibility.
Note:
This enhancement applies to namespace-based aggregation in chargeback reports (not namespace-label-based aggregation). This is because if a primary label value (e.g., cost_center) is the same across multiple namespaces but secondary label values (e.g., team) differ, the report will not be able to aggregate on primary labels in such cases.
Modern enterprises rarely run applications in a single cluster. A production fleet might include on-prem clusters in Singapore and London, a regulated environment in AWS us-east-1, and a developer sandbox in someone’s laptop. GitOps with Argo CD is the natural way to keep all those clusters in the desired state—but the moment clusters live in different security domains (fire-walled data centers, private VPCs, or even air-gapped networks) the simple argocd cluster add story breaks down:
Bespoke bastion hosts or VPN tunnels for every hop
Long-lived bearer-token Secrets stashed in Argo’s namespace
High latency between the GitOps engine and far-flung clusters, turning reconciliations into a slog
Rafay’s Zero-Trust Kubectl Access (ZTKA) solves all three problems in one stroke. By front-loading the connection with a hardened Kube API Access Proxy—and issuing just-in-time (JIT), short-lived ServiceAccounts inside every cluster.
In this blog, we will describe how Rafay Zero Trust Kubectl Access Proxy gives Argo CD a secure path to every cluster in the fleet, even when those clusters sit deep behind corporate firewalls.