Skip to content

Nvidia

NVIDIA NIM Operator: Bringing AI Model Deployment to the Kubernetes Era

In the previous blog, we learnt the basics about NIM (NVIDIA Inference Microservices). In this follow-on blog, we will do a deep dive into the NIM Kubernetes Operator, a Kubernetes-native extension that automates the deployment and management of NVIDIA’s NIM containers. By combining the strengths of Kubernetes orchestration with NVIDIA’s optimized inference stack, the NIM Operator makes it dramatically easier to deliver production-grade generative AI at scale.

NIM Operator

NVIDIA NIM: Why It Matters—and How It Stacks Up

Generative AI is moving from experiments to production, and the bottleneck is no longer training—it’s serving: getting high-quality model inference running reliably, efficiently, and securely across clouds, data centers, and the edge.

NVIDIA’s answer is NIM (NVIDIA Inference Microservices). NIM a set of prebuilt, performance-tuned containers that expose industry-standard APIs for popular model families (LLMs, vision, speech) and run anywhere there’s an NVIDIA GPU. Think of NIM as a “batteries-included” model-serving layer that blends TensorRT-LLM optimizations, Triton runtimes, security hardening, and OpenAI-compatible APIs into one deployable unit.

NIM Logo

NVIDIA Performance Reference Architecture: An Introduction

Artificial intelligence (AI) and high-performance computing (HPC) workloads are evolving at unprecedented speed. Enterprises today require infrastructure that can scale elastically, provide consistent performance, and ensure secure multi-tenant operation. NVIDIA’s Performance Reference Architecture (PRA), built on HGX platforms with Shared NVSwitch GPU Passthrough Virtualization, delivers precisely this capability.

This is the introductory blog in a multi part series. In this blog, we explain why PRA is critical for modern enterprises and service providers, highlight the benefits of adoption, and outline the key steps required to successfully deploy and support the PRA design/architecture.

Deep Dive into nvidia-smi: Monitoring Your NVIDIA GPU with Real Examples

Whether you're training deep learning models, running simulations, or just curious about your GPU's performance, nvidia-smi is your go-to command-line tool. Short for NVIDIA System Management Interface, this utility provides essential real-time information about your NVIDIA GPU’s health, workload, and performance.

In this blog, we’ll explore what nvidia-smi is, how to use it, and walk through a real output from a system using an NVIDIA T1000 8GB GPU.


What is nvidia-smi?

nvidia-smi is a CLI utility bundled with the NVIDIA driver. It enables:

  • Real-time GPU monitoring
  • Driver and CUDA version discovery
  • Process visibility and control
  • GPU configuration and performance tuning

You can execute it using:

nvidia-smi

Choosing Your Engine for LLM Inference: The Ultimate vLLM vs. TensorRT LLM Guide

This is the next blog in the series of blogs on LLMs and Generative AI. When deploying large language models (LLMs) for inference, it is critical to consider: efficiency, scalability, and performance. Users will likely be very familiar with two market leading options: vLLM and Nvidia's TensorRT LLM.

In this blog, we dive into their pros and cons, helping users select the most appropriate option for their use case.

vLLM vs TensorRT LLM

Fractional GPUs using Nvidia's KAI Scheduler

At KubeCon Europe, in April 2025, Nvidia announced and launched the Kubernetes AI (KAI) Scheduler. This is an Open Source project maintained by Nvidia.

The KAI Scheduler is an advanced Kubernetes scheduler that allows administrators of Kubernetes clusters to dynamically allocate GPU resources to workloads. Users of the Rafay Platform can immediately leverage the KAI scheduler via the integrated Catalog.

KAI in Catalog

To help you understand the basics quickly, we have also created a brief video introducing the concepts and a live demonstration showcasing how you can allocate fractional GPU resources to workloads.

Spatial Partitioning of GPUs using Nvidia MIG

In the prior blogs, we discussed why GPUs are managed differently in Kubernetes, how the GPU Operator helps streamline management and various strategies to share GPUs on Kubernetes. In 2020, Nvidia introduced Multi-Instance GPU (MIG) that takes GPU sharing to a different level.

In this blog, we will start by reviewing some common industry use cases where MIG is used and then dive deeper into how MIG is configured and used.

Nvidia MIG

GPU Sharing Strategies in Kubernetes

In the previous blogs, we discussed why GPUs are managed differently in Kubernetes and how the GPU Operator can help streamline management. In Kubernetes, although you can request fractional CPU units for workloads, you cannot request fractional GPU units.

Pod manifests must request GPU resources in integers which results in an entire physical GPU allocated to one container even if the container only requires a fraction of the resources. In this blog, we will describe two popular and commonly used strategies to share a GPU on Kubernetes.

GPU Sharing in Kubernetes