Mohan Atreya¶

May 12, 2025
in Product Blog, K8s 1.33, User Namespaces, Lateral Movement, Container Escape
5 min read

Introduction to User Namespaces in Kubernetes

In Kubernetes, some features arrive quietly, but leave a massive impact. Kubernetes v1.33 is turning out to be one such release where there are some features with massive impact. In the previous blog, my colleague described how you can provision and operate Kubernetes v1.33 clusters on bare metal and VM based environments using Rafay.

In this blog, we will discuss a new feature in v1.33 called User Namespaces. This feature is not a headline grabber such as a service mesh etc, but is a game changer for container security.

Container in a Jail

May 9, 2025
in Product Blog, Serverless Inference, LLMs, GPU PaaS
4 min read

Powering Multi-Tenant, Serverless AI Inference for Cloud Providers

The AI revolution is here, and Large Language Models (LLMs) are at its forefront. Cloud providers are uniquely positioned to offer powerful AI inference services to their enterprise and retail customers. However, delivering these services in a scalable, multi-tenant, and cost-effective serverless manner presents significant operational challenges.

Rafay enables cloud providers deliver Serverless Inference to 100s of users and enterprises.

Info

Earlier this week, we announced our Multi-Tenant Serverless Inference offering for GPU & Sovereign Cloud Providers. Learn more about this here.

Multi Tenant

April 30, 2025
in Product Blog, LLM, Inference, Family, Lineage
4 min read

Family vs. Lineage: Unpacking Two Often-Confused Ideas in the LLM World

LLMs have begun to resemble sprawling family trees. Folks that are relatively new to LLMs will notice two words appear constantly in technical blogs: "family" and "lineage".

They sound interchangeable and users frequently conflate them. But, they describe different slices of an LLM’s life story.

Important

Understanding the differences is more than trivia. This determines how you pick models, tune them, and keep inference predictable at scale.

LLM Family vs Lineage

April 29, 2025
in Product Blog, LLM, Inference, Family
3 min read

Why “Family” Matters in the World of LLMs

When GPU bills run into six digits and every millisecond of latency counts, platform teams learn that vocabulary choices and hidden-unit counts aren’t the only things that separate one model checkpoint from another.

LLMs travel in families—lineages of models that share a common architecture, tokenizer, and training recipe. Think of them the way you might think of Apple’s M-series chips or Toyota’s Prius line: the tuning changes, the size varies, but the underlying design stays stable enough that tools, drivers, and workflows remain interchangeable.

In this blog, we will learn about what we mean by a family for LLMs and why this matters for Inference.

LLM Family

April 28, 2025
in Product Blog, LLM, vLLM, Nvidia, TensorRT-LLM
3 min read

Choosing Your Engine for LLM Inference: The Ultimate vLLM vs. TensorRT LLM Guide

This is the next blog in the series of blogs on LLMs and Generative AI. When deploying large language models (LLMs) for inference, it is critical to consider: efficiency, scalability, and performance. Users will likely be very familiar with two market leading options: vLLM and Nvidia's TensorRT LLM.

In this blog, we dive into their pros and cons, helping users select the most appropriate option for their use case.

vLLM vs TensorRT LLM

April 25, 2025
in Product Blog, LLM, Quantization, Inference
3 min read

Demystifying Quantization: Why It Matters for LLMs and Inference Efficiency

As Large Language Models (LLMs) like GPT, LLaMA, and DeepSeek reach hundreds of billions of parameters, the demand for high-speed, low-cost inference has skyrocketed. Quantization is a technique that helps drastically reduces model size and computational requirements by using lower-precision numbers. In this blog, we will discuss quantization and why it is essential.

Quantization

April 24, 2025
in Product Blog, LLM, Compile, Inference
3 min read

Compiling a LLM for High Performance Inference

This is the next blog in the blog series on LLMs and Inference. In the previous blog on LLMs and Inference, we discussed about the safetensors format for LLMs. In this blog, we will walk through a critical step for LLM Inference.

Compiling a Large Language Model (LLM) generally refers to optimizing the model’s computational graph and kernel execution to improve inference or training performance on specific hardware (like GPUs or TPUs). Think of this as the next logical step that is performed after loading a model.

LLM Compilation

April 23, 2025
in Product Blog, LLM, Safetensors, Inference
3 min read

Safetensors: The Secure, Scalable Format Powering LLM Inference

As Large Language Models (LLMs) like LLaMA, Mistral, and DeepSeek continue to scale into the hundreds of billions of parameters, model efficiency becomes as important as model quality.

One often-overlooked bottleneck is the model loading format. This is one of the primary focus areas for safetensors.

April 18, 2025
in Product Blog, AWS SageMaker AI, Domains, User Profiles, Self Service
4 min read

End-User Self-Service for Automated User Profile Creation in SageMaker Domains

As organizations expand their use of Amazon SageMaker to empower data scientists and machine learning (ML) engineers, managing access to development environments becomes a critical concern. In the last blog, we discussed how SageMaker Domains can provide isolated, secure, and fully-featured environments for users.

However, manually creating user profiles for every user quickly becomes a bottleneck—especially in large or fast-growing organizations. Asking users to submit an IT ticket and wait for days before it can be fulfilled is unacceptable in today's fast paced environment.

In this blog, we will describe how organizations use Rafay's GPU PaaS to provide their users with a self-service experience to onboard themselves into SageMaker Domains without waiting on IT or platform teams. This not only improves efficiency and user experience but also ensures consistency and compliance across the organization.

SageMaker AI Self Service

April 17, 2025
in Product Blog, AWS SageMaker AI, Domains, User Profiles
4 min read

Why Enterprises Should Use Domains for SageMaker AI

As organizations continue to invest in artificial intelligence (AI) and machine learning (ML) to drive digital transformation, the demand for streamlined, secure, and scalable development environments has never been greater.

Many organizations that are standardized on Amazon AWS may use Amazon SageMaker AI to build, train, and deploy machine learning models at scale with minimal operational overhead. SageMaker AI provides a fully managed environment that streamlines the entire ML lifecycle, enabling faster innovation, stronger governance, and cost-effective AI development.

In this introductory blog, we will describe one of the most critical capabilities of SageMaker AI called Domains. In the next blog, we will describe how organizations can scale their AI/ML teams by providing their data scientists and ML engineers with a self service experience for access to SageMaker Domains.

SageMaker AI Logo