Skip to content

Rafay Newsletter-September 2024

Welcome to the September 2024 edition of the Rafay customer newsletter. This month, we’re excited to bring you the latest product enhancements and insightful content crafted to help you make the most of your AI/ML, Kubernetes, and cloud-native operations.

Every month, we push out a number of incremental updates to our product documentation, new functionality, our YouTube channel, tech blogs etc. Our users tell us that it will be great if we summarized all the updates for the month in the form of a newsletter that they can read or listen to in 10 minutes.

Newsletter Sep 2024


Podcast Format

Info

We have also published a 10-min podcast format of the newsletter for users that would prefer to listen to it.


Updates for Sep 2024

Aggregate and Visualize GPU Metrics across Multiple Clusters
A video showcasing how the Rafay platform aggregates GPU metrics from 100s of clusters into a central time series database

What GPU Metrics to Monitor and Why
With the increasing reliance on GPUs for compute-intensive tasks such as machine learning and deep learning, it's important to monitor critical GPU metrics. This is a series of five blogs that dive into key metrics like GPU utilization, memory usage, SM Clock, power consumption and framebuffer usage. It also explains how tracking these metrics helps prevent failures, optimize performance, and reduce operational costs.


Recent Product Releases

Our September 2024 release brings a host of new features designed to streamline and enhance your infrastructure management experience. Here are some major highlights from this release.

  • Support for Kubernetes 1.31
    Rafay’s MKS (i.e. Rafay’s Kubernetes distribution) for data centers, edge and private cloud environments now supports the latest Kubernetes version 1.31. Existing clusters can be upgraded in-place to this version as well.

  • User Access Reports for Kubernetes
    Addressing the growing need for access reviews, we’ve added compliance-focused user access reports. These reports help meet regulatory requirements, such as SOX and HIPAA, by providing detailed visibility into who has access to what across your Kubernetes clusters. We also wrote a detailed blog about this feature.

  • Workload Identity Support for AKS
    Addresses secure access to Azure services without the need to manage secrets. We also wrote a detailed blog about this feature.

  • Custom Provider for Environment Manager
    Earlier this year, we added support for CNCF’s OpenTofu provider which has quickly become heavily used by customers. With this release, users can now embed their custom code (written in Golang, Python etc) into Environment Manager workflows using the custom provider.

  • Dashboards for Environment Manager
    Users now have a centralized view of all environments and resources across the organization. A common use case that customers use it for is to quickly identify versions that are out of date and get them current.

Info

Navigate to our official roadmap if you are interested in learning about what is releasing in the next few months.


Product Documentation

To support our customers using Rafay’s Kubeflow based MLOps platform, we have created a number of step-by-step, Getting Started guides.

End-to-End MLOps Pipeline
This guide is based on the Iris dataset and shows how a data scientist/ML engineer can create a pipeline to train a model, validate it, register it in a model registry, serve the model on an endpoint (inference) and test it.

Deep Learning Pipeline
This guide is based on the Titanic dataset and shows how a data scientist/ML engineer can create a TensorFlow based deep learning pipeline to train a model, register it in a model registry and visualize the model and its parameters using the integrated TensorBoard console.

Training in Jupyter Notebook
In these two guides, we show how users can train a model in a Jupyter notebook using TensorFlow and PyTorch.


Product Videos

We created and posted a number of new videos in Rafay’s YouTube Channel for the benefit of our customers. We have highlighted the ones from Sep below.

End-to-End MLOps Pipeline in Rafay’s Kubeflow based MLOps Offering
A video showcasing an example of an end-to-end MLOps pipeline implemented using Rafay’s Kubeflow based MLOps offering.

Deep Learning Pipeline based on TensorFlow in Kubeflow
A video showcasing how you can implement a Deep Learning pipeline based on TensorFlow in Kubeflow and visualize the model in the integrated TensorBoard web application.

How to build Container Images inside Jupyter Notebooks
A video showcasing how you can build a container image inside a Jupyter notebook in a Kubeflow pipeline using Kaniko.

PyTorch vs TensorFlow in 2024
A video of a conversation discussing our blog on PyTorch vs TensorFlow. We published right before the PyTorch 2024 conference in San Francisco.

Org-wide Cluster Add-On Standardization using Golden Blueprints
A video showcasing how users can use Golden Blueprints for cluster add-on standardization. Users use golden blueprints extensively to implement org-wide standards for Kubernetes clusters.


We blog extensively every month. Here are some blogs we think you may enjoy.

  • Building an Extensible GenAI Copilot: What We Learned
    In this blog, we wrote about our experience and learnings in the development of our Copilot. Building an enterprise-grade GenAI application and operating it in production has many challenges, including finding the right LLM, managing costs, devising data access controls, deploying prompt guardrails and maintaining observability.

  • PyTorch vs. TensorFlow: A Deep Dive
    This blog compares PyTorch and TensorFlow, two of the most widely used deep learning frameworks. It covers ease of use, ecosystem support, and production-readiness, helping users decide which tool best suits their specific AI/ML workloads

  • Secure Access to Azure Services Using Workload Identity for Azure AKS
    In this blog, we discuss how to securely access Azure services using workload identities in Azure Kubernetes Service (AKS). It covers the integration of AAD Pod Identity and Managed Identity, providing a scalable solution for managing access without the need for secrets.

  • User Access Reports for Kubernetes
    As security and compliance requirements evolve, regular access reviews have become mandatory for organizations. This blog highlights Rafay’s new Kubernetes user access reports feature, which helps platform teams stay on top of security and meet compliance needs efficiently.


Events & Conferences

Q4 is considered an event season in the tech industry. Here are two events that we recommend you attend in Oct and Nov if possible. We would love to have you stop by our booth and chat with us.

  • Nvidia AI Summit (October 7-9, 2024)
    Catch us at the upcoming NVidia AI Summit in Washington DC, where we will showcase Rafay’s latest innovations in GPU monitoring and AI/ML platform management. Visit our booth to see live demos and learn how Rafay can support your AI/ML initiatives.

  • KubeCon North America (November 12-15, 2024)
    Meet us at KubeCon NA in Salt Lake City. This is Cloud Native Computing Foundation’s flagship conference where adopters and technologists from leading open source and cloud native communities come together. In addition to advances with Rafay’s Kubernetes Management and Environment Management offerings, we will also show you how platform teams can extend into AI/ML initiatives in their organizations.


Summary

Stay tuned for more updates and resources as we continue to deliver new capabilities that enable your teams to accelerate AI/ML adoption and cloud-native operations. If you have any questions or need assistance, don’t hesitate to reach out to our support team. Also, keep an eye out for our Oct 2024 newsletter. We should have this out in a few weeks.

Our sincere thanks to our customers and users who spend time reading our product blogs, updates and videos. Please Contact the Rafay Product Team if you would like us to write about other topics.