Skip to content

Product Blog

Bringing DevOps and Automation to Machine Learning via MLOps

The vast majority of organizations are new to AI/ML. As a result, most in-house systems and processes supporting this is likely ad-hoc. Industry analysts like Gartner forecast that organizations will need to quickly transition from Pilots to Production with AI/ML in order to make it across the chasm.

Most organizations already have reasonably mature DevOps processes and systems in place. So, going mainstream with AI should be a walk in the park. Correct? Turns out that this is not really true “IT leaders responsible for AI are discovering the AI pilot paradox, where launching pilots is deceptively easy but deploying them into production is notoriously challenging.” by Chirag Dekate, Gartner

In this blog, we will try and answer the following question:

Why do we need a new process called MLOps when most organizations already have reasonably mature DevOps practices? How is MLOps different from DevOps?

DevOps vs MLOps

GPU Metrics - SM Clock

In the previous blog, we discussed why tracking and reporting GPU Memory Utilization metrics matters. In this blog, we will dive deeper into another critical GPU metric i.e. GPU SM Clock. The GPU SM clock (Streaming Multiprocessor clock) metric refers to the clock speed at which the GPU's cores (SMs) are running.

The SM is the main processing unit of the GPU, responsible for executing compute tasks such as deep learning operations, simulations, and graphics rendering. Monitoring the SM clock speed can help users assess the performance and health of your GPU during workloads and detect potential bottlenecks related to clock speed throttling.

GPU SM Clock

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.

GPU Metrics - Memory Utilization

In the introductory blog on GPU metrics, we discussed about the GPU metrics that matter and why they matter. In this blog, we will dive deeper into one of the critical GPU metrics i.e. GPU Memory Utilization.

GPU memory utilization refers to the percentage of the GPU’s dedicated memory (i.e. framebuffer) that is currently in use. It measures how much of the available GPU memory is occupied by data such as models, textures, tensors, or intermediate results during computation.

GPU Memory Utilization

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.

What GPU Metrics to Monitor and Why?

With the increasing reliance on GPUs for compute-intensive tasks such as machine learning, deep learning, data processing, and rendering, both infrastructure administrators and users of GPUs (i.e. data scientists, ML engineers and GenAI app developers) require timely access and insights into performance, efficiency, and overall health of their GPU resources.

In order to make data driven, logical decisions, it is critical for these users to have access to critical metrics for their GPUs. This is the first blog in a series where we will describe the GPU metrics that you should track and monitor. In subsequent blogs, we will do a deep dive into each metric, why it matters and how to use it effectively.

Intro to GPU Metrics

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.

PyTorch vs. TensorFlow: A Comprehensive Comparison in 2024

Note

Listen to a conversation based on this blog post. Tell us what you think about it.

When it comes to deep learning frameworks, PyTorch and TensorFlow are two of the most prominent tools in the field. Both have been widely adopted by researchers and developers alike, and while they share many similarities, they also have key differences that make them suitable for different use cases.

We thought this blog would be timely especially with the PyTorch 2024 Conference right around the corner.

In this blog, we’ll explore the main differences between PyTorch and TensorFlow across several dimensions such as ease of use, dynamic vs. static computation, ecosystem, deployment, community, and industry adoption. In a follow-on blog, we will describe how Rafay’s customers use both PyTorch and TensorFlow for their AI/ML projects.

PyTorch vs TensorFlow

Secure Access to Azure Services using Workload Identity for Azure AKS

Although Azure Kubernetes Service (AKS) allows you to deploy containerized workloads in a managed Kubernetes environment, developers still need to deal with the challenge of securely managing access to Azure resources (e.g. Key Vault or Azure Storage). Traditionally, secrets like API keys or service account credentials are used to authenticate and authorize workloads, but this approach presents security risks and operational overhead.

In Azure for AKS clusters, developers have access to something similar called Workload Identity. It is a modern, secure, and scalable way to manage access without the hassle of managing secrets. In this blog post, we'll dive deep into what Workload Identity is, how it works in AKS, and why it's a game-changer for Kubernetes clusters on Azure.

App Accessing Azure Service

Note

In a related blog, we will see how users can achieve something similar in Amazon EKS clusters using EKS Pod Identity.

User Access Reports for Kubernetes

Access reviews are required and mandated by regulations such as SOX, HIPAA, GLBA, PCI, NYDFS, and SOC-2. Access reviews are critical to help organizations maintain a strong risk management posture and uphold compliance. These reviews are typically conducted on a periodic basis (e.g. monthly, quarterly or annually) depending on the organization's policies and tolerance to risk.

Providing auditors with periodic access to user access reports for Kubernetes is a critical task for any typical platform team. This becomes onerous and burdensome especially for organizations that operate 10s or 100s of Kubernetes clusters that are used by 100s of app developers and SREs. Doing this via manual processes is impractical.

General Process

In this blog, we will look at why user access reports are critical for organizations and how Rafay's customers implement this with very high levels of automation.

EC2 versus Fargate for Amazon EKS: A Cost Comparison

When it comes to running workloads on Amazon Web Services (AWS), two popular choices are Amazon Elastic Compute Cloud (EC2) and AWS Fargate. Both have their merits, but understanding their cost implications is crucial for making an informed decision.

In this blog, we'll dive into a cost comparison of EC2 and Fargate configurations within an Amazon Elastic Kubernetes Service (EKS) cluster.