Skip to content

Mohan Atreya

Solutions for Key Kubernetes Challenges for AI/ML in the Enterprise - Part 2

This is part-2 of our blog series on challenges and solutions for AI/ML in the enterprise. This blog is based on our learnings over the last two years as we worked very closely with our customers that make extensive use of Kubernetes for AI/ML use cases. In part-1, we looked at the following:

  • Why Kubernetes is particularly compelling for AI/ML.
  • Described some of the key challenges that organizations will encounter with AI/ML and Kubernetes

In this part, we will look at some innovative approaches by which organizations can address these challenges.

Key Kubernetes Challenges for AI/ML in the Enterprise - Part 1

This blog is based on our learnings over the last two years as we worked very closely with our customers that make extensive use of Kubernetes for AI/ML.

This is part-1 of a two part series. In part-1, we will

  • Start by looking at why Kubernetes is particularly compelling for AI/ML.
  • Describe some of the key challenges that organizations will encounter with AI/ML and Kubernetes

In part-2, we will look at ways by which organizations can address these challenges.

Announcing our April 2023 (v1.24) Release

A few weeks back in early April 2023, we upgraded our Preview environment to v1.24 of the Rafay Kubernetes Operations Platform. Our sincere thanks to our customers and partners that have been actively testing the new functionality. We have received timely feedback that we have been able to incorporate into our product documentation and into the platform as well.

Today, we upgraded our Production environment to this release. As always, our customers will have seamless access to the new functionality with no interruptions to their applications or clusters. In this blog, I will describe some of the new features that are part of this release.

April Release v1.24

Goldilocks Zone for AKS

In this blog, we will look at the process used by Microsoft Azure to add support for new Kubernetes versions for their "Managed" Azure Kubernetes Service (AKS). We will also look at recommendations for customers on things they need to consider to operate their AKS clusters at scale without issues.

Azure's AKS managed Kubernetes is supported globally in 60+ regions. As one can imagine, it is not practical to update software in all these regions in one fell swoop. The AKS team at Microsoft employs a Safe Deployment Practice (SDP) where new releases are rolled out gradually in phases. This means that any given time, something new is being rolled out to some region.

Note

The AKS team maintains a Release Tracker that provides visibility to customers that require it.

Considerations for In-Place Upgrades to Amazon EKS v1.24

Recently, AWS added support for Kubernetes v1.24 for their Amazon EKS offering. One significant change with this version is the removal of Dockershim as the Container Runtime (CRI). Amazon EKS clusters v1.24 onwards are standardized on "containerd".

New Amazon EKS v1.24 clusters are provisioned with containerd. Watch a brief video showcasing how customers can use Rafay to configure and provision an Amazon EKS v1.24 cluster.

When EKS clusters are upgraded to v1.24, the nodes in the EKS cluster's data plane are seamlessly migrated from "Dockershim" to "containerd".

graph LR
  A[Dockershim] --> B[Containerd];

Although this transition is mostly "behind the scenes" for users, the transition from Dockershim -> Containerd can cause disruptions to deployed applications that may be dependent on Docker. In this blog, we will look at what Rafay has done to protect our customers during an in-place upgrade to EKS v1.24.

Considerations for Windows Containers on Kubernetes

With increasing adoption of Kubernetes in organizations, we are seeing interest from a number of customers that would like to deploy and operate their "legacy Windows applications" on Kubernetes as well.

In this blog, we have attempted to capture our learnings from working with customers that use the Rafay Kubernetes Operations Platform to deploy and operate Kubernetes clusters with Windows based containerized applications.

Kubernetes Cluster Insights for Platform Teams

Many customers of the Rafay Kubernetes Operations Platform are "Platform Teams". In many cases, the first priority for these platform teams is to "take over and standardize" existing Kubernetes clusters in active use by application teams.

However, one of the challenges they run into with the take over process is nobody in the team has complete clarity into what resources already exist on the cluster and for what purpose. Identifying an accurate list manually can be extremely error prone and time consuming for both the platform teams as well as the various application teams resulting in delays in adoption and standardization efforts.

Cluster Blueprints and Drift Detection

Around three years back, we noticed many of our customers struggling with enterprise wide standardization of their Kubernetes clusters. Every cluster in their Organization was a snowflake and they were looking for a way to enforce that every cluster had a "baseline set of add-ons". This prompted us to develop Cluster Blueprints which has turned out to be one of the most heavily used features in our platform.

In this blog, we will describe a superpower setting in the cluster blueprints feature that we see customers use heavily for their production clusters to secure against unplanned drift.

Blueprints Icon

Considerations for In-Place Upgrades to Amazon EKS v1.23

Earlier this year, AWS added support for Kubernetes v1.23 for their Amazon EKS offering. One significant change with this version is with the Container Storage Interface (CSI) for working with AWS Elastic Block Store (Amazon EBS) volumes.

Specifically, the updates to the CSI driver require customers to take action to ensure a seamless upgrade process for EKS clusters from previous versions. The CSI was developed in Kubernetes to replace the in-tree driver. With the CSI, there is now a simplified plug-in model that makes it easier for storage providers to decouple their releases from the Kubernetes release cycle.

graph LR
  A[In-Tree Storage Driver] --> B[CSI Plugin for EBS CSI];

In a nutshell, this transition is good for Amazon EKS users because they do not have to upgrade Kubernetes versions for their EKS clusters just to get some additional functionality or bug fixes for EBS storage via the "in-tree driver".