Unique Capabilities

Rafay has developed and implemented some unique capabilities around Kubeflow and associated components to address several challenges that users face today with upstream Kubeflow. In the section below, we have captured some of the unique and differentiated approaches we have implemented in our offering.

Complexity of Installation¶

Users have to develop and maintain tooling to provision and manage the underlying infrastructure (i.e. Kubernetes cluster, databases, object store etc).

Rafay's Solution¶

Rafay has invested in the development of a pre-packaged and templatized solution. In just 3 simple steps, admins can be operational with the complete tech stack i.e. required infrastructure, Kubeflow, other dependencies and integrations.

They can update an existing deployment by just updating the template in Git. They can upgrade to new versions of the solution by simply updating the template in Git.

Wastage due to Idling Resources¶

As data scientists and ML engineers experiment, they will forget to shutdown their notebooks. These consume large amount of expensive resources and result in substantial wastage due to idling infrastructure.

Rafay's Solution¶

Rafay makes sure that resources used by notebooks are automatically culled (i.e. released) if they are identified to be idle.

A 30 minute idle period is provided by default (i.e. idling notebooks will be paused after 30 mins idle period)
Administrators can override this culling time period, or disable notebook culling entirely as needed to suit their organization's requirements.

Shown below is the YAML for the notebook controller where you can see culling is automatically enabled.

# Default values for notebook-controller.
notebookControllerDeployment:
  manager:
    image:
      repository: docker.io/kubeflownotebookswg/notebook-controller
      tag: v1.8.0
    enableCulling: true

Complex Updates & Upgrades¶

The standard packaging and deployment approach used by upstream Kubeflow is based heavily on Kustomize which results in a number of complex issues for users. Some of the challenges are described below.

Lack of Templating and Reusability: Kustomize generates final manifests through overlays which makes it extremely complex to deploy and operate. This causes havoc keeping multiple environments (e.g. dev, prod) in-sync.
Lack of Package Management: Kustomize doesn’t have a concept of a package manager or repositories. So, users have to resort to maintaining their own unique configurations.
Release Management: Kustomize doesn’t offer release management. As a result, users have to manually track and manage the state of their deployments. This means doing patches or upgrades of the software can be a nightmare for the administrators.
Deployment Complexity: Kustomize is fine for simple configurations. But, a typical Kubeflow and related components requires management of complex dependencies across services or applications.

Rafay's Solution¶

Rafay has taken upstream Kubeflow and packaged the manifests into Helm charts. Kubeflow is a complicated application. So, the solution is actually comprised of >28 separate Helm charts that are then deployed in a specific sequence (i.e. dependencies).

Because the application is packaged as Helm charts, customization is straightforward via the "values.yaml" file. Deletion and updates are straightforward. Shown below is an example listing the various Helm charts for the solution.

.\helm.exe list -A
NAME                        NAMESPACE               REVISION    UPDATED                                 STATUS      CHART                           APP VERSION  
admission-webhook           kubeflow                1           2024-09-13 20:51:02.750916313 +0000 UTC deployed    admission-webhook-0.1.0         v1.7.0       
central-dashboard           kubeflow                1           2024-09-13 20:51:21.829477345 +0000 UTC deployed    central-dashboard-0.1.0         v1.8.0       
cluster-local-gateway       cluster-local-gateway   1           2024-09-13 20:49:18.749390063 +0000 UTC deployed    cluster-local-gateway-0.1.0     1.8.0        
dex                         kubeflow                1           2024-09-13 20:49:19.156979828 +0000 UTC deployed    dex-0.1.0                       v2.35.3      
feast                       feast                   1           2024-09-13 20:48:04.085658554 +0000 UTC deployed    feast-feature-server-0.40.1                  
istio                       default                 1           2024-09-13 20:48:13.587579972 +0000 UTC deployed    istio-0.2.1                     1.16.1       
jupyter-web-app             kubeflow                1           2024-09-13 20:49:35.298031166 +0000 UTC deployed    jupyter-web-app-0.1.0           v1.8.0       
katib                       kubeflow                1           2024-09-13 20:50:23.431460747 +0000 UTC deployed    katib-0.1.0                     v0.15.0      
kfp-poddefault-controller   rafay-aiml              1           2024-09-13 20:51:21.793609281 +0000 UTC deployed    kfp-poddefault-controller-0.1.0 0.1.0        
knative-eventing            kubeflow                1           2024-09-13 20:51:30.79415974 +0000 UTC  deployed    knative-eventing-0.1.0          1.8.1        
knative-serving             kubeflow                1           2024-09-13 20:52:54.982729581 +0000 UTC deployed    knative-serving-0.1.0           1.8.1        
kserve                      kubeflow                1           2024-09-13 20:52:55.25044638 +0000 UTC  deployed    kserve-0.1.0                    v0.10.0      
kubeflow-issuer             default                 1           2024-09-13 20:48:10.334303171 +0000 UTC deployed    kubeflow-issuer-0.1.0           v1.6.1       
kubeflow-namespace          default                 1           2024-09-13 20:47:59.990616124 +0000 UTC deployed    kubeflow-namespace-0.1.0        v1.6.1       
kubeflow-pipelines          kubeflow                1           2024-09-13 20:52:38.116966465 +0000 UTC deployed    kubeflow-pipelines-0.1.0        2.0.0-alpha.7
kubeflow-roles              default                 1           2024-09-13 20:48:10.210884689 +0000 UTC deployed    kubeflow-roles-0.1.0            v1.6.1       
mlflow                      mlflow                  1           2024-09-13 20:58:23.022259285 +0000 UTC deployed    mlflow-1.3.2                    2.13.1       
models-web-app              kubeflow                1           2024-09-13 20:53:41.30290272 +0000 UTC  deployed    models-web-app-0.1.0            v0.10.0      
namespace-2lrzxoq-app       app                     1           2024-09-24 15:50:06.7557329 +0000 UTC   deployed    namespace-0.1.0                 1.16.0       
notebook-controller         kubeflow                1           2024-09-13 20:50:22.561442694 +0000 UTC deployed    notebook-controller-0.1.0       v1.8.0       
oidc-authservice            kubeflow                1           2024-09-13 20:49:35.311363044 +0000 UTC deployed    oidc-authservice-0.1.0          0.2.0        
profiles-and-kfam           kubeflow                1           2024-09-13 20:51:43.606551765 +0000 UTC deployed    profiles-and-kfam-0.1.0         v1.8.0       
pvcviewer-controller        kubeflow                1           2024-09-13 20:50:54.532324424 +0000 UTC deployed    pvcviewer-controller-0.1.0      v1.8.0       
tensorboard-controller      kubeflow                1           2024-09-13 20:53:59.119132629 +0000 UTC deployed    tensorboard-controller-0.1.0    v1.8.0       
tensorboards-web-app        kubeflow                1           2024-09-13 20:54:39.232910481 +0000 UTC deployed    tensorboards-web-app-0.1.0      v1.8.0       
training-operator           kubeflow                1           2024-09-13 20:49:50.053378394 +0000 UTC deployed    training-operator-0.1.0         v1.8.0       
v2-alertmanager             rafay-infra             1           2024-09-13 20:39:49.837424713 +0000 UTC deployed    v2-alertmanager-0.1.0           1.16.0       
v2-edge-client              rafay-system            1           2024-09-13 20:39:46.29011165 +0000 UTC  deployed    edge-client-0.1.0               0.1.0        
v2-infra                    rafay-system            2           2024-09-13 20:41:29.471077257 +0000 UTC deployed    v2-infra-v1.0                   v1.0         
volumes-web-app             kubeflow                1           2024-09-13 20:50:21.737807047 +0000 UTC deployed    volumes-web-app-0.1.0           v1.8.0

Lack of Standardized Workspaces¶

In a standard Kubeflow deployment, administrators have to create the PodDefaults manually for every namespace. This makes it impractical and impossible for IT/Ops teams to provide their data scientists and ML engineers with a self-service onboarding experience to the multi-tenant Kubeflow environment.

Rafay's Solution¶

KFP Pod Default Controller

A PodDefault CRD allows you to mutate pods during creation i.e. if you create a poddefault custom resource in a namespace then it will automatically mutate all pods that match the label selector in the specified PodDefault. Rafay takes a list of poddefaults in a config map and automatically injects them into every namespace created via Kubeflow. With Rafay's Kubeflow controller, a PodDefault is automatically injected into every Kubeflow namespace.

Common use cases for this are:

Need to automatically add volumes & environment variables to pods
Need to automatically set image pull secrets
Need to automatically add sidecars/init-containers

Lack of Secure Access¶

The administrator needs to provide every user (data scientist/ML engineer) with secure access credentials to the underlying storage/bucket details. The organization has to deal with rotation and revocation of credentials when the user leaves the organization.

Rafay's Solution¶

Rafay configures the MLflow based registry to be used as a proxy. This means that the administrators do not have to provide the users with details/access credentials to the underlying storage/bucket details.

For model registration, the data scientist just needs to log the model with a special prefix **mlflow-artifacts:/. The MLflow registry proxies the request to the underlying storage without requiring another layer of authentication by the user. MLflow is configured with the underlying bucket's credentials and will store the artifacts (model) in the bucket.

The data scientist will compare the current run with other runs. If this run has beaten the older runs, they will note down the Artifact URI for the experiment in MLFlow console and pass this URI (format: mlflow-artifacts:///) so that it can be used for Inference.

The ML Engineer will create an Kserve CR for Inference Service and will use the URI (mlflow-artifacts:///) provided by the data scientist.

When Kserve creates a pod, the proprietary storage initializer is invoked because the URI starts with mlflow-artifacts:/.. The storage initializer communicates with MLflow and pulls the model artifacts. The Inference container starts up and uses the model loaded in the previous step and starts serving HTTP model prediction requests

Important

Kserve will not directly reach out to object storage buckets. Instead the requests are proxied via MLflow.