Unique Capabilities
Rafay has developed and implemented some unique capabilities around Kubeflow and associated components to address several challenges that users face today with upstream Kubeflow. In the section below, we have captured some of the unique and differentiated approaches we have implemented in our offering.
Complexity of Installation¶
Users have to develop and maintain tooling to provision and manage the underlying infrastructure (i.e. Kubernetes cluster, databases, object store etc).
Rafay's Solution¶
Rafay has invested in the development of a pre-packaged and templatized solution. In just 3 simple steps, admins can be operational with the complete tech stack i.e. required infrastructure, Kubeflow, other dependencies and integrations.
They can update an existing deployment by just updating the template in Git. They can upgrade to new versions of the solution by simply updating the template in Git.
Wastage due to Idling Resources¶
As data scientists and ML engineers experiment, they will forget to shutdown their notebooks. These consume large amount of expensive resources and result in substantial wastage due to idling infrastructure.
Rafay's Solution¶
Rafay makes sure that resources used by notebooks are automatically culled (i.e. released) if they are identified to be idle.
- A 30 minute idle period is provided by default (i.e. idling notebooks will be paused after 30 mins idle period)
- Administrators can override this culling time period, or disable notebook culling entirely as needed to suit their organization's requirements.
Shown below is the YAML for the notebook controller where you can see culling is automatically enabled.
# Default values for notebook-controller.
notebookControllerDeployment:
manager:
image:
repository: docker.io/kubeflownotebookswg/notebook-controller
tag: v1.8.0
enableCulling: true
Complex Updates & Upgrades¶
The standard packaging and deployment approach used by upstream Kubeflow is based heavily on Kustomize which results in a number of complex issues for users. Some of the challenges are described below.
-
Lack of Templating and Reusability: Kustomize generates final manifests through overlays which makes it extremely complex to deploy and operate. This causes havoc keeping multiple environments (e.g. dev, prod) in-sync.
-
Lack of Package Management: Kustomize doesn’t have a concept of a package manager or repositories. So, users have to resort to maintaining their own unique configurations.
-
Release Management: Kustomize doesn’t offer release management. As a result, users have to manually track and manage the state of their deployments. This means doing patches or upgrades of the software can be a nightmare for the administrators.
-
Deployment Complexity: Kustomize is fine for simple configurations. But, a typical Kubeflow and related components requires management of complex dependencies across services or applications.
Rafay's Solution¶
Rafay has taken upstream Kubeflow and packaged the manifests into Helm charts. Kubeflow is a complicated application. So, the solution is actually comprised of >28 separate Helm charts that are then deployed in a specific sequence (i.e. dependencies).
Because the application is packaged as Helm charts, customization is straightforward via the "values.yaml" file. Deletion and updates are straightforward. Shown below is an example listing the various Helm charts for the solution.
.\helm.exe list -A
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
admission-webhook kubeflow 1 2024-09-13 20:51:02.750916313 +0000 UTC deployed admission-webhook-0.1.0 v1.7.0
central-dashboard kubeflow 1 2024-09-13 20:51:21.829477345 +0000 UTC deployed central-dashboard-0.1.0 v1.8.0
cluster-local-gateway cluster-local-gateway 1 2024-09-13 20:49:18.749390063 +0000 UTC deployed cluster-local-gateway-0.1.0 1.8.0
dex kubeflow 1 2024-09-13 20:49:19.156979828 +0000 UTC deployed dex-0.1.0 v2.35.3
feast feast 1 2024-09-13 20:48:04.085658554 +0000 UTC deployed feast-feature-server-0.40.1
istio default 1 2024-09-13 20:48:13.587579972 +0000 UTC deployed istio-0.2.1 1.16.1
jupyter-web-app kubeflow 1 2024-09-13 20:49:35.298031166 +0000 UTC deployed jupyter-web-app-0.1.0 v1.8.0
katib kubeflow 1 2024-09-13 20:50:23.431460747 +0000 UTC deployed katib-0.1.0 v0.15.0
kfp-poddefault-controller rafay-aiml 1 2024-09-13 20:51:21.793609281 +0000 UTC deployed kfp-poddefault-controller-0.1.0 0.1.0
knative-eventing kubeflow 1 2024-09-13 20:51:30.79415974 +0000 UTC deployed knative-eventing-0.1.0 1.8.1
knative-serving kubeflow 1 2024-09-13 20:52:54.982729581 +0000 UTC deployed knative-serving-0.1.0 1.8.1
kserve kubeflow 1 2024-09-13 20:52:55.25044638 +0000 UTC deployed kserve-0.1.0 v0.10.0
kubeflow-issuer default 1 2024-09-13 20:48:10.334303171 +0000 UTC deployed kubeflow-issuer-0.1.0 v1.6.1
kubeflow-namespace default 1 2024-09-13 20:47:59.990616124 +0000 UTC deployed kubeflow-namespace-0.1.0 v1.6.1
kubeflow-pipelines kubeflow 1 2024-09-13 20:52:38.116966465 +0000 UTC deployed kubeflow-pipelines-0.1.0 2.0.0-alpha.7
kubeflow-roles default 1 2024-09-13 20:48:10.210884689 +0000 UTC deployed kubeflow-roles-0.1.0 v1.6.1
mlflow mlflow 1 2024-09-13 20:58:23.022259285 +0000 UTC deployed mlflow-1.3.2 2.13.1
models-web-app kubeflow 1 2024-09-13 20:53:41.30290272 +0000 UTC deployed models-web-app-0.1.0 v0.10.0
namespace-2lrzxoq-app app 1 2024-09-24 15:50:06.7557329 +0000 UTC deployed namespace-0.1.0 1.16.0
notebook-controller kubeflow 1 2024-09-13 20:50:22.561442694 +0000 UTC deployed notebook-controller-0.1.0 v1.8.0
oidc-authservice kubeflow 1 2024-09-13 20:49:35.311363044 +0000 UTC deployed oidc-authservice-0.1.0 0.2.0
profiles-and-kfam kubeflow 1 2024-09-13 20:51:43.606551765 +0000 UTC deployed profiles-and-kfam-0.1.0 v1.8.0
pvcviewer-controller kubeflow 1 2024-09-13 20:50:54.532324424 +0000 UTC deployed pvcviewer-controller-0.1.0 v1.8.0
tensorboard-controller kubeflow 1 2024-09-13 20:53:59.119132629 +0000 UTC deployed tensorboard-controller-0.1.0 v1.8.0
tensorboards-web-app kubeflow 1 2024-09-13 20:54:39.232910481 +0000 UTC deployed tensorboards-web-app-0.1.0 v1.8.0
training-operator kubeflow 1 2024-09-13 20:49:50.053378394 +0000 UTC deployed training-operator-0.1.0 v1.8.0
v2-alertmanager rafay-infra 1 2024-09-13 20:39:49.837424713 +0000 UTC deployed v2-alertmanager-0.1.0 1.16.0
v2-edge-client rafay-system 1 2024-09-13 20:39:46.29011165 +0000 UTC deployed edge-client-0.1.0 0.1.0
v2-infra rafay-system 2 2024-09-13 20:41:29.471077257 +0000 UTC deployed v2-infra-v1.0 v1.0
volumes-web-app kubeflow 1 2024-09-13 20:50:21.737807047 +0000 UTC deployed volumes-web-app-0.1.0 v1.8.0
Lack of Standardized Workspaces¶
In a standard Kubeflow deployment, administrators have to create the PodDefaults manually for every namespace. This makes it impractical and impossible for IT/Ops teams to provide their data scientists and ML engineers with a self-service onboarding experience to the multi-tenant Kubeflow environment.
Rafay's Solution¶
KFP Pod Default Controller
A PodDefault CRD allows you to mutate pods during creation i.e. if you create a poddefault custom resource in a namespace then it will automatically mutate all pods that match the label selector in the specified PodDefault. Rafay takes a list of poddefaults in a config map and automatically injects them into every namespace created via Kubeflow. With Rafay's Kubeflow controller, a PodDefault is automatically injected into every Kubeflow namespace.
Common use cases for this are:
- Need to automatically add volumes & environment variables to pods
- Need to automatically set image pull secrets
- Need to automatically add sidecars/init-containers
Lack of Secure Access¶
The administrator needs to provide every user (data scientist/ML engineer) with secure access credentials to the underlying storage/bucket details. The organization has to deal with rotation and revocation of credentials when the user leaves the organization.
Rafay's Solution¶
Rafay configures the MLflow based registry to be used as a proxy. This means that the administrators do not have to provide the users with details/access credentials to the underlying storage/bucket details.
For model registration, the data scientist just needs to log the model with a special prefix **mlflow-artifacts:/. The MLflow registry proxies the request to the underlying storage without requiring another layer of authentication by the user. MLflow is configured with the underlying bucket's credentials and will store the artifacts (model) in the bucket.
The data scientist will compare the current run with other runs. If this run has beaten the older runs, they will note down the Artifact URI for the experiment in MLFlow console and pass this URI (format: mlflow-artifacts:/
The ML Engineer will create an Kserve CR for Inference Service and will use the URI (mlflow-artifacts:/
When Kserve creates a pod, the proprietary storage initializer is invoked because the URI starts with mlflow-artifacts:/.. The storage initializer communicates with MLflow and pulls the model artifacts. The Inference container starts up and uses the model loaded in the previous step and starts serving HTTP model prediction requests
Important
Kserve will not directly reach out to object storage buckets. Instead the requests are proxied via MLflow.