Skip to content

Administrators

The Training Operator's control plane is automatically installed as part of the default deployment of Rafay's Kubeflow based MLOps platform.


Troubleshooting

Administrators can verify the status of the Training Operator on their MLOps Cluster using the steps below.

  • Login into the Rafay Console as an administrator
  • Navigate to the project where the MLOps platform's cluster is deployed.
  • Click on the Kubernetes Resources link on the web console

Training Operator Pods

  • Ensure the "namespace" selector is enabled and select the "kubeflow" namespace from the dropdown
  • Enter "training" into the search box
  • You should see the Training Operator pod, its status etc

Administrators can use the integrated monitoring and troubleshooting facilities in the Rafay Platform for diagnostics etc.

Operator Pod

Kubectl CLI friendly administrators can use Rafay's Zero Trust Kubectl to troubleshoot issues either from the integrated kubectl web shell or download the kubeconfig to be used with the Kubectl CLI utility.

kubectl get pods -n kubeflow 

NAME                                             READY   STATUS    RESTARTS   AGE
training-operator-658c68d697-46zmn               1/1     Running   0          90s
......

Training Operator CRDs

  • Ensure the "cluster" selector is enabled
  • Select "Custom Resources Definition" in the menu on the left
  • Enter "training" into the search box
  • You should see the Training Operator CRDs deployed on the cluster

Using this approach, admins can verify if the CRD for their preferred ML Framework is deployed on their cluster or not.

Operator CRDs

Kubectl CLI friendly administrators can use Rafay's Zero Trust Kubectl to troubleshoot issues either from the integrated kubectl web shell or download the kubeconfig to be used with the Kubectl CLI utility.

$ kubectl get crd

mpijobs.kubeflow.org                                     2024-09-09T00:31:07Z
mxjobs.kubeflow.org                                      2024-09-09T00:31:05Z
paddlejobs.kubeflow.org                                  2024-09-09T00:31:09Z
pytorchjobs.kubeflow.org                                 2024-09-09T00:31:06Z
tfjobs.kubeflow.org                                      2024-09-09T00:31:04Z
xgboostjobs.kubeflow.org                                 2024-09-09T00:31:04Z