Administrators
The Training Operator's control plane is automatically installed as part of the default deployment of Rafay's Kubeflow based MLOps platform.
Troubleshooting¶
Administrators can verify the status of the Training Operator on their MLOps Cluster using the steps below.
- Login into the Rafay Console as an administrator
- Navigate to the project where the MLOps platform's cluster is deployed.
- Click on the Kubernetes Resources link on the web console
Training Operator Pods¶
- Ensure the "namespace" selector is enabled and select the "kubeflow" namespace from the dropdown
- Enter "training" into the search box
- You should see the Training Operator pod, its status etc
Administrators can use the integrated monitoring and troubleshooting facilities in the Rafay Platform for diagnostics etc.
Kubectl CLI friendly administrators can use Rafay's Zero Trust Kubectl to troubleshoot issues either from the integrated kubectl web shell or download the kubeconfig to be used with the Kubectl CLI utility.
kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
training-operator-658c68d697-46zmn 1/1 Running 0 90s
......
Training Operator CRDs¶
- Ensure the "cluster" selector is enabled
- Select "Custom Resources Definition" in the menu on the left
- Enter "training" into the search box
- You should see the Training Operator CRDs deployed on the cluster
Using this approach, admins can verify if the CRD for their preferred ML Framework is deployed on their cluster or not.
Kubectl CLI friendly administrators can use Rafay's Zero Trust Kubectl to troubleshoot issues either from the integrated kubectl web shell or download the kubeconfig to be used with the Kubectl CLI utility.
$ kubectl get crd
mpijobs.kubeflow.org 2024-09-09T00:31:07Z
mxjobs.kubeflow.org 2024-09-09T00:31:05Z
paddlejobs.kubeflow.org 2024-09-09T00:31:09Z
pytorchjobs.kubeflow.org 2024-09-09T00:31:06Z
tfjobs.kubeflow.org 2024-09-09T00:31:04Z
xgboostjobs.kubeflow.org 2024-09-09T00:31:04Z