Troubleshoot

If you have encountered an issue deploying or accessing the environment, use the below troubleshooting steps to resolve the issue.

  • Log into the Rafay console and navigate to Environments -> Environments and click on the MLOps environment
  • If there are any failures, expand the activity and review any error messages in detail
  • If the error messages are related to variables that were entered, edit the environment variables at the top of the screen and click Save & Deploy to reprovision the environment with updated variables
  • If there are issues provisioning the cluster, go to the Rafay console and navigate to Infrastructure -> Clusters. Click on the cluster to see if there are any error messages in the provisioning console.
  • If there are issues provisioning GCP infrastructure and additional error details are needed outside of the Environment Manager logs, log into GCP and locate the resource to see if there are any errors present

Once the issue has been identified and corrected, go back to the environment and attempt to deploy the environment again.

  • If there is a failure deploying one of the application resources (MLflow, Kubeflow or Feast), review the error message within Environment Manager. Correct the issue and deploy the environment again.
  • If additional details are needed, go to Infrastructure -> Clusters and then go to the Resources tab of the deployed cluster. Select Pods in the left hand pane and find any pods that are not in a Running state.
  • If there are pods that are not in a running state, select the Actions button for that pod and select events to see if there are any issues.
  • If there are pods that are not in a running state, select the Actions button for that pod and select shell and logs -> logs. Review the logs to determine why the pod is not in a running state

Once the issue has been identified and corrected, go back to the environment and attempt to deploy the environment again.

  • If the environment successfully deployed but the MLOps URL cannot be accessed, be sure that the loadbalancer IP address has been registered to the URL domain
  • If the environment successfully deployed but the MLOps URL cannot be accessed, be sure that the DNS certificates are valid for the URL domain
  • If the OKTA username is not working, attempt to login using the local username and password provided during the deployment.
  • If the local account is working but OKTA accounts are not, review the OKTA account configuration to ensure user accounts are configured properly

Important

Please contact your assigned Rafay customer success person if you need assistance with further troubleshooting.