Troubleshoot
As the end user, once they have an operational "Ray as Service" tenant on the organization's shared, host cluster, they will just use the available endpoints and not have to worry about dealing with the service itself. As a result, from a troubleshooting perspective, we have organized troubleshooting into two categories:
- Launching Ray as Service endpoint
- Ongoing use of Ray as Service
Launch Tenant¶
If an end user encounters an issue when they create their "Ray as Service" tenant, it is most likely due to core infrastructure or configuration issues. The platform administrator will most likely need to intervene to address the issue. Admins and end users can leverage the integrated troubleshooting capabilities to diagnose and resolve potential issues.
For example, in the image below, clicking on "activities" will expand it and detailed status and real time updates of progress related to the "Ray as Service Tenant Creation" workflow is displayed.
If one of the steps in the workflow has failed, the user can expand it to view detailed logs and status as the operation progresses. In the example below, for troubleshooting purposes, the user can see the various steps and checks being performed.
Use Tenant¶
Once the "Ray as Service" tenant has been successfully launched, the end user has to just use the provided endpoint on an ongoing basis. Here are some issues that can occur and solutions to address the issue.
Cannot Reach Endpoint¶
The most common reason for this is because the user's laptop is unable to resolve the tenant's endpoint URL. As a good security practice, the endpoints are typically available only when on the organization's internal network. Users should make sure that they are on the organization's network via a VPN etc and try again.
Host Infrastructure Issues¶
The endpoint can become unreachable if there are host infrastructure issues. These can be due to either planned maintenance issues or due to unplanned issues. For example,
- The administrator may have let the endpoint's SSL certificate expire and the end user may be unable to authenticate to the service's endpoint.
- The underlying hardware backing the host cluster may be down or degraded.
We recommend users follow the simple steps as described below to identify the underlying issue.
Step 1
Ensure that you did not miss any internal notifications for planned infrastructure maintenance and related downtime.
Step 2
Access your tenant's Ray Dashboard (if you are able to) and see if there are issues being reported here. For example, if your Ray Head is not operational, contact your administrator.
Step 3
If your admin suspects the issue is due to the virtual cluster backing the Ray Tenant,
- Login into the Rafay Org and navigate to the user's project
- Click on Infrastructure -> Clusters
If there are issues with the virtual cluster, it may be reported here.
You can use the integrated zero trust kubectl web shell to remotely issue commands for troubleshooting. Just click on kubectl and a web based kubectl shell will be presented to the user.