Air-Gapped Controller Troubleshooting¶
This guide covers common issues and solutions for air-gapped Rafay controllers.
Controller Issues¶
TLS & Certificate Issues¶
Problem: UI access fails with certificate errors or container image pulls fail due to TLS validation.
Solution: Replace/renew expired TLS certificates using the following steps:
- Prepare Certificate Files
- Generate new TLS certificate and private key for wildcard controller domain (e.g.,
*.controller.example.com) - Save certificate chain as
tls.crt -
Save private key as
tls.key -
Backup Existing TLS Secrets
kubectl get secret admin-ingress-certs -n istio-system -o yaml > admin-ingress-certs.yaml kubectl get secret rafay-container-registry-tls-secret-opaque -n rafay-core -o yaml > rafay-container-registry-tls-secret-opaque.yaml -
Update TLS Secrets
kubectl create secret generic admin-ingress-certs \ --from-file=tls.crt=tls.crt \ --from-file=tls.key=tls.key \ -n istio-system -o yaml --dry-run=client | kubectl apply -f - kubectl create secret generic rafay-container-registry-tls-secret-opaque \ --from-file=tls.crt=tls.crt \ --from-file=tls.key=tls.key \ -n rafay-core -o yaml --dry-run=client | kubectl apply -f - -
Restart Affected Deployments
kubectl rollout restart deployment/istio-ingressgateway -n istio-system kubectl rollout restart deployment/admin-api -n rafay-core
Note
Check the validity of the controller console URL before and after the certificate replacement to ensure the new certificate is applied correctly and trusted by the browser.
Controller UI Accessibility Issues¶
Problem: The controller console UI is inaccessible or unresponsive.
Troubleshooting Steps:
-
SSH into the controller master node and check if all controller pods are running:
kubectl get pods -A -
Identify problematic pods: Look for pods in
CrashLoopBackOff,Error, orPendingstate. -
Describe the problematic pod to get more details about the issue:
kubectl describe pod <pod-name> -n <namespace> -
Check pod logs for error messages:
kubectl logs <pod-name> -n <namespace> -
Scale down and scale up the deployment to restart the affected pods:
# Scale down the deployment kubectl scale deployment <deployment-name> -n <namespace> --replicas=0 # Wait a few seconds, then scale back up kubectl scale deployment <deployment-name> -n <namespace> --replicas=1 -
Verify all pods are running after the restart:
kubectl get pods -A -
Try accessing the controller UI again once all pods are in
Runningstate.
Node Reboot Recovery Time
If the nodes that are part of the air-gapped controller are rebooted, the services and pods will take some time to come back up and become fully operational. This can take approximately 10 to 20 minutes. Please allow sufficient time for the system to recover before attempting to access the console UI.
Cluster Issues¶
Cluster Dashboard Not Loading¶
Problem: Cluster dashboards may not load due to TimescaleDB instability affecting metrics flow.
Troubleshooting Steps:
-
Check if required pods are running:
kubectl get pods -A | grep timescale kubectl get pods -A | grep promscale -
If pods are unhealthy:
- Review pod logs:
kubectl logs -n <namespace> <timescale-pod-name> - Restart TimescaleDB pods:
kubectl delete pod <timescale-pod-name> -n <namespace>
Tip
Look for connection timeouts or OOM (out of memory) errors in the logs.
MKS Upgrade Failures During Preflight Checks¶
Problem: MKS upgrades may fail intermittently with control channel status fluctuations:
Status of control channel for <Node_name>: Down
Status of control channel for <Node_name>: Up
Status of control channel for <Node_name>: Up
Status of control channel for <Node_name>: Up
Solution:
- SSH into the affected MKS node
- Restart the required services:
sudo systemctl restart salt-minion sudo systemctl restart chisel.service
Note
Wait a few seconds after restarting services before re-attempting the upgrade.
ZTKA Connection Issues¶
Problem: Unable to access cluster using ZTKA channel (kubectl hangs or times out)
Troubleshooting Steps:
- Verify all cluster pods are running
- Check relay-agent pod logs:
kubectl logs -n rafay-system -l app=relay-agent kubectl logs -n rafay-system -l app=v2-relay-agent
Common Causes: - Firewall rules blocking outbound controller connections - DNS or networking issues in restricted environments
Solution:
Restart the relay-agent pod:
kubectl delete pod -n rafay-system -l app=relay-agent
Tip
After restart, verify kubectl connectivity through ZTKA channel.