Skip to content

Air-Gapped Controller Troubleshooting

This guide covers common issues and solutions for air-gapped Rafay controllers.

Controller Issues

TLS & Certificate Issues

Problem: UI access fails with certificate errors or container image pulls fail due to TLS validation.

Solution: Replace/renew expired TLS certificates using the following steps:

  1. Prepare Certificate Files
  2. Generate new TLS certificate and private key for wildcard controller domain (e.g., *.controller.example.com)
  3. Save certificate chain as tls.crt
  4. Save private key as tls.key

  5. Backup Existing TLS Secrets

    kubectl get secret admin-ingress-certs -n istio-system -o yaml > admin-ingress-certs.yaml
    kubectl get secret rafay-container-registry-tls-secret-opaque -n rafay-core -o yaml > rafay-container-registry-tls-secret-opaque.yaml
    

  6. Update TLS Secrets

    kubectl create secret generic admin-ingress-certs \
    --from-file=tls.crt=tls.crt \
    --from-file=tls.key=tls.key \
    -n istio-system -o yaml --dry-run=client | kubectl apply -f -
    
    kubectl create secret generic rafay-container-registry-tls-secret-opaque \
    --from-file=tls.crt=tls.crt \
    --from-file=tls.key=tls.key \
    -n rafay-core -o yaml --dry-run=client | kubectl apply -f -
    

  7. Restart Affected Deployments

    kubectl rollout restart deployment/istio-ingressgateway -n istio-system
    kubectl rollout restart deployment/admin-api -n rafay-core
    

Note

Check the validity of the controller console URL before and after the certificate replacement to ensure the new certificate is applied correctly and trusted by the browser.


Controller UI Accessibility Issues

Problem: The controller console UI is inaccessible or unresponsive.

Troubleshooting Steps:

  1. SSH into the controller master node and check if all controller pods are running:

    kubectl get pods -A
    

  2. Identify problematic pods: Look for pods in CrashLoopBackOff, Error, or Pending state.

  3. Describe the problematic pod to get more details about the issue:

    kubectl describe pod <pod-name> -n <namespace>
    

  4. Check pod logs for error messages:

    kubectl logs <pod-name> -n <namespace>
    

  5. Scale down and scale up the deployment to restart the affected pods:

    # Scale down the deployment
    kubectl scale deployment <deployment-name> -n <namespace> --replicas=0
    
    # Wait a few seconds, then scale back up
    kubectl scale deployment <deployment-name> -n <namespace> --replicas=1
    

  6. Verify all pods are running after the restart:

    kubectl get pods -A
    

  7. Try accessing the controller UI again once all pods are in Running state.

Node Reboot Recovery Time

If the nodes that are part of the air-gapped controller are rebooted, the services and pods will take some time to come back up and become fully operational. This can take approximately 10 to 20 minutes. Please allow sufficient time for the system to recover before attempting to access the console UI.


Cluster Issues

Cluster Dashboard Not Loading

Problem: Cluster dashboards may not load due to TimescaleDB instability affecting metrics flow.

Troubleshooting Steps:

  1. Check if required pods are running:

    kubectl get pods -A | grep timescale
    kubectl get pods -A | grep promscale
    

  2. If pods are unhealthy:

  3. Review pod logs:
    kubectl logs -n <namespace> <timescale-pod-name>
    
  4. Restart TimescaleDB pods:
    kubectl delete pod <timescale-pod-name> -n <namespace>
    

Tip

Look for connection timeouts or OOM (out of memory) errors in the logs.


MKS Upgrade Failures During Preflight Checks

Problem: MKS upgrades may fail intermittently with control channel status fluctuations:

Status of control channel for <Node_name>: Down  
Status of control channel for <Node_name>: Up  
Status of control channel for <Node_name>: Up  
Status of control channel for <Node_name>: Up

Solution:

  1. SSH into the affected MKS node
  2. Restart the required services:
    sudo systemctl restart salt-minion
    sudo systemctl restart chisel.service
    

Note

Wait a few seconds after restarting services before re-attempting the upgrade.


ZTKA Connection Issues

Problem: Unable to access cluster using ZTKA channel (kubectl hangs or times out)

Troubleshooting Steps:

  1. Verify all cluster pods are running
  2. Check relay-agent pod logs:
    kubectl logs -n rafay-system -l app=relay-agent
    kubectl logs -n rafay-system -l app=v2-relay-agent
    

Common Causes: - Firewall rules blocking outbound controller connections - DNS or networking issues in restricted environments

Solution:

Restart the relay-agent pod:

kubectl delete pod -n rafay-system -l app=relay-agent

Tip

After restart, verify kubectl connectivity through ZTKA channel.