Skip to content

Air-Gapped Controller Troubleshooting

This guide covers common issues and their solutions for air-gapped (self-hosted) Rafay controllers.

Cluster Issues

Cluster Dashboard Not Loading

Problem: Cluster dashboards may not load due to TimescaleDB instability affecting metrics flow.

Troubleshooting Steps:

  1. Check if required pods are running:

    kubectl get pods -A | grep timescale
    kubectl get pods -A | grep promscales
    

  2. If pods are unhealthy:

  3. Review pod logs:
    kubectl logs -n <namespace> <timescale-pod-name>
    
  4. Restart TimescaleDB pods:
    kubectl delete pod <timescale-pod-name> -n <namespace>
    

Tip

Look for connection timeouts or OOM (out of memory) errors in the logs.


MKS Upgrade Failures During Preflight Checks

Problem: MKS upgrades may fail intermittently with control channel status fluctuations:

Status of control channel for <Node_name>: Down  
Status of control channel for <Node_name>: Up  
Status of control channel for <Node_name>: Up  
Status of control channel for <Node_name>: Up

Solution:

  1. SSH into the affected MKS node
  2. Restart the required services:
    sudo systemctl restart salt-minion
    sudo systemctl restart chisel.service
    

Note

Wait a few seconds after restarting services before re-attempting the upgrade.


ZTKA Connection Issues

Problem: Unable to access cluster using ZTKA channel (kubectl hangs or times out)

Troubleshooting Steps:

  1. Verify all cluster pods are running
  2. Check relay-agent pod logs:
    kubectl logs -n rafay-system -l app=relay-agent
    kubectl logs -n rafay-system -l app=v2-relay-agent
    

Common Causes: - Firewall rules blocking outbound controller connections - DNS or networking issues in restricted environments

Solution:

Restart the relay-agent pod:

kubectl delete pod -n rafay-system -l app=relay-agent

Tip

After restart, verify kubectl connectivity through ZTKA channel.


Controller Issues

TLS & Certificate Issues

Problem: UI access fails with certificate errors or container image pulls fail due to TLS validation.

Solution: Replace/renew expired TLS certificates using the following steps:

  1. Prepare Certificate Files
  2. Generate new TLS certificate and private key for wildcard controller domain (e.g., *.controller.example.com)
  3. Save certificate chain as tls.crt
  4. Save private key as tls.key

  5. Backup Existing TLS Secrets

    kubectl get secret admin-ingress-certs -n istio-system -o yaml > admin-ingress-certs.yaml
    kubectl get secret rafay-container-registry-tls-secret-opaque -n rafay-core -o yaml > rafay-container-registry-tls-secret-opaque.yaml
    

  6. Update TLS Secrets

    kubectl create secret generic admin-ingress-certs \
    --from-file=tls.crt=tls.crt \
    --from-file=tls.key=tls.key \
    -n istio-system -o yaml --dry-run=client | kubectl apply -f -
    
    kubectl create secret generic rafay-container-registry-tls-secret-opaque \
    --from-file=tls.crt=tls.crt \
    --from-file=tls.key=tls.key \
    -n rafay-core -o yaml --dry-run=client | kubectl apply -f -
    

  7. Restart Affected Deployments

    kubectl rollout restart deployment/istio-ingressgateway -n istio-system
    kubectl rollout restart deployment/admin-api -n rafay-core
    

Note

Check the validity of the controller console URL before and after the certificate replacement to ensure the new certificate is applied correctly and trusted by the browser.