Troubleshooting

Use this page when the console shows a problem but the cause is usually in your data center: network paths, services on the management machine, or connectivity back to the Rafay controller.

1. Gateway shows as unhealthy (for example, “getting agent conn failed”)¶

In simple terms the gateway is a small part of your setup on a server or VM in your data center often the machine you use as the head or management server for bare metal. The Rafay controller uses it to reach that environment securely because the controller does not connect directly to your physical servers.

When the gateway is healthy the Rafay controller and that server can talk to each other reliably. An unhealthy gateway almost always means that link is broken, most often due to network issues (no route, blocked traffic, DNS not resolving, or a proxy or firewall blocking outbound connections from that server to the Rafay controller).

What to do:

On the server where the gateway is installed, have your administrator confirm the infra agent (the service that connects the gateway host to the Rafay controller) is running and review recent errors:
Check service status: systemctl status infraagent.service
Review the log file: /var/log/infra_agent.log
From that same server, confirm it can reach the Rafay controller over the network using the URLs and connectivity your deployment expects. Resolve any firewall, DNS, or outbound access issues until the agent stays up and the gateway shows Healthy again.

2. Bare metal provisioner shows as failed¶

In simple terms, the bare metal provisioner is the software stack on the Kubernetes cluster on your management (head) server that orchestrates provisioning of your bare metal machines. If it is not running correctly, new bare metal provisioning will fail even when servers and profiles look fine in the UI.

What to do:

Check the gateway first. If the gateway is unhealthy, fix that using section 1. Many downstream issues look like a failed provisioner when the real problem is that the Rafay controller cannot reach your environment reliably.
Check the provisioning software on the cluster. Someone with access to the Kubernetes environment on that management server should confirm the bare metal and provisioning components (the operators and related workloads from your provisioner install) are running, not stuck restarting, and not reporting errors. That usually means reviewing workload status in the right namespace, for example with kubectl get pods across all namespaces or scoped to the namespace used for your install, plus any recent failure events.