The Controller continuously monitors both clusters and workloads deployed on the managed clusters. When a critical issue with the cluster or the workload is detected, the Controller generates an "Alert".
Alerts are generated when observed events "persist" and are unable to resolve automatically after a number of retries. The entire history of "Alerts" is persisted on the Controller and a reverse chronological history is available to Org Admins on the Console.
All Alerts start life as "Open Alerts". When the underlying issue is resolved (automatically or manually) and the issue does not manifest anymore, the alert is automatically "Closed".
Filters are provided to help sort and manage the alerts appropriately:
- Alerts Status (Open/Closed)
For every alert, the following data is presented to the user: - Date: When the issue was first observed and therefore the alert was generated automatically - Duration: How long the issue has persisted - Type: See details below - Cluster: The cluster in which the issue was observed - Severity: How severe is this alert (Critical/Warning/Info) - Summary: Brief description of the issue - Description: Detailed description of the issue behind the alert
All alerts have an associated Severity. A CRITICAL alert means the administrator needs to pay attention immediately to help address the underlying issue. A WARNING severity means there is an underlying issue that is trending poorly and will need attention quickly. An Info severity is mostly for Informational purposes only.
For application and ops teams, SLA can be a critical measure of their effectiveness. The "duration" of the alert provides an excellent indication of SLA. Issues should ideally be triaged and resolved ASAP in minutes.
Alerts Quick View¶
Cluster administrators are provided with a quick view of all open alerts associated with a cluster. In the Console, navigate to the cluster card to get a bird's eye view of open alerts.
The table below captures the list of scenarios that are actively monitored. Alerts are automatically generated when these scenarios occur.
|Cluster||Health of pods in Critical Monitored Namespaces||Critical|
|Cluster||Loss of Operator Connectivity to Controller||Critical|
|Cluster||Very Low Capacity||Critical|
Pods in Critical Namespaces
Are pods in critical, monitored namespaces healthy? i.e. “kube-system”, “rafay-system” and “rafay-infra” namespaces
The k8s Operator is unable to reach the Controller over the network
Less than 20% of overall cluster capacity (CPU and Memory) available for >5 minutes
Very Low Capacity
Less than 10% of overall cluster capacity (CPU and Memory) available for >5 minutes
|Node||Node in Not Ready state||Critical|
|Node||Node powered down||Critical|
|Node||High CPU load||Critical|
|Node||High Memory Load||Critical|
|Node||Disk Usage Prediction||Warning|
Node Not Ready
Cluster Node in “Not Ready” state for >5 minutes (i.e. Disk, CPU or PID Pressure)
Node Powered Down
Node powered down for >5 minutes
High CPU Load
Greater than 90% sustained CPU utilization over 5 minutes. This can result in CPU throttling of pods
High Memory Load
Greater than 80% sustained Memory utilization over 5 minutes. This can result in pods experiencing OOM Killed issues
Disk Usage Prediction
Prediction based on growth and usage
A k8s resource (e.g. replicaset, daemonset etc.) required by the workload is unavailable for > 2 minutes
One or more of the Pod’s P95 cpu utilization is 90% of the limit for > 15 minutes
|Pod||Frequent Pod Restart||Critical|
Processes in the Pod have used more than the memory limit
Pod pending for >5minutes
Frequent Pod Restart
Pod restarted >3 times in 60 minutes
PVC unbound for >5minutes
PVC Usage Prediction
PVC projected to run out of capacity within 24hrs.