Skip to content

Alerts

The Controller continuously monitors both clusters and workloads deployed on the managed clusters. When a critical issue with the cluster or the workload is detected, the Controller generates an "Alert".

Alerts are generated when observed events "persist" and are unable to resolve automatically after a number of retries. The entire history of "Alerts" is persisted on the Controller and a reverse chronological history is available to Org Admins on the Console.


Alert Lifecycle

All Alerts start life as "Open Alerts". When the underlying issue is resolved (automatically or manually) and the issue does not manifest anymore, the alert is automatically "Closed".

Filters are provided to help sort and manage the alerts appropriately:

  • Alerts Status (Open/Closed)
  • Type
  • Cluster
  • Severity
  • Timeframe

For every alert, the following data is presented to the user: - Date: When the issue was first observed and therefore the alert was generated automatically - Duration: How long the issue has persisted - Type: See details below - Cluster: The cluster in which the issue was observed - Severity: How severe is this alert (Critical/Warning/Info) - Summary: Brief description of the issue - Description: Detailed description of the issue behind the alert

Closed Alerts


Alert Severity

All alerts have an associated Severity. A CRITICAL alert means the administrator needs to pay attention immediately to help address the underlying issue. A WARNING severity means there is an underlying issue that is trending poorly and will need attention quickly. An Info severity is mostly for Informational purposes only.


SLA

For application and ops teams, SLA can be a critical measure of their effectiveness. The "duration" of the alert provides an excellent indication of SLA. Issues should ideally be triaged and resolved ASAP in minutes.


Alerts Quick View

Cluster administrators are provided with a quick view of all open alerts associated with a cluster. In the Console, navigate to the cluster card to get a bird's eye view of open alerts.

Quick View of Alerts


Alert Scenarios

The table below captures the list of scenarios that are actively monitored. Alerts are automatically generated when these scenarios occur.


Managed Clusters

Monitored Object Description Severity
Cluster Health of pods in Critical Monitored Namespaces Critical
Cluster Loss of Operator Connectivity to Controller Critical
Cluster Low Capacity Warning
Cluster Very Low Capacity Critical

Pods in Critical Namespaces

Are pods in critical, monitored namespaces healthy? i.e. “kube-system”, “rafay-system” and “rafay-infra” namespaces

Network Connectivity

The k8s Operator is unable to reach the Controller over the network

Low Capacity

Less than 20% of overall cluster capacity (CPU and Memory) available for >5 minutes

Very Low Capacity

Less than 10% of overall cluster capacity (CPU and Memory) available for >5 minutes


Cluster Nodes

Monitored Object Description Severity
Node Node in Not Ready state Critical
Node Node powered down Critical
Node High CPU load Critical
Node High Memory Load Critical
Node Disk Usage Prediction Warning

Node Not Ready

Cluster Node in “Not Ready” state for >5 minutes (i.e. Disk, CPU or PID Pressure)

Node Powered Down

Node powered down for >5 minutes

High CPU Load

Greater than 90% sustained CPU utilization over 5 minutes. This can result in CPU throttling of pods

High Memory Load

Greater than 80% sustained Memory utilization over 5 minutes. This can result in pods experiencing OOM Killed issues

Disk Usage Prediction

Prediction based on growth and usage


Workloads

Monitored Object Description Severity
Workload Unhealthy Critical
Workload Degradation Critical

Workload Unhealthy

A k8s resource (e.g. replicaset, daemonset etc.) required by the workload is unavailable for > 2 minutes

Workload Degradation

One or more of the Pod’s P95 cpu utilization is 90% of the limit for > 15 minutes


Pod Health

Monitored Object Description Severity
Pod OOM Killed Critical
Pod Pod Pending Critical
Pod Frequent Pod Restart Critical

Pod OOMKilled

Processes in the Pod have used more than the memory limit

Pod Pending

Pod pending for >5minutes

Frequent Pod Restart

Pod restarted >3 times in 60 minutes


PVC Health

Monitored Object Description Severity
PVC PVC Unbound Critical
PVC Usage Prediction Warning

PVC Unbound

PVC unbound for >5minutes

PVC Usage Prediction

PVC projected to run out of capacity within 24hrs.