Backup & Restore

The self-hosted controller infrastructure provides organizations with the flexibility to implement robust disaster recovery procedures tailored to their specific needs. In this guide, the procedures for disaster recovery for self-hosted controllers are outlined, ensuring that organizations can quickly recover from any unforeseen incidents and maintain operational continuity.

Failover Endpoints¶

Here is the list of failover endpoints for the controller's normal operations:

The Cloud SQL Database
Load Balancers
AWS EKS service
AWS EC2 service for Worker Nodes

Self-Hosted Controller Disaster Recovery Procedure¶

Prerequisites¶

Ensure that the database is backed up.
Ensure that all clusters are provisioned and in a healthy state in the UI console while taking the backup.
Ensure that the S3 bucket is available to store the backup.
Ensure that the IRSA role for the rafay-velero-sa service account in the “velero” namespace has permissions to access the above S3 bucket.

Important

The backup and restore procedure applies to both the cloud-based controller and the bare-metal/airgapped controller, requiring mandatory access to an S3 bucket. Without S3 bucket access, backup and restore operations are not possible.

Backup for the controller¶

Navigate to the directory where the controller packages were installed to update the config.yaml file and execute radm commands to take a backup for the controller as follows:
Update the config.yaml with the following values to enable backup for the controller.

backup_restore:
    enabled: true
    schedule: "0 * * * *" #cron schedule to backup every hour
    bucketName: "rafay-controller-backup-s3" #bucket name
    retentionPeriod : "30" #Retention Days for backup
    restore: false
    restoreFolderName: ""
    eks:
      role_arn: "<IRSA_Role_ARN>"  #IAM role arn which has permissions to perform actions on s3

Update the Rafay dependencies in the controller to install the backup service.

$  radm dependency --config config.yaml --kubeconfig <kube config file>

Update the Rafay application with backup service.

$  radm application --config config.yaml --kubeconfig <kube config file>

Check the status of the backup job in the “velero” namespace, it should be in completed status which indicates the backup is successful.

$  kubectl get backups -n velero
Output:
velero  controller-backup                       13h

$ kubectl describe backup -n velero controller-backup
Output:
 Status:
  Completion Timestamp:  2023-10-17T02:43:12Z
  Expiration:            2033-10-16T14:35:23Z
  Format Version:        1.1.0
  Phase:                 Completed
  Progress:
  Items Backed Up:  3241
    Total Items:      3241
  Start Timestamp:    2023-10-17T02:35:23Z
  Version:            1
 Events:               <none>

Restore the Controller¶

Follow the same instructions as for the previous controller to create a new controller for disaster recovery purposes.
In the event that disaster recovery needs to be triggered, obtain the last known good backup snapshot of the old controller using the command below in the old controller:

$  kubectl get backups -n velero

Output:
NAMESPACE   NAME                                      AGE
velero      velero-rafay-core-backup-20230119100023   3h22m
velero      velero-rafay-core-backup-20230119110023   142m

Update the config.yaml of the new controller with the following variables to enable the disaster recovery:

backup_restore:
     restore: true
     restoreFolderName: "" <Update with backup snapshot name to restore from>

If the new controller is using the same database as the old controller, run the following command to connect the new controller to that database:

$ radm database --host '<database endpoint>' --kubeconfig <kube config file> --port 5432 --root-password '<db password>' --root-user 'db username'

Update the Rafay dependencies in the new controller to perform the restore process from the backup data of the old controller.

$ radm dependency --config config.yaml --kubeconfig <kube config file>

Update the Rafay application in the new controller.

$ radm application --config config.yaml --kubeconfig <kube config file>

Verify the status of the restore in the "velero" namespace; it should show as "completed" to confirm the successful restoration process from the backup data of the old controller.

$  kubectl get restore -n velero

Output:
NAMESPACE   NAME                 AGE
velero      restore-rafay-core   13h

$ kubectl describe restore -n velero restore-rafay-core

Output:
 Status:
  Completion Timestamp:  2023-10-17T02:43:12Z
  Expiration:            2033-10-16T14:35:23Z
  Format Version:        1.1.0
  Phase:                 Completed
  Progress:
    Items Backed Up:  2081
    Total Items:      2081
  Start Timestamp:    2023-10-17T02:35:23Z
  Version:            1
 Events:               <none>

If utilizing a new database endpoint for restoring from the old controller's database, update the postgresql-admin service with the new endpoint and restart all Rafay-Core pods.

$ kubectl patch svc -n rafay-core postgres-admin -p '{"spec":{"externalName":"<db-address>"}}'

$ kubectl delete po -n rafay-core --all

Once the above steps are completed, verify that DNS entries for the controller's FQDNs now point to the new controller endpoints. Users should then be able to log in to the controller console UI using the same credentials as those used for the old controller.

https://console.<controller-FQDN>

Ensure that the existing clusters become healthy within 10 to 20 minutes.