Backup & Restore
The self-hosted controller infrastructure provides organizations with the flexibility to implement robust disaster recovery procedures tailored to their specific needs. In this guide, the procedures for disaster recovery for self-hosted controllers are outlined, ensuring that organizations can quickly recover from any unforeseen incidents and maintain operational continuity.
Failover Endpoints¶
Here is the list of failover endpoints for the controller's normal operations:
- The Cloud SQL Database
- Load Balancers
- AWS EKS service
- AWS EC2 service for Worker Nodes
Self-Hosted Controller Disaster Recovery Procedure¶
Prerequisites¶
- Ensure that the database is backed up.
- Ensure that all clusters are provisioned and in a healthy state in the UI console while taking the backup.
- Ensure that the S3 bucket is available to store the backup.
- Ensure that the IRSA role for the
rafay-velero-sa
service account in the “velero” namespace has permissions to access the above S3 bucket.
Important
The backup and restore procedure applies to both the cloud-based controller and the bare-metal/airgapped controller, requiring mandatory access to an S3 bucket. Without S3 bucket access, backup and restore operations are not possible.
Backup for the controller¶
- Navigate to the directory where the controller packages were installed to update the
config.yaml
file and execute radm commands to take a backup for the controller as follows: - Update the
config.yaml
with the following values to enable backup for the controller.
backup_restore:
enabled: true
schedule: "0 * * * *" #cron schedule to backup every hour
bucketName: "rafay-controller-backup-s3" #bucket name
retentionPeriod : "30" #Retention Days for backup
restore: false
restoreFolderName: ""
eks:
role_arn: "<IRSA_Role_ARN>" #IAM role arn which has permissions to perform actions on s3
- Update the Rafay dependencies in the controller to install the backup service.
$ radm dependency --config config.yaml --kubeconfig <kube config file>
- Update the Rafay application with backup service.
$ radm application --config config.yaml --kubeconfig <kube config file>
- Check the status of the backup job in the “velero” namespace, it should be in completed status which indicates the backup is successful.
$ kubectl get backups -n velero
Output:
velero controller-backup 13h
$ kubectl describe backup -n velero controller-backup
Output:
Status:
Completion Timestamp: 2023-10-17T02:43:12Z
Expiration: 2033-10-16T14:35:23Z
Format Version: 1.1.0
Phase: Completed
Progress:
Items Backed Up: 3241
Total Items: 3241
Start Timestamp: 2023-10-17T02:35:23Z
Version: 1
Events: <none>
Restore the Controller¶
- Follow the same instructions as for the previous controller to create a new controller for disaster recovery purposes.
- In the event that disaster recovery needs to be triggered, obtain the last known good backup snapshot of the old controller using the command below in the old controller:
$ kubectl get backups -n velero
Output:
NAMESPACE NAME AGE
velero velero-rafay-core-backup-20230119100023 3h22m
velero velero-rafay-core-backup-20230119110023 142m
- Update the
config.yaml
of the new controller with the following variables to enable the disaster recovery:
backup_restore:
restore: true
restoreFolderName: "" <Update with backup snapshot name to restore from>
- If the new controller is using the same database as the old controller, run the following command to connect the new controller to that database:
$ radm database --host '<database endpoint>' --kubeconfig <kube config file> --port 5432 --root-password '<db password>' --root-user 'db username'
- Update the Rafay dependencies in the new controller to perform the restore process from the backup data of the old controller.
$ radm dependency --config config.yaml --kubeconfig <kube config file>
- Update the Rafay application in the new controller.
$ radm application --config config.yaml --kubeconfig <kube config file>
- Verify the status of the restore in the "velero" namespace; it should show as "completed" to confirm the successful restoration process from the backup data of the old controller.
$ kubectl get restore -n velero
Output:
NAMESPACE NAME AGE
velero restore-rafay-core 13h
$ kubectl describe restore -n velero restore-rafay-core
Output:
Status:
Completion Timestamp: 2023-10-17T02:43:12Z
Expiration: 2033-10-16T14:35:23Z
Format Version: 1.1.0
Phase: Completed
Progress:
Items Backed Up: 2081
Total Items: 2081
Start Timestamp: 2023-10-17T02:35:23Z
Version: 1
Events: <none>
- If utilizing a new database endpoint for restoring from the old controller's database, update the
postgresql-admin
service with the new endpoint and restart all Rafay-Core pods.
$ kubectl patch svc -n rafay-core postgres-admin -p '{"spec":{"externalName":"<db-address>"}}'
$ kubectl delete po -n rafay-core --all
- Once the above steps are completed, verify that DNS entries for the controller's FQDNs now point to the new controller endpoints. Users should then be able to log in to the controller console UI using the same credentials as those used for the old controller.
https://console.<controller-FQDN>
- Ensure that the existing clusters become healthy within 10 to 20 minutes.