Backup and Restore Guide for Air-Gapped Controllers¶
In highly secure environments where air-gapped Rafay controllers are deployed, it is critical to implement a reliable backup and restore strategy. This guide outlines how to:
- Configure scheduled backups of controller workloads and persistent data
- Store backups in an external AWS S3 object store
- Restore controller state in case of failure or disaster
Architecture Overview¶
The following diagram illustrates the backup and restore workflow for air-gapped Rafay controllers in a disaster recovery scenario:
- The primary (active) controller continuously backs up its application state and database volumes to an external S3 object store.
- In the event of failure, a new controller is provisioned and restored using the latest backup.
- After validation, DNS is updated to route traffic to the restored controller, minimizing downtime and ensuring service continuity.
Prerequisites¶
- Access to an AWS S3 object storage
- IAM user or credentials with required S3 access
- Rafay controller tarball and
radmCLI tool - Kubernetes cluster access on the air-gapped controller
When to Configure¶
You can set up this configuration at:
- Day-0: During initial controller deployment
- Day-2: On an already running controller
Note
For Day-2 configuration, make sure to run the following command to install the required Velero charts on the controller:
sudo radm dependency --config config.yaml
Backup Configuration Steps¶
We use Velero for backup and restore of the air-gapped controller. To allow Velero to authenticate with the external S3-compatible object store, you need to securely create and store AWS credentials as a Kubernetes secret.
1. Create an S3 Bucket¶
If not already available, create an S3 bucket to store the backups:
aws s3api create-bucket \
--bucket <YOUR_BUCKET> \
--region <REGION> \
--create-bucket-configuration LocationConstraint=<REGION>
Note
Skip this step if you're using an existing S3 bucket.
2. Create IAM Policy and Attach to User¶
Prepare a policy with permissions required for Velero to perform backup and restore operations:
BUCKET=<YOUR_BUCKET>
cat > velero-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeVolumes",
"ec2:CreateVolume",
"ec2:CreateSnapshot",
"ec2:DeleteSnapshot",
"ec2:CreateTags",
"ec2:DescribeSnapshots"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Resource": [
"arn:aws:s3:::${BUCKET}",
"arn:aws:s3:::${BUCKET}/*"
]
}
]
}
EOF
Attach the policy to your IAM user:
aws iam put-user-policy \
--user-name <USERNAME> \
--policy-name velero-access \
--policy-document file://velero-policy.json
3. Create Secret for S3 Credentials¶
a. Generate the credentials file¶
On the node where you are preparing the controller, create a file named cloud with the following content:
cat <<EOF > ~/cloud
[default]
aws_access_key_id=<AWS_ACCESS_KEY_ID>
aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>
EOF
Replace
<AWS_ACCESS_KEY_ID>and<AWS_SECRET_ACCESS_KEY>with valid values for your S3 object store.
b. Create the Kubernetes Secret¶
Use the generated file to create the secret in the velero namespace:
kubectl create ns velero
kubectl create secret generic velero \
--namespace velero \
--from-file ~/cloud
Info
The cloud file is not used directly by Velero. Instead, the contents are packaged into a Kubernetes secret that Velero uses to authenticate with your object store. This approach avoids embedding sensitive credentials in config.yaml, reducing security risks and aligning with best practices.
4. Update Controller Configuration¶
Edit the config.yaml file with the following section under backup_restore:
backup_restore:
enabled: true
restore: false
schedule: "0 0 * * *" # Daily at midnight
bucketName: "<YOUR_BUCKET>"
retentionPeriod: "168h0m0s" # 7-day retention
backupFolderName: "controller-backups"
enabled: true
Enables the backup system. If true, Velero will be deployed to handle backup/restore.restore: true
Controls whether the system will attempt to restore from a previous backup at startup.schedule: "0 0 * * *"
CRON expression that defines when backups are created. If not set, no scheduled backups will occur.
Note: Minimum interval supported is 45 minutes to ensure each backup completes before the next one starts.bucketName: "<YOUR_BUCKET>"
S3-compatible bucket where Velero will store backup data.retentionPeriod: "168h"
Duration to retain backups (e.g., "168h" = 7 days). Old backups beyond this period are deleted.
Note: Only the first backup is a full backup. Others are incremental. If the full backup expires, the incremental backups depending on it become unusable.backupFolderName: "controller-backups"
Folder name for restoring from backup. Must match the last successful backup folder.
This enables automated backup scheduling to your designated S3 bucket.
5. Run Installation Commands¶
# Install dependencies for Rafay controller
sudo radm dependency --config config.yaml
# Install controller application
sudo radm application --config config.yaml
Once these steps are completed, your controller will automatically back up to the S3 object store as per the defined schedule.
Restore Procedure¶
Note: Perform the following steps on the newly provisioned controller node, or on a passive controller if you already have one available. Ensure you have access to the required backup files and credentials before proceeding.
Follow these steps to restore your Rafay air-gapped controller from backup:
-
Prepare S3 Credentials
On the controller node, create a file named
cloudwith your S3-compatible object store credentials:Replacecat <<EOF > ~/cloud [default] aws_access_key_id=<AWS_ACCESS_KEY_ID> aws_secret_access_key=<AWS_SECRET_ACCESS_KEY> EOF<AWS_ACCESS_KEY_ID>and<AWS_SECRET_ACCESS_KEY>with your actual credentials. -
Create Kubernetes Secret for Velero
Use the credentials file to create a Kubernetes secret in the
veleronamespace:kubectl create ns velero kubectl create secret generic velero \ --namespace velero \ --from-file ~/cloud -
Download and Extract Controller Package
Obtain the Rafay controller tarball and extract it on the controller node.
-
Update
config.yamlfor RestoreEdit the
config.yamlfile and update thebackup_restoresection as follows:Ensurebackup_restore: enabled: true restore: true schedule: "0 0 * * *" # Daily at midnight (can be adjusted) bucketName: "<YOUR_BUCKET>" # Name of your S3 bucket retentionPeriod: "168h0m0s" # 7-day retention backupFolderName: "<LATEST_BACKUP_FOLDER>" # Name of the latest backup folderrestore: trueand provide the correct bucket and backup folder names. -
Install Dependencies and Restore Controller
Run the following commands to install dependencies and restore the controller application:
sudo radm dependency --config config.yaml sudo radm application --config config.yaml -
Post-Restore Steps
Once the restore is complete:
- Update your DNS or routing configuration to point to the new controller's IP address.
- Test and verify that all controller components and workloads have been restored successfully.
