Backup and Restore Guide for Air-Gapped Controllers¶
In highly secure environments where air-gapped Rafay controllers are deployed, it is critical to implement a reliable backup and restore strategy. This guide outlines how to:
- Configure scheduled backups of controller workloads and persistent data
- Store backups in an external AWS S3 object store
- Restore controller state in case of failure or disaster
Architecture Overview¶
The following diagram illustrates the backup and restore workflow for air-gapped Rafay controllers in a disaster recovery scenario:
- The primary (active) controller continuously backs up its application state and database volumes to an external S3 object store.
- In the event of failure, a new controller is provisioned and restored using the latest backup.
- After validation, DNS is updated to route traffic to the restored controller, minimizing downtime and ensuring service continuity.
Prerequisites¶
- Access to an AWS S3 object storage
- IAM user or credentials with required S3 access
- Rafay controller tarball and
radm
CLI tool - Kubernetes cluster access on the air-gapped controller
When to Configure¶
You can set up this configuration at:
- Day-0: During initial controller deployment
- Day-2: On an already running controller
Note
For Day-2 configuration, make sure to run the following command to install the required Velero charts on the controller:
sudo radm dependency --config config.yaml
Backup Configuration Steps¶
We use Velero for backup and restore of the air-gapped controller. To allow Velero to authenticate with the external S3-compatible object store, you need to securely create and store AWS credentials as a Kubernetes secret.
1. Create an S3 Bucket¶
If not already available, create an S3 bucket to store the backups:
aws s3api create-bucket \
--bucket <YOUR_BUCKET> \
--region <REGION> \
--create-bucket-configuration LocationConstraint=<REGION>
Note
Skip this step if you're using an existing S3 bucket.
2. Create IAM Policy and Attach to User¶
Prepare a policy with permissions required for Velero to perform backup and restore operations:
BUCKET=<YOUR_BUCKET>
cat > velero-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeVolumes",
"ec2:CreateVolume",
"ec2:CreateSnapshot",
"ec2:DeleteSnapshot",
"ec2:CreateTags",
"ec2:DescribeSnapshots"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Resource": [
"arn:aws:s3:::${BUCKET}",
"arn:aws:s3:::${BUCKET}/*"
]
}
]
}
EOF
Attach the policy to your IAM user:
aws iam put-user-policy \
--user-name <USERNAME> \
--policy-name velero-access \
--policy-document file://velero-policy.json
3. Create Secret for S3 Credentials¶
a. Generate the credentials file¶
On the node where you are preparing the controller, create a file named cloud
with the following content:
cat <<EOF > ~/cloud
[default]
aws_access_key_id=<AWS_ACCESS_KEY_ID>
aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>
EOF
Replace
<AWS_ACCESS_KEY_ID>
and<AWS_SECRET_ACCESS_KEY>
with valid values for your S3 object store.
b. Create the Kubernetes Secret¶
Use the generated file to create the secret in the velero
namespace:
kubectl create ns velero
kubectl create secret generic velero \
--namespace velero \
--from-file ~/cloud
Info
The cloud file is not used directly by Velero. Instead, the contents are packaged into a Kubernetes secret that Velero uses to authenticate with your object store. This approach avoids embedding sensitive credentials in config.yaml
, reducing security risks and aligning with best practices.
4. Update Controller Configuration¶
Edit the config.yaml
file with the following section under backup_restore
:
backup_restore:
enabled: true
restore: false
schedule: "0 0 * * *" # Daily at midnight
bucketName: "<YOUR_BUCKET>"
retentionPeriod: "168h0m0s" # 7-day retention
backupFolderName: "controller-backups"
enabled: true
Enables the backup system. If true, Velero will be deployed to handle backup/restore.restore: true
Controls whether the system will attempt to restore from a previous backup at startup.schedule: "0 0 * * *"
CRON expression that defines when backups are created. If not set, no scheduled backups will occur.
Note: Minimum interval supported is 45 minutes to ensure each backup completes before the next one starts.bucketName: "<YOUR_BUCKET>"
S3-compatible bucket where Velero will store backup data.retentionPeriod: "168h"
Duration to retain backups (e.g., "168h" = 7 days). Old backups beyond this period are deleted.
Note: Only the first backup is a full backup. Others are incremental. If the full backup expires, the incremental backups depending on it become unusable.backupFolderName: "controller-backups"
Folder name for restoring from backup. Must match the last successful backup folder.
This enables automated backup scheduling to your designated S3 bucket.
5. Run Installation Commands¶
# Install dependencies for Rafay controller
sudo radm dependency --config config.yaml
# Install controller application
sudo radm application --config config.yaml
Once these steps are completed, your controller will automatically back up to the S3 object store as per the defined schedule.
Restore Procedure¶
Note: Perform the following steps on the newly provisioned controller node, or on a passive controller if you already have one available. Ensure you have access to the required backup files and credentials before proceeding.
Follow these steps to restore your Rafay air-gapped controller from backup:
-
Prepare S3 Credentials
On the controller node, create a file named
cloud
with your S3-compatible object store credentials:Replacecat <<EOF > ~/cloud [default] aws_access_key_id=<AWS_ACCESS_KEY_ID> aws_secret_access_key=<AWS_SECRET_ACCESS_KEY> EOF
<AWS_ACCESS_KEY_ID>
and<AWS_SECRET_ACCESS_KEY>
with your actual credentials. -
Create Kubernetes Secret for Velero
Use the credentials file to create a Kubernetes secret in the
velero
namespace:kubectl create ns velero kubectl create secret generic velero \ --namespace velero \ --from-file ~/cloud
-
Download and Extract Controller Package
Obtain the Rafay controller tarball and extract it on the controller node.
-
Update
config.yaml
for RestoreEdit the
config.yaml
file and update thebackup_restore
section as follows:Ensurebackup_restore: enabled: true restore: true schedule: "0 0 * * *" # Daily at midnight (can be adjusted) bucketName: "<YOUR_BUCKET>" # Name of your S3 bucket retentionPeriod: "168h0m0s" # 7-day retention backupFolderName: "<LATEST_BACKUP_FOLDER>" # Name of the latest backup folder
restore: true
and provide the correct bucket and backup folder names. -
Install Dependencies and Restore Controller
Run the following commands to install dependencies and restore the controller application:
sudo radm dependency --config config.yaml sudo radm application --config config.yaml
-
Post-Restore Steps
Once the restore is complete:
- Update your DNS or routing configuration to point to the new controller's IP address.
- Test and verify that all controller components and workloads have been restored successfully.