Skip to content

Backup and Restore Guide for Air-Gapped Controllers

In highly secure environments where air-gapped Rafay controllers are deployed, it is critical to implement a reliable backup and restore strategy. This guide outlines how to:

  • Configure scheduled backups of controller workloads and persistent data
  • Store backups in an external AWS S3 object store
  • Restore controller state in case of failure or disaster

Architecture Overview

The following diagram illustrates the backup and restore workflow for air-gapped Rafay controllers in a disaster recovery scenario:

  • The primary (active) controller continuously backs up its application state and database volumes to an external S3 object store.
  • In the event of failure, a new controller is provisioned and restored using the latest backup.
  • After validation, DNS is updated to route traffic to the restored controller, minimizing downtime and ensuring service continuity.

backup & restore


Prerequisites

  • Access to an AWS S3 object storage
  • IAM user or credentials with required S3 access
  • Rafay controller tarball and radm CLI tool
  • Kubernetes cluster access on the air-gapped controller

When to Configure

You can set up this configuration at:

  • Day-0: During initial controller deployment
  • Day-2: On an already running controller

Note

For Day-2 configuration, make sure to run the following command to install the required Velero charts on the controller:

sudo radm dependency --config config.yaml

Backup Configuration Steps

We use Velero for backup and restore of the air-gapped controller. To allow Velero to authenticate with the external S3-compatible object store, you need to securely create and store AWS credentials as a Kubernetes secret.

1. Create an S3 Bucket

If not already available, create an S3 bucket to store the backups:

aws s3api create-bucket \
  --bucket <YOUR_BUCKET> \
  --region <REGION> \
  --create-bucket-configuration LocationConstraint=<REGION>

Note

Skip this step if you're using an existing S3 bucket.

2. Create IAM Policy and Attach to User

Prepare a policy with permissions required for Velero to perform backup and restore operations:

BUCKET=<YOUR_BUCKET>
cat > velero-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeVolumes",
        "ec2:CreateVolume",
        "ec2:CreateSnapshot",
        "ec2:DeleteSnapshot",
        "ec2:CreateTags",
        "ec2:DescribeSnapshots"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket",
        "s3:AbortMultipartUpload",
        "s3:ListMultipartUploadParts"
      ],
      "Resource": [
        "arn:aws:s3:::${BUCKET}",
        "arn:aws:s3:::${BUCKET}/*"
      ]
    }
  ]
}
EOF

Attach the policy to your IAM user:

aws iam put-user-policy \
  --user-name <USERNAME> \
  --policy-name velero-access \
  --policy-document file://velero-policy.json

3. Create Secret for S3 Credentials

a. Generate the credentials file

On the node where you are preparing the controller, create a file named cloud with the following content:

cat <<EOF > ~/cloud
[default]
aws_access_key_id=<AWS_ACCESS_KEY_ID>
aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>
EOF

Replace <AWS_ACCESS_KEY_ID> and <AWS_SECRET_ACCESS_KEY> with valid values for your S3 object store.

b. Create the Kubernetes Secret

Use the generated file to create the secret in the velero namespace:

kubectl create ns velero
kubectl create secret generic velero \
  --namespace velero \
  --from-file ~/cloud

Info

The cloud file is not used directly by Velero. Instead, the contents are packaged into a Kubernetes secret that Velero uses to authenticate with your object store. This approach avoids embedding sensitive credentials in config.yaml, reducing security risks and aligning with best practices.

4. Update Controller Configuration

Edit the config.yaml file with the following section under backup_restore:

backup_restore:
  enabled: true
  restore: false
  schedule: "0 0 * * *"           # Daily at midnight
  bucketName: "<YOUR_BUCKET>"
  retentionPeriod: "168h0m0s"     # 7-day retention
  backupFolderName: "controller-backups"
  • enabled: true
    Enables the backup system. If true, Velero will be deployed to handle backup/restore.
  • restore: true
    Controls whether the system will attempt to restore from a previous backup at startup.
  • schedule: "0 0 * * *"
    CRON expression that defines when backups are created. If not set, no scheduled backups will occur.
    Note: Minimum interval supported is 45 minutes to ensure each backup completes before the next one starts.
  • bucketName: "<YOUR_BUCKET>"
    S3-compatible bucket where Velero will store backup data.
  • retentionPeriod: "168h"
    Duration to retain backups (e.g., "168h" = 7 days). Old backups beyond this period are deleted.
    Note: Only the first backup is a full backup. Others are incremental. If the full backup expires, the incremental backups depending on it become unusable.
  • backupFolderName: "controller-backups"
    Folder name for restoring from backup. Must match the last successful backup folder.

This enables automated backup scheduling to your designated S3 bucket.

5. Run Installation Commands

# Install dependencies for Rafay controller
sudo radm dependency --config config.yaml

# Install controller application
sudo radm application --config config.yaml

Once these steps are completed, your controller will automatically back up to the S3 object store as per the defined schedule.


Restore Procedure

Note: Perform the following steps on the newly provisioned controller node, or on a passive controller if you already have one available. Ensure you have access to the required backup files and credentials before proceeding.

Follow these steps to restore your Rafay air-gapped controller from backup:

  1. Prepare S3 Credentials

    On the controller node, create a file named cloud with your S3-compatible object store credentials:

    cat <<EOF > ~/cloud
    [default]
    aws_access_key_id=<AWS_ACCESS_KEY_ID>
    aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>
    EOF
    
    Replace <AWS_ACCESS_KEY_ID> and <AWS_SECRET_ACCESS_KEY> with your actual credentials.

  2. Create Kubernetes Secret for Velero

    Use the credentials file to create a Kubernetes secret in the velero namespace:

    kubectl create ns velero
    kubectl create secret generic velero \
      --namespace velero \
      --from-file ~/cloud
    
  3. Download and Extract Controller Package

    Obtain the Rafay controller tarball and extract it on the controller node.

  4. Update config.yaml for Restore

    Edit the config.yaml file and update the backup_restore section as follows:

    backup_restore:
      enabled: true
      restore: true
      schedule: "0 0 * * *"           # Daily at midnight (can be adjusted)
      bucketName: "<YOUR_BUCKET>"     # Name of your S3 bucket
      retentionPeriod: "168h0m0s"     # 7-day retention
      backupFolderName: "<LATEST_BACKUP_FOLDER>"  # Name of the latest backup folder
    
    Ensure restore: true and provide the correct bucket and backup folder names.

  5. Install Dependencies and Restore Controller

    Run the following commands to install dependencies and restore the controller application:

    sudo radm dependency --config config.yaml
    sudo radm application --config config.yaml
    
  6. Post-Restore Steps

    Once the restore is complete:

    • Update your DNS or routing configuration to point to the new controller's IP address.
    • Test and verify that all controller components and workloads have been restored successfully.