Bare Metal Replication And Virtualization Environment (BRAVE)¶

BRAVE (Bare Metal Replication And Virtualization Environment) offers a virtual, cost-efficient, convenient, automated and on-demand tool for executing use cases requiring bare metal infrastructure.

Cost and complexity of bare metal deployments can be prohibitive for a number of non production use cases such as :

Creating on-demand labs for conducting quick proof of concepts, demonstrations or experiments
Creating testbed environments for development, debugging and automated testing
Performing comprehensive architectural and security assessments through construction of proof of concept deployments.

Overview¶

BRAVE simplifies and streamlines bare metal deployments (esp. for non production use cases) by:

Providing fully automated instantiation of a singular cloud-based instance on one of the supported cloud providers (currently Oracle Cloud Infrastructure and Amazon Web Services are supported)
Replicating the entire bare metal infrastructure within this singular cloud instance by employing Virtualbox and vagrant managed virtual machines and capitalizing on Virtualbox's network capabilities.
Providing multiple provisioners that execute tailored workflows specifically designed for utilizing bare metal infrastructure in various use cases.

BRAVE facilitates a spectrum of automated workflows through its provisioners:

VM Deployment: Enables deployment of multiple virtual machines with configurable capacity and OS flavours on a single layer 2 network. These virtual machines are allocated static IPs and have outbound access to the Internet. This setup serves as a versatile deployment solution for interconnected bare metal servers. BRAVE also configures SSH access to these VMs from the hosting cloud instance. This capability is supported by the vms_only provisioner.
EKS Anywhere Bare Metal Cluster Creation: Facilitates the creation of an EKS Anywhere Bare Metal (EKSA-BM) Kubernetes cluster. Virtualbox-managed VMs emulate an EKSA admin machine along with all cluster machines. Interconnectivity among these EKSA machines is achieved through a Virtualbox NAT network, allowing cluster machines to PXE boot from the admin machine and install software from the Internet. BRAVE also implements a power management algorithm that monitors the state of the cluster and perform automatic power on and off of the Virtualbox VMs without requiring BMC integration. This power management is mandatory for automated provisioning of an EKSA cluster. This functionality is supported by eksabm_cluster provisioner. This provisioner uses eksctl anywhere cli for EKSA operations.
EKS Anywhere Bare Metal Cluster Creation using Rafay Controller: This functionality is supported by rafay_eksabm_cluster provisioner. This provisioner uses Rafay Systems Inc. Controller for EKSA operations. All other implementation details are same as eksabm_cluster provisioner explained above.

BRAVE is written to be extensible and its functionality can be extended for new bare metal use cases by adding new provisioners. Each provisioner implements tailored workflow specifically designed for utilizing bare metal infrastructure in a particular use case.

BRAVE is a Rafay Systems Inc. project and publicly available on Github. For a comprehensive list of all open-source projects by Rafay, please refer to this link.

Providers and Provisioners¶

BRAVE supports virtualization of a number of bare metal deployment use cases by automatically creating a cloud instance and then executing workflows to spawn virtualized bare metal infrastructure within the cloud instance. Configuration item infrastructure_provider determines the public cloud to use and provisioner selects workflow to execute. A number of infrastructure_provider and provisioner settings are exposed which can be used in any combination to fit a particular bare metal deployment use case.

Providers¶

The infrastructure_provider determines which public cloud is used to deploy the cloud instance to house the virtualized bare metal infrastructure. The options currently supported are:

aws: Cloud instance is automatically launched in AWS Public Cloud. Instance types of metal are required.
oci: Cloud instance is automatically launched in OCI Public Cloud. All instance types are compatible.
infra_exists: No cloud instance is automatically launched. A pre-existing instance is assumed. SSH access is required to this instance.

Provisioner¶

The provisioner setting determines workflow that would be run on the cloud instance. Each workflow is targeted to a specific use case for instance creating VMs, creating a cluster etc. Currently supported settings for provisioner are :

vms_only: Supports automatic deployment of virtual machines on the cloud instance. Virtualbox-managed VMs are allocated static IPs on the same layer2 network and have outbound access to the Internet.
eksabm_cluster: Supports automatic creation of an EKSA-BM cluster using VMs on the cloud instance. Leverages eksctl anwhere cli for cluster creation and Virtualbox for networking, VM management/power management.
rafay_eksabm_cluster: Uses Rafay System's controller for EKSA-BM cluster creation and Virtualbox for networking, VM management/power management.
none: No provisioner option. If specified, no provisioner is applied (no VMs etc. created). Possible use case is just creation of the cloud instance.

Note: Refer to discussion on structure of BRAVE's configuration file for further details.

Supported Use Cases¶

BRAVE facilitates a spectrum of automated workflows through its provisioners. Below is a discussion of these provisioners and the use cases they support in detail.

Deploying VMs on a Cloud Instance¶

Most conventional bare metal deployments encompass a small network of interconnected servers capable of direct communication among themselves and with external networks through a designated gateway. However, setting up non-production replicas of such deployments for testing, evaluations, demonstrations, or development purposes can become cost-prohibitive, primarily due to hardware specifications.

BRAVE offers a solution to virtualize these deployments, substantially reducing costs and hardware requirements. This is achieved by employing Virtualbox and vagrant managed VMs to replicate bare metal servers and utilizing Virtualbox networking capabilities to establish the necessary networking infrastructure within a single cloud instance, available on supported public clouds. All essential software and packages are automatically installed on this cloud instance.

As an example, below excerpt of input.yaml instructs BRAVE to - Provision a cloud instance named "brave-node" in OCI public cloud (infrastructure_provider is set as oci) - Deploy total 3 Virtualbox VMs on this instance (provisioner is set as vms_only): - 2x ubuntu 20.04 VMs with name workers and capacity cpu=3vcpus and memory=16GB. These VMs will be named workers-1 and workers-2 - 1x ubuntu 20.04 VM with name storage and capacity cpu=2vcpus and memory=16GB. This VM will be named storage-1.

infrastructure_provider: oci  
infrastructure_provider_config:
  oci:
    host_name: "brave-node"
    .....
    .....

provisioner: vms_only
provisioner_config:
# VM Provisioner configuration
  vms_only:
    - name: "workers"
      count: 2
      cpu: 3       # in vcpus
      mem: 16384   # in MB
      osfamily: ubuntu # currently only ubuntu is supported
      vagrant_box: "bento/ubuntu-20.04"
    - name: "storage"
      #count: 1             # (default value)
      #cpu: 2               # (default value)
      #mem: 16384           # (default value)
      #osfamily: ubuntu     # (default value)
      #vagrant_box: "bento/ubuntu-20.04"  # (default value)

EKSA Bare Metal Cluster using VMs¶

EKS Anywhere Bare Metal (EKSA-BM) Kubernetes cluster creation can be non trivial and cost prohibitive for certain non production use cases as there are extensive hardware and networking requirements to meet.

BRAVE makes it possible to create non production EKSA-BM clusters without having access to specialized hardware or networking setup. With BRAVE , extensive hardware and networking requirements of an EKSA-BM cluster are reduced to just a single requirement :

Having permission to launch a single cloud instance in a supported cloud provider (AWS and OCI are currently supported).

BRAVE can:

Launch an instance in a cloud provider.
Inside this cloud instance, create all infrastructure required for supporting an EKSA-BM cluster. This includes vms to emulate the machines and the network.
Using this virtual infrastructure, create an EKSA-BM cluster without any power management support (fully automated end to end).

Since entire infrastructure is contained within a single cloud instance, the entire infrastructure can be shut down by just stopping the cloud instance. This is not only convenient (no hardware required) but also cost effective. Simply start the instance back up when you wish to restart the cluster.

Create an EKSA Bare Metal Kubernetes Cluster¶

BRAVE simplifies EKSA-BM cluster creation by emulating the entire networking and bare metal setup required for creating EKSA-BM clusters on a single cloud instance of a cloud provider. BRAVE achieves this by:

Creating a cloud instance on a supported cloud or infrastructure provider. Terraform is used to power this functionality. (A pre-existing compute instance can also be used). Currently supported infrastructure providers are Oracle Cloud Infrastructure (OCI) and Amazon Web Services.
Leveraging Virtualbox and vagrant to create EKSA-BM cluster setup on the cloud instance using VMs and a NAT Network. Virtualbox VMs are used to emulate cluster hardware and the Admin machine, whereas VirtualBox's NAT Network is used to emulate the Layer2 Network these machines are connected to. This way EKSA-BM machines are connected to each other on a Layer2 network and also able to reach the Internet.
Providing an automation engine to handle cluster lifecycle management operations for EKSA-BM clusters end to end without any manual intervention. EKSA-BM cluster's lifecycle can be managed by two supported provisioners.
- rafay_eksabm_cluster which uses Rafay Systems Inc. Controller
- eksabm_cluster which uses eksctl anywhere cli directly
Automatically handling power management of cluster machines WITHOUT a BMC controller by watching relevant cluster events and performing power on and off of VMs via VBoxManage cli. (See below)

Note

End to end creation of cluster (including time to create cloud instance) can range anywhere between 30 to 50 minutes. Please be patient!

As an example, below excerpt of input.yaml instructs BRAVE to - Provision a cloud instance named "brave-node" in OCI public cloud (infrastructure_provider is set as oci) - Create an EKSA-BM cluster (provisioner is set as eksabm_cluster): - Cluster will be named BRAVE, use K8s version 1.27 and have 1 control plane node and 1 worker node.

BRAVE would launch one VM to emulate the admin machine and 2 VMs to emulate the nodes.

infrastructure_provider: oci  
infrastructure_provider_config:
  oci:
    host_name: "brave-node"
    ....
    ....

provisioner: eksabm_cluster
provisioner_config:
  eksabm_cluster:
    cluster_name: "brave"
    operation_type: "provision"
    k8s_version: "1.27"
    num_control_plane_nodes: 1
    num_worker_nodes: 1

It is also possible to customize the EKSA-BM cluster configuration extensively by provisioning an EKSA-BM configuration file to the provisioner. In below example, BRAVE is instructed to create an EKSA-BM cluster and use a configuration file name "eksa-bm-config.yaml" for configuration of the cluster.

provisioner: eksabm_cluster
provisioner_config:
  eksabm_cluster:
    cluster_name: "brave-cluster"
    operation_type: "provision"
    config_file_name: "eksa-bm-config.yaml"

Provisioner rafay_eksabm_cluster can also be used to create an EKSA-BM cluster. It requires additional configuration related to Rafay controller. Refer to VM Management, Debugging and Advanced Usage doc for more details.

Power Management¶

Since Virtualbox does not support Baseboard Management Controller (BMC) integration, automatically powering machines on and off is not possible. Without BMC support, machines have to be powered on and off manually at the correct time during provisioning, upgrading and scaling.

To address this issue, BRAVE implements a power management algorithm that monitors the state of the cluster and perform automatic power management of the Virtualbox vms without requiring BMC integration. Powering on and off of VMs is carried our using VBoxManage tool. This algorithm is described below:

Start a loop to monitor cluster progress.
Check if the cluster creation logs indicate cluster has reached state where machines need to be powered on. This is indicated by presence of string "Creating new workload cluster" in the logs.
Collect Tinkerbell workflows and their status: Pending, Running, Failed, and Success.
If there are Pending or Failed Tinkerbell workflows, power cycle the respective Virtualbox vms with net boot order to initiate a PXE boot of the machine and start these workflows. Use MAC address to correlate which Tinkerbell workflows correspond to which Virtualbox VMs.
Collect machine status from the cluster. Check if any machine is in the Provisioned phase. If found, power cycle it with boot order set as disk so that it boots from installed OS on the disk by Tinkerbell workflow and enters Running phase.
Repeat the loop until all machines are in the Running phase, signifying the completion of cluster creation.

Below is excerpt from BRAVE's output demonstrating above power management algorithm in action.

$ ./brave.py
....
....
[11.] Fetching Tinkerbell workflows from cluster : KUBECONFIG=/opt/rafay/native/brave/brave/generated/brave.kind.kubeconfig kubectl get workflows -A -o yaml
   [+] Fetching machine status from cluster to perform power management tasks: KUBECONFIG=/opt/rafay/native/brave/brave/generated/brave.kind.kubeconfig kubectl get machines.cluster.x-k8s.io -A
   [+] Execution of command to fetch machine status passed. stdout:
 NAMESPACE     NAME                               CLUSTER   NODENAME      PROVIDERID                              PHASE          AGE   VERSION
eksa-system   brave-fnlgp                        brave     brave-fnlgp   tinkerbell://eksa-system/brave-cp-n-1   Running        10m   v1.27.4-eks-1-27-10
eksa-system   brave-ng1-5899b4fd4bxjhh86-zqdgl   brave                                                           Provisioning   10m   v1.27.4-eks-1-27-10

   [+] Tinkerbell workflows detected: PENDING:['brave-dp-n-1'] RUNNING:[] FAILED:[] SUCCESS:['brave-cp-n-1']
   [+] Power cycling nodes with Tinkerbell workflows (boot order net) PENDING:['brave-dp-n-1'] OR FAILED:[]
   [+] Powering on cluster node brave-dp-n-1 with boot order net
   [+] Detected machines in phases ['Running', 'Provisioning']
   [+]  .... waiting for 5 minutes to recheck logs ....
...
...

Summary¶

BRAVE presents a robust solution tailored for executing diverse bare metal infrastructure needs in a virtualized, cost-efficient, and automated fashion. It simplifies such deployments by automating the creation of a cloud-based instance on supported cloud providers and then replicating the entire bare metal infrastructure within this cloud instance through Virtualbox and vagrant-managed virtual machines. Workflows to implement bare metal use cases are offered as provisioners within BRAVE, where new use cases can easily be supported by introducing new provisioners.

Overall, BRAVE stands as a versatile extensible solution for virtualizing and simplifying bare metal deployments, offering a range of provisioners that cater to different use cases, making it an efficient and cost-effective tool for various non-production scenarios.

How can you help¶

Show your support by starring the repository if you find it useful.
Share your thoughts, feedback, or suggest improvements by opening an issue in the repository.
Help expand BRAVE by implementing a Provider or Provisioner for new cloud providers and workflows. Pull requests (PRs) are highly encouraged and appreciated.