Skip to content

Bare Metal Requirements

Bare metal infrastructure must include a combination of GPU-capable and CPU-only nodes. GPU nodes are used for high-performance AI/ML workloads, while CPU-only nodes support services such as orchestration layers and storage controllers (e.g., Ceph).


Bare Metal Servers

  • GPU nodes for training, inferencing, or LLM workloads
  • CPU-only nodes for control plane components, storage, and background jobs

Infrastructure Components

  • Top-of-Rack (ToR) switches for connecting GPU and storage nodes
  • Out-of-Band (OOB) switches for BMC/iDRAC access
  • Ceph or similar distributed storage setup
  • Optional BlueField DPU interfaces for enhanced isolation and performance

Operating System

  • Base Linux OS image (e.g., Ubuntu) should be accessible over the network for bootstrapping and provisioning

Storage

  • Distributed storage (such as Ceph) must be reachable from all GPU and CPU nodes
  • Storage VLANs must be configured and routable across node groups

Network

Multiple VLANs must be provisioned to support different traffic types and access layers.

VLAN Type Description
OOB VLAN BMC/iDRAC management network
TAN VLAN Tenant Access Network, typically enabled over VxLAN
Storage VLAN Network segment for Ceph or other storage traffic
iDRAC VLAN Management VLAN for Dell iDRAC interfaces
DPU VLAN VLAN for managing BlueField DPUs (if applicable)

VLAN Pool Configuration

  • VLAN pools must be preconfigured for tenant network creation
  • IP address ranges should be assigned and managed via IPAM or equivalent tools

SSH Access

  • SSH access is required for provisioning, debugging, and manual intervention

Rafay Controller Accessibility

  • Bare metal nodes and control interfaces must have outbound access to the Rafay Controller for cluster lifecycle operations, telemetry, and observability integrations