Bare Metal Requirements¶
Bare metal infrastructure must include a combination of GPU-capable and CPU-only nodes. GPU nodes are used for high-performance AI/ML workloads, while CPU-only nodes support services such as orchestration layers and storage controllers (e.g., Ceph).
Bare Metal Servers¶
- GPU nodes for training, inferencing, or LLM workloads
- CPU-only nodes for control plane components, storage, and background jobs
Infrastructure Components¶
- Top-of-Rack (ToR) switches for connecting GPU and storage nodes
- Out-of-Band (OOB) switches for BMC/iDRAC access
- Ceph or similar distributed storage setup
- Optional BlueField DPU interfaces for enhanced isolation and performance
Operating System¶
- Base Linux OS image (e.g., Ubuntu) should be accessible over the network for bootstrapping and provisioning
Storage¶
- Distributed storage (such as Ceph) must be reachable from all GPU and CPU nodes
- Storage VLANs must be configured and routable across node groups
Network¶
Multiple VLANs must be provisioned to support different traffic types and access layers.
VLAN Type | Description |
---|---|
OOB VLAN | BMC/iDRAC management network |
TAN VLAN | Tenant Access Network, typically enabled over VxLAN |
Storage VLAN | Network segment for Ceph or other storage traffic |
iDRAC VLAN | Management VLAN for Dell iDRAC interfaces |
DPU VLAN | VLAN for managing BlueField DPUs (if applicable) |
VLAN Pool Configuration¶
- VLAN pools must be preconfigured for tenant network creation
- IP address ranges should be assigned and managed via IPAM or equivalent tools
SSH Access¶
- SSH access is required for provisioning, debugging, and manual intervention
Rafay Controller Accessibility¶
- Bare metal nodes and control interfaces must have outbound access to the Rafay Controller for cluster lifecycle operations, telemetry, and observability integrations