Baremetal

The Bare Metal compute option in the Developer Hub enables deployment of dedicated physical servers that provide direct access to hardware resources. Administrators can provision Bare Metal instances for users by defining profiles with specific hardware configurations. End users can then deploy these instances with a single click, offering a streamlined and high-performance infrastructure experience.

Create Baremetal Compute Instance¶

To create a Baremetal compute instance:

Navigate to the Developer Hub and select the Baremetal type
Click New Bare Metal to create a new instance

Users can also access using the left navigation pane under the Compute section.

Clicking on the Baremetal option from the left pane or selecting "New Bare Metal" from the home page opens a wizard that allows users to select from the available Baremetal Compute Profiles.

Once the profile is selected, provide the required details. If pricing for the selected profile is configured in Global Settings by the Org Admin, a monthly estimate will be displayed.

Name: Enter a unique name for the compute instance
Description: Provide a brief summary of the instance
Compute Profile: Proceed with the selected compute profile
Workspace: Select the workspace from the drop-down menu
Contract Term (In Months): Specify the duration of the contract Pricing is dynamically calculated based on the selected term and is displayed in the estimate section on the right
Operating System: Choose the desired operating system for the instance (e.g., Ubuntu 22.04)
Public SSH Key: Paste your public SSH key to enable secure access to the instance

Once an instance deployment is initiated, the Status Tracker section displays real-time progress and estimated completion time.

Note: The deployment typically takes 15–20 minutes to complete.

Upon successful deployment, the status updates to Success.

Important

The list of compute profiles presented to the end user is dynamic i.e. if the administrator updates or publishes a new profile, it will be immediately available to the end user as an option to consume.

View Compute Instances¶

When the user clicks on Baremetal, they are presented with the list of Baremetal compute instances in their workspace.

Name
Workspace
Created At
Services
Publish Status
Actions

Post-Deployment Operations¶

Once an instance has been launched and is operational, users can perform a number of actions on it. This section describes the list of actions that can be performed.

Remote Access¶

The end user will need access to the remote instance that is operating behind a firewall in a private data center or public cloud. These instances can be either a Kubernetes namespace or a Virtual Cluster or a Dedicated Kubernetes cluster. The secure remote access capability is powered by Rafay's Zero Trust Kubectl (ZTKA) feature.

The user can download the ZTKA "kubeconfig" file, configure their KubeCTL CLI utility to use it and access the instance remotely. The user can also click on the "kubectl" button which will open a web shell that will allow them to securely interact with the instance.

Collaborators¶

It is common for end users to work with both internal (i.e. employees) and external (i.e. outside the company) collaborators. Users can easily add/remove other users to a specific instance. Once the user enters the collaborator's "email address", an email invitation is sent to the user with details on how they can access this instance. Once they login, they will have the same level (i.e. role/privilege) of access to the instance.

Instance Actions¶

After a Bare Metal compute instance is deployed, various lifecycle operations can be performed from the Actions panel. The following options are available:

Start: Powers on the instance if it is currently in a stopped state
Stop: Initiates a graceful shutdown of the instance. This is typically used for maintenance or to reduce resource usage
Power Cycle: Performs a complete power cycle, turning the instance off and then back on. Useful for applying configuration changes or resolving issues
Power Reset: Executes a hardware-level reset, similar to pressing the reset button on a physical server. Intended for unresponsive system scenarios
Delete: Permanently removes the instance along with all related configurations. This action cannot be undone

⚠️ It is recommended to back up any critical data before using Power Reset or Delete actions. This action is irreversible and cannot be undone.

View Metrics¶

To access the metrics for a Bare Metal instance:

Navigate to the list of Bare Metal instances
Click the ellipsis (⋮) icon under the Actions column for the desired instance
Select View Metrics from the dropdown

The Metrics Overview page is displayed, providing insights into:

CPU Utilization: Displays current, peak, and average CPU usage and committed resources
Memory Utilization: Shows memory usage statistics, including current, peak, and average usage
Storage Utilization: Indicates disk usage in terms of current, peak, and average usage

Additionally, GPU information is displayed for each allocated GPU with details such as model and identifier (e.g., GPU #1, GPU #2).

GPU Metrics Breakdown¶

To view GPU-specific metrics, expand the corresponding GPU section. The detailed metrics include:

GPU Utilization: Indicates the percentage of GPU core processing capacity currently in use. A high value suggests the GPU is actively engaged in compute tasks such as inference or training.
GPU Memory Copy Utilization: Reflects how actively the GPU is transferring data between its memory and compute units. Elevated values may indicate frequent data movement, which could impact performance if memory bandwidth becomes a bottleneck.
GPU Temperature: Displays the current temperature of the GPU in degrees Celsius. Continuous high temperatures may lead to thermal throttling or hardware issues, and could indicate inadequate cooling.
GPU SM Clocks: Shows the clock speed of the Streaming Multiprocessors (SMs), which are responsible for executing shader and compute workloads. This metric helps assess if the GPU is running at its full performance potential.
GPU Memory Clocks: Indicates the frequency at which the GPU memory operates. It determines the speed at which data is read from or written to the GPU memory, affecting overall memory throughput.
Framebuffer Memory Used: Displays the amount of memory currently consumed by the GPU for active workloads. This includes model weights, input data, and intermediate results stored in memory.
Framebuffer Memory Free: Indicates the remaining available memory on the GPU that can be used for additional tasks. Monitoring this helps ensure there is sufficient memory to handle new or growing workloads without failure.

These metrics provide a comprehensive view of the GPU's performance characteristics and help in identifying potential bottlenecks or hardware limitations during workload execution.