Slurm
Slurm compute instances are designed for High Performance Computing (HPC) workloads that require large-scale parallel processing and efficient scheduling of batch jobs. They provide an environment to run compute-intensive applications on bare metal CPU and GPU nodes, managed through a head node and optional login node. This compute option is well-suited for scientific simulations, AI/ML training, and workloads requiring advanced scheduling.
Create Slurm Compute Instance¶
To create a Slurm compute instance, navigate to the Developer Hub and select the Slurm type.
- Click
New Slurm
to create a new instance. - Click
View All
to view and manage existing Slurm instances.
Users can also access using the left navigation pane under the Compute section.
Clicking on the Slurm option from the left pane or selecting "New Slurm Cluster" from the home page opens a wizard that allows users to select from the available Slurm Compute Profiles.
Once the profile is selected, provide the required details. If pricing for the selected profile is configured in Global Settings by the Org Admin, a monthly estimate will be displayed.
- Name: Enter a unique name for the compute instance
- Description: Provide a brief summary of the instance
- Compute Profile: Proceed with the selected profile
- Workspace: Select the workspace from the drop-down menu
- Configuration:
- Number of CPU nodes
- Number of GPU nodes (e.g., H100, L40S)
- Enable Login Node (optional)
- Click Deploy
The instance initially displays a status of In Progress.
Upon successful deployment, the status updates to Success.
Important
The list of compute profiles presented to the end user is dynamic i.e. if the administrator updates or publishes a new profile, it will be immediately available to the end user as an option to consume.
View Compute Instances¶
When the user clicks on Slurm, they are presented with the list of Slurm compute instances in their workspace.
- Name
- Workspace
- Created At
- Nodes
- Publish Status
- Actions
Actions¶
Once an instance has been launched and is operational, users can perform a number of actions on it. This section describes the list of actions that can be performed.
Run Slurm Command¶
Users can execute Slurm commands on the deployed cluster without logging into the head node.
- Click Run Slurm Command from the Actions menu.
- Enter the command in the
slurm_command
field. - Optionally configure the
timeout_seconds
value (default:60
). - Click Apply to execute.
Example Commands:
sinfo
: View the state of the clustersqueue
: List active jobssbatch
: Submit a batch job (via login node)scancel
: Cancel a job
Add Node¶
Use this action to scale your SLURM cluster by adding additional compute resources. You can choose the type and number of nodes to add based on your workload requirements.
When you select Add Node, you’ll be prompted to specify counts for the supported node types:
l40s_node_addon_count
: Number of additional GPU nodes with NVIDIA L40s to be provisioned.h100_node_addon_count
: Number of additional GPU nodes with NVIDIA H100 to be provisioned.cpu_node_addon_count
: Number of additional CPU-only nodes to be provisioned.
The platform automatically provisions the requested nodes and registers them with the SLURM cluster. Once added, these nodes become available for scheduling workloads.
Decommission Node¶
Use this action to remove one or more nodes from the SLURM cluster. This is typically done when nodes are no longer needed, under maintenance, or being scaled down.
You must provide the node name(s) that should be removed. The specified nodes will be drained and decommissioned from the cluster.
Note: Make sure the nodes you enter exist in the current cluster.
Sample Input:
node-L40s-006,node-H100-002,node-CPU-010
Delete Compute Instance¶
If the user does not require a compute instance, they can delete it by clicking on the delete icon on the right of the compute instance.
Note
This action is irreversible and cannot be undone.