Slurm
Prerequisites¶
Before creating a Slurm Compute Profile, ensure the following are in place:
- The
system-slurm-bm
environment template is published and accessible in the project - GitOps Agent and Agent Host details are configured
- Bare metal resources are available in the inventory:
- CPU nodes for head node and worker node provisioning
- GPU nodes (e.g., H100, L40S) if required
- API Key and Controller Endpoint are configured in the platform; these will be available as output variables for accessing the Slurm service
Create a Compute Profile for Slurm¶
Slurm compute profiles are created by PaaS Administrators to enable users to request and manage HPC clusters on tenant networks using BCM-based provisioning.
This profile simplifies the end-user experience by exposing only a minimal set of options (for example, number of nodes and login node toggle), while administrators configure advanced parameters such as SKU mappings, API endpoints, and GitOps agent details.
Slurm clusters are commonly used for job scheduling, resource allocation, and high-performance workloads in GPU/CPU-intensive environments.
Refer to the Compute Profile Overview for general information.
Steps to Create a Slurm Compute Profile¶
- In the Developer Console, select Compute Profiles from the left navigation pane
- Click the + New Compute Profile button
- In the Compute Profile form:
- Name: Provide a unique identifier (e.g.,
slurm-profile
) - Display Name: Enter a user-friendly name (optional)
- Description (Optional): Add relevant information about this profile's purpose or configuration
- Environment Template: Select the available template:
system-slurm-bm
: Use this when provisioning and managing a complete Slurm cluster on bare metal. This automates lifecycle operations such as PXE boot, IPMI configuration, and provisioning the head node with worker nodes.
- Environment Template Version: Select the required version (e.g.,
v6
) - Compute Type: Select Slurm
- Name: Provide a unique identifier (e.g.,
⚠️ This determines that the compute instances launched using this profile will be provisioned for Slurm-based workload management.
- Once all fields are configured, click Save & Continue.
Compute Profile Configuration¶
Once saved, the Compute Profile Configuration page appears.
General¶
Name | Default Value | Value Type | Description |
---|---|---|---|
Name | slurm-prod-profile |
string | Internal identifier for the compute profile |
Display Name | SLURM Production |
string | User-friendly label for UI display |
Description | Profile for SLURM production workloads |
string | Notes describing the profile purpose or usage |
Allocation Type | Shared |
string | Determines whether nodes are dedicated or shared |
Advanced Configuration¶
Name | Default Value | Value Type | Description |
---|---|---|---|
Labels | env=production, team=ai |
key-value | Key-value pairs used for grouping or identifying resources |
Annotations | owner=platform-team |
key-value | Non-identifying metadata for resource management or documentation purposes |
Extra Config | {"logLevel":"debug"} |
key-value | Additional configuration in key-value or JSON format for advanced tuning |
Display Settings¶
Name | Default Value | Value Type | Description |
---|---|---|---|
Icon URL | https://example.com/icons/slurm.png |
string | URL to a custom icon used to visually identify the compute profile |
Read Me | SLURM profile for AI/ML workloads |
string | Short summary describing the purpose or characteristics of the profile |
Input Settings¶
Name | Value | Type | Description |
---|---|---|---|
API Key | Enter Value | envVars | Authentication key used for API access |
Controller Endpoint | console.stage.shakticloud.ai | envVars | Endpoint of the controller managing the cluster |
CPU Node SKU | bmass-cpu-slurm | json | SKU identifier for CPU-based worker nodes |
CPU Nodes | 0 | text | Number of CPU worker nodes |
Enable Login Node | false | text | Flag to enable or disable a dedicated login node |
GitOps Agent Host IP | 10.0.7.62.73 | text | IP address of the GitOps agent host |
GitOps Agent Host Password | ****** | text | Password for GitOps agent host authentication |
GitOps Agent Host TAN Interface | bond0 | text | Network interface used by the GitOps agent host |
GitOps Agent Host Username | rafayuser | text | Username for GitOps agent host login |
H100 Node SKU | bmass-h100-slurm | text | SKU identifier for GPU-based H100 nodes |
H100 Nodes | 0 | text | Number of GPU H100 nodes |
Head Node SKU | vm-bom-head-node | text | SKU identifier for the cluster head node |
L40S Node SKU | bmass-l40s-slurm | text | SKU identifier for GPU-based L40S nodes |
L40S Nodes | 2 | text | Number of GPU L40S nodes |
Login Node SKU | vm-slurm-login-node | text | SKU identifier for login node |
NCP Server API Key | ****** | text | API key for NCP server authentication |
Ops Console Endpoint | ops-console.stage.shakticloud.ai | text | Endpoint of the Ops console |
Partner API Key | ****** | text | API key for partner integrations |
PXE Subnet Name | default-pxe | text | Subnet name for PXE boot |
PXE VPC Name | default-pxe | text | VPC name for PXE boot |
TAN Subnet Name | default | text | Subnet name for TAN |
TAN VPC Name | default | text | VPC name for TAN |
Input Configuration Controls (Slurm)¶
-
Override (Checkbox): Enables environment-level overrides for a specific input parameter in the Slurm configuration. When selected, users can customize values such as partition size, node count, or job submission parameters in their environment-specific settings.
-
Allow Override For All: A global toggle to enable override capability across all listed Slurm inputs in one click. This is useful when overrides need to be enabled for multiple cluster or job parameters simultaneously.
-
Preview Input Form: Clicking Preview Input Form displays how the configured Slurm inputs appear to users. It includes field labels, tooltips, input types, validation, and grouping as defined in the configuration (for example, partition definitions or resource limits).
-
Display Config (Edit): Opens a configuration panel that allows customization of how each Slurm input field appears in the environment form. It can be used to change the display name, add tooltips for guidance (e.g., how to set job limits), set default values, define input types, or group related parameters such as compute resources or scheduling options.
Example: Edited Input for Controller Endpoint
Field | Description |
---|---|
Alias | Internal reference name for the field (endpoint ) |
Tooltip | Help text shown when hovering over the info icon (empty in this case) |
Disabled | Field is disabled and cannot be edited by end users |
Order/Weight | Position of the field in the form (30 ; lower values appear first) |
Type | Input type is set to File Upload (Text Only) |
Validation Type | Validation is based on Length |
Validation Pattern | Length validation allows a maximum of 20 characters |
Section | Field is grouped under the General section |
Section Description | Optional description for the section (currently empty) |
Output Settings¶
PaaS Admin can define outputs such as SSH keys
, node details
, or Slurm command results. These outputs are displayed to end users after they deploy the Slurm cluster, enabling them to access the login node and interact with the cluster using Slurm commands.
Name | Label | Description |
---|---|---|
private_key_pem | Login-node SSH Private Key | SSH private key to access the login node |
nodes | Slurm Nodes | Total number of nodes provisioned in the Slurm cluster |
login_node_hostname | Login-node Hostname | Hostname of the provisioned login node |
slurm_command_output | Slurm Command Output | Output of Slurm commands executed on the cluster |
slurm_command_validation_error | Slurm Command Error | Error messages from Slurm command execution |
login_node_ip | Login-node IP Address | IP address of the login node |
login_node_username | Login-node Username | Username used to access the login node |
login_node_password | Login-node Password | Password for accessing the login node (if enabled) |
Once all configurations are complete, click Save Changes to apply the updates.