Skip to content

Requirements

The following section outlines the prerequisites that need to be in place on the "host Kubernetes cluster" before users can deploy a SLURM cluster via self service.

Host Kubernetes Cluster

Ensure you have provisioned a host Kubernetes Cluster (based on Rafay MKS) provisioned in the "system-catalog" project. Ensure this is sized and scaled to sufficient capacity (i.e. number of worker nodes, with GPUs and sufficient cpu/memory resources).

Storage

The SLURM cluster nodes (login and compute nodes) will provide users with access to a shared file system. Ensure a CSI is installed and configured on the host Kubernetes cluster with a StorageClass with RWX access. For example, Rook-Ceph using a shared filesystem is a good option.


Ports

Ensure ports 80, 443 and 30000-32767 open on the Public IPs of K8s cluster. We will use NodePort for users to login into the SLURM cluster over SSH.


Cluster Add-Ons

The following software add-ons need to be installed and configured on the host cluster. These are typically packaged as a Rafay Cluster Blueprint for consistency and repeatability.

Ingress Controller

Install an Ingress Controller (e.g. nginx) on the host cluster.


Slurm Operator

Install the Slurm Operator on the host cluster.

helm install slurm-operator-crds oci://ghcr.io/slinkyproject/charts/slurm-operator-crds
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
--namespace=slinky --create-namespace

Cert Manager

Install Cert Manager on the host cluster

helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
--set 'crds.enabled=true' \
--namespace cert-manager --create-namespace

GPU Operator

Install the GPU operator on the host cluster. For example, Nvidia's GPU Operator.

helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator-resources \
--create-namespace

DNS

Configure the Ingress Controller with TLS certificates applied for a domain that will be used to present the user with a URL to access the Grafana based monitoring dashboard. Ensure that DNS is mapped to wildcard for the domain directed to the K8s Cluster public IPs.