Requirements
The following section outlines the prerequisites that need to be in place on the "host Kubernetes cluster" before users can deploy a SLURM cluster via self service.
Host Kubernetes Cluster¶
Ensure you have provisioned a host Kubernetes Cluster (based on Rafay MKS) provisioned in the "system-catalog" project. Ensure this is sized and scaled to sufficient capacity (i.e. number of worker nodes, with GPUs and sufficient cpu/memory resources).
Storage¶
The SLURM cluster nodes (login and compute nodes) will provide users with access to a shared file system. Ensure a CSI is installed and configured on the host Kubernetes cluster with a StorageClass with RWX access. For example, Rook-Ceph using a shared filesystem is a good option.
Ports¶
Ensure ports 80, 443 and 30000-32767 open on the Public IPs of K8s cluster. We will use NodePort for users to login into the SLURM cluster over SSH.
Cluster Add-Ons¶
The following software add-ons need to be installed and configured on the host cluster. These are typically packaged as a Rafay Cluster Blueprint for consistency and repeatability.
Ingress Controller¶
Install an Ingress Controller (e.g. nginx) on the host cluster.
Slurm Operator¶
Install the Slurm Operator on the host cluster.
helm install slurm-operator-crds oci://ghcr.io/slinkyproject/charts/slurm-operator-crds
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
--namespace=slinky --create-namespace
Cert Manager¶
Install Cert Manager on the host cluster
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
--set 'crds.enabled=true' \
--namespace cert-manager --create-namespace
GPU Operator¶
Install the GPU operator on the host cluster. For example, Nvidia's GPU Operator.
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator-resources \
--create-namespace
¶
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator-resources \
--create-namespace
DNS¶
Configure the Ingress Controller with TLS certificates applied for a domain that will be used to present the user with a URL to access the Grafana based monitoring dashboard. Ensure that DNS is mapped to wildcard for the domain directed to the K8s Cluster public IPs.