GPU Benchmarks

The SLURM cluster comes preloaded with the NCCL tests. Running these tests will check the networking performance between GPUs on a single node and multiple nodes using NVLink or InfiniBand.

Run NCCL Tests¶

Follow these steps to run a sample NCCL test. Additional tests are available on the node and can be run as well.

SSH into the Login node
Create the following batch script with the name nccl_test.sbatch
Use SCP to copy the file to the /mnt/data directory on the Login node

#!/bin/bash
#SBATCH --job-name=nccl_multi_node
#SBATCH --output=nccl-%j.out
#SBATCH --error=nccl-%j.error
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --partition=slinky

# Use 2 InfiniBand queue pairs per connection between ranks
export NCCL_IB_QPS_PER_CONNECTION=2

# Use NVLink SHARP to offload all-reduce to NVSwitch
export NCCL_NVLS_ENABLE=1

# Double buffer size for NCCL communications
export NCCL_BUFFSIZE=8388608

# Prevent MPI from using InfiniBand
export UCX_NET_DEVICES=eth0

# Optional: Enable detailed NCCL debug
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH,NET

# Path to NCCL test binary
TEST_BIN=/opt/nccl-tests/build/all_reduce_perf

# Run multi-node NCCL test via Slurm
srun --mpi=pmix \
  ${TEST_BIN} -b 512M -e 8G -f 2 -g 1

Now, submit the NCCL test job by running the following command

sbatch --nodes=4 nccl_test.sbatch

Results on Nvidia H200 GPUs¶

The results below are from an example run on a Nvidia H200 GPU based SLURM cluster.

For example, we ran the following command to exercise the NCCL tests on a single GPU.

srun --partition=slinky --mpi=pmix /opt/nccl-tests/build/all_reduce_perf  -b 512M -e 8G -f 2 -g 1

Info

nranks 1 cudaDev 0 means a single process on a single GPU. With one rank, all-reduce degenerates to a local operation (no cross-GPU or network traffic).

The NCCL results were as follows:

slinky-0:46175:46220 [0] NCCL INFO comm 0x5581dcdfb880 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 18000 commId 0x5874f913ab1ed5af - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong                     
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)                             
   536870912     134217728     float     sum      -1   255.54  2100.9    0.00       0     0.07   8e+06    0.00       0
  1073741824     268435456     float     sum      -1   503.33  2133.3    0.00       0     0.07   2e+07    0.00       0
  2147483648     536870912     float     sum      -1  1000.38  2146.7    0.00       0     0.07   3e+07    0.00       0
  4294967296    1073741824     float     sum      -1  1994.22  2153.7    0.00       0     0.07   6e+07    0.00       0
  8589934592    2147483648     float     sum      -1  3982.20  2157.1    0.00       0     0.07   1e+08    0.00       0
slinky-0:46175:46175 [0] NCCL INFO comm 0x5581dcdfb880 rank 0 nranks 1 cudaDev 0 busId 18000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#
# Collective test concluded: all_reduce_perf

Single Node, 8 GPUs¶

srun -N1 -n8 --gpus-per-task=1 ./all_reduce_perf -b 8M -e 8G -f 2

With this, you should see "nranks 8" and "non-zero busbw".

Two Nodes, 8 GPUs Each¶

srun -N2 -n16 --gpus-per-task=1 ./all_reduce_perf -b 8M -e 8G -f 2

This will ensure NCCL can reach across nodes (proper NCCL env, IB/NVLink fabric, etc.). Look for "stable busbw" matching your fabric’s expected throughput.

NCCL Test Configuration¶

Worker Nodes¶

The number of worker nodes can be customized by setting --nodes=4 when starting the job. Update the number of nodes as needed.

If you run the test on multiple nodes, it uses NVLink for communications between GPUs on the same node and InfiniBand for GPUs on different nodes. To benchmark NVLink specifically, run the test on one node.

The number of of vCPUs for the test can be set by updating the following parameters in the script. Note, that the number of vCPUs used per worker for this test is 4 x 16 = 64 vCPUs on each worker node.

#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=16

Environment variables¶

The script uses the following NCCL and UCX environment variables:

NCCL_IB_QPS_PER_CONNECTION=2 makes each connection between two ranks (GPU processes) use two InfiniBand queue pairs.
NCCL_NVLS_ENABLE=1 explicitly enables NVLink SHARP (NVLS), which offloads the all-reduce operation to the NVSwitch domain.
NCCL_BUFFSIZE=8388608 increases the buffer size for NCCL communications between pairs of GPUs from 4 MiB (default) to 8 MiB.
UCX_NET_DEVICES=eth0 makes MPI use the eth0 network interface instead of InfiniBand.

Test Parameters¶

The all_reduce_perf_mpi test uses the following parameters that you can customize:

-b, -f and -e: The start size, the increment factor and the end size of data that the test uses. For example, -b 512M -f 2 -e 8G means that the first iteration works with 512 MiB of data, which then doubles in size at each following iteration (1 GiB, 2 GiB, 4 GiB) until it reaches 8 GiB. -g: The number of GPUs per task.

For more parameters, see NCCL tests documentation.