SLURM Tutorial | Sangeet Sagar

Introduction

SLURM is an open-source cluster management and job scheduling system for Linux Clusters. Main function

Allocates access to resources (computer nodes)
Provides a framework to run and monitor jobs on allocated nodes
Manage a job queue for competing resource requests.

Overview of SLURM commands

sinfo : display compute partition and node information
sbatch : submit a job script for remote execution
srun : launch parallel tasks (job steps) for MPI jobs
salloc : allocate resources for an interactive job
squeue : display status for jobs and job steps
sprio : display job priority information
scancel : cancel pending or running jobs
sstat : display status information for running jobs
sacct : display accounting information for past jobs
seff : display job efficiency information for past jobs
scontrol : display or modify slurm configuration and state

Codes for common node state

Draining : the node is currently executing a job, but will not be allocated additional jobs
Maint: The node is currently in a reservation with a flag value of maintenance
Mixed: The node has some of its CPUs allocated with others are IDLE
Reserved: The node is in advanced and not generally available.

myaccount : to know your default project account

View the status of all running jobs

squeue -u sagar -o "%.6i %9P %15j %.6u %.10T %.10M %.10l %.6D %10R %s %.10g"

Interactive job

This basically helps you run your script interactively by allocating a terminal/console for you and you get to see all verbose generated by your script.

Interactive CPU job

srun --job-name "InteractiveJob" --cpus-per-task 8 --ptybash

here

InteractiveJob : name of your job. (when you run squeue -u sagar you can identify your job with that name)
--cpus-per-task 8 : you need 8 cores for this job (16 threads)
--ptybash : this will open a bash terminal for you and let YOU run the job (that’s why we call it an interactive job, since you can interact with the terminal, and control the arguments for your script)

You can also specify other parameters like

--mem-per-cpu=16gb : each core to have 16 Gb RAM
--time=30:00:00 : Run this task no longer than 30 hrs.

Interactive GPU job

srun \
-n 1 \
--container-mounts=/netscratch/$USER:/netscratch/$USER,/ds:/ds:ro,"`pwd`":"`pwd`",/home/$USER/:/home/$USER/ \
--container-image=/netscratch/sagar/docker_images/images/10.2.sqsh \
--container-workdir="`pwd`" \
--pty \
--cpus-per-gpu=8 \
--gpus-per-task=1 \
--partition=RTXA6000,A100,V100-32GB,RTX3090 \
/bin/bash

change these parameters as per your requirement

container-image
cpus-per-gpu
gpus-per-task
partition

Job submission (GPU)

Choice 1- short way

Create a script with a name, let’s say sbatch_train.sh

#!/bin/bash

#SBATCH --nodes=1               # Number of nodes or servers. See: http://koeln.kl.dfki.de:3000/d/slurm-resources/resources?orgId=1&refresh=15s
#SBATCH --ntasks-per-node=1     # Number of task in each node, we want 1 
#SBATCH --cpus-per-task=4       # We want 4 cores for this job.
#SBATCH --mem-per-cpu=16gb      # each core to have 16 Gb RAM
#SBATCH --gres=gpu:4            # We want 4 GPUs in each node for this job.
#SBATCH --time=30:00:00         # Run this task no longer that 30 hrs.
#SBATCH --partition=RTXA6000,V100-32GB,RTX3090  # Run this only in these mentioned GPUs. If you dont have any choice over GPUs, remove this parameter.
#SBATCH --job-name=mimic_loss
#SBATCH --output=mimic_loss_%A.logs

srun -K \
--container-mounts=/netscratch/$USER:/netscratch/$USER,/ds:/ds:ro,"`pwd`":"`pwd`",/home/$USER/:/home/$USER/ \
--container-image=/netscratch/sagar/docker_images/images/10.2.sqsh \
--container-workdir="`pwd`" \
python train.py --output_dir output

where --container-image should point to your docker-image that has all packages installed like a python version of your choice, PyTorch or TensorFlow.
Submit your job using

sbatch sbatch_train.sh

Choice 2- long way

I personally prefer using a docker image with minimal installation i.e. supported with just cuda and cudnn version of my choice. I rather use python, pytorch, tensorflow etc. (all required libraries/packages) installed locally in my conda virtual env or python virtual env. (You can also download the docker image of your choice: https://hub.docker.com/r/nvidia/cuda/tags)

Create a script with a name, let’s say sbatch_train.sh

#!/bin/bash

#SBATCH --nodes=1               # Number of nodes or servers. See: http://koeln.kl.dfki.de:3000/d/slurm-resources/resources?orgId=1&refresh=15s
#SBATCH --ntasks-per-node=1     # Number of task in each node, we want 1 
#SBATCH --cpus-per-task=4       # We want 4 cores for this job.
#SBATCH --mem-per-cpu=16gb      # each core to have 16 Gb RAM
#SBATCH --gres=gpu:4            # We want 4 GPUs in each node for this job.
#SBATCH --time=30:00:00         # Run this task no longer than 30 hrs.
#SBATCH --partition=RTXA6000,V100-32GB,RTX3090  # Run this only in these mentioned GPUs. If you don't have any choice over GPUs, remove this parameter.
#SBATCH --job-name=mimic_loss
#SBATCH --output=mimic_loss_%A.logs

echo "#############################"
date
echo "Current dir: " ${SLURM_SUBMIT_DIR}
echo "Hostname: `hostname`"

# Print the task details.
echo "Job ID: ${SLURM_JOBID}"
echo "SLURM array task ID:  ${SLURM_ARRAY_TASK_ID}"
echo "Node list: ${SLURM_JOB_NODELIST}" 
echo "Cluster name: ${SLURM_CLUSTER_NAME}"
echo "Partition name: ${SLURM_JOB_PARTITION}" 
echo "Using: `which python`"
echo -e "#############################\n"

srun -K \
--container-mounts=/netscratch/$USER:/netscratch/$USER,/ds:/ds:ro,"`pwd`":"`pwd`",/home/$USER/:/home/$USER/ \
--container-image=/netscratch/sagar/docker_images/images/10.2.sqsh \
--container-workdir="`pwd`" \
bash run_task.sh

In my script, I use a basic docker image with minimal installation of just cuda and cudnn (a version of my choice). And to use the locally installed libraries and packages, I activate the virtual environment. All these are done in run_task.sh script

Your run_task.sh script will have the actual training script/commands

#!/bin/bash

# We jump into the submission dir
cd ${SLURM_SUBMIT_DIR}

# Activate the python virtual env
source /netscratch/sagar/thesis/sb_env/bin/activate

# But if you are using conda (uncomment the lines below)
# . /netscratch/sagar/miniconda/etc/profile.d/conda.sh
# conda activate my_env_name

# If you need other packages that are not available in pip/conda, 
# you can install them this way
apt-get update
apt-get -y install libsndfile1


## Single-GPU training
# Execute this python script (this will use 1 GPU regardless you have taken 4 or 8 GPUs)
python train.py hparams/robust_asr.yaml

## Multi GPU training
# If you want your script to use all GPUs that you requested, follow this 
NUM_GPUS=4
python -m torch.distributed.launch \
--nproc_per_node=${NUM_GPUS} \
--nnodes=${SLURM_JOB_NUM_NODES} \
--node_rank=${SLURM_NODEID} \
train.py hparams/robust_asr.yaml \
--distributed_launch

## Note
# Multi-gpu jobs will be successful only if your script is adapted to perform DP or DDP training

Submit your job using

sbatch sbatch_train.sh

Multi-node multi-GPU DDP training

To utilize multiple nodes and all GPUs within those nodes, it is important to ensure that all nodes are connected to a master node. This will allow for efficient distribution of work across all available resources. This can be done using

MASTER=`echo $SLURM_JOB_NODELIST | cut -d"," -f1 | sed 's/[][]//g' | cut -d "-" -f 1,2`

and then in the job submission script modify parameters this way

python -m torch.distributed.launch 
--nproc_per_node=${NUM_GPUS} 
--nnodes=${SLURM_JOB_NUM_NODES} 
--master_addr=${MASTER} 
--master_port=5557 
--node_rank=${SLURM_NODEID} 
${training_command} 
--distributed_launch

Downloading a docker image

Prepare a script with name download_docker_img.sh

docker_img=$1
docker_img_name=$2
srun enroot import -o /netscratch/$USER/docker_images/$docker_img_name docker://$docker_img

Download docker image and prepare the sqsh file, E.g.

./download_docker_img.sh nvidia/pytorch:20.10-py3 pytorch:20.10-py3.sqsh