SLURM Tutorial
Introduction
SLURM is an open-source cluster management and job scheduling system for Linux Clusters. Main function
- Allocates access to resources (computer nodes)
- Provides a framework to run and monitor jobs on allocated nodes
- Manage a job queue for competing resource requests.
Overview of SLURM commands
-
sinfo
: display compute partition and node information -
sbatch
: submit a job script for remote execution -
srun
: launch parallel tasks (job steps) for MPI jobs -
salloc
: allocate resources for an interactive job -
squeue
: display status for jobs and job steps -
sprio
: display job priority information -
scancel
: cancel pending or running jobs -
sstat
: display status information for running jobs -
sacct
: display accounting information for past jobs -
seff
: display job efficiency information for past jobs -
scontrol
: display or modify slurm configuration and state
Codes for common node state
-
Draining
: the node is currently executing a job, but will not be allocated additional jobs -
Maint
: The node is currently in a reservation with a flag value of maintenance -
Mixed
: The node has some of its CPUs allocated with others are IDLE -
Reserved
: The node is in advanced and not generally available.
myaccount
: to know your default project account
View the status of all running jobs
squeue -u sagar -o "%.6i %9P %15j %.6u %.10T %.10M %.10l %.6D %10R %s %.10g"
Interactive job
This basically helps you run your script interactively by allocating a terminal/console for you and you get to see all verbose generated by your script.
Interactive CPU job
srun --job-name "InteractiveJob" --cpus-per-task 8 --ptybash
here
-
InteractiveJob
: name of your job. (when you runsqueue -u sagar
you can identify your job with that name) -
--cpus-per-task 8
: you need 8 cores for this job (16 threads) -
--ptybash
: this will open a bash terminal for you and let YOU run the job (that’s why we call it an interactive job, since you can interact with the terminal, and control the arguments for your script)
You can also specify other parameters like
-
--mem-per-cpu=16gb
: each core to have 16 Gb RAM -
--time=30:00:00
: Run this task no longer than 30 hrs.
Interactive GPU job
srun \
-n 1 \
--container-mounts=/netscratch/$USER:/netscratch/$USER,/ds:/ds:ro,"`pwd`":"`pwd`",/home/$USER/:/home/$USER/ \
--container-image=/netscratch/sagar/docker_images/images/10.2.sqsh \
--container-workdir="`pwd`" \
--pty \
--cpus-per-gpu=8 \
--gpus-per-task=1 \
--partition=RTXA6000,A100,V100-32GB,RTX3090 \
/bin/bash
change these parameters as per your requirement
container-image
cpus-per-gpu
gpus-per-task
partition
Job submission (GPU)
Choice 1- short way
Create a script with a name, let’s say sbatch_train.sh
#!/bin/bash
#SBATCH --nodes=1 # Number of nodes or servers. See: http://koeln.kl.dfki.de:3000/d/slurm-resources/resources?orgId=1&refresh=15s
#SBATCH --ntasks-per-node=1 # Number of task in each node, we want 1
#SBATCH --cpus-per-task=4 # We want 4 cores for this job.
#SBATCH --mem-per-cpu=16gb # each core to have 16 Gb RAM
#SBATCH --gres=gpu:4 # We want 4 GPUs in each node for this job.
#SBATCH --time=30:00:00 # Run this task no longer that 30 hrs.
#SBATCH --partition=RTXA6000,V100-32GB,RTX3090 # Run this only in these mentioned GPUs. If you dont have any choice over GPUs, remove this parameter.
#SBATCH --job-name=mimic_loss
#SBATCH --output=mimic_loss_%A.logs
srun -K \
--container-mounts=/netscratch/$USER:/netscratch/$USER,/ds:/ds:ro,"`pwd`":"`pwd`",/home/$USER/:/home/$USER/ \
--container-image=/netscratch/sagar/docker_images/images/10.2.sqsh \
--container-workdir="`pwd`" \
python train.py --output_dir output
where --container-image
should point to your docker-image that has all packages installed like a python version of your choice, PyTorch or TensorFlow.
Submit your job using
sbatch sbatch_train.sh
Choice 2- long way
I personally prefer using a docker image with minimal installation i.e. supported with just cuda
and cudnn
version of my choice. I rather use python, pytorch, tensorflow etc. (all required libraries/packages) installed locally in my conda virtual env or python virtual env. (You can also download the docker image of your choice: https://hub.docker.com/r/nvidia/cuda/tags
)
- Create a script with a name, let’s say
sbatch_train.sh
#!/bin/bash
#SBATCH --nodes=1 # Number of nodes or servers. See: http://koeln.kl.dfki.de:3000/d/slurm-resources/resources?orgId=1&refresh=15s
#SBATCH --ntasks-per-node=1 # Number of task in each node, we want 1
#SBATCH --cpus-per-task=4 # We want 4 cores for this job.
#SBATCH --mem-per-cpu=16gb # each core to have 16 Gb RAM
#SBATCH --gres=gpu:4 # We want 4 GPUs in each node for this job.
#SBATCH --time=30:00:00 # Run this task no longer than 30 hrs.
#SBATCH --partition=RTXA6000,V100-32GB,RTX3090 # Run this only in these mentioned GPUs. If you don't have any choice over GPUs, remove this parameter.
#SBATCH --job-name=mimic_loss
#SBATCH --output=mimic_loss_%A.logs
echo "#############################"
date
echo "Current dir: " ${SLURM_SUBMIT_DIR}
echo "Hostname: `hostname`"
# Print the task details.
echo "Job ID: ${SLURM_JOBID}"
echo "SLURM array task ID: ${SLURM_ARRAY_TASK_ID}"
echo "Node list: ${SLURM_JOB_NODELIST}"
echo "Cluster name: ${SLURM_CLUSTER_NAME}"
echo "Partition name: ${SLURM_JOB_PARTITION}"
echo "Using: `which python`"
echo -e "#############################\n"
srun -K \
--container-mounts=/netscratch/$USER:/netscratch/$USER,/ds:/ds:ro,"`pwd`":"`pwd`",/home/$USER/:/home/$USER/ \
--container-image=/netscratch/sagar/docker_images/images/10.2.sqsh \
--container-workdir="`pwd`" \
bash run_task.sh
In my script, I use a basic docker image with minimal installation of just cuda and cudnn (a version of my choice). And to use the locally installed libraries and packages, I activate the virtual environment. All these are done in run_task.sh
script
- Your
run_task.sh
script will have the actual training script/commands
#!/bin/bash
# We jump into the submission dir
cd ${SLURM_SUBMIT_DIR}
# Activate the python virtual env
source /netscratch/sagar/thesis/sb_env/bin/activate
# But if you are using conda (uncomment the lines below)
# . /netscratch/sagar/miniconda/etc/profile.d/conda.sh
# conda activate my_env_name
# If you need other packages that are not available in pip/conda,
# you can install them this way
apt-get update
apt-get -y install libsndfile1
## Single-GPU training
# Execute this python script (this will use 1 GPU regardless you have taken 4 or 8 GPUs)
python train.py hparams/robust_asr.yaml
## Multi GPU training
# If you want your script to use all GPUs that you requested, follow this
NUM_GPUS=4
python -m torch.distributed.launch \
--nproc_per_node=${NUM_GPUS} \
--nnodes=${SLURM_JOB_NUM_NODES} \
--node_rank=${SLURM_NODEID} \
train.py hparams/robust_asr.yaml \
--distributed_launch
## Note
# Multi-gpu jobs will be successful only if your script is adapted to perform DP or DDP training
Submit your job using
sbatch sbatch_train.sh
Multi-node multi-GPU DDP training
To utilize multiple nodes and all GPUs within those nodes, it is important to ensure that all nodes are connected to a master node. This will allow for efficient distribution of work across all available resources. This can be done using
MASTER=`echo $SLURM_JOB_NODELIST | cut -d"," -f1 | sed 's/[][]//g' | cut -d "-" -f 1,2`
and then in the job submission script modify parameters this way
python -m torch.distributed.launch
--nproc_per_node=${NUM_GPUS}
--nnodes=${SLURM_JOB_NUM_NODES}
--master_addr=${MASTER}
--master_port=5557
--node_rank=${SLURM_NODEID}
${training_command}
--distributed_launch
Downloading a docker image
Prepare a script with name download_docker_img.sh
docker_img=$1
docker_img_name=$2
srun enroot import -o /netscratch/$USER/docker_images/$docker_img_name docker://$docker_img
Download docker image and prepare the sqsh file, E.g.
./download_docker_img.sh nvidia/pytorch:20.10-py3 pytorch:20.10-py3.sqsh
Enjoy Reading This Article?
Here are some more articles you might like to read next: