Introduction to running AI tasks on VSC

This version (2024/03/18 12:12) was approved by mpfister.The Previously approved version (2024/03/15 08:56) is available.

<html><!—

This page is still a work in progress.

–></html>

VSC5 is a high performance cluster that consists of different kind of nodes. If you are outside of the university network, you need to use the jump host vmos.vsc.ac.at to reach VSC. (Note: Not all users are enabled to use the jump host.) To do that, use ssh -J username@vmos.vsc.ac.at username@vsc5.vsc.ac.at. When you login using SSH, you reach one of the login nodes that can be used to prepare the software environment, but that should not be used to perform any resource-intensive calculations. When you are finished preparing the software / scripts that you want to run, you submit a job file to the queue, which is later on executed on compute nodes (some of which have GPUs) once the resources become available.

At VSC5, SLURM is used as scheduler for queuing jobs. You can find an introduction to SLURM in the course material of the VSC introduction course at: https://events.vsc.ac.at/event/125/page/351-course-material Additional information can be found in the VSC wiki: https://wiki.vsc.ac.at/doku.php?id=doku:slurm

But to make things easier, a summary of the most important commands is given here:

Every SLURM job needs a job description file. These files are kind of shell scripts, but with some additional boilerplate at the top. Here is an example file:

gpu_job_template.slurm

#!/bin/bash
 
## Specify job name:
#SBATCH --job-name=GPU_job
 
## Specify GPU:
## For Nvidia A40:
##SBATCH --partition=zen2_0256_a40x2
##SBATCH --qos=zen2_0256_a40x2
## For Nvidia A100:
#SBATCH --partition=zen3_0512_a100x2
#SBATCH --qos=zen3_0512_a100x2
 
## Specify run time limit in format days-hours:minutes:seconds (up to 3 days)
## Note: Job will be killed once the run time limit is reached.
## Shorter values might reduce queuing time.
#SBATCH --time=0-01:00:00
 
## Specify number of GPUs (1 or 2):
#SBATCH --gres=gpu:1  # Number of GPUs
 
## Optional: Get notified via mail when the job runs and finishes:
##SBATCH --mail-type=ALL    # BEGIN, END, FAIL, REQUEUE, ALL
##SBATCH --mail-user=user@example.com
 
# Start in a clean environment
module purge
 
# List available GPUs:
nvidia-smi
 
# Load conda:
module load miniconda3
eval "$(conda shell.bash hook)"
 
# Load a conda environment with Python 3.11.6, PyTorch 2.1.0, TensorFlow 2.13.1 and other packages:
export XLA_FLAGS=--xla_gpu_cuda_data_dir=/gpfs/opt/sw/cuda-zen/spack-0.19.0/opt/spack/linux-almalinux8-zen/gcc-12.2.0/cuda-11.8.0-knnuyxtpma52vhp6zhj72nbjfbrcvb7f
conda activate /opt/sw/jupyterhub/envs/conda/vsc5/jupyterhub-horovod-v1
 
# Run AI scripts:
python -c "import torch;print(torch.__version__);print(torch.cuda.get_device_properties(0))"

The commands at the bottom of this script are executed on the compute node once the job runs. In this example script, nvidia-smi lists the available GPUs, then a python/pytorch/tensorflow environment is loaded (see below) and python uses the pytorch library to give some information about the first CUDA device that it finds.

To submit this script, save the SLURM script as example_gpu_job.slurm and execute

sbatch example_gpu_job.slurm

at the shell prompt.

To view your jobs, enter this command:

squeue --user $USER

If you need to cancel a job, enter this command, but replace JOBID with the JOBID given by squeue:

scancel JOBID

If you want to get an idea of how many jobs there are queued up at the A40 nodes, use this command:

squeue -p zen2_0256_a40x2 -o "%.10i %.16q %.16j %.10u %.10a %.2t %.10M %.10L %.3D %.6C %.6Q %R"

and for A100 nodes:

squeue -p zen3_0512_a100x2 -o "%.10i %.16q %.16j %.10u %.10a %.2t %.10M %.10L %.3D %.6C %.6Q %R"

In the right-most column you either see the name of the node (e.g. n3071-016) where the job is currently running or the reason why it is not (yet) running.

Once a job finished, you can find the output in a file called slurm-JOBID.out in your home directory.

To prepare your python deep learning environment, I recommend using conda. At VSC, there is a lot of HPC software pre-installed that can be loaded using module commands. To load conda, enter this command:

module load miniconda3
eval "$(conda shell.bash hook)"

The second line modifies the current shell so that different conda environments can be activated using conda activate. You can prepare your own conda environment, but you can also use an environment prepared by us that contains many useful packages including python 3.11.6, pytorch 2.1.0 and tensorflow 2.13.1:

conda activate /opt/sw/jupyterhub/envs/conda/vsc5/jupyterhub-horovod-v1

You can find some additional information about conda in the course material of the Python4HPC course at: https://gitlab.tuwien.ac.at/vsc-public/training/python4hpc/-/blob/main/D1_02_env_03_conda.ipynb

In addition to SSH access, there are a few JupyterHub nodes that can be accessed at: https://jupyterhub.vsc.ac.at To get the same environment as above, select the „VSC-5 A40 GPU (conda python env)“ profile (or the corresponding A100 profile) and the conda env „Conda jupyterhub-horovod-v1 (Python 3.11.6)“. JupterHub is not a substitute for SLURM batch scripts, but can sometimes be a useful tool for interactively executing python commands.