Differences

This shows you the differences between two versions of the page.

--- doku:ai_intro [2024/03/15 08:13] – alindner
+++ doku:ai_intro [2024/05/08 08:44] (current) – Change SLURM time limit in example to 3 days mpfister
@@ Line 1: / Line 1: @@
-====== AI on VSC ======
+====== Introduction to running AI tasks on VSC ======
-<box>
+VSC5 is a high performance cluster that consists of different kind of nodes. If you are outside of the university network, you need to use the jump host ''vmos.vsc.ac.at'' to reach VSC. (Note: Not all users are enabled to use the jump host.) To do that, use ''ssh -J username@vmos.vsc.ac.at username@vsc5.vsc.ac.at''. When you login using SSH, you reach one of the login nodes that can be used to prepare the software environment, but that should not be used to perform any resource-intensive calculations. When you are finished preparing the software / scripts that you want to run, you submit a job file to the queue, which is later on executed on compute nodes (some of which have GPUs) once the resources become available.
-This page is still a work in progress.
-</box>
-VSC5 is a high performance cluster that consists of different kind of nodes. When you login using SSH at ''vsc5.vsc.ac.at'', you reach one of the login nodes that can be used to prepare the software environment, but that should not be used to perform any resource-intensive calculations. When you are finished preparing the software / scripts that you want to run, you submit a job file to the queue, which is later on executed on compute nodes (some of which have GPUs) once the resources become available.
 At VSC5, SLURM is used as scheduler for queuing jobs. You can find an introduction to SLURM in the course material of the VSC introduction course at:
@@ Line 12: / Line 8: @@
 [[https://wiki.vsc.ac.at/doku.php?id=doku:slurm]]
-But to make things easier, I will try to give a summary of the most important commands here:
+But to make things easier, a summary of the most important commands is given here:
 Every SLURM job needs a job description file. These files are kind of shell scripts, but with some additional boilerplate at the top. Here is an example file:
 <file bash gpu_job_template.slurm>
 #!/bin/bash
-#
-#SBATCH --job-name=GPU_job  # Job name
+## Specify job name:
+#SBATCH --job-name=GPU_job
+## Specify GPU:
 ## For Nvidia A40:
-#SBATCH --partition=zen2_0256_a40x2  # Type of hardware
+##SBATCH --partition=zen2_0256_a40x2
-#SBATCH --qos=zen2_0256_a40x2  # Quality of service, for VSC often identical to partition
+##SBATCH --qos=zen2_0256_a40x2
 ## For Nvidia A100:
-##SBATCH --partition=zen3_0512_a100x2  # Type of hardware
+#SBATCH --partition=zen3_0512_a100x2
-##SBATCH --qos=zen3_0512_a100x2  # Quality of service, for VSC often identical to partition
+#SBATCH --qos=zen3_0512_a100x2
-#SBATCH --time=0-01:00:00  # Maximum run time in format days-hours:minutes:seconds (up to 3 days)
+## Specify run time limit in format days-hours:minutes:seconds (up to 3 days)
+## Note: Job will be killed once the run time limit is reached.
+## Shorter values might reduce queuing time.
+#SBATCH --time=3-00:00:00
+## Specify number of GPUs (1 or 2):
 #SBATCH --gres=gpu:1  # Number of GPUs
-##Optional: Get notified via mail when the job runs and finishes:
+## Optional: Get notified via mail when the job runs and finishes:
 ##SBATCH --mail-type=ALL    # BEGIN, END, FAIL, REQUEUE, ALL
-##SBATCH --mail-user=martin.pfister@tuwien.ac.at
+##SBATCH --mail-user=user@example.com
+# Start in a clean environment
+module purge
-module purge  # Start in a clean environment
+# List available GPUs:
 nvidia-smi
+# Load conda:
 module load miniconda3
 eval "$(conda shell.bash hook)"
+# Load a conda environment with Python 3.11.6, PyTorch 2.1.0, TensorFlow 2.13.1 and other packages:
+export XLA_FLAGS=--xla_gpu_cuda_data_dir=/gpfs/opt/sw/cuda-zen/spack-0.19.0/opt/spack/linux-almalinux8-zen/gcc-12.2.0/cuda-11.8.0-knnuyxtpma52vhp6zhj72nbjfbrcvb7f
 conda activate /opt/sw/jupyterhub/envs/conda/vsc5/jupyterhub-horovod-v1
+# Run AI scripts:
 python -c "import torch;print(torch.__version__);print(torch.cuda.get_device_properties(0))"
 </file>