Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision Next revisionBoth sides next revision | ||
doku:ai_intro [2024/03/08 10:08] – created markus | doku:ai_intro [2024/03/15 08:13] – alindner | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | AI on VSC | + | ====== |
+ | |||
+ | < | ||
+ | This page is still a work in progress. | ||
+ | </ | ||
+ | |||
+ | VSC5 is a high performance cluster that consists of different kind of nodes. When you login using SSH at '' | ||
+ | |||
+ | At VSC5, SLURM is used as scheduler for queuing jobs. You can find an introduction to SLURM in the course material of the VSC introduction course at: | ||
+ | [[https:// | ||
+ | Additional information can be found in the VSC wiki: | ||
+ | [[https:// | ||
+ | |||
+ | But to make things easier, I will try to give a summary of the most important commands here: | ||
+ | |||
+ | Every SLURM job needs a job description file. These files are kind of shell scripts, but with some additional boilerplate at the top. Here is an example file: | ||
+ | <file bash gpu_job_template.slurm> | ||
+ | # | ||
+ | # | ||
+ | #SBATCH --job-name=GPU_job | ||
+ | ## For Nvidia A40: | ||
+ | #SBATCH --partition=zen2_0256_a40x2 | ||
+ | #SBATCH --qos=zen2_0256_a40x2 | ||
+ | ## For Nvidia A100: | ||
+ | ##SBATCH --partition=zen3_0512_a100x2 | ||
+ | ##SBATCH --qos=zen3_0512_a100x2 | ||
+ | #SBATCH --time=0-01: | ||
+ | #SBATCH --gres=gpu: | ||
+ | ##Optional: Get notified via mail when the job runs and finishes: | ||
+ | ##SBATCH --mail-type=ALL | ||
+ | ##SBATCH --mail-user=martin.pfister@tuwien.ac.at | ||
+ | |||
+ | module purge # Start in a clean environment | ||
+ | nvidia-smi | ||
+ | module load miniconda3 | ||
+ | eval " | ||
+ | conda activate / | ||
+ | python -c " | ||
+ | </ | ||
+ | |||
+ | The commands at the bottom of this script are executed on the compute node once the job runs. In this example script, '' | ||
+ | |||
+ | To submit this script, save the SLURM script as example_gpu_job.slurm and execute | ||
+ | < | ||
+ | at the shell prompt. | ||
+ | |||
+ | To view your jobs, enter this command: | ||
+ | < | ||
+ | |||
+ | If you need to cancel a job, enter this command, but replace JOBID with the JOBID given by squeue: | ||
+ | < | ||
+ | |||
+ | If you want to get an idea of how many jobs there are queued up at the A40 nodes, use this command: | ||
+ | < | ||
+ | and for A100 nodes: | ||
+ | < | ||
+ | In the right-most column you either see the name of the node (e.g. n3071-016) where the job is currently running or the reason why it is not (yet) running. | ||
+ | |||
+ | Once a job finished, you can find the output in a file called '' | ||
+ | |||
+ | To prepare your python deep learning environment, | ||
+ | At VSC, there is a lot of HPC software pre-installed that can be loaded using module commands. To load conda, enter this command: | ||
+ | < | ||
+ | module load miniconda3 | ||
+ | eval " | ||
+ | </ | ||
+ | The second line modifies the current shell so that different conda environments can be activated using conda activate. You can prepare your own conda environment, | ||
+ | < | ||
+ | conda activate / | ||
+ | </ | ||
+ | |||
+ | You can find some additional information about conda in the course material of the Python4HPC course at: | ||
+ | [[https:// | ||
+ | |||
+ | In addition to SSH access, there are a few JupyterHub nodes that can be accessed at: | ||
+ | [[https:// | ||
+ | To get the same environment as above, select the „VSC-5 A40 GPU (conda python env)“ profile (or the corresponding A100 profile) and the conda env „Conda jupyterhub-horovod-v1 (Python 3.11.6)“. JupterHub is not a substitute for SLURM batch scripts, but can sometimes be a useful tool for interactively executing python commands. | ||