Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
doku:ai_intro [2024/03/08 10:08] – created markus | doku:ai_intro [2024/05/08 08:44] (current) – Change SLURM time limit in example to 3 days mpfister | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | AI on VSC | + | ====== Introduction to running |
+ | |||
+ | VSC5 is a high performance cluster that consists of different kind of nodes. If you are outside of the university network, you need to use the jump host '' | ||
+ | |||
+ | At VSC5, SLURM is used as scheduler for queuing jobs. You can find an introduction to SLURM in the course material of the VSC introduction course at: | ||
+ | [[https:// | ||
+ | Additional information can be found in the VSC wiki: | ||
+ | [[https:// | ||
+ | |||
+ | But to make things easier, a summary of the most important commands is given here: | ||
+ | |||
+ | Every SLURM job needs a job description file. These files are kind of shell scripts, but with some additional boilerplate at the top. Here is an example file: | ||
+ | <file bash gpu_job_template.slurm> | ||
+ | # | ||
+ | |||
+ | ## Specify job name: | ||
+ | #SBATCH --job-name=GPU_job | ||
+ | |||
+ | ## Specify GPU: | ||
+ | ## For Nvidia A40: | ||
+ | ##SBATCH --partition=zen2_0256_a40x2 | ||
+ | ##SBATCH --qos=zen2_0256_a40x2 | ||
+ | ## For Nvidia A100: | ||
+ | #SBATCH --partition=zen3_0512_a100x2 | ||
+ | #SBATCH --qos=zen3_0512_a100x2 | ||
+ | |||
+ | ## Specify run time limit in format days-hours: | ||
+ | ## Note: Job will be killed once the run time limit is reached. | ||
+ | ## Shorter values might reduce queuing time. | ||
+ | #SBATCH --time=3-00: | ||
+ | |||
+ | ## Specify number of GPUs (1 or 2): | ||
+ | #SBATCH --gres=gpu: | ||
+ | |||
+ | ## Optional: Get notified via mail when the job runs and finishes: | ||
+ | ##SBATCH --mail-type=ALL | ||
+ | ##SBATCH --mail-user=user@example.com | ||
+ | |||
+ | # Start in a clean environment | ||
+ | module purge | ||
+ | |||
+ | # List available GPUs: | ||
+ | nvidia-smi | ||
+ | |||
+ | # Load conda: | ||
+ | module load miniconda3 | ||
+ | eval " | ||
+ | |||
+ | # Load a conda environment with Python 3.11.6, PyTorch 2.1.0, TensorFlow 2.13.1 and other packages: | ||
+ | export XLA_FLAGS=--xla_gpu_cuda_data_dir=/ | ||
+ | conda activate / | ||
+ | |||
+ | # Run AI scripts: | ||
+ | python -c " | ||
+ | </ | ||
+ | |||
+ | The commands at the bottom of this script are executed on the compute node once the job runs. In this example script, '' | ||
+ | |||
+ | To submit this script, save the SLURM script as example_gpu_job.slurm and execute | ||
+ | < | ||
+ | at the shell prompt. | ||
+ | |||
+ | To view your jobs, enter this command: | ||
+ | < | ||
+ | |||
+ | If you need to cancel a job, enter this command, but replace JOBID with the JOBID given by squeue: | ||
+ | < | ||
+ | |||
+ | If you want to get an idea of how many jobs there are queued up at the A40 nodes, use this command: | ||
+ | < | ||
+ | and for A100 nodes: | ||
+ | < | ||
+ | In the right-most column you either see the name of the node (e.g. n3071-016) where the job is currently running or the reason why it is not (yet) running. | ||
+ | |||
+ | Once a job finished, you can find the output in a file called '' | ||
+ | |||
+ | To prepare your python deep learning environment, | ||
+ | At VSC, there is a lot of HPC software pre-installed that can be loaded using module commands. To load conda, enter this command: | ||
+ | < | ||
+ | module load miniconda3 | ||
+ | eval " | ||
+ | </ | ||
+ | The second line modifies the current shell so that different conda environments can be activated using conda activate. You can prepare your own conda environment, | ||
+ | < | ||
+ | conda activate / | ||
+ | </ | ||
+ | |||
+ | You can find some additional information about conda in the course material of the Python4HPC course at: | ||
+ | [[https:// | ||
+ | |||
+ | In addition to SSH access, there are a few JupyterHub nodes that can be accessed at: | ||
+ | [[https:// | ||
+ | To get the same environment as above, select the „VSC-5 A40 GPU (conda python env)“ profile (or the corresponding A100 profile) and the conda env „Conda jupyterhub-horovod-v1 (Python 3.11.6)“. JupterHub is not a substitute for SLURM batch scripts, but can sometimes be a useful tool for interactively executing python commands. | ||