Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revisionBoth sides next revision | ||
doku:ai_intro [2024/03/15 08:18] – alindner | doku:ai_intro [2024/03/18 12:11] – Add SSH jump host mpfister | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== AI on VSC ====== | + | ====== |
+ | < | ||
<box> | <box> | ||
This page is still a work in progress. | This page is still a work in progress. | ||
</ | </ | ||
+ | --></ | ||
- | VSC5 is a high performance cluster that consists of different kind of nodes. | + | VSC5 is a high performance cluster that consists of different kind of nodes. |
At VSC5, SLURM is used as scheduler for queuing jobs. You can find an introduction to SLURM in the course material of the VSC introduction course at: | At VSC5, SLURM is used as scheduler for queuing jobs. You can find an introduction to SLURM in the course material of the VSC introduction course at: | ||
Line 12: | Line 14: | ||
[[https:// | [[https:// | ||
- | But to make things easier, | + | But to make things easier, a summary of the most important commands |
Every SLURM job needs a job description file. These files are kind of shell scripts, but with some additional boilerplate at the top. Here is an example file: | Every SLURM job needs a job description file. These files are kind of shell scripts, but with some additional boilerplate at the top. Here is an example file: | ||
<file bash gpu_job_template.slurm> | <file bash gpu_job_template.slurm> | ||
#!/bin/bash | #!/bin/bash | ||
- | # | + | |
- | #SBATCH --job-name=GPU_job | + | ## Specify job name: |
+ | #SBATCH --job-name=GPU_job | ||
+ | |||
+ | ## Specify GPU: | ||
## For Nvidia A40: | ## For Nvidia A40: | ||
- | #SBATCH --partition=zen2_0256_a40x2 | + | ##SBATCH --partition=zen2_0256_a40x2 |
- | #SBATCH --qos=zen2_0256_a40x2 | + | ##SBATCH --qos=zen2_0256_a40x2 |
## For Nvidia A100: | ## For Nvidia A100: | ||
- | ##SBATCH --partition=zen3_0512_a100x2 | + | #SBATCH --partition=zen3_0512_a100x2 |
- | ##SBATCH --qos=zen3_0512_a100x2 | + | #SBATCH --qos=zen3_0512_a100x2 |
- | #SBATCH --time=0-01: | + | |
+ | ## Specify | ||
+ | ## Note: Job will be killed once the run time limit is reached. | ||
+ | ## Shorter values might reduce queuing time. | ||
+ | #SBATCH --time=0-01: | ||
+ | |||
+ | ## Specify number of GPUs (1 or 2): | ||
#SBATCH --gres=gpu: | #SBATCH --gres=gpu: | ||
- | ##Optional: Get notified via mail when the job runs and finishes: | + | |
+ | ## Optional: Get notified via mail when the job runs and finishes: | ||
##SBATCH --mail-type=ALL | ##SBATCH --mail-type=ALL | ||
- | ##SBATCH --mail-user=martin.pfister@tuwien.ac.at | + | ##SBATCH --mail-user=user@example.com |
+ | |||
+ | # Start in a clean environment | ||
+ | module purge | ||
- | module purge | + | # List available GPUs: |
nvidia-smi | nvidia-smi | ||
+ | |||
+ | # Load conda: | ||
module load miniconda3 | module load miniconda3 | ||
eval " | eval " | ||
+ | |||
+ | # Load a conda environment with Python 3.11.6, PyTorch 2.1.0, TensorFlow 2.13.1 and other packages: | ||
+ | export XLA_FLAGS=--xla_gpu_cuda_data_dir=/ | ||
conda activate / | conda activate / | ||
- | python -c " | ||
- | module list | + | # Run AI scripts: |
+ | python -c " | ||
</ | </ | ||