Documentation for VSC-2

This version (2022/06/20 09:01) was approved by msiegel.

Log in to your university's designated login server via SSH

# Uni Wien
ssh <username>@vsc2.univie.ac.at

# TU Wien
ssh <username>@vsc2.tuwien.ac.at

# Boku Wien
ssh <username>@vsc2.boku.ac.at

Transfer your programs and data/input files to your home directory.
(Re-)Compile your application. Please choose your MPI-Environment as described in MPI Environment.
Write a job script for your application:
```
#$ -N <job_name>
#$ -pe mpich <slots>
#$ -V
mpirun -machinefile $TMPDIR/machines -np $NSLOTS <executable>
```
where “<job_name>” is a freely chosen descriptive name and “<slots>” is the number of processor cores that you want to use for the calculation and the order of the options -machinefile and -np is essential! On VSC-2 only exclusive reservations of compute nodes are available. Each compute node provides 16 “<slots>”. Substitute the path to your MPI-enabled application for <executable> and you are ready to run!
NOTE: when the option #$ -V is specified this error message will be generated in one of the output files of the grid engine:
```
/bin/sh: module: line 1: syntax error: unexpected end of file
/bin/sh: error importing function definition for `module'
bash: module: line 1: syntax error: unexpected end of file
bash: error importing function definition for `module'
```
This message is due to a know bug in the grid engine which cannot handle functions defined in the user environment. This message can be safely ignored. You can avoid this error message by exporting only particular environment variables in your job-script, like:
```
#$ -v PATH
#$ -v LD_LIBRARY_PATH
```
To receive E-Mail notifications concerning job events (b .. beginning, e .. end, a .. abort or reschedule, s .. suspend), use these lines in your job script:
```
#$ -M <email address to notify of job events>
#$ -m beas  # all job events sent via email
```
It is often advisable to also specify the job's runtime as
```
#$ -l h_rt=hh:mm:ss
```
in particular when you know that your job will run only for several hours or even minutes. That way one can “backfill” the queue, thus avoiding very long waiting times, which may be due to a highly parallel job waiting for free resources.
Here is an example job-script, requesting 32 processor cores (2 nodes), which will run for a maximum of 3 hours and sends emails at the beginning and at the end of the job:
```
#$ -N hitchhiker
#$ -pe mpich 32
#$ -V
#$ -M my.name@example.com
#$ -m be
#$ -l h_rt=03:00:00

mpirun -machinefile $TMPDIR/machines -np $NSLOTS ./myjob
```
Submit your job:
```
qsub <job_file>
```
where “<job_file>” is the name of the file you just created.
Check if and where your job has been scheduled:
```
qstat
```
Inspect the job output. Assuming your job was assigned the id “42” and your job's name was “hitchhiker”, you should be able to find the following files in the directory you started it from:
```
$ ls -l
hitchhiker.o42
hitchhiker.e42
hitchhiker.po42
hitchhiker.pe42
```
In this example hitchhiker.o42 contains the output of your job. hitchhiker.e42 contains possible error messages. In hitchhiker.po42 and hitchhiker.pe42 you might find additional information related to the parallel computing environment.
Delete Jobs:
```
$ qdel <job_id>
```

Standard Queue (all.q)

The majority of jobs use the standard queue 'all.q' with a maximum run time of 3 days (72 hours).

Node types

All nodes are configured equivalently, only a few nodes have different amounts of memory. VSC-2 has

1334 nodes with 32GB main memory
8 nodes with 64 GB main memory
8 nodes with 128 GB main memory
2 nodes with 256 GB main memory and 64 cores (not in 'all.q', but in 'highmem.q')

Jobs are scheduled by default to request nodes with at least 27 GB free memory. To override this default on the command line or in the job script you may specify:

to allow scheduling on the 32 GB, 64 GB and 128 GB nodes
- this is the default
to allow scheduling on the 64 GB and 128 GB nodes
- '-l mem_free=50G' (on the command line) or
- '#$ -l mem_free=50G' (in the script),
to allow scheduling on 128 GB nodes only:
- '-l mem_free=100G' (command line) or
- '#$ -l mem_free=100G' (job script).
to allow scheduling on 256 GB nodes only:
- see below: High Memory Queue

In order to avoid jobs with low memory requirements on nodes with 64 or 128 GB, priority adjustments are made in the queue.

Long Queue

A queue where jobs will be allowed to run a maximum of 7 days is available on VSC-2. The limit on the number of slots per job is 128 and the maximum number of allocatable slots per user at one time is 768. A total of 4096 slots are available for long jobs. All nodes of this queue have 32GB main memory. Use this queue by specifying it explicitly in your job script:

#$ -q long.q

or submit your job with

 qsub -q long.q <job_file>

High Memory Queue

Due to higher memory requests from some users, two nodes with 256 GB memory and 64 cores are available in the queue 'highmem.q'. The four processors utilized are AMD Opteron 6274 with 2.2GHz and 16 cores each. These nodes show a sustained performance of about 400 GFlop/s, which compares to about four standard nodes of the VSC-2.

Due to the special memory requirements of jobs in this queue, jobs are granted exclusive access. 64 slots are accounted for, even if the job does not make efficient use of all 64 cores. Make sure to adapt your job script to pin processes to cores

export I_MPI_PIN_PROCESSOR_LIST=0-63

if applicable.

The run time limit is 3 days (72 hours).

Programs which work in the 'all.q' and the 'long.q' run without modifications on these nodes, too. Intel compilers and Intel MPI show good behaviour on the 'highmem.q' queue.

Please use this node only for jobs with memory requirements of more than 64 GB!

On VSC-2 several versions of MPI are available. Choose one using 'mpi-selector' or 'mpi-selector-menu':

#list available MPI versions:
$ mpi-selector --list
impi_intel-4.1.0.024
impi_intel-4.1.1.036
intel_mpi_intel64-4.0.3.008
mvapich2_1.8_intel_limic
mvapich2_gcc-1.9a2
mvapich2_intel
openmpi-1.5.4_gcc
openmpi-1.5.4_intel
openmpi_gcc-1.6.4

#see the currently used MPI version:
$ mpi-selector --query
default:impi_intel-4.1.0.024
level:user

#set the MPI version:
$ mpi-selector --set impi_intel-4.1.0.024

Modifications will be active after logging in again.

In addition to $HOME, which is fine to use for standard jobs with rather few small files (<1000 files, overall size <1G), there are a number of specialized scratch directories.

The Fraunhofer parallel cluster file system (FhGFS) is used in $GLOBAL and $SCRATCH.

Global Personal Scratch Directories $GLOBAL

Please use the environment variable $GLOBAL to access your personal scratch space. Access is available from the compute and login nodes. The variable expands as e.g.:

$ echo $GLOBAL
/global/lv70999/username

The directory is writeable as user and readable by the group members. It is advisable to make use of these directories in particular for jobs with heavy I/O operations. In addition it will reduce the load on the fileserver holding the $HOME directories.

The Fraunhofer parallel file system is shared by all users and by all nodes. Single jobs producing heavy load (»1000 requests per second) have been observed to reduce responsiveness for all jobs and all users.

Per-node Scratch Directories $SCRATCH

Local scratch directories on each node are provided as a link to the Fraunhofer parallel file system and can thus be viewed also via the login nodes as '/fhgfs/rXXnXX/'. The parallel file system (and thus the performance) is identical between $SCRATCH and $GLOBAL. The variable $SCRATCH expands as:

$ echo $SCRATCH
/scratch

These directories are purged after job execution.

Local temporary ram disk $TMPDIR

For smaller files and very fast access, restricted to single nodes, the variables $TMP or $TMPDIR may be used which expand equally to

$ echo $TMP -- $TMPDIR
/tmp/123456.789.queue.q -- /tmp/123456.789.queue.q

These directories are purged after job execution.

Please refrain from writing directly to '/tmp'!

Joblocal scratch directory $JOBLOCAL

The newest, still experimental, scratch file system $JOBLOCAL is a common temporary storage within a user job. The 'joblocal' file system may be requested with

-v JOBLOCAL_FILESYSTEM=TRUE

All nodes within a job access the same files under '/joblocal', which is purged after job execution.

This method scales very well up to several hundred similar jobs. Although the file system has 32GB, it is recommended to use only a few GB.

To save files at the job end, use, e.g.,

cd /joblocal; tar czf ${HOME}/joblocal_${JOB_NAME}_${JOB_ID}.tgz myfiles

in your user epilog script.

If there are many files (»1000), please refrain from plain copying to $HOME or $GLOBAL at the job end.

Implementation details: $JOBLOCAL is implemented via SCSI RDMA Protocol (SRP) and NFS. Very high performance for small files is achieved by extensive caching on the jobs master node, which acts as (job internal) NFS server.

Comparison of scratch directories

	$GLOBAL \|\| $SCRATCH	$TMPDIR \|\| $JOBLOCAL (experimental)
Recommended file size	large	large	small	small
Lifetime	until file system failure	job	job	job
Size	x00 TB (for all users)	x00 TB (for all users)	a few GB (within memory)	about 5 GB (hard limit: 32GB)
Scaling	troubles with very many small file accesses (from more than 100 nodes)	troubles with very many small file accesses (from more than 100 nodes)	no problem (local)	no problem (local)
Visibility	global	node (see above)	node	job
Recommended usage	large files, available after job life	large files	small files, or many seek-operations within a file	many small files (>1000), or many seek-operations within a file

To make sure that the MPI communication happens via the infiniband fabric, please use the following settings in your job-script and/or in your .bashrc file:

export I_MPI_DAT_LIBRARY=/usr/lib64/libdat2.so.2
export OMP_NUM_THREADS=1
export I_MPI_FABRICS=shm:dapl
export I_MPI_FALLBACK=0
export I_MPI_CPUINFO=proc
export I_MPI_PIN_PROCESSOR_LIST=1,14,9,6,5,10,13,2,3,12,11,4,7,8,15,0
export I_MPI_JOB_FAST_STARTUP=0

fft libraries
large jobs with more than 1024 cores
memory intensive jobs requiring more than 2 GB per core
ScaLAPACK compile options
NWChem
Linking to BLAS Libraries
user defined prolog and epilog scripts

The NUMA memory of VSC-2 is highly depending on the positioning of processes to the four NUMA nodes on each compute node. Using Intel MPI the Parameter

export I_MPI_PIN_PROCESSOR_LIST=1,14,9,6,5,10,13,2,3,12,11,4,7,8,15,0

as mentioned above should always be used to pin (up to) 16 processes to the 16 cores. In the case of sequential jobs, we recommend to use 'taskset' or 'numactl', e.g.

taskset -c 0 our_example_code param1 param2 >out1 &
taskset -c 8 our_example_code param1 param2 >out2 &
wait

Performance gains of up to 200% were observed for synthetic benchmarks. Note also the examples for sequential jobs.

No backup on VSC.

Backup is at the responsibility of each user.

Data loss by hardware failure is prevented by using state-of-the-art technology like RAID-6.

Documentation for VSC-2

Quick Start

Queues

Standard Queue (all.q)

Node types

Long Queue

High Memory Queue

MPI Version

Scratch Directories

Global Personal Scratch Directories $GLOBAL

Per-node Scratch Directories $SCRATCH

Local temporary ram disk $TMPDIR

Joblocal scratch directory $JOBLOCAL

Comparison of scratch directories

General recommendations

Recommendations for various codes

Recommendations for advanced users

Process pinning

Backup