# Documentation for VSC-2

# Uni Wien

# TU Wien

# Boku Wien
ssh <username>@vsc2.boku.ac.at
4. Write a job script for your application:
#$-N <job_name> #$ -pe mpich <slots>
#$-V mpirun -machinefile$TMPDIR/machines -np $NSLOTS <executable> where “<job_name>” is a freely chosen descriptive name and “<slots>” is the number of processor cores that you want to use for the calculation and the order of the options -machinefile and -np is essential! On VSC-2 only exclusive reservations of compute nodes are available. Each compute node provides 16 “<slots>”. Substitute the path to your MPI-enabled application for <executable> and you are ready to run! NOTE: when the option #$ -V is specified this error message will be generated in one of the output files of the grid engine:

/bin/sh: module: line 1: syntax error: unexpected end of file
/bin/sh: error importing function definition for module'
bash: module: line 1: syntax error: unexpected end of file
bash: error importing function definition for module'

This message is due to a know bug in the grid engine which cannot handle functions defined in the user environment. This message can be safely ignored. You can avoid this error message by exporting only particular environment variables in your job-script, like:

#$-v PATH #$ -v LD_LIBRARY_PATH

To receive E-Mail notifications concerning job events (b .. beginning, e .. end, a .. abort or reschedule, s .. suspend), use these lines in your job script:

#$-M <email address to notify of job events> #$ -m beas  # all job events sent via email

It is often advisable to also specify the job's runtime as

#$-l h_rt=hh:mm:ss in particular when you know that your job will run only for several hours or even minutes. That way one can “backfill” the queue, thus avoiding very long waiting times, which may be due to a highly parallel job waiting for free resources. Here is an example job-script, requesting 32 processor cores (2 nodes), which will run for a maximum of 3 hours and sends emails at the beginning and at the end of the job: #$ -N hitchhiker
#$-pe mpich 32 #$ -V
#$-M my.name@example.com #$ -m be
#$-l h_rt=03:00:00 mpirun -machinefile$TMPDIR/machines -np $NSLOTS ./myjob 5. Submit your job: qsub <job_file> where “<job_file>” is the name of the file you just created. 6. Check if and where your job has been scheduled: qstat 7. Inspect the job output. Assuming your job was assigned the id “42” and your job's name was “hitchhiker”, you should be able to find the following files in the directory you started it from: $ ls -l
hitchhiker.o42
hitchhiker.e42
hitchhiker.po42
hitchhiker.pe42

In this example hitchhiker.o42 contains the output of your job. hitchhiker.e42 contains possible error messages. In hitchhiker.po42 and hitchhiker.pe42 you might find additional information related to the parallel computing environment.

8. Delete Jobs:
$qdel <job_id> #### Standard Queue (all.q) The majority of jobs use the standard queue 'all.q' with a maximum run time of 3 days (72 hours). ##### Node types All nodes are configured equivalently, only a few nodes have different amounts of memory. VSC-2 has • 1334 nodes with 32GB main memory • 8 nodes with 64 GB main memory • 8 nodes with 128 GB main memory • 2 nodes with 256 GB main memory and 64 cores (not in 'all.q', but in 'highmem.q') Jobs are scheduled by default to request nodes with at least 27 GB free memory. To override this default on the command line or in the job script you may specify: • to allow scheduling on the 32 GB, 64 GB and 128 GB nodes • this is the default • to allow scheduling on the 64 GB and 128 GB nodes • '-l mem_free=50G' (on the command line) or • '#$ -l mem_free=50G' (in the script),
• to allow scheduling on 128 GB nodes only:
• '-l mem_free=100G' (command line) or
• '#$-l mem_free=100G' (job script). • to allow scheduling on 256 GB nodes only: • see below: High Memory Queue In order to avoid jobs with low memory requirements on nodes with 64 or 128 GB, priority adjustments are made in the queue. #### Long Queue A queue where jobs will be allowed to run a maximum of 7 days is available on VSC-2. The limit on the number of slots per job is 128 and the maximum number of allocatable slots per user at one time is 768. A total of 4096 slots are available for long jobs. All nodes of this queue have 32GB main memory. Use this queue by specifying it explicitly in your job script: #$ -q long.q

 qsub -q long.q <job_file>

#### High Memory Queue

Due to higher memory requests from some users, two nodes with 256 GB memory and 64 cores are available in the queue 'highmem.q'. The four processors utilized are AMD Opteron 6274 with 2.2GHz and 16 cores each. These nodes show a sustained performance of about 400 GFlop/s, which compares to about four standard nodes of the VSC-2.

Due to the special memory requirements of jobs in this queue, jobs are granted exclusive access. 64 slots are accounted for, even if the job does not make efficient use of all 64 cores. Make sure to adapt your job script to pin processes to cores

export I_MPI_PIN_PROCESSOR_LIST=0-63

if applicable.

The run time limit is 3 days (72 hours).

Programs which work in the 'all.q' and the 'long.q' run without modifications on these nodes, too. Intel compilers and Intel MPI show good behaviour on the 'highmem.q' queue.

Please use this node only for jobs with memory requirements of more than 64 GB!

On VSC-2 several versions of MPI are available. Choose one using 'mpi-selector' or 'mpi-selector-menu':

#list available MPI versions:
$mpi-selector --list impi_intel-4.1.0.024 impi_intel-4.1.1.036 intel_mpi_intel64-4.0.3.008 mvapich2_1.8_intel_limic mvapich2_gcc-1.9a2 mvapich2_intel openmpi-1.5.4_gcc openmpi-1.5.4_intel openmpi_gcc-1.6.4 #see the currently used MPI version:$ mpi-selector --query
default:impi_intel-4.1.0.024
level:user

#set the MPI version:
$mpi-selector --set impi_intel-4.1.0.024 Modifications will be active after logging in again. In addition to$HOME, which is fine to use for standard jobs with rather few small files (<1000 files, overall size <1G), there are a number of specialized scratch directories.

The Fraunhofer parallel cluster file system (FhGFS) is used in $GLOBAL and$SCRATCH.

#### Global Personal Scratch Directories $GLOBAL Please use the environment variable $GLOBAL to access your personal scratch space. Access is available from the compute and login nodes. The variable expands as e.g.:

$echo$GLOBAL
/global/lv70999/username

Local scratch directories on each node are provided as a link to the Fraunhofer parallel file system and can thus be viewed also via the login nodes as '/fhgfs/rXXnXX/'. The parallel file system (and thus the performance) is identical between $SCRATCH and$GLOBAL. The variable $SCRATCH expands as: $ echo $SCRATCH /scratch These directories are purged after job execution. #### Local temporary ram disk$TMPDIR

For smaller files and very fast access, restricted to single nodes, the variables $TMP or $TMPDIR may be used which expand equally to

$echo$TMP -- $TMPDIR /tmp/123456.789.queue.q -- /tmp/123456.789.queue.q These directories are purged after job execution. Please refrain from writing directly to '/tmp'! #### Joblocal scratch directory$JOBLOCAL

The newest, still experimental, scratch file system $JOBLOCAL is a common temporary storage within a user job. The 'joblocal' file system may be requested with -v JOBLOCAL_FILESYSTEM=TRUE All nodes within a job access the same files under '/joblocal', which is purged after job execution. This method scales very well up to several hundred similar jobs. Although the file system has 32GB, it is recommended to use only a few GB. To save files at the job end, use, e.g., cd /joblocal; tar czf${HOME}/joblocal_${JOB_NAME}_${JOB_ID}.tgz myfiles

If there are many files (»1000), please refrain from plain copying to $HOME or$GLOBAL at the job end.

Implementation details: $JOBLOCAL is implemented via SCSI RDMA Protocol (SRP) and NFS. Very high performance for small files is achieved by extensive caching on the jobs master node, which acts as (job internal) NFS server. #### Comparison of scratch directories $GLOBAL || $SCRATCH$TMPDIR || \$JOBLOCAL (experimental) Recommended file size large large small small Lifetime until file system failure job job job Size x00 TB (for all users) x00 TB (for all users) a few GB (within memory) about 5 GB (hard limit: 32GB) Scaling troubles with very many small file accesses (from more than 100 nodes) troubles with very many small file accesses (from more than 100 nodes) no problem (local) no problem (local) Visibility global node (see above) node job Recommended usage large files, available after job life large files small files, or many seek-operations within a file many small files (>1000), or many seek-operations within a file

To make sure that the MPI communication happens via the infiniband fabric, please use the following settings in your job-script and/or in your .bashrc file:

export I_MPI_DAT_LIBRARY=/usr/lib64/libdat2.so.2
export I_MPI_FABRICS=shm:dapl
export I_MPI_FALLBACK=0
export I_MPI_CPUINFO=proc
export I_MPI_PIN_PROCESSOR_LIST=1,14,9,6,5,10,13,2,3,12,11,4,7,8,15,0
export I_MPI_JOB_FAST_STARTUP=0

The NUMA memory of VSC-2 is highly depending on the positioning of processes to the four NUMA nodes on each compute node. Using Intel MPI the Parameter

export I_MPI_PIN_PROCESSOR_LIST=1,14,9,6,5,10,13,2,3,12,11,4,7,8,15,0

as mentioned above should always be used to pin (up to) 16 processes to the 16 cores. In the case of sequential jobs, we recommend to use 'taskset' or 'numactl', e.g.

taskset -c 0 our_example_code param1 param2 >out1 &
taskset -c 8 our_example_code param1 param2 >out2 &
wait

Performance gains of up to 200% were observed for synthetic benchmarks. Note also the examples for sequential jobs.

No backup on VSC.

Backup is at the responsibility of each user.

Data loss by hardware failure is prevented by using state-of-the-art technology like RAID-6.

• doku/vsc2.txt