Differences

This shows you the differences between two versions of the page.

--- doku:vsc2 [2014/06/18 10:45] – removed ir
+++ doku:vsc2 [2022/02/01 23:10] (current) – ↷ Links adapted because of a move operation 114.119.138.70
@@ Line 1: / Line 1: @@
+====== Documentation for VSC-2 ======
+==== Quick Start ====
+  - Log in to your university's designated login server via SSH \\ <code># Uni Wien
+ssh <username>@vsc2.univie.ac.at
+# TU Wien
+ssh <username>@vsc2.tuwien.ac.at
+# Boku Wien
+ssh <username>@vsc2.boku.ac.at
+</code>
+  - Transfer your programs and data/input files to your home directory.
+  - (Re-)Compile your application. Please choose your MPI-Environment as described in [[doku:mpi|MPI Environment]].
+  - Write a job script for your application: \\ <code>#$ -N <job_name>
+#$ -pe mpich <slots>
+#$ -V
+mpirun -machinefile $TMPDIR/machines -np $NSLOTS <executable></code> where "<job_name>" is a freely chosen descriptive name and "<slots>" is the number of processor cores that you want to use for the calculation and the **order of the options** ''-machinefile'' and ''-np'' is essential! On VSC-2 only exclusive reservations of compute nodes are available. Each compute node provides 16 "<slots>". Substitute the path to your MPI-enabled application for <executable> and you are ready to run!\\ **NOTE**: when the option ''#$ -V'' is specified this error message will be generated in one of the output files of the grid engine:<code>
+/bin/sh: module: line 1: syntax error: unexpected end of file
+/bin/sh: error importing function definition for `module'
+bash: module: line 1: syntax error: unexpected end of file
+bash: error importing function definition for `module'</code> This message is due to a know bug in the grid engine which cannot handle functions defined in the user environment. This message can be safely ignored. You can avoid this error message by exporting only particular environment variables in your job-script, like:<code>#$ -v PATH
+#$ -v LD_LIBRARY_PATH
+</code>\\ To receive E-Mail notifications concerning job events (b .. beginning, e .. end, a .. abort or reschedule, s .. suspend), use these lines in your job script: \\ <code>#$ -M <email address to notify of job events>
+#$ -m beas  # all job events sent via email</code> It is often advisable to also specify the job's runtime as <code>#$ -l h_rt=hh:mm:ss</code> in particular when you know that your job will run only for several hours or even minutes. That way one can "backfill" the queue, thus avoiding very long waiting times, which may be due to a highly parallel job waiting for free resources. \\ Here is an example job-script, requesting 32 processor cores (2 nodes), which will run for a maximum of 3 hours and sends emails at the beginning and at the end of the job: <code>
+#$ -N hitchhiker
+#$ -pe mpich 32
+#$ -V
+#$ -M my.name@example.com
+#$ -m be
+#$ -l h_rt=03:00:00
+mpirun -machinefile $TMPDIR/machines -np $NSLOTS ./myjob</code>
+  - Submit your job:\\ <code>qsub <job_file></code>where "<job_file>" is the name of the file you just created.
+  - Check if and where your job has been scheduled:\\ <code>qstat</code>
+  - Inspect the job output. Assuming your job was assigned the id "42" and your job's name was "hitchhiker", you should be able to find the following files in the directory you started it from: \\ <code>$ ls -l
+hitchhiker.o42
+hitchhiker.e42
+hitchhiker.po42
+hitchhiker.pe42</code> In this example hitchhiker.o42 contains the output of your job. hitchhiker.e42 contains possible error messages. In hitchhiker.po42 and hitchhiker.pe42 you might find additional information related to the parallel computing environment.
+  - Delete Jobs: <code>$ qdel <job_id></code>
+==== Queues ====
+=== Standard Queue (all.q) ===
+The majority of jobs use the standard queue '''all.q''' with a maximum run time of 3 days (72 hours).
+== Node types ==
+All nodes are configured equivalently, only a few nodes have different amounts of memory.
+VSC-2 has
+  * 1334 nodes with 32GB main memory
+  * 8 nodes with 64 GB main memory
+  * 8 nodes with 128 GB main memory
+  * 2 nodes with 256 GB main memory and 64 cores (not in '''all.q''', but in '''highmem.q''')
+Jobs are scheduled by default to request nodes with at least 27 GB free memory. To override this default on the command line or in the job script you may specify:
+  * to allow scheduling on the 32 GB, 64 GB and 128 GB nodes
+    * this is the default
+  * to allow scheduling on the 64 GB and 128 GB nodes
+    * '''-l mem_free=50G''' (on the command line) or
+    * '''#$ -l mem_free=50G''' (in the script),
+  * to allow scheduling on 128 GB nodes only:
+    * '''-l mem_free=100G''' (command line) or
+    * '''#$ -l mem_free=100G''' (job script).
+  * to allow scheduling on 256 GB nodes only:
+    * see below: High Memory Queue
+In order to avoid jobs with low memory requirements on nodes with 64 or 128 GB, priority adjustments are made in the queue.
+=== Long Queue ===
+A queue where jobs will be allowed to run a maximum of 7 days
+is available on VSC-2. The limit on the number of slots per job
+is 128 and the maximum number of allocatable slots per user
+at one time is 768. A total of 4096 slots are available for long jobs.
+All nodes of this queue have 32GB main memory.
+Use this queue by specifying it explicitly in your job script:
+<code>#$ -q long.q</code> or submit your job with <code> qsub -q long.q <job_file></code>
+=== High Memory Queue ===
+Due to higher memory requests from some users, two nodes with 256 GB memory and 64 cores
+are available in the queue '''highmem.q'''.
+The four processors utilized are AMD Opteron 6274 with 2.2GHz and 16 cores each.
+These nodes show a sustained performance of about 400 GFlop/s, which compares to about four standard nodes of the VSC-2.
+Due to the special memory requirements of jobs in this queue, jobs are granted exclusive access.
+slots are accounted for, even if the job does not make efficient use of all 64 cores.
+Make sure to adapt your job script to pin processes to cores
+<code>export I_MPI_PIN_PROCESSOR_LIST=0-63</code>
+if applicable.
+The run time limit is 3 days (72 hours).
+Programs which work in the '''all.q''' and the '''long.q''' run without modifications on these nodes, too.
+Intel compilers and Intel MPI show good behaviour on the '''highmem.q''' queue.
+Please use this node only for jobs with memory requirements of more than 64 GB!
+==== MPI Version ====
+On VSC-2 several versions of MPI are available.
+Choose one using 'mpi-selector' or 'mpi-selector-menu':
+<code>
+#list available MPI versions:
+$ mpi-selector --list
+impi_intel-4.1.0.024
+impi_intel-4.1.1.036
+intel_mpi_intel64-4.0.3.008
+mvapich2_1.8_intel_limic
+mvapich2_gcc-1.9a2
+mvapich2_intel
+openmpi-1.5.4_gcc
+openmpi-1.5.4_intel
+openmpi_gcc-1.6.4
+#see the currently used MPI version:
+$ mpi-selector --query
+default:impi_intel-4.1.0.024
+level:user
+#set the MPI version:
+$ mpi-selector --set impi_intel-4.1.0.024
+</code>
+Modifications will be active after logging in again.
+==== Scratch Directories ====
+In addition to $HOME, which is fine to use for standard jobs with rather few small files (<1000 files, overall size <1G), there are a number of specialized scratch directories.
+The [[http://www.fhgfs.com/cms/documentation|Fraunhofer parallel cluster file system (FhGFS)]] is used in $GLOBAL and $SCRATCH.
+=== Global Personal Scratch Directories $GLOBAL ===
+Please use the environment variable ''$GLOBAL'' to access your personal scratch space. Access is available from the compute and login nodes. The variable expands as e.g.:
+<code>
+$ echo $GLOBAL
+/global/lv70999/username
+</code>
+The directory is writeable as user and readable by the group members. It is advisable to make use of these directories in particular for jobs with heavy I/O operations. In addition it will reduce the load on the fileserver holding the $HOME directories.
+The Fraunhofer parallel file system is shared by all users and by all nodes.
+Single jobs producing heavy load (>>1000 requests per second) have been observed to reduce responsiveness for all jobs and all users.
+=== Per-node Scratch Directories $SCRATCH ===
+Local scratch directories on each node are provided as a link to the Fraunhofer parallel file system and can thus be viewed also via the login nodes as '''/fhgfs/rXXnXX/'''.
+The parallel file system (and thus the performance) is identical between $SCRATCH and $GLOBAL.
+The variable ''$SCRATCH'' expands as:
+<code>
+$ echo $SCRATCH
+/scratch
+</code>
+These directories are purged after job execution.
+=== Local temporary ram disk $TMPDIR ===
+For smaller files and very fast access, restricted to single nodes, the variables ''$TMP'' or ''$TMPDIR'' may be used which expand equally to
+<code>
+$ echo $TMP -- $TMPDIR
+/tmp/123456.789.queue.q -- /tmp/123456.789.queue.q
+</code>
+These directories are purged after job execution.
+Please refrain from writing directly to '''/tmp'''!
+=== Joblocal scratch directory $JOBLOCAL ===
+The newest, still experimental, scratch file system $JOBLOCAL is a common temporary storage within a user job.
+The '''joblocal''' file system may be requested with
+<code>
+-v JOBLOCAL_FILESYSTEM=TRUE
+</code>
+All nodes within a job access the same files under '''/joblocal''', which is purged after job execution.
+This method scales very well up to several hundred similar jobs.
+Although the file system has 32GB, it is recommended to use only a few GB.
+To save files at the job end, use, e.g.,
+<code>
+cd /joblocal; tar czf ${HOME}/joblocal_${JOB_NAME}_${JOB_ID}.tgz myfiles
+</code>
+in your [[prolog|user epilog]] script.
+If there are many files (>>1000), please refrain from plain copying to $HOME or $GLOBAL at the job end.
+Implementation details: ''$JOBLOCAL'' is implemented via [[http://en.wikipedia.org/wiki/SCSI_RDMA_Protocol| SCSI RDMA Protocol (SRP)]] and [[http://en.wikipedia.org/wiki/Network_File_System|NFS]].
+Very high performance for small files is achieved by extensive caching on the jobs master node, which acts as (job internal) NFS server.
+=== Comparison of scratch directories ===
+|                      || $GLOBAL                        || $SCRATCH               || $TMPDIR                 || $JOBLOCAL (experimental) ||
+| Recommended file size || large                         || large                  || small                   || small ||
+| Lifetime             || until file system failure      || job                    || job                     || job ||
+| Size                 || x00 TB (for all users)         || x00 TB (for all users) || a few GB (within memory) || about 5 GB (hard limit: 32GB) ||
+| Scaling              || troubles with very many small file accesses (from more than 100 nodes) || troubles with very many small file accesses (from more than 100 nodes) || no problem (local) || no problem (local) ||
+| Visibility           || global                         || node (see above)      || node                    || job ||
+| Recommended usage     || large files, available after job life || large files    || small files, or many seek-operations within a file || many small files (>1000), or many seek-operations within a file ||
+==== General recommendations ====
+To make sure that the MPI communication happens via the infiniband fabric, please use the following settings in your job-script and/or in your ''.bashrc'' file:
+<code>
+export I_MPI_DAT_LIBRARY=/usr/lib64/libdat2.so.2
+export OMP_NUM_THREADS=1
+export I_MPI_FABRICS=shm:dapl
+export I_MPI_FALLBACK=0
+export I_MPI_CPUINFO=proc
+export I_MPI_PIN_PROCESSOR_LIST=1,14,9,6,5,10,13,2,3,12,11,4,7,8,15,0
+export I_MPI_JOB_FAST_STARTUP=0
+</code>
+==== Recommendations for various codes ====
+  * [[vasp-vsc2|VASP]]
+  * [[antares|ANTARES]]
+  * [[wien2k|WIEN2k]]
+  * [[mpi-helium|MPI-Helium]]
+  * [[wrf|WRFV3]]
+  * [[gaussian09|Gaussian09]]
+  * [[sequential-codes|Sequential codes]]
+==== Recommendations for advanced users ====
+  * [[fft]] libraries
+  * [[large]] jobs with more than 1024 cores
+  * [[memory]] intensive jobs requiring more than 2 GB per core
+  * [[ScaLAPACK]] compile options
+  * [[nwchem-vsc2|NWChem]]
+  * [[blas|Linking to BLAS Libraries]]
+  * user defined [[prolog|prolog and epilog]] scripts
+==== Process pinning ====
+The NUMA memory of VSC-2 is highly depending on the positioning of processes to the four ''NUMA nodes'' on each compute node.
+Using Intel MPI the Parameter
+<code>export I_MPI_PIN_PROCESSOR_LIST=1,14,9,6,5,10,13,2,3,12,11,4,7,8,15,0
+</code>
+as mentioned above should always be used to pin (up to) 16 processes to the 16 cores.
+In the case of sequential jobs, we recommend to use 'taskset' or 'numactl', e.g.
+<code>
+taskset -c 0 our_example_code param1 param2 >out1 &
+taskset -c 8 our_example_code param1 param2 >out2 &
+wait
+</code>
+Performance gains of up to 200% were observed for synthetic benchmarks.
+Note also the examples for [[sequential-codes|sequential jobs]].
+==== Backup ====
+No backup on VSC.
+Backup is at the responsibility of each user.
+Data loss by hardware failure is prevented by using state-of-the-art technology like RAID-6.