====== Documentation for VSC-2 ====== ==== Quick Start ==== - Log in to your university's designated login server via SSH \\ # Uni Wien ssh @vsc2.univie.ac.at # TU Wien ssh @vsc2.tuwien.ac.at # Boku Wien ssh @vsc2.boku.ac.at - Transfer your programs and data/input files to your home directory. - (Re-)Compile your application. Please choose your MPI-Environment as described in [[doku:mpi|MPI Environment]]. - Write a job script for your application: \\ #$ -N #$ -pe mpich #$ -V mpirun -machinefile $TMPDIR/machines -np $NSLOTS where "" is a freely chosen descriptive name and "" is the number of processor cores that you want to use for the calculation and the **order of the options** ''-machinefile'' and ''-np'' is essential! On VSC-2 only exclusive reservations of compute nodes are available. Each compute node provides 16 "". Substitute the path to your MPI-enabled application for and you are ready to run!\\ **NOTE**: when the option ''#$ -V'' is specified this error message will be generated in one of the output files of the grid engine: /bin/sh: module: line 1: syntax error: unexpected end of file /bin/sh: error importing function definition for `module' bash: module: line 1: syntax error: unexpected end of file bash: error importing function definition for `module' This message is due to a know bug in the grid engine which cannot handle functions defined in the user environment. This message can be safely ignored. You can avoid this error message by exporting only particular environment variables in your job-script, like:#$ -v PATH #$ -v LD_LIBRARY_PATH \\ To receive E-Mail notifications concerning job events (b .. beginning, e .. end, a .. abort or reschedule, s .. suspend), use these lines in your job script: \\ #$ -M #$ -m beas # all job events sent via email It is often advisable to also specify the job's runtime as #$ -l h_rt=hh:mm:ss in particular when you know that your job will run only for several hours or even minutes. That way one can "backfill" the queue, thus avoiding very long waiting times, which may be due to a highly parallel job waiting for free resources. \\ Here is an example job-script, requesting 32 processor cores (2 nodes), which will run for a maximum of 3 hours and sends emails at the beginning and at the end of the job: #$ -N hitchhiker #$ -pe mpich 32 #$ -V #$ -M my.name@example.com #$ -m be #$ -l h_rt=03:00:00 mpirun -machinefile $TMPDIR/machines -np $NSLOTS ./myjob - Submit your job:\\ qsub where "" is the name of the file you just created. - Check if and where your job has been scheduled:\\ qstat - Inspect the job output. Assuming your job was assigned the id "42" and your job's name was "hitchhiker", you should be able to find the following files in the directory you started it from: \\ $ ls -l hitchhiker.o42 hitchhiker.e42 hitchhiker.po42 hitchhiker.pe42 In this example hitchhiker.o42 contains the output of your job. hitchhiker.e42 contains possible error messages. In hitchhiker.po42 and hitchhiker.pe42 you might find additional information related to the parallel computing environment. - Delete Jobs: $ qdel ==== Queues ==== === Standard Queue (all.q) === The majority of jobs use the standard queue '''all.q''' with a maximum run time of 3 days (72 hours). == Node types == All nodes are configured equivalently, only a few nodes have different amounts of memory. VSC-2 has * 1334 nodes with 32GB main memory * 8 nodes with 64 GB main memory * 8 nodes with 128 GB main memory * 2 nodes with 256 GB main memory and 64 cores (not in '''all.q''', but in '''highmem.q''') Jobs are scheduled by default to request nodes with at least 27 GB free memory. To override this default on the command line or in the job script you may specify: * to allow scheduling on the 32 GB, 64 GB and 128 GB nodes * this is the default * to allow scheduling on the 64 GB and 128 GB nodes * '''-l mem_free=50G''' (on the command line) or * '''#$ -l mem_free=50G''' (in the script), * to allow scheduling on 128 GB nodes only: * '''-l mem_free=100G''' (command line) or * '''#$ -l mem_free=100G''' (job script). * to allow scheduling on 256 GB nodes only: * see below: High Memory Queue In order to avoid jobs with low memory requirements on nodes with 64 or 128 GB, priority adjustments are made in the queue. === Long Queue === A queue where jobs will be allowed to run a maximum of 7 days is available on VSC-2. The limit on the number of slots per job is 128 and the maximum number of allocatable slots per user at one time is 768. A total of 4096 slots are available for long jobs. All nodes of this queue have 32GB main memory. Use this queue by specifying it explicitly in your job script: #$ -q long.q or submit your job with qsub -q long.q === High Memory Queue === Due to higher memory requests from some users, two nodes with 256 GB memory and 64 cores are available in the queue '''highmem.q'''. The four processors utilized are AMD Opteron 6274 with 2.2GHz and 16 cores each. These nodes show a sustained performance of about 400 GFlop/s, which compares to about four standard nodes of the VSC-2. Due to the special memory requirements of jobs in this queue, jobs are granted exclusive access. 64 slots are accounted for, even if the job does not make efficient use of all 64 cores. Make sure to adapt your job script to pin processes to cores export I_MPI_PIN_PROCESSOR_LIST=0-63 if applicable. The run time limit is 3 days (72 hours). Programs which work in the '''all.q''' and the '''long.q''' run without modifications on these nodes, too. Intel compilers and Intel MPI show good behaviour on the '''highmem.q''' queue. Please use this node only for jobs with memory requirements of more than 64 GB! ==== MPI Version ==== On VSC-2 several versions of MPI are available. Choose one using 'mpi-selector' or 'mpi-selector-menu': #list available MPI versions: $ mpi-selector --list impi_intel-4.1.0.024 impi_intel-4.1.1.036 intel_mpi_intel64-4.0.3.008 mvapich2_1.8_intel_limic mvapich2_gcc-1.9a2 mvapich2_intel openmpi-1.5.4_gcc openmpi-1.5.4_intel openmpi_gcc-1.6.4 #see the currently used MPI version: $ mpi-selector --query default:impi_intel-4.1.0.024 level:user #set the MPI version: $ mpi-selector --set impi_intel-4.1.0.024 Modifications will be active after logging in again. ==== Scratch Directories ==== In addition to $HOME, which is fine to use for standard jobs with rather few small files (<1000 files, overall size <1G), there are a number of specialized scratch directories. The [[http://www.fhgfs.com/cms/documentation|Fraunhofer parallel cluster file system (FhGFS)]] is used in $GLOBAL and $SCRATCH. === Global Personal Scratch Directories $GLOBAL === Please use the environment variable ''$GLOBAL'' to access your personal scratch space. Access is available from the compute and login nodes. The variable expands as e.g.: $ echo $GLOBAL /global/lv70999/username The directory is writeable as user and readable by the group members. It is advisable to make use of these directories in particular for jobs with heavy I/O operations. In addition it will reduce the load on the fileserver holding the $HOME directories. The Fraunhofer parallel file system is shared by all users and by all nodes. Single jobs producing heavy load (>>1000 requests per second) have been observed to reduce responsiveness for all jobs and all users. === Per-node Scratch Directories $SCRATCH === Local scratch directories on each node are provided as a link to the Fraunhofer parallel file system and can thus be viewed also via the login nodes as '''/fhgfs/rXXnXX/'''. The parallel file system (and thus the performance) is identical between $SCRATCH and $GLOBAL. The variable ''$SCRATCH'' expands as: $ echo $SCRATCH /scratch These directories are purged after job execution. === Local temporary ram disk $TMPDIR === For smaller files and very fast access, restricted to single nodes, the variables ''$TMP'' or ''$TMPDIR'' may be used which expand equally to $ echo $TMP -- $TMPDIR /tmp/123456.789.queue.q -- /tmp/123456.789.queue.q These directories are purged after job execution. Please refrain from writing directly to '''/tmp'''! === Joblocal scratch directory $JOBLOCAL === The newest, still experimental, scratch file system $JOBLOCAL is a common temporary storage within a user job. The '''joblocal''' file system may be requested with -v JOBLOCAL_FILESYSTEM=TRUE All nodes within a job access the same files under '''/joblocal''', which is purged after job execution. This method scales very well up to several hundred similar jobs. Although the file system has 32GB, it is recommended to use only a few GB. To save files at the job end, use, e.g., cd /joblocal; tar czf ${HOME}/joblocal_${JOB_NAME}_${JOB_ID}.tgz myfiles in your [[prolog|user epilog]] script. If there are many files (>>1000), please refrain from plain copying to $HOME or $GLOBAL at the job end. Implementation details: ''$JOBLOCAL'' is implemented via [[http://en.wikipedia.org/wiki/SCSI_RDMA_Protocol| SCSI RDMA Protocol (SRP)]] and [[http://en.wikipedia.org/wiki/Network_File_System|NFS]]. Very high performance for small files is achieved by extensive caching on the jobs master node, which acts as (job internal) NFS server. === Comparison of scratch directories === | || $GLOBAL || $SCRATCH || $TMPDIR || $JOBLOCAL (experimental) || | Recommended file size || large || large || small || small || | Lifetime || until file system failure || job || job || job || | Size || x00 TB (for all users) || x00 TB (for all users) || a few GB (within memory) || about 5 GB (hard limit: 32GB) || | Scaling || troubles with very many small file accesses (from more than 100 nodes) || troubles with very many small file accesses (from more than 100 nodes) || no problem (local) || no problem (local) || | Visibility || global || node (see above) || node || job || | Recommended usage || large files, available after job life || large files || small files, or many seek-operations within a file || many small files (>1000), or many seek-operations within a file || ==== General recommendations ==== To make sure that the MPI communication happens via the infiniband fabric, please use the following settings in your job-script and/or in your ''.bashrc'' file: export I_MPI_DAT_LIBRARY=/usr/lib64/libdat2.so.2 export OMP_NUM_THREADS=1 export I_MPI_FABRICS=shm:dapl export I_MPI_FALLBACK=0 export I_MPI_CPUINFO=proc export I_MPI_PIN_PROCESSOR_LIST=1,14,9,6,5,10,13,2,3,12,11,4,7,8,15,0 export I_MPI_JOB_FAST_STARTUP=0 ==== Recommendations for various codes ==== * [[vasp-vsc2|VASP]] * [[antares|ANTARES]] * [[wien2k|WIEN2k]] * [[mpi-helium|MPI-Helium]] * [[wrf|WRFV3]] * [[gaussian09|Gaussian09]] * [[sequential-codes|Sequential codes]] ==== Recommendations for advanced users ==== * [[fft]] libraries * [[large]] jobs with more than 1024 cores * [[memory]] intensive jobs requiring more than 2 GB per core * [[ScaLAPACK]] compile options * [[nwchem-vsc2|NWChem]] * [[blas|Linking to BLAS Libraries]] * user defined [[prolog|prolog and epilog]] scripts ==== Process pinning ==== The NUMA memory of VSC-2 is highly depending on the positioning of processes to the four ''NUMA nodes'' on each compute node. Using Intel MPI the Parameter export I_MPI_PIN_PROCESSOR_LIST=1,14,9,6,5,10,13,2,3,12,11,4,7,8,15,0 as mentioned above should always be used to pin (up to) 16 processes to the 16 cores. In the case of sequential jobs, we recommend to use 'taskset' or 'numactl', e.g. taskset -c 0 our_example_code param1 param2 >out1 & taskset -c 8 our_example_code param1 param2 >out2 & wait Performance gains of up to 200% were observed for synthetic benchmarks. Note also the examples for [[sequential-codes|sequential jobs]]. ==== Backup ==== No backup on VSC. Backup is at the responsibility of each user. Data loss by hardware failure is prevented by using state-of-the-art technology like RAID-6.