====== Submitting batch jobs ======

===== Module environment =====

In order to set environment variables needed for a specific application, the **module** environment may be used:
  * ''module avail''     lists the **available** Application-Software, Compilers, Parallel-Environment, and Libraries 
  * ''module list''      shows currently loaded package of your session
  * ''module unload <xyz>'' unload a particular package <xyz> from your session
  * ''module load <xyz>'' load a particular package <xyz> into your session
  * ''module display <xyz>'' OR ''module show <xyz>'' show module details such as the full  path  of  the module file and all (or most) of the environment changes the modulefile will make if loaded
  * ''module purge'' unloads all loaded modulefiles
== Note: ==

  - **<xyz>** format corresponds exactly to the output of ''module avail''. Thus, in order to load or unload a selected module, copy and paste exactly the name listed by ''module avail''.\\ 
  - a list of ''module load/unload'' directives may also be included in the top part of a job submission script\\ 

When all required/intended modules have been loaded, user packages may be compiled as usual.
===== SLURM (Simple Linux Utility for Resource Management) =====

Contrary to the previously on VSC 1 and VSC 2 employed SGE, the scheduler on VSC-3 is [[http://slurm.schedmd.com|SLURM]]. 
=== Basic SLURM commands: ===
  * ''[...]$ sinfo'' gives information on which 'queues'='partitions' are available for job submission. Note: the under SGE termed 'queue' is called a 'partition' under SLURM.
  * ''[...]$ scontrol'' is used to view SLURM configuration including: job, job step, node, partition, reservation, and overall system configuration. Without a command entered on the execute line, scontrol operates in an interactive mode and prompt for input. With a command entered on the execute line, scontrol executes that command and terminates. 
  * ''[...]$ scontrol show job 567890'' shows information on the job with number 567890.
  * ''[...]$ scontrol show partition'' shows information on available partitions.
  * ''[...]$ squeue''    to see the current list of submitted jobs, their state and resource allocation. [[doku:slurm_job_reason_codes|Here]] is a description of the most important **job reason codes** returned by the squeue command.
==== Node configuration - hyperthreading ====

The compute nodes of VSC-3 are configured with the following parameters in SLURM:
<code>
CoresPerSocket=8
Sockets=2
ThreadsPerCore=2
</code>
This reflects the fact that <html> <font color=#cc3300> hyperthreading </font> </html> is activated on all compute nodes and <html> <font color=#cc3300> 32 cores </font> </html> may be utilized on each node. 
In the batch script hyperthreading is selected by adding the line
<code> 
#SBATCH --ntasks-per-core=2
</code>
which allows for 2 tasks per core.

Some codes may experience a performance gain from using all 32 virtual cores, e.g., GROMACS seems to profit. But note that using all virtual cores also leads to more communication and may impact on the performance of large MPI jobs.

**NOTE on accounting**: the project's core-h are always calculated as ''job_walltime * nnodes * 16'' (16 physical cores per node). SLURM's built in function ''sreport'' yields wrong accounting statistics because (depending on the job script) the multiplier is 32 instead of 16. You may instead use the accounting script introduced in this [[https://wiki.vsc.ac.at/doku.php?id=doku:slurm_sacct&#accounting_script|section]].

==== Node allocation policy ====
On VSC-3 (as on VSC-2)  <html> <font color=#cc3300> only complete compute Nodes </font> </html>, i.e., integral multiples of 16 cores, can be allocated for user jobs. If you wish to run many single core jobs, there will be a possibility to schedule them in a smart way exploiting all 16 cpus of one node, please see the [[doku:slurm&#scheduler_script_for_many_single_core_jobs|scheduler script for a series of single core jobs]].


===== Submit a batch job =====

==== Partition, quality of service and run time ====

Depending on the demands of a certain application, the  
[[doku:vsc3_queue|partition (grouping hardware according its type) and 
quality of service (QOS; defining the run time etc.)]] can be selected.
Additionally, the run time of a job can be limited in the job script to a value lower than the runtime limit of the selected QOS. This allows for a process called [[doku:vsc3_queue#backfilling|backfilling]] possibly leading to a <html><font color=#cc3300>shorter waiting time</font></html> in the queue.
==== The job submission script====

It is recommended to write the job script using a [[doku:win2vsc&#the_job_filetext_editors_on_the_cluster|text editor]] on the VSC //Linux// cluster. 
Editors in //Windows// may add additional invisible characters to the job file which render it unreadable and, thus, it is not executed. 

Assume a submission script ''check.slrm''
<code>
#!/bin/bash
#
#SBATCH -J chk
#SBATCH -N 2
#SBATCH --ntasks-per-node=16
#SBATCH --ntasks-per-core=1
#SBATCH --mail-type=BEGIN    # first have to state the type of event to occur 
#SBATCH --mail-user=<email@address.at>   # and then your email address

# when srun is used, you need to set:

<srun -l -N2 -n32 a.out > 
# or
<mpirun -np 32 a.out>
</code>
  * **-J**     job name,\\ 
  * **-N**     number of nodes requested (16 cores per node available)\\ 
  * **-n, --ntasks=<number>** specifies the number of tasks to run,      
  * **--ntasks-per-node**     number of processes run in parallel on a single node \\        
  * **--ntasks-per-core**     number of tasks a single core should work on\\   
  * **srun** is an alternative command  to **mpirun**. It provides direct access to SLURM inherent variables and settings. 
  * **-l** adds task-specific labels to the beginning of all output lines. 
  * **--mail-type** sends an email at specific events. The SLURM doku lists the following valid mail-type values: "BEGIN, END, FAIL, REQUEUE, ALL (equivalent to BEGIN, END, FAIL and REQUEUE), TIME_LIMIT, TIME_LIMIT_90 (reached 90 percent of time limit), TIME_LIMIT_80 (reached 80 percent of time limit), and TIME_LIMIT_50 (reached 50 percent of time limit). Multiple type values may be specified in a comma separated list." ([[http://slurm.schedmd.com|cited from the SLURM doku]])
  * **--mail-user** sends an email to this address

In order to send the job to specific queues, see [[doku:vsc3_queue|Queue/Partition setup on VSC-3]].
====Job submission====
 
<code>
[username@l31 ~]$ sbatch check.slrm    # to submit the job             
[username@l31 ~]$ squeue -u `whoami`   # to check the status  of own jobs
[username@l31 ~]$ scancel  JOBID       # for premature removal, where JOBID
                                       # is obtained from the previous command   
</code>


====A word on srun and mpirun:====
Currently (27th March 2015), **srun** only works when the application uses **intel mpi** and is compiled with the **intel compiler**. We will provide compatible versions of MVAPICH2 and OpenMPI in the near future.
At the moment, it is recommended to use **mpirun** in case of MVAPICH2 and OpenMPI.


==== Hybrid MPI/OMP: ====

SLURM Script:
<code>
#SBATCH -N 3 
#SBATCH --ntasks-per-node=2
#SBATCH -c 8

export OMP_NUM_THREADS=8
srun myhybridcode.exe
</code>

**mpirun** pins processes to cores. 
At least in the case of pure MPI processes (without any threads) the best performance has been observed with our default pinning (pinning to the physical cpus 0, 1, ..., 15).
If you need to use hybrid MPI/openMP, you may have to disable our default pinning including the following line in the job script:
<code>
unset I_MPI_PIN_PROCESSOR_LIST
export I_MPI_PIN_PROCESSOR_LIST=0,8    # configuration for 2 processes / node
export I_MPI_PIN_PROCESSOR_LIST=0,4,8,12    #              4 processes / node
export I_MPI_PIN_PROCESSOR_LIST=0,2,4,6,8,10,12,14    #    8 processes / node
</code>
or use the shell script:
<code>
if [ $PROC_PER_NODE -gt 1 ]
then
	unset I_MPI_PIN_PROCESSOR_LIST
	if [ $PROC_PER_NODE -eq 2 ]
	then
		export I_MPI_PIN_PROCESSOR_LIST=0,8    # configuration for 2 processes / node
	elif [ $PROC_PER_NODE -eq 4 ]
	then
		export I_MPI_PIN_PROCESSOR_LIST=0,4,8,12    #              4 processes / node
	elif [ $PROC_PER_NODE -eq 8 ]
	then
		export I_MPI_PIN_PROCESSOR_LIST=0,2,4,6,8,10,12,14    #    8 processes / node
	else
		export I_MPI_PIN=disable
	fi
fi
</code>
See also the [[https://software.intel.com/sites/products/documentation/hpc/ics/impi/41/lin/Reference_Manual/Environment_Variables_Process_Pinning.htm|Intel Environment Variables]].


==== Job chain ====

This example is for using a set of 4 nodes to compute a series of jobs in two stages, each of them split into two separate subjobs. \\

vi check.slrm\\ 
<code>
#!/bin/bash
#
#SBATCH -J chk
#SBATCH -N 4
#SBATCH --ntasks-per-node=16
#SBATCH --ntasks-per-core=1


scontrol show hostnames $SLURM_NODELIST  > ./nodelist

srun -l -N2 -r0 -n32 job1.scrpt &
srun -l -N2 -r2 -n32 job2.scrpt &
wait

srun -l -N2 -r2 -n32 job3.scrpt &
srun -l -N2 -r0 -n32 job4.scrpt &
wait

</code>
== Note: ==
the file 'nodelist' has been written for information only; \\
it is important to send the jobs into the background (&) and insert the 'wait' at each synchronization point; \\
with **-r2** one can define an offset in the node list, in particular the **-r2** means taking nodes number 2 and 3 from the set of four (where the list starts with node number 0), hence a combination of -N -r -n allows full control over all involved cores and the tasks they are going to be used for;

==== Scheduler script for many single core jobs ====

Given a large number of single core jobs, it is the aim to schedule them such that the needed core hours is minimized. The following script sends 64 tasks to a single node. The first 16 tasks are distributed to the 16 cores. After termination of each task, the consecutive task is started.


<code>
#!/bin/bash
#SBATCH -J TEST              ## name
#SBATCH -N 1                 ## 1 node
#SBATCH --ntasks=16          ## number of tasks per node
#SBATCH --time=08:00:00      

tasks_to_be_done=64          ## total number of tasks to be scheduled:
                             ## set this to your preferred number
                             
max_tasks=16                 ## number of tasks per node.

current_task=0               ## initialization
running_tasks=0              ## initialization

while (($current_task < $tasks_to_be_done))
do

   ## ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ##
   ## "test" MUST BE REPLACED by your application ~ * ~ * ##
   ## counts the number of tasks currently running  * ~ * ##
   ## ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ##
   running_tasks=`ps -C test --no-headers | wc -l`

   while (($running_tasks < $max_tasks && ${current_task} < ${tasks_to_be_done}))
   do
      ((current_task++))
      
      ## ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ ##
      ## run application (here named 'test_--.sh'): replace "test" accordingly ##
      ## ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ * ~ ##
      ./test_${current_task}.sh -${current_task} & ## run executables test_0, test_1, ...

      running_tasks=`ps -C test --no-headers | wc -l`
   done
done
wait
</code>

The test_--.sh script was here of the form
<code>
#!/bin/bash
sleep 10
date
hostname
ps | grep test|grep -v grep | wc -l
</code>


Another example for single core jobs with the parallelization in the innermost loop:
<code>
#!/bin/bash
#
#SBATCH -J testjob
#SBATCH -N 1
#SBATCH --ntasks-per-node=16
#SBATCH --ntasks-per-core=1
##SBATCH --mail-type=BEGIN    # first have to state the type of event to occur 
##SBATCH --mail-user=user@example.com   # and then your email address
#SBATCH --time=60 # to use backfilling of the slurm queues

export max_num_tasks=16
export executable=dt24
#
dk=100
for n in {1027..20000..231}; do
  echo
  for k in $((n)) $((n-1)) $((n/2)) $((n/4)) $((n/10)) ; do
    for rdb in $(eval echo 1 {8..16..4}); do
            for blocksize in $(eval echo {64..$n..32}); do
                  ./$executable $n $k $rdb $blocksize $dk                               >dt2.${n}.${k}.${rdb}.${blocksize} 2>&1 &
                  ############################## Here starts the parallelization
                  while (("$(ps -C $executable --no-headers | wc -l)" == $max_num_tasks )); do
                    sleep 10   # replace by appropriate value
                  done
                  ############################## Here ends the parallelization
            done
    done
  done
done
wait # wait for final jobs to finish
</code>
(The SLURM inherent command //#SBATCH --array starting_value-end_value:stepwidth// does not provide this functionality since nodes are exclusively occupied by one job.)

[[doku:multimpi|Here are examples how to run multiple MPI jobs on a single node]]

===== Generating a host machines file =====

In general a host file is not needed. However, there may be situations where a host file is necessary, for example for [[doku:mathematica|Mathematica Batch Jobs]]. With the following script a host machines file may be generated
<code>
#!/bin/bash
#
#SBATCH -J par                      # job name
#SBATCH -N 2                        # number of nodes=2
#SBATCH --ntasks-per-node=16        # uses all cpus of one node      
#SBATCH --ntasks-per-core=1
#SBATCH --threads-per-core=1

module load Mathematica/10.0.2 # load desired version

scontrol show hostnames $SLURM_NODELIST > ./nodelist

rm machines_tmp

tasks_per_node=16         # change number accordingly
nodes=2                   # change number accordingly
for ((line=1; line<=nodes; line++))
do
    for ((tN=1; tN<=tasks_per_node; tN++))
    do
        head -$line nodelist | tail -1 >> machines_tmp
    done
done

echo "this is the content of the host machines file"
cat machines_tmp
</code>
Pay attention to the fact that ''tasks_per_node'' and ''nodes'' have to be changed accordingly.

==== Restarting Failed Jobs ====

Slurm is __not longer__ configured to automatically requeue jobs which were aborted due to node failures. If this is an unwanted behaviour you can <del>prevent</del> ask for automatic requeuing with the following option in your job script:
<code>#SBATCH --requeue</code>
==== Job Arrays ====

<code>
#!/bin/bash
#SBATCH -J array_job
#SBATCH -N 1
#SBATCH --tasks-per-node=16
#run all tasks from index 1 to 1000 with stepwidth 3, i.e. 1,4,7, ...
#SBATCH --array=1-1000:3

#usefull variables:
#$SLURM_ARRAY_JOB_ID
#$SLURM_ARRAY_TASK_ID
#$SLURM_JOB_NODELIST 

./start_my_task $SLURM_ARRAY_TASK_ID

</code>
==== Job Dependency ====
Dependent jobs can be used to create a job chain that is executed one job after another. 


  - Submit first job and get its <job id>
  - Submit dependent job (and get <job_id>):<code>
#!/bin/bash
#SBATCH -J jobname
#SBATCH -N 2
#SBATCH -d afterany:<job_id>

srun  ./my_program
</code>
  - continue at 2. for further dependent jobs

===== Licenses =====

Software, that uses a license server, has to be specified upon job submission. A list of all available licensed software for your user can be shown by using the command:

<code>
slic
</code>

Within the job script add the flags as shown with 'slic', e.g. for using both Matlab and Mathematica:

<code>
#SBATCH -L matlab@vsc,mathematica@vsc
</code>

===== Prolog Error Codes =====

<code>
ERROR_MEMORY=200
ERROR_INFINIBAND_HW=201
ERROR_INFINIBAND_SW=202
ERROR_IPOIB=203
ERROR_BEEGFS_SERVICE=204
ERROR_BEEGFS_USER=205
ERROR_BEEGFS_SCRATCH=206
ERROR_NFS=207

ERROR_USER_GROUP=220
ERROR_USER_HOME=221

ERROR_GPFS_START=228
ERROR_GPFS_MOUNT=229
ERROR_GPFS_UNMOUNT=230

</code>