==== Specifying the maximum runtime (means quicker job execution) / resource reservation ====
=== The problem ===
The queueing system we use on the VSC enables for the **reservation of slots** to jobs that need many of them. To make sure that these jobs are at all able to run on the cluster (not blocked by the many smaller ones), we **automatically** turn on this feature **for any jobs that run with 32 or more slots**. Unfortunately, this will "waste" a certain amount of resources. In the meantime, there will be some **empty nodes which run no calculation**.
=== The solution (quicker scheduling available here) ===
But fear not! We will tell you how to make use of these idle resources while they are waiting for the large job to be scheduled!
This is called **backfilling** and is done so:
#### add this to the top of your job submission script:
#$ -l h_rt=01:00:00 # in this example, ONE HOUR (01:00:00) is the job's maximum run time
Here, h_rt means "hard runtime". The job will be permitted to run __at most__ this long. Please add a little **buffer of about 10-50%** to the runtime you've estimated/measured, so the job doesn't get killed before its result is available! For example, if your last job came in at 03:57:03 and you think that the next one will be pretty similar in nature, try to specify "-l h_rt=06:00:00".
We hope that you are able and **willing to utilize this feature** as it would **save a lot of electricity** and greatly **reduce the time your jobs spend in the queue**!\\
Please put our precious computing time to use as best as you can.
====== Job chains ======
Job chains are sets of consecutive **interdependent** jobs.
There are several ways how to create job chains and we will discuss three different solutions below:
=== Solution I ====
Create one single job script and list several different jobs inside it. E.g.:
user@l01 $ cat allInOne.sge
#$ -N allInOne
./doJob1
./doJob2
./doJob3
user@l01 $ qsub allInOne.sge
Your job 10411 ("allInOne") has been submitted.
=== Solution II ===
Job chains, using -hold jid .
user@l01 $ cat holdJob.sge
#$ -N holdJob
./doJob1
user@l01 $ cat Job2.sge
#$ -N Job2
#$ -hold_jid holdJob
./doJob2
...
user@l01 $ qsub holdJob.sge
Your job 10451 ("holdJob") has been submitted.
user@l01 $ qsub Job2.sge
Your job 10452 ("Job2") has been submitted.
user@l01 $ qsub Job3.sge
Your job 10453 ("Job3") has been submitted.
'Job2' waits and starts only after 'holdJob' has finished. If you have another 'Job3' it waits until 'Job2' has finished and so on ... .
*Advantage: accumulates ”waiting time”
*Note: -hold_jid can only be used to reference jobs of the same user (-hold_jid can be used to reference any job)
===== Job arrays =====
Job arrays are sets of similar but **independent** jobs. Submit sets of similar and independent “tasks”:
*''qsub -t 1-500:1 example_3.sge'' submits 500 instances of the same script
*each instance (“task”) is executed independently
*all instances subsumed with a single job ID
*variable ''$SGE_TASK_ID'' discriminates between instances
*task numbering scheme: ''-t -:''
*related: ''$SGE_TASK_FIRST,$SGE_TASK_LAST,$SGE_TASK_STEPSIZE''
Example:
#$ -cwd
#$ -N blastArray
#$ -t 1-500:1
QUERY=query_${SGE_TASK_ID}.fa
OUTPUT=blastout_${SGE_TASK_ID}.txt
echo "processing query $QUERY ..."
blastall -p blastn -d nt -i $QUERY -o $OUTPUT
echo "...done"
user@l01 $ qsub example_3.sge
Your job 10420.1-500:1 ("blastArray") has been submitted.
user@l01 $ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
10420 0.56000 blastArray mjr r 02/13/2007 15:05:56 all.q@r10n01 1 198
10420 0.56000 blastArray mjr r 02/13/2007 15:05:56 all.q@r10n01 1 199
10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 all.q@r10n01 1 202
10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 all.q@r10n01 1 203
10420 0.56000 blastArray mjr r 02/13/2007 15:05:41 all.q@r10n01 1 196
10420 0.56000 blastArray mjr r 02/13/2007 15:05:41 all.q@r10n01 1 197
10420 0.55241 blastArray mjr r 02/13/2007 15:08:41 all.q@r10n01 1 208
10420 0.55241 blastArray mjr r 02/13/2007 15:08:41 all.q@r10n01 1 209
10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 all.q@r10n02 1 204
10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 all.q@r10n02 1 206
10420 0.56000 blastArray mjr r 02/13/2007 15:02:11 all.q@r10n02 1 176
10420 0.56000 blastArray mjr r 02/13/2007 15:02:11 all.q@r10n02 1 177
10420 0.56000 blastArray mjr r 02/13/2007 15:03:26 all.q@r10n02 1 182
10420 0.56000 blastArray mjr r 02/13/2007 15:03:26 all.q@r10n02 1 183
10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 all.q@r10n02 1 200
10420 0.56000 blastArray mjr r 02/13/2007 15:07:11 all.q@r10n02 1 201
10420 0.56000 blastArray mjr r 02/13/2007 15:05:11 all.q@r10n03 1 193
10420 0.56000 blastArray mjr r 02/13/2007 15:05:11 all.q@r10n03 1 194
10420 0.56000 blastArray mjr r 02/13/2007 15:04:41 all.q@r10n03 1 190
10420 0.56000 blastArray mjr r 02/13/2007 15:04:41 all.q@r10n03 1 191
10420 0.56000 blastArray mjr r 02/13/2007 15:03:41 all.q@r10n03 1 184
10420 0.56000 blastArray mjr r 02/13/2007 15:03:41 all.q@r10n03 1 185
10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 all.q@r10n03 1 205
10420 0.56000 blastArray mjr r 02/13/2007 15:08:11 all.q@r10n03 1 207
10420 0.56000 blastArray mjr r 02/13/2007 15:05:11 all.q@r10n04 1 192
10420 0.56000 blastArray mjr r 02/13/2007 15:05:26 all.q@r10n04 1 195
10420 0.56000 blastArray mjr r 02/13/2007 15:04:26 all.q@r10n04 1 188
10420 0.56000 blastArray mjr r 02/13/2007 15:04:26 all.q@r10n04 1 189
10420 0.56000 blastArray mjr r 02/13/2007 15:03:56 all.q@r10n04 1 186
10420 0.56000 blastArray mjr r 02/13/2007 15:03:56 all.q@r10n04 1 187
10420 0.55242 blastArray mjr qw 02/13/2007 14:28:34 1 210-500:1
==== Job arrays with multiple single core jobs on one exclusive node ====
In some cases job arrays with single core tasks require more memory than the per core memory of the
compute nodes (3 GB on VSC-1, 2 GB on VSC-2). For such cases the jobscript below can be used. It starts several single core
tasks on one node within a job array. Note the definition of the job stepwidth.
#$ -N job_array_with_multilple single tasks on one node
###
### request single nodes, on vsc1 all nodes have 24 GB of memory:
### 8-core nodes:
#$ -pe mpich 8
### 12-core nodes:
### #$ -pe mpich 12
###
### set first and last task_id and stepwidth of array tasks. stepwidth should be identical with the
### number of jobs per node
#$ -t 1-18:3
#optimum order for using single cpus
cpus=(0 4 2 6 1 5 3 7)
for i in `seq 0 $( expr ${SGE_TASK_STEPSIZE} - 1 )`
do
TASK=`expr ${SGE_TASK_ID} + $i`
CMD="run file_$TASK"
taskset -c ${cpus[$i]} $CMD &
done
#wait for all tasks to be finished before exiting the script
wait
==== Job arrays with multiple task within one SGE task step ====
In some cases, where a huge number of Job task need to be started and the task's runtime is very short, the
following construction can be used. It starts several tasks, one after another, on
the specified nodes. Note the definition of the job stepwidth.
#$ -N job_array_with_multilple single tasks on one node
#$ -pe mpich
### set first and last task_id and stepwidth of array tasks.
### stepwidth should be identical with the
### number of jobs per node
#$ -t 1-18:3
for i in `seq 0 $( expr ${SGE_TASK_STEPSIZE} - 1 )`
do
TASK=`expr ${SGE_TASK_ID} + $i`
CMD="run file_$TASK &"
#or
#CMD="mpirun -np $SLOTS ./a.out $TASK &"
$CMD
done
wait
====== Specifying runtime limits ======
In SGE Jobs two runtime limits are available: soft (s_rt) and hard (h_rt) runtime limit
h_rt specifies the time after all parts of the job script have to be finished. Running processes are then killed by GridEngine. Grid Engine sends a SIGUSR2 signal.
s_rt specifies the soft runtime limit after that a SIGUSR1 signal is sent to the process. If s_rt is n times smaller than h_rt SIGUSR1 is sent n times:
#!/bin/bash
#$ -N notify_test
#$ -pe mpich 2
#$ -notify
#$ -V
#$ -l h_rt=0:20:00
#$ -l s_rt=0:02:00
echo $TMPDIR
function sigusr1handler()
{
date
echo "SIGUSR1 caught by shell script" 1>&2
}
function sigusr2handler()
{
date
echo "SIGUSR2 caught by shell script" 1>&2
}
trap sigusr1handler SIGUSR1
trap sigusr2handler SIGUSR2
echo "starting:"
date
# Start
# -q 0: disable "MPI progress Quiescence" error message
#mpirun -q 0 -m $TMPDIR/machines -np $NSLOTS sleep 200
for i in {1..900}
do
echo "waiting $i"
sleep 10
done
echo "finished:"
date
output in error (*.e*) file of this job example is:
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 2
SIGUSR2 caught by shell script
====== Modyfying the machines file on VSC-1 ======
In cases when not all CPUs of one node are required, the machines file can be modified to guarantee the right behaviour of mpirun. The $TMPDIR/machines file on VSC-1 consists of a number of machine/node names. Each name stands for one CPU on the given machine/node. For an exclusive job on 2 nodes the machine file looks like:
r10n01
r10n01
r10n01
r10n01
r10n01
r10n01
r10n01
r10n01
r12n10
r12n10
r12n10
r12n10
r12n10
r12n10
r12n10
r12n10
For running a job on less than eight cores the $TMPDIR/machines file has to be replaced within the job script:
#$ -N test
#$ -pe mpich 16
NSLOTS_PER_NODE_AVAILABLE=8
NSLOTS_PER_NODE_USED=4
NSLOTS_REDUCED=`echo "$NSLOTS / $NSLOTS_PER_NODE_AVAILABLE * $NSLOTS_PER_NODE_USED" | bc `
echo "starting run with $NSLOTS_REDUCED processes; $NSLOTS_PER_NODE_USED per node"
for i in `seq 1 $NSLOTS_PER_NODE_USED`
do
uniq $TMPDIR/machines >> $TMPDIR/tmp
done
sort $TMPDIR/tmp > $TMPDIR/myhosts
cat $TMPDIR/myhosts
mpirun -machinefile $TMPDIR/myhosts -np $NSLOTS_REDUCED sleep 2
The reduced form would look like:
r10n01
r10n01
r10n01
r10n01
r12n10
r12n10
r12n10
r12n10
===== Find out memory usage of running jobs =====
qstat -F mem_free |grep -B 2
==== Using the local scratch on VSC-1 ====
On VSC-1 a local scratch directory is available in /tmp. For using this directory in a prallel Job, in which processes
on each node need to access some data, the following script (not tested excessivly, bugs might be around) can be used for transferring the data
to and from the nodes:
#!/bin/bash
#$ -l h_rt=72:00:00
#$ -cwd
#$ -N jobname
#$ -pe mpich 64
#$ -V
#$ -j yes
################################################
##### start copying process ####
################################################
#temporary directory on nodes:
tmp_dir=$TMPDIR/data
#naming of the tar files that should be distributed to
#the nodes. Each tar file should contain 8 subdirs.
#file names are completed by appending a number without leading
#zeros + tar.gz
input_tar_file_base=processes_
#naming of the outputfiles; number and tar.gz are appended automatically
output_tar_base=output_
#define which directories and files should be put into the output tar.gz
# not tested yet
PACKING=data\*
start_dir=`pwd`
#save output data to this directory
data_save="${start_dir}/data_${JOB_ID}"
nodes_uniq=$(cat $TMPDIR/machines| uniq)
#copy files per node
#tar.gz files contain 8 subdirectories for each process
j=0
for i in $nodes_uniq
do
tar_file_name="${input_tar_file_base}${j}.tar.gz"
echo "creating ${tmp_dir} on $i"
ssh $i mkdir -p $tmp_dir
echo "copying $tar_file_name to node $i"
ssh $i cp ${start_dir}\/${tar_file_name} ${tmp_dir}
ssh $i cp -r ${start_dir}\/tmp_dictionaries\/* ${tmp_dir}
echo "extracting file"
ssh $i "cd ${tmp_dir} ;tar -zxf ${tmp_dir}\/${tar_file_name}"
j=$(echo "$j+1"|bc -l)
done
#command to run:
cd ${tmp_dir}
time mpirun -mca btl_openib_ib_timeout 20 -machinefile $TMPDIR/machines -np 64 $1 -parallel
#cp files per node back to start directory of job
echo "================================================="
j=0
for i in $nodes_uniq
do
tar_file_name="${input_tar_file_base}${j}.tar.gz"
output_tar_file="${output_tar_base}${j}.tar.gz"
echo "creating ${output_tar_file} on node $i"
echo "ssh $i \" cd ${tmp_dir} ;tar -zcf ${output_tar_file} $PACKING\""
ssh $i " cd ${tmp_dir} ;tar -zcf ${output_tar_file} $PACKING"
echo "copying file back"
mkdir -p ${data_save}
ssh $i cp ${tmp_dir}\/${output_tar_file} ${data_save}
j=$(echo "$j+1"|bc -l)
done
# ----------------------------------------------------------------- end-of-file
===== Tight integration =====
Some MPI implementations have a tight integration to SGE. In this case, providing a machinefile could/will be ignored by mpirun.
For disabling this behaviour one can
* unset the variable PE_HOSTFILE within the job script with:
unset PE_HOSTFILE
* recompile the MPI libraries with disabled tight integration support
* use the mpirun command with following options, '-machinefile' option has to be omitted:
mpirun -npernode -np
===== Using Resources of foreign Projects =====
Sometimes users from one project are allowed by the project manager of another project to use its resources, ie. run jobs within this project by using the '-P ' flag in the job script.
Grid Engine Jobs are alwayes executed with the primary group of the user. If the primary group of a user is not matching the project group given in the job, the job will be rejected.
For running a job in an other project than your primary group proceed as follows. Define the '-P' flag and if necessary the '-q' flag in your jobscript 'job.sh':
#$ -N foreingProjectJob
### specify queue if necessary:
###$ -q p80000
#$ -P p80000
#$ -pe mpich 6
sleep 60
Submit the job using a wrapper script:
qsub.py job.sh
The wrapper script will change your primary group to that given by the '-P' flag and submit the job.
===== Changing job parameters of already submitted jobs (qalter) =====
Jobs that have already been submitted to the queue can be modified using the 'qalter' command. Most common usage is changing queue or hard resources like runtime and free memory:
Change queue to 'long.q':
qalter -q long.q
Change hard resources, first get resources with qstat:
qstat -j |grep hard
Set new hard resources with modified string from above command:
qalter -l h_rt=60,
==== submitting a job from standard in (stdin) ====
echo "sleep 10" | qsub -N test
==== Redirecting output of jobs ====
=== Within Grid Engine ===
If nothing else is specified the output of GridEngine Jobs is written to four files:
* .o(.): STDOUT of the job
* .e(.): STDERR of the job
* .po(.): STDOUT of parallel environment of the job
* .pe(.): STDERR of parallel environment of the job
This behaviour can be modified in the following way:
* '#$ -j y' joins STDOUT and STDERR, i.e. there will be a '.o' and a '.po' file
* '#$ -o /path/filname' sets the STDOUT to a certain filename, i.e one gets the file specified with '-o' plus a '.e' and a '.pe' file
* '#$ -e /path/filname' sets the STDERR to a certain filename, three files in analogy to the '-o' option
* combination of '-j' with '-o' is possible, '-e' is ignored if '-j' is used.
When using the '-o' and '-e' options following pseudo variables can be used for specifying the path/filenames. NOTE: The
variables can not be used as in a bash environment, i.e. using ${HOME} will not work.
* $HOME home directory on execution machine
* $USER user ID of job owner
* $JOB_ID current job ID
* $JOB_NAME current job name
* $HOSTNAME name of the execution host
* $TASK_ID array job task index number
For further details see 'man qsub' on the login nodes of VSC.
=== Ordinary shell redirection ===
Since the job is executed in an shell environment, the output of any shell command can be redirected to any file:
echo hello > /path/file
==== Overcommitting memory of the compute nodes on VSC-2 ====
By default only 32 GB of (the physically available) memory can be allocated by applications on the compute nodes of VSC-2. Some applications however
have to allocate larger amounts of memory than the physical limit (but although allocated, it is never used). For such cases a kernel parameter of the
compute nodes can be changed in order to allow for overcommittment of the memory:
#$ -N test
#$ -pe mpich 32
#$ -l overcommit_mem=true