Job chains

This version is outdated by a newer approved version.

This version (2014/07/11 08:03) is a draft.
Approvals: 0/1

This is an old revision of the document!

The problem

The queueing system we use on the VSC enables for the reservation of slots to jobs that need many of them. To make sure that these jobs are at all able to run on the cluster (not blocked by the many smaller ones), we automatically turn on this feature for any jobs that run with 32 or more slots. Unfortunately, this will “waste” a certain amount of resources. In the meantime, there will be some empty nodes which run no calculation.

The solution (quicker scheduling available here)

But fear not! We will tell you how to make use of these idle resources while they are waiting for the large job to be scheduled!

This is called backfilling and is done so:

#### add this to the top of your job submission script:
#$ -l h_rt=01:00:00    # in this example, ONE HOUR (01:00:00) is the job's maximum run time

Here, h_rt means “hard runtime”. The job will be permitted to run at most this long. Please add a little buffer of about 10-50% to the runtime you've estimated/measured, so the job doesn't get killed before its result is available! For example, if your last job came in at 03:57:03 and you think that the next one will be pretty similar in nature, try to specify “-l h_rt=06:00:00”.

We hope that you are able and willing to utilize this feature as it would save a lot of electricity and greatly reduce the time your jobs spend in the queue!
Please put our precious computing time to use as best as you can.

Job chains are sets of consecutive interdependent jobs.

There are several ways how to create job chains and we will discuss three different solutions below:

Solution I

Create one single job script and list several different jobs inside it. E.g.:

user@l01 $ cat allInOne.sge
#$ -N allInOne

./doJob1
./doJob2
./doJob3

user@l01 $ qsub allInOne.sge
Your job 10411 ("allInOne") has been submitted.

Solution II

Job chains, using -hold jid <job id|job name>.

user@l01 $ cat holdJob.sge
#$ -N holdJob

./doJob1

user@l01 $ cat Job2.sge
#$ -N Job2
#$ -hold_jid holdJob

./doJob2

…

user@l01 $ qsub holdJob.sge
Your job 10451 ("holdJob") has been submitted.
user@l01 $ qsub Job2.sge
Your job 10452 ("Job2") has been submitted.
user@l01 $ qsub Job3.sge
Your job 10453 ("Job3") has been submitted.

'Job2' waits and starts only after 'holdJob' has finished. If you have another 'Job3' it waits until 'Job2' has finished and so on … .

Advantage: accumulates ”waiting time”
Note: -hold_jid <job_name> can only be used to reference jobs of the same user (-hold_jid <job_id> can be used to reference any job)

Job arrays are sets of similar but independent jobs. Submit sets of similar and independent “tasks”:

qsub -t 1-500:1 example_3.sge submits 500 instances of the same script
each instance (“task”) is executed independently
all instances subsumed with a single job ID
variable $SGE_TASK_ID discriminates between instances
task numbering scheme: -t <first>-<last>:<stepsize>
related: $SGE_TASK_FIRST,$SGE_TASK_LAST,$SGE_TASK_STEPSIZE

Example:

#$ -cwd
#$ -N blastArray
#$ -t 1-500:1

QUERY=query_${SGE_TASK_ID}.fa
OUTPUT=blastout_${SGE_TASK_ID}.txt
echo "processing query $QUERY ..."
blastall -p blastn -d nt -i $QUERY -o $OUTPUT
echo "...done"

user@l01 $ qsub example_3.sge
Your job 10420.1-500:1 ("blastArray") has been submitted.
user@l01 $ qstat
job-ID prior    name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  10420 0.56000 blastArray mjr           r     02/13/2007 15:05:56 all.q@r10n01                       1 198
  10420 0.56000 blastArray mjr           r     02/13/2007 15:05:56 all.q@r10n01                       1 199
  10420 0.56000 blastArray mjr           r     02/13/2007 15:07:11 all.q@r10n01                       1 202
  10420 0.56000 blastArray mjr           r     02/13/2007 15:07:11 all.q@r10n01                       1 203
  10420 0.56000 blastArray mjr           r     02/13/2007 15:05:41 all.q@r10n01                       1 196
  10420 0.56000 blastArray mjr           r     02/13/2007 15:05:41 all.q@r10n01                       1 197
  10420 0.55241 blastArray mjr           r     02/13/2007 15:08:41 all.q@r10n01                       1 208
  10420 0.55241 blastArray mjr           r     02/13/2007 15:08:41 all.q@r10n01                       1 209
  10420 0.56000 blastArray mjr           r     02/13/2007 15:08:11 all.q@r10n02                       1 204
  10420 0.56000 blastArray mjr           r     02/13/2007 15:08:11 all.q@r10n02                       1 206
  10420 0.56000 blastArray mjr           r     02/13/2007 15:02:11 all.q@r10n02                       1 176
  10420 0.56000 blastArray mjr           r     02/13/2007 15:02:11 all.q@r10n02                       1 177
  10420 0.56000 blastArray mjr           r     02/13/2007 15:03:26 all.q@r10n02                       1 182
  10420 0.56000 blastArray mjr           r     02/13/2007 15:03:26 all.q@r10n02                       1 183
  10420 0.56000 blastArray mjr           r     02/13/2007 15:07:11 all.q@r10n02                       1 200
  10420 0.56000 blastArray mjr           r     02/13/2007 15:07:11 all.q@r10n02                       1 201
  10420 0.56000 blastArray mjr           r     02/13/2007 15:05:11 all.q@r10n03                       1 193
  10420 0.56000 blastArray mjr           r     02/13/2007 15:05:11 all.q@r10n03                       1 194
  10420 0.56000 blastArray mjr           r     02/13/2007 15:04:41 all.q@r10n03                       1 190
  10420 0.56000 blastArray mjr           r     02/13/2007 15:04:41 all.q@r10n03                       1 191
  10420 0.56000 blastArray mjr           r     02/13/2007 15:03:41 all.q@r10n03                       1 184
  10420 0.56000 blastArray mjr           r     02/13/2007 15:03:41 all.q@r10n03                       1 185
  10420 0.56000 blastArray mjr           r     02/13/2007 15:08:11 all.q@r10n03                       1 205
  10420 0.56000 blastArray mjr           r     02/13/2007 15:08:11 all.q@r10n03                       1 207
  10420 0.56000 blastArray mjr           r     02/13/2007 15:05:11 all.q@r10n04                       1 192
  10420 0.56000 blastArray mjr           r     02/13/2007 15:05:26 all.q@r10n04                       1 195
  10420 0.56000 blastArray mjr           r     02/13/2007 15:04:26 all.q@r10n04                       1 188
  10420 0.56000 blastArray mjr           r     02/13/2007 15:04:26 all.q@r10n04                       1 189
  10420 0.56000 blastArray mjr           r     02/13/2007 15:03:56 all.q@r10n04                       1 186
  10420 0.56000 blastArray mjr           r     02/13/2007 15:03:56 all.q@r10n04                       1 187
  10420 0.55242 blastArray mjr           qw    02/13/2007 14:28:34                                    1 210-500:1

In some cases job arrays with single core tasks require more memory than the per core memory of the compute nodes (3 GB on VSC-1, 2 GB on VSC-2). For such cases the jobscript below can be used. It starts several single core tasks on one node within a job array. Note the definition of the job stepwidth.

#$ -N job_array_with_multilple single tasks on one node
###
### request single nodes, on vsc1 all nodes have 24 GB of memory:
### 8-core nodes:
#$ -pe mpich 8
### 12-core nodes:
### #$ -pe mpich 12
###
### set first and last task_id and stepwidth of array tasks. stepwidth should be identical with the
### number of jobs per node 
#$ -t 1-18:3

#optimum order for using single cpus
cpus=(0 4 2 6 1 5 3 7)

for i in `seq 0 $( expr ${SGE_TASK_STEPSIZE} - 1 )`
do
        TASK=`expr ${SGE_TASK_ID} + $i`
        CMD="run file_$TASK"
        taskset -c ${cpus[$i]} $CMD  & 
done
#wait for all tasks to be finished before exiting the script
wait

In some cases, where a huge number of Job task need to be started and the task's runtime is very short, the following construction can be used. It starts several tasks, one after another, on the specified nodes. Note the definition of the job stepwidth.

#$ -N job_array_with_multilple single tasks on one node
#$ -pe mpich <N>
### set first and last task_id and stepwidth of array tasks. stepwidth should be identical with the
### number of jobs per node 
#$ -t 1-18:3


for i in `seq 0 $( expr ${SGE_TASK_STEPSIZE} - 1 )`
do
        TASK=`expr ${SGE_TASK_ID} + $i`
        CMD="run file_$TASK &"
        #or 
        #CMD="mpirun -np $SLOTS ./a.out $TASK &"
        $CMD  
done
wait

In SGE Jobs two runtime limits are available: soft (s_rt) and hard (h_rt) runtime limit

h_rt specifies the time after all parts of the job script have to be finished. Running processes are then killed by GridEngine. Grid Engine sends a SIGUSR2 signal.

s_rt specifies the soft runtime limit after that a SIGUSR1 signal is sent to the process. If s_rt is n times smaller than h_rt SIGUSR1 is sent n times:

#!/bin/bash

#$ -N notify_test
#$ -pe mpich 2
#$ -notify
#$ -V
#$ -l h_rt=0:20:00
#$ -l s_rt=0:02:00

echo $TMPDIR

function sigusr1handler()
{
	date
        echo "SIGUSR1 caught by shell script" 1>&2
}
function sigusr2handler()
{
	date
        echo "SIGUSR2 caught by shell script" 1>&2
}
trap sigusr1handler SIGUSR1
trap sigusr2handler SIGUSR2

echo "starting:"
date
# Start
# -q 0: disable "MPI progress Quiescence" error message

#mpirun -q 0 -m $TMPDIR/machines -np $NSLOTS sleep 200
for i in {1..900}
do
	echo "waiting $i"
	sleep 10
done

echo "finished:"
date

output in error (*.e*) file of this job example is:

User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 1
SIGUSR1 caught by shell script
User defined signal 2
SIGUSR2 caught by shell script

In cases when not all CPUs of one node are required, the machines file can be modified to guarantee the right behaviour of mpirun. The $TMPDIR/machines file on VSC-1 consists of a number of machine/node names. Each name stands for one CPU on the given machine/node. For an exclusive job on 2 nodes the machine file looks like:

r10n01
r10n01
r10n01
r10n01
r10n01
r10n01
r10n01
r10n01
r12n10
r12n10
r12n10
r12n10
r12n10
r12n10
r12n10
r12n10

For running a job on less than eight cores the $TMPDIR/machines file has to be replaced within the job script:

#$ -N test
#$ -pe mpich 16

NSLOTS_PER_NODE_AVAILABLE=8
NSLOTS_PER_NODE_USED=4
NSLOTS_REDUCED=`echo "$NSLOTS / $NSLOTS_PER_NODE_AVAILABLE * $NSLOTS_PER_NODE_USED" | bc  `

echo "starting run with $NSLOTS_REDUCED processes; $NSLOTS_PER_NODE_USED per node"
for i in `seq 1 $NSLOTS_PER_NODE_USED`
do
	uniq $TMPDIR/machines >> $TMPDIR/tmp
done
sort $TMPDIR/tmp  > $TMPDIR/myhosts
cat $TMPDIR/myhosts


mpirun -machinefile $TMPDIR/myhosts -np $NSLOTS_REDUCED sleep 2

The reduced form would look like:

r10n01
r10n01
r10n01
r10n01
r12n10
r12n10
r12n10
r12n10

qstat -F mem_free  |grep -B 2 <job_id>

On VSC-1 a local scratch directory is available in /tmp. For using this directory in a prallel Job, in which processes on each node need to access some data, the following script (not tested excessivly, bugs might be around) can be used for transferring the data to and from the nodes:

#!/bin/bash
#$ -l h_rt=72:00:00
#$ -cwd
#$ -N jobname
#$ -pe mpich 64
#$ -V
#$ -j yes

################################################
#####         start copying process         ####
################################################



#temporary directory on nodes:
tmp_dir=$TMPDIR/data

#naming of the tar files that should be distributed to
#the nodes. Each tar file should contain 8 subdirs.

#file names are completed by appending a number without leading
#zeros + tar.gz
input_tar_file_base=processes_

#naming of the outputfiles; number and tar.gz are appended automatically
output_tar_base=output_
#define which directories and files should be put into the output tar.gz
# not tested yet
PACKING=data\*

start_dir=`pwd`
#save output data to this directory
data_save="${start_dir}/data_${JOB_ID}"



nodes_uniq=$(cat $TMPDIR/machines| uniq)


#copy files per node
#tar.gz files contain 8 subdirectories for each process

j=0

for i in $nodes_uniq
do
    tar_file_name="${input_tar_file_base}${j}.tar.gz"
    echo "creating ${tmp_dir} on $i"
    ssh $i mkdir -p $tmp_dir
    echo "copying $tar_file_name to node $i"
    ssh $i cp ${start_dir}\/${tar_file_name} ${tmp_dir}
    ssh $i cp -r ${start_dir}\/tmp_dictionaries\/* ${tmp_dir}
    echo "extracting file"
    ssh $i "cd ${tmp_dir} ;tar -zxf ${tmp_dir}\/${tar_file_name}"
    j=$(echo "$j+1"|bc -l)
done

#command to run:
cd ${tmp_dir}
time mpirun -mca btl_openib_ib_timeout 20 -machinefile $TMPDIR/machines -np 64 $1 -parallel

#cp files per node back to start directory of job
echo "================================================="
j=0
for i in $nodes_uniq
do
    tar_file_name="${input_tar_file_base}${j}.tar.gz"
    output_tar_file="${output_tar_base}${j}.tar.gz"
    echo "creating ${output_tar_file} on node $i"
    echo "ssh $i \" cd ${tmp_dir} ;tar -zcf ${output_tar_file} $PACKING\""
    ssh $i " cd ${tmp_dir} ;tar -zcf ${output_tar_file} $PACKING"

    echo "copying file back"
    mkdir -p ${data_save}
    ssh $i cp  ${tmp_dir}\/${output_tar_file} ${data_save}
    j=$(echo "$j+1"|bc -l)
done

# ----------------------------------------------------------------- end-of-file

Some MPI implementations have a tight integration to SGE. In this case, providing a machinefile could/will be ignored by mpirun. For disabling this behaviour one can

unset the variable PE_HOSTFILE within the job script with:

unset PE_HOSTFILE

recompile the MPI libraries with disabled tight integration support

use the mpirun command with following options, '-machinefile' option has to be omitted:

    mpirun -npernode <number_of_processes_on_one_node> -np <number_of_total_processes> <command>

Sometimes users from one project are allowed by the project manager of another project to use its resources, ie. run jobs within this project by using the '-P <project_name>' flag in the job script. Grid Engine Jobs are alwayes executed with the primary group of the user. If the primary group of a user is not matching the project group given in the job, the job will be rejected.

For running a job in an other project than your primary group proceed as follows. Define the '-P' flag and if necessary the '-q' flag in your jobscript 'job.sh':

#$ -N foreingProjectJob
### specify queue if necessary:
###$ -q p80000
#$ -P p80000

#$ -pe mpich 6
sleep 60

Submit the job using a wrapper script:

qsub.py  job.sh

The wrapper script will change your primary group to that given by the '-P' flag and submit the job.

Jobs that have already been submitted to the queue can be modified using the 'qalter' command. Most common usage is changing queue or hard resources like runtime and free memory:

Change queue to 'long.q':

qalter -q long.q <job_id>

Change hard resources, first get resources with qstat:

qstat -j <job_id> |grep hard

Set new hard resources with modified string from above command:

qalter -l h_rt=60,<rest_of_string_from_above>

echo "sleep 10" | qsub -N test

Within Grid Engine

If nothing else is specified the output of GridEngine Jobs is written to four files:

<job_name>.o<job_id>(.<task_id>): STDOUT of the job
<job_name>.e<job_id>(.<task_id>): STDERR of the job
<job_name>.po<job_id>(.<task_id>): STDOUT of parallel environment of the job
<job_name>.pe<job_id>(.<task_id>): STDERR of parallel environment of the job

This behaviour can be modified in the following way:

'#$ -j y' joins STDOUT and STDERR, i.e. there will be a '.o' and a '.po' file
'#$ -o /path/filname' sets the STDOUT to a certain filename, i.e one gets the file specified with '-o' plus a '.e' and a '.pe' file
'#$ -e /path/filname' sets the STDERR to a certain filename, three files in analogy to the '-o' option
combination of '-j' with '-o' is possible, '-e' is ignored if '-j' is used.

When using the '-o' and '-e' options following pseudo variables can be used for specifying the path/filenames. NOTE: The variables can not be used as in a bash environment, i.e. using ${HOME} will not work.

$HOME home directory on execution machine
$USER user ID of job owner
$JOB_ID current job ID
$JOB_NAME current job name
$HOSTNAME name of the execution host
$TASK_ID array job task index number

For further details see 'man qsub' on the login nodes of VSC.

Ordinary shell redirection

Since the job is executed in an shell environment, the output of any shell command can be redirected to any file:

echo hello > /path/file

By default only 32 GB of (the physically available) memory can be allocated by applications on the compute nodes of VSC-2. Some applications however have to allocate larger amounts of memory than the physical limit (but although allocated, it is never used). For such cases a kernel parameter of the compute nodes can be changed in order to allow for overcommittment of the memory:

#$ -N test
#$ -pe mpich 32
#$ -l overcommit_mem=true

Specifying the maximum runtime (means quicker job execution) / resource reservation

The problem

The solution (quicker scheduling available here)

Job chains

Solution I

Solution II

Job arrays

Job arrays with multiple single core jobs on one exclusive node

Job arrays with multiple task within one SGE task step

Specifying runtime limits

Modyfying the machines file on VSC-1

Find out memory usage of running jobs

Using the local scratch on VSC-1

Tight integration

Using Resources of foreign Projects

Changing job parameters of already submitted jobs (qalter)

submitting a job from standard in (stdin)

Redirecting output of jobs

Within Grid Engine

Ordinary shell redirection

Overcommitting memory of the compute nodes on VSC-2