Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
doku:slurm [2017/06/27 12:31]
ir [SLURM (Simple Linux Utility for Resource Management)]
doku:slurm [2024/02/07 10:55] (current)
katrin [The job submission script]
Line 1: Line 1:
-====== Submitting batch jobs ======+====== SLURM ======
  
-===== Module environment =====+Contrary to the previously on VSC-1 and VSC-2 employed SGE, the scheduler on VSC-3, VSC-4, and VSC-5 is [[http://slurm.schedmd.com|SLURM - Simple Linux Utility for Resource Management]]. 
  
-In order to set environment variables needed for a specific application, the **module** environment may be used: +==== Basic SLURM commands: ==== 
-  * ''module avail''     lists the **available** Application-Software, Compilers, Parallel-Environment, and Libraries  +  * ''[...]$ sinfo'' gives information on which partitions are available for job submission. Note: What SGE on VSC-2 termed 'queue' is now called a 'partition' under SLURM.
-  * ''module list''      shows currently loaded package of your session +
-  * ''module unload <xyz>'' unload a particular package <xyz> from your session +
-  * ''module load <xyz>'' load a particular package <xyz> into your session +
-  * ''module display <xyz>'' OR ''module show <xyz>'' show module details such as the full  path  of  the module file and all (or most) of the environment changes the modulefile will make if loaded +
-  * ''module purge'' unloads all loaded modulefiles +
-== Note: == +
- +
-  - **<xyz>** format corresponds exactly to the output of ''module avail''. Thus, in order to load or unload a selected module, copy and paste exactly the name listed by ''module avail''.\\  +
-  - a list of ''module load/unload'' directives may also be included in the top part of a job submission script\\  +
- +
-When all required/intended modules have been loaded, user packages may be compiled as usual. +
-===== SLURM (Simple Linux Utility for Resource Management) ===== +
- +
-Contrary to the previously on VSC 1 and VSC 2 employed SGE, the scheduler on VSC-3 is [[http://slurm.schedmd.com|SLURM]].  +
-=== Basic SLURM commands: === +
-  * ''[...]$ sinfo'' gives information on which 'queues'='partitionsare available for job submission. Note: the under SGE termed 'queue' is called a 'partition' under SLURM.+
   * ''[...]$ scontrol'' is used to view SLURM configuration including: job, job step, node, partition, reservation, and overall system configuration. Without a command entered on the execute line, scontrol operates in an interactive mode and prompt for input. With a command entered on the execute line, scontrol executes that command and terminates.    * ''[...]$ scontrol'' is used to view SLURM configuration including: job, job step, node, partition, reservation, and overall system configuration. Without a command entered on the execute line, scontrol operates in an interactive mode and prompt for input. With a command entered on the execute line, scontrol executes that command and terminates. 
   * ''[...]$ scontrol show job 567890'' shows information on the job with number 567890.   * ''[...]$ scontrol show job 567890'' shows information on the job with number 567890.
   * ''[...]$ scontrol show partition'' shows information on available partitions.   * ''[...]$ scontrol show partition'' shows information on available partitions.
   * ''[...]$ squeue''    to see the current list of submitted jobs, their state and resource allocation. [[doku:slurm_job_reason_codes|Here]] is a description of the most important **job reason codes** returned by the squeue command.   * ''[...]$ squeue''    to see the current list of submitted jobs, their state and resource allocation. [[doku:slurm_job_reason_codes|Here]] is a description of the most important **job reason codes** returned by the squeue command.
 +
 +
 +==== Software Installations and Modules ====
 +
 +On VSC-4 and VSC-5, spack is used to install and provide modules, see [[doku:spack|SPACK - a package manager for HPC systems]]. The methods described in [[doku:modules]] can still be used for backwards compatibility, but we suggest using spack.
 +
 ==== Node configuration - hyperthreading ==== ==== Node configuration - hyperthreading ====
  
-The compute nodes of VSC-are configured with the following parameters in SLURM:+The compute nodes of VSC-are configured with the following parameters in SLURM:
 <code> <code>
-CoresPerSocket=8+CoresPerSocket=24
 Sockets=2 Sockets=2
 ThreadsPerCore=2 ThreadsPerCore=2
 </code> </code>
-This reflects the fact that <html> <font color=#cc3300> hyperthreading </font> </html> is activated on all compute nodes and <html> <font color=#cc3300> 32 cores </font> </html> may be utilized on each node. +And the primary nodes of VSC-5 with: 
 +<code> 
 +CoresPerSocket=64 
 +Sockets=2 
 +ThreadsPerCore=2 
 +</code> 
 +This reflects the fact that <html> <font color=#cc3300> hyperthreading </font> </html> is activated on all compute nodes and <html> <font color=#cc3300> 96 cores on VSC4 and 256 cores on VSC5 </font> </html> may be utilized on each node. 
 In the batch script hyperthreading is selected by adding the line In the batch script hyperthreading is selected by adding the line
 <code>  <code> 
Line 40: Line 36:
 which allows for 2 tasks per core. which allows for 2 tasks per core.
  
-Some codes may experience a performance gain from using all 32 virtual cores, e.g., GROMACS seems to profit. But note that using all virtual cores also leads to more communication and may impact on the performance of large MPI jobs.+Some codes may experience a performance gain from using all virtual cores, e.g., GROMACS seems to profit. But note that using all virtual cores also leads to more communication and may impact on the performance of large MPI jobs.
  
-**NOTE on accounting**: the project's core-h are always calculated as ''job_walltime * nnodes * 16'' (16 physical cores per node). SLURM's built in function ''sreport'' yields wrong accounting statistics because (depending on the job script) the multiplier is 32 instead of 16. You may instead use the accounting script introduced in this [[https://wiki.vsc.ac.at/doku.php?id=doku:slurm_sacct&#accounting_script|section]].+**NOTE on accounting**: the project's core-h are always calculated as ''job_walltime * nnodes * ncpus'' (number of physical cores per node). SLURM's built in function ''sreport'' yields wrong accounting statistics because (depending on the job script) the multiplier is 'number of virtual cores' instead of 'physical cores'. You may instead use the accounting script introduced in this [[doku:slurm_sacct|section]].
  
 ==== Node allocation policy ==== ==== Node allocation policy ====
-On VSC-3 (as on VSC-2)  <html> <font color=#cc3300> only complete compute Nodes </font> </html>, i.e., integral multiples of 16 cores, can be allocated for user jobs. If you wish to run many single core jobs, there will be a possibility to schedule them in smart way exploiting all 16 cpus of one node, please see the [[doku:slurm&#scheduler_script_for_many_single_core_jobs|scheduler script for a series of single core jobs]].+On VSC-4 & VSC-5 there is a set of nodes that accept jobs that do not require entire exclusive nodes (anything from 1 core to less than full node). These nodes are set up to accommodate different jobs from different users until they are full. They are automatically used for such types of jobs. All other nodes are assigned completely (and exclusively) to a job whenever the '-N' argument is used.
  
  
-===== The job submission script===== 
  
-It is recommended to write the job script using a [[doku:win2vsc&#the_job_filetext_editors_on_the_cluster|text editor]] on the VSC //Linux// cluster.  +===== Submit a batch job ===== 
-Editors in //Windows// may add additional invisible characters to the job file which render it unreadable and, thus, it is not executed. + 
 +==== Partition, quality of service and run time ==== 
 + 
 +Depending on the demands of a certain application, the   
 +[[doku:vsc3_queue|partition (grouping hardware according its type) and  
 +quality of service (QOS; defining the run time etc.)]] can be selected. 
 +Additionally, the run time of a job can be limited in the job script to a value lower than the runtime limit of the selected QOS. This allows for a process called [[doku:vsc3_queue#backfilling|backfilling]] possibly leading to a <html><font color=#cc3300>shorter waiting time</font></html> in the queue. 
 +==== The job submission script==== 
 + 
 +It is recommended to write the job script using a [[doku:win2vsc&#the_job_filetext_editors_on_the_cluster|text editor]] on the VSC //Linux// cluster or on any Linux/Mac system.  
 +Editors in //Windows// may add additional invisible characters to the job file which render it unreadable and, thus, it cannot be not executed. 
  
 Assume a submission script ''check.slrm'' Assume a submission script ''check.slrm''
Line 59: Line 64:
 #SBATCH -J chk #SBATCH -J chk
 #SBATCH -N 2 #SBATCH -N 2
-#SBATCH --ntasks-per-node=16+#SBATCH --ntasks-per-node=48
 #SBATCH --ntasks-per-core=1 #SBATCH --ntasks-per-core=1
 #SBATCH --mail-type=BEGIN    # first have to state the type of event to occur  #SBATCH --mail-type=BEGIN    # first have to state the type of event to occur 
Line 65: Line 70:
  
 # when srun is used, you need to set: # when srun is used, you need to set:
-export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/current/lib/libpmi.so 
  
-<srun -l -N2 -n32 a.out > +<srun -l -N2 -n96 a.out > 
 # or # or
-<mpirun -np 32 a.out>+<mpirun -np 96 a.out>
 </code> </code>
   * **-J**     job name,\\    * **-J**     job name,\\ 
-  * **-N**     number of nodes requested (16 cores per node available)\\ +  * **-N**     number of nodes requested\\ 
   * **-n, --ntasks=<number>** specifies the number of tasks to run,         * **-n, --ntasks=<number>** specifies the number of tasks to run,      
   * **--ntasks-per-node**     number of processes run in parallel on a single node \\           * **--ntasks-per-node**     number of processes run in parallel on a single node \\        
Line 81: Line 85:
   * **--mail-user** sends an email to this address   * **--mail-user** sends an email to this address
  
-In order to send the job to specific queues, see [[doku:vsc3_queue|Queue/Partition setup on VSC-3]].+In order to send the job to specific queues, see [[doku:vsc4_queue|Queue | Partition setup on VSC-4]] or [[doku:vsc5_queue|Queue | Partition setup on VSC-5]].
 ====Job submission==== ====Job submission====
    
 <code> <code>
-[username@l31 ~]$ sbatch check.slrm    # to submit the job              +[username@l42 ~]$ sbatch check.slrm    # to submit the job              
-[username@l31 ~]$ squeue -u `whoami`   # to check the status  of own jobs +[username@l42 ~]$ squeue -u `whoami`   # to check the status  of own jobs 
-[username@l31 ~]$ scancel  JOBID       # for premature removal, where JOBID+[username@l42 ~]$ scancel  JOBID       # for premature removal, where JOBID
                                        # is obtained from the previous command                                           # is obtained from the previous command   
 </code> </code>
Line 95: Line 99:
  
  
-====A word on srun and mpirun:==== 
-Currently (27th March 2015), **srun** only works when the application uses **intel mpi** and is compiled with the **intel compiler**. We will provide compatible versions of MVAPICH2 and OpenMPI in the near future. 
-At the moment, it is recommended to use **mpirun** in case of MVAPICH2 and OpenMPI. 
  
-==== Further examples for job submission scripts ==== +==== Hybrid MPI/OMP: ====
- +
- +
-=== Hybrid MPI/OMP: ===+
  
 SLURM Script: SLURM Script:
Line 145: Line 143:
  
  
-=== Job chain ===+==== Job chain ====
  
 This example is for using a set of 4 nodes to compute a series of jobs in two stages, each of them split into two separate subjobs. \\ This example is for using a set of 4 nodes to compute a series of jobs in two stages, each of them split into two separate subjobs. \\
Line 155: Line 153:
 #SBATCH -J chk #SBATCH -J chk
 #SBATCH -N 4 #SBATCH -N 4
-#SBATCH --ntasks-per-node=16+#SBATCH --ntasks-per-node=48
 #SBATCH --ntasks-per-core=1 #SBATCH --ntasks-per-core=1
  
-export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/current/lib/libpmi.so 
  
 scontrol show hostnames $SLURM_NODELIST  > ./nodelist scontrol show hostnames $SLURM_NODELIST  > ./nodelist
  
-srun -l -N2 -r0 -n32 job1.scrpt & +srun -l -N2 -r0 -n96 job1.scrpt & 
-srun -l -N2 -r2 -n32 job2.scrpt &+srun -l -N2 -r2 -n96 job2.scrpt &
 wait wait
  
-srun -l -N2 -r2 -n32 job3.scrpt & +srun -l -N2 -r2 -n96 job3.scrpt & 
-srun -l -N2 -r0 -n32 job4.scrpt &+srun -l -N2 -r0 -n96 job4.scrpt &
 wait wait
  
Line 264: Line 261:
 </code> </code>
 (The SLURM inherent command //#SBATCH --array starting_value-end_value:stepwidth// does not provide this functionality since nodes are exclusively occupied by one job.) (The SLURM inherent command //#SBATCH --array starting_value-end_value:stepwidth// does not provide this functionality since nodes are exclusively occupied by one job.)
 +
 +[[doku:multimpi|Here are examples how to run multiple MPI jobs on a single node]]
  
 ===== Generating a host machines file ===== ===== Generating a host machines file =====
Line 273: Line 272:
 #SBATCH -J par                      # job name #SBATCH -J par                      # job name
 #SBATCH -N 2                        # number of nodes=2 #SBATCH -N 2                        # number of nodes=2
-#SBATCH --ntasks-per-node=16        # uses all cpus of one node      +#SBATCH --ntasks-per-node=48        # uses all cpus of one node      
 #SBATCH --ntasks-per-core=1 #SBATCH --ntasks-per-core=1
 #SBATCH --threads-per-core=1 #SBATCH --threads-per-core=1
Line 283: Line 282:
 rm machines_tmp rm machines_tmp
  
-tasks_per_node=16         # change number accordingly+tasks_per_node=48         # change number accordingly
 nodes=2                   # change number accordingly nodes=2                   # change number accordingly
 for ((line=1; line<=nodes; line++)) for ((line=1; line<=nodes; line++))
Line 300: Line 299:
 ==== Restarting Failed Jobs ==== ==== Restarting Failed Jobs ====
  
-Slurm is configured to automatically requeue jobs which were aborted due to node failures. If this is an unwanted behaviour you can prevent automatic requeuing with the following option in your job script: +Slurm is __not longer__ configured to automatically requeue jobs which were aborted due to node failures. If this is an unwanted behaviour you can <del>prevent</del> ask for automatic requeuing with the following option in your job script: 
-<code>#SBATCH --no-requeue</code> +<code>#SBATCH --requeue</code>
 ==== Job Arrays ==== ==== Job Arrays ====
  
Line 336: Line 334:
   - continue at 2. for further dependent jobs   - continue at 2. for further dependent jobs
  
-===== Licenses ===== 
  
-Software, that uses a license server, has to be specified upon job submission. A list of all available licensed software for your user can be shown by using the command:+ 
 +===== Prolog Error Codes =====
  
 <code> <code>
-slic +ERROR_MEMORY=200 
-</code>+ERROR_INFINIBAND_HW=201 
 +ERROR_INFINIBAND_SW=202 
 +ERROR_IPOIB=203 
 +ERROR_BEEGFS_SERVICE=204 
 +ERROR_BEEGFS_USER=205 
 +ERROR_BEEGFS_SCRATCH=206 
 +ERROR_NFS=207
  
-Within the job script add the flags as shown with 'slic', e.g. for using both Matlab and Mathematica:+ERROR_USER_GROUP=220 
 +ERROR_USER_HOME=221 
 + 
 +ERROR_GPFS_START=228 
 +ERROR_GPFS_MOUNT=229 
 +ERROR_GPFS_UNMOUNT=230
  
-<code> 
-#SBATCH -L matlab@vsc,mathematica@vsc 
 </code> </code>
- 
  • doku/slurm.1498566708.txt.gz
  • Last modified: 2017/06/27 12:31
  • by ir