Engineering

This version is outdated by a newer approved version.

This version (2021/10/22 08:56) is a draft.
Approvals: 0/1

This is an old revision of the document!

Users sometimes find that their jobs take longer than the maximaum runtime permitted by the scheduler to complete. Providing that your model does not automatically re-mesh (for example, after a fracture), you may be able to make use of Abaqus’ built-in checkpointing function.

This will create a restart file (.res file extension) from which a job that is killed can be restarted.

Activate the restart feature by adding the line:

*restart, write

at the top of your input file and run your job as normal. It should produce a restart file with a .res file extension.

Run the restart analysis with

abaqus job=jobName oldjob=oldjobName ...

where oldJobName is the initial input file and newJobName is a file which contains only the line:

*restart, read

Example:

INPUT: dynam.inp

JOB SCRIPT: job.sh

INPUT FOR RESTART: dynamr.inp

The following case is provided here including the directories-structure
and the appropriate batch-file: karman.rar

Available version of Comsol can be found by executing the following line:

module avail 2>&1 | grep -i comsol

Currently these versions can be loaded:

Comsol/5.5
Comsol/5.6

module load *your preferred module*

In general you define your complete case on your local machine and save it as *.mph file.
This file contains all necessary information to run a successfull calculation on the cluster.

An example of a Job script is shown below.

#!/bin/bash
# slurmsubmit.sh

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24
#SBATCH --job-name="karman"
#SBATCH --partition=mem_0384
#SBATCH --qos=mem_0384

module purge
module load Comsol/5.6

MODELTOCOMPUTE="karman"
path=$(pwd)

INPUTFILE="${path}/${MODELTOCOMPUTE}.mph"
OUTPUTFILE="${path}/${MODELTOCOMPUTE}_result.mph"
BATCHLOG="${path}/${MODELTOCOMPUTE}.log"

echo "reading the inputfile"
echo $INPUTFILE
echo "writing the resultfile to"
echo $OUTPUTFILE
echo "COMSOL logs written to"
echo $BATCHLOG
echo "and the usual slurm...out"

# COMSOL's internal command for number of nodes -nn and so on (-np, -nnhost, ...) are deduced from SLURM
comsol batch -mpibootstrap slurm -inputfile ${INPUTFILE} -outputfile ${OUTPUTFILE} -batchlog ${BATCHLOG} -alivetime 15 -recover -mpidebug 10

COMSOL is generating a huge amount of temporary files during the calculation. These files in general got saved in $HOME and then this error will be occuring. To avoid it, you have to change the path of $TMPDIR to e.g. /local. So the temporary files will be stored on the SSD-storage local to the compute node. To get rid of this error just expand the comsol command in the job script by the following option:

-tmpdir "/local"

sbatch karman.job

If your case isn't that demanding on hardware and you are interested in a fast solution, it is possible to use one of the shared nodes. These are non-exclusive nodes, thus more than just one job is able to use the provided hardware. On these nodes you have to tell SLURM, how much memory (RAM) your case would need. This value should be less than the maximum of 96GB these nodes uses. Otherwise your job needs a whole node anyway. Here we use –mem=20G, to dedicate 20GB of memory.

#!/bin/bash
# slurmsubmit.sh

#SBATCH -n 1
#SBATCH --ntasks-per-node=1
#SBATCH --job-name="clustsw"
#SBATCH --qos=mem_0096
#SBATCH --mem=20G

hostname

module purge
module load Comsol/5.6
module list
.
.
.

Engineering

ABAQUS 2016

Checkpointing and restart

ABAQUS 2016

Checkpointing and restart

COMSOL

Module

Workflow

Job script

Possible IO-Error

Submit job

Using a shared node