Approvals: 0/1
This is an old revision of the document!
Engineering
ABAQUS 2016
Checkpointing and restart
Users sometimes find that their jobs take longer than the maximaum runtime permitted by the scheduler to complete. Providing that your model does not automatically re-mesh (for example, after a fracture), you may be able to make use of Abaqus’ built-in checkpointing function.
This will create a restart file (.res file extension) from which a job that is killed can be restarted.
- Activate the restart feature by adding the line:
*restart, write
at the top of your input file and run your job as normal. It should produce a restart file with a .res file extension.
-
Run the restart analysis with
abaqus job=jobName oldjob=oldjobName ...
where oldJobName is the initial input file and newJobName is a file which contains only the line:
*restart, read
ABAQUS 2016
Checkpointing and restart
COMSOL
The following case is provided here including the directories-structure
and the appropriate batch-file: karman.rar
Module
Available version of Comsol can be found by executing the following line:
module avail 2>&1 | grep -i comsol
Currently these versions can be loaded:
- Comsol/5.5
- Comsol/5.6
module load *your preferred module*
Workflow
In general you define your complete case on your local machine and save it as *.mph file.
This file contains all necessary information to run a successfull calculation on the cluster.
Job script
An example of a Job script is shown below.
#!/bin/bash # slurmsubmit.sh #SBATCH --nodes=1 #SBATCH --ntasks-per-node=24 #SBATCH --job-name="karman" #SBATCH --partition=mem_0384 #SBATCH --qos=mem_0384 module purge module load Comsol/5.6 MODELTOCOMPUTE="karman" path=$(pwd) INPUTFILE="${path}/${MODELTOCOMPUTE}.mph" OUTPUTFILE="${path}/${MODELTOCOMPUTE}_result.mph" BATCHLOG="${path}/${MODELTOCOMPUTE}.log" echo "reading the inputfile" echo $INPUTFILE echo "writing the resultfile to" echo $OUTPUTFILE echo "COMSOL logs written to" echo $BATCHLOG echo "and the usual slurm...out" # COMSOL's internal command for number of nodes -nn and so on (-np, -nnhost, ...) are deduced from SLURM comsol batch -mpibootstrap slurm -inputfile ${INPUTFILE} -outputfile ${OUTPUTFILE} -batchlog ${BATCHLOG} -alivetime 15 -recover -mpidebug 10
Possible IO-Error
COMSOL is generating a huge amount of temporary files during the calculation. These files in general got saved in $HOME
and then this error will be occuring. To avoid it, you have to change the path of $TMPDIR
to e.g. /local. So the temporary files will be stored on the SSD-storage local to the compute node.
To get rid of this error just expand the comsol command in the job script by the following option:
-tmpdir "/local"
Submit job
sbatch karman.job
Using a shared node
If your case isn't that demanding on hardware and you are interested in a fast solution, it is possible to use one of the shared nodes. These are non-exclusive nodes, thus more than just one job is able to use the provided hardware. On these nodes you have to tell SLURM, how much memory (RAM) your case would need. This value should be less than the maximum of 96GB these nodes uses. Otherwise your job needs a whole node anyway. Here we use –mem=20G, to dedicate 20GB of memory.
#!/bin/bash # slurmsubmit.sh #SBATCH -n 1 #SBATCH --ntasks-per-node=1 #SBATCH --job-name="clustsw" #SBATCH --qos=mem_0096 #SBATCH --mem=20G hostname module purge module load Comsol/5.6 module list . . .