doku:oom [VSC Wiki]

This version (2022/06/20 09:01) was approved by msiegel.

A node ran out of memory during the execution of a user job.

An E-Mail was sent to the user, with a content line similar to

Sep 11 10:28:16 r01n08 (mpirun) Starting on r01n09: Sep 11 08:46:43 jz r01n09 254681 3

the parts of this line are

'Sep 11 10:28:16' is the time of the 'out of memory event'.
'r01n08' is the node of the 'out of memory event'. This may be a slave node of the job.
'(mpirun)' is the executable which was killed. All children of this executable have died, too. (Only one message is sent per node, i.e., for the first executable killed.)
'Starting on r01n09:' the master node of the job.
'Sep 11 08:46:43' is the time the job was started.
'jz' is the user id.
'r01n09' is again the master node.
'254681' is the job id.
'3' is the task id (if applicable).

If your job requires more memory per core, you might consider using only 8 or even 4 core per node.

Please direct further questions to the system administration.