This version (2022/06/20 09:01) was approved by msiegel.

A node ran out of memory during the execution of a user job.

An E-Mail was sent to the user, with a content line similar to

Sep 11 10:28:16 r01n08 (mpirun) Starting on r01n09: Sep 11 08:46:43 jz r01n09 254681 3

the parts of this line are

  • 'Sep 11 10:28:16' is the time of the 'out of memory event'.
  • 'r01n08' is the node of the 'out of memory event'. This may be a slave node of the job.
  • '(mpirun)' is the executable which was killed. All children of this executable have died, too. (Only one message is sent per node, i.e., for the first executable killed.)
  • 'Starting on r01n09:' the master node of the job.
  • 'Sep 11 08:46:43' is the time the job was started.
  • 'jz' is the user id.
  • 'r01n09' is again the master node.
  • '254681' is the job id.
  • '3' is the task id (if applicable).

If your job requires more memory per core, you might consider using only 8 or even 4 core per node.

Please direct further questions to the system administration.

  • doku/oom.txt
  • Last modified: 2014/10/02 13:27
  • by ir