Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
doku:oom [2017/09/01 10:00] – removed irdoku:oom [Unknown date] (current) – external edit (Unknown date) 127.0.0.1
Line 1: Line 1:
 +==== Out of memory ====
  
 +A node ran out of memory during the execution of a user job.
 +
 +An E-Mail was sent to the user, with a content line similar to
 +<code>Sep 11 10:28:16 r01n08 (mpirun) Starting on r01n09: Sep 11 08:46:43 jz r01n09 254681 3</code>
 +the parts of this line are
 +  * '''Sep 11 10:28:16''' is the time of the 'out of memory event'.
 +  * '''r01n08''' is the node of the 'out of memory event'. This may be a slave node of the job.
 +  * '''(mpirun)''' is the executable which was killed. All children of this executable have died, too. (Only one message is sent per node, i.e., for the first executable killed.)
 +  * '''Starting on r01n09:''' the master node of the job.
 +  * '''Sep 11 08:46:43''' is the time the job was started.
 +  * '''jz''' is the user id.
 +  * '''r01n09''' is again the master node.
 +  * '''254681''' is the job id.
 +  * '''3''' is the task id (if applicable).
 +
 +If your job requires [[memory|more memory]] per core, you might consider using only 8 or even 4 core per node.
 +
 +Please direct further questions to the **[[doku:contact|system administration]]**.
  • doku/oom.1504260031.txt.gz
  • Last modified: 2017/09/01 10:00
  • by ir