Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
doku:oom [2012/09/12 14:10] – external edit 127.0.0.1doku:oom [Unknown date] (current) – external edit (Unknown date) 127.0.0.1
Line 1: Line 1:
-1397030115+==== Out of memory ==== 
 + 
 +A node ran out of memory during the execution of a user job. 
 + 
 +An E-Mail was sent to the user, with a content line similar to 
 +<code>Sep 11 10:28:16 r01n08 (mpirun) Starting on r01n09: Sep 11 08:46:43 jz r01n09 254681 3</code> 
 +the parts of this line are 
 +  * '''Sep 11 10:28:16''' is the time of the 'out of memory event'
 +  * '''r01n08''' is the node of the 'out of memory event'. This may be a slave node of the job. 
 +  * '''(mpirun)''' is the executable which was killed. All children of this executable have died, too. (Only one message is sent per node, i.e., for the first executable killed.) 
 +  * '''Starting on r01n09:''' the master node of the job. 
 +  * '''Sep 11 08:46:43''' is the time the job was started. 
 +  * '''jz''' is the user id. 
 +  * '''r01n09''' is again the master node. 
 +  * '''254681''' is the job id. 
 +  * '''3''' is the task id (if applicable). 
 + 
 +If your job requires [[memory|more memory]] per core, you might consider using only 8 or even 4 core per node. 
 + 
 +Please direct further questions to the **[[doku:contact|system administration]]**.
  • doku/oom.1347459015.txt.gz
  • Last modified: 2014/10/02 13:27
  • (external edit)