This version (2020/10/20 08:09) is a draft.
Approvals: 0/1

On the current cluster (“smmpmech.unileoben.ac.at”):

  • X server (VNC) runs on head node
  • X clients (=applications, e.g. Fluent) run on compute nodes
  • X clients communicate with X server
    • over “physical” network (Infiniband)
    • with the inefficient X protocol
    • many clients with one server

Problems with the current method:

  • X clients (=applications) die when connection is lost
    • therefore head node can not be rebooted without killing jobs
    • somtimes VNC crashes or gets stuck
  • many clients are displayed on one server
    • one misbehaving client can block the server
  • communication (X server ↔ X client)
    • takes a lot of CPU power on head node
    • can slow down the application (experienced up to 60% performance loss with Fluent)
    • that is the reason why minimizing Fluent window helps (no graphic updates ⇒ no communication)

On the new cluster we have:

  • X servers (Xpra) run on compute nodes
    • one server per application
  • X clients (applications) run on compute nodes
  • client and server communicate directly on the same machine
    • each client with its own server
    • no “physical” network involved
  • to see the actual output you must attach to the Xpra server with an Xpra client
    • use sbatch+display to submit and display a job
    • use display-all to display graphical output of all your jobs

Solved problems with this method:

  • X clients (=applications) no longer die when connection is lost
    • login nodes can be booted any time
    • simply detach the Xpra connection when you are not watching it in order not to slow down the application (e.g. Fluent)
  • misbehaving X clients can only block their own server
  • communication (X server ↔ X client) stays on the comput node ⇒ fast
  • communication Xpra server ↔ Xpra client
    • can be detached and reattached any time
    • uses efficient Xpra protocol

When users can allocate only whole nodes:

  • everything is easier for the admins
  • no jobs are getting in the way of each other on the same node
  • memory is given implicitly by number and kind of nodes
    • no need for the user to specify memory in job script
  • no fragmentation and problems related to fragmentation
    • e.g. partial nodes are free but the user needs a whole node

But there are also disadvantages with whole nodes:

  • single core jobs are more complicated for the user
    • user must manage the execution of many small jobs on one node himself/herself
  • if a cluster consists of relatively few nodes its utilization will be worse

Therefore we have decided on allowing use of partial nodes.


  • because of shared node usage (also referred to as “single core jobs”)
  • to avoid “killing” nodes by using too much memory

  • as long as there are free resources (e.g. cores): no effect can be seen
  • as soon as jobs have to compete for resources:
    • history of user / group comes into play
    • scheduler preferes jobs of users/groups which have not had their fair share yet
  • in the long run this allocates resources according to the predefined percentages

  • not only cores but also memory can “block” nodes
  • e.g. 1 core and 120 GB of RAM block an entire E5-2690v4 node and are in this way equivalent to 28 cores
  • therefore memory must also count

  • pandoc/introduction-to-mul-cluster/01_introduction/06_background_info.txt
  • Last modified: 2020/10/20 08:09
  • by pandoc