----

===== Xpra vs. VNC (1) =====

On the current cluster (“smmpmech.unileoben.ac.at”):

  * X server (VNC) runs on head node
  * X clients (=applications, e.g. Fluent) run on compute nodes
  * X clients communicate with X server
    * over “physical” network (Infiniband)
    * with the inefficient X protocol
    * many clients with one server


----

===== Xpra vs. VNC (2) =====

Problems with the current method:

  * X clients (=applications) die when connection is lost
    * therefore head node can not be rebooted without killing jobs
    * somtimes VNC crashes or gets stuck
  * many clients are displayed on one server
    * one misbehaving client can block the server
  * communication (X server <-> X client)
    * takes a lot of CPU power on head node
    * can slow down the application (experienced up to 60% performance loss with Fluent)
    * that is the reason why minimizing Fluent window helps (no graphic updates => no communication)


----

===== Xpra vs. VNC (3) =====

On the new cluster we have:

  * X servers (Xpra) run on compute nodes
    * one server per application
  * X clients (applications) run on compute nodes
  * client and server communicate directly on the same machine
    * each client with its own server
    * no “physical” network involved
  * to see the actual output you must attach to the Xpra server with an Xpra client
    * use ''%%sbatch+display%%'' to submit and display a job
    * use ''%%display-all%%'' to display graphical output of all your jobs


----

===== Xpra vs. VNC (4) =====

Solved problems with this method:

  * X clients (=applications) no longer die when connection is lost
    * login nodes can be booted any time
    * simply detach the Xpra connection when you are not watching it in order not to slow down the application (e.g. Fluent)
  * misbehaving X clients can only block their own server
  * communication (X server <-> X client) stays on the comput node => fast
  * communication Xpra server <-> Xpra client
    * can be detached and reattached any time
    * uses efficient Xpra protocol


----

===== whole nodes vs. partial nodes (1) =====

=== When users can allocate only whole nodes: ===

  * everything is easier for the admins
  * no jobs are getting in the way of each other on the same node
  * memory is given implicitly by number and kind of nodes
    * no need for the user to specify memory in job script
  * no fragmentation and problems related to fragmentation
    * e.g. partial nodes are free but the user needs a whole node


----

===== whole nodes vs. partial nodes (2) =====

=== But there are also disadvantages with whole nodes: ===

  * single core jobs are more complicated for the user
    * user must manage the execution of many small jobs on one node himself/herself
  * if a cluster consists of relatively few nodes its utilization will be worse

Therefore we have decided on allowing use of partial nodes.


----

===== why must the memory be specified? =====

  * because of shared node usage (also referred to as “single core jobs”)
  * to avoid “killing” nodes by using too much memory


----

===== how does fair share scheduling work? =====

  * as long as there are free resources (e.g. cores): no effect can be seen
  * as soon as jobs have to compete for resources:
    * history of user / group comes into play
    * scheduler preferes jobs of users/groups which have not had their fair share yet
  * in the long run this allocates resources according to the predefined percentages


----

===== why do we count memory for fair share? =====

  * not only cores but also memory can “block” nodes
  * e.g. 1 core and 120 GB of RAM block an entire E5-2690v4 node and are in this way equivalent to 28 cores
  * therefore memory must also count


----