====== Storage Infrastructure of VSC2 and VSC3 ======

  * Motivation
    * Know how a HPC storage stack can be integrated into a supercomputer
    * and gain some additional insights
    * Know the hard- and software to harness its full potential

====== VSC2 - Basic Infos ======

  * VSC2
    * Procured in 2011
    * Installed by MEGWARE
    * 1314 Compute Nodes
      * 2 CPUs each (AMD Opteron 6132 HE, 2.2Ghz)
      * 32 GB RAM
      * 2x Gigabit Ethernet
      * 1x Infiniband QDR
    * Login Nodes, Head Nodes, Hypervisors, etc…

====== VSC2 - File Systems ======

  * File Systems on VSC2
    * User Homes (/home/lvXXXXX)
    * Global (/fhgfs)
    * Scratch (/fhgfs/nodeName)
    * TMP (/tmp)
  * Change Filesystem with ‘cd’
    * cd /global # Go into Global
    * cd ~ # Go into Home
  * Use your applications settings to set in/out directories
  * Pipe the output into your home with ‘./myApp.sh 2>&1 > ~/out.txt’

====== VSC - Fileserver ======

{{.:fileserver.jpg}}

====== VSC - Why different Filesystems ======

  * Home File Systems
    * BeegFS (FhGFS) was very slow when it came to small file I/O
      * Leads to severe slowdown when compiling applications and/or
      * working with small files in general
    * We do not recommend using small files. Use big files when possible, however users want to
      * Compile Programs
      * Do some testing
      * Write (small) Log Files
      * etc…

====== VSC - Why different Filesystems ======

  * Global
    * Parallell Filesystem (BeeGFS)
    * Perfectly suited for large sequential transfers
  * TMP
    * Uses the main memory (RAM) of the server. Up to 1/2 size of the whole memory.
    * Users can access it like a file system but they get the speed of a byte-wise addressable storage
      * Random I/O is blindingly fast
      * But comes at the price of main memory
    * If you are not sure how to map your file from GLOBAL/HOME into memory just copy it to TMP and you are ready to go

====== VSC - Why different Filesystems ======

  * Small I/Os
    * Random Access –> TMP
    * Others –> HOME
  * Large I/Os
    * Random Access –> TMP
    * Sequential Access –> GLOBAL

====== VSC2 - Storage Homes ======

  * VSC2 - Homes
    * 6 File Servers for Home
      * 2 CPUs each (Intel Xeon E5620 @ 2.40 Ghz - Westmere)
      * 48 GB RAM
      * 1x Infiniband QDR
      * LSI Raid Controller
    * Homes are Stored on RAID6 Volumes
      * 10+2 Disks per Array. Up to 3 Arrays per Server (Depends on usage)
      * Exported via NFS to the compute nodes (No RDMA)
    * Each project resides on 1 server

====== VSC2 - Storage Global ======

  * VSC2 - Global
    * 8 File Servers
      * 2 CPUs each (Intel Xeon E5620 @ 2.40 Ghz - Westmere)
      * 192 GB RAM
      * 1x Infiniband QDR
      * LSI Raid Controller
    * OSTs consist of 24 Disks (22+1p+1Hot-Spare RAID5)
    * One Metadata Target per Server
      * 4x Intel X25-E 64 GB SSDs
    * Up to 6000 MB/s throughput
    * ~ 160 TB capacity

====== VSC2 - Storage Summary ======

  * 14 Servers
  * ~ 400 spinning disks
  * ~ 25 SSDs
  * 2 Filesystems (Home+Global)
  * 1 Temporary Filesystem

====== VSC3 - Basic Infos ======

  * Procured in 2014
  * Installed by CLUSTERVISION
  * 2020 Nodes
    * 2 CPUs each (Intel Xeon E5-2650 v2 @ 2.60 Ghz)
    * 64 GB RAM
    * 2x Infiniband QDR (Dual-Rail)
    * 2x Gigabit Ethernet
  * Login Nodes, Head Nodes, Hypervisors, Accelerator Nodes, etc…

====== VSC3 - File Systems ======

  * User Homes (/home/lvXXXXX)
  * Global (/global)
  * Scratch (/scratch)
  * EODC-GPFS (/eodc)
  * BINFS (/binfs)
  * BINFL (/binfl)

====== VSC3 - Home File System ======

  * VSC3 - Homes
    * 9 Servers (Intel Xeon E5620 @ 2.40 Ghz)
      * 64 GB RAM
      * 1x Infiniband QDR
      * LSI Raid Controller
    * Homes are stored on RAID6 Volumes
      * 10+2 Disks per Array. Up to 3 Arrays per Server (Depends on usage)
      * Exported via NFS to the compute nodes (No RDMA)
    * 1 Server without RAID Controller running ZFS
    * Quotas are enforced

====== VSC3 - Global File System ======

  * VSC3 - Global
    * 8 Servers (Intel Xeon E5-1620 v2 @ 3.70 Ghz)
    * 128 GB RAM
    * 1x Infiniband QDR
    * LSI Raid Controller
  * 4 OSTs per Server
    * 48 Disk 4x(10+2p)
  * 1 Metadata Target per Server
    * 2 SSDs each (Raid-1 / Mirrored)
  * Up to 20’000 MB/s throughput
  * ~ 600 TB capacity

====== VSC3 - GPFS ======

  * IBM GPFS - General Parallell Filesystem / Elastic Storage / Spectrum Scale
    * Released in 1998 as multi media filesystem (mmfs)
    * Linux support since 1999
    * Has been in use on many supercomputers
    * Separate Storage Pools
    * CES - Cluster Export Services
    * Can be linked to a hadoop cluster
  * Supports
    * Striping
    * n-way Replication
    * Raid (Spectrum Scale Raid)
    * Fast rebuild
    * …

====== VSC3 - EODC Filesystem ======

  * VSC3 <–> EODC
    * 4 IBM ESS Servers (~350 spinning disks each)
    * Running GPFS 4.2.3
    * 12 Infiniband QDR Links
    * Up to 14 GB/s sequential write and 26 GB sequential read
  * Multi tiered (IBM HSM)
    * Least recently used files are written to tape
    * Stub File stays in the file system
    * Transparent recall after access
    * Support for schedules, callbacks, Migration Control, etc.
  * Sentinel Satellite Data
    * Multiple petabytes
    * Needs backup on different locations

====== VSC3 - EODC Filesystem ======

  * VSC3 is a “remote cluster” for the EODC GPFS filesystem
    * 2 Management servers
      * Tie-Breaker disk for quorum
    * No filesystems (only remote filesystems from EODC)
    * Up to 500 VSC clients can use GPFS in parallell

====== VSC3 - Bioinformatics ======

  * VSC3 got a “bioinformatics” upgrade in late 2016
    * 17 Nodes
      * 2x Intel Xeon E5-2690 v4 @ 2.60 Ghz (14 cores each / 28 with hyperthreading)
      * Each node has at least 512 GB RAM
      * 1x Infiniband QDR (40 Gbit/s)
      * 1x Omnipath (100 Gbit/s)
      * 12 spinning disks
      * 4 NVMe Memories (Intel DC P3600)
    * These Nodes export 2 filesystems to VSC (BINFL and BINFS)

====== VSC3 - Bioinformatics ======

{{.:binf.jpg}}

====== VSC3 - BINFL Filesystem ======

  * Use for I/O intensive bioinformatics jobs
  * ~ 1 PB Space (Quotas enforced)
  * BeeGFS Filesystem
  * Metadata Servers
    * Metadata on Datacenter SSDs (RAID-10)
    * 8 Metadata Servers
  * Object Storages
    * Disk Storages configured as RAID-6
    * 12 Disks per Target / 1 Target per Server / 16 Servers total
  * Up to 40 Gigabyte/second write speed

====== Binfs ======

  * Use for very I/O intensive jobs
  * ~ 100 TB Space (Quotas enforced)
  * BeeGFS Filesystem
  * Metadata Servers
    * Metadata on Datacenter SSDs (RAID-10)
    * 8 Metadata Servers
  * Object Storages
    * Datacenter SSDs are used insteand of traditional disks.
      * No redundancy. See it as (very) fast and low-latency scratch space. Data may be lost after a hardware failure.
    * 4x Intel P3600 2TB Datacenter SSDs per Server
    * 16 Storage Servers
  * Up to 80 Gigabyte/second via OmniPath Interconnect

====== VSC3 - Storage Summary ======

  * 33 Servers
  * ~ 800 spinning disks
  * ~ 100 SSDs
  * 5 Filesystems (
    * Home
    * Global
    * EODC
    * BINFS
    * BINFL
  * Temporary Filesystem

====== Storage Performance ======

{{.:vsc3_storage_performance.png}}

====== Temporary Filesystems ======

  * Use for
    * Random I/O
    * Many small files
  * Data gets deleted after the job
    * Write Results to $HOME or $GLOBAL
  * Disadvantages
    * Disk space is consumed from main memory
  * Alternatively the mmap() system call can be used
    * Keep in mind, that mmap() uses lazy loading
    * Very small files waste main memory (memory mapped files are aligned to page-size)

====== Addendum: Backup ======

  * Only for disaster recovery we keep some backups
    * Systems (Node Images, Head Nodes, Hypervisors, VMs, Module Environment)
      * via rsnapshot
    * Home
      * Home filesystems are backupped
      * via self written, parallellized rsync script
    * Global
      * Metadata is backupped. But if a total disaster happens this won’t help.
    * Altough nothing happenend for some time, this is high performance computing and redundancy is minimal
      * Keep an eye on your data. If it’s important you should backup it yourself.
        * Use rsync

====== Addendum: Big Files ======

  * What does an administrator mean with ‘don’t use small files’?
    * It depends
    * On /global fopen –> fclose takes ~100 microseconds
    * On VSC2 Home Filesystems > 100 Million Files are stored.
      * A check which files have changed takes more than 12 hours. Without even reading file contents or copying.
    * What we mean is: Use reasonable file sizes according to your working set and your throughput needs
      * Reading a 2 GB File from SSD takes less than one second. Having files which are a few MB in size, will slow down your processing.
      * If you need high throughput with your >1TB working set a file size >=10E8 bytes is reasonable
      * File Sizes < 10E6 bytes are problematic if you plan to use many files
      * If you want to do random i/o copy your files to TMP.
    * Storage works well when the blocksize is reasonable for the storage system (on VSC3 a few Megabytes are enough)
    * Do not create millions of files
      * If every user had millions of files we’d run into some problems
    * Use ‘tar’ to archive unneeded files
      * tar -cvpf myArchive.tar myFolderWithFiles/
      * tar -cvjpf myArchive.tar myFolderWithFiles/ # Uses bzip2 compression
      * Extract with
      * tar xf myArchive.tar

====== The End ======

Thank you for your attention