Differences

This shows you the differences between two versions of the page.

--- pandoc:parallel-io:03_storage_infrastructure:storage_infrastructure [2018/01/31 11:11] – Pandoc Auto-commit pandoc
+++ pandoc:parallel-io:03_storage_infrastructure:storage_infrastructure [2020/10/20 09:13] (current) – Pandoc Auto-commit pandoc
@@ Line 1: / Line 1: @@
+====== Storage Infrastructure of VSC2 and VSC3 ======
+  * Motivation
+    * Know how a HPC storage stack can be integrated into a supercomputer
+    * and gain some additional insights
+    * Know the hard- and software to harness its full potential
+====== VSC2 - Basic Infos ======
+  * VSC2
+    * Procured in 2011
+    * Installed by MEGWARE
+    * 1314 Compute Nodes
+      * 2 CPUs each (AMD Opteron 6132 HE, 2.2Ghz)
+      * 32 GB RAM
+      * 2x Gigabit Ethernet
+      * 1x Infiniband QDR
+    * Login Nodes, Head Nodes, Hypervisors, etc…
+====== VSC2 - File Systems ======
+  * File Systems on VSC2
+    * User Homes (/home/lvXXXXX)
+    * Global (/fhgfs)
+    * Scratch (/fhgfs/nodeName)
+    * TMP (/tmp)
+  * Change Filesystem with ‘cd’
+    * cd /global # Go into Global
+    * cd ~ # Go into Home
+  * Use your applications settings to set in/out directories
+  * Pipe the output into your home with ‘./myApp.sh 2>&1 > ~/out.txt’
+====== VSC - Fileserver ======
+{{.:fileserver.jpg}}
+====== VSC - Why different Filesystems ======
+  * Home File Systems
+    * BeegFS (FhGFS) was very slow when it came to small file I/O
+      * Leads to severe slowdown when compiling applications and/or
+      * working with small files in general
+    * We do not recommend using small files. Use big files when possible, however users want to
+      * Compile Programs
+      * Do some testing
+      * Write (small) Log Files
+      * etc…
+====== VSC - Why different Filesystems ======
+  * Global
+    * Parallell Filesystem (BeeGFS)
+    * Perfectly suited for large sequential transfers
+  * TMP
+    * Uses the main memory (RAM) of the server. Up to 1/2 size of the whole memory.
+    * Users can access it like a file system but they get the speed of a byte-wise addressable storage
+      * Random I/O is blindingly fast
+      * But comes at the price of main memory
+    * If you are not sure how to map your file from GLOBAL/HOME into memory just copy it to TMP and you are ready to go
+====== VSC - Why different Filesystems ======
+  * Small I/Os
+    * Random Access –> TMP
+    * Others –> HOME
+  * Large I/Os
+    * Random Access –> TMP
+    * Sequential Access –> GLOBAL
+====== VSC2 - Storage Homes ======
+  * VSC2 - Homes
+    * 6 File Servers for Home
+      * 2 CPUs each (Intel Xeon E5620 @ 2.40 Ghz - Westmere)
+      * 48 GB RAM
+      * 1x Infiniband QDR
+      * LSI Raid Controller
+    * Homes are Stored on RAID6 Volumes
+      * 10+2 Disks per Array. Up to 3 Arrays per Server (Depends on usage)
+      * Exported via NFS to the compute nodes (No RDMA)
+    * Each project resides on 1 server
+====== VSC2 - Storage Global ======
+  * VSC2 - Global
+    * 8 File Servers
+      * 2 CPUs each (Intel Xeon E5620 @ 2.40 Ghz - Westmere)
+      * 192 GB RAM
+      * 1x Infiniband QDR
+      * LSI Raid Controller
+    * OSTs consist of 24 Disks (22+1p+1Hot-Spare RAID5)
+    * One Metadata Target per Server
+      * 4x Intel X25-E 64 GB SSDs
+    * Up to 6000 MB/s throughput
+    * ~ 160 TB capacity
+====== VSC2 - Storage Summary ======
+  * 14 Servers
+  * ~ 400 spinning disks
+  * ~ 25 SSDs
+  * 2 Filesystems (Home+Global)
+  * 1 Temporary Filesystem
+====== VSC3 - Basic Infos ======
+  * Procured in 2014
+  * Installed by CLUSTERVISION
+  * 2020 Nodes
+    * 2 CPUs each (Intel Xeon E5-2650 v2 @ 2.60 Ghz)
+    * 64 GB RAM
+    * 2x Infiniband QDR (Dual-Rail)
+    * 2x Gigabit Ethernet
+  * Login Nodes, Head Nodes, Hypervisors, Accelerator Nodes, etc…
+====== VSC3 - File Systems ======
+  * User Homes (/home/lvXXXXX)
+  * Global (/global)
+  * Scratch (/scratch)
+  * EODC-GPFS (/eodc)
+  * BINFS (/binfs)
+  * BINFL (/binfl)
+====== VSC3 - Home File System ======
+  * VSC3 - Homes
+    * 9 Servers (Intel Xeon E5620 @ 2.40 Ghz)
+      * 64 GB RAM
+      * 1x Infiniband QDR
+      * LSI Raid Controller
+    * Homes are stored on RAID6 Volumes
+      * 10+2 Disks per Array. Up to 3 Arrays per Server (Depends on usage)
+      * Exported via NFS to the compute nodes (No RDMA)
+    * 1 Server without RAID Controller running ZFS
+    * Quotas are enforced
+====== VSC3 - Global File System ======
+  * VSC3 - Global
+    * 8 Servers (Intel Xeon E5-1620 v2 @ 3.70 Ghz)
+    * 128 GB RAM
+    * 1x Infiniband QDR
+    * LSI Raid Controller
+  * 4 OSTs per Server
+    * 48 Disk 4x(10+2p)
+  * 1 Metadata Target per Server
+    * 2 SSDs each (Raid-1 / Mirrored)
+  * Up to 20’000 MB/s throughput
+  * ~ 600 TB capacity
+====== VSC3 - GPFS ======
+  * IBM GPFS - General Parallell Filesystem / Elastic Storage / Spectrum Scale
+    * Released in 1998 as multi media filesystem (mmfs)
+    * Linux support since 1999
+    * Has been in use on many supercomputers
+    * Separate Storage Pools
+    * CES - Cluster Export Services
+    * Can be linked to a hadoop cluster
+  * Supports
+    * Striping
+    * n-way Replication
+    * Raid (Spectrum Scale Raid)
+    * Fast rebuild
+    * …
+====== VSC3 - EODC Filesystem ======
+  * VSC3 <–> EODC
+    * 4 IBM ESS Servers (~350 spinning disks each)
+    * Running GPFS 4.2.3
+    * 12 Infiniband QDR Links
+    * Up to 14 GB/s sequential write and 26 GB sequential read
+  * Multi tiered (IBM HSM)
+    * Least recently used files are written to tape
+    * Stub File stays in the file system
+    * Transparent recall after access
+    * Support for schedules, callbacks, Migration Control, etc.
+  * Sentinel Satellite Data
+    * Multiple petabytes
+    * Needs backup on different locations
+====== VSC3 - EODC Filesystem ======
+  * VSC3 is a “remote cluster” for the EODC GPFS filesystem
+    * 2 Management servers
+      * Tie-Breaker disk for quorum
+    * No filesystems (only remote filesystems from EODC)
+    * Up to 500 VSC clients can use GPFS in parallell
+====== VSC3 - Bioinformatics ======
+  * VSC3 got a “bioinformatics” upgrade in late 2016
+    * 17 Nodes
+      * 2x Intel Xeon E5-2690 v4 @ 2.60 Ghz (14 cores each / 28 with hyperthreading)
+      * Each node has at least 512 GB RAM
+      * 1x Infiniband QDR (40 Gbit/s)
+      * 1x Omnipath (100 Gbit/s)
+      * 12 spinning disks
+      * 4 NVMe Memories (Intel DC P3600)
+    * These Nodes export 2 filesystems to VSC (BINFL and BINFS)
+====== VSC3 - Bioinformatics ======
+{{.:binf.jpg}}
+====== VSC3 - BINFL Filesystem ======
+  * Use for I/O intensive bioinformatics jobs
+  * ~ 1 PB Space (Quotas enforced)
+  * BeeGFS Filesystem
+  * Metadata Servers
+    * Metadata on Datacenter SSDs (RAID-10)
+    * 8 Metadata Servers
+  * Object Storages
+    * Disk Storages configured as RAID-6
+    * 12 Disks per Target / 1 Target per Server / 16 Servers total
+  * Up to 40 Gigabyte/second write speed
+====== Binfs ======
+  * Use for very I/O intensive jobs
+  * ~ 100 TB Space (Quotas enforced)
+  * BeeGFS Filesystem
+  * Metadata Servers
+    * Metadata on Datacenter SSDs (RAID-10)
+    * 8 Metadata Servers
+  * Object Storages
+    * Datacenter SSDs are used insteand of traditional disks.
+      * No redundancy. See it as (very) fast and low-latency scratch space. Data may be lost after a hardware failure.
+    * 4x Intel P3600 2TB Datacenter SSDs per Server
+    * 16 Storage Servers
+  * Up to 80 Gigabyte/second via OmniPath Interconnect
+====== VSC3 - Storage Summary ======
+  * 33 Servers
+  * ~ 800 spinning disks
+  * ~ 100 SSDs
+  * 5 Filesystems (
+    * Home
+    * Global
+    * EODC
+    * BINFS
+    * BINFL
+  * Temporary Filesystem
+====== Storage Performance ======
+{{.:vsc3_storage_performance.png}}
+====== Temporary Filesystems ======
+  * Use for
+    * Random I/O
+    * Many small files
+  * Data gets deleted after the job
+    * Write Results to $HOME or $GLOBAL
+  * Disadvantages
+    * Disk space is consumed from main memory
+  * Alternatively the mmap() system call can be used
+    * Keep in mind, that mmap() uses lazy loading
+    * Very small files waste main memory (memory mapped files are aligned to page-size)
+====== Addendum: Backup ======
+  * Only for disaster recovery we keep some backups
+    * Systems (Node Images, Head Nodes, Hypervisors, VMs, Module Environment)
+      * via rsnapshot
+    * Home
+      * Home filesystems are backupped
+      * via self written, parallellized rsync script
+    * Global
+      * Metadata is backupped. But if a total disaster happens this won’t help.
+    * Altough nothing happenend for some time, this is high performance computing and redundancy is minimal
+      * Keep an eye on your data. If it’s important you should backup it yourself.
+        * Use rsync
+====== Addendum: Big Files ======
+  * What does an administrator mean with ‘don’t use small files’?
+    * It depends
+    * On /global fopen –> fclose takes ~100 microseconds
+    * On VSC2 Home Filesystems > 100 Million Files are stored.
+      * A check which files have changed takes more than 12 hours. Without even reading file contents or copying.
+    * What we mean is: Use reasonable file sizes according to your working set and your throughput needs
+      * Reading a 2 GB File from SSD takes less than one second. Having files which are a few MB in size, will slow down your processing.
+      * If you need high throughput with your >1TB working set a file size >=10E8 bytes is reasonable
+      * File Sizes < 10E6 bytes are problematic if you plan to use many files
+      * If you want to do random i/o copy your files to TMP.
+    * Storage works well when the blocksize is reasonable for the storage system (on VSC3 a few Megabytes are enough)
+    * Do not create millions of files
+      * If every user had millions of files we’d run into some problems
+    * Use ‘tar’ to archive unneeded files
+      * tar -cvpf myArchive.tar myFolderWithFiles/
+      * tar -cvjpf myArchive.tar myFolderWithFiles/ # Uses bzip2 compression
+      * Extract with
+      * tar xf myArchive.tar
+====== The End ======
+Thank you for your attention