Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
pandoc:parallel-io:03_storage_infrastructure:storage_infrastructure [2018/01/31 11:11] – Pandoc Auto-commit pandocpandoc:parallel-io:03_storage_infrastructure:storage_infrastructure [2020/10/20 09:13] (current) – Pandoc Auto-commit pandoc
Line 1: Line 1:
 +====== Storage Infrastructure of VSC2 and VSC3 ======
 +
 +  * Motivation
 +    * Know how a HPC storage stack can be integrated into a supercomputer
 +    * and gain some additional insights
 +    * Know the hard- and software to harness its full potential
 +
 +====== VSC2 - Basic Infos ======
 +
 +  * VSC2
 +    * Procured in 2011
 +    * Installed by MEGWARE
 +    * 1314 Compute Nodes
 +      * 2 CPUs each (AMD Opteron 6132 HE, 2.2Ghz)
 +      * 32 GB RAM
 +      * 2x Gigabit Ethernet
 +      * 1x Infiniband QDR
 +    * Login Nodes, Head Nodes, Hypervisors, etc…
 +
 +====== VSC2 - File Systems ======
 +
 +  * File Systems on VSC2
 +    * User Homes (/home/lvXXXXX)
 +    * Global (/fhgfs)
 +    * Scratch (/fhgfs/nodeName)
 +    * TMP (/tmp)
 +  * Change Filesystem with ‘cd’
 +    * cd /global # Go into Global
 +    * cd ~ # Go into Home
 +  * Use your applications settings to set in/out directories
 +  * Pipe the output into your home with ‘./myApp.sh 2>&1 > ~/out.txt’
 +
 +====== VSC - Fileserver ======
 +
 +{{.:fileserver.jpg}}
 +
 +====== VSC - Why different Filesystems ======
 +
 +  * Home File Systems
 +    * BeegFS (FhGFS) was very slow when it came to small file I/O
 +      * Leads to severe slowdown when compiling applications and/or
 +      * working with small files in general
 +    * We do not recommend using small files. Use big files when possible, however users want to
 +      * Compile Programs
 +      * Do some testing
 +      * Write (small) Log Files
 +      * etc…
 +
 +====== VSC - Why different Filesystems ======
 +
 +  * Global
 +    * Parallell Filesystem (BeeGFS)
 +    * Perfectly suited for large sequential transfers
 +  * TMP
 +    * Uses the main memory (RAM) of the server. Up to 1/2 size of the whole memory.
 +    * Users can access it like a file system but they get the speed of a byte-wise addressable storage
 +      * Random I/O is blindingly fast
 +      * But comes at the price of main memory
 +    * If you are not sure how to map your file from GLOBAL/HOME into memory just copy it to TMP and you are ready to go
 +
 +====== VSC - Why different Filesystems ======
 +
 +  * Small I/Os
 +    * Random Access –> TMP
 +    * Others –> HOME
 +  * Large I/Os
 +    * Random Access –> TMP
 +    * Sequential Access –> GLOBAL
 +
 +====== VSC2 - Storage Homes ======
 +
 +  * VSC2 - Homes
 +    * 6 File Servers for Home
 +      * 2 CPUs each (Intel Xeon E5620 @ 2.40 Ghz - Westmere)
 +      * 48 GB RAM
 +      * 1x Infiniband QDR
 +      * LSI Raid Controller
 +    * Homes are Stored on RAID6 Volumes
 +      * 10+2 Disks per Array. Up to 3 Arrays per Server (Depends on usage)
 +      * Exported via NFS to the compute nodes (No RDMA)
 +    * Each project resides on 1 server
 +
 +====== VSC2 - Storage Global ======
 +
 +  * VSC2 - Global
 +    * 8 File Servers
 +      * 2 CPUs each (Intel Xeon E5620 @ 2.40 Ghz - Westmere)
 +      * 192 GB RAM
 +      * 1x Infiniband QDR
 +      * LSI Raid Controller
 +    * OSTs consist of 24 Disks (22+1p+1Hot-Spare RAID5)
 +    * One Metadata Target per Server
 +      * 4x Intel X25-E 64 GB SSDs
 +    * Up to 6000 MB/s throughput
 +    * ~ 160 TB capacity
 +
 +====== VSC2 - Storage Summary ======
 +
 +  * 14 Servers
 +  * ~ 400 spinning disks
 +  * ~ 25 SSDs
 +  * 2 Filesystems (Home+Global)
 +  * 1 Temporary Filesystem
 +
 +====== VSC3 - Basic Infos ======
 +
 +  * Procured in 2014
 +  * Installed by CLUSTERVISION
 +  * 2020 Nodes
 +    * 2 CPUs each (Intel Xeon E5-2650 v2 @ 2.60 Ghz)
 +    * 64 GB RAM
 +    * 2x Infiniband QDR (Dual-Rail)
 +    * 2x Gigabit Ethernet
 +  * Login Nodes, Head Nodes, Hypervisors, Accelerator Nodes, etc…
 +
 +====== VSC3 - File Systems ======
 +
 +  * User Homes (/home/lvXXXXX)
 +  * Global (/global)
 +  * Scratch (/scratch)
 +  * EODC-GPFS (/eodc)
 +  * BINFS (/binfs)
 +  * BINFL (/binfl)
 +
 +====== VSC3 - Home File System ======
 +
 +  * VSC3 - Homes
 +    * 9 Servers (Intel Xeon E5620 @ 2.40 Ghz)
 +      * 64 GB RAM
 +      * 1x Infiniband QDR
 +      * LSI Raid Controller
 +    * Homes are stored on RAID6 Volumes
 +      * 10+2 Disks per Array. Up to 3 Arrays per Server (Depends on usage)
 +      * Exported via NFS to the compute nodes (No RDMA)
 +    * 1 Server without RAID Controller running ZFS
 +    * Quotas are enforced
 +
 +====== VSC3 - Global File System ======
 +
 +  * VSC3 - Global
 +    * 8 Servers (Intel Xeon E5-1620 v2 @ 3.70 Ghz)
 +    * 128 GB RAM
 +    * 1x Infiniband QDR
 +    * LSI Raid Controller
 +  * 4 OSTs per Server
 +    * 48 Disk 4x(10+2p)
 +  * 1 Metadata Target per Server
 +    * 2 SSDs each (Raid-1 / Mirrored)
 +  * Up to 20’000 MB/s throughput
 +  * ~ 600 TB capacity
 +
 +====== VSC3 - GPFS ======
 +
 +  * IBM GPFS - General Parallell Filesystem / Elastic Storage / Spectrum Scale
 +    * Released in 1998 as multi media filesystem (mmfs)
 +    * Linux support since 1999
 +    * Has been in use on many supercomputers
 +    * Separate Storage Pools
 +    * CES - Cluster Export Services
 +    * Can be linked to a hadoop cluster
 +  * Supports
 +    * Striping
 +    * n-way Replication
 +    * Raid (Spectrum Scale Raid)
 +    * Fast rebuild
 +    * …
 +
 +====== VSC3 - EODC Filesystem ======
 +
 +  * VSC3 <–> EODC
 +    * 4 IBM ESS Servers (~350 spinning disks each)
 +    * Running GPFS 4.2.3
 +    * 12 Infiniband QDR Links
 +    * Up to 14 GB/s sequential write and 26 GB sequential read
 +  * Multi tiered (IBM HSM)
 +    * Least recently used files are written to tape
 +    * Stub File stays in the file system
 +    * Transparent recall after access
 +    * Support for schedules, callbacks, Migration Control, etc.
 +  * Sentinel Satellite Data
 +    * Multiple petabytes
 +    * Needs backup on different locations
 +
 +====== VSC3 - EODC Filesystem ======
 +
 +  * VSC3 is a “remote cluster” for the EODC GPFS filesystem
 +    * 2 Management servers
 +      * Tie-Breaker disk for quorum
 +    * No filesystems (only remote filesystems from EODC)
 +    * Up to 500 VSC clients can use GPFS in parallell
 +
 +====== VSC3 - Bioinformatics ======
 +
 +  * VSC3 got a “bioinformatics” upgrade in late 2016
 +    * 17 Nodes
 +      * 2x Intel Xeon E5-2690 v4 @ 2.60 Ghz (14 cores each / 28 with hyperthreading)
 +      * Each node has at least 512 GB RAM
 +      * 1x Infiniband QDR (40 Gbit/s)
 +      * 1x Omnipath (100 Gbit/s)
 +      * 12 spinning disks
 +      * 4 NVMe Memories (Intel DC P3600)
 +    * These Nodes export 2 filesystems to VSC (BINFL and BINFS)
 +
 +====== VSC3 - Bioinformatics ======
 +
 +{{.:binf.jpg}}
 +
 +====== VSC3 - BINFL Filesystem ======
 +
 +  * Use for I/O intensive bioinformatics jobs
 +  * ~ 1 PB Space (Quotas enforced)
 +  * BeeGFS Filesystem
 +  * Metadata Servers
 +    * Metadata on Datacenter SSDs (RAID-10)
 +    * 8 Metadata Servers
 +  * Object Storages
 +    * Disk Storages configured as RAID-6
 +    * 12 Disks per Target / 1 Target per Server / 16 Servers total
 +  * Up to 40 Gigabyte/second write speed
 +
 +====== Binfs ======
 +
 +  * Use for very I/O intensive jobs
 +  * ~ 100 TB Space (Quotas enforced)
 +  * BeeGFS Filesystem
 +  * Metadata Servers
 +    * Metadata on Datacenter SSDs (RAID-10)
 +    * 8 Metadata Servers
 +  * Object Storages
 +    * Datacenter SSDs are used insteand of traditional disks.
 +      * No redundancy. See it as (very) fast and low-latency scratch space. Data may be lost after a hardware failure.
 +    * 4x Intel P3600 2TB Datacenter SSDs per Server
 +    * 16 Storage Servers
 +  * Up to 80 Gigabyte/second via OmniPath Interconnect
 +
 +====== VSC3 - Storage Summary ======
 +
 +  * 33 Servers
 +  * ~ 800 spinning disks
 +  * ~ 100 SSDs
 +  * 5 Filesystems (
 +    * Home
 +    * Global
 +    * EODC
 +    * BINFS
 +    * BINFL
 +  * Temporary Filesystem
 +
 +====== Storage Performance ======
 +
 +{{.:vsc3_storage_performance.png}}
 +
 +====== Temporary Filesystems ======
 +
 +  * Use for
 +    * Random I/O
 +    * Many small files
 +  * Data gets deleted after the job
 +    * Write Results to $HOME or $GLOBAL
 +  * Disadvantages
 +    * Disk space is consumed from main memory
 +  * Alternatively the mmap() system call can be used
 +    * Keep in mind, that mmap() uses lazy loading
 +    * Very small files waste main memory (memory mapped files are aligned to page-size)
 +
 +====== Addendum: Backup ======
 +
 +  * Only for disaster recovery we keep some backups
 +    * Systems (Node Images, Head Nodes, Hypervisors, VMs, Module Environment)
 +      * via rsnapshot
 +    * Home
 +      * Home filesystems are backupped
 +      * via self written, parallellized rsync script
 +    * Global
 +      * Metadata is backupped. But if a total disaster happens this won’t help.
 +    * Altough nothing happenend for some time, this is high performance computing and redundancy is minimal
 +      * Keep an eye on your data. If it’s important you should backup it yourself.
 +        * Use rsync
 +
 +====== Addendum: Big Files ======
 +
 +  * What does an administrator mean with ‘don’t use small files’?
 +    * It depends
 +    * On /global fopen –> fclose takes ~100 microseconds
 +    * On VSC2 Home Filesystems > 100 Million Files are stored.
 +      * A check which files have changed takes more than 12 hours. Without even reading file contents or copying.
 +    * What we mean is: Use reasonable file sizes according to your working set and your throughput needs
 +      * Reading a 2 GB File from SSD takes less than one second. Having files which are a few MB in size, will slow down your processing.
 +      * If you need high throughput with your >1TB working set a file size >=10E8 bytes is reasonable
 +      * File Sizes < 10E6 bytes are problematic if you plan to use many files
 +      * If you want to do random i/o copy your files to TMP.
 +    * Storage works well when the blocksize is reasonable for the storage system (on VSC3 a few Megabytes are enough)
 +    * Do not create millions of files
 +      * If every user had millions of files we’d run into some problems
 +    * Use ‘tar’ to archive unneeded files
 +      * tar -cvpf myArchive.tar myFolderWithFiles/
 +      * tar -cvjpf myArchive.tar myFolderWithFiles/ # Uses bzip2 compression
 +      * Extract with
 +      * tar xf myArchive.tar
 +
 +====== The End ======
 +
 +Thank you for your attention
  
  • pandoc/parallel-io/03_storage_infrastructure/storage_infrastructure.1517397069.txt.gz
  • Last modified: 2018/01/31 11:11
  • by pandoc