Storage infrastructure

The most recent version of this page is a draft.

This version (2019/01/15 15:39) is a draft.
Approvals: 0/1

This is an old revision of the document!

Article written by Siegfried Reinwald (VSC Team) <html><br></html>(last update 2019-01-15 by sh).

Storage on VSC-3
- 10 Servers for $HOME
- 8 Servers for $GLOBAL
- 16 Servers for $BINFS / $BINFL
- ~ 800 spinning disks
- ~ 100 SSDs

Several Storage Targets on VSC-3
- $HOME
- $TMPDIR
- $SCRATCH
- $GLOBAL
- $BINFS
- $BINFL
For different purposes
- Random I/O
- Small Files
- Huge Files / Streaming Data

Use for non I/O intensive jobs
Basically NFS Exports over infiniband (no RDMA)
Targets with up to 24 Disks (RAID-6 on VSC-3)
Up to 2 Gigabyte/second write speed
Logical volumes of projects are distributed among the servers
- Each logical volume belongs to 1 NFS server
Accessible with the $HOME environment variable
- /home/lv70XXX/username

Use for I/O intensive jobs
~ 500 TB Space (default quota is 500GB/project)
- Can be increased on request (subject to availability)
BeeGFS Filesystem
Metadata Servers
- Metadata on SSDs (RAID-1)
- 8 Metadata Targets for VSC-3
Object Storages
- Disk Storages (RAID-6 on VSC-3)
- VSC-3: 12 Disks per Target / 4 Targets per Server / 8 Servers total
Up to 20 Gigabyte/second write speed
Accessible via the $GLOBAL and $SCRATCH environment variables
- $GLOBAL … /global/lv70XXX/username
- $SCRATCH … /scratch

Specifically designed for Bioinformatics applications
Use for I/O intensive jobs
~ 1 PB Space (default quota is 10GB/project)
- Can be increased on request (subject to availability)
BeeGFS Filesystem
Metadata Servers
- Metadata on Datacenter SSDs (RAID-10)
- 8 Metadata Servers
Object Storages
- Disk Storages configured as RAID-6
- 12 Disks per Target / 1 Target per Server / 16 Servers total
Up to 40 Gigabyte/second write speed
Accessible via $BINFL environment variable
- $BINFL … /binfl/lv70XXX/username

Specifically designed for Bioinformatics applications
Use for very I/O intensive jobs
~ 100 TB Space (default quota is 2GB/project)
- Can be increased on request (subject to availability)
BeeGFS Filesystem
Metadata Servers
- Metadata on Datacenter SSDs (RAID-10)
- 8 Metadata Servers
Object Storages
- Datacenter SSDs are used instead of traditional disks.
  - No redundancy. See it as (very) fast and low-latency scratch space. Data may be lost after a hardware failure.
- 4x Intel P3600 2TB Datacenter SSDs per Server
- 16 Storage Servers
Up to 80 Gigabyte/second via OmniPath Interconnect
Accessible via $BINFS environment variable
- $BINFS … /binfs/lv70XXX/username

Use for
- Random I/O
- Many small files
Size is up to 50% of main memory
Data gets deleted after the job
- Write Results to $HOME or $GLOBAL
Disadvantages
- Space is consumed from main memory <html><!–* Alternatively the mmap() system call can be used
Keep in mind, that mmap() uses lazy loading
Very small files waste main memory (memory mapped files are aligned to page-size)–></html>
Accessible with the $TMPDIR environment variable

In these exercises we try to measure the performance of the different storage targets on VSC-3. For that we will use the “IOR” application (https://github.com/LLNL/ior) which is a standard benchmark for distributed storage systems.

“IOR” for these exercises has been built with gcc-4.9 and openmpi-1.10.2 so load these 2 modules first:

module purge
module load gcc/4.9 openmpi/1.10.2

Now extract the storage exercises to your own Folder.

mkdir my_directory_name
cd my_directory_name
cp -r ~training/examples/08_storage_infrastructure/*Benchmark ./

Keep in mind that the results will vary, because there are other users working on the storage targets.

We will now measure the sequential performance of the different storage targets on VSC-3.

<HTML><ol style=“list-style-type: lower-alpha;”></HTML> <HTML><li></HTML>With one process<HTML></li></HTML><HTML></ol></HTML>

cd 01_SequentialStorageBenchmark
# Submit the job    
sbatch 01a_one_process_per_target.slrm
# Inspect corresponding slurm-*.out files

<HTML><ol start=“2” style=“list-style-type: lower-alpha;”></HTML> <HTML><li></HTML>With 8 processes<HTML></li></HTML><HTML></ol></HTML>

# Submit the job
sbatch 01b_eight_processes_per_target.slrm
# Inspect corresponding slurm-*.out files

Take your time and compare the outputs of the 2 different runs. What conclusions can be drawn for the storage targets on VSC-3?

Discuss the following questions with your partner:

The performance of which storage targets improves with the number of processes? Why?
What could you do to further improve the performance of the sequential write throughput? What could be a problem with that?
Bonus Question: $TMPDIR seems to scale pretty well with the number of processes although it is an in-memory filesystem. Why is that happening?

Exercise 1a:

HOME: Max Write: 237.37 MiB/sec (248.91 MB/sec)
GLOBAL: Max Write: 925.64 MiB/sec (970.60 MB/sec)
BINFL: Max Write: 1859.69 MiB/sec (1950.03 MB/sec)
BINFS: Max Write: 1065.61 MiB/sec (1117.37 MB/sec)
TMP: Max Write: 2414.70 MiB/sec (2531.99 MB/sec)

Exercise 1b:

HOME: Max Write: 371.76 MiB/sec (389.82 MB/sec)
GLOBAL: Max Write: 2195.28 MiB/sec (2301.91 MB/sec)
BINFL: Max Write: 2895.24 MiB/sec (3035.88 MB/sec)
BINFS: Max Write: 2950.23 MiB/sec (3093.54 MB/sec)
TMP: Max Write: 16764.76 MiB/sec (17579.12 MB/sec)

We will now measure the storage performance for tiny 4kilobyte random writes.

<HTML><ol style=“list-style-type: lower-alpha;”></HTML> <HTML><li></HTML>With one process<HTML></li></HTML><HTML></ol></HTML>

cd 02_RandomioStorageBenchmark
# Submit the job
sbatch 02a_one_process_per_target.slrm
# Inspect corresponding slurm-*.out files

<HTML><ol start=“2” style=“list-style-type: lower-alpha;”></HTML> <HTML><li></HTML>With 8 processes<HTML></li></HTML><HTML></ol></HTML>

# Submit the job
sbatch 02b_eight_processes_per_target.slrm
# Inspect corresponding slurm-*.out files

Take your time and compare the outputs of the 2 different runs. Do additional processes speed up the I/O activity?

Now compare your Results to the sequential run in exercise 1. What can be concluded for random I/O versus sequential I/O on the VSC-3 storage targets?

Discuss the following questions with your partner:

Which storage targets on VSC-3 are especially suited for doing random I/O
Which storage targets should never be used for random I/O
You have a program that needs 32Gigabytes of RAM and does heavy random I/O on a 10 Gigabyte file which is stored on $GLOBAL. How could you speed up your application?
Bonus Question: Why are SSDs so much faster than traditional disks, when it comes to random I/O? (A modern datacenter SSD can deliver ~1000 times more IOPS than a traditional disk)

Exercise 2a:

HOME: Max Write: 216.06 MiB/sec (226.56 MB/sec)
GLOBAL: Max Write: 56.13 MiB/sec (58.86 MB/sec)
BINFL: Max Write: 42.70 MiB/sec (44.77 MB/sec)
BINFS: Max Write: 41.39 MiB/sec (43.40 MB/sec)
TMP: Max Write: 1428.41 MiB/sec (1497.80 MB/sec)

Exercise 2b:

HOME: Max Write: 249.11 MiB/sec (261.21 MB/sec)
GLOBAL: Max Write: 235.51 MiB/sec (246.95 MB/sec)
BINFL: Max Write: 414.46 MiB/sec (434.59 MB/sec)
BINFS: Max Write: 431.59 MiB/sec (452.55 MB/sec)
TMP: Max Write: 10551.71 MiB/sec (11064.27 MB/sec)

Storage infrastructure

Storage hardware

Storage targets

Storage performance

The HOME Filesystem

The GLOBAL and SCRATCH filesystem

The BINFL filesystem

The BINFS filesystem

The TMP filesystem

Storage exercises

Exercise 1 - Sequential I/O

Exercise 1 - Sequential I/O performance discussion

Exercise 1 - Sequential I/O performance discussion

Exercise 2 - Random I/O

Exercise 2 - Random I/O performance discussion

Exercise 2 - Random I/O performance discussion