pandoc:parallel-io:1_introduction_to_io:1_introduction_to

Reason for I/O

Input and output of calculations: always as file

* Size of I/O may vary: at least a flag, but maybe large data sets
* Intermediate files
* Checkpoints

VSC infrastructure

* No monitor
* No printer

Organizational

Questions: immediately

Coffee: immediately

Comments/Feedback: yes, please!

Limiting factors to high performance

CPU performance
- Integer
- Floating point
- Scheduler
- Caches
Memory performance <html><!— * Number of memory banks
- ECC (Error correcting code)—></html>
Network performance
- Locality
- (Partially) shared ressource
Storage
- shared ressource
- I/O throughput
- IOPS: Input/Output Operations per Second

High Performance Computing

CPU: 10^10 operations per second

Memory: 10^8 operations per second

Network: 10^6 operations per second

SSD: 10^5 operations per second

HDD: 10^2 operations per second

High Performance Storage

High performance: throughput and IOPS

Throughput
- Each node is limited by network bandwidth
- Each storage server is limited by network bandwidth
- Many nodes ⇒ high throughput
IOPS
- Network latency
- Block devices
- HDD (Hard Disk Drive) latency
  - Seek time
  - Rotational latency
- SDD (Solid State Drive) latency
  - Fetch time for a block

High Performance Storage

Latency should not dominate

Combine I/O operations

Methods

Buffering
Avoid many small files
Combine data in few large files
Parallel I/O

Topics of today

Introduction to I/O (this talk)
Storage technologies
VSC storage infrastructure
Application view to I/O
Performance hints and best practices for I/O
MPI I/O (overview)

Tomorrow: Parallel I/O and Portable Data Formats

NetCDF4: Network Common Data Form
PnetCDF: Parallel NetCDF
HDF5: Hierarchical Data Format

Introduction to I/O

Concepts
Technology names
Management view

User view to I/O

Storage~size
File~size
Highly~available
Temporary
Backup
Shareable:
- Available in web browser worldwide?
- Mounted on your desktop?
Visibility:
- User, Group, All Users (=‘Other’)
- Access control lists (ACLs)

Performance

Performance

Number of files
Throughput
IOPS
Number of spindles

Usage

Usage

sequential~access
random~access
write~once
append
modify
flush
locking (Deadlocks!)
byte range locking (Deadlocks!)
read~once
read~often
read~never (e.g. log files, snapshots)

Security/Safety

Security/Safety

Redundancy
- RAID: Redundant Array of Independent Disks
  - RAID levels (0,1,5,6)
- Erasure~coding
- How many copies
- software~RAID
Repair~times
Buffer~Battery - supercapacitor
USV~UPS: Unterbrechungsfreie Stromversorgung - Uninterruptible Power Supply
Reliability: disk failures

Technology

Technology

HDFS: Hadoop Distributed File System
DAS: Direct Attached Storage (JBOD: Just a Bunch of Disks)
SAN: Storage Area Network
NAS: Network Attached Storage
Block~storage
Disk~partitions
LVM: Logical Volume Manager
- Physical Volume (PV)
- Volume Group (VG)
- Logical Volume (LV)
Journaling File Sytems: XFS - ZFS - Ext4 - BtrFS
NFS - SMB (CIFS)
Inode
SCSI, SAS, SATA, NVMe, FC, iSCSI, SRP
3.5“, 2.5” form factor
SSD: wear leveling
I/O scheduler
Object storage
Tiered storage
Tape storage

Technology used by VSC

NAS: Network Attached Storage
Block~storage
Disk~partitions
LVM: Logical Volume Manager
- Physical Volume (PV)
- Volume Group (VG)
- Logical Volume (LV)
Journaling File Sytems: XFS - ZFS - Ext4
NFS
Inode
SAS, NVMe
3.5“, 2.5” form factor

User - Performance

Storage size - parallel file system - number of spindles - throughput

Temporary - locality - IOPS

Highly available - throughput

File size - number of files - storage size

User - Usage

Backup - locking

Storage size - read never

Performance - Usage

Throughput - random access

IOPS - random access

Throughput - sequential access

User - Security/Safety

Storage size - redundancy

Highly available - RAID - erasure coding

Backup - redundancy

Highly available - USC - UPS

Highly available - buffer battery

User - Technology

Storage size - DAS - NAS - SAN

Highly available - HDFS

Visibility - object storage

Storage size - tiered storage

Big Data

Technology available

Large disks
Fast computers
Lots of data

The result is called ‘Big Data’

Data growth

New data is generated digitally

Data creation increases exponentially

Internet / Social networks / Mobile Devices

Internet of Things

Sensors everywhere create data
Growing exceptionally fast

Medicine / Genome

Science

Types of data

Databases
Text
Video
Image
Sensor data

V3

Characterization of Big Data by

Volume
Velocity
Variety

Tools

NoSQL
Object Storage
HDFS
- Cheap building blocks
- Replication
Hadoop
Requirement: linear scaling
Cloud computing

Hadoop software tools

Yarn: framework for job scheduling
MapReduce: parallel processing, very well scaling
HBase: distributed database
Hive: data warehouse infrastructure with ad-hoc-querying
Pig: high-level data-flow language
Cassandra: distributed database
Flume: aggregate and move large amounts of data
Kafka: distributed streaming
Spark: compute engine for hadoop data, more flexible than MapReduce
…

Applications

Advertising / Sales
Problem analysis
Microtrends
Genomics
Archaeology
Science
…

Table of Contents

Reason for I/O

Organizational

High Performance Computing

High Performance Storage

High Performance Storage

Topics of today

Tomorrow: Parallel I/O and Portable Data Formats

Introduction to I/O

User view to I/O

User view to I/O

Performance

Performance

Usage

Usage

Security/Safety

Security/Safety

Technology

Technology

Technology used by VSC

User - Performance

User - Performance

User - Usage

User - Usage

Performance - Usage

Performance - Usage

User - Security/Safety

User - Security/Safety

User - Technology

User - Technology

Big Data

Data growth

Types of data

V3

Tools

Hadoop software tools

Applications