Storage Technologies - Welcome
“Computers are like Old Testament gods; lots of rules and no mercy.” (Joseph Campbell)
<!– “Simple things should be simple. Complicated things should be possible.” (Alan Kay) –!>
<!– “Computer sciene is not about machines, in the same way that astronomy is not about telescopes. […] Science is not about tools, it is about how we use them and what we find out when we do.” (Michael R. Fellows) –!>
Storage Technologies - Contents
Storage Technologies - Hardware Basics
Storage Technologies - Hardware Basics
CPU
Main Memory
Stores data and programs
Typically volatile
I/O Modules
Storage Technologies - Hardware Basics (II)
System Bus
Provides Communication among processor, memory and I/O modules
ISA, PCI, AGP, PCI-Express, …
External Devices
I/O Controllers (HBA / RAID)
Network Controllers (Ethernet, Fibre Channel, Infiniband)
Human Interface Devices (Keyboard, Mouse, Screen, etc.)
Storage Technologies - Direct Memory Access
Memory Hierarchy (I)
(Operating Systems, 7th Edition, W. Stallings, Chapter 1)
Memory Hierarchy (II)
Inboard Memory
Outboard Storage
Flash-Drives
Disks
Blu-Ray / DVD / CDRom
Off-Line Storage
Memory Hierarchy (III)
Prefetching
Need data locality
Memory Hierarchy (IV)
Memory Hierarchy (V)
Memory Access Times
Memory Access Times (II)
Memory Hierarchy (VI)
Caching
(Operating Systems, 7th Edition, W. Stallings, Chapter 1)
Storage Technologies - Cache Memory
Acts as a buffer between 2 memory tiers
Modern CPUs utilize 3 levels of caches
Level 1 split into instruction and data cache. Separate for each core.
Level 2 data and instruction cache. Separate for each core.
Level 3 data and instruction cache. Shared among all cores on the die.
Benefits both throughput and latency
Different Caching Strategies for different purposes
Caching Strategies
Caching Problems
Storage Technologies - Why Caching Works –> Locality of reference
Caching - Example
int sizeX = 2000;
int sizeY = 1000;
int array[sizeY][sizeX];
// Fill the array with some data
fill_buffer(&array);
// Now run through the array and do something with the elements
// This runs slow in C
for (int x=0; x<sizeX, x++) {
for (int y=0; y<sizeY; y++) {
array[y][x] = x+2000*y;
}
}
// This runs fast in C
for (int y=0; y<sizeY, y++) {
for (int x=0; x<sizeX; x++) {
array[y][x] = x+2000*y;
}
}
Memory Hierarchy - Recap
Problem: A CPU waiting for data can’t do any work
Solution: Caching/Prefetching algorithms.
As one goes down the hierarchy
Storage Technologies (I)
I/O Devices
Human Readable
Machine Readable
Communication
Differences in
Data Rate
Application
Complexity of Control
Storage Technologies (II)
Storage Technologies - Device Characteristics (I)
Storage Technologies - Device Characteristics (II)
Mutability
Accessibility
Random Access
Sequential Access
Adressability
Location addressable
File addressable
Content addressable
Sequential I/O vs. Random I/O
Sequential I/O
Writing / Reading contiguous large chunks of data (Chunk Size >= 10E6 Bytes)
Usually the fastest way to read data from storage devices
Not always easily applicable to a problem
Random I/O
Writing / Reading small chunks of data to / from random locations (Chunk Size ⇐ 10E4 Byte)
Slowest way to read data from storage devices
Magnitude of the slow-down depends on the underlying hard- and software (e.g. Tape-Drives vs. Flash-Drives)
Hard-Drives Overview (I)
Invented in the mid 50s by IBM
The first IBM drive stored 3.75
MB on a stack of 50 discs
Became cheap / mainstream in the late 80s
Today one 3.5“ drive can hold up to 14TB of data
Interface
Hard-Drives Overview (II)
Hard-Drives Characteristics
Hard-Drives Sequential Access Example
Hard-Drives Random Access Example
Given a Harddisk with
We read a file that is distributed randomly over 2500 sectors of the disk
Slowdown of nearly 3 orders of magnitude with the same average values
Do not rely on average values!
Quo vadis Hard-Drives
(NVMe) Solid State Drives (I)
NAND Flash Memory
Lower cost compared to DRAM
No refresh cycles / external PSU needed to retain the data compared to DRAM
Uses less space compared to memory based on NOR gates
No random access (Cells are connected in series to sectors)
Key components
New Interfaces
(NVMe) Solid State Drives (II)
NAND Flash Memory
Single Level Cell (SLC)
Multi Level Cell (MLC)
SLC
MLC
multiple bits per cell
Lower production cost
(NVMe) Solid State Drives - Memory (I)
NAND Flash Memory
Is composed of one or more chips
Chips are segmented into planes
Planes are segmented into thousands (e.g. 2048) of blocks
A Block usually contains 64 to 128 pages
Exact specification varies across different memory packages
(NVMe) Solid State Drives - Memory (II)
Read
Write
Is performed in units of pages
Pages in one block must be written sequentially
A page write takes approximately 100 µs (SLC) up to 900µs (MLC)
Block must be erased before being reprogrammed
(NVMe) Solid State Drives - Memory (III)
Erase
Must be performed in block granularity (hence the term erase-block)
Can take up to 5 ms
Limited number of erase cycles
Some flash memory is reserved to replace bad blocks
Controller takes care of wear leveling
(NVMe) Solid State Drives - Controller (I)
(NVMe) Solid State Drives - Speed (I)
(NVMe) Solid State Drives - IOPS vs. Throughput
Traditional discs have been measured in terms of throughput
SSDs are sometimes measured in terms of IOPS (Input- Output- Operations per Second)
The following equation holds (if there is no controller overhead like erasing, garbage collection, etc.):
So if we know the blocksize that was used in benchmarking…
We now even have a base to calculate which blocksize is at least necessary to get full write throughput
(NVMe) Solid State Drives - Speed (II)
Given an Intel DC P3700 SSD with a capacity of 2 TB, specification says:
Sequential Read 450’000 IOPS with 2800MB/s of max throughput
$2'800'000'000 Bytes = 450'000 * x$
$x = 6222 Bytes ~ 6KByte$
So a blocksize of 8 KByte is a good starting point to try to get full throughput
Random Reads with a page size of 8KByte should work well on this device and we should get the full throughput of 2800MB/s
(NVMe) Comparison SSD vs. HDD
In the slide before, we proposed that on our SSD with a block size of 8 KByte will lead to a throughput of 2800MB/s
Let’s try this with a HDD
(NVMe) Comparison SSD vs. HDD
(NVMe) Solid State Drives - Speed (II)
Quo vadis Solid State Drives
Current / Future Trends
Flash Memory Market surpassed DRAM Market in 2012
Density of Flash Memory surpassed Hard Drive density in 2016
In Q4 2016 45 million SSDs with a total capacity of 16 Exabyte were delivered to customers
Market for HDDs is significantly bigger than for SSDs
New memory technologies (e.g. Intel/Micron 3DXPoint)
Intel Optane Memories
Things to consider
Magnetic Tapes (I)
Have been in use for data storage since the 50’s
Main storage medium in some early computers
Capacity:
1950’s: ~ 1 MByte
1980’s: ~ 100 MByte - 1 GByte
1990’s: ~ 10 - 100 GByte
2000’s: ~ 100 GByte - 1 TByte
Now: >10 TByte
Future: Going up to 200 TByte per Tape seems possible
Real tape capacity is usually 1/2 of the given capacity
Magnetic Tapes - Characteristics (I)
Tape width
Recording Method
Linear
Linear Serpentine
Scanning (writing across the width of the tape)
Helical Scan (Short, dense, diagonal tracks)
Block Layout
Magnetic Tapes - Characteristics (II)
Access
Compression
Encryption
Magnetic Tapes - Characteristics (II)
Access
Compression
Encryption
Magnetic Tapes - LTO
Linear Tape Open
LTO-8 State of the art
12 TByte raw capacity (uncompressed)
360 MByte/s max uncompressed speed
Compression ratio 2.5:1
Supports encryption
Supports WORM (Write once read many)
Magnetic Tapes - No random I/O
Memory Hierarchy - cont’d
With the given storage technologies (Flash, HDD and Tape) we can refine our Memory hierarchy
Using Flash Memories for Caching and Metadata
Using HDDs as online data storage
Using Tapes as offline data storage
With the concept of locality and the right applications this will speed up our storage stack
Recently random accessed files and Metadata that fits on the SSDs will be stored on the SSDs
Recently sequential acessed files will be moved to the HDDs (They are fast enough when used in a sequential way)
Rarely acessed files go to the tape machine
Multi Tiered Storage Systems
Gaining speed and redundancy with RAID
RAID - Redundant Array of Independent (Inexpensive) Discs
Different Techniques / RAID Levels available
RAID 0, 1, 5, 6, 10, 01, …
Several Discs in a RAID are called a RAID array
RAID - Striping and Mirroring
RAID Level 0
Data is striped across drives without redundancy
Failure of one disc leads to loss of all data in the array
Speedup is proportional to the number of discs
RAID Level 1
Data is mirrored across discs (usually 2, but more are possible)
Failure of a disc involves no loss of data
Read calls can be parallellized to multiple discs
No speedup for write calls
RAID - Parity
RAID Level 5
Data is striped across drives, parity is calculated (XOR) and stored
Considered obsolete nowadays, because of long rebuild times
RAID Level 6
RAID - Other Levels
“Hybrid”-RAID
Nested-RAID
But where are the levels 2 - 4 ?
File Management
File Management - Inodes (I)
Besides the file content, filesystems rely on data structures about the files
Inodes store information about files
Type of File (File, Directory, etc)
Ownership
Access Mode / Permissions
Timestamps (Creation, Access, Modification)
Last state change of the inode itself (status, ctime)
Size of the file
Link-Counter
One or more references to the actual data blocks
File Management - Inodes (II)
Most filesystems use a fixed amount of inodes
Ext2 filesystem can address up to 12 blocks per inode
Appending to files and writing new files is not possible when inodes run out
Depending on the amount of files/directories a filesystem can use up to 10% of its capacity for meta information
Show Inode Number with ‘ls -i’
Show content with stat
File Management - Inodes (III)
* sreinwal@rs ~/Work/vboxshared/sreinwal/pandoc/vsc-markdown/parallell-io/02_storage_technologies $ stat storage_technologies.md
* File: storage_technologies.md
* Size: 30837 Blocks: 64 IO Block: 4096 regular file
* Device: fd04h/64772d Inode: 25373850 Links: 1
* Access: (0644/-rw-r--r--) Uid: ( 1001/sreinwal) Gid: ( 1001/sreinwal)
* Access: 2017-11-28 14:25:11.191823770 +0100
* Modify: 2017-11-28 14:23:53.482827520 +0100
* Change: 2017-11-28 14:25:20.416823325 +0100
* Birth: -
File System - Organization (I)
(Operating Systems 7th Edition, W. Stallings, Chapter 12)
File System - Organization (II)
File System - Organization (III)
File System - Organization (IV)
(Operating Systems 7th Edition, W. Stallings, Chapter 12)
Unix File Management
Linux Virtual File System
Storage Networks - NAS vs. SAN
Distributed File Systems - NFSv4
NFSv4
Provides a standardized view of its local filesystem
NFS4 uses Remote File Service / Remote Access Model
Opposed to the Upload/Download Model
Client Side Caching - Asynchronous I/O
Synchronous or Blocking I/O
Asynchronous or Non-Blocking I/O
As we have seen, I/O Operations can be very slow
Client calls the operating systems for a read/write
Continues processing
Data is read/written in the background
Operating system informs the process when the file is written
Client side caching in NFS
Server Side Caching - NFSv4
Server Side Caching
Server replies to requests before they have been committed to stable storage
Usually less error prone than client side caching, because servers are usually in a more controlled environment than clients
Problems arise
Server crashes before data is transferred from client to server
Client crashes before it transfers data to server
Network connection breaks
Power outages
etc…
Parallell File Systems - BeeGFS
FhGFS / BeeGFS Parallell File System
Distributes data and metadata across serveral targets
Stripes data across several targets
Huge speedup compared to single server appliances
Scalability and Flexibility were key aspects for the design
BeeGFS - Management Server
Management Server
Exactly 1 MS per filesystem
Keeps track of metadata and storage targets
Keeps track of connected clients
Tags targets with labels
Not involved in filesystem operations
BeeGFS - Object Storage Server
Holds the file contents on Object Storage Targets (OSTs)
Underlying devices / filesystem can be chosen freely
File contents can be striped over multiple OSTs
One server can handle multiple OSTs
A typical OST consists of 6 - 12 drives running in RAID6
Number of threads influence the number of requests that can be put on disk
Chunksize sets the amount of data stored on a OST per stripe
OSTs can be added on demand. But a rebalancing of the file system might be needed to gain full performance
BeeGFS - Clients
Theres much more
Storage Technologies - Bibliography
Understanding Intrinsic Characteristics and System Implications of Flash Memory Based Solid State Drives, Cheng et al., 2009
Operating Systems - Internals and Design Principles 7th Edition, Stallings, 2012
Distributed Systems - Principles and Paradigms 2nd Edition, Tanenbaum et al., 2007
Storage Technologies - The End
Remember
Random I/O is magnitudes slower than sequential I/O
Sequential I/O can even be done in parallell from multiple nodes to further improve the throughput
Highly parallellized Random calls will result in degraded storage performance for ALL users and can even lead to an unresponsive storage.
Thank you for your attention