====== Storage Technologies - Welcome ====== “Computers are like Old Testament gods; lots of rules and no mercy.” (Joseph Campbell) ====== Storage Technologies - Contents ====== * Hardware Basics * Memory Hierarchy * Storage Devices * Flash-Memories * Harddisks * Magnetic Tapes * Multi Tiered Storages / Hierarchical Storage Systems * File Systems * Local * Network * Distributed ====== Storage Technologies - Hardware Basics ====== {{.:node.jpg}} ====== Storage Technologies - Hardware Basics ====== * CPU * Central Processing Unit * Controls Operation and performs data processing * Registers and Caches for fast data access * Main Memory * Stores data and programs * Typically volatile * I/O Modules * Move data between CPU/Memory and external devices ====== Storage Technologies - Hardware Basics (II) ====== * System Bus * Provides Communication among processor, memory and I/O modules * ISA, PCI, AGP, PCI-Express, … * External Devices * I/O Controllers (HBA / RAID) * Network Controllers (Ethernet, Fibre Channel, Infiniband) * Human Interface Devices (Keyboard, Mouse, Screen, etc.) ====== Storage Technologies - Direct Memory Access ====== * Traditionally copying of data from/to devices was done with cpu registers * Costs a lot of clock cycles that could be used for processing (up to 40 clock cycles per word) * DMA - Direct Memory Access * DMA-C is integrated in modern chipsets * Moves data from external devices to memory and vice versa * Informs the system when copying is done (interrupt) * Still some cycles needed for communication with the controller * Performance gets better the more data is transferred. Byte-wise DMA is a bad idea. ====== Memory Hierarchy (I) ====== {{.:memoryhierarchypyramid.png}} (Operating Systems, 7th Edition, W. Stallings, Chapter 1) ====== Memory Hierarchy (II) ====== * Inboard Memory * Registers * Caches * RAM * Outboard Storage * Flash-Drives * Disks * Blu-Ray / DVD / CDRom * Off-Line Storage * Magnetic Tapes ====== Memory Hierarchy (III) ====== * Why? * SRAM is expensive * Trade-off among speed, cost, size, and power consumption * Strategies * Caching\\ * Prefetching * Need data locality ====== Memory Hierarchy (IV) ====== * As one goes down the hierarchy… * Decreasing cost per bit * Increasing capacity * Decreasing frequency of access * Increasing access time ====== Memory Hierarchy (V) ====== * Memory Access Times on Intel Haswell Architecture 1 Clock Cycle = 250 picoseconds @ 4 Ghz (Movement at speed of light = 7.5 cm per 250 picoseconds) * Processor Registers * 1 Clock Cycle (250 ps) * Cache (L1) * Access Time ~4-5 Clock Cycles (~ 1 ns) * Cache (L2) * Access Time ~12 Clock Cycles (~ 3 ns) * Cache (L3) * Access Time ~36-58 Clock Cycles ( ~ 12 ns) * RAM * Access Time ~30-60 Clock Cycles + 50-100 ns ( ~ 70 - 120 ns) ====== Memory Access Times ====== {{.:haswell_access_times.png}} ====== Memory Access Times (II) ====== {{.:haswell_access_times_with_ram.png}} ====== Memory Hierarchy (VI) ====== * Storage Devices Typical Access Times * NVMe Flash Memory ~ 6 µs (@ 150’000 IOPS) * SAS Magnetical Disk ~ 2-3 ms (@ 350 IOPS) * Magnetical Tape milliseconds up to many seconds ====== Caching ====== {{.:memoryhierarchypyramid.png}} (Operating Systems, 7th Edition, W. Stallings, Chapter 1) ====== Storage Technologies - Cache Memory ====== * Acts as a buffer between 2 memory tiers * Modern CPUs utilize 3 levels of caches * Level 1 split into instruction and data cache. Separate for each core. * Level 2 data and instruction cache. Separate for each core. * Level 3 data and instruction cache. Shared among all cores on the die. * Benefits both throughput and latency * Different Caching Strategies for different purposes ====== Caching Strategies ====== * Cache Hit vs. Cache Miss * Cache Hit Ratio * Replacement Strategies / Associativity * Direct Mapped * N-Way Set Associative (But: Increased N means more potential locations need to be searched) * Fully Associative * Write Strategies * Write Back * Write Through * Prefetching / Readahead ====== Caching Problems ====== * Larger Caches have higher hit-rate but also higher latency * Multi Level Caches * Data may become incoherent particularly in multiprocessor systems –> Cache Coherency Protocol * Write Propagation * Transaction Serialization (Reads/Writes to a memory location must be seen in the same order by all processors) ====== Storage Technologies - Why Caching Works –> Locality of reference ====== * Memory References tend to cluster. * Program execution is sequential * Instructions tend to be localized to a few procedures * Iterative constructs consist of a small number of instructions * Processing of data structures, like arrays or sequences * Spatial locality * Execution involves a number of memory locations that are clustered * Acessing data locations sequentially * Spatial locality can be exploited by using large cache blocks and prefetching * Temporal locality * Execution involves memory locations that have been accessed recently * Temporal locality can be exploited by keeping recently used values in cache * Locality of reference enables us to do caching and prefetching * It also enables us to do branch prediction and other things, but this goes beyond the scope of this course ====== Caching - Example ====== int sizeX = 2000; int sizeY = 1000; int array[sizeY][sizeX]; // Fill the array with some data fill_buffer(&array); // Now run through the array and do something with the elements // This runs slow in C for (int x=0; x ====== Memory Hierarchy - Recap ====== * Problem: A CPU waiting for data can’t do any work * Low Memory Access Times are crucial * Solution: Caching/Prefetching algorithms. * Works well with sequential data / local data access patterns * Won’t work with totally random access patterns (Locality) * As one goes down the hierarchy * Increasing access time * Decreasing throughput * Also known as “Von Neumann Bottleneck” ====== Storage Technologies (I) ====== * I/O Devices * Human Readable * Machine Readable * Communication * Differences in * Data Rate * Application * Complexity of Control ====== Storage Technologies (II) ====== * Across the different storage technologies, these relationships seem to hold: * Faster access time, greater cost per bit * Greater capacity, smaller cost per bit * Greater capacity, slower access speed ====== Storage Technologies - Device Characteristics (I) ====== * Performance * Access Time / Latency * Throughput * Capacity * Raw capacity * Storage density * Volatility / Differentiation * Dynamic Memory (periodic refreshes required) * Static Memory ====== Storage Technologies - Device Characteristics (II) ====== * Mutability * Read/Write * Read Only * Slow Write, Fast Read (e.g. SMR Discs) * Accessibility * Random Access * Sequential Access * Adressability * Location addressable * File addressable * Content addressable ====== Sequential I/O vs. Random I/O ====== * Sequential I/O * Writing / Reading contiguous large chunks of data (Chunk Size >= 10E6 Bytes) * Usually the fastest way to read data from storage devices * Not always easily applicable to a problem * Random I/O * Writing / Reading small chunks of data to / from random locations (Chunk Size <= 10E4 Byte) * Slowest way to read data from storage devices * Magnitude of the slow-down depends on the underlying hard- and software (e.g. Tape-Drives vs. Flash-Drives) ====== Hard-Drives Overview (I) ====== * Invented in the mid 50s by IBM * The first IBM drive stored 3.75 MB on a stack of 50 discs * Had the size of two refrigerators (1.9 m³) * Became cheap / mainstream in the late 80s * Today one 3.5" drive can hold up to 14TB of data * Density ~ 1.5 Tbit / in² * With future technologies even higer data densities may become possible * Filling disks with helium * Shingled Magnetic Recording * Heat Assisted Magnetic Recording * Interface * Serial Attached SCSI (SAS) * SATA * NVMe ====== Hard-Drives Overview (II) ====== {{.:harddisk_description_wikipedia.png}} * Video of operating harddisk on wikipedia: https:%%//%%upload.wikimedia.org/wikipedia/commons/c/c5/HardDisk1.ogv ====== Hard-Drives Characteristics ====== * Seek Time * The time it takes to move the head to the correct track * Rotational Delay * The time it takes until the platter rotates to the correct position ( ~2ms @ 15’000 rpm) * r… rotation speed in revolutions per second * $T_{rdelay} = 1/2r$ * Transfer time calculation * T… Transfer Time * b… Number of bytes to be transferred * N… Number of bytes per track * r… rotation speed in revolutions per second * $T = b/rN$ * Average Access Time * $T_{access} = T_{seek} + T_{rdelay} + T_{transfer}$ ====== Hard-Drives Sequential Access Example ====== * Do not rely on average values! * Given a Harddisk with * Rotational speed of 7’500 rpm * 512 byte sectors * 500 sectors per track * We read a file that is stored on 5 adjacent tracks in 2500 sectors (1.28 MBytes) * Average seek… 4ms * Rotational delay… 4ms * Read 500 sectors… 8ms * We need to do this 5 times (1 for each track), but because of sequential organization we can skip the seek time for the consecutive tracks * $T_{total} = 16 + (4 * 12) = 64 ms$ ====== Hard-Drives Random Access Example ====== * Given a Harddisk with * Rotational speed of 7’500 rpm * 512 byte sectors * 500 sectors per track * We read a file that is distributed randomly over 2500 sectors of the disk * Average seek… 4ms * Rotational delay… 4ms * Read 1 sector… 0.016ms * We need to do this 2500 times with a seek after each sector * $T_{total} = 2'500 * 8.016 = 20'040 ms = 20.04s$ * Slowdown of nearly 3 orders of magnitude with the same average values * Do not rely on average values! ====== Quo vadis Hard-Drives ====== * Current/Future Trends * Helium filled drives * Shingled Magnetic Recording * Heat assisted recording ====== (NVMe) Solid State Drives (I) ====== * NAND Flash Memory * Lower cost compared to DRAM * No refresh cycles / external PSU needed to retain the data compared to DRAM * Uses less space compared to memory based on NOR gates * No random access (Cells are connected in series to sectors) * Key components * Controller * Memory * New Interfaces * SATA Express * NVM Express ====== (NVMe) Solid State Drives (II) ====== * NAND Flash Memory * Single Level Cell (SLC) * Multi Level Cell (MLC) * SLC * 1 bit per cell * ~10 times longer life than MLC * Lower latency than MLC * MLC * multiple bits per cell * Lower production cost ====== (NVMe) Solid State Drives - Memory (I) ====== * NAND Flash Memory * Is composed of one or more chips * Chips are segmented into planes * Planes are segmented into thousands (e.g. 2048) of blocks * And 1 or 2 registers of the page size for buffering * A Block usually contains 64 to 128 pages * Each page has a data part (few KBytes) * and a small metadata area (e.g. 128 bytes) for Error Correcting Code * Exact specification varies across different memory packages ====== (NVMe) Solid State Drives - Memory (II) ====== * Read * Is performed in units of pages * 1 page takes approx. 10 µs (SLC) up to 100 µs (MLC) to read * Write * Is performed in units of pages * Some controllers support sub-page operations * Pages in one block must be written sequentially * A page write takes approximately 100 µs (SLC) up to 900µs (MLC) * Block must be erased before being reprogrammed ====== (NVMe) Solid State Drives - Memory (III) ====== * Erase * Must be performed in block granularity (hence the term erase-block) * Can take up to 5 ms * Limited number of erase cycles * SLC: 100’000 * MLC: 10’000 * Some flash memory is reserved to replace bad blocks * Controller takes care of wear leveling ====== (NVMe) Solid State Drives - Controller (I) ====== * Higher complexity of access compared to hard-disks * High performance SSDs consists of hundreds of dies with parallel communication paths * Data cells may wear out and become bad * Erase size != Page size * We need a controller to handle this: * Does the initial setup * Format the memory * Mark bad blocks * Moves data and picks blocks so that single cells don’t wear out * Parallelizes read/write/erase operations on many dies * Adresses in Blocks/Pages * Flash Transition Layer and Logical to Physical Mapping * Garbage Collection ====== (NVMe) Solid State Drives - Speed (I) ====== * Besides the additional complexity, SSDs are much faster than traditional disks * Read Latency is in the order of few microseconds (We had milliseconds with HDDs) * No need to move the head around * But because of the arrangement in blocks and pages, sequential access is still faster * More IOPS because of the lower latency * Needs more CPU utilization to get full throughput * Needs some pressure (multiple calls) to fully parallelize read/write calls * This is also called ‘Queue Depth’ ====== (NVMe) Solid State Drives - IOPS vs. Throughput ====== * Traditional discs have been measured in terms of throughput * SSDs are sometimes measured in terms of IOPS (Input- Output- Operations per Second) * The following equation holds (if there is no controller overhead like erasing, garbage collection, etc.): * $Throughput = IOPS * Blocksize$ * Where blocksize can be chosen freely * So if we know the blocksize that was used in benchmarking… * We can calculate IOPS from throughput and vice versa * (If we assume that the disk was empty at the beginning of the benchmark and no additional controller overhead was involved) * We now even have a base to calculate which blocksize is at least necessary to get full write throughput ====== (NVMe) Solid State Drives - Speed (II) ====== * Given an Intel DC P3700 SSD with a capacity of 2 TB, specification says: * Sequential Read 450’000 IOPS with 2800MB/s of max throughput * $2'800'000'000 Bytes = 450'000 * x$ * $x = 6222 Bytes ~ 6KByte$ * So a blocksize of 8 KByte is a good starting point to try to get full throughput * Random Reads with a page size of 8KByte should work well on this device and we should get the full throughput of 2800MB/s * How would a traditional HDD perform under these conditions? ====== (NVMe) Comparison SSD vs. HDD ====== * In the slide before, we proposed that on our SSD with a block size of 8 KByte will lead to a throughput of 2800MB/s * Let’s try this with a HDD * Rotational speed of 7’500 rpm * 512 byte sectors * 500 sectors per track ====== (NVMe) Comparison SSD vs. HDD ====== * We read a file that is distributed randomly over 2500 sectors of the disk in blocks of 8KByte (Blocks of 16 sectors) * Average seek… 4ms * Rotational delay… 4ms * Read 16 sectors… 0.256ms * We need to do this 156.25 times with a seek after every block of 16 sectors * $T_{total} = 156.25 * 8.256 = 1290 ms = 1.290 s$ * We transferred 156.25 * 8 KBytes = 1250 KBytes in 1.290 s * This equals ~ 1 MByte per second (SSD 2800 MByte per second) * With the same block size our benchmark will run ~2800 times faster on SSD than on HDD * When utilised in sequential form, HDDs are still 20-30 times slower than high performance SSDs ====== (NVMe) Solid State Drives - Speed (II) ====== * $Transfertime = T_{controller} + T_{transfer}$ ====== Quo vadis Solid State Drives ====== * Current / Future Trends * Flash Memory Market surpassed DRAM Market in 2012 * Density of Flash Memory surpassed Hard Drive density in 2016 * In Q4 2016 45 million SSDs with a total capacity of 16 Exabyte were delivered to customers * Market for HDDs is significantly bigger than for SSDs * New memory technologies (e.g. Intel/Micron 3DXPoint) * Intel Optane Memories * Things to consider * Will SSDs replace all HDDs in the future? * All Flash SDS * Storage as Memory * Everything except archive * What use cases remain for HDDs? ====== Magnetic Tapes (I) ====== * Have been in use for data storage since the 50’s * Main storage medium in some early computers * Capacity: * 1950’s: ~ 1 MByte * 1980’s: ~ 100 MByte - 1 GByte * 1990’s: ~ 10 - 100 GByte * 2000’s: ~ 100 GByte - 1 TByte * Now: >10 TByte * Future: Going up to 200 TByte per Tape seems possible * (https:%%//%%www.sony.net/SonyInfo/News/Press/201404/14-044E/index.html) * Real tape capacity is usually 1/2 of the given capacity * A compression ratio of 2:1 is quietly assumed ====== Magnetic Tapes - Characteristics (I) ====== * Tape width * Half inch has been the most common width * Recording Method * Linear * Linear Serpentine * Scanning (writing across the width of the tape) * Helical Scan (Short, dense, diagonal tracks) * Block Layout * Data is written in blocks, each block is a single operation ====== Magnetic Tapes - Characteristics (II) ====== * Access * Sequential Access * Index database or Tape Marks * Compression * Data is usually compressed when written on tape * Encryption * Data encryption is possible and standardised * Encryption needs to be done after compression * Why? ====== Magnetic Tapes - Characteristics (II) ====== * Access * Sequential Access * Index database or Tape Marks * Compression * Data is usually compressed when written on tape * Encryption * Data encryption is possible and standardised * Encryption needs to be done after compression * Encrypted data should look like random noise * If your encrypted data is compressible then your encryption is flawed ====== Magnetic Tapes - LTO ====== * Linear Tape Open * Open standard to store data on magnetic tapes * Developed in the 90’s * Eigth Generation LTO was released in 2017 * LTO-8 State of the art * 12 TByte raw capacity (uncompressed) * 360 MByte/s max uncompressed speed * Compression ratio 2.5:1 * Supports encryption * Supports WORM (Write once read many) ====== Magnetic Tapes - No random I/O ====== * The loading and winding of the tape takes a long time (up to many seconds) * Tape moves at a constant speed * Modern tape machines can move at 1/2 or 1/4 speed * Shoe shining happens when there is not enough data to write * Wears out the tape * Doing random reads wears out the tape * Expected durability of a tape is about 10^4 end-to-end passes ====== Memory Hierarchy - cont’d ====== * With the given storage technologies (Flash, HDD and Tape) we can refine our Memory hierarchy * Using Flash Memories for Caching and Metadata * Using HDDs as online data storage * Using Tapes as offline data storage * With the concept of locality and the right applications this will speed up our storage stack * Recently random accessed files and Metadata that fits on the SSDs will be stored on the SSDs * Recently sequential acessed files will be moved to the HDDs (They are fast enough when used in a sequential way) * Rarely acessed files go to the tape machine * We could even link these files transparently into the file system, so the user doesn’t have to know anything about where his data is located ====== Multi Tiered Storage Systems ====== * Different Tiers of Storage e.g. * Tier 1: SSDs * Tier 2: SATA HDDs * Tier 3: Tape Storage * But there’s no free meal * We need to store data about where our data is stored (More on metadata later on) –> Memory Overhead * If a user accesses a file, we need to check where the file is –> Processing Overhead * In the worst case we have to copy it from tape to the disk * which means loading the right tape and winding to the correct position * which also means that the user might think, that the storage is very slow today ====== Gaining speed and redundancy with RAID ====== * RAID - Redundant Array of Independent (Inexpensive) Discs * Distribute data across several drives * Add parity calculation or mirroring for redundancy * Different Techniques / RAID Levels available * RAID 0, 1, 5, 6, 10, 01, … * Several Discs in a RAID are called a RAID array ====== RAID - Striping and Mirroring ====== * RAID Level 0 * Data is striped across drives without redundancy * Failure of one disc leads to loss of all data in the array * Speedup is proportional to the number of discs * RAID Level 1 * Data is mirrored across discs (usually 2, but more are possible) * Failure of a disc involves no loss of data * Read calls can be parallellized to multiple discs * No speedup for write calls ====== RAID - Parity ====== * RAID Level 5 * Data is striped across drives, parity is calculated (XOR) and stored * Block level striping * Failure of one disc can be compensated * Failure of more than one discs leads to data loss * Considered obsolete nowadays, because of long rebuild times * Use RAID-6 * RAID Level 6 * Data is striped across drives, parity is calculated and stored * 2 different syndromes are computed to allow the loss of up to two drives * Additional computation needed (no simple XOR for the second syndrome) ====== RAID - Other Levels ====== * “Hybrid”-RAID * Mirroring between SSD and HDD (special controller needed) * Nested-RAID * RAID-10 * Stripe of Mirrors * RAID-01 * Mirror of Stripes * Most probably RAID-10 is the better choice than RAID-01 * Improved fault tolerance ====== But where are the levels 2 - 4 ? ====== * RAID2 - RAID4 are obsolete * RAID2 * Bit level striping with dedicated parity disc * RAID3 * Byte level striping with dedicated parity disc * RAID4 * Block level striping with dedicated parity disc ====== File Management ====== * Abstraction of the secondary storage * After all we don’t want to cope with the internals of storage devices * Abstraction to Files / Directories –> File System * Files * Long-term existence * Sharable between processes * Structure * Files can have internal structure (e.g. databases) * Operations * Create * Delete * Open * Close * Read * Write ====== File Management - Inodes (I) ====== * Besides the file content, filesystems rely on data structures about the files * Also called Metadata * Inodes store information about files * Type of File (File, Directory, etc) * Ownership * Access Mode / Permissions * Timestamps (Creation, Access, Modification) * Last state change of the inode itself (status, ctime) * Size of the file * Link-Counter * One or more references to the actual data blocks ====== File Management - Inodes (II) ====== * Most filesystems use a fixed amount of inodes * Ext2 filesystem can address up to 12 blocks per inode * If more blocks are needed for a file a new inode is allocated (indirect block) * Ext2 allows up to triple nested indirect blocks which leads to a maximum size of 16 GiB to 4 TiB depending on block-size * Appending to files and writing new files is not possible when inodes run out * Depending on the amount of files/directories a filesystem can use up to 10% of its capacity for meta information * Show Inode Number with ‘ls -i’ * Show content with stat ====== File Management - Inodes (III) ====== * sreinwal@rs ~/Work/vboxshared/sreinwal/pandoc/vsc-markdown/parallell-io/02_storage_technologies $ stat storage_technologies.md * File: storage_technologies.md * Size: 30837 Blocks: 64 IO Block: 4096 regular file * Device: fd04h/64772d Inode: 25373850 Links: 1 * Access: (0644/-rw-r--r--) Uid: ( 1001/sreinwal) Gid: ( 1001/sreinwal) * Access: 2017-11-28 14:25:11.191823770 +0100 * Modify: 2017-11-28 14:23:53.482827520 +0100 * Change: 2017-11-28 14:25:20.416823325 +0100 * Birth: - ====== File System - Organization (I) ====== {{.:fs1.png}} (Operating Systems 7th Edition, W. Stallings, Chapter 12) ====== File System - Organization (II) ====== * Device Drivers / Hardware Layer * Dealing with I/O operations on the devices * Usually considered part of the operating system * Basic File System / Physical IO Layer * Deals with blocks of data, that are exchanged with device drivers * Manages placement of blocks and the buffering of these blocks in main memory ====== File System - Organization (III) ====== * Basic I/O supervisor / Basic I/O Layer * Responsible for file I/O and its termination * Control structures are maintained to deal with device I/O, scheduling and file status * Disk-Scheduling to optimize performance * Logical I/O Layer * Provides general purpose file I/O * Maintains basic data about files * Enables users/applications to access files ====== File System - Organization (IV) ====== {{.:fs2.png}} (Operating Systems 7th Edition, W. Stallings, Chapter 12) ====== Unix File Management ====== * 6 different types of files * Regular * Contains data. Simple stream of bytes. * Directory * Contains list of file names plus pointers to their index nodes * Special * Map physical devices to files (e.g. /dev/sda for the first SAS/SATA disc) * Named pipes * Buffers received data * Enables interprocess communication * FIFO (First-In-First-Out) * Links * Alternative filenames for existing files * Symbolic Links * Basically an ordinary file that contains the name of the file it is linked to ====== Linux Virtual File System ====== * Different File Systems under Linux * ext2, ext3, ext4, reiserfs, xfs, nfs, … * VFS provides a virtual file system * Presents a single, uniform interface to user processes * Mapping between VFS and underlying file system ====== Storage Networks - NAS vs. SAN ====== * NAS - Network Attached Storage * Centralized File Storage * Exporting File-Systems for clients to mount * Connected via Network (Ethernet, Infiniband, …) * File System Layers of the server are used * User/Access Management at the server * SAN - Storage Area Network * Centralized Device Array * Exports Block devices * Connected via Fibrechannel * ISCSI Protocol, NVMoF Protocol * Client is responsible for handling file systems on the block device * No User/Access Management at block level ====== Distributed File Systems - NFSv4 ====== * NFSv4 * Network File System Version 4 * Client Server Architecture * Provides a standardized view of its local filesystem * Allowing a heterogenous collection of processes to share a common filesystem * Allows sharing of all local filesystems as well as other network filesystems * VFS layer comes in handy here * NFS4 uses Remote File Service / Remote Access Model * File stays on the server * Locking Mechanisms (Stateful Model) * NFSv2 and NFSv3 were stateless which led to some problems (Separate Locking daemon needed) * Opposed to the Upload/Download Model * Client downloads file, does operations locally, reuploads the file ====== Client Side Caching - Asynchronous I/O ====== * Synchronous or Blocking I/O * Process issues an I/O call to the operating system and waits for the operation to complete * Leads to intervals where the CPU is completely idle waiting for the operation to finish * Asynchronous or Non-Blocking I/O * As we have seen, I/O Operations can be very slow * Client calls the operating systems for a read/write * Continues processing * Data is read/written in the background * Operating system informs the process when the file is written * Client side caching in NFS * async mount option * data is cached on the client ====== Server Side Caching - NFSv4 ====== * Server Side Caching * Server replies to requests before they have been committed to stable storage * Usually less error prone than client side caching, because servers are usually in a more controlled environment than clients * e.g. Uninterruptible Power Supply, BBU Units on RAID controllers, etc. * Problems arise * Server crashes before data is transferred from client to server * Client crashes before it transfers data to server * Network connection breaks * Power outages * etc… ====== Parallell File Systems - BeeGFS ====== * FhGFS / BeeGFS Parallell File System * Distributes data and metadata across serveral targets * Stripes data across several targets * Huge speedup compared to single server appliances * Scalability and Flexibility were key aspects for the design ====== BeeGFS - Management Server ====== * Management Server * Exactly 1 MS per filesystem * Keeps track of metadata and storage targets * Keeps track of connected clients * Tags targets with labels * Normal * Low * Critical * Not involved in filesystem operations ====== BeeGFS - Metadata Server ====== * Holds the Metadata information of the file system * Should be stored on fast storage devices (e.g. Flash Memories) * Metadata accesses are mostly tiny random accesses * Ext4 seems to be the fastest filesystem to store beegfs metadata * Additional servers can be added on demand * Each server holds exactly one metadata target * TargetChooser * Chooses targets according to the configuration when a user writes a file * Holds metadata information * File / Directory information * Creation date * Striping pattern * Affected Storage Targets * Permissions * ACLs * Extended attributes * … ====== BeeGFS - Metadata Server ====== * Entry point of the file system ‘/’ is located on MDS #1 * Defined entry point in the directory tree * Additional directories are assigned random to other MDS * Leads to effective load balancing on the metadataservers * As long as there are significantly more directories than metadata targets * Client walks down the directory tree to find the responsible MDS for a directory * Handling Meta-Data requests is creating some load on the cpu cores * For a small number of requests higher clock speeds lead to less latency * For a high number of requests multiple CPU Cores are needed * Metadata handling will create processes which are waiting for data * CPU overcommitting could increase performance ====== BeeGFS - Object Storage Server ====== * Holds the file contents on Object Storage Targets (OSTs) * Underlying devices / filesystem can be chosen freely * As long as the file system is POSIX compliant * File contents can be striped over multiple OSTs * One server can handle multiple OSTs * A typical OST consists of 6 - 12 drives running in RAID6 * Number of threads influence the number of requests that can be put on disk * Higher number of threads will lead to randomization of sequential data * We have to live with that to a certain extent in multi user environments * Chunksize sets the amount of data stored on a OST per stripe * OSTs can be added on demand. But a rebalancing of the file system might be needed to gain full performance ====== BeeGFS - Clients ====== * Clients can be added on demand * Thousands of clients is no problem as long as there are enough MDTs and OSTs available * tuneFileCacheType - Client caching is supported * Buffered Mode * Uses a pool of static buffers for write-back and read-ahead * Caches up to a few hundred kilobytes of a file * Default Mode * Native Mode * Uses the linux kernel page cache * May cache up to multiple gigabytes of data (depending on the available RAM) * Experimental / Work in progress ====== Theres much more ====== * Different Storage Devices * Optical * Laserdisc / CD-Rom / DVD / Blu-Ray * Magnetic * Floppy / ZIP Drives * Magneto-Optical * Minidisc (Sony) * Different Parallell File Systems * GPFS * GlusterFS * Lustre * Different use-cases * Storage Area Network * Different architectures * HP - The Machine ====== Storage Technologies - Bibliography ====== * Understanding Intrinsic Characteristics and System Implications of Flash Memory Based Solid State Drives, Cheng et al., 2009 * Operating Systems - Internals and Design Principles 7th Edition, Stallings, 2012 * Distributed Systems - Principles and Paradigms 2nd Edition, Tanenbaum et al., 2007 ====== Storage Technologies - The End ====== * Remember * Random I/O is magnitudes slower than sequential I/O * Sequential I/O can even be done in parallell from multiple nodes to further improve the throughput * Highly parallellized Random calls will result in degraded storage performance for ALL users and can even lead to an unresponsive storage. * Don’t do random I/O on storage devices Thank you for your attention