File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent +...

34
File systems over network Hung-Wei Tseng

Transcript of File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent +...

Page 1: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

File systems over networkHung-Wei Tseng

Page 2: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

Recap: How your application reaches storage device

2

HDD #1

Device Controller

User

Kernel

Hardware

Applications

SSDDevice Controller

FTL

File system

Device independent I/O interface (e.g. ioctl)Buffer

Device Driver Device Driver Device Driverdata

read/write — 0, 512, 4096, … (block address)

read/write — block addresses

read/write — block addresses

fread/fwrite — input.bin/output.bin

I/O libraries Bufferfread/fwrite — input.bin/output.bin

data

data

Network?

Page 3: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file in ext file systems — a list of extents • Journal

• Write-ahead logs — performs writes as in LFS • Apply the log to the target location when appropriate

• Block group • Modern H.D.Ds do not have the concept of “cylinders” • They label neighboring sectors with consecutive block addresses • Does not work for SSDs given the internal log-structured management of block

addresses

3

Recap: Extent file systems — ext2, ext3, ext4

Page 4: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• Asymmetric read/write behavior/performance • Wear-out faster than traditional magnetic disks • Another layer of indirection is introduced

• Intensify log-on-log issues • We need to revise the file system design

4

Recap: flash SSDs, NVM-based SSDs

Page 5: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

The introduction of virtual file system interface

5

HDD #1

Device Controller

User-space

Kernel

Hardware

Applications, user-space libraries

SSDDevice Controller

FTL

File system #2 (e.g. f2fs)

Device independent I/O interface (e.g. ioctl)data

read/write — 0, 512, 4096, … (block address)

read/write — block addresses data

Virtual File System

open, close, read, write, …

File system #1 (e.g. ext4)

Device Driver Device Driver

open, close, read, write, …

read/write — block addresses

Page 6: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• NFS • Google file system

6

Outline

Page 7: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

Network File System

7

Page 8: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

The introduction of virtual file system interface

8

HDD #1

Device Controller

User-space

Kernel

Hardware

Applications, user-space libraries

SSDDevice Controller

FTL

File system #2 (e.g. f2fs)

Device independent I/O interface (e.g. ioctl)data

read/write — 0, 512, 4096, … (block address)

read/write — block addresses data

Virtual File System

open, close, read, write, …

File system #1 (e.g. ext4)

Device Driver Device Driver

open, close, read, write, …

read/write — block addresses

File system #3 — NFSopen, close, read, write, …

NIC

Device Controller

Network Device Driver

Network Stack

open, close, read, write, …

Page 9: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

open, close, read, write, …

open, close, read, write, …

NFS Client/Server

9

User-space

Kernel

Hardware

Applications, user-space libraries

Virtual File System

NFS

NICDevice Controller

Network Stack

Network Device Driver

open, close, read, write, …

NFS Server

Virtual File System

NICDevice Controller

Network Stack

Network Device Driver

open, close, read, write, …

Disk File System

HDD #1Device Controller

I/O interface

Device Driver

read/write — block addresses

Page 10: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• The client gives it’s file system a tuple to describe data • Volume: Identify which server contains the file — represented by

the mount point in UNIX • inode: Where in the server • generation numer: version number of the file

• The local file system forwards the requests to the server • The server response the client with file system attributes as

local disks

10

How does NFS handle a file?

Page 11: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

How open works with NFS

15

open(“/mnt/nfs/home/hungwei/foo.c”, O_RDONLY);client server

lookup for home

return the inode of home

read for home

return the data of home

lookup for hungwei

return the inode of hungwei

read for hungwei

return the data of hungwei

lookup for foo.c

return the inode of foo.c

read for foo.c

return the data of foo.c

Page 12: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• NFS operations are expensive • Lots of network round-trips • NFS server is a user-space daemon

• With caching on the clients • Only the first reference needs network communication • Later requests can be satisfied in local memory

17

Caching

Page 13: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• Given the same input, always give the same output regardless how many times the operation is employed

• You only need to retry the same operation if it failed

23

Idempotent operations

Page 14: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

Think about this

24

Network

Server

File Server

File System

Network Stack Disk

Client A

Application

File System

Cache

Network Stack

Client B

Application

File System

Cache

Network Stack

Client C

Application

File System

Cache

Network Stack

foo.txtfoo.txtfoo.txt

update foo.txt in cache

Client C won’t be aware of the change

in Client A

Page 15: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• Flush-on-close: flush all write buffer contents when close the file • Later open operations will get the latest content

• Force-getattr: • Open a file requires getattr from server to check timestamps • attribute cache to remedy the performance

25

Solution

Page 16: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

The Google File SystemSanjay Ghemawat, Howard Gobioff, and

Shun-Tak LeungGoogle

26

Page 17: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• Conventional file systems do not fit the demand of data centers • Workloads in data centers are different from conventional

computers • Storage based on inexpensive disks that fail frequently • Many large files in contrast to small files for personal data • Primarily reading streams of data • Sequential writes appending to the end of existing files • Must support multiple concurrent operations • Bandwidth is more critical than latency

32

Why we care about GFS

Page 18: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• Google Search (Web Search for a Planet: The Google Cluster Architecture, IEEE Micro, vol. 23, 2003)

• MapReduce (MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004)

• Large-scale machine learning problems • Extraction of user data for popular queries • Extraction of properties of web pages for new experiments and products • Large-scale graph computations

• BigTable (Bigtable: A Distributed Storage System for Structured Data, OSDI 2006)

• Google analytics • Google earth • Personalized search

33

Data-center workloads for GFS

Page 19: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• Maintaining the same interface • The same function calls • The same hierarchical directory/files

• Files are decomposed into large chunks (e.g. 64MB) with replicas

• Hierarchical namespace implemented with flat structure • Master/chunkservers/clients

34

What GFS proposes?

Page 20: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

Latency Numbers Every Programmer Should Know

40

Operations Latency (ns) Latency (us) Latency (ms)L1 cache reference 0.5 ns ~ 1 CPU cycleBranch mispredict 5 nsL2 cache reference 7 ns 14x L1 cacheMutex lock/unlock 25 nsMain memory reference 100 ns 20x L2 cache, 200x L1 cacheCompress 1K bytes with Zippy 3,000 ns 3 usSend 1K bytes over 1 Gbps network

10,000 ns 10 us

Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSDRead 1 MB sequentially from memory

250,000 ns 250 us

Round trip within same datacenter 500,000 ns 500 usRead 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memoryRead 512B from disk 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtripRead 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSDSend packet CA-Netherlands-CA 150,000,000 ns 150,000 us 150 ms

Page 21: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• Directories are illusions • Namespace maintained like a hash table

41

Flat file system structure

Page 22: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

How open works with NFS

43

open(“/mnt/nfs/home/hungwei/foo.c”, O_RDONLY);client server

lookup for home

return the inode of home

read for home

return the data of home

lookup for hungwei

return the inode of hungwei

read for hungwei

return the data of hungwei

lookup for foo.c

return the inode of foo.c

read for foo.c

return the data of foo.c

lookup for /home/hungwei/foo.c

return the list of locations of /home/hungwei/foo.c

read from one data location of /home/hungwei/foo.c

return data of /home/hungwei/foo.c

You only need these in GFS

Page 23: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

Distributed architecture

49

decoupled data and control paths — only control path goes through master

load balancing, replicas among chunkservers

Page 24: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• Single master • maintains file system metadata including namespace, mapping, access control

and chunk locations. • controls system wide activities including garbage collection and chunk migration.

• Chunkserver • stores data chunks • chunks are replicated to improve reliability (3 replicas)

• Client • APIs to interact with applications • interacts with masters for control operations • interacts with chunkservers for accessing data • Can run on chunkservers

50

Distributed architecture

Page 25: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

Reading data in GFS

51

Application

GFS Client Master

filename, sizefilename, chunk index

chunk handle, chunk locations

Chunk server

Chunk server

Chunk server

chunk handle, byte range

data from file

data

Page 26: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

Writing data in GFS

52

Application

GFS Client Master

filename, datafilename, chunk index

chunk handle, primary and secondary replicas

Chunk server

Chunk server

Chunk server

data

primary defines the order of updates in

chunk servers

response

data

data

write commandprimary

response

Page 27: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• Linux problems (section 7) • Linux driver issues — disks do not report their capabilities honestly • The cost of fsync — proportion to file size rather than updated

chunk size • Single reader-writer lock for mmap • Due to the open-source nature of Linux, they can fix it and

contribute to the rest of the community

53

Real world, industry experience

• GFS is not open-sourced

Page 28: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• GFS claims this will not be a bottleneck • In-memory data structure for fast access • Only involved in metadata operations — decoupled data/

control paths • Client cache • What if the master server fails?

54

Single master design

Page 29: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• Mentioned in “Spanner: Google's Globally-Distributed Database”, OSDI 2012 — “tablet’s state is stored in set of B-tree-like files and a write-ahead log, all on a distributed file system called Colossus (the successor to the Google File System)”

• Single master

55

The evolution of GFS

Page 30: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• Support for smaller chunk size — gmail

56

The evolution of GFS

Page 31: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• snapshots • namespace locking • replica placement • create, re-replication, re-balancing • garbage collection • stable replica detection • data integrity • diagnostic tools: logs are your friends

57

Lots of other interesting topics

Page 32: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• Distributed, simple, efficient • Filename/metadata updates/creates are atomic • Consistency modes

• Consistent: all replicas have the same value • Defined: replica reflects the mutation, consistent

• Applications need to deal with inconsistent cases themselves

58

GFS: Relaxed Consistency model

Write — write to a specific offset Append — write to the end of a file

Serial success DefinedDefined with interspersed with

inconsistentConcurrent success Consistent but undefined

Failure inconsistent

Page 33: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• Conventional file systems do not fit the demand of data centers • Workloads in data centers are different from conventional

computers • Storage based on inexpensive disks that fail frequently • Many large files in contrast to small files for personal data • Primarily reading streams of data • Sequential writes appending to the end of existing files • Must support multiple concurrent operations • Bandwidth is more critical than latency

59

Why we care about GFS

— MapReduce is fault tolerant

— MapReduce aims at processing large amount of data once— MapReduce reads chunks of large files

— Output file keep growing as workers keep writing

—MapReduce has thousands of workers simultaneously

—MapReduce only wants to finish tasks within “reasonable” amount of time

Page 34: File systems over networkhtseng/classes/cs202... · •Basically optimizations over FFS + Extent + Journaling (write-ahead logs) • Extent — consecutive disk blocks • A file

• Reading quiz due next Thursday • Project due 3/3

60

Announcement