Large Scale File Systems - Amir H. Payberah

84
Large Scale File Systems Amir H. Payberah [email protected] KTH Royal Institute of Technology Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 1 / 65

Transcript of Large Scale File Systems - Amir H. Payberah

Page 1: Large Scale File Systems - Amir H. Payberah

Large Scale File Systems

Amir H. [email protected]

KTH Royal Institute of Technology

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 1 / 65

Page 2: Large Scale File Systems - Amir H. Payberah

What is the Problem?

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 2 / 65

Page 3: Large Scale File Systems - Amir H. Payberah

What is the Problem?

I Crawl the whole web.

I Store it all on one big disk.

I Process users’ searches on one big CPU.

I Does not scale.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 3 / 65

Page 4: Large Scale File Systems - Amir H. Payberah

What is the Problem?

I Crawl the whole web.

I Store it all on one big disk.

I Process users’ searches on one big CPU.

I Does not scale.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 3 / 65

Page 5: Large Scale File Systems - Amir H. Payberah

Reminder

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 4 / 65

Page 6: Large Scale File Systems - Amir H. Payberah

What is Filesystem?

I Controls how data is stored in and retrieved from disk.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 5 / 65

Page 7: Large Scale File Systems - Amir H. Payberah

What is Filesystem?

I Controls how data is stored in and retrieved from disk.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 5 / 65

Page 8: Large Scale File Systems - Amir H. Payberah

Distributed Filesystems

I When data outgrows the storage capacity of a single machine: par-tition it across a number of separate machines.

I Distributed filesystems: manage the storage across a network ofmachines.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 6 / 65

Page 9: Large Scale File Systems - Amir H. Payberah

Google File System(GFS)

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 7 / 65

Page 10: Large Scale File Systems - Amir H. Payberah

Motivation and Assumptions (1/3)

I Lots of cheap PCs, each with disk and CPU.• How to share data among PCs?

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 8 / 65

Page 11: Large Scale File Systems - Amir H. Payberah

Motivation and Assumptions (2/3)

I 100s to 1000s of PCs in cluster.• Failure of each PC.• Monitoring, fault tolerance,

auto-recovery essential.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 9 / 65

Page 12: Large Scale File Systems - Amir H. Payberah

Motivation and Assumptions (3/3)

I Large files: ≥ 100 MB in size.

I Large streaming reads and small random reads.

I Append to files rather than overwrite.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 10 / 65

Page 13: Large Scale File Systems - Amir H. Payberah

Files and Chunks (1/2)

I Files are split into chunks.

I Chunks• Single unit of storage.• Transparent to user.• Default size: either 64MB or 128MB

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 11 / 65

Page 14: Large Scale File Systems - Amir H. Payberah

Files and Chunks (2/2)

I Why is a chunk in GFS so large?

• To minimize the cost of seeks.

I Time to read a chunk = seek time + transfer time

I Keeping the ratio seek timetransfer time small.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 12 / 65

Page 15: Large Scale File Systems - Amir H. Payberah

Files and Chunks (2/2)

I Why is a chunk in GFS so large?• To minimize the cost of seeks.

I Time to read a chunk = seek time + transfer time

I Keeping the ratio seek timetransfer time small.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 12 / 65

Page 16: Large Scale File Systems - Amir H. Payberah

GFS Architecture

I Main components:• GFS master• GFS chunk server• GFS client

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 13 / 65

Page 17: Large Scale File Systems - Amir H. Payberah

GFS Master

I Manages file namespace operations.

I Manages file metadata (holds all metadata in memory).• Access control information• Mapping from files to chunks• Locations of chunks

I Manages chunks in chunk servers.• Creation/deletion• Placement• Load balancing• Maintains replication• Garbage collection

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 14 / 65

Page 18: Large Scale File Systems - Amir H. Payberah

GFS Master

I Manages file namespace operations.

I Manages file metadata (holds all metadata in memory).• Access control information• Mapping from files to chunks• Locations of chunks

I Manages chunks in chunk servers.• Creation/deletion• Placement• Load balancing• Maintains replication• Garbage collection

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 14 / 65

Page 19: Large Scale File Systems - Amir H. Payberah

GFS Master

I Manages file namespace operations.

I Manages file metadata (holds all metadata in memory).• Access control information• Mapping from files to chunks• Locations of chunks

I Manages chunks in chunk servers.• Creation/deletion• Placement• Load balancing• Maintains replication• Garbage collection

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 14 / 65

Page 20: Large Scale File Systems - Amir H. Payberah

GFS Chunk Server

I Manage chunks.

I Tells master what chunks it has.

I Store chunks as files.

I Maintain data consistency of chunks.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 15 / 65

Page 21: Large Scale File Systems - Amir H. Payberah

GFS Client

I Issues control (metadata) requests to master server.

I Issues data requests directly to chunk servers.

I Caches metadata.

I Does not cache data.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 16 / 65

Page 22: Large Scale File Systems - Amir H. Payberah

The Master Operations

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 17 / 65

Page 23: Large Scale File Systems - Amir H. Payberah

The Master Operations

I Namespace management and locking

I Replica placement

I Creating, re-replicating and re-balancing replicas

I Garbage collection

I Stale replica detection

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 18 / 65

Page 24: Large Scale File Systems - Amir H. Payberah

Namespace Management and Locking

I Represents its namespace as a lookup table mapping full pathnamesto metadata.

I Each master operation acquires a set of locks before it runs.

I Read lock on internal nodes, and read/write lock on the leaf.

I Read lock on directory prevents its deletion, renaming or snapshot

I Allowed concurrent mutations in the same directory

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 19 / 65

Page 25: Large Scale File Systems - Amir H. Payberah

Namespace Management and Locking

I Represents its namespace as a lookup table mapping full pathnamesto metadata.

I Each master operation acquires a set of locks before it runs.

I Read lock on internal nodes, and read/write lock on the leaf.

I Read lock on directory prevents its deletion, renaming or snapshot

I Allowed concurrent mutations in the same directory

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 19 / 65

Page 26: Large Scale File Systems - Amir H. Payberah

Namespace Management and Locking

I Represents its namespace as a lookup table mapping full pathnamesto metadata.

I Each master operation acquires a set of locks before it runs.

I Read lock on internal nodes, and read/write lock on the leaf.

I Read lock on directory prevents its deletion, renaming or snapshot

I Allowed concurrent mutations in the same directory

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 19 / 65

Page 27: Large Scale File Systems - Amir H. Payberah

Replica Placement

I Maximize data reliability, availability and bandwidth utilization.

I Replicas spread across machines and racks, for example:• 1st replica on the local rack.• 2nd replica on the local rack but different machine.• 3rd replica on the different rack.

I The master determines replica placement.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 20 / 65

Page 28: Large Scale File Systems - Amir H. Payberah

Creation, Re-replication and Re-balancing

I Creation• Place new replicas on chunk servers with below-average disk usage.• Limit number of recent creations on each chunk servers.

I Re-replication• When number of available replicas falls below a user-specified goal.

I Rebalancing• Periodically, for better disk utilization and load balancing.• Distribution of replicas is analyzed.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 21 / 65

Page 29: Large Scale File Systems - Amir H. Payberah

Creation, Re-replication and Re-balancing

I Creation• Place new replicas on chunk servers with below-average disk usage.• Limit number of recent creations on each chunk servers.

I Re-replication• When number of available replicas falls below a user-specified goal.

I Rebalancing• Periodically, for better disk utilization and load balancing.• Distribution of replicas is analyzed.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 21 / 65

Page 30: Large Scale File Systems - Amir H. Payberah

Creation, Re-replication and Re-balancing

I Creation• Place new replicas on chunk servers with below-average disk usage.• Limit number of recent creations on each chunk servers.

I Re-replication• When number of available replicas falls below a user-specified goal.

I Rebalancing• Periodically, for better disk utilization and load balancing.• Distribution of replicas is analyzed.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 21 / 65

Page 31: Large Scale File Systems - Amir H. Payberah

Garbage Collection

I File deletion logged by master.

I File renamed to a hidden name with deletion timestamp.

I Master regularly deletes files older than 3 days (configurable).

I Until then, hidden file can be read and undeleted.

I When a hidden file is removed, its in-memory metadata is erased.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 22 / 65

Page 32: Large Scale File Systems - Amir H. Payberah

Garbage Collection

I File deletion logged by master.

I File renamed to a hidden name with deletion timestamp.

I Master regularly deletes files older than 3 days (configurable).

I Until then, hidden file can be read and undeleted.

I When a hidden file is removed, its in-memory metadata is erased.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 22 / 65

Page 33: Large Scale File Systems - Amir H. Payberah

Garbage Collection

I File deletion logged by master.

I File renamed to a hidden name with deletion timestamp.

I Master regularly deletes files older than 3 days (configurable).

I Until then, hidden file can be read and undeleted.

I When a hidden file is removed, its in-memory metadata is erased.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 22 / 65

Page 34: Large Scale File Systems - Amir H. Payberah

Stale Replica Detection

I Chunk replicas may become stale: if a chunk server fails and missesmutations to the chunk while it is down.

I Need to distinguish between up-to-date and stale replicas.

I Chunk version number:• Increased when master grants new lease on the chunk.• Not increased if replica is unavailable.

I Stale replicas deleted by master in regular garbage collection.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 23 / 65

Page 35: Large Scale File Systems - Amir H. Payberah

Stale Replica Detection

I Chunk replicas may become stale: if a chunk server fails and missesmutations to the chunk while it is down.

I Need to distinguish between up-to-date and stale replicas.

I Chunk version number:• Increased when master grants new lease on the chunk.• Not increased if replica is unavailable.

I Stale replicas deleted by master in regular garbage collection.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 23 / 65

Page 36: Large Scale File Systems - Amir H. Payberah

Stale Replica Detection

I Chunk replicas may become stale: if a chunk server fails and missesmutations to the chunk while it is down.

I Need to distinguish between up-to-date and stale replicas.

I Chunk version number:• Increased when master grants new lease on the chunk.• Not increased if replica is unavailable.

I Stale replicas deleted by master in regular garbage collection.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 23 / 65

Page 37: Large Scale File Systems - Amir H. Payberah

Stale Replica Detection

I Chunk replicas may become stale: if a chunk server fails and missesmutations to the chunk while it is down.

I Need to distinguish between up-to-date and stale replicas.

I Chunk version number:• Increased when master grants new lease on the chunk.• Not increased if replica is unavailable.

I Stale replicas deleted by master in regular garbage collection.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 23 / 65

Page 38: Large Scale File Systems - Amir H. Payberah

System Interactions

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 24 / 65

Page 39: Large Scale File Systems - Amir H. Payberah

Read Operation (1/2)

I 1. Application originates the read request.

I 2. GFS client translates request and sends it to the master.

I 3. The master responds with chunk handle and replica locations.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 25 / 65

Page 40: Large Scale File Systems - Amir H. Payberah

Read Operation (2/2)

I 4. The client picks a location and sends the request.

I 5. The chunk server sends requested data to the client.

I 6. The client forwards the data to the application.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 26 / 65

Page 41: Large Scale File Systems - Amir H. Payberah

Update Order (1/2)

I Update (mutation): an operation that changes the content or meta-data of a chunk.

I For consistency, updates to each chunk must be ordered in the sameway at the different chunk replicas.

I Consistency means that replicas will end up with the same versionof the data and not diverge.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 27 / 65

Page 42: Large Scale File Systems - Amir H. Payberah

Update Order (1/2)

I Update (mutation): an operation that changes the content or meta-data of a chunk.

I For consistency, updates to each chunk must be ordered in the sameway at the different chunk replicas.

I Consistency means that replicas will end up with the same versionof the data and not diverge.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 27 / 65

Page 43: Large Scale File Systems - Amir H. Payberah

Update Order (2/2)

I For this reason, for each chunk, one replica is designated as theprimary.

I The other replicas are designated as secondaries

I Primary defines the update order.

I All secondaries follows this order.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 28 / 65

Page 44: Large Scale File Systems - Amir H. Payberah

Primary Leases (1/2)

I For correctness there needs to be one single primary for each chunk.

I At any time, at most one server is primary for each chunk.

I Master selects a chunk-server and grants it lease for a chunk.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 29 / 65

Page 45: Large Scale File Systems - Amir H. Payberah

Primary Leases (1/2)

I For correctness there needs to be one single primary for each chunk.

I At any time, at most one server is primary for each chunk.

I Master selects a chunk-server and grants it lease for a chunk.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 29 / 65

Page 46: Large Scale File Systems - Amir H. Payberah

Primary Leases (2/2)

I The chunk-server holds the lease for a period T after it gets it, andbehaves as primary during this period.

I The chunk-server can refresh the lease endlessly, but if the chunk-server can not successfully refresh lease from master, he stops beinga primary.

I If master does not hear from primary chunk-server for a period, hegives the lease to someone else.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 30 / 65

Page 47: Large Scale File Systems - Amir H. Payberah

Write Operation (1/3)

I 1. Application originates the request.

I 2. The GFS client translates request and sends it to the master.

I 3. The master responds with chunk handle and replica locations.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 31 / 65

Page 48: Large Scale File Systems - Amir H. Payberah

Write Operation (2/3)

I 4. The client pushes write data to all locations. Data is stored inchunk-server’s internal buffers.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 32 / 65

Page 49: Large Scale File Systems - Amir H. Payberah

Write Operation (3/3)

I 5. The client sends write command to the primary.

I 6. The primary determines serial order for data instances in its bufferand writes the instances in that order to the chunk.

I 7. The primary sends the serial order to the secondaries and tellsthem to perform the write.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 33 / 65

Page 50: Large Scale File Systems - Amir H. Payberah

Write Consistency

I Primary enforces one update order across all replicas for concurrentwrites.

I It also waits until a write finishes at the other replicas before itreplies.

I Therefore:• We will have identical replicas.• But, file region may end up containing mingled fragments from

different clients: e.g., writes to different chunks may be ordereddifferently by their different primary chunk-servers

• Thus, writes are consistent but undefined state in GFS.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 34 / 65

Page 51: Large Scale File Systems - Amir H. Payberah

Write Consistency

I Primary enforces one update order across all replicas for concurrentwrites.

I It also waits until a write finishes at the other replicas before itreplies.

I Therefore:• We will have identical replicas.• But, file region may end up containing mingled fragments from

different clients: e.g., writes to different chunks may be ordereddifferently by their different primary chunk-servers

• Thus, writes are consistent but undefined state in GFS.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 34 / 65

Page 52: Large Scale File Systems - Amir H. Payberah

Append Operation (1/2)

I 1. Application originates record append request.

I 2. The client translates request and sends it to the master.

I 3. The master responds with chunk handle and replica locations.

I 4. The client pushes write data to all locations.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 35 / 65

Page 53: Large Scale File Systems - Amir H. Payberah

Append Operation (2/2)

I 5. The primary checks if record fits in specified chunk.

I 6. If record does not fit, then the primary:• Pads the chunk,• Tells secondaries to do the same,• And informs the client.• The client then retries the append with the next chunk.

I 7. If record fits, then the primary:• Appends the record,• Tells secondaries to do the same,• Receives responses from secondaries,• And sends final response to the client

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 36 / 65

Page 54: Large Scale File Systems - Amir H. Payberah

Delete Operation

I Meta data operation.

I Renames file to special name.

I After certain time, deletes the actual chunks.

I Supports undelete for limited time.

I Actual lazy garbage collection.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 37 / 65

Page 55: Large Scale File Systems - Amir H. Payberah

Fault Tolerance

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 38 / 65

Page 56: Large Scale File Systems - Amir H. Payberah

Fault Tolerance for Chunks

I Chunks replication (re-replication and re-balancing)

I Data integrity• Checksum for each chunk divided into 64KB blocks.• Checksum is checked every time an application reads the data.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 39 / 65

Page 57: Large Scale File Systems - Amir H. Payberah

Fault Tolerance for Chunk Server

I All chunks are versioned.

I Version number updated when a new lease is granted.

I Chunks with old versions are not served and are deleted.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 40 / 65

Page 58: Large Scale File Systems - Amir H. Payberah

Fault Tolerance for Master

I Master state replicated for reliability on multiple machines.

I When master fails:• It can restart almost instantly.• A new master process is started elsewhere.

I Shadow (not mirror) master provides only read-only access to filesystem when primary master is down.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 41 / 65

Page 59: Large Scale File Systems - Amir H. Payberah

High Availability

I Fast recovery• Master and chunk-servers have to restore their state and start in

seconds no matter how they terminated.

I Heartbeat messages:• Checking liveness of chunk-servers• Piggybacking garbage collection commands• Lease renewal

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 42 / 65

Page 60: Large Scale File Systems - Amir H. Payberah

Flat Datacenter Storage(FDS)

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 43 / 65

Page 61: Large Scale File Systems - Amir H. Payberah

Motivation and Assumptions (1/5)

I Why move computation close to data?• Because remote access is slow due to oversubscription.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 44 / 65

Page 62: Large Scale File Systems - Amir H. Payberah

Motivation and Assumptions (2/5)

I Locality adds complexity.

I Need to be aware of where the data is.• Non-trivial scheduling algorithm.• Moving computations around is not easy.

I Need a data-parallel programming model.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 45 / 65

Page 63: Large Scale File Systems - Amir H. Payberah

Motivation and Assumptions (3/5)

I Datacenter networks are getting faster.

I Consequences• The networks are not oversubscribed.• Support full bisection bandwidth: no local vs. remote disk distinction.• Simpler work schedulers and programming models.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 46 / 65

Page 64: Large Scale File Systems - Amir H. Payberah

Motivation and Assumptions (4/5)

I File systems like GFS manage metadata centrally.

I On every read or write, clients contact the master to get informationabout the location of blocks in the system.

• Good visibility and control.• Bottleneck: use large block size• This makes it harder to do fine-grained load balancing like our ideal

little-data computer does.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 47 / 65

Page 65: Large Scale File Systems - Amir H. Payberah

Motivation and Assumptions (4/5)

I File systems like GFS manage metadata centrally.

I On every read or write, clients contact the master to get informationabout the location of blocks in the system.

• Good visibility and control.• Bottleneck: use large block size• This makes it harder to do fine-grained load balancing like our ideal

little-data computer does.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 47 / 65

Page 66: Large Scale File Systems - Amir H. Payberah

Motivation and Assumptions (5/5)

I Let’s make a digital socialism

I Flat Datacenter Storage

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 48 / 65

Page 67: Large Scale File Systems - Amir H. Payberah

Blobs and Tracts

I Data is stored in logical blobs.• Byte sequences with a 128-bit Global Unique Identifiers (GUID).

I Blobs are divided into constant sized units called tracts.• Tracts are sized, so random and sequential accesses have same

throughput.

I Both tracts and blobs are mutable.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 49 / 65

Page 68: Large Scale File Systems - Amir H. Payberah

FDS API (1/2)

// create a blob with the specified GUID

CreateBlob(GUID, &blobHandle, doneCallbackFunction);

// write 8MB from buf to track 0 of the blob

blobHandle->WriteTrack(0, buf, doneCallbackFunction);

// read track 2 of blob into buf

blobHandle->ReadTrack(2, buf, doneCallbackFunction);

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 50 / 65

Page 69: Large Scale File Systems - Amir H. Payberah

FDS API (2/2)

I Reads and writes are atomic.

I Reads and writes not guaranteed to appear in the order they areissued.

I API is non-blocking.• Helps the performance: many requests can be issued in parallel, and

FDS can pipeline disk reads with network transfers.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 51 / 65

Page 70: Large Scale File Systems - Amir H. Payberah

FDS Architecture

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 52 / 65

Page 71: Large Scale File Systems - Amir H. Payberah

Trackserver

I Every disk is managed by a process called a tractserver.

I Trackservers accept commands from the network, e.g., ReadTrackand WriteTrack.

I They do not use file systems.• They lay out tracts directly to disk by using the raw disk interface.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 53 / 65

Page 72: Large Scale File Systems - Amir H. Payberah

Metadata Server

I Metadata server coordinates the cluster and helps clients rendezvouswith tractservers.

I It collects a list of active tractservers and distribute it to clients.

I This list is called the tract locator table (TLT).

I Clients can retrieve the TLT from the metadata server once, thennever contact the metadata server again.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 54 / 65

Page 73: Large Scale File Systems - Amir H. Payberah

Track Locator Table (1/2)

I TLT contains the address of the tractserver(s) responsible for tracts.

I Clients use the blob’s GUID (g) and the tract number (i) to selectan entry in the TLT: tract locator

TractLocator = (Hash(g) + i) mod TLT Length

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 55 / 65

Page 74: Large Scale File Systems - Amir H. Payberah

Track Locator Table (2/2)

I The only time the TLT changes is when a disk fails or is added.

I Reads and writes do not change the TLT.

I In a system with more than one replica, reads go to one replica atrandom, and writes go to all of them.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 56 / 65

Page 75: Large Scale File Systems - Amir H. Payberah

Per-Blob Metadata

I Per-blob metadata: blob’s length and permission bits.

I Stored in tract -1 of each blob.

I The trackserver is responsible for the blob metadata tract.

I Newly created blobs have a length of zero, and applications mustextend a blob before writing. The extend operation is atomic.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 57 / 65

Page 76: Large Scale File Systems - Amir H. Payberah

Fault Tolerance

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 58 / 65

Page 77: Large Scale File Systems - Amir H. Payberah

Replication

I Replicate data to improve durability and availability.

I When a disk fails, redundant copies of the lost data are used torestore the data to full replication.

I Writes a tract: the client sends the write to every tractserver itcontains.

• Applications are notified that their writes have completed only afterthe client library receives write ack from all replicas.

I Reads a tract: the client selects a single tractserver at random.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 59 / 65

Page 78: Large Scale File Systems - Amir H. Payberah

Replication

I Replicate data to improve durability and availability.

I When a disk fails, redundant copies of the lost data are used torestore the data to full replication.

I Writes a tract: the client sends the write to every tractserver itcontains.

• Applications are notified that their writes have completed only afterthe client library receives write ack from all replicas.

I Reads a tract: the client selects a single tractserver at random.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 59 / 65

Page 79: Large Scale File Systems - Amir H. Payberah

Failure Recovery (1/2)

I Step 1: Tractservers send heartbeat messages to the metadataserver. When the metadata server detects a tractserver timeout,it declares the tractserver dead.

I Step 2: invalidates the current TLT by incrementing the versionnumber of each row in which the failed tractserver appears.

I Step 3: picks random tractservers to fill in the empty spaces in theTLT where the dead tractserver appeared.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 60 / 65

Page 80: Large Scale File Systems - Amir H. Payberah

Failure Recovery (2/2)

I Step 4: sends updated TLT assignments to every server affected bythe changes.

I Step 5: waits for each tractserver to ack the new TLT assignments,and then begins to give out the new TLT to clients when queriedfor it.

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 61 / 65

Page 81: Large Scale File Systems - Amir H. Payberah

Summary

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 62 / 65

Page 82: Large Scale File Systems - Amir H. Payberah

Summary

I Google File System (GFS)

I Files and chunks

I GFS architecture: master, chunk servers, client

I GFS interactions: read and update (write and update record)

I Master operations: metadata management, replica placement andgarbage collection

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 63 / 65

Page 83: Large Scale File Systems - Amir H. Payberah

Summary

I Flat Datacenter Storage (FDS)

I Blobs and tracks

I FDS architecture: Metadata server, trackservers, TLT

I FDS interactions: using GUID and track number

I Replication and failure recovery

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 64 / 65

Page 84: Large Scale File Systems - Amir H. Payberah

Questions?

Amir H. Payberah (KTH) Large Scale File Systems 2016/09/01 65 / 65