Gfs google-file-system-13331
-
Upload
fengchang-xie -
Category
Technology
-
view
385 -
download
2
Transcript of Gfs google-file-system-13331
![Page 1: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/1.jpg)
The Google File SystemTut Chi Io(Modified by Fengchang)
![Page 2: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/2.jpg)
WHAT IS GFS
• Google FILE SYSTEM(GFS)• scalable distr ibuted fi le system (DFS) • falt tolerence• Reliabil i ty• Scalabil ity• availabil i ty and performance to large
networks and connected nodes.
![Page 3: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/3.jpg)
WHAT IS GFS
• built from low-cost COMMODITY HARDWARE components
• optimized to accomodate Google's different data use and storage needs,
• capitalized on the strength of off-the-shelf servers while minimizing hardware weaknesses
![Page 4: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/4.jpg)
Design Overview – Assumption
• Inexpensive commodity hardware• Large fi les: Multi-GB• Workloads
– Large streaming reads– Small random reads– Large, sequential appends
• Concurrent append to the same fi le• High Throughput > Low Latency
![Page 5: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/5.jpg)
Design Overview – Interface
• Create• Delete• Open• Close• Read• Write
• Snapshot• Record Append
![Page 6: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/6.jpg)
What does it look like
![Page 7: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/7.jpg)
Design Overview – Architecture
• Single master, mult iple chunk servers, multiple clients– User-level process running on commodity
Linux machine– GFS client code l inked into each cl ient
application to communicate• File -> 64MB chunks -> Linux f i les
– on local disks of chunk servers– replicated on multiple chunk servers (3r)
• Cache metadata but not chunk on clients
![Page 8: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/8.jpg)
Design Overview – Single Master
• Why centralization? Simplicity!• Global knowledge is needed for
– Chunk placement– Replication decisions
![Page 9: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/9.jpg)
Design Overview – Chunk Size• 64MB – Much Larger than ordinary,
why?– Advantages
• Reduce client-master interaction• Reduce network overhead• Reduce the size of the metadata
– Disadvantages• Internal fragmentation
– Solution: lazy space al location• Hot Spots – many clients accessing a 1-chunk
f i le, e.g. executables– Solution:– Higher replication factor– Stagger application start t imes– Client-to-client communication
![Page 10: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/10.jpg)
Design Overview – Metadata• File & chunk namespaces
– In master ’s memory– In master ’s and chunk servers ’ storage
• File-chunk mapping– In master ’s memory– In master ’s and chunk servers ’ storage
• Location of chunk replicas– In master ’s memory– Ask chunk servers when
• Master starts• Chunk server joins the cluster
– If persistent, master and chunk servers must be in sync
![Page 11: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/11.jpg)
Design Overview – Metadata – In-memory DS
• Why in-memory data structure for the master? – Fast! For GC and LB
• Will i t pose a l imit on the number of chunks -> total capacity?– No, a 64MB chunk needs less than 64B
metadata (640TB needs less than 640MB)• Most chunks are full• Prefix compression on f i le names
![Page 12: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/12.jpg)
Design Overview – Metadata – Log• The only persistent record of metadata• Defines the order of concurrent
operations• Crit ical
– Replicated on multiple remote machines– Respond to client only when log locally
and remotely• Fast recovery by using checkpoints
– Use a compact B-tree l ike form directly mapping into memory
– Switch to a new log, Create new checkpoints in a separate threads
![Page 13: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/13.jpg)
Design Overview – Consistency Model
• Consistent– All cl ients wil l see the same data,
regardless of which replicas they read from
• Defined– Consistent, and clients wil l see what the
mutation writes in its entirety
![Page 14: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/14.jpg)
Design Overview – Consistency Model
• After a sequence of success, a region is guaranteed to be defined– Same order on all replicas– Chunk version number to detect stale
replicas• Client cache stale chunk locations?
– Limited by cache entry ’s timeout– Most f i les are append-only
• A Stale replica return a premature end of chunk
![Page 15: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/15.jpg)
System Interactions – Lease
• Minimized management overhead• Granted by the master to one of the
replicas to become the primary• Primary picks a serial order of
mutation and all replicas follow• 60 seconds timeout, can be extended• Can be revoked
![Page 16: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/16.jpg)
System Interactions – Mutation OrderCurrent lease holder?
identity of primarylocation of replicas(cached by client)
3a. data
3b. data
3c. data
Write request
Primary assign s/n to mutationsApplies itForward write request
Operation completed
Operation completed
Operation completedor Error report
![Page 17: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/17.jpg)
System Interactions – Data Flow• Decouple data flow and control f low• Control f low
– Master -> Primary -> Secondaries• Data flow
– Carefully picked chain of chunk servers• Forward to the closest f irst• Distances estimated from IP addresses
– Linear (not tree), to fully uti l ize outbound bandwidth (not divided among recipients)
– Pipelining, to exploit full-duplex l inks• Time to transfer B bytes to R replicas = B/T +
RL• T: network throughput, L: latency
![Page 18: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/18.jpg)
System Interactions – Atomic Record Append
• Concurrent appends are serializable– Client specif ies only data– GFS appends at least once atomically– Return the offset to the client– Heavily used by Google to use fi les as
• mult iple-producer/single-consumer queues• Merged results from many dif ferent cl ients
– On failures, the cl ient retries the operation
– Data are defined, intervening regions are inconsistent
• A Reader can identify and discard extra padding and record fragments using the checksums
![Page 19: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/19.jpg)
System Interactions – Snapshot• Makes a copy of a fi le or a directory
tree almost instantaneously• Use copy-on-write• Steps
– Revokes lease– Logs operations to disk– Duplicates metadata, pointing to the same
chunks• Create real duplicate locally
– Disks are 3 t imes as fast as 100 Mb Ethernet l inks
![Page 20: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/20.jpg)
Master Operation – Namespace Management
• No per-directory data structure• No support for alias• Lock over regions of namespace to
ensure serialization• Lookup table mapping full pathnames
to metadata– Prefix compression -> In-Memory
![Page 21: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/21.jpg)
Master Operation – Namespace Locking
• Each node (fi le/directory) has a read-write lock
• Scenario: prevent /home/user/foo from being created while /home/user is being snapshotted to /save/user– Snapshot
• Read locks on /home, /save• Write locks on /home/user, /save/user
– Create• Read locks on /home, /home/user• Write lock on /home/user/foo
![Page 22: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/22.jpg)
Master Operation – Policies
• New chunks creation policy– New replicas on below-average disk
uti l ization– Limit # of “recent” creations on each chun
server– Spread replicas of a chunk across racks
• Re-replication priority– Far from replication goal f irst– Chunk that is blocking client f irst – Live f i les f irst (rather than deleted)
• Rebalance replicas periodically
![Page 23: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/23.jpg)
Master Operation – Garbage Collection• Lazy reclamation
– Logs deletion immediately– Rename to a hidden name
• Remove 3 days later• Undelete by renaming back
• Regular scan for orphaned chunks– Not garbage:
• All references to chunks: f i le-chunk mapping• All chunk replicas: Linux fi les under designated
directory on each chunk server
– Erase metadata– HeartBeat message to tell chunk servers to
delete chunks
![Page 24: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/24.jpg)
Master Operation – Garbage Collection
• Advantages– Simple & reliable
• Chunk creation may fai led• Deletion messages may be lost
– Uniform and dependable way to clean up unuseful replicas
– Done in batches and the cost is amortized– Done when the master is relatively free– Safety net against accidental, irreversible
deletion
![Page 25: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/25.jpg)
Master Operation – Garbage Collection
• Disadvantage– Hard to f ine tune when storage is t ight
• Solution– Delete twice explicit ly -> expedite storage
reclamation– Different policies for different parts of the
namespace
• Stale Replica Detection– Master maintains a chunk version number
![Page 26: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/26.jpg)
Fault Tolerance – High Availability
• Fast Recovery– Restore state and start in seconds– Do not distinguish normal and abnormal
termination• Chunk Replication
– Different replication levels for different parts of the f i le namespace
– Keep each chunk fully replicated as chunk servers go off l ine or detect corrupted replicas through checksum verif ication
![Page 27: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/27.jpg)
Fault Tolerance – High Availability
• Master Replication– Log & checkpoints are replicated– Master failures?
• Monitoring infrastructure outside GFS starts a new master process
– “Shadow” masters• Read-only access to the f i le system when the
primary master is down• Enhance read availabil i ty• Reads a replica of the growing operation log
![Page 28: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/28.jpg)
Fault Tolerance – Data Integrity• Use checksums to detect data corruption• A chunk(64MB) is broken up into 64KB
blocks with 32-bit checksum• Chunk server verif ies the checksum before
returning, no error propagation• Record append
– Incrementally update the checksum for the last block, error wil l be detected when read
• Random write– Read and verify the f irst and last block f irst– Perform write, compute new checksums
![Page 29: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/29.jpg)
Conclusion
• GFS supports large-scale data processing using commodity hardware
• Reexamine traditional f i le system assumption– based on application workload and
technological environment– Treat component failures as the norm
rather than the exception– Optimize for huge fi les that are mostly
appended– Relax the stand fi le system interface
![Page 30: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/30.jpg)
Conclusion
• Fault tolerance– Constant monitoring– Replicating crucial data– Fast and automatic recovery– Checksumming to detect data corruption
at the disk or IDE subsystem level• High aggregate throughput
– Decouple control and data transfer– Minimize operations by large chunk size
and by chunk lease
![Page 31: Gfs google-file-system-13331](https://reader030.fdocuments.net/reader030/viewer/2022021416/5a6d48c07f8b9a1b428b52f5/html5/thumbnails/31.jpg)
Reference
• Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System”