Download - Advanced Operating Systems

Univ. of TehranDistributed Operating

Systems 1

Advanced Advanced Operating SystemsOperating Systems

University of TehranDept. of EE and Computer Engineering

By:Dr. Nasser Yazdani

Lecture 14:Distributed File Distributed File SystemSystem


Systems 2

How to design a systemHow to design a system How to share data in DS. References

Chapter 10 of the text book “Google file system”


Systems 3

OutlineOutline What are file? General problems Network File System (NFS) Andrew file System Google file system


Systems 4

What are File?What are File? A file is a collection of data organized by the user.

Not necessarily meaningful to OS. operating system. File system is responsible for managing files typically

on persistent storage. Name files in meaningful ways. Access files. Create, destroy, read, write, etc. Physical allocation. Security and protection. Resource administration, quotas, priorities,

What DFS does? the same thing on DS environment. Transparency is important here.


Systems 5

Why Dist. File System?Why Dist. File System? More storage than can fit on a single

system More fault tolerance than can be

achieved if "all of the eggs are in one basket."

The user is "distributed" and needs to access the file system from many places


Systems 6

How build DFS?How build DFS? Grafting a Name Space Into the Tree

Mounting on "/etc/vfstab". Solaris can bind remote directories to local mount points "on demand" through the "automounter". It allows files to be spread out across multiple servers and/or replicated. AFS implements read-only replication. Coda supports read-write replication.

Implementing Operations Typically done via the virtual file system (VFS)

interface.


Systems 7

How build DFS (2)?How build DFS (2)? Unit of Transit

How much do we move? whole files and blocks. Andrew File System (AFS) and Coda version 1 and 2 whole

file semantics. NFS and AFS version 3 implement block level semantics. None byte-level semantics. File level long time with opening and closing a file, and less cache efficiency.

Reads, Writes and Coherency Multiple users write, the final result will be the file

from the perspective of one user, but not both. In UNIX, files are not protected from conflicting

writes.


Systems 8

How build DFS (3)?How build DFS (3)? Caching and Server State

High latency, we need caching. What happens if two different users want to read the same file? what happens if one of them writes to the file? How do the clients hodling cached copies know?

Ostrich principle -- live with the inconsistency if it does. Periodically validate our cache by checking with the server:

“checksome [blah].?" or “timestamp [blah]?” keep track of the users by issuing a callback promise to

each client as it collects a file/block.. The callback-based approach is optimistic and saves the network overhead. But it does complicate the server and reduce its robustness.


Systems 9

File Sharing SemanticsFile Sharing Semantics Unix semantics

Read after write returns value written System enforces absolute time ordering on all operations Always returns most recent value Changes immediately visible to all processes Difficult to enforce in distributed file systems unless all

access occur at server (with no client caching) Session semantics

Local changes only visible to process that opened file File close => changes made visible to all processes Allows local caching of file at client Two nearly simultaneous file closes => one

overwrites other?


Systems 10

Other File Sharing Other File Sharing SemanticsSemantics

Immutable files Create/delete only; no modifications allowed Delete file in use by another process

Atomic transactions Access to files protected by transactions Serializable access, costly to implement

NFSNFS Networked file system Provide distributed filing by remote

access With a high degree of transparency

Method of providing highly transparent access to remote files

Developed by Sun

NFS CharacteristicsNFS Characteristics Volume-level access RPC-based Stateless remote file access Uses XDR for transferring files Location (not name) transparent Implementation for many systems

All interoperate, even non-Unix ones Currently based on VFS

VFS/Vnode ReviewVFS/Vnode Review VFS—Virtual File System

Common interface allowing multiple file system implementations on one system

Plugged in below user level Files represented by vnodes

NFS DiagramNFS Diagram

NFS Client

NFS Server

/tmp

/

/mnt

x y

/home

/

/bin

foo bar

File HandlesFile Handles On the client site, files are

represented by vnodes The client NFS implementation

internally represents remote files as handles

Opaque to client But meaningful to server

To name remote file, provide handle to server


Systems 16

NFS Architecture (1)NFS Architecture (1)

a) The remote access model.b) The upload/download model


Systems 17

NFS Architecture (2)NFS Architecture (2) The basic NFS architecture for UNIX

systems.


Systems 18

CommunicationCommunication

a) Reading data from a file in NFS version 3.b) Reading data using a compound procedure in version 4.

NFS Handle DiagramNFS Handle Diagram

file descriptor

vnode

handle inode

vnode

handleUser process

VFS level

NFS level

Client side Server sideNFS server

VFS level

UFS

How to make this How to make this work?work?

Could integrate it into the kernel Non-portable, non-distributable

Instead, use existing features to do the work VFS for common interface RPC for data transport

Using RPC for NFSUsing RPC for NFS Must have some process at server

that answers the RPC requests Continuously running daemon process

Somehow, must perform mounts over machine boundaries A second daemon process for this

NFS ProcessesNFS Processes nfsd daemons—server daemons

that accept RPC calls for NFS rpc.mountd daemons—server

daemons that handle mount requests

biod daemons—optional client daemons that can improve performance

NFS from the Client’s NFS from the Client’s SideSide

User issues a normal file operation Like read()

Passes through vnode interface to client-side NFS implementation

Client-side NFS implementation formats and sends an RPC packet to perform operation

Single client blocks until NFS RPC returns

NFS RPC ProceduresNFS RPC Procedures 16 RPC procedures to implement

NFS Some for files, some for file systems Including directory ops, link ops, read,

write, etc. Lookup() is the key operation

Because it fetches handles Other NFS file operations use the

handle


Systems 25

Naming (1)Naming (1) Mounting (part of) a remote file system in

NFS.


Systems 26

Naming (2)Naming (2) Mounting nested directories from multiple

servers in NFS.


Systems 27

Automounting (1)Automounting (1) A simple automounter for NFS.


Systems 28

Automounting (2)Automounting (2) Using symbolic links with

automounting.


Systems 29

File Attributes (1)File Attributes (1)

Some general mandatory file attributes in NFS. NFS modeled based on Unix-like file systems

Implementing NFS on other file systems (Windows) difficult NFS v4 enhances compatibility by using mandatory

and recommended attributes

Attribute DescriptionTYPE The type of the file (regular, directory, symbolic link)

SIZE The length of the file in bytes

CHANGE Indicator for a client to see if and/or when the file has changed

FSID Server-unique identifier of the file's file system


Systems 30

File Attributes (2)File Attributes (2)

Some general recommended file attributes.

Attribute DescriptionACL an access control list associated with the file

FILEHANDLE The server-provided file handle of this file

FILEID A file-system unique identifier for this file

FS_LOCATIONS Locations in the network where this file system may be found

OWNER The character-string name of the file's owner

TIME_ACCESS Time when the file data were last accessed

TIME_MODIFY Time when the file data were last modified

TIME_CREATE Time when the file was created


Systems 31

Semantics of File Semantics of File Sharing (1)Sharing (1)

a) On a single processor, when a read follows a write, the value returned by the read is the value just written.

b) In a distributed system with caching, obsolete values may be returned.


Systems 32

Semantics of File Semantics of File Sharing (2)Sharing (2)

Four ways of dealing with the shared files in a distributed system.

NFS implements session semantics Can use remote/access model for providing UNIX semantics

(expensive) Most implementations use local caches for performance and

provide session semantics

Method Comment

UNIX semantics Every operation on a file is instantly visible to all processes

Session semantics No changes are visible to other processes until the file is closed

Immutable files No updates are possible; simplifies sharing and replication

Transaction All changes occur atomically

Implications of Implications of StatelessnessStatelessness

NFS RPC requests must completely describe operations

NFS requests should be idempotent NFS should use a stateless transport

protocol (e.g., UDP) Servers don’t worry about client crashes Server crashes won’t leave junk lying

around

An Important An Important Implication of Implication of StatelessnessStatelessness

Servers don’t know what files clients think are open Unlike in UFS, LFS, most local VFS file systems Makes it much harder to provide certain semantics Also scales nicely, though

NFS works hard to provide identical semantics to local UFS operations

Some of this is tricky Especially given statelessness of server

E.g., how do you avoid discarding pages of unlinked file a client has open?

Sleazy NFS TricksSleazy NFS Tricks Used to provide desired semantics

despite statelessness of the server E.g., if client unlinks open file, send

rename to server rather than remove Perform actual remove when file is

closed Won’t work if file removed on server Won’t work with cooperating clients

File HandlesFile Handles Method clients use to identify files Created by the server on the file

lookup Must be unique mappings of server

file identifier to universal identifier File handles become invalid when

server frees or reuses inode Inode generation number in handle

shows when stale

rpc.lockd Daemonrpc.lockd Daemon NFS server is stateless, so it does

not handle file locking rpc.lockd provides locking Runs on both client and server

Client side catches request, forwards to sever daemon

rpc.lockd handles lock recovery when server crashes

rpc.statd Daemonrpc.statd Daemon Also runs on both client and server Used to check status of a machine Server’s rpc.lockd asks rpc.statd to

store permanent lock information (in file system) And to monitor status of locking

machine If client crashes, clear its locks from

server

Recovering Locks After Recovering Locks After a Crasha Crash

If server crashes and recovers, its rpc.lockd contacts clients to reestablish locks

If client crashes, rpc.statd contacts client when it becomes available again

Client has short grace period to revalidate locks Then they’re cleared


Systems 40

File Locking in NFS (1)File Locking in NFS (1)

NFS version 4 operations related to file locking. Applications can use locks to ensure consistency Locking was not part of NFS until version 3 NFS v4 supports locking as part of the protocol (see above table)

Operation Description

Lock Creates a lock for a range of bytes

Lockt Test whether a conflicting lock has been granted

Locku Remove a lock from a range of bytes

Renew Renew the leas on a specified lock


Systems 41

File Locking in NFS (2)File Locking in NFS (2)

The result of an open operation with share reservations in NFS.a) When the client requests shared access given the current denial state.b) When the client requests a denial state given the current file access state.

Current file denial stateNONE READ WRITE BOTH

READ Succeed Fail Succeed Succeed

WRITE Succeed Succeed Fail Succeed

BOTH Succeed Succeed Succeed Fail

(a)

Requested file denial stateNONE READ WRITE BOTH

READ Succeed Fail Succeed Succeed

WRITE Succeed Succeed Fail Succeed

BOTH Succeed Succeed Succeed Fail

(b)

Requestaccess

Currentaccessstate

Caching in NFSCaching in NFS What can you cache at NFS clients? How do you handle invalid client caches? Data blocks read ahead by biod daemon

Cached in normal file system cache area File attributes

Specially cached by NFS Directory attributes handled a little differently

than file attributes Especially important because many programs

get and set attributes frequently


Systems 43

Client Caching (1)Client Caching (1) Client-side caching is left to the

implementation (NFS does not prohibit it) Different implementation use different

caching policies Sun: allow cache data to be stale for up to 30

seconds


Systems 44

Client Caching (2)Client Caching (2) NFS V4 supports open delegation

Server delegates local open and close requests to the NFS client

Uses a callback mechanism to recall file delegation


Systems 45

RPC FailuresRPC Failures

Three situations for handling retransmissions.a) The request is still in progressb) The reply has just been returnedc) The reply has been some time ago, but was lost.


Systems 46

SecuritySecurity The NFS security architecture.

Simplest case: user ID, group ID authentication only


Systems 47

Secure RPCsSecure RPCs Secure RPC in NFS version 4.


Systems 48

Access ControlAccess Control

The classification of operations recognized by NFS with respect to access control.


Read_data Permission to read the data contained in a file

Write_data Permission to to modify a file's data

Append_data Permission to to append data to a file

Execute Permission to to execute a file

List_directory Permission to to list the contents of a directory

Add_file Permission to to add a new file t5o a directory

Add_subdirectory Permission to to create a subdirectory to a directory

Delete Permission to to delete a file

Delete_child Permission to to delete a file or directory within a directory

Read_acl Permission to to read the ACL

Write_acl Permission to to write the ACL

Read_attributes The ability to read the other basic attributes of a file

Write_attributes Permission to to change the other basic attributes of a file

Read_named_attrs Permission to to read the named attributes of a file

Write_named_attrs Permission to to write the named attributes of a file

Write_owner Permission to to change the owner

Synchronize Permission to to access a file locally at the server with synchronous reads and writes

Andrew ModelAndrew Model Files are stored permanently at file

server machines Users work from workstation

machines With their own private namespace

Andrew provides mechanisms to cache user’s files from shared namespace

User Model of AFS UseUser Model of AFS Use Sit down at any AFR workstation

anywhere Log in and authenticate who I am Access all files without regard to

which workstation I’m using

The Local NamspaceThe Local Namspace Each workstation stores a few files Mostly systems programs and

configuration files Workstations are treated as generic,

interchangeable entities

Virtue and ViceVirtue and Vice Vice is the system run by the file

servers Distributed system

Virtue is the protocol client workstations use to communicate to Vice

Overall ArchitectureOverall Architecture System is viewed as a WAN

composed of LANs Each LAN has a Vice cluster server

Which stores local files But Vice makes all files available to

all clients

Andrew Architecture Andrew Architecture DiagramDiagram

LANWAN

LAN

LAN

Caching the User FilesCaching the User Files Goal is to offload work from servers

to clients When must servers do work?

To answer requests To move data

Whole files cached at clients

Why Whole-File Why Whole-File Caching?Caching?

Minimizes communications with server

Most files used in entirety, anyway Easier cache management problem Requires substantial free disk space

on workstations- Doesn’t address huge file problems

The Shared NamespaceThe Shared Namespace An Andrew installation has global

shared namespace All clients files in the namespace

with the same names High degree of name and location

transparency

How do servers provide How do servers provide the namespace?the namespace?

Files are organized into volumes Volumes are grafted together into

overall namespace Each file has globally unique ID Volumes are stored at individual

servers But a volume can be moved from

server to server

Finding a FileFinding a File At high level, files have names Directory translates name to unique

ID If client knows where the volume is,

it simply sends unique ID to appropriate server

Finding a VolumeFinding a Volume What if you enter a new volume?

How do you find which server stores the volume?

Volume-location database stored on each server

Once information on volume is known, client caches it

Making a VolumeMaking a Volume When a volume moves from server

to server, update database Heavyweight distributed operation

What about clients with cached information?

Old server maintains forwarding info Also eases server update

Handling Cached FilesHandling Cached Files Client can cache all or part of a file Files fetched transparently when

needed File system traps opens

Sends them to local Venus process

The Venus DaemonThe Venus Daemon Responsible for handling single

client cache Caches files on open Writes modified versions back on

close Cached files saved locally after close Cache directory entry translations,

too

Consistency for AFSConsistency for AFS If my workstation has a locally

cached copy of a file, what if someone else changes it?

Callbacks used to invalidate my copy

Requires servers to keep info on who caches files

Write Consistency in Write Consistency in AFSAFS

What if I write to my cached copy of a file?

Need to get write permission from server Which invalidates anyone else’s callback

Permission obtained on open for write Need to obtain new data at this point

Write Consistency in Write Consistency in AFS, Con’tAFS, Con’t

Initially, written only to local copy On close, Venus sends update to

server Server will invalidate callbacks for

other copies Extra mechanism to handle failures

Storage of Andrew Storage of Andrew FilesFiles

Stored in UNIX file systems Client cache is a directory on local

machine Low-level names do not match Andrew

names

Venus Cache Venus Cache ManagementManagement

Venus keeps two caches Status Data

Status cache kept in virtual memory For fast attribute lookup

Data cache kept on disk


Systems 69

CodaCoda Coda: descendent of the Andrew file

system at CMU Andrew designed to serve a large

(global community)

Salient features: Support for disconnected operations

Desirable for mobile users Support for a large number of users

Venus Process Venus Process ArchitectureArchitecture

Venus is single user process But multithreaded Uses RPC to talk to server

RPC is built on low level datagram service

AFS SecurityAFS Security Only server/Vice are trusted here

Client machines might be corrupted No client programs run on Vice

machines Clients must authenticate

themselves to servers Encryption used to protect

transmissions

AFS File ProtectionAFS File Protection AFS supports access control lists

Each file has list of users who can access it

And permitted modes of access Maintained by Vice Used to mimic UNIX access control

AFS Read-Only AFS Read-Only ReplicationReplication

For volumes containing files that are used frequently, but not changed often E.g., executables

AFS allows multiple servers to store read-only copies


Systems 74

The Coda File SystemThe Coda File System The various kinds of users and processes

distinguished by NFS with respect to access control.Type of user DescriptionOwner The owner of a file

Group The group of users associated with a file

Everyone Any user of a process

Interactive Any process accessing the file from an interactive terminal

Network Any process accessing the file via the network

Dialup Any process accessing the file through a dialup connection to the server

Batch Any process accessing the file as part of a batch job

Anonymous Anyone accessing the file without authentication

Authenticated Any authenticated user of a process

Service Any system-defined service process


Systems 75

Overview of Coda Overview of Coda

Centrally administered Vice file servers Large number of virtue clients


Systems 76

Virtue: Coda ClientsVirtue: Coda Clients

The internal organization of a Virtue workstation. Designed to allow access to files even if server is unavailable Uses VFS and appears like a traditional Unix file system


Systems 77

Communication in Communication in CodaCoda

Coda uses RPC2: a sophisticated reliable RPC system Start a new thread for each request, server periodically informs

client it is still working on the request RPC2 supports side-effects: application-specific protocols

Useful for video streaming [where RPCs are less useful] RPC2 also has multicast support


Systems 78

Communication: Communication: InvalidationsInvalidations

a) Sending an invalidation message one at a time.b) Sending invalidation messages in parallel.Can use MultiRPCs [Parallel RPCs] or use Multicast - Fully transparent to the caller and callee [looks like normal

RPC]


Systems 79

NamingNaming

Clients in Coda have access to a single shared name space Files are grouped into volumes [partial subtree in the directory structure]

Volume is the basic unit of mounting Namespace: /afs/filesrv.cs.umass.edu [same namespace on all client; different from NFS] Name lookup can cross mount points: support for detecting crossing and automounts


Systems 80

File IdentifiersFile Identifiers

Each file in Coda belongs to exactly one volume Volume may be replicated across several servers Multiple logical (replicated) volumes map to the same physical volume 96 bit file identifier = 32 bit RVID + 64 bit file handle


Systems 81

Sharing Files in CodaSharing Files in Coda

Transactional behavior for sharing files: similar to share reservations in NFS File open: transfer entire file to client machine [similar to delegation] Uses session semantics: each session is like a transaction

Updates are sent back to the server only when the file is closed


Systems 82

Transactional Transactional SemanticsSemantics

Network partition: part of network isolated from rest Allow conflicting operations on replicas across file partitions Reconcile upon reconnection Transactional semantics => operations must be serializable

Ensure that operations were serializable after thay have executed Conflict => force manual reconciliation

File-associated data Read? Modified?File identifier Yes No

Access rights Yes No

Last modification time Yes Yes

File length Yes Yes

File contents Yes Yes


Systems 83

Client CachingClient Caching

Cache consistency maintained using callbacks Server tracks all clients that have a copy of the file [provide callback promise] Upon modification: send invalidate to clients


Systems 84

Server ReplicationServer Replication

Use replicated writes: read-once write-all Writes are sent to all AVSG (all accessible replicas)

How to handle network partitions? Use optimistic strategy for replication Detect conflicts using a Coda version vector Example: [2,2,1] and [1,1,2] is a conflict => manual reconciliation


Systems 85

Disconnected Disconnected OperationOperation

The state-transition diagram of a Coda client with respect to a volume. Use hoarding to provide file access during disconnection

Prefetch all files that may be accessed and cache (hoard) locally If AVSG=0, go to emulation mode and reintegrate upon reconnection


Systems 86

Overview of xFS.Overview of xFS.

Key Idea: fully distributed file system [serverless file system], Berkeley 96 xFS: x in “xFS” => no server Designed for high-speed LAN environments


Systems 87

Processes in xFSProcesses in xFS

The principle of log-based striping in xFS Combines striping and logging


Systems 88

Reading a File BlockReading a File Block

Reading a block of data in xFS.


Systems 89

xFS NamingxFS Naming

Main data structures used in xFS.

Data structure Description

Manager map Maps file ID to manager

Imap Maps file ID to log address of file's inode

Inode Maps block number (i.e., offset) to log address of block

File identifier Reference used to index into manager map

File directory Maps a file name to a file identifier

Log addresses Triplet of stripe group, ID, segment ID, and segment offset

Stripe group map Maps stripe group ID to list of storage servers


Systems 90

Secure Channels (1)Secure Channels (1) Mutual authentication in RPC2.


Systems 91

Secure Channels (2)Secure Channels (2)

Setting up a secure channel between a (Venus) client and a Vice server in Coda.


Systems 92

Access ControlAccess Control

Classification of file and directory operations recognized by Coda with respect to access control.


Read Read any file in the directory

Write Modify any file in the directory

Lookup Look up the status of any file

Insert Add a new file to the directory

Delete Delete an existing file

Administer Modify the ACL of the directory


Systems 93

Plan 9: Resources Plan 9: Resources Unified to FilesUnified to Files

General organization of Plan 9


Systems 94

CommunicationCommunication

Files associated with a single TCP connection in Plan 9.

File Description

ctl Used to write protocol-specific control commands

data Used to read and write data

listen Used to accept incoming connection setup requests

local Provides information on the caller's side of the connection

remote Provides information on the other side of the connection

status Provides diagnostic information on the current status of the connection


Systems 95

ProcessesProcesses The Plan 9 file server. WORM: Write-

once Read-Many, IBM, 1980


Systems 96

NamingNaming A union directory in Plan 9. Multiple

mounts to one point


Systems 97

Overview of xFS.Overview of xFS.

A typical distribution of xFS processes across multiple machines.


Systems 98

Processes (1)Processes (1) The principle of log-based striping in xFS.


Systems 99

Processes (2)Processes (2) Reading a block of data in xFS.


Systems 100

NamingNaming Main data structures used in xFS.

Data structure Description

Manager map Maps file ID to manager

Imap Maps file ID to log address of file's inode

Inode Maps block number (i.e., offset) to log address of block

File identifier Reference used to index into manager map

File directory Maps a file name to a file identifier

Log addresses Triplet of stripe group, ID, segment ID, and segment offset

Stripe group map Maps stripe group ID to list of storage servers


Systems 101

Overview of SFSOverview of SFS The organization of SFS. Scaleable

Security


Systems 102

NamingNaming

A self-certifying pathname in SFS.

/sfs LOC HID Pathname

/sfs/sfs.vu.sc.nl:ag62hty4wior450hdh63u623i4f0kqere/home/steen/mbox


Systems 103

SummarySummary

A comparison between NFS, Coda, Plan 9, xFS. N/S indicates that nothing has been specified.

ssueI NFS Coda Plan 9 xFS SFS

Design goals Access transparency High availability Uniformity Serverless system Scalable security

Access model Remote Up/Download Remote Log-based Remote

Communication RPC RPC Special Active msgs RPC

Client process Thin/Fat Fat Thin Fat Medium

Server groups No Yes No Yes No

Mount granularity Directory File system File system File system Directory

Name space Per client Global Per process Global Global

File ID scope File server Global Server Global File system

Sharing sem. Session Transactional UNIX UNIX N/S

Cache consist. write-back write-back write-through write-back write-back

Replication Minimal ROWA None Striping None

Fault tolerance Reliable comm. Replication and caching Reliable comm. Striping Reliable comm.

Recovery Client-based Reintegration N/S Checkpoint & write logs N/S

Secure channels Existing mechanisms

Needham-Schroeder

Needham-Schroeder No pathnames Self-cert.

Access control Many operations Directory operations UNIX based UNIX based NFS BASED

characteristicscharacteristics 1.First, component failures are the norm rather than the exception. problems caused by : application bugs operating system bugs, human errors, failures of disks, memory, connectors, networking, and power supplies. 2.Second, files are huge by traditional standards. Multi-GB files are

common. 3.Third, most files are mutated by appending new data rather than

overwriting existing data. 4.Fourth, co-designing the applications and the file system API benefits the

overall system by increasing our flexibility. we have relaxed GFS’s consistency model We have also introduced an atomic append operation

DESIGN OVERVIEWDESIGN OVERVIEW Assumptions: 1.The system is built from many inexpensive

commodity components that often fail. 2.The system stores a modest number of large files.

We expect a few million files, each typically 100 MB or larger in size.

3.The workloads primarily consist of two kinds of reads: large streaming reads and small random reads.

4.The workloads also have many large, sequential writes that append data to files. Typical operation sizes are similar to those for reads.

5.The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file.

6.High sustained bandwidth is more important than low latency.

InterfaceInterface Files are organized hierarchically in

directories and identified by pathnames.

We support the usual operations to create, delete,open, close, read, and write files.

Moreover, GFS has snapshot and record append operations.

ArchitectureArchitecture

Architecture(desc.)Architecture(desc.) A GFS cluster consists of a single master and multiple Chunk servers and is accessed by multiple clients, as shown in Figure 1. Each of these is typically a commodity Linux machine running a user-level server process. Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk

creation. Chunk servers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range.

Architecture(desc. Architecture(desc. Contd.)Contd.)

The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunklease management, garbage collection of orphaned chunks, and chunk migration between chunkservers.

GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the

application.

Neither the client nor the chunkserver caches file data.

Architecture(desc. Architecture(desc. Contd.)Contd.)

Single Master, why ? Chunk Size, how much ? advantages : 1.First, it reduces clients’ need to interact with the

master because ... 2.Second, since on a large chunk, a client is more

likely to perform many operations on a given chunk, ...

3.Third, it reduces the size of the metadata stored on the master.

disadvantages : 1.internal fragmentation (solution : Lazy space

allocation) 2.hot spots for small files

MetadataMetadata The master stores three major types of

metadata: 1.the file and chunk namespaces 2.the mapping from files to chunks 3.the locations of each chunk’s replicas

a)In-Memory Data Structures b)Chunk Locations c)Operation Log

Consistency ModelConsistency Model 1.Guarantees by GFS 2.Implications for Applications relying on appends rather than

overwrites checkpointing writing self-validating self-identifying records

SYSTEM INTERACTIONSSYSTEM INTERACTIONS minimize the master’s involvement

in all operations. Leases and Mutation Order Data Flow Atomic Record Appends Snapshot

SYSTEM INTERACTIONSSYSTEM INTERACTIONS

MASTER OPERATIONsMASTER OPERATIONs Replica Placement: maximize data reliability and availability, maximize network bandwidth utilization. Creation, Re-replication,

Rebalancing Garbage Collection Stale Replica Detection

FAULT TOLERANCE AND FAULT TOLERANCE AND DIAGNOSISDIAGNOSIS

High Availability Fast Recovery Master Replication Data Integrity Diagnostic Tools