Univ. of TehranDistributed Operating
Systems 1
Advanced Advanced Operating SystemsOperating Systems
University of TehranDept. of EE and Computer Engineering
By:Dr. Nasser Yazdani
Lecture 14:Distributed File Distributed File SystemSystem
Univ. of TehranDistributed Operating
Systems 2
How to design a systemHow to design a system How to share data in DS. References
Chapter 10 of the text book “Google file system”
Univ. of TehranDistributed Operating
Systems 3
OutlineOutline What are file? General problems Network File System (NFS) Andrew file System Google file system
Univ. of TehranDistributed Operating
Systems 4
What are File?What are File? A file is a collection of data organized by the user.
Not necessarily meaningful to OS. operating system. File system is responsible for managing files typically
on persistent storage. Name files in meaningful ways. Access files. Create, destroy, read, write, etc. Physical allocation. Security and protection. Resource administration, quotas, priorities,
What DFS does? the same thing on DS environment. Transparency is important here.
Univ. of TehranDistributed Operating
Systems 5
Why Dist. File System?Why Dist. File System? More storage than can fit on a single
system More fault tolerance than can be
achieved if "all of the eggs are in one basket."
The user is "distributed" and needs to access the file system from many places
Univ. of TehranDistributed Operating
Systems 6
How build DFS?How build DFS? Grafting a Name Space Into the Tree
Mounting on "/etc/vfstab". Solaris can bind remote directories to local mount points "on demand" through the "automounter". It allows files to be spread out across multiple servers and/or replicated. AFS implements read-only replication. Coda supports read-write replication.
Implementing Operations Typically done via the virtual file system (VFS)
interface.
Univ. of TehranDistributed Operating
Systems 7
How build DFS (2)?How build DFS (2)? Unit of Transit
How much do we move? whole files and blocks. Andrew File System (AFS) and Coda version 1 and 2 whole
file semantics. NFS and AFS version 3 implement block level semantics. None byte-level semantics. File level long time with opening and closing a file, and less cache efficiency.
Reads, Writes and Coherency Multiple users write, the final result will be the file
from the perspective of one user, but not both. In UNIX, files are not protected from conflicting
writes.
Univ. of TehranDistributed Operating
Systems 8
How build DFS (3)?How build DFS (3)? Caching and Server State
High latency, we need caching. What happens if two different users want to read the same file? what happens if one of them writes to the file? How do the clients hodling cached copies know?
Ostrich principle -- live with the inconsistency if it does. Periodically validate our cache by checking with the server:
“checksome [blah].?" or “timestamp [blah]?” keep track of the users by issuing a callback promise to
each client as it collects a file/block.. The callback-based approach is optimistic and saves the network overhead. But it does complicate the server and reduce its robustness.
Univ. of TehranDistributed Operating
Systems 9
File Sharing SemanticsFile Sharing Semantics Unix semantics
Read after write returns value written System enforces absolute time ordering on all operations Always returns most recent value Changes immediately visible to all processes Difficult to enforce in distributed file systems unless all
access occur at server (with no client caching) Session semantics
Local changes only visible to process that opened file File close => changes made visible to all processes Allows local caching of file at client Two nearly simultaneous file closes => one
overwrites other?
Univ. of TehranDistributed Operating
Systems 10
Other File Sharing Other File Sharing SemanticsSemantics
Immutable files Create/delete only; no modifications allowed Delete file in use by another process
Atomic transactions Access to files protected by transactions Serializable access, costly to implement
NFSNFS Networked file system Provide distributed filing by remote
access With a high degree of transparency
Method of providing highly transparent access to remote files
Developed by Sun
NFS CharacteristicsNFS Characteristics Volume-level access RPC-based Stateless remote file access Uses XDR for transferring files Location (not name) transparent Implementation for many systems
All interoperate, even non-Unix ones Currently based on VFS
VFS/Vnode ReviewVFS/Vnode Review VFS—Virtual File System
Common interface allowing multiple file system implementations on one system
Plugged in below user level Files represented by vnodes
NFS DiagramNFS Diagram
NFS Client
NFS Server
/tmp
/
/mnt
x y
/home
/
/bin
foo bar
File HandlesFile Handles On the client site, files are
represented by vnodes The client NFS implementation
internally represents remote files as handles
Opaque to client But meaningful to server
To name remote file, provide handle to server
Univ. of TehranDistributed Operating
Systems 16
NFS Architecture (1)NFS Architecture (1)
a) The remote access model.b) The upload/download model
Univ. of TehranDistributed Operating
Systems 17
NFS Architecture (2)NFS Architecture (2) The basic NFS architecture for UNIX
systems.
Univ. of TehranDistributed Operating
Systems 18
CommunicationCommunication
a) Reading data from a file in NFS version 3.b) Reading data using a compound procedure in version 4.
NFS Handle DiagramNFS Handle Diagram
file descriptor
vnode
handle inode
vnode
handleUser process
VFS level
NFS level
Client side Server sideNFS server
VFS level
UFS
How to make this How to make this work?work?
Could integrate it into the kernel Non-portable, non-distributable
Instead, use existing features to do the work VFS for common interface RPC for data transport
Using RPC for NFSUsing RPC for NFS Must have some process at server
that answers the RPC requests Continuously running daemon process
Somehow, must perform mounts over machine boundaries A second daemon process for this
NFS ProcessesNFS Processes nfsd daemons—server daemons
that accept RPC calls for NFS rpc.mountd daemons—server
daemons that handle mount requests
biod daemons—optional client daemons that can improve performance
NFS from the Client’s NFS from the Client’s SideSide
User issues a normal file operation Like read()
Passes through vnode interface to client-side NFS implementation
Client-side NFS implementation formats and sends an RPC packet to perform operation
Single client blocks until NFS RPC returns
NFS RPC ProceduresNFS RPC Procedures 16 RPC procedures to implement
NFS Some for files, some for file systems Including directory ops, link ops, read,
write, etc. Lookup() is the key operation
Because it fetches handles Other NFS file operations use the
handle
Univ. of TehranDistributed Operating
Systems 25
Naming (1)Naming (1) Mounting (part of) a remote file system in
NFS.
Univ. of TehranDistributed Operating
Systems 26
Naming (2)Naming (2) Mounting nested directories from multiple
servers in NFS.
Univ. of TehranDistributed Operating
Systems 27
Automounting (1)Automounting (1) A simple automounter for NFS.
Univ. of TehranDistributed Operating
Systems 28
Automounting (2)Automounting (2) Using symbolic links with
automounting.
Univ. of TehranDistributed Operating
Systems 29
File Attributes (1)File Attributes (1)
Some general mandatory file attributes in NFS. NFS modeled based on Unix-like file systems
Implementing NFS on other file systems (Windows) difficult NFS v4 enhances compatibility by using mandatory
and recommended attributes
Attribute DescriptionTYPE The type of the file (regular, directory, symbolic link)
SIZE The length of the file in bytes
CHANGE Indicator for a client to see if and/or when the file has changed
FSID Server-unique identifier of the file's file system
Univ. of TehranDistributed Operating
Systems 30
File Attributes (2)File Attributes (2)
Some general recommended file attributes.
Attribute DescriptionACL an access control list associated with the file
FILEHANDLE The server-provided file handle of this file
FILEID A file-system unique identifier for this file
FS_LOCATIONS Locations in the network where this file system may be found
OWNER The character-string name of the file's owner
TIME_ACCESS Time when the file data were last accessed
TIME_MODIFY Time when the file data were last modified
TIME_CREATE Time when the file was created
Univ. of TehranDistributed Operating
Systems 31
Semantics of File Semantics of File Sharing (1)Sharing (1)
a) On a single processor, when a read follows a write, the value returned by the read is the value just written.
b) In a distributed system with caching, obsolete values may be returned.
Univ. of TehranDistributed Operating
Systems 32
Semantics of File Semantics of File Sharing (2)Sharing (2)
Four ways of dealing with the shared files in a distributed system.
NFS implements session semantics Can use remote/access model for providing UNIX semantics
(expensive) Most implementations use local caches for performance and
provide session semantics
Method Comment
UNIX semantics Every operation on a file is instantly visible to all processes
Session semantics No changes are visible to other processes until the file is closed
Immutable files No updates are possible; simplifies sharing and replication
Transaction All changes occur atomically
Implications of Implications of StatelessnessStatelessness
NFS RPC requests must completely describe operations
NFS requests should be idempotent NFS should use a stateless transport
protocol (e.g., UDP) Servers don’t worry about client crashes Server crashes won’t leave junk lying
around
An Important An Important Implication of Implication of StatelessnessStatelessness
Servers don’t know what files clients think are open Unlike in UFS, LFS, most local VFS file systems Makes it much harder to provide certain semantics Also scales nicely, though
NFS works hard to provide identical semantics to local UFS operations
Some of this is tricky Especially given statelessness of server
E.g., how do you avoid discarding pages of unlinked file a client has open?
Sleazy NFS TricksSleazy NFS Tricks Used to provide desired semantics
despite statelessness of the server E.g., if client unlinks open file, send
rename to server rather than remove Perform actual remove when file is
closed Won’t work if file removed on server Won’t work with cooperating clients
File HandlesFile Handles Method clients use to identify files Created by the server on the file
lookup Must be unique mappings of server
file identifier to universal identifier File handles become invalid when
server frees or reuses inode Inode generation number in handle
shows when stale
rpc.lockd Daemonrpc.lockd Daemon NFS server is stateless, so it does
not handle file locking rpc.lockd provides locking Runs on both client and server
Client side catches request, forwards to sever daemon
rpc.lockd handles lock recovery when server crashes
rpc.statd Daemonrpc.statd Daemon Also runs on both client and server Used to check status of a machine Server’s rpc.lockd asks rpc.statd to
store permanent lock information (in file system) And to monitor status of locking
machine If client crashes, clear its locks from
server
Recovering Locks After Recovering Locks After a Crasha Crash
If server crashes and recovers, its rpc.lockd contacts clients to reestablish locks
If client crashes, rpc.statd contacts client when it becomes available again
Client has short grace period to revalidate locks Then they’re cleared
Univ. of TehranDistributed Operating
Systems 40
File Locking in NFS (1)File Locking in NFS (1)
NFS version 4 operations related to file locking. Applications can use locks to ensure consistency Locking was not part of NFS until version 3 NFS v4 supports locking as part of the protocol (see above table)
Operation Description
Lock Creates a lock for a range of bytes
Lockt Test whether a conflicting lock has been granted
Locku Remove a lock from a range of bytes
Renew Renew the leas on a specified lock
Univ. of TehranDistributed Operating
Systems 41
File Locking in NFS (2)File Locking in NFS (2)
The result of an open operation with share reservations in NFS.a) When the client requests shared access given the current denial state.b) When the client requests a denial state given the current file access state.
Current file denial stateNONE READ WRITE BOTH
READ Succeed Fail Succeed Succeed
WRITE Succeed Succeed Fail Succeed
BOTH Succeed Succeed Succeed Fail
(a)
Requested file denial stateNONE READ WRITE BOTH
READ Succeed Fail Succeed Succeed
WRITE Succeed Succeed Fail Succeed
BOTH Succeed Succeed Succeed Fail
(b)
Requestaccess
Currentaccessstate
Caching in NFSCaching in NFS What can you cache at NFS clients? How do you handle invalid client caches? Data blocks read ahead by biod daemon
Cached in normal file system cache area File attributes
Specially cached by NFS Directory attributes handled a little differently
than file attributes Especially important because many programs
get and set attributes frequently
Univ. of TehranDistributed Operating
Systems 43
Client Caching (1)Client Caching (1) Client-side caching is left to the
implementation (NFS does not prohibit it) Different implementation use different
caching policies Sun: allow cache data to be stale for up to 30
seconds
Univ. of TehranDistributed Operating
Systems 44
Client Caching (2)Client Caching (2) NFS V4 supports open delegation
Server delegates local open and close requests to the NFS client
Uses a callback mechanism to recall file delegation
Univ. of TehranDistributed Operating
Systems 45
RPC FailuresRPC Failures
Three situations for handling retransmissions.a) The request is still in progressb) The reply has just been returnedc) The reply has been some time ago, but was lost.
Univ. of TehranDistributed Operating
Systems 46
SecuritySecurity The NFS security architecture.
Simplest case: user ID, group ID authentication only
Univ. of TehranDistributed Operating
Systems 47
Secure RPCsSecure RPCs Secure RPC in NFS version 4.
Univ. of TehranDistributed Operating
Systems 48
Access ControlAccess Control
The classification of operations recognized by NFS with respect to access control.
Operation Description
Read_data Permission to read the data contained in a file
Write_data Permission to to modify a file's data
Append_data Permission to to append data to a file
Execute Permission to to execute a file
List_directory Permission to to list the contents of a directory
Add_file Permission to to add a new file t5o a directory
Add_subdirectory Permission to to create a subdirectory to a directory
Delete Permission to to delete a file
Delete_child Permission to to delete a file or directory within a directory
Read_acl Permission to to read the ACL
Write_acl Permission to to write the ACL
Read_attributes The ability to read the other basic attributes of a file
Write_attributes Permission to to change the other basic attributes of a file
Read_named_attrs Permission to to read the named attributes of a file
Write_named_attrs Permission to to write the named attributes of a file
Write_owner Permission to to change the owner
Synchronize Permission to to access a file locally at the server with synchronous reads and writes
Andrew ModelAndrew Model Files are stored permanently at file
server machines Users work from workstation
machines With their own private namespace
Andrew provides mechanisms to cache user’s files from shared namespace
User Model of AFS UseUser Model of AFS Use Sit down at any AFR workstation
anywhere Log in and authenticate who I am Access all files without regard to
which workstation I’m using
The Local NamspaceThe Local Namspace Each workstation stores a few files Mostly systems programs and
configuration files Workstations are treated as generic,
interchangeable entities
Virtue and ViceVirtue and Vice Vice is the system run by the file
servers Distributed system
Virtue is the protocol client workstations use to communicate to Vice
Overall ArchitectureOverall Architecture System is viewed as a WAN
composed of LANs Each LAN has a Vice cluster server
Which stores local files But Vice makes all files available to
all clients
Andrew Architecture Andrew Architecture DiagramDiagram
LANWAN
LAN
LAN
Caching the User FilesCaching the User Files Goal is to offload work from servers
to clients When must servers do work?
To answer requests To move data
Whole files cached at clients
Why Whole-File Why Whole-File Caching?Caching?
Minimizes communications with server
Most files used in entirety, anyway Easier cache management problem Requires substantial free disk space
on workstations- Doesn’t address huge file problems
The Shared NamespaceThe Shared Namespace An Andrew installation has global
shared namespace All clients files in the namespace
with the same names High degree of name and location
transparency
How do servers provide How do servers provide the namespace?the namespace?
Files are organized into volumes Volumes are grafted together into
overall namespace Each file has globally unique ID Volumes are stored at individual
servers But a volume can be moved from
server to server
Finding a FileFinding a File At high level, files have names Directory translates name to unique
ID If client knows where the volume is,
it simply sends unique ID to appropriate server
Finding a VolumeFinding a Volume What if you enter a new volume?
How do you find which server stores the volume?
Volume-location database stored on each server
Once information on volume is known, client caches it
Making a VolumeMaking a Volume When a volume moves from server
to server, update database Heavyweight distributed operation
What about clients with cached information?
Old server maintains forwarding info Also eases server update
Handling Cached FilesHandling Cached Files Client can cache all or part of a file Files fetched transparently when
needed File system traps opens
Sends them to local Venus process
The Venus DaemonThe Venus Daemon Responsible for handling single
client cache Caches files on open Writes modified versions back on
close Cached files saved locally after close Cache directory entry translations,
too
Consistency for AFSConsistency for AFS If my workstation has a locally
cached copy of a file, what if someone else changes it?
Callbacks used to invalidate my copy
Requires servers to keep info on who caches files
Write Consistency in Write Consistency in AFSAFS
What if I write to my cached copy of a file?
Need to get write permission from server Which invalidates anyone else’s callback
Permission obtained on open for write Need to obtain new data at this point
Write Consistency in Write Consistency in AFS, Con’tAFS, Con’t
Initially, written only to local copy On close, Venus sends update to
server Server will invalidate callbacks for
other copies Extra mechanism to handle failures
Storage of Andrew Storage of Andrew FilesFiles
Stored in UNIX file systems Client cache is a directory on local
machine Low-level names do not match Andrew
names
Venus Cache Venus Cache ManagementManagement
Venus keeps two caches Status Data
Status cache kept in virtual memory For fast attribute lookup
Data cache kept on disk
Univ. of TehranDistributed Operating
Systems 69
CodaCoda Coda: descendent of the Andrew file
system at CMU Andrew designed to serve a large
(global community)
Salient features: Support for disconnected operations
Desirable for mobile users Support for a large number of users
Venus Process Venus Process ArchitectureArchitecture
Venus is single user process But multithreaded Uses RPC to talk to server
RPC is built on low level datagram service
AFS SecurityAFS Security Only server/Vice are trusted here
Client machines might be corrupted No client programs run on Vice
machines Clients must authenticate
themselves to servers Encryption used to protect
transmissions
AFS File ProtectionAFS File Protection AFS supports access control lists
Each file has list of users who can access it
And permitted modes of access Maintained by Vice Used to mimic UNIX access control
AFS Read-Only AFS Read-Only ReplicationReplication
For volumes containing files that are used frequently, but not changed often E.g., executables
AFS allows multiple servers to store read-only copies
Univ. of TehranDistributed Operating
Systems 74
The Coda File SystemThe Coda File System The various kinds of users and processes
distinguished by NFS with respect to access control.Type of user DescriptionOwner The owner of a file
Group The group of users associated with a file
Everyone Any user of a process
Interactive Any process accessing the file from an interactive terminal
Network Any process accessing the file via the network
Dialup Any process accessing the file through a dialup connection to the server
Batch Any process accessing the file as part of a batch job
Anonymous Anyone accessing the file without authentication
Authenticated Any authenticated user of a process
Service Any system-defined service process
Univ. of TehranDistributed Operating
Systems 75
Overview of Coda Overview of Coda
Centrally administered Vice file servers Large number of virtue clients
Univ. of TehranDistributed Operating
Systems 76
Virtue: Coda ClientsVirtue: Coda Clients
The internal organization of a Virtue workstation. Designed to allow access to files even if server is unavailable Uses VFS and appears like a traditional Unix file system
Univ. of TehranDistributed Operating
Systems 77
Communication in Communication in CodaCoda
Coda uses RPC2: a sophisticated reliable RPC system Start a new thread for each request, server periodically informs
client it is still working on the request RPC2 supports side-effects: application-specific protocols
Useful for video streaming [where RPCs are less useful] RPC2 also has multicast support
Univ. of TehranDistributed Operating
Systems 78
Communication: Communication: InvalidationsInvalidations
a) Sending an invalidation message one at a time.b) Sending invalidation messages in parallel.Can use MultiRPCs [Parallel RPCs] or use Multicast - Fully transparent to the caller and callee [looks like normal
RPC]
Univ. of TehranDistributed Operating
Systems 79
NamingNaming
Clients in Coda have access to a single shared name space Files are grouped into volumes [partial subtree in the directory structure]
Volume is the basic unit of mounting Namespace: /afs/filesrv.cs.umass.edu [same namespace on all client; different from NFS] Name lookup can cross mount points: support for detecting crossing and automounts
Univ. of TehranDistributed Operating
Systems 80
File IdentifiersFile Identifiers
Each file in Coda belongs to exactly one volume Volume may be replicated across several servers Multiple logical (replicated) volumes map to the same physical volume 96 bit file identifier = 32 bit RVID + 64 bit file handle
Univ. of TehranDistributed Operating
Systems 81
Sharing Files in CodaSharing Files in Coda
Transactional behavior for sharing files: similar to share reservations in NFS File open: transfer entire file to client machine [similar to delegation] Uses session semantics: each session is like a transaction
Updates are sent back to the server only when the file is closed
Univ. of TehranDistributed Operating
Systems 82
Transactional Transactional SemanticsSemantics
Network partition: part of network isolated from rest Allow conflicting operations on replicas across file partitions Reconcile upon reconnection Transactional semantics => operations must be serializable
Ensure that operations were serializable after thay have executed Conflict => force manual reconciliation
File-associated data Read? Modified?File identifier Yes No
Access rights Yes No
Last modification time Yes Yes
File length Yes Yes
File contents Yes Yes
Univ. of TehranDistributed Operating
Systems 83
Client CachingClient Caching
Cache consistency maintained using callbacks Server tracks all clients that have a copy of the file [provide callback promise] Upon modification: send invalidate to clients
Univ. of TehranDistributed Operating
Systems 84
Server ReplicationServer Replication
Use replicated writes: read-once write-all Writes are sent to all AVSG (all accessible replicas)
How to handle network partitions? Use optimistic strategy for replication Detect conflicts using a Coda version vector Example: [2,2,1] and [1,1,2] is a conflict => manual reconciliation
Univ. of TehranDistributed Operating
Systems 85
Disconnected Disconnected OperationOperation
The state-transition diagram of a Coda client with respect to a volume. Use hoarding to provide file access during disconnection
Prefetch all files that may be accessed and cache (hoard) locally If AVSG=0, go to emulation mode and reintegrate upon reconnection
Univ. of TehranDistributed Operating
Systems 86
Overview of xFS.Overview of xFS.
Key Idea: fully distributed file system [serverless file system], Berkeley 96 xFS: x in “xFS” => no server Designed for high-speed LAN environments
Univ. of TehranDistributed Operating
Systems 87
Processes in xFSProcesses in xFS
The principle of log-based striping in xFS Combines striping and logging
Univ. of TehranDistributed Operating
Systems 88
Reading a File BlockReading a File Block
Reading a block of data in xFS.
Univ. of TehranDistributed Operating
Systems 89
xFS NamingxFS Naming
Main data structures used in xFS.
Data structure Description
Manager map Maps file ID to manager
Imap Maps file ID to log address of file's inode
Inode Maps block number (i.e., offset) to log address of block
File identifier Reference used to index into manager map
File directory Maps a file name to a file identifier
Log addresses Triplet of stripe group, ID, segment ID, and segment offset
Stripe group map Maps stripe group ID to list of storage servers
Univ. of TehranDistributed Operating
Systems 90
Secure Channels (1)Secure Channels (1) Mutual authentication in RPC2.
Univ. of TehranDistributed Operating
Systems 91
Secure Channels (2)Secure Channels (2)
Setting up a secure channel between a (Venus) client and a Vice server in Coda.
Univ. of TehranDistributed Operating
Systems 92
Access ControlAccess Control
Classification of file and directory operations recognized by Coda with respect to access control.
Operation Description
Read Read any file in the directory
Write Modify any file in the directory
Lookup Look up the status of any file
Insert Add a new file to the directory
Delete Delete an existing file
Administer Modify the ACL of the directory
Univ. of TehranDistributed Operating
Systems 93
Plan 9: Resources Plan 9: Resources Unified to FilesUnified to Files
General organization of Plan 9
Univ. of TehranDistributed Operating
Systems 94
CommunicationCommunication
Files associated with a single TCP connection in Plan 9.
File Description
ctl Used to write protocol-specific control commands
data Used to read and write data
listen Used to accept incoming connection setup requests
local Provides information on the caller's side of the connection
remote Provides information on the other side of the connection
status Provides diagnostic information on the current status of the connection
Univ. of TehranDistributed Operating
Systems 95
ProcessesProcesses The Plan 9 file server. WORM: Write-
once Read-Many, IBM, 1980
Univ. of TehranDistributed Operating
Systems 96
NamingNaming A union directory in Plan 9. Multiple
mounts to one point
Univ. of TehranDistributed Operating
Systems 97
Overview of xFS.Overview of xFS.
A typical distribution of xFS processes across multiple machines.
Univ. of TehranDistributed Operating
Systems 98
Processes (1)Processes (1) The principle of log-based striping in xFS.
Univ. of TehranDistributed Operating
Systems 99
Processes (2)Processes (2) Reading a block of data in xFS.
Univ. of TehranDistributed Operating
Systems 100
NamingNaming Main data structures used in xFS.
Data structure Description
Manager map Maps file ID to manager
Imap Maps file ID to log address of file's inode
Inode Maps block number (i.e., offset) to log address of block
File identifier Reference used to index into manager map
File directory Maps a file name to a file identifier
Log addresses Triplet of stripe group, ID, segment ID, and segment offset
Stripe group map Maps stripe group ID to list of storage servers
Univ. of TehranDistributed Operating
Systems 101
Overview of SFSOverview of SFS The organization of SFS. Scaleable
Security
Univ. of TehranDistributed Operating
Systems 102
NamingNaming
A self-certifying pathname in SFS.
/sfs LOC HID Pathname
/sfs/sfs.vu.sc.nl:ag62hty4wior450hdh63u623i4f0kqere/home/steen/mbox
Univ. of TehranDistributed Operating
Systems 103
SummarySummary
A comparison between NFS, Coda, Plan 9, xFS. N/S indicates that nothing has been specified.
ssueI NFS Coda Plan 9 xFS SFS
Design goals Access transparency High availability Uniformity Serverless system Scalable security
Access model Remote Up/Download Remote Log-based Remote
Communication RPC RPC Special Active msgs RPC
Client process Thin/Fat Fat Thin Fat Medium
Server groups No Yes No Yes No
Mount granularity Directory File system File system File system Directory
Name space Per client Global Per process Global Global
File ID scope File server Global Server Global File system
Sharing sem. Session Transactional UNIX UNIX N/S
Cache consist. write-back write-back write-through write-back write-back
Replication Minimal ROWA None Striping None
Fault tolerance Reliable comm. Replication and caching Reliable comm. Striping Reliable comm.
Recovery Client-based Reintegration N/S Checkpoint & write logs N/S
Secure channels Existing mechanisms
Needham-Schroeder
Needham-Schroeder No pathnames Self-cert.
Access control Many operations Directory operations UNIX based UNIX based NFS BASED
characteristicscharacteristics 1.First, component failures are the norm rather than the exception. problems caused by : application bugs operating system bugs, human errors, failures of disks, memory, connectors, networking, and power supplies. 2.Second, files are huge by traditional standards. Multi-GB files are
common. 3.Third, most files are mutated by appending new data rather than
overwriting existing data. 4.Fourth, co-designing the applications and the file system API benefits the
overall system by increasing our flexibility. we have relaxed GFS’s consistency model We have also introduced an atomic append operation
DESIGN OVERVIEWDESIGN OVERVIEW Assumptions: 1.The system is built from many inexpensive
commodity components that often fail. 2.The system stores a modest number of large files.
We expect a few million files, each typically 100 MB or larger in size.
3.The workloads primarily consist of two kinds of reads: large streaming reads and small random reads.
4.The workloads also have many large, sequential writes that append data to files. Typical operation sizes are similar to those for reads.
5.The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file.
6.High sustained bandwidth is more important than low latency.
InterfaceInterface Files are organized hierarchically in
directories and identified by pathnames.
We support the usual operations to create, delete,open, close, read, and write files.
Moreover, GFS has snapshot and record append operations.
ArchitectureArchitecture
Architecture(desc.)Architecture(desc.) A GFS cluster consists of a single master and multiple Chunk servers and is accessed by multiple clients, as shown in Figure 1. Each of these is typically a commodity Linux machine running a user-level server process. Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk
creation. Chunk servers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range.
Architecture(desc. Architecture(desc. Contd.)Contd.)
The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunklease management, garbage collection of orphaned chunks, and chunk migration between chunkservers.
GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the
application.
Neither the client nor the chunkserver caches file data.
Architecture(desc. Architecture(desc. Contd.)Contd.)
Single Master, why ? Chunk Size, how much ? advantages : 1.First, it reduces clients’ need to interact with the
master because ... 2.Second, since on a large chunk, a client is more
likely to perform many operations on a given chunk, ...
3.Third, it reduces the size of the metadata stored on the master.
disadvantages : 1.internal fragmentation (solution : Lazy space
allocation) 2.hot spots for small files
MetadataMetadata The master stores three major types of
metadata: 1.the file and chunk namespaces 2.the mapping from files to chunks 3.the locations of each chunk’s replicas
a)In-Memory Data Structures b)Chunk Locations c)Operation Log
Consistency ModelConsistency Model 1.Guarantees by GFS 2.Implications for Applications relying on appends rather than
overwrites checkpointing writing self-validating self-identifying records
SYSTEM INTERACTIONSSYSTEM INTERACTIONS minimize the master’s involvement
in all operations. Leases and Mutation Order Data Flow Atomic Record Appends Snapshot
SYSTEM INTERACTIONSSYSTEM INTERACTIONS
MASTER OPERATIONsMASTER OPERATIONs Replica Placement: maximize data reliability and availability, maximize network bandwidth utilization. Creation, Re-replication,
Rebalancing Garbage Collection Stale Replica Detection
FAULT TOLERANCE AND FAULT TOLERANCE AND DIAGNOSISDIAGNOSIS
High Availability Fast Recovery Master Replication Data Integrity Diagnostic Tools
Top Related