Post on 11-Jan-2016
November 2005 Distributed systems: general services 1
Distributed Systems:
General Services
November 2005 Distributed systems: general services 2
Overview of chapters
• Introduction• Co-ordination models and languages• General services
– Ch 11 Time Services, 11.1-11.3– Ch 9 Name Services– Ch 8 Distributed File Systems– Ch 10 Peer-to-peer Systems
• Distributed algorithms
• Shared data
November 2005 Distributed systems: general services 3
This chapter: overview
• Introduction
• Time services
• Name services
• Distributed file systems
• Peer-to-peer systems
November 2005 Distributed systems: general services 4
Introduction • Objectives
– Study existing subsystems• Services offered
• Architecture
• Limitations
– Learn about non-functional requirements• Techniques used
• Results
• Limitations
November 2005 Distributed systems: general services 5
This chapter: overview
• Introduction
• Time services
• Name services
• Distributed file systems
• Peer-to-peer systems
November 2005 Distributed systems: general services 6
Time services: Overview
• What & Why
• Clock synchronization
• Logical clocks (see chapter on algorithms)
November 2005 Distributed systems: general services 7
Time services
• Definition of time– International Atomic Time (TAI)
A second is defined as 9.192.631.770 periods of transition between two hyperfine levels of the ground state of Caesium-133
– Astronomical time Based on the rotation of the earth on its axis and its rotation about the sun. But due to the tidal friction the earth’s rotation is getting longer
– Co-ordinated Universal Time (UTC)International standard based Atomic time, but a leap second is occasionally inserted or deleted to keep up with Astronomical time
November 2005 Distributed systems: general services 8
Time services (cont.)
• Why is time important?
– Accounting (e.g. logins, files,…)
– Performance measurement
– Resource management
– some algorithms depend on it:
• Transactions using timestamps
• Authentication (e.g. Kerberos)
• Example Make
November 2005 Distributed systems: general services 9
Time services (cont.) • An example: Unix tool “make”
– large program:• multiple source files: graphics.c
• multiple object files: graphics.o
• change of one file recompilation of all files
– use of make to handle changes:• change of graphics.c at 3h45
• make examines creation time of graphics.o
• e.g. 3h43: recompiles graphics.c to create an up-to-date graphics.o
November 2005 Distributed systems: general services 10
Time services (cont.) • Make in a distributed environment:
• source file on system A
• object file on system B
Clock B
3h.. 47 48 49 50
Graphics.o created
Clock A
3h.. 41 42 43 44
Graphics.c modified
time
Equal times
Make graphics.o
no effect!!
November 2005 Distributed systems: general services 11
Time services (cont.) • Hardware clocks
– each computer has a physical clock• H(t): value of hardware clock/counter• different clocks different values• e.g. crystal-based clocks drift due to
– different shapes, temperatures, …
• Typical drift rates: – 1sec/sec 1 sec in 11.6 days– High precision quartz clocks: 0.1-0.01 sec/sec
• Software clocks– C(t) = H(t) +
• Ex. nanosecs elapsed since reference time
November 2005 Distributed systems: general services 12
Time services (cont.) • Changing local time
– Hardware: influence H(t)– Software: change and/or – Forward:
• Set clock forward
• some clock ticks appear to have been missed
– Backward:• Set backward? unacceptable
• Let clock run slower for a short period of time
November 2005 Distributed systems: general services 13
Time services (cont.) • Properties
– External synchronization to UTC (S)
| S(t) - C(t) | < D
– Internal synchronization| Ci(t) – Cj(t)| < D
– CorrectnessDrift rate <
– Monotonicityt’ > t C(t’) > C(t)
November 2005 Distributed systems: general services 14
Time services: Overview
• What & Why
• Synchronising Physical clocks
– Synchronising with UTC
– Christian’s algorithm
– The Berkeley algorithm
– Network Time Protocol
• Logical clocks (see chapter on algorithms)
November 2005 Distributed systems: general services 15
Time services synchronising with UTC
• Broadcast by Radio-stations– WWV (USA) (accuracy 0.1 - 10 msec.)
• Broadcast by satellites– Geos (0.1 msec.)– GPS (1 msec)
• Receivers to workstations
November 2005 Distributed systems: general services 16
Time services Christian’s algorithm
• assumption: time server process Ts on a
computer with UTC receiver
• process P: request/reply with time server
P Ts
T-request
reply (t)
November 2005 Distributed systems: general services 17
Time services Christian’s algorithm
• how to set clock at P?
– time t inserted in Ts
– time for P: t + Ttrans
– Ttrans = min + x
• min = minimum transmission time
• x? depends on
– network delay
– process scheduling
November 2005 Distributed systems: general services 18
Time services Christian’s algorithm
• In a synchronous system– Upper bound on delay: max– P sets time to t + (max – min) / 2– Accuracy: (max – min) / 2
• In an asynchronous system?
November 2005 Distributed systems: general services 19
Time services Christian’s algorithm
• practical approach– P measures round-trip time: Tround
– P sets time to t + Tround/2– accuracy?
• time at Ts when message arrives at P is in range: [t + min, t + Tround - min]
• accuracy = ((Tround/2) - min)
• problems:– single point of failure– impostor time server
November 2005 Distributed systems: general services 20
Time services The Berkeley algorithm
• Used on computers running Berkeley Unix
• active master elected among computers
whose clocks have to be synchronised
– no central time server
– upon failure of master: re-election
November 2005 Distributed systems: general services 21
Time services The Berkeley algorithm
• algorithm of master:– periodically polls the clock value of other
computers– estimates its local clock time (based on round-
trip delays)– averages the obtained values– returns the amount by which each individual’s
slave clock needs adjustment– fault-tolerant average (strange values excluded)
may be used
November 2005 Distributed systems: general services 22
Time services Network Time Protocol
• Standard for the Internet
• Design aims and features:
– Accurate synchronization despite variable message
delays
• Statistical techniques to filter timing data from different
servers
– Reliable service that can survive losses of connectivity
• Redundant servers
• hierarchy of servers in synchronised subnets
November 2005 Distributed systems: general services 23
Time services Network Time Protocol
1
2 2
333
Level = stratumStratum1 have UTC receiver
synchronisation
November 2005 Distributed systems: general services 24
Time services Network Time Protocol
• Three working modes for NTP servers– Multicast mode
• used on high speed LAN
• slave sets clock assuming small delay
– Procedure-call mode• similar to Christian’s algorithm
– Symmetric Mode• used by master servers and higher levels
• association between servers
November 2005 Distributed systems: general services 25
Time services Network Time Protocol
• Procedure-call and Symmetric mode
– each message bears timestamps of recent
message events
Server B T i-2
Server A
T i-1
T i-3 T i
timem’m
November 2005 Distributed systems: general services 26
Time services Network Time Protocol
– Compute approximations of offset and delay
Server B T i-2
Server A
T i-1
T i-3 T i
o true offsetm’m
Transmission times t t’
T i-2 = T i-3 + t + o
T i = T i-1 + t’ - o
a = T i-2 - T i-3 = t + o
b = T i-1 - T i = o – t’
estimates
d i = a - b
o i = (a + b) / 2
o i - d i / 2 o o i + d i / 2
d delay
total transmission time
November 2005 Distributed systems: general services 27
Time services Network Time Protocol
• Procedure-call and Symmetric mode
– data filtering applied to successive pairs
– NTP has complex peer-selection algorithm; favoured
are peers with :
• a lower stratum number
• a lower synchronisation dispersion
– Accuracy:
• 10 msec on Internet
• 1 msec on LAN
November 2005 Distributed systems: general services 28
This chapter: overview
• Introduction
• Time services
• Name services
• Distributed file systems
• Peer-to-peer systems
November 2005 Distributed systems: general services 29
Name services• Overview
– Introduction
– Basic service
– Case study: Domain name system
– Directory services
November 2005 Distributed systems: general services 30
Name services Introduction
• Types of name:– users– files, devices– service names (e.g. RPC)– port, process, group identifiers
• Attributes associated with names– Email address, login name, password– disk, block number– network address of server– ...
November 2005 Distributed systems: general services 31
Name services Introduction
• Goal:– binding between name and attributes
• History– originally:
• quite simple problem• scope: a single LAN• naïve solution: all names + attributes in a single file
– now:• interconnected networks• different administrative domains in a single name
space
November 2005 Distributed systems: general services 32
Name services Introduction
• naming service separate from other services:– unification
• general naming scheme for different objects
• e.g. local + remote files in same scheme
• e.g. files + devices+ pipes in same scheme
• e.g. URL
– integration• scope of sharing difficult to predict
• merge different administrative domains
• requires extensible name space
November 2005 Distributed systems: general services 33
Name services Introduction
• General requirements:
– scalable
– long lifetime
– high availability
– fault isolation
– tolerance of mistrust
November 2005 Distributed systems: general services 34
Name services• Overview
– Introduction
– Basic service
– Domain name system
– Directory services
November 2005 Distributed systems: general services 35
Name services Basic service
• Operations:identifier lookup (name, context)
register (name, context, identifier)
delete (name, context)
• From a single server to multiple servers:– hierarchical names
– efficiency
– availability
November 2005 Distributed systems: general services 36
Name services Basic service
• Hierarchical name space– introduction of directories or subdomains
• /users/pv/edu/distri/services.ppt
• cs.kuleuven.ac.be
– easier administration• delegation of authority
• less name conflicts: unique per directory
• unit of distribution: server per directory + single root server
November 2005 Distributed systems: general services 37
Name services Basic service
• Hierarchical name space
– navigation
• process of locating naming data from more than
one server
• iterative or recursive
– result of lookup:
• attributes
• references to another name server
November 2005 Distributed systems: general services 38
Name services Basic service
root
srcusers
sys pv ann
Lookup(“/users/pv/edu/…”,…)P
Ref to server-users
November 2005 Distributed systems: general services 39
Name services Basic service
root
srcusers
sys pv ann
PLookup(“pv/edu/…”,…)
Ref to server-pv
November 2005 Distributed systems: general services 40
Name services Basic service
• Efficiency caching– hierarchy introduces multiple servers
more communication overhead more processing capacity
– caching at client side:• <name, attributes> are cached
• fewer request lookup performance higher
• inconsistency? – Data is rarely changed
– limit on validity
November 2005 Distributed systems: general services 41
Name services Basic service
• Availability replication– at least two failure independent servers
– primary server: accepts updates
– secondary server(s): • get copy of data of primary
• period check for updates at primary
– level of consistency?
November 2005 Distributed systems: general services 42
Name services• Overview
– Introduction
– Basic service
– Case study: Domain name system
– Directory services
November 2005 Distributed systems: general services 43
Name services Domain name system
• DNS = Internet name service• originally: an ftp-able file
– not scalable– centralised administration
• a general name service for computers and domains– partitioning– delegation– replication and caching
November 2005 Distributed systems: general services 44
Name services Domain name system
• Domain name– hierarchy
nix.cs.kuleuven.ac.be
country (België)
univ. (academic)
K.U.Leuven
dept. computer science
name of computersystem
November 2005 Distributed systems: general services 45
Name services Domain name system
• 3 kinds of Top Level Domains (TLD)– 2-letter country codes (ISO 3166)– generic names (similar organisations)
• com commercial organisations• org non-profit organisations (bv. vzw)• int international organisations (nato, EU, …)• net network providers
– USA oriented names • edu universities• gov American government• mil American army
– new generic names• biz, info, name, aero, museum, ...
November 2005 Distributed systems: general services 46
Name services Domain name system
• For each TLD:– administrator (assign names within domain)
– “be”: DNS BE vzw (previous: dept. computer science)
• Each organisation with a name :– responsible for new (sub)names
– e.g. cs.kuleuven.ac.be
• Hierarchical naming + delegation workable structure
November 2005 Distributed systems: general services 47
Name services Domain name system
• Name servers– root name server +
– server per (group of) subdomains
+ replication high availability
+ caching acceptable performance– time-to-live to limit inconsistencies
November 2005 Distributed systems: general services 48
Name services Domain name system
Name server cs.kuleuven.ac.be
Systems/subdomainsSystems/subdomains typetype IP-adresIP-adresof cs.kuleuven.ac.beof cs.kuleuven.ac.be
nixnix AA 134.58.42.36134.58.42.36
idefixidefix AA 134.58.41.7134.58.41.7
droopydroopy AA 134.58.41.10134.58.41.10
stevinstevin AA 134.58.41.16134.58.41.16
......
A = AddressA = Address
November 2005 Distributed systems: general services 49
Name services Domain name system
Name server kuleuven.ac.be
Machines/subdomeinenMachines/subdomeinen typetype IP-adresIP-adresvan kuleuven.ac.bevan kuleuven.ac.be
cscs NSNS 134.58.39.1134.58.39.1
esatesat NSNS ……
wwwwww AA ……
......
NS = NameServerNS = NameServer
November 2005 Distributed systems: general services 50
Name services Domain name system
Example : www.cs.vu.nl
Lokale NSLokale NS
(cs.kuleuven.ac.be(cs.kuleuven.ac.be))
www.www.cs.vu.nlcs.vu.nl
Root-NSRoot-NS
NS (nl)NS (nl)
NS (vu.nl)NS (vu.nl)
NS (cs.vu.nl)NS (cs.vu.nl)130.37.24.1130.37.24.111
130.37.24.1130.37.24.111
November 2005 Distributed systems: general services 51
Name services Domain name system
• Extensions:– mail host location
• used by electronic mail software
• requests where mail for a domain should be delivered
– reverse resolution (IP address -> domain name)– host information
• Weak points:– security
November 2005 Distributed systems: general services 52
Name services Domain name system
• Good system design: – partitioning of data
• multiple servers
– replication of servers • high availability• limited inconsistencies• NO load balancing; NO server preference
– caching• acceptable performance• limited inconsistencies
November 2005 Distributed systems: general services 53
Name services• Overview
– Introduction
– Basic service
– Case study: Domain name system
– Directory services
November 2005 Distributed systems: general services 54
Name services Directory services
• Name serviceexact name attributes
• Directory servicesome attributes of object all attributes
• Examples:– request: E-mail address of Janssen at KULeuven
– result: list of names of all persons called Janssen
– select person and read attributes
November 2005 Distributed systems: general services 55
Name services Directory services
• Directory information base– list of entries– each entry
• set of type/list-of-values pairs
• optional and mandatory pairs
• some pairs determine name
– distributed implementation
November 2005 Distributed systems: general services 56
Name services Directory services
• X500– CCITT standard
• LDAP– light weight directory access protocol– Internet standard
November 2005 Distributed systems: general services 57
This chapter: overview
• Introduction
• Time services
• Name services
• Distributed file systems
• Peer-to-peer systems
November 2005 Distributed systems: general services 58
Distributed file systems• Overview
– File service architecture
– case study: NFS
– case study AFS
– comparison NFS <> AFS
November 2005 Distributed systems: general services 59
Distributed file systemsFile service architecture
• Definitions:– file– directory – file system
cf. Operating systems
November 2005 Distributed systems: general services 60
Distributed file systemsFile service architecture
• Requirements:– addressed by most file systems:
• access transparency• location transparency• failure transparency• performance transparency
– related to distribution:• concurrent updates• hardware and operating system heterogeneity• scalability
November 2005 Distributed systems: general services 61
Distributed file systemsFile service architecture
• Requirements (cont.) :– scalable to a very large number of nodes
• replication transparency
• migration transparency
– future:• support for fine-grained distribution of data
• tolerance to network partitioning
November 2005 Distributed systems: general services 62
Distributed file systemsFile service architecture
• File service components
Flat file service
Directory service
Network
Client module: API
User program User program
UFID
names
November 2005 Distributed systems: general services 63
Distributed file systemsFile service architecture
• Flat file service– file = data + attributes
– data = sequence of items
– operations: simple read & write
– attributes: in flat file & directory service
– UFID: unique file identifier
November 2005 Distributed systems: general services 64
Distributed file systemsFile service architecture
• Flat file service: operations
– Read (FileID, Position, n) Data
– Write (FileID, Position, Data)
– Create () FileID
– Delete (FileID)
– GetAttributes (FileId) Attr
– SetAttributes (FileID, Attr)
November 2005 Distributed systems: general services 65
Distributed file systemsFile service architecture
• Flat file service: fault tolerance
– straightforward for simple servers
– idempotent operations
– stateless servers
November 2005 Distributed systems: general services 66
Distributed file systemsFile service architecture
• Directory service
– translate file name in UFID
– substitute for open
– responsible for access control
November 2005 Distributed systems: general services 67
Distributed file systemsFile service architecture
• Directory service: operations
– Lookup (Dir, Name, AccessMode, UserID)
UFID
– AddName (Dir, Name, FileID, UserID)
– UnName (Dir, Name)
– ReName (Dir, OldName, NewName)
– GetNames (Dir, Pattern) list-of-names
November 2005 Distributed systems: general services 68
Distributed file systemsFile service architecture
• Implementation techniques:– known techniques from OS experience– remain important– distributed file service: comparable in
• performance
• reliability
November 2005 Distributed systems: general services 69
Distributed file systemsFile service architecture
• Implementation techniques: overview– file groups– space leaks– capabilities and access control– access modes– file representation– file location– group location– file addressing– caching
November 2005 Distributed systems: general services 70
Distributed file systemsFile service architecture
• Implementation techniques: file groups– (similar: file system, partition)= collection of files mounted on a server
computer– unit of distribution over servers– transparent migration of file groups– once created file is locked in file group– UFID = file group identifier + file identifier
November 2005 Distributed systems: general services 71
Distributed file systemsFile service architecture
• Implementation techniques: space leaks– 2 steps for creating a file
• create (empty) file and get new UFID• enter name + UFID in directory
– failure after step 1:• file exists in file server• unreachable: UFID not in any directory
lost space on disk– detection requires co-operation between
• file server• directory server
November 2005 Distributed systems: general services 72
Distributed file systemsFile service architecture
• Implementation techniques: capabilities= digital key: access to resource granted on
presentation of the capability– request to directory server: file name + user id
+ mode of accessUFID including permitted access modes – construction of UFID– unique– encode access– unforgeable
November 2005 Distributed systems: general services 73
Distributed file systemsFile service architecture
• Implementation techniques: capabilities
File group id File nr
File group id File nr Random nr
File group id File nr Random nr
Access bits
File group id File nr
Access bits Encrypted (access bits + random number)
Reuse?
Access check?
Forgeable?
November 2005 Distributed systems: general services 74
Distributed file systemsFile service architecture
• Implementation techniques: file location– from UFID location of file server– use of replicated group location database file group id, PortId
– why replication?– why location not encoded in UFID?
November 2005 Distributed systems: general services 75
Distributed file systemsFile service architecture
• Implementation techniques: caching– server cache: reduce delay for disk I/O
• selecting blocks for release
• coherence: – dirty flags
– write-through caching
– client cache: reduce network delay• always use write-through
• synchronisation problems with multiple caches
November 2005 Distributed systems: general services 76
Distributed file systems• Overview
– File service model
– case study: Network File System -- NFS
– case study: AFS
– comparison NFS <> AFS
November 2005 Distributed systems: general services 77
Distributed file systemsNFS
• Background and aims– first file service product– emulate UNIX file system interface– de facto standard
• key interfaces published in public domain
• source code available for reference implementation
– supports diskless workstations• not important anymore
November 2005 Distributed systems: general services 78
Distributed file systemsNFS
• Design characteristics– client and server modules can be in any node
Large installations include a few servers– Clients:
• On Unix: emulation of standard UNIX file system
• for MS/DOS, Windows, Apple, …
– Integrated file and directory service– Integration of remote file systems in a local
one: mount remote mount
November 2005 Distributed systems: general services 79
Distributed file systemsNFS: configuration
• Unix mount system call– each disk partition contains hierarchical FS– how integrate?
• Name partitions a:/usr/students/john
• glue partitions together– invisible for user
– partitions remain useful for system managers
November 2005 Distributed systems: general services 80
Distributed file systemsNFS: configuration
• Unix mount system call
/ (root)
vmunix usr...
students staffx
Root partition
/ (root)
pierre ann...
network proje
Partition c
Directory staff root of c: /usr/staff/ann/network
November 2005 Distributed systems: general services 81
Distributed file systemsNFS: configuration
• Remote mount
/ (root)
vmunix usr...
students staffx
client
/ (root)
users ......
ann pierre
Server 1
Directory staff users: /usr/staff/ann/...
November 2005 Distributed systems: general services 82
Distributed file systemsNFS: configuration
• Mount service on server– enables clients to integrate (part of) remote file
system in the local name space– exported file systems in /etc/exports + access
list (= hosts permitted to mount; secure?)
• On client side– file systems to mount enumerated in /etc/rc– typically mount at start up time
November 2005 Distributed systems: general services 83
Distributed file systemsNFS: configuration
• Mounting semantics– hard
• client waits until request for a remote file succeeds
• eventually forever
– soft• failure returned if request does not succeed after n
retries
• breaks Unix failure semantics
November 2005 Distributed systems: general services 84
Distributed file systemsNFS: configuration
• Automounter– principle:
• empty mount points on clients
• mount on first request for remote file
– acts as a server for a local client• gets references to empty mount points
• maps mount points to remote file systems
• referenced file system mounted on mount point via a symbolic link, to avoid redundant requests to automounter
November 2005 Distributed systems: general services 85
Distributed file systemsNFS: implementation
• In UNIX: client and server modules
implemented in kernel
• virtual file system:
– internal key interface, based on file handles for
remote files
November 2005 Distributed systems: general services 86
Distributed file systemsNFS: implementation
UNIX kernel
Virtual file system
UNIX file
system
NFSclient
User client
process System call
Client computer
Net-work
UNIX kernel
Virtual file system
UNIX file
system
NFSserver
server computer
NFS protocol
November 2005 Distributed systems: general services 87
Distributed file systemsNFS: implementation
• Virtual file system– Added to UNIX kernel to make distinction
• Local files
• Remote files
– File handles: file ID in NFS• Base: inode number
– File ID in partition on UNIX system
• Extended with:– File system identifier
– inode generation number (to enable reuse)
November 2005 Distributed systems: general services 88
Distributed file systemsNFS: implementation
• Client integration:– NFS client module integrated in kernel
• offers standard UNIX interface
• no client recompilation/reloading
• single client module for all user level processes
• encryption in kernel
• server integration– only for performance reasons– user level = 80% of kernel level version
November 2005 Distributed systems: general services 89
Distributed file systemsNFS: implementation
• Directory service– name resolution co-ordinated in client– step-by-step process for multi-part file names– mapping tables in server: high overhead
reduced by caching• Access control and authentication
– based on UNIX user ID and group ID– included and checked for every NFS request– secure NFS 4.0 thanks to use of DES
encryption
November 2005 Distributed systems: general services 90
Distributed file systemsNFS: implementation
• Caching– Unix caching
• based on disk blocks• delayed write (why?)• read ahead (why?)• periodically sync to flush buffers in cache
– Caching in NFS• Server caching• Client caching
November 2005 Distributed systems: general services 91
Distributed file systemsNFS: implementation
• Server caching in NFS– based on standard UNIX caching: 2 modes– write-through (instead of delayed write)
• failure semantics • performance
– delayed write• Data stored in cache, till commit operation is
received– Close on client commit operation on server
• Failure semantics?• Performance
November 2005 Distributed systems: general services 92
Distributed file systemsNFS: implementation
• Client caching– cached are results of
• read, write, getattr, lookup, readdir
– problem: multiple copies of same data at different NFS clients
– NFS clients use read-ahead and delayed write
November 2005 Distributed systems: general services 93
Distributed file systemsNFS: implementation
• Client caching (cont.)– handling of writes
• block of file is fetched and updated• changed block is marked as dirty• dirty pages of files are flushed to server
asynchronously– on close of file– sync operation on client– by bio-daemon (when block is filled)
• dirty pages of directories are flushed to server– by bio-daemon without further delay
November 2005 Distributed systems: general services 94
Distributed file systemsNFS: implementation
• Client caching (cont.)
– consistency checks• based on time-stamps indicating last modification
of file on server
• validation checks– when file is opened
– when a new block is fetched
– assumed to remain valid for a fixed time (3 sec for file, 30 sec for directory)
– next operation causes another check
• costly procedure
November 2005 Distributed systems: general services 95
Distributed file systemsNFS: implementation
• Caching – Cached entry is valid
(T – Tc) < t or (Tmclient = Tmserver)– T: current time– Tc: time when cache entry was last validated– t: freshness interval (3 .. 30 secs in Solaris)– Tm: time when block was last modified at …
– consistency level• acceptable
• most UNIX applications do not depend critically on synchronisation of file updates
November 2005 Distributed systems: general services 96
Distributed file systemsNFS: implementation
• Performance– reasonable performance
• remote files on fast disk better than local files on slow disk
• RPC packets are 9 Kb to contain 8Kb disk blocks– Lookup operation covers about 50% of server
calls– Drawbacks:
• frequent getattr calls for time-stamps (cache validation)
• poor performance of (relative infrequent) writes (because of write-through)
November 2005 Distributed systems: general services 97
Distributed file systems• Overview
– File service model
– case study: NFS
– case study: Andrew File System -- AFS
– comparison NFS <> AFS
November 2005 Distributed systems: general services 98
Distributed file systemsAFS
• Background and aims– Base: observation of UNIX file systems
• files are small ( < 10 Kb)• read is more common than write• sequential access is common, random access is not• most files are not shared• shared files are often modified by one user• file references come in bursts
– Aim: combine best of personal computers and time-sharing systems
November 2005 Distributed systems: general services 99
Distributed file systemsAFS
• Background and aims (cont.)
– Assumptions about environment• secured file servers
• public workstations
• workstations with local disk
• no private files on local disk
– Key targets: scalability and security• CMU 1991: 800 workstations, 40 servers
November 2005 Distributed systems: general services 100
Distributed file systemsAFS
• Design characteristics– whole-file serving– whole-file cachingentire files are transmitted, not blocks– client cache realised on local disk (relatively
large)lower number of open requests on the network– separation between file and directory service
November 2005 Distributed systems: general services 101
Distributed file systemsAFS
• Configuration– single (global) name space– local files
• temporary files
• system files for start-up
– volume = unit of configuration and management
– each server maintains a replicated location database (volume-server mappings)
November 2005 Distributed systems: general services 102
Distributed file systemsAFS
• File name space
/ (root)
bin afstmp
local
/ (root)
users bin...
ann pierre
shared
Directory afs root of shared file system
–Symbolic link
November 2005 Distributed systems: general services 103
Distributed file systemsAFS
• Implementation– Vice = file server
• secured systems, controlled by system management• understands only file identifiers• runs in user space
– Venus = user/client software• runs on workstations• workstations keep autonomy• implements directory services
– kernel modifications for open and close
November 2005 Distributed systems: general services 104
Distributed file systemsAFS
User
program
Unix kernel
Venus
User
program
Unix kernel
Venus
Unix kernel
Vice
Unix kernel
Vice
workstations servers
November 2005 Distributed systems: general services 105
Distributed file systemsAFS
User
program
Unix kernel
Venus
Unix file
system call Non-local file operations
Unix file system
November 2005 Distributed systems: general services 106
Distributed file systemsAFS
• Implementation of file system calls– open, close:
• UNIX kernel of workstation
• Venus (on workstation)
• VICE (on server)
– read, write:• UNIX kernel of workstation
November 2005 Distributed systems: general services 107
place copy of file in local file system and store FileName
open local file and return descriptor to user process
Distributed file systemsAFS
• Open system call open( FileName,…)
if FileName in shared space /afs then pass request to Venus kernel
if file not present in local cache OR file present with
invalid callback then pass request to Vice server venus
network
transfer copy of file and valid callback vice
November 2005 Distributed systems: general services 108
Distributed file systemsAFS
• Read Write system call
perform normal UNIX read op local copy of file kernel
read( FileDescriptor,…)
….
November 2005 Distributed systems: general services 109
Distributed file systemsAFS
• Close system call
close local copy and inform Venus about close kernel
if local copy is changed then send copy to Vice server venus network
Store copy of file and send callback to other
clients holding callback promise vice
close( FileDescriptor,…)
...
November 2005 Distributed systems: general services 110
Distributed file systemsAFS
• Caching– callback principle
• service supplies callback promise at open
• promise is stored with file in client cache
– callback promise can be valid or cancelled• initially valid
• server sends message to cancel callback promise – to all clients that cache the file
– whenever file is updated
November 2005 Distributed systems: general services 111
Distributed file systemsAFS
• Caching maintenance– when client workstation reboots
• cache validation necessary because of missed messages
• cache validation requests are sent for each valid promise
– valid callback promises are renewed • on open
• when no communication has occurred with the server during a period T
November 2005 Distributed systems: general services 112
Distributed file systemsAFS
• Update semantics– guarantee after successful open of file F on server S:
latest(F, S, 0)orlostcallback(S, T) and incache(F) and latest(F, S, T)
– no other concurrency control
• 2 copies can be updated at different workstations
• all updates except from last close are (silently) lost
• <> normal UNIX operation
November 2005 Distributed systems: general services 113
Distributed file systemsAFS
• Performance: impressive <> NFS– benchmark: load on server
• 40% AFS• 100% NFS
– whole-file caching: • reduction of load on servers• minimises effect of network latency
– read-only volumes are replicated (master copy for occasional updates)
Andrew optimised for a specific pattern of use!!
November 2005 Distributed systems: general services 114
Distributed file systems• Overview
– File service model
– case study: NFS
– case study AFS
– comparison NFS <> AFS
November 2005 Distributed systems: general services 115
Distributed file systemsNFS <>AFS
• Access transparency– both in NFS and AFS– Unix file system interface is offered
• Location transparency– uniform view on shared files in AFS– in NFS
• mounting freedom
• same view possible if same mounting; discipline!
November 2005 Distributed systems: general services 116
Distributed file systemsNFS <>AFS
• Failure transparency– NFS
• no state of clients stored in servers
• idempotent operations
• transparency limited by soft mounting
– AFS• state about clients stored in servers
• cache maintenance protocol handles server crashes
• limitations?
November 2005 Distributed systems: general services 117
Distributed file systemsNFS <>AFS
• Performance transparency– NFS
• acceptable performance degradation
– AFS• only delay for first open operation on file
• better than NFS for small files
November 2005 Distributed systems: general services 118
Distributed file systemsNFS <>AFS
• Migration transparency– limited: update of locations required– NFS
• update configuration files on all clients
– AFS• update of replicated database
November 2005 Distributed systems: general services 119
Distributed file systemsNFS <>AFS
• Replication transparency– NFS
• not supported
– AFS• limited support; for read-only volumes
• one master for exceptional updates; manual procedure for propagating changes to other volumes
November 2005 Distributed systems: general services 120
Distributed file systemsNFS <>AFS
• Concurrency transparency– not supported (in UNIX)
• Scalability transparency– AFS better than NFS
November 2005 Distributed systems: general services 121
Distributed file systems• Overview
– File service architecture
– case study: NFS
– case study AFS
– comparison NFS <> AFS
November 2005 Distributed systems: general services 122
Distributed file systems
• Conclusions– Standards <> quality– evolution to standards and common key
interface• AFS-3 incorporates Kerberos, vnode interface,
large messages for file blocks (64Kb)
– network (mainly access) transparency causes inheritance of weakness: e.g. no concurrency control
– evolution mainly performance driven
November 2005 Distributed systems: general services 123
Distributed file systems
• Good system design: – partitioning
• different storage volumes– replication of servers
• high availability• limited inconsistencies• load balancing?• server preference?
– caching• acceptable performance• limited inconsistencies
November 2005 Distributed systems: general services 124
This chapter: overview
• Introduction
• Time services
• Name services
• Distributed file systems
• Peer-to-peer systems
November 2005 Distributed systems: general services 125
Peer-to-peer systems• Overview
– Introduction – Napster– Middleware– Routing algorithms– OceanStore –Pond file store
November 2005 Distributed systems: general services 126
Peer-to-peer systems Introduction
• Definition 1:– Support useful distributed services– Using data and computer resources in the PCs
and workstations available on the Internet
• Definition 2:– Applications exploiting resources available at
the edges of the Internet
November 2005 Distributed systems: general services 127
Peer-to-peer systems Introduction
• Characteristics:– Each user contributes resources to the system
– All nodes have the same functional capabilities and responsibilities
– Correct operation does not depend on the existence of any centrally administered systems
– Can offer limited degree of anonymity
– Efficient algorithms for the placement of data across many hosts and subsequent access
– Efficient algorithms for load balancing and availability of data
Availability of partic
ipating computers is unpredictabel
November 2005 Distributed systems: general services 128
Peer-to-peer systems Introduction
• 3 generations of peer-to-peer systems and application development:– Napster: music exchange service– File-sharing applications offering greater
scalability, anonymity and fault tolerance: Freenet, Gnutella, Kazaa
– Middleware for the application-independent management of distributed resources on a global scale: Pastry, Tapestry,…
November 2005 Distributed systems: general services 129
Peer-to-peer systems Napster
• Download digital music files
• Architecture:– Centralised indices– Files stored and accessed on PCs
• Method of operation (next slide)
November 2005 Distributed systems: general services 130
Peer-to-peer systems Napster
Napster serverIndex1. File location
2. List of peers
request
offering the file
peers
3. File request
4. File delivered5. Index update
Napster serverIndex
– step 5: file sharing expected!
November 2005 Distributed systems: general services 131
Peer-to-peer systems Napster
• Conclusion:– Feasibility demonstrated– Simple load balancing techniques– Replicated unified index of all available music
files– No updates of files– Availability not a hard concern– Legal issues: anonymity
November 2005 Distributed systems: general services 132
Peer-to-peer systems• Overview
– Introduction – Napster– Middleware– Routing algorithms– OceanStore –Pond file store
November 2005 Distributed systems: general services 133
Peer-to-peer systems middleware
• Goal:– Automatic placement– Subsequent location of distributed objects
• Functional requirements:– Locate and communicate with any resource– Add and remove resources– Add and remove hosts– API independent of type of resource
}
November 2005 Distributed systems: general services 134
Peer-to-peer systems middleware
• Non-functional requirements:– Global scalability (millions of object on hundreds of thousands of
hosts)
– Optimization for local interactions between
neighbouring peers
– Accommodating to highly dynamic host availability
– Security of data in an environment with heterogeneous
trust
– Anonymity, deniability and resistance to censorship
November 2005 Distributed systems: general services 135
Peer-to-peer systems middleware
• Approach:– Knowledge of locations of objects
• Partitioned
• Distributed
• Replicated (x16)
throughout the network═Routing overlay:
• Distributed algorithm for locating objects and nodes
November 2005 Distributed systems: general services 136
Peer-to-peer systems middleware
Object:
Node:
D
CÕs routing knowledge
DÕs routing knowledgeAÕs routing knowledge
BÕs routing knowledge
C
A
B
November 2005 Distributed systems: general services 137
Peer-to-peer systems middleware
• Resource identification:– GUID = globally unique identifier– Secure hash from
• all state – self certifying
– for immutable objects
• part of state– e.g. name + …
November 2005 Distributed systems: general services 138
Peer-to-peer systems middleware
• API for a DHT in Pastry
DHT = distributed hash table
put(GUID, data) The data is stored in replicas at all nodes responsible for the object identified by GUID.remove(GUID)Deletes all references to GUID and the associated data.value = get(GUID)The data associated with GUID is retrieved from one of the nodes responsible it.
November 2005 Distributed systems: general services 139
Peer-to-peer systems middleware
• API for a DOLR in Tapestry
DOLR = distributed object location and routing
publish(GUID ) This function makes the node performing a publish operation the host for the object corresponding to GUID.
unpublish(GUID)Makes the object corresponding to GUID inaccessible.
sendToObj(msg, GUID, [n])an invocation message is sent to an object in order to access it. The optional parameter [n], requests the delivery of the same message to n replicas of the object.
November 2005 Distributed systems: general services 140
Peer-to-peer systems• Overview
– Introduction – Napster– Middleware– Routing algorithms– OceanStore –Pond file store
November 2005 Distributed systems: general services 141
Peer-to-peer systems Routing algorithms
• prefix routing – Used in Pastry, Tapestry
• GUID: 128-bit– Host: secure hash on public key– Object: secure hash on name or part of state
• N participating nodes– O(log N) steps to route a message to any GUID– O(log N) messages to integrate a new node
November 2005 Distributed systems: general services 142
Peer-to-peer systems Routing algorithms
• Simplified algorithm: circular routing– Leaf set in each active node– Element in leaf set:
GUID – IP address of nodes whose GUIDS are numerically closest
on either side of its own GIUD– Size L = 2*l– Leaf sets are updated when nodes
• Join• Leave
November 2005 Distributed systems: general services 143
Peer-to-peer systems Routing algorithms
• Circular space for GUIDs
• Dot = live node
• Leaf size = 8
• Shown: routing of a message from node 65A1FC to D46A1C
0 FFFFF....F (2128-1)
65A1FC
D13DA3
D471F1
D467C4D46A1C
November 2005 Distributed systems: general services 144
Peer-to-peer systems Routing algorithms
• Full algorithm (Pastry)– Tree-structured routing table
• GUID – IP pairs spread throughout entire range of GUIDs
• Increased density of coverage for GUIDs numerically close to GUID of local node
• GUIDs viewed as hexadecimal values
• Table classifies GUIDs based on hexadecimal prefices
• As many rows as hexadecimal digits
November 2005 Distributed systems: general services 145
Peer-to-peer systems Routing algorithms
• Full algorithm (Pastry)
November 2005 Distributed systems: general services 146
Peer-to-peer systems Routing algorithms
• Routing a message
from 65A1FC
to D46A1C
• Well-populated table:
~log16(N) hops
0 FFFFF....F (2128-1)
65A1FC
D13DA3
D4213F
D462BA
D471F1
D467C4D46A1C
November 2005 Distributed systems: general services 147
Peer-to-peer systems Routing algorithms
• Full algorithm (Pastry)To handle a message M addressed to a node D (where R[p,i] is the element at column i,row p of the routing table):
1. If (L-l < D < Ll) { // the destination is within the leaf set or is the current node.2. Forward M to the element Li of the leaf set with GUID closest to D or the current
node A.3. } else { // use the routing table to despatch M to a node with a closer GUID4. find p, the length of the longest common prefix of D and A. and i, the (p+1)th
hexadecimal digit of D .5. If (R[p,i] ° null) forward M to R[p,i] // route M to a node with a longer common
prefix.6. else { // there is no entry in the routing table7. Forward M to any node in L or R with a common prefix of length i, but a
GUID that is numerically closer.}
}
November 2005 Distributed systems: general services 148
Peer-to-peer systems Routing algorithms
• Full algorithm (Pastry): host integration– New node X:
• computes GUIDX
• X should know at least one nearby Pastry node A with GUIDA
• X sends special join message to A (destination GUIDX)
– A despatches the join message via Pastry
• Route of message: A, B, ….Z (GUIDZ GUIDX)
– Each node sends relevant part of its routing table and
leaf sets to X
November 2005 Distributed systems: general services 149
Peer-to-peer systems Routing algorithms
• Full algorithm (Pastry): host integration– Route of message: A, B, ….Z
– Routing table of X:• Row 0 of A row 0 of X
• B and X share hexadecimal digits: so row i of B row I of X
• …
• Choice for entry? proximity neighbour selection algorithm (metric: IP hops, measured latency,…)
– Leaf set of Z leaf set of X
– X sends its leaf set and routing table to all nodes in its routing table and leaf set
November 2005 Distributed systems: general services 150
Peer-to-peer systems Routing algorithms
• Full algorithm (Pastry): host failure– Repair actions to update leafsets: leaf set requested
from node(s) close to failed node
– Routing table: repairs on a “when discovered” basis
November 2005 Distributed systems: general services 151
Peer-to-peer systems Routing algorithms
• Full algorithm (Pastry): Fault tolerance– All nodes in leaf set life?
• Each node sends heartbeat messages to nodes in leaf set
– Malicious nodes?• Clients can use at least once delivery mechanism
• Use limited randomness in node selection in routing algorithm
November 2005 Distributed systems: general services 152
Peer-to-peer systems Routing algorithms
• Full algorithm (Pastry): dependability– Include acks in routing algorithm: No ack
• alternative route• node: suspected failure
– Heartbeat messages– Suspected failed nodes in routing tables
• Probed• Fail: alternative nodes
– Simple Gossip protocol to exchange routing table info between nodes
November 2005 Distributed systems: general services 153
Peer-to-peer systems Routing algorithms
• Full algorithm (Pastry): evaluation– Message loss
– Performance: Relative Delay Penalty (RDP)= Pastry delivery / IP/UDP delivery
IP loss Pastry loss Pastry wrong node RDP
100. 000 messages
0% 1.5 0 1.8
5% 3.3 1.6 2.2
November 2005 Distributed systems: general services 154
Peer-to-peer systems• Overview
– Introduction – Napster– Middleware– Routing algorithms– OceanStore –Pond file store
November 2005 Distributed systems: general services 155
Distributed file systems OceanStore
• Objectives:– Very large scale, incrementally-scalable,
persistent storage facility – Mutable data objects with long-term
persistence and reliability– Environment of network and computing
resources constantly changing
• Intended use
• Mechanisms used
November 2005 Distributed systems: general services 156
Distributed file systems OceanStore
• Intended use:– NFS-like file system– Electronic mail hosting
• Mechanism needed:– Consistency between replicas
• Tailored to application needs by a Bayou like system
– Privacy and integrity• Encryption of data• Byzantine agreement protocol for updates
November 2005 Distributed systems: general services 157
Distributed file systems OceanStore
• Storage organization
d1 d2 d3 d5d4
root block
version i indirection blocks
d2
version i+1
d1 d3
certificate VGUID of currentversion
VGUID ofversion i
AGUID
VGUID of version i-1
data blocks
BG
UID
(co
py
on
wri
te)
Version i+1 has been updated in blocks d1, d2 and d3. The certificate and the root blocks include some metadata not shown. All unlabelled arrows are BGUIDs.
November 2005 Distributed systems: general services 158
Distributed file systems OceanStore
• Storage organization (cont)
Name Meaning Description
BGUID block GUID Secure hash of a data block
VGUID version GUID BGUID of the root block of a version
AGUID active GUID Uniquely identifies all the versions of an object
November 2005 Distributed systems: general services 159
Distributed file systems OceanStore
• Storage organization (cont)
– New object AGUID• small set of hosts act as inner ring
• publish (AGUID)
• Refers to signed certificate recording the sequence of versions of the object
– Update of object• Byzantine agreement between host in inner ring
(primary copy)
• Result disseminated to secondary replicas
November 2005 Distributed systems: general services 160
Distributed file systems OceanStore
• PerformanceLAN WAN Predominant
operations inbenchmarkPhase Linux NFS Pond Linux NFS Pond
1 0.0 1.9 0.9 2.8 Read and write
2 0.3 11.0 9.4 16.8 Read and write
3 1.1 1.8 8.3 1.8 Read
4 0.5 1.5 6.9 1.5 Read
5 2.6 21.0 21.5 32.0 Read and write
Total 4.5 37.2 47.0 54.9
November 2005 Distributed systems: general services 161
Distributed file systems OceanStore
• Performance (cont)
– Benchmarks:• Create subdirectories recursively
• Copy a source tree
• Examine the status of all files in the tree
• Examine every byte of data in all files
• Compile and link the files
November 2005 Distributed systems: general services 162
This chapter: overview
• Introduction
• Time services
• Name services
• Distributed file systems
• Peer-to-peer systems
November 2005 Distributed systems: general services 163
Distributed Systems:
General Services