November 2005Distributed systems: general services1 Distributed Systems: General Services.

November 2005 Distributed systems: general services 1

Distributed Systems:

General Services

Overview of chapters

• Introduction• Co-ordination models and languages• General services

– Ch 11 Time Services, 11.1-11.3– Ch 9 Name Services– Ch 8 Distributed File Systems– Ch 10 Peer-to-peer Systems

• Distributed algorithms

• Shared data

This chapter: overview

• Introduction

• Time services

• Name services

• Distributed file systems

• Peer-to-peer systems

Introduction • Objectives

– Study existing subsystems• Services offered

• Architecture

• Limitations

– Learn about non-functional requirements• Techniques used

• Results

• Limitations

• Introduction

• Time services

• Name services

Time services: Overview

• What & Why

• Clock synchronization

• Logical clocks (see chapter on algorithms)

Time services

• Definition of time– International Atomic Time (TAI)

A second is defined as 9.192.631.770 periods of transition between two hyperfine levels of the ground state of Caesium-133

– Astronomical time Based on the rotation of the earth on its axis and its rotation about the sun. But due to the tidal friction the earth’s rotation is getting longer

– Co-ordinated Universal Time (UTC)International standard based Atomic time, but a leap second is occasionally inserted or deleted to keep up with Astronomical time

Time services (cont.)

• Why is time important?

– Accounting (e.g. logins, files,…)

– Performance measurement

– Resource management

– some algorithms depend on it:

• Transactions using timestamps

• Authentication (e.g. Kerberos)

• Example Make

Time services (cont.) • An example: Unix tool “make”

– large program:• multiple source files: graphics.c

• multiple object files: graphics.o

• change of one file recompilation of all files

– use of make to handle changes:• change of graphics.c at 3h45

• make examines creation time of graphics.o

• e.g. 3h43: recompiles graphics.c to create an up-to-date graphics.o

Time services (cont.) • Make in a distributed environment:

• source file on system A

• object file on system B

Clock B

3h.. 47 48 49 50

Graphics.o created

Clock A

3h.. 41 42 43 44

Graphics.c modified

Equal times

Make graphics.o

no effect!!

Time services (cont.) • Hardware clocks

– each computer has a physical clock• H(t): value of hardware clock/counter• different clocks different values• e.g. crystal-based clocks drift due to

– different shapes, temperatures, …

• Typical drift rates: – 1sec/sec 1 sec in 11.6 days– High precision quartz clocks: 0.1-0.01 sec/sec

• Software clocks– C(t) = H(t) +

• Ex. nanosecs elapsed since reference time

Time services (cont.) • Changing local time

– Hardware: influence H(t)– Software: change and/or – Forward:

• Set clock forward

• some clock ticks appear to have been missed

– Backward:• Set backward? unacceptable

• Let clock run slower for a short period of time

Time services (cont.) • Properties

– External synchronization to UTC (S)

| S(t) - C(t) | < D

– Internal synchronization| Ci(t) – Cj(t)| < D

– CorrectnessDrift rate <

– Monotonicityt’ > t C(t’) > C(t)

Time services: Overview

• What & Why

• Synchronising Physical clocks

– Synchronising with UTC

– Christian’s algorithm

– The Berkeley algorithm

– Network Time Protocol

• Logical clocks (see chapter on algorithms)

Time services synchronising with UTC

• Broadcast by Radio-stations– WWV (USA) (accuracy 0.1 - 10 msec.)

• Broadcast by satellites– Geos (0.1 msec.)– GPS (1 msec)

• Receivers to workstations

Time services Christian’s algorithm

• assumption: time server process Ts on a

computer with UTC receiver

• process P: request/reply with time server

T-request

reply (t)

• how to set clock at P?

– time t inserted in Ts

– time for P: t + Ttrans

– Ttrans = min + x

• min = minimum transmission time

• x? depends on

– network delay

– process scheduling

• In a synchronous system– Upper bound on delay: max– P sets time to t + (max – min) / 2– Accuracy: (max – min) / 2

• In an asynchronous system?

• practical approach– P measures round-trip time: Tround

– P sets time to t + Tround/2– accuracy?

• time at Ts when message arrives at P is in range: [t + min, t + Tround - min]

• accuracy = ((Tround/2) - min)

• problems:– single point of failure– impostor time server

Time services The Berkeley algorithm

• Used on computers running Berkeley Unix

• active master elected among computers

whose clocks have to be synchronised

– no central time server

– upon failure of master: re-election

Time services The Berkeley algorithm

• algorithm of master:– periodically polls the clock value of other

computers– estimates its local clock time (based on round-

trip delays)– averages the obtained values– returns the amount by which each individual’s

slave clock needs adjustment– fault-tolerant average (strange values excluded)

may be used

Time services Network Time Protocol

• Standard for the Internet

• Design aims and features:

– Accurate synchronization despite variable message

delays

• Statistical techniques to filter timing data from different

servers

– Reliable service that can survive losses of connectivity

• Redundant servers

• hierarchy of servers in synchronised subnets

Level = stratumStratum1 have UTC receiver

synchronisation

• Three working modes for NTP servers– Multicast mode

• used on high speed LAN

• slave sets clock assuming small delay

– Procedure-call mode• similar to Christian’s algorithm

– Symmetric Mode• used by master servers and higher levels

• association between servers

• Procedure-call and Symmetric mode

– each message bears timestamps of recent

message events

Server B T i-2

Server A

T i-3 T i

timem’m

– Compute approximations of offset and delay

Server B T i-2

Server A

T i-3 T i

o true offsetm’m

Transmission times t t’

T i-2 = T i-3 + t + o

T i = T i-1 + t’ - o

a = T i-2 - T i-3 = t + o

b = T i-1 - T i = o – t’

estimates

d i = a - b

o i = (a + b) / 2

o i - d i / 2 o o i + d i / 2

d delay

total transmission time

• Procedure-call and Symmetric mode

– data filtering applied to successive pairs

– NTP has complex peer-selection algorithm; favoured

are peers with :

• a lower stratum number

• a lower synchronisation dispersion

– Accuracy:

• 10 msec on Internet

• 1 msec on LAN

• Introduction

• Time services

• Name services

Name services• Overview

– Introduction

– Basic service

– Case study: Domain name system

– Directory services

Name services Introduction

• Types of name:– users– files, devices– service names (e.g. RPC)– port, process, group identifiers

• Attributes associated with names– Email address, login name, password– disk, block number– network address of server– ...

• Goal:– binding between name and attributes

• History– originally:

• quite simple problem• scope: a single LAN• naïve solution: all names + attributes in a single file

– now:• interconnected networks• different administrative domains in a single name

• naming service separate from other services:– unification

• general naming scheme for different objects

• e.g. local + remote files in same scheme

• e.g. files + devices+ pipes in same scheme

• e.g. URL

– integration• scope of sharing difficult to predict

• merge different administrative domains

• requires extensible name space

• General requirements:

– scalable

– long lifetime

– high availability

– fault isolation

– tolerance of mistrust

– Introduction

– Basic service

– Domain name system

Name services Basic service

• Operations:identifier lookup (name, context)

register (name, context, identifier)

delete (name, context)

• From a single server to multiple servers:– hierarchical names

– efficiency

– availability

• Hierarchical name space– introduction of directories or subdomains

• /users/pv/edu/distri/services.ppt

• cs.kuleuven.ac.be

– easier administration• delegation of authority

• less name conflicts: unique per directory

• unit of distribution: server per directory + single root server

• Hierarchical name space

– navigation

• process of locating naming data from more than

one server

• iterative or recursive

– result of lookup:

• attributes

• references to another name server

srcusers

sys pv ann

Lookup(“/users/pv/edu/…”,…)P

Ref to server-users

srcusers

sys pv ann

PLookup(“pv/edu/…”,…)

Ref to server-pv

• Efficiency caching– hierarchy introduces multiple servers

more communication overhead more processing capacity

– caching at client side:• <name, attributes> are cached

• fewer request lookup performance higher

• inconsistency? – Data is rarely changed

– limit on validity

• Availability replication– at least two failure independent servers

– primary server: accepts updates

– secondary server(s): • get copy of data of primary

• period check for updates at primary

– level of consistency?

– Introduction

– Basic service

Name services Domain name system

• DNS = Internet name service• originally: an ftp-able file

– not scalable– centralised administration

• a general name service for computers and domains– partitioning– delegation– replication and caching

• Domain name– hierarchy

nix.cs.kuleuven.ac.be

country (België)

univ. (academic)

K.U.Leuven

dept. computer science

name of computersystem

• 3 kinds of Top Level Domains (TLD)– 2-letter country codes (ISO 3166)– generic names (similar organisations)

• com commercial organisations• org non-profit organisations (bv. vzw)• int international organisations (nato, EU, …)• net network providers

– USA oriented names • edu universities• gov American government• mil American army

– new generic names• biz, info, name, aero, museum, ...

• For each TLD:– administrator (assign names within domain)

– “be”: DNS BE vzw (previous: dept. computer science)

• Each organisation with a name :– responsible for new (sub)names

– e.g. cs.kuleuven.ac.be

• Hierarchical naming + delegation workable structure

• Name servers– root name server +

– server per (group of) subdomains

+ replication high availability

+ caching acceptable performance– time-to-live to limit inconsistencies

Name server cs.kuleuven.ac.be

Systems/subdomainsSystems/subdomains typetype IP-adresIP-adresof cs.kuleuven.ac.beof cs.kuleuven.ac.be

nixnix AA 134.58.42.36134.58.42.36

idefixidefix AA 134.58.41.7134.58.41.7

droopydroopy AA 134.58.41.10134.58.41.10

stevinstevin AA 134.58.41.16134.58.41.16

......

A = AddressA = Address

Name server kuleuven.ac.be

Machines/subdomeinenMachines/subdomeinen typetype IP-adresIP-adresvan kuleuven.ac.bevan kuleuven.ac.be

cscs NSNS 134.58.39.1134.58.39.1

esatesat NSNS ……

wwwwww AA ……

......

NS = NameServerNS = NameServer

Example : www.cs.vu.nl

Lokale NSLokale NS

(cs.kuleuven.ac.be(cs.kuleuven.ac.be))

www.www.cs.vu.nlcs.vu.nl

Root-NSRoot-NS

NS (nl)NS (nl)

NS (vu.nl)NS (vu.nl)

NS (cs.vu.nl)NS (cs.vu.nl)130.37.24.1130.37.24.111

130.37.24.1130.37.24.111

• Extensions:– mail host location

• used by electronic mail software

• requests where mail for a domain should be delivered

– reverse resolution (IP address -> domain name)– host information

• Weak points:– security

• Good system design: – partitioning of data

• multiple servers

– replication of servers • high availability• limited inconsistencies• NO load balancing; NO server preference

– caching• acceptable performance• limited inconsistencies

– Introduction

– Basic service

Name services Directory services

• Name serviceexact name attributes

• Directory servicesome attributes of object all attributes

• Examples:– request: E-mail address of Janssen at KULeuven

– result: list of names of all persons called Janssen

– select person and read attributes

• Directory information base– list of entries– each entry

• set of type/list-of-values pairs

• optional and mandatory pairs

• some pairs determine name

– distributed implementation

• X500– CCITT standard

• LDAP– light weight directory access protocol– Internet standard

• Introduction

• Time services

• Name services

Distributed file systems• Overview

– File service architecture

– case study: NFS

– case study AFS

– comparison NFS <> AFS

Distributed file systemsFile service architecture

• Definitions:– file– directory – file system

cf. Operating systems

• Requirements:– addressed by most file systems:

• access transparency• location transparency• failure transparency• performance transparency

– related to distribution:• concurrent updates• hardware and operating system heterogeneity• scalability

• Requirements (cont.) :– scalable to a very large number of nodes

• replication transparency

• migration transparency

– future:• support for fine-grained distribution of data

• tolerance to network partitioning

• File service components

Flat file service

Directory service

Network

Client module: API

User program User program

• Flat file service– file = data + attributes

– data = sequence of items

– operations: simple read & write

– attributes: in flat file & directory service

– UFID: unique file identifier

• Flat file service: operations

– Read (FileID, Position, n) Data

– Write (FileID, Position, Data)

– Create () FileID

– Delete (FileID)

– GetAttributes (FileId) Attr

– SetAttributes (FileID, Attr)

• Flat file service: fault tolerance

– straightforward for simple servers

– idempotent operations

– stateless servers

• Directory service

– translate file name in UFID

– substitute for open

– responsible for access control

• Directory service: operations

– Lookup (Dir, Name, AccessMode, UserID)

– AddName (Dir, Name, FileID, UserID)

– UnName (Dir, Name)

– ReName (Dir, OldName, NewName)

– GetNames (Dir, Pattern) list-of-names

• Implementation techniques:– known techniques from OS experience– remain important– distributed file service: comparable in

• performance

• reliability

• Implementation techniques: overview– file groups– space leaks– capabilities and access control– access modes– file representation– file location– group location– file addressing– caching

• Implementation techniques: file groups– (similar: file system, partition)= collection of files mounted on a server

computer– unit of distribution over servers– transparent migration of file groups– once created file is locked in file group– UFID = file group identifier + file identifier

• Implementation techniques: space leaks– 2 steps for creating a file

• create (empty) file and get new UFID• enter name + UFID in directory

– failure after step 1:• file exists in file server• unreachable: UFID not in any directory

lost space on disk– detection requires co-operation between

• file server• directory server

• Implementation techniques: capabilities= digital key: access to resource granted on

presentation of the capability– request to directory server: file name + user id

+ mode of accessUFID including permitted access modes – construction of UFID– unique– encode access– unforgeable

• Implementation techniques: capabilities

File group id File nr

File group id File nr Random nr

Access bits

File group id File nr

Access bits Encrypted (access bits + random number)

Reuse?

Access check?

Forgeable?

• Implementation techniques: file location– from UFID location of file server– use of replicated group location database file group id, PortId

– why replication?– why location not encoded in UFID?

• Implementation techniques: caching– server cache: reduce delay for disk I/O

• selecting blocks for release

• coherence: – dirty flags

– write-through caching

– client cache: reduce network delay• always use write-through

• synchronisation problems with multiple caches

– File service model

– case study: Network File System -- NFS

– case study: AFS

Distributed file systemsNFS

• Background and aims– first file service product– emulate UNIX file system interface– de facto standard

• key interfaces published in public domain

• source code available for reference implementation

– supports diskless workstations• not important anymore

Distributed file systemsNFS

• Design characteristics– client and server modules can be in any node

Large installations include a few servers– Clients:

• On Unix: emulation of standard UNIX file system

• for MS/DOS, Windows, Apple, …

– Integrated file and directory service– Integration of remote file systems in a local

one: mount remote mount

Distributed file systemsNFS: configuration

• Unix mount system call– each disk partition contains hierarchical FS– how integrate?

• Name partitions a:/usr/students/john

• glue partitions together– invisible for user

– partitions remain useful for system managers

• Unix mount system call

/ (root)

vmunix usr...

students staffx

Root partition

/ (root)

pierre ann...

network proje

Partition c

Directory staff root of c: /usr/staff/ann/network

• Remote mount

/ (root)

vmunix usr...

students staffx

client

/ (root)

users ......

ann pierre

Server 1

Directory staff users: /usr/staff/ann/...

• Mount service on server– enables clients to integrate (part of) remote file

system in the local name space– exported file systems in /etc/exports + access

list (= hosts permitted to mount; secure?)

• On client side– file systems to mount enumerated in /etc/rc– typically mount at start up time

• Mounting semantics– hard

• client waits until request for a remote file succeeds

• eventually forever

– soft• failure returned if request does not succeed after n

retries

• breaks Unix failure semantics

• Automounter– principle:

• empty mount points on clients

• mount on first request for remote file

– acts as a server for a local client• gets references to empty mount points

• maps mount points to remote file systems

• referenced file system mounted on mount point via a symbolic link, to avoid redundant requests to automounter

Distributed file systemsNFS: implementation

• In UNIX: client and server modules

implemented in kernel

• virtual file system:

– internal key interface, based on file handles for

remote files

UNIX kernel

Virtual file system

UNIX file

system

NFSclient

User client

process System call

Client computer

Net-work

UNIX kernel

Virtual file system

UNIX file

system

NFSserver

server computer

NFS protocol

• Virtual file system– Added to UNIX kernel to make distinction

• Local files

• Remote files

– File handles: file ID in NFS• Base: inode number

– File ID in partition on UNIX system

• Extended with:– File system identifier

– inode generation number (to enable reuse)

• Client integration:– NFS client module integrated in kernel

• offers standard UNIX interface

• no client recompilation/reloading

• single client module for all user level processes

• encryption in kernel

• server integration– only for performance reasons– user level = 80% of kernel level version

• Directory service– name resolution co-ordinated in client– step-by-step process for multi-part file names– mapping tables in server: high overhead

reduced by caching• Access control and authentication

– based on UNIX user ID and group ID– included and checked for every NFS request– secure NFS 4.0 thanks to use of DES

encryption

• Caching– Unix caching

• based on disk blocks• delayed write (why?)• read ahead (why?)• periodically sync to flush buffers in cache

– Caching in NFS• Server caching• Client caching

• Server caching in NFS– based on standard UNIX caching: 2 modes– write-through (instead of delayed write)

• failure semantics • performance

– delayed write• Data stored in cache, till commit operation is

received– Close on client commit operation on server

• Failure semantics?• Performance

• Client caching– cached are results of

• read, write, getattr, lookup, readdir

– problem: multiple copies of same data at different NFS clients

– NFS clients use read-ahead and delayed write

• Client caching (cont.)– handling of writes

• block of file is fetched and updated• changed block is marked as dirty• dirty pages of files are flushed to server

asynchronously– on close of file– sync operation on client– by bio-daemon (when block is filled)

• dirty pages of directories are flushed to server– by bio-daemon without further delay

• Client caching (cont.)

– consistency checks• based on time-stamps indicating last modification

of file on server

• validation checks– when file is opened

– when a new block is fetched

– assumed to remain valid for a fixed time (3 sec for file, 30 sec for directory)

– next operation causes another check

• costly procedure

• Caching – Cached entry is valid

(T – Tc) < t or (Tmclient = Tmserver)– T: current time– Tc: time when cache entry was last validated– t: freshness interval (3 .. 30 secs in Solaris)– Tm: time when block was last modified at …

– consistency level• acceptable

• most UNIX applications do not depend critically on synchronisation of file updates

• Performance– reasonable performance

• remote files on fast disk better than local files on slow disk

• RPC packets are 9 Kb to contain 8Kb disk blocks– Lookup operation covers about 50% of server

calls– Drawbacks:

• frequent getattr calls for time-stamps (cache validation)

• poor performance of (relative infrequent) writes (because of write-through)

– case study: NFS

– case study: Andrew File System -- AFS

Distributed file systemsAFS

• Background and aims– Base: observation of UNIX file systems

• files are small ( < 10 Kb)• read is more common than write• sequential access is common, random access is not• most files are not shared• shared files are often modified by one user• file references come in bursts

– Aim: combine best of personal computers and time-sharing systems

• Background and aims (cont.)

– Assumptions about environment• secured file servers

• public workstations

• workstations with local disk

• no private files on local disk

– Key targets: scalability and security• CMU 1991: 800 workstations, 40 servers

• Design characteristics– whole-file serving– whole-file cachingentire files are transmitted, not blocks– client cache realised on local disk (relatively

large)lower number of open requests on the network– separation between file and directory service

• Configuration– single (global) name space– local files

• temporary files

• system files for start-up

– volume = unit of configuration and management

– each server maintains a replicated location database (volume-server mappings)

• File name space

/ (root)

bin afstmp

/ (root)

users bin...

ann pierre

shared

Directory afs root of shared file system

–Symbolic link

• Implementation– Vice = file server

• secured systems, controlled by system management• understands only file identifiers• runs in user space

– Venus = user/client software• runs on workstations• workstations keep autonomy• implements directory services

– kernel modifications for open and close

program

Unix kernel

program

Unix kernel

workstations servers

program

Unix kernel

Unix file

system call Non-local file operations

Unix file system

• Implementation of file system calls– open, close:

• UNIX kernel of workstation

• Venus (on workstation)

• VICE (on server)

– read, write:• UNIX kernel of workstation

place copy of file in local file system and store FileName

open local file and return descriptor to user process

• Open system call open( FileName,…)

if FileName in shared space /afs then pass request to Venus kernel

if file not present in local cache OR file present with

invalid callback then pass request to Vice server venus

network

transfer copy of file and valid callback vice

• Read Write system call

perform normal UNIX read op local copy of file kernel

read( FileDescriptor,…)

• Close system call

close local copy and inform Venus about close kernel

if local copy is changed then send copy to Vice server venus network

Store copy of file and send callback to other

clients holding callback promise vice

close( FileDescriptor,…)

• Caching– callback principle

• service supplies callback promise at open

• promise is stored with file in client cache

– callback promise can be valid or cancelled• initially valid

• server sends message to cancel callback promise – to all clients that cache the file

– whenever file is updated

• Caching maintenance– when client workstation reboots

• cache validation necessary because of missed messages

• cache validation requests are sent for each valid promise

– valid callback promises are renewed • on open

• when no communication has occurred with the server during a period T

• Update semantics– guarantee after successful open of file F on server S:

latest(F, S, 0)orlostcallback(S, T) and incache(F) and latest(F, S, T)

– no other concurrency control

• 2 copies can be updated at different workstations

• all updates except from last close are (silently) lost

• <> normal UNIX operation

• Performance: impressive <> NFS– benchmark: load on server

• 40% AFS• 100% NFS

– whole-file caching: • reduction of load on servers• minimises effect of network latency

– read-only volumes are replicated (master copy for occasional updates)

Andrew optimised for a specific pattern of use!!

– case study: NFS

– case study AFS

Distributed file systemsNFS <>AFS

• Access transparency– both in NFS and AFS– Unix file system interface is offered

• Location transparency– uniform view on shared files in AFS– in NFS

• mounting freedom

• same view possible if same mounting; discipline!

• Failure transparency– NFS

• no state of clients stored in servers

• idempotent operations

• transparency limited by soft mounting

– AFS• state about clients stored in servers

• cache maintenance protocol handles server crashes

• limitations?

• Performance transparency– NFS

• acceptable performance degradation

– AFS• only delay for first open operation on file

• better than NFS for small files

• Migration transparency– limited: update of locations required– NFS

• update configuration files on all clients

– AFS• update of replicated database

• Replication transparency– NFS

• not supported

– AFS• limited support; for read-only volumes

• one master for exceptional updates; manual procedure for propagating changes to other volumes

• Concurrency transparency– not supported (in UNIX)

• Scalability transparency– AFS better than NFS

– File service architecture

– case study: NFS

– case study AFS

Distributed file systems

• Conclusions– Standards <> quality– evolution to standards and common key

interface• AFS-3 incorporates Kerberos, vnode interface,

large messages for file blocks (64Kb)

– network (mainly access) transparency causes inheritance of weakness: e.g. no concurrency control

– evolution mainly performance driven

Distributed file systems

• Good system design: – partitioning

• different storage volumes– replication of servers

• high availability• limited inconsistencies• load balancing?• server preference?

– caching• acceptable performance• limited inconsistencies

• Introduction

• Time services

• Name services

Peer-to-peer systems• Overview

– Introduction – Napster– Middleware– Routing algorithms– OceanStore –Pond file store

Peer-to-peer systems Introduction

• Definition 1:– Support useful distributed services– Using data and computer resources in the PCs

and workstations available on the Internet

• Definition 2:– Applications exploiting resources available at

the edges of the Internet

• Characteristics:– Each user contributes resources to the system

– All nodes have the same functional capabilities and responsibilities

– Correct operation does not depend on the existence of any centrally administered systems

– Can offer limited degree of anonymity

– Efficient algorithms for the placement of data across many hosts and subsequent access

– Efficient algorithms for load balancing and availability of data

Availability of partic

ipating computers is unpredictabel

• 3 generations of peer-to-peer systems and application development:– Napster: music exchange service– File-sharing applications offering greater

scalability, anonymity and fault tolerance: Freenet, Gnutella, Kazaa

– Middleware for the application-independent management of distributed resources on a global scale: Pastry, Tapestry,…

Peer-to-peer systems Napster

• Download digital music files

• Architecture:– Centralised indices– Files stored and accessed on PCs

• Method of operation (next slide)

Napster serverIndex1. File location

2. List of peers

request

offering the file

3. File request

4. File delivered5. Index update

Napster serverIndex

– step 5: file sharing expected!

• Conclusion:– Feasibility demonstrated– Simple load balancing techniques– Replicated unified index of all available music

files– No updates of files– Availability not a hard concern– Legal issues: anonymity

Peer-to-peer systems middleware

• Goal:– Automatic placement– Subsequent location of distributed objects

• Functional requirements:– Locate and communicate with any resource– Add and remove resources– Add and remove hosts– API independent of type of resource

• Non-functional requirements:– Global scalability (millions of object on hundreds of thousands of

hosts)

– Optimization for local interactions between

neighbouring peers

– Accommodating to highly dynamic host availability

– Security of data in an environment with heterogeneous

– Anonymity, deniability and resistance to censorship

• Approach:– Knowledge of locations of objects

• Partitioned

• Distributed

• Replicated (x16)

throughout the network═Routing overlay:

• Distributed algorithm for locating objects and nodes

Object:

CÕs routing knowledge

DÕs routing knowledgeAÕs routing knowledge

BÕs routing knowledge

• Resource identification:– GUID = globally unique identifier– Secure hash from

• all state – self certifying

– for immutable objects

• part of state– e.g. name + …

• API for a DHT in Pastry

DHT = distributed hash table

put(GUID, data) The data is stored in replicas at all nodes responsible for the object identified by GUID.remove(GUID)Deletes all references to GUID and the associated data.value = get(GUID)The data associated with GUID is retrieved from one of the nodes responsible it.

• API for a DOLR in Tapestry

DOLR = distributed object location and routing

publish(GUID ) This function makes the node performing a publish operation the host for the object corresponding to GUID.

unpublish(GUID)Makes the object corresponding to GUID inaccessible.

sendToObj(msg, GUID, [n])an invocation message is sent to an object in order to access it. The optional parameter [n], requests the delivery of the same message to n replicas of the object.

Peer-to-peer systems Routing algorithms

• prefix routing – Used in Pastry, Tapestry

• GUID: 128-bit– Host: secure hash on public key– Object: secure hash on name or part of state

• N participating nodes– O(log N) steps to route a message to any GUID– O(log N) messages to integrate a new node

• Simplified algorithm: circular routing– Leaf set in each active node– Element in leaf set:

GUID – IP address of nodes whose GUIDS are numerically closest

on either side of its own GIUD– Size L = 2*l– Leaf sets are updated when nodes

• Join• Leave

• Circular space for GUIDs

• Dot = live node

• Leaf size = 8

• Shown: routing of a message from node 65A1FC to D46A1C

0 FFFFF....F (2128-1)

65A1FC

D13DA3

D471F1

D467C4D46A1C

• Full algorithm (Pastry)– Tree-structured routing table

• GUID – IP pairs spread throughout entire range of GUIDs

• Increased density of coverage for GUIDs numerically close to GUID of local node

• GUIDs viewed as hexadecimal values

• Table classifies GUIDs based on hexadecimal prefices

• As many rows as hexadecimal digits

• Full algorithm (Pastry)

• Routing a message

from 65A1FC

to D46A1C

• Well-populated table:

~log16(N) hops

0 FFFFF....F (2128-1)

65A1FC

D13DA3

D4213F

D462BA

D471F1

D467C4D46A1C

• Full algorithm (Pastry)To handle a message M addressed to a node D (where R[p,i] is the element at column i,row p of the routing table):

1. If (L-l < D < Ll) { // the destination is within the leaf set or is the current node.2. Forward M to the element Li of the leaf set with GUID closest to D or the current

node A.3. } else { // use the routing table to despatch M to a node with a closer GUID4. find p, the length of the longest common prefix of D and A. and i, the (p+1)th

hexadecimal digit of D .5. If (R[p,i] ° null) forward M to R[p,i] // route M to a node with a longer common

prefix.6. else { // there is no entry in the routing table7. Forward M to any node in L or R with a common prefix of length i, but a

GUID that is numerically closer.}

• Full algorithm (Pastry): host integration– New node X:

• computes GUIDX

• X should know at least one nearby Pastry node A with GUIDA

• X sends special join message to A (destination GUIDX)

– A despatches the join message via Pastry

• Route of message: A, B, ….Z (GUIDZ GUIDX)

– Each node sends relevant part of its routing table and

leaf sets to X

• Full algorithm (Pastry): host integration– Route of message: A, B, ….Z

– Routing table of X:• Row 0 of A row 0 of X

• B and X share hexadecimal digits: so row i of B row I of X

• …

• Choice for entry? proximity neighbour selection algorithm (metric: IP hops, measured latency,…)

– Leaf set of Z leaf set of X

– X sends its leaf set and routing table to all nodes in its routing table and leaf set

• Full algorithm (Pastry): host failure– Repair actions to update leafsets: leaf set requested

from node(s) close to failed node

– Routing table: repairs on a “when discovered” basis

• Full algorithm (Pastry): Fault tolerance– All nodes in leaf set life?

• Each node sends heartbeat messages to nodes in leaf set

– Malicious nodes?• Clients can use at least once delivery mechanism

• Use limited randomness in node selection in routing algorithm

• Full algorithm (Pastry): dependability– Include acks in routing algorithm: No ack

• alternative route• node: suspected failure

– Heartbeat messages– Suspected failed nodes in routing tables

• Probed• Fail: alternative nodes

– Simple Gossip protocol to exchange routing table info between nodes

• Full algorithm (Pastry): evaluation– Message loss

– Performance: Relative Delay Penalty (RDP)= Pastry delivery / IP/UDP delivery

IP loss Pastry loss Pastry wrong node RDP

100. 000 messages

0% 1.5 0 1.8

5% 3.3 1.6 2.2

Distributed file systems OceanStore

• Objectives:– Very large scale, incrementally-scalable,

persistent storage facility – Mutable data objects with long-term

persistence and reliability– Environment of network and computing

resources constantly changing

• Intended use

• Mechanisms used

• Intended use:– NFS-like file system– Electronic mail hosting

• Mechanism needed:– Consistency between replicas

• Tailored to application needs by a Bayou like system

– Privacy and integrity• Encryption of data• Byzantine agreement protocol for updates

• Storage organization

d1 d2 d3 d5d4

root block

version i indirection blocks

version i+1

certificate VGUID of currentversion

VGUID ofversion i

VGUID of version i-1

data blocks

Version i+1 has been updated in blocks d1, d2 and d3. The certificate and the root blocks include some metadata not shown. All unlabelled arrows are BGUIDs.

• Storage organization (cont)

Name Meaning Description

BGUID block GUID Secure hash of a data block

VGUID version GUID BGUID of the root block of a version

AGUID active GUID Uniquely identifies all the versions of an object

• Storage organization (cont)

– New object AGUID• small set of hosts act as inner ring

• publish (AGUID)

• Refers to signed certificate recording the sequence of versions of the object

– Update of object• Byzantine agreement between host in inner ring

(primary copy)

• Result disseminated to secondary replicas

• PerformanceLAN WAN Predominant

operations inbenchmarkPhase Linux NFS Pond Linux NFS Pond

1 0.0 1.9 0.9 2.8 Read and write

2 0.3 11.0 9.4 16.8 Read and write

3 1.1 1.8 8.3 1.8 Read

4 0.5 1.5 6.9 1.5 Read

5 2.6 21.0 21.5 32.0 Read and write

Total 4.5 37.2 47.0 54.9

• Performance (cont)

– Benchmarks:• Create subdirectories recursively

• Copy a source tree

• Examine the status of all files in the tree

• Examine every byte of data in all files

• Compile and link the files

• Introduction

• Time services

• Name services

Distributed Systems:

General Services

November 2005Distributed systems: general services1 Distributed Systems: General Services.

Documents

Transcript of November 2005Distributed systems: general services1 Distributed Systems: General Services.

B2 mh programs and services1 [autosaved]01

Amplified Work Plan - Amazon Web Services1... · Policy and Planning Actions to Internalize Societal Impacts of CV and AV Systems into Market Decisions Ginger Goodin, P.E. Senior

10/28/2005Distributed Databases in HEP @ HENPC Group Meeting (LBNL)1 Distributed Databases in HEP Igor A. Gaponenko (LBNL/NERSC) IAGaponenko@lbl.gov.

성장하는기업의비밀병기 IBM Global Technology Services1. 도입 2. 중견기업의비즈니스및it 이슈 3. 중견기업정보전략혁신과제 4. 중견기업의비즈니스혁신을지원하는ibm의서비스

Child protective services1

COURIER SERVICES1

(Design of Services1)

NAVIGATIONAL SERVICES1 Executive Summary AMSA’s responsibility for its aids to navigation (AtoN) network and navigational systems is described in Navigational Services in Australian

Qian Chen, Haibo Hu, Jianliang Xu Hong Kong Baptist University Authenticated Online Data Integration Services1.

Management of financial services1

Confianz services1

Disbursements and Travel Services1 Automated Stipend Payments Training Course.

Financial services1

Competitive Alternatives, 2016 - KPMG International · National results by sector Digital services1 R&D services1 Japan 0.0% BASELINE Higher cost United States Lower cost 11.0% Germany

Promotion and Communication of Services1

3 11 00am - capital markets financial services1

Posterus services1-5

PTC Computer Services1 1/08/2007 Applications News Software / Personalized Problem: Disparate Data E-Mail.

Alessandro Mapelli PHILIPS Innovation Services at CERN 6 May 2015 CERN 6 May 2015PHILIPS Innovation Services1.

Enterprise Java v021012Web Services1 Web Services and SOAP.