Distributed Systems Principles and Paradigms - FCUPlblopes/aulas/sd/ch4.pdf · Distributed Systems...
Transcript of Distributed Systems Principles and Paradigms - FCUPlblopes/aulas/sd/ch4.pdf · Distributed Systems...
Distributed SystemsPrinciples and Paradigms
Maarten van Steen
VU Amsterdam, Dept. Computer ScienceRoom R4.20, [email protected]
Chapter 04: CommunicationVersion: November 5, 2009
Contents
Chapter01: Introduction02: Architectures03: Processes04: Communication05: Naming06: Synchronization07: Consistency & Replication08: Fault Tolerance09: Security10: Distributed Object-Based Systems11: Distributed File Systems12: Distributed Web-Based Systems13: Distributed Coordination-Based Systems
2 / 52
Communication 4.1 Layered Protocols
Layered Protocols
Low-level layersTransport layerApplication layerMiddleware layer
3 / 52
Communication 4.1 Layered Protocols
Basic networking model
Physical
Data link
Network
Transport
Session
Application
Presentation
Application protocol
Presentation protocol
Session protocol
Transport protocol
Network protocol
Data link protocol
Physical protocol
Network
1
2
3
4
5
7
6
DrawbacksFocus on message-passing onlyOften unneeded or unwanted functionalityViolates access transparency
4 / 52
Communication 4.1 Layered Protocols
Low-level layers
RecapPhysical layer: contains the specification and implementation ofbits, and their transmission between sender and receiverData link layer: prescribes the transmission of a series of bits intoa frame to allow for error and flow controlNetwork layer: describes how packets in a network of computersare to be routed.
ObservationFor many distributed systems, the lowest-level interface is that of thenetwork layer.
5 / 52
Communication 4.1 Layered Protocols
Transport Layer
ImportantThe transport layer provides the actual communication facilities formost distributed systems.
Standard Internet protocolsTCP: connection-oriented, reliable, stream-orientedcommunicationUDP: unreliable (best-effort) datagram communication
NoteIP multicasting is often considered a standard available service (whichmay be dangerous to assume).
6 / 52
Communication 4.1 Layered Protocols
Middleware Layer
ObservationMiddleware is invented to provide common services and protocolsthat can be used by many different applications
A rich set of communication protocols(Un)marshaling of data, necessary for integrated systemsNaming protocols, to allow easy sharing of resourcesSecurity protocols for secure communicationScaling mechanisms, such as for replication and caching
NoteWhat remains are truly application-specific protocols...such as?
7 / 52
Communication 4.1 Layered Protocols
Types of communication
Client
Server
Synchronize after processing by server
Synchronize at request delivery
Synchronize at request submission
Request
Reply
Storage facility
Transmission interrupt
Time
DistinguishTransient versus persistent communicationAsynchrounous versus synchronous communication
8 / 52
Communication 4.1 Layered Protocols
Types of communication
Client
Server
Synchronize after processing by server
Synchronize at request delivery
Synchronize at request submission
Request
Reply
Storage facility
Transmission interrupt
Time
Transient versus persistentTransient communication: Comm. server discards message when itcannot be delivered at the next server, or at the receiver.Persistent communication: A message is stored at a communicationserver as long as it takes to deliver it.
9 / 52
Communication 4.1 Layered Protocols
Types of communication
Client
Server
Synchronize after processing by server
Synchronize at request delivery
Synchronize at request submission
Request
Reply
Storage facility
Transmission interrupt
Time
Places for synchronization
At request submissionAt request deliveryAfter request processing
10 / 52
Communication 4.1 Layered Protocols
Client/Server
Some observationsClient/Server computing is generally based on a model of transientsynchronous communication:
Client and server have to be active at time of commun.Client issues request and blocks until it receives replyServer essentially waits only for incoming requests, andsubsequently processes them
Drawbacks synchronous communicationClient cannot do any other work while waiting for replyFailures have to be handled immediately: the client is waitingThe model may simply not be appropriate (mail, news)
11 / 52
Communication 4.1 Layered Protocols
Client/Server
Some observationsClient/Server computing is generally based on a model of transientsynchronous communication:
Client and server have to be active at time of commun.Client issues request and blocks until it receives replyServer essentially waits only for incoming requests, andsubsequently processes them
Drawbacks synchronous communicationClient cannot do any other work while waiting for replyFailures have to be handled immediately: the client is waitingThe model may simply not be appropriate (mail, news)
11 / 52
Communication 4.1 Layered Protocols
Messaging
Message-oriented middlewareAims at high-level persistent asynchronous communication:
Processes send each other messages, which are queuedSender need not wait for immediate reply, but can do other thingsMiddleware often ensures fault tolerance
12 / 52
Communication 4.2 Remote Procedure Call
Remote Procedure Call (RPC)
Basic RPC operationParameter passingVariations
13 / 52
Communication 4.2 Remote Procedure Call
Basic RPC operation
ObservationsApplication developers are familiar with simple procedure modelWell-engineered procedures operate in isolation (black box)There is no fundamental reason not to execute procedures onseparate machine
ConclusionCommunication between caller &callee can be hidden by usingprocedure-call mechanism.
Call local procedureand return results
Call remoteprocedure
Returnfrom call
Client
Request Reply
ServerTime
Wait for result
14 / 52
Communication 4.2 Remote Procedure Call
Basic RPC operation
Implementationof add
Client OS Server OS
Client machine Server machine
Client stub
Client process Server process1. Client call to
procedure
2. Stub buildsmessage
5. Stub unpacksmessage
6. Stub makeslocal call to "add"
3. Message is sentacross the network
4. Server OShands messageto server stub
Server stubk = add(i,j) k = add(i,j)
proc: "add"int: val(i)int: val(j)
proc: "add"int: val(i)int: val(j)
proc: "add"int: val(i)int: val(j)
1 Client procedure calls client stub.2 Stub builds message; calls local OS.3 OS sends message to remote OS.4 Remote OS gives message to stub.5 Stub unpacks parameters and calls
server.
6 Server returns result to stub.7 Stub builds message; calls OS.8 OS sends message to client’s OS.9 Client’s OS gives message to stub.
10 Client stub unpacks result and returns tothe client.
15 / 52
Communication 4.2 Remote Procedure Call
RPC: Parameter passing
Parameter marshalingThere’s more than just wrapping parameters into a message:
Client and server machines may have different datarepresentations (think of byte ordering)Wrapping a parameter means transforming a value into asequence of bytesClient and server have to agree on the same encoding:
How are basic data values represented (integers, floats, characters)How are complex data values represented (arrays, unions)
Client and server need to properly interpret messages,transforming them into machine-dependent representations.
16 / 52
Communication 4.2 Remote Procedure Call
RPC: Parameter passing
RPC parameter passing: some assumptionsCopy in/copy out semantics: while procedure is executed, nothing canbe assumed about parameter values.All data that is to be operated on is passed by parameters. Excludespassing references to (global) data.
ConclusionFull access transparency cannot be realized.
ObservationA remote reference mechanism enhances access transparency:
Remote reference offers unified access to remote dataRemote references can be passed as parameter in RPCs
17 / 52
Communication 4.2 Remote Procedure Call
Asynchronous RPCs
EssenceTry to get rid of the strict request-reply behavior, but let the clientcontinue without waiting for an answer from the server.
Call local procedure
Call remoteprocedure
Returnfrom call
Request Accept request
Wait for acceptance
Call local procedureand return results
Call remoteprocedure
Returnfrom call
Client Client
Request Reply
Server ServerTime Time
Wait for result
(a) (b)
18 / 52
Communication 4.2 Remote Procedure Call
Deferred synchronous RPCs
Call local procedure
Call remoteprocedure
Returnfrom call
Client
RequestAcceptrequest
ServerTime
Wait foracceptance
Interrupt client
Returnresults Acknowledge
Call client withone-way RPC
VariationClient can also do a (non)blocking poll at the server to see whetherresults are available.
19 / 52
Communication 4.2 Remote Procedure Call
RPC in practice
C compiler
Uuidgen
IDL compiler
C compiler C compiler
Linker Linker
C compiler
Server stubobject file
Serverobject file
Runtimelibrary
Serverbinary
Clientbinary
Runtimelibrary
Client stubobject file
Clientobject file
Client stubClient code Header Server stub
Interfacedefinition file
Server code
#include#include
20 / 52
Communication 4.2 Remote Procedure Call
Client-to-server binding (DCE)
Issues(1) Client must locate server machine, and (2) locate the server.
Endpointtable
Server
DCEdaemon
Client1. Register endpoint
2. Register service3. Look up server
4. Ask for endpoint
5. Do RPC
Directoryserver
Server machineClient machine
Directory machine
21 / 52
Communication 4.3 Message-Oriented Communication
Message-Oriented Communication
Transient MessagingMessage-Queuing SystemMessage BrokersExample: IBM Websphere
22 / 52
Communication 4.3 Message-Oriented Communication
Transient messaging: sockets
Berkeley socket interface
SOCKET Create a new communication endpointBIND Attach a local address to a socketLISTEN Announce willingness to accept N connectionsACCEPT Block until request to establish a connectionCONNECT Attempt to establish a connectionSEND Send data over a connectionRECEIVE Receive data over a connectionCLOSE Release the connection
23 / 52
Communication 4.3 Message-Oriented Communication
Transient messaging: sockets
connect
socket
socket
bind listen read
read
write
write
accept close
close
Server
Client
Synchronization point Communication
24 / 52
Communication 4.3 Message-Oriented Communication
Message-oriented middleware
EssenceAsynchronous persistent communication through support ofmiddleware-level queues. Queues correspond to buffers atcommunication servers.
PUT Append a message to a specified queueGET Block until the specified queue is nonempty, and re-
move the first messagePOLL Check a specified queue for messages, and remove
the first. Never blockNOTIFY Install a handler to be called when a message is put
into the specified queue
25 / 52
Communication 4.3 Message-Oriented Communication
Message broker
ObservationMessage queuing systems assume a common messaging protocol: allapplications agree on message format (i.e., structure and datarepresentation)
Message brokerCentralized component that takes care of application heterogeneity inan MQ system:
Transforms incoming messages to target formatVery often acts as an application gatewayMay provide subject-based routing capabilities⇒ EnterpriseApplication Integration
26 / 52
Communication 4.3 Message-Oriented Communication
Message broker
Queuinglayer
Brokerprogram
Repository with conversion rules and programsSource client Destination client
OS OSOS
Message broker
Network
27 / 52
Communication 4.3 Message-Oriented Communication
IBM’s WebSphere MQ
Basic conceptsApplication-specific messages are put into, and removed fromqueuesQueues reside under the regime of a queue managerProcesses can put messages only in local queues, or through anRPC mechanism
28 / 52
Communication 4.3 Message-Oriented Communication
IBM’s WebSphere MQ
Message transferMessages are transferred between queuesMessage transfer between queues at different processes, requiresa channelAt each endpoint of channel is a message channel agentMessage channel agents are responsible for:
Setting up channels using lower-level network communicationfacilities (e.g., TCP/IP)(Un)wrapping messages from/in transport-level packetsSending/receiving packets
29 / 52
Communication 4.3 Message-Oriented Communication
IBM’s WebSphere MQ
MCA MCAMCA MCA
MQ Interface
Stub StubServerstub
Serverstub
Send queue
Program ProgramQueuemanager
Queuemanager
Routing table
Enterprise networkRPC(synchronous)
Local network
Message passing(asynchronous)
To other remotequeue managers
Client's receivequeueSending client Receiving client
Channels are inherently unidirectionalAutomatically start MCAs when messages arriveAny network of queue managers can be createdRoutes are set up manually (system administration)
30 / 52
Communication 4.3 Message-Oriented Communication
IBM’s WebSphere MQ
Routing
By using logical names, in combination with name resolution to local queues,it is possible to put a message in a remote queue
SQ1SQ2
SQ1
SQ1SQ1SQ2
QMBQMCQMD
SQ1
SQ1SQ1
SQ1
SQ2SQ1
SQ1
SQ1SQ1
QMA
QMA QMA
QMC
QMCQMB
QMD
QMBQMD
Routing tableRouting table
Routing table Routing table
QMB
QMC
QMA
LA1LA1
LA1
LA2LA2
LA2
QMCQMA
QMA
QMDQMD
QMC
Alias tableAlias table
Alias table
QMD
SQ1
SQ2
SQ1
31 / 52
Communication 4.4 Stream-Oriented Communication
Stream-oriented communication
Support for continuous mediaStreams in distributed systemsStream management
32 / 52
Communication 4.4 Stream-Oriented Communication
Continuous media
ObservationAll communication facilities discussed so far are essentially based on adiscrete, that is time-independent exchange of information
Continuous mediaCharacterized by the fact that values are time dependent:
AudioVideoAnimationsSensor data (temperature, pressure, etc.)
33 / 52
Communication 4.4 Stream-Oriented Communication
Continuous media
Transmission modesDifferent timing guarantees with respect to data transfer:
Asynchronous: no restrictions with respect to when data is to bedeliveredSynchronous: define a maximum end-to-end delay for individualdata packetsIsochronous: define a maximum and minimum end-to-end delay(jitter is bounded)
34 / 52
Communication 4.4 Stream-Oriented Communication
Stream
DefinitionA (continuous) data stream is a connection-oriented communicationfacility that supports isochronous data transmission.
Some common stream characteristicsStreams are unidirectionalThere is generally a single source, and one or more sinksOften, either the sink and/or source is a wrapper around hardware(e.g., camera, CD device, TV monitor)Simple stream: a single flow of data, e.g., audio or videoComplex stream: multiple data flows, e.g., stereo audio orcombination audio/video
35 / 52
Communication 4.4 Stream-Oriented Communication
Streams and QoS
EssenceStreams are all about timely delivery of data. How do you specify thisQuality of Service (QoS)? Basics:
The required bit rate at which data should be transported.The maximum delay until a session has been set up (i.e., when anapplication can start sending data).The maximum end-to-end delay (i.e., how long it will take until adata unit makes it to a recipient).The maximum delay variance, or jitter.The maximum round-trip delay.
36 / 52
Communication 4.4 Stream-Oriented Communication
Enforcing QoS
ObservationThere are various network-level tools, such as differentiated servicesby which certain packets can be prioritized.
AlsoUse buffers to reduce jitter:
0 5
1 2 3 4 5 6 7 8
10Time (sec)
Time in buffer
15 20
Gap in playback
Packet removed from buffer
1 2 3 4 5 6 7 8Packet arrives at buffer
1 2 3 4 5 6 7 8Packet departs source
37 / 52
Communication 4.4 Stream-Oriented Communication
Enforcing QoS
ProblemHow to reduce the effects of packet loss (when multiple samples are ina single packet)?
38 / 52
Communication 4.4 Stream-Oriented Communication
Enforcing QoS
1 2 3 4 5 6 7 8 9 10 11 12
1 5 9 13 2 6 10 14 3 7 11 15
13 14 15 16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
4 8 12 16
Lost packet
Lost packet
Gap of lost frames
Lost frames
(a)
(b)
Sent
Delivered
Sent
Delivered
39 / 52
Communication 4.4 Stream-Oriented Communication
Stream synchronization
ProblemGiven a complex stream, how do you keep the different substreams insynch?
ExampleThink of playing out two channels, that together form stereo sound.Difference should be less than 20–30 µsec!
40 / 52
Communication 4.4 Stream-Oriented Communication
Stream synchronization
Network
Incoming stream
Application
Receiver's machine
Procedure that readstwo audio data units foreach video data unit
OS
AlternativeMultiplex all substreams into a single stream, and demultiplex at thereceiver. Synchronization is handled at multiplexing/demultiplexingpoint (MPEG).
41 / 52
Communication 4.4 Stream-Oriented Communication
Stream synchronization
Network
Incoming stream
Application
Receiver's machine
Procedure that readstwo audio data units foreach video data unit
OS
AlternativeMultiplex all substreams into a single stream, and demultiplex at thereceiver. Synchronization is handled at multiplexing/demultiplexingpoint (MPEG).
41 / 52
Communication 4.5 Multicast Communication
Multicast communication
Application-level multicastingGossip-based data dissemination
42 / 52
Communication 4.5 Multicast Communication
Application-level multicasting
EssenceOrganize nodes of a distributed system into an overlay network and use thatnetwork to disseminate data.
Chord-based tree building1 Initiator generates a multicast identifier mid.2 Lookup succ(mid), the node responsible for mid.3 Request is routed to succ(mid), which will become the root.4 If P wants to join, it sends a join request to the root.5 When request arrives at Q:
Q has not seen a join request before⇒ it becomes forwarder; Pbecomes child of Q. Join request continues to be forwarded.Q knows about tree⇒ P becomes child of Q. No need to forwardjoin request anymore.
43 / 52
Communication 4.5 Multicast Communication
ALM: Some costs
A
BD
C
Ra
RbRd
Rc
Internet
RouterEnd host
Overlay network
75
1
1
1
1
30
40
Re 20
Link stress: How often does an ALM message cross the samephysical link? Example: message from A to D needs to cross〈Ra,Rb〉 twice.Stretch: Ratio in delay between ALM-level path and network-levelpath. Example: messages B to C follow path of length 71 at ALM,but 47 at network level⇒ stretch = 71/47.
44 / 52
Communication 4.5 Multicast Communication
Epidemic Algorithms
General backgroundUpdate modelsRemoving objects
45 / 52
Communication 4.5 Multicast Communication
Principles
Basic ideaAssume there are no write–write conflicts:
Update operations are performed at a single serverA replica passes updated state to only a few neighborsUpdate propagation is lazy, i.e., not immediateEventually, each update should reach every replica
Two forms of epidemics
Anti-entropy: Each replica regularly chooses another replica at random,and exchanges state differences, leading to identical states at bothafterwardsGossiping: A replica which has just been updated (i.e., has beencontaminated), tells a number of other replicas about its update(contaminating them as well).
46 / 52
Communication 4.5 Multicast Communication
Anti-entropy
Principle operationsA node P selects another node Q from the system at random.Push: P only sends its updates to QPull: P only retrieves updates from QPush-Pull: P and Q exchange mutual updates (after which theyhold the same information).
ObservationFor push-pull it takes O(log(N)) rounds to disseminate updates to allN nodes (round = when every node as taken the initiative to start anexchange).
47 / 52
Communication 4.5 Multicast Communication
Gossiping
Basic modelA server S having an update to report, contacts other servers. If aserver is contacted to which the update has already propagated, Sstops contacting other servers with probability 1/k .
ObservationIf s is the fraction of ignorant servers (i.e., which are unaware of theupdate), it can be shown that with many servers
s = e−(k+1)(1−s)
48 / 52
Communication 4.5 Multicast Communication
Gossiping
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
-15.0
-12.5
-10.0
-7.5
-5.0
-2.5
k
ln(s)
Consider 10,000 nodesk s Ns1 0.203188 20322 0.059520 5953 0.019827 1984 0.006977 705 0.002516 256 0.000918 97 0.000336 3
NoteIf we really have to ensure that all servers are eventually updated,gossiping alone is not enough
49 / 52
Communication 4.5 Multicast Communication
Gossiping
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
-15.0
-12.5
-10.0
-7.5
-5.0
-2.5
k
ln(s)
Consider 10,000 nodesk s Ns1 0.203188 20322 0.059520 5953 0.019827 1984 0.006977 705 0.002516 256 0.000918 97 0.000336 3
NoteIf we really have to ensure that all servers are eventually updated,gossiping alone is not enough
49 / 52
Communication 4.5 Multicast Communication
Deleting values
Fundamental problemWe cannot remove an old value from a server and expect the removalto propagate. Instead, mere removal will be undone in due time usingepidemic algorithms
SolutionRemoval has to be registered as a special update by inserting a deathcertificate
50 / 52
Communication 4.5 Multicast Communication
Deleting values
Next problem
When to remove a death certificate (it is not allowed to stay for ever):
Run a global algorithm to detect whether the removal is knowneverywhere, and then collect the death certificates (looks like garbagecollection)Assume death certificates propagate in finite time, and associate amaximum lifetime for a certificate (can be done at risk of not reaching allservers)
NoteIt is necessary that a removal actually reaches all servers.
QuestionWhat’s the scalability problem here?
51 / 52
Communication 4.5 Multicast Communication
Example applications
Typical appsData dissemination: Perhaps the most important one. Note thatthere are many variants of dissemination.Aggregation: Let every node i maintain a variable xi . When twonodes gossip, they each reset their variable to
xi ,xj ← (xi + xj)/2
Result: in the end each node will have computed the averagex̄ = ∑i xi/N.
QuestionWhat happens if initially xi = 1 and xj = 0, j 6= i?
52 / 52
Communication 4.5 Multicast Communication
Example applications
Typical appsData dissemination: Perhaps the most important one. Note thatthere are many variants of dissemination.Aggregation: Let every node i maintain a variable xi . When twonodes gossip, they each reset their variable to
xi ,xj ← (xi + xj)/2
Result: in the end each node will have computed the averagex̄ = ∑i xi/N.
QuestionWhat happens if initially xi = 1 and xj = 0, j 6= i?
52 / 52