PHAR lecture 5 Replication: Differences between eukaryotes and prokaryotes.
Lecture XII: Replication
description
Transcript of Lecture XII: Replication
CMPT 431 2008
Dr. Alexandra Fedorova
Lecture XII: Replication
2CMPT 431 © A. Fedorova
Replication
3CMPT 431 © A. Fedorova
Why Replicate? (I)
Fault-tolerance / High availability As long as one replica is up, the service is available Assume each of n replicas has same independent
probability p to fail. Availability = 1 - pn
Fault-Tolerance: Take-Over
4CMPT 431 © A. Fedorova
Why Replicate? (II)
• Fast local access (WAN replication)– client can always send requests to
closest replica– Goal: no communication to
remote replicas necessary during request execution
– Goal: client experiences location transparency since all access is fast local access
Fast local access
Toronto
Montreal
Rome
5CMPT 431 © A. Fedorova
Why Replicate?
• Scalability and load distribution (LAN replication)– Requests can be distributed
among replicas– Handle increasing load by adding
new replicas to the system
cluster instead of bigger server
6CMPT 431 © A. Fedorova
Challenges: Data Consistency
• We will study systems that use data replication• It is hard, because data must be kept consistent• Users submit operations against the logical copies of data• These operations must be translated into operations
against one, some, or all physical copies of data• Nearly all existing approaches follow a ROWA(A)
approach:– Read-one-write-all-(available)– Update has to be (eventually) executed at all replicas to keep
them consistent– Read can be performed at any replica
7CMPT 431 © A. Fedorova
Challenges: Fault Tolerance
• The goal is to have data available despite failures• If one site fails others should continue providing service• How many replicas should we have?• It depends on:
– How many faults we want to tolerate– The types of faults we expect– How much we are willing to pay
8CMPT 431 © A. Fedorova
Roadmap• Replication architectures
– Active replication– Primary-backup (passive, master-slave) replication
• Design considerations for replicated services• Surviving failures
9CMPT 431 © A. Fedorova
Active Replication
Replicated Servers
A
A
Client
B
C
AA
10CMPT 431 © A. Fedorova
Active Replication
11CMPT 431 © A. Fedorova
Active Replication
1. The client send request to the servers using totally ordered reliable multicast (logical clocks or vector clocks)
2. Server coordination is given by the total order property (assumption: synchronous system)
3. All replicas execute the request in the order they are delivered
4. No additional coordination necessary (Assumption: determinism) All replicas produce the same result
5. All replicas send result to the client; client waits for the first answer
12CMPT 431 © A. Fedorova
Fault Tolerance: Failstop Failures
• As long as at least one replica survivesthe client will continue receiving service
• Assuming there are no partitions!• Suppose B and C are partitioned, so
the cannot communicate• They cannot
agree on howto order client’s requests
Replicated Servers
A
A
Client
B
C
AA
13CMPT 431 © A. Fedorova
Fault Tolerance: Byzantine Failures
• Can survive Byzantine failures (assuming no partitions)• The system must have n ≥ 2f + 1 replicas (f is the number
of failures)• The client will compare results of all replicas, will choose
the result returned by the majority f + 1 non-faulty replicas
• This is the idea used in LOCKSS (Lots of Copies Keep Stuff Safe)
14CMPT 431 © A. Fedorova
Primary-Backup Replication (PB)
Replicated Servers
A
A
Client
primary
backup
A
B A
backup
C
Also known as passive replication and master-slave replication
If the primary fails, a backup takes over,
becomes the primary
15CMPT 431 © A. Fedorova
System Requirements
• How do we want the system to behave?• Just like a single-server system?
– Must ensure that there is only one primary at a time• Data is kept consistent:
– If a client received an acknowledgement of an update operation, that update must survive system crashes
– Results of operations should be the same as they would be if executed on a single-server system
• Can we tolerate loose data consistency?– The client eventually gets the consistent data, but not right away
16CMPT 431 © A. Fedorova
Example of Data Inconsistency
• Client operations:write(x = 5)read (x) // should return 5 on a single-server system
• On a replicated system:write (x = 5)
Primary responds to clientPrimary crashed before propagating
update to other replicasA new primary is selected
read (x) // may return x ≠ 5, the new primary does not know about the update to x
17CMPT 431 © A. Fedorova
Design Considerations for Replicated Services
• Where to submit updates?– A designated server or any server?
• When to propagate updates?– Eager or lazy?
• How many replicas to install?
18CMPT 431 © A. Fedorova
Where to Submit Updates?
• Primary Copy:- Each object has a primary copy- Often there is a designated primary - it holds primary
copies for all objects - Updates on object x have to be submitted to the primary
copy of x- Primary propagates changes on x to secondary copies- Secondary copies are read-only- Also called master/slave approach
19CMPT 431 © A. Fedorova
Where to Submit Updates
• Update Everywhere:– Both read and write operations can be submitted to any server
– This server takes care of the execution of the operation and the
propagation of updates to the other copies
T2:r(y)w(y)T1:r(x)w(y)
20CMPT 431 © A. Fedorova
When to Propagate Updates?
• Eager: – Within the boundaries of the transaction– Before response is sent to client
• Lazy: – After the commit of the transaction– After the response is sent to client
21CMPT 431 © A. Fedorova
PB Replication with Eager Updates
1. The client sends the request to the primary2. There is no initial coordination3. The primary executes the request4. The primary coordinates with the other replicas by
sending the update information to the backups5. The primary (or another replica) sends the answer to the
client
22CMPT 431 © A. Fedorova
Eager Update Propagation
23CMPT 431 © A. Fedorova
Eager Update Propagation For Transactional Services
24CMPT 431 © A. Fedorova
When Can a Failure Occur?
• F1: Primary fails before replica coordination– Client receives no response. It will retry. Eventually will get data
from new primary.• F2: Primary fails during replica coordination
– Replicas may or may not have reached agreement w.r.t. client’s transaction. Client may receive a response after system recovers.The system may fail to recover (if the agreement protocol blocks).
• F3: Primary fails after replica coordination – A new primary responds
Phase 1:Client Request
Phase 3:Execution
Phase 4:Replica Coordination
Phase 5:Client response
F1 F2 F3
25CMPT 431 © A. Fedorova
Lazy Update Propagation (Transactional Services)
• Primary Copy:– Upon read: read locally and return to user– Upon write: write locally and return to user– Upon commit/abort: terminate locally– Sometime after commit: multicast changed
objects in a single message to other sites (in FIFO)
26CMPT 431 © A. Fedorova
Lazy Update Propagation (Continued)
• Secondary copy: – Upon read: read locally– Upon message from primary copy: install all
changes (FIFO)– Upon write from client: refuse (writing clients
must submit to primary copy) – Upon commit/abort request (only for read-only
txn): local commit
27CMPT 431 © A. Fedorova
Lazy Update Propagation
A client may end up with an inconsistent view of the system
28CMPT 431 © A. Fedorova
Lazy Propagation: Discussion
• Lazy replication has no server/agreement coordination within response time– Faster– Transactions might be lost in case of primary crash
• Weak data consistency– Simple to achieve– Secondary copies only need to apply updates in FIFO order– Data at secondary copies might be stale
• Multiple Primaries possible (multi-master replication)– More locality
29CMPT 431 © A. Fedorova
Fault Handling
• Properties of correct PB protocol– Property 1: There is at most one primary at any time– Property 2: Each client maintains the identity of the primary, and
sends its requests only to the primary– Property 3: If a client update arrives at a backup, it is not
processed• When a primary fails, we must elect a new one• Network partitions may cause election of more than one
primary• We can avoid network by choosing the right number of replicas
(under certain failure assumptions)• How many replicas do we need to tolerate failures?
30CMPT 431 © A. Fedorova
System Model
• Synchronous system (useful for deriving theoretical results)• Fully connected network (exactly one FIFO link between any two
processes)• Failure model:
– Crash failures: also known as failstop failures– Crash+Link failures: A server may crash or a link may lose messages (but
links do not delay, duplicate or corrupt messages)– Receive-Omission failures: A server may crash and also omit to receive
some of the messages send over a non-faulty link– Send-Omission failures: A server may fail not only by crashing but also by
omitting to send some messages over a non-faulty link– General-Omission failures: A server may exhibit send-omission and
receive-omission failures
31CMPT 431 © A. Fedorova
Lower Bounds on Replication
• How many replicas n do you need to tolerate f failures?
Failure Model Degree of Replicationcrash n > f
crash+link n > f+1
receive-omission n >
send-omission n > f
general-omission n > 2f
23f
32CMPT 431 © A. Fedorova
Crash Failures, Send-Omission Failures: n > f Replicas
FAILED(crashed or fail
to send)Becomes primary
33CMPT 431 © A. Fedorova
Other Failure Models
• The rest of the failure models may create partitions• Partitions: Servers are divided into mutually non-
communicating partitions• A primary may emerge in each partition, so we’ll have
more than one primary – against the rules• To avoid partitions, we use more replication
34CMPT 431 © A. Fedorova
Crash+Link Failures: n > f+1 Replicas
Scenario 1: f servers fail
FAILED
Scenario 2: f links fail
Becomes primary
UNREACHABLE BUT ALIVE
Becomes primary Becomes
primaryProblem! 2 primaries!!!
35CMPT 431 © A. Fedorova
Crash+Link Failures: n > f+1 Replicas
Becomes primary
UNREACHABLE BUT ALIVE
Becomes primary
• We need another correct node that would serve as a link between the two partitions
• If the new node fails, we have f+1 failures.
• This is a contradiction, because we assume at most f failures
36CMPT 431 © A. Fedorova
What About Hard Partitions?
• We showed how many replicas are needed to prevent partitions in the face of f failures
• However partitions do happen due to router failures, for example
• So having extra replicas won’t help, because they will also be on one of the sides of the faulty router
• Next we’ll talk aboutsurviving failures despitenetwork partitions
37CMPT 431 © A. Fedorova
Surviving Network Partitions• Most systems operate under assumption that a partition
will eventually be repaired• Optimistic approach:
– Allow updates in all partitions– When the partition is repaired, eventually synchronize the data– OK for a distributed file system (think about your laptop in
disconnected mode)• Pessimistic approach:
– Allow updates only in a single partition – used where strong consistency is required (flight reservation system)
– Which partition? This is usually decided by quorum consensus– After partition is repaired update copies of data in the other
partition
38CMPT 431 © A. Fedorova
Quorum Consensus • Quorum is a sub-group of servers whose size gives it the right to carry
out the operation• Usually the majority gets the quorum• Design/implementation challenges:
– Replicas must agree that they are behind a partition – must rely on timeouts, failuredetectors (special devices?)
– If the quorum set does not containthe primary, the replicas must elect the new primary
– Cost consideration: to tolerate one partition, musthave at least three servers. Implement one as a simple witness?
Quorum
39CMPT 431 © A. Fedorova
Bringing Replicas Up-to-Date• Version numbers:
– Each copy has a version number (or a timestamp)– Only copies that are up-to-date have the current version number– Operations should be applied only to copies with the current
version number• How does a failed server finds out that its not up-to-date?
– Periodically compare all version numbers?• Log sequence numbers:
– Each operation is written to a log (like a transactional log)– Each log record has a log sequence number (LSN)– Replica managers compare LSN’s to find out if they are not up-to-
date– Used by Berkeley DB replication system
40CMPT 431 © A. Fedorova
Summary• Discussed replication
– Used for performance, high availability• Active replication
– Client sends updates to all replicas– Replicas co-ordinate amongst themselves, apply updates in order
• Passive replication (primary copy, primary-backup)– Eager/lazy update propagation– Number of replicas to prevent partitions
• Handling partitions– Optimistic– Pessimistic (quorum consensus)
• Next let us look at real systems that use replication