The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... ·...

44
Silvia Bonomi University of Rome “La Sapienza” www.dis.uniroma1.it/~bonomi Great Ideas in Computer Science & Engineering A.A. 2012/2013 The CAP theorem and the design of large scale distributed systems: Part I

Transcript of The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... ·...

Page 1: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Silvia Bonomi University of Rome “La Sapienza” www.dis.uniroma1.it/~bonomi

Great Ideas in Computer Science & Engineering A.A. 2012/2013

The CAP theorem and the design of large scale distributed systems: Part I

Page 2: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

A little bit of History

2

Mainframe- based Information Systems

First Internet-based systems for military

purpose Client-server architectures

Web-services based Information Systems

Peer-to-peer Systems

Wireless and Mobile ad-hoc

networks

Cloud Computing Platforms

Page 3: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Relational Databases History }  Relational Databases – mainstay of business }  Web-based applications caused spikes

}  Especially true for public-facing e-Commerce sites

}  Developers begin to front RDBMS with memcache or integrate other caching mechanisms within the application

Page 4: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Scaling Up }  Issues with scaling up when the dataset is just too big }  RDBMS were not designed to be distributed }  Began to look at multi-node database solutions }  Known as ‘scaling out’ or ‘horizontal scaling’ }  Different approaches include:

}  Master-slave }  Sharding

Page 5: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Scaling RDBMS – Master/Slave }  Master-Slave

}  All writes are written to the master. All reads performed against the replicated slave databases

}  Critical reads may be incorrect as writes may not have been propagated down

}  Large data sets can pose problems as master needs to duplicate data to slaves

Page 6: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Scaling RDBMS - Sharding }  Partition or sharding

}  Scales well for both reads and writes }  Not transparent, application needs to be partition-aware }  Can no longer have relationships/joins across partitions }  Loss of referential integrity across shards

Page 7: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Other ways to scale RDBMS }  Multi-Master replication }  INSERT only, not UPDATES/DELETES }  No JOINs, thereby reducing query time

}  This involves de-normalizing data

}  In-memory databases

Page 8: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Today…

Page 9: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Context }  Networked Shared-data Systems

9

A A

A

Page 10: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Fundamental Properties }  Consistency

}  (informally) “every request receives the right response” }  E.g. If I get my shopping list on Amazon I expect it contains all the

previously selected items

}  Availability }  (informally) “each request eventually receives a response” }  E.g. eventually I access my shopping list

}  tolerance to network Partitions }  (informally) “servers can be partitioned in to multiple groups that cannot

communicate with one other”

10

Page 11: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

11

CAP Theorem }  2000: Eric Brewer, PODC conference keynote }  2002: Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2)

“Of three properties of shared-data systems (Consistency, Availability and

tolerance to network Partitions) only two can be achieved at any given moment in time.”

Page 12: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Proof Intuition

12

A A

A

Networked Shared-data system

C

Write (v1, A)

C

Read (A)

Page 13: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

C

A P

Fox&Brewer “CAP Theorem”: C-A-P: choose two.

consistency

Availability Partition-resilience

Claim: every distributed system is on one side of the triangle.

CA: available, and consistent, unless there is a partition.

AP: a reachable replica provides service even in a partition, but may be inconsistent.

CP: always consistent, even in a partition, but a reachable replica may deny service without agreement of the others (e.g., quorum).

Page 14: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

The CAP Theorem

Consistency Availability

Tolerance to network

Partitions

Theorem: You can have at most two of these invariants for any shared-data system Corollary: consistency boundary must choose A or P

Page 15: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Forfeit Partitions

Consistency Availability

Tolerance to network

Partitions

Examples }  Single-site databases }  Cluster databases }  LDAP }  Fiefdoms

Traits }  2-phase commit }  cache validation protocols }  The “inside”

Page 16: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Observations }  CAP states that in case of failures you can have at most two of

these three properties for any shared-data system

}  To scale out, you have to distribute resources. }  P in not really an option but rather a need }  The real selection is among consistency or availability }  In almost all cases, you would choose availability over consistency

Page 17: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Forfeit Availability

Consistency Availability

Tolerance to network

Partitions

Examples }  Distributed databases }  Distributed locking }  Majority protocols

Traits }  Pessimistic locking }  Make minority partitions

unavailable

Page 18: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Forfeit Consistency

Consistency Availability

Tolerance to network

Partitions

Examples }  Coda }  Web caching }  DNS }  Emissaries

Traits }  expirations/leases }  conflict resolution }  Optimistic }  The “outside”

Page 19: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Consistency Boundary Summary

}  We can have consistency & availability within a cluster. }  No partitions within boundary!

}  OS/Networking better at A than C

}  Databases better at C than A

}  Wide-area databases can’t have both

}  Disconnected clients can’t have both

Page 20: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if
Page 21: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

CAP, ACID and BASE }  BASE stands for Basically Available Soft State Eventually

Consistent system.

}  Basically Available: the system available most of the time and there could exists a subsystems temporarily unavailable

}  Soft State: data are “volatile” in the sense that their persistence is in the hand of the user that must take care of refresh them

}  Eventually Consistent: the system eventually converge to a consistent state

21

Page 22: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

CAP, ACID and BASE }  Relation among ACID and CAP is core complex

}  Atomicity: every operation is executed in “all-or-nothing” fashion

}  Consistency: every transaction preserves the consistency constraints on data

}  Integrity: transaction does not interfere. Every transaction is executed as it is the only one in the system

}  Durability: after a commit, the updates made are permanent regardless possible failures

22

Page 23: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

CAP, ACID and BASE

CAP ACID

}  C here looks to single-copy consistency

}  A here look to the service/data availability

}  C here looks to constraints on data and data model

}  A looks to atomicity of operation and it is always ensured

}  I is deeply related to CAP. I can be ensured in at most one partition

}  D is independent from CAP

23

Page 24: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Warning! }  What CAP says:

}  When you have a partition in the network you cannot have both C and A

}  What CAP does not say: }  There could not exists a time periods in which you can have

both C, A and P

24

During Normal Periods (i.e. period with no partitions) both C and A can be achieved

Page 25: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

2 of 3 is misleading }  Partitions are rare events

}  there are little reasons to forfeit by design C or A

}  Systems evolve along time }  Depending on the specific partition, service or data, the

decision about the property to be sacrificed can change

}  C, A and P are measured according to continuum }  Several level of Consistency (e.g. ACID vs BASE) }  Several level of Availability }  Several degree of partition severity

Page 26: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

2 of 3 is misleading }  In principle every system should be designed to ensure

both C and A in normal situation

}  When a partition occurs the decision among C and A can be taken

}  When the partition is resolved the system takes corrective action coming back to work in normal situation

26

Page 27: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Consistency/Latency Trade Off }  CAP does not force designers to give up A or C but why

there exists a lot of systems trading C?

}  CAP does not explicitly talk about latency… }  … however latency is crucial to get the essence of CAP

27

Page 28: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Consistency/Latency Trade Off

High Availability

•  High Availability is a strong requirement of modern shared-data systems

Replication •  To achieve High Availability, data and services must be replicated

Consistency •  Replication impose consistency maintenance

Latency

•  Every form of consistency requires communication and a stronger consistency requires higher latency

28

Page 29: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

PACELC }  Abadi proposes to revise CAP as follows:

“PACELC (pronounced pass-elk): if there is a partition (P), how does the system trade off

availability and consistency (A and C); else (E), when the system is running normally in the

absence of partitions, how does the system trade off latency (L) and consistency (C)?”

29

Page 30: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Partitions Management

30

COVER FE ATURE

COMPUTER 26

detect the start of a partition, enter an explicit partition mode that may limit some

operations, and initiate partition recovery when communication is

restored.

The last step aims to restore consistency and compen-sate for mistakes the program made while the system was partitioned.

Figure 1 shows a partition’s evolution. Normal operation is a sequence of atomic operations, and thus partitions always start between operations. Once the system times out, it detects a partition, and the detecting side enters partition mode. If a partition does indeed exist, both sides enter this mode, but one-sided partitions are possible. In such cases, the other side communicates as needed and either this side responds correctly or no communication was required; either way, operations remain consistent. However, because the detecting side could have incon-sistent operations, it must enter partition mode. Systems that use a quorum are an example of this one-sided par-titioning. One side will have a quorum and can proceed, but the other cannot. Systems that support disconnected operation clearly have a notion of partition mode, as do some atomic multicast systems, such as Java’s JGroups.

Once the system enters partition mode, two strategies are possible. The first is to limit some operations, thereby reducing availability. The second is to record extra infor-mation about the operations that will be helpful during partition recovery. Continuing to attempt communication will enable the system to discern when the partition ends.

Which operations should proceed?Deciding which operations to limit depends primarily

on the invariants that the system must maintain. Given a set of invariants, the designer must decide whether or not to maintain a particular invariant during partition mode or risk violating it with the intent of restoring it during recov-

ery. For example, for the invariant that keys in a table are unique, designers typically decide to risk that invariant and allow duplicate keys during a partition. Duplicate keys are easy to detect during re-covery, and, assuming that they can be merged, the designer can easily restore the invariant.

For an invariant that must be maintained during a partition, however, the designer must pro-hibit or modify operations that might violate it. (In general, there is no way to tell if the operation will actually violate the invariant, since

the state of the other side is not knowable.) Externalized events, such as charging a credit card, often work this way. In this case, the strategy is to record the intent and execute it after the recovery. Such transactions are typically part of a larger workflow that has an explicit order-processing state, and there is little downside to delaying the operation until the partition ends. The designer forfeits A in a way that users do not see. The users know only that they placed an order and that the system will execute it later.

More generally, partition mode gives rise to a funda-mental user-interface challenge, which is to communicate that tasks are in progress but not complete. Researchers have explored this problem in some detail for disconnected operation, which is just a long partition. Bayou’s calendar application, for example, shows potentially inconsistent (tentative) entries in a different color.7 Such notifications are regularly visible both in workflow applications, such as commerce with e-mail notifications, and in cloud services with an offline mode, such as Google Docs.

One reason to focus on explicit atomic operations, rather than just reads and writes, is that it is vastly easier to analyze the impact of higher-level operations on invari-ants. Essentially, the designer must build a table that looks at the cross product of all operations and all invariants and decide for each entry if that operation could violate the invariant. If so, the designer must decide whether to prohibit, delay, or modify the operation. In practice, these decisions can also depend on the known state, on the argu-ments, or on both. For example, in systems with a home node for certain data,5 operations can typically proceed on the home node but not on other nodes.

The best way to track the history of operations on both sides is to use version vectors, which capture the causal dependencies among operations. The vector’s elements are a pair (node, logical time), with one entry for every node that has updated the object and the time of its last update. Given two versions of an object, A and B, A is newer than B if, for every node in common in their vectors, A’s times

State: S'

Partition modePartition starts

Time

Partitionrecovery

State: S2

State: S1State: S

Operations on S

Figure 1. The state starts out consistent and remains so until a partition starts. To stay available, both sides enter partition mode and continue to execute operations, creat-ing concurrent states S1 and S2, which are inconsistent. When the partition ends, the truth becomes clear and partition recovery starts. During recovery, the system merges S1 and S2 into a consistent state S' and also compensates for any mistakes made during the partition.

Partition Detection

Activating Partition Mode

Partition Recovery

Page 31: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Partition Detection }  CAP does not explicitly talk about latencies

}  However… }  To keep the system live time-outs must be set }  When a time-out expires the system must take a decision

31

Is a partition

happening?

Possible Availability Loss

Possible Consistency Loss

YES, go on with execution

NO, continue to

wait

Page 32: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Partition Detection }  Partition Detection is not global

}  An interacting part may detect the partition, the other not. }  Different processes may be in different states (partition mode

vs normal mode)

}  When entering Partition Mode the system may }  Decide to block risk operations to avoid consistency violations }  Go on limiting a subset of operations

32

Page 33: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Which Operations Should Proceed? }  Live operation selection is an hard task

}  Knowledge of the severity of invariant violation }  Examples

}  every key in a DB must be unique ¨  Managing violation of unique keys is simple ¨  Merging element with the same key or keys update

}  every passenger of an airplane must have assigned a seat ¨  Managing seat reservations violation is harder ¨  Compensation done with human intervention

}  Log every operation for a possible future re-processing

33

Page 34: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Partition Recovery }  When a partition is repaired, partitions’ logs may be used

to recover consistency

}  Strategy 1: roll-back and executed again operations in the proper order (using version vectors)

}  Strategy 2: disable a subset of operations (Commutative Replicated Data Type - CRDT)

34

Page 35: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Basic Techniques: Version Vector }  In the version vector we have an entry for any node

updating the state }  Each node has an identifier }  Each operation is stored in the log with attached a pair

<nodeId, timeStamp>

}  Given two version vector A and B, A is newer than B if }  For any node in both A and B, ta(B) ≤ ts(A) and }  There exists at least one entry where ta(B) < ts(A)

35

Page 36: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Version Vectors: example

36

1

0

0

1

1

0

ts(A) < ts(B) then A → B

ts(B) Ts(A)

1

0

0

0

0

1

ts (A) ≠ ts (B) then A || B

POTENTIALLY INCONSISTENT!

Ts(A) ts(B)

Page 37: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Basic Techniques: Version Vector }  Using version vectors it is always possible to determine if

two operations are causally related or they are concurrent (and then dangerous)

}  Using vector versions stored on both the partitions it is possible to re-order operations and raising conflicts that may be resolved by hand

}  Recent works proved that this consistency is the best that can be obtained in systems focussed on latency

37

Page 38: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Basic Techniques: CRDT }  Commutative Replicated Data Type (CRDT) are data

structures that provably converges after a partition (e.g. set).

}  Characteristics: }  All the operations during a partition are commutative (e.g. add(a) and

add(b) are commutative) or }  Values are represented on a lattice and all operations during a partitions

are monotonically increasing wrt the lattice (giving an order among them) }  Approach taken by Amazon with the shopping cart.

}  Allows designers to choose A still ensuring the convergence after a partition recovery

38

Page 39: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Basic Techniques: Mistake Compensation

}  Selecting A and forfaiting C, mistakes may be taken }  Invariants violation

}  To fix mistakes the system can }  Apply deterministic rule (e.g. “last write win”) }  Operations merge }  Human escalation

}  General Idea: }  Define specific operation managing the error

}  E.g. re-found credit card

39

Page 40: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

What is NoSQL? }  Stands for Not Only SQL }  Class of non-relational data storage systems }  Usually do not require a fixed table schema nor do they use

the concept of joins }  All NoSQL offerings relax one or more of the ACID

properties (will talk about the CAP theorem)

Page 41: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Why NoSQL? }  For data storage, an RDBMS cannot be the be-all/end-all }  Just as there are different programming languages, need to have

other data storage tools in the toolbox }  A NoSQL solution is more acceptable to a client now than

even a year ago

Page 42: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

How did we get here? }  Explosion of social media sites (Facebook, Twitter) with

large data needs }  Rise of cloud-based solutions such as Amazon S3 (simple

storage solution) }  Just as moving to dynamically-typed languages (Ruby/

Groovy), a shift to dynamically-typed data with frequent schema changes

}  Open-source community

Page 43: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Dynamo and BigTable }  Three major papers were the seeds of the NoSQL movement

}  BigTable (Google) }  Dynamo (Amazon)

}  Gossip protocol (discovery and error detection) }  Distributed key-value data store }  Eventual consistency

Page 44: The CAP theorem and the design of large scale distributed ...querzoni/corsi_assets/1213/... · PACELC ! Abadi proposes to revise CAP as follows: “PACELC (pronounced pass-elk): if

Thank You!

Questions?!