Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra,...

40
Data Store Consistency Alan Fekete

Transcript of Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra,...

Page 1: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Data Store Consistency Alan Fekete

Page 2: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Outline

›  Overview ›  Consistency Properties

-  Research paper: Wada et al, CIDR’11

›  Transaction Techniques -  Research paper: Bailis et al, HotOS’13, VLDB’14

Page 3: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

The big ideas

› Each data item is stored in several places (replication) - Can improve availability (access is possible even if some nodes have failed or are not reachable)

- Can improve performance (some replica may be close to the user)

› The location(s) of an item are determined from the key of the item - As more data must be stored, use more storage on more nodes

Page 4: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Replica management summary

› Read any (nearby, up) replica -  Improved performance and availability for reading

› Update all replicas -  If replicas are always kept consistent, this damages

performance and availability for updating - So allow updates to propagate lazily, improving update QoS,

but reads may not be consistent - Unless some restrictions are placed, replicas may diverge

permanently (“split-brain”) -  So define efficient protocols that give some interesting property weaker

than “always consistent” replicas

Page 5: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Latencies

›  Message delay within a data center are usually under 1 ms -  But there is a long tail (eg 1% of messages take several ms)

›  Message delays across geography are often ~100 ms

›  See http://www.bailis.org/blog/communication-costs-in-real-world-networks/

Page 6: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Internet-Scale Data Storage

› Many early systems* offered scalability and availability but missed functionality expected in traditional database management platforms (=> “NoSQL”) - Access by id/key [without content-based access,

without joins] - Operations may see stale data - Lack all-or-nothing combining ops across items

*eg BigTable, PNUTS, S3, Dynamo, MongoDB, Cassandra, SimpleDB, Riak

6

Page 7: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Quorums

›  Suppose each write is initially done to W replicas (out of N) -  Other replicas will be updated later, after write has been acked

to client

›  How can we find the current value of the item when reading?

›  Traditional “quorum” approach is to look at R replicas -  Consider the timestamp of value in each of these

-  Choose value with highest timestamp as result of the read

›  If W>N/2 and R+W>N, this works properly

›  Any read will see the most recent completed write, -  There will be at least one replica that is BOTH among the W written and

among the R read

Page 8: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Sloppy Quorums

›  Waiting for R replicas to return the value makes reads slow and perhaps unavailable

›  Some systems use R and W where R+W<N ›  Common is R=1, W=2, N=3 ›  Or even R=1, W=1, N=3 ›  In this case, read may not see the most recent value! We

say the read can be “stale”

Page 9: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Insert name Insert Title

H. Wada*, A. Fekete^, L. Zhao*, K. Lee*, A. Liu* *NICTA (National ICT Australia), U. of New South Wales ^U. of Sydney

Data Consistency Properties and the Trade-offs in Commercial Cloud Storages: the Consumers’ Perspective [CIDR’11]

Page 10: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Our Research Objective

›  Take the consumer view

-  What can be measured, controlled, planned from outside the cloud

-  Working through available management APIs

›  Investigate consistency models provided by commercial NoSQLs in cloud

-  If weak, which extra properties supported? How often and in what circumstances is inconsistency (stale values) observed?

-  Any differences between what is observed and what is announced from the vendor?

›  Investigate the benefits for consumer of accepting weaker consistency models

-  Are the benefits significant to justify consumers’ effort?

-  When vendor offers choice of consistency model, how do they compare in practice?

Page 11: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Platforms we observed

›  A variety of commercial cloud NoSQLs that are offered as storage service

-  Amazon S3

-  Two options: Regular and Reduced redundancy (durability)

-  Amazon SimpleDB

-  Two options: Consistent Reads and Eventual Consistent Reads

-  Google App Engine datastore

-  Two options: Strong and Eventual Consistent Reads (added “High Replication” in 2011)

-  Windows Azure Table and Blob

-  No option available in data consistency

Page 12: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Observing Stale Data

›  Experimental Setup -  A writer updates data once each 3 secs, for 5 mins

-  On GAE, it runs for 27 seconds

-  A reader(s) reads it 100 times/sec

-  Check if the data is stale by comparing value seen to the most recent value written

-  Plot against time since most recent write occurred

–  Execute the above once every hour

•  On GAE, every 10 min •  For at least 10 days •  Done in Oct and Nov,

2010

Page 13: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

SimpleDB: Read and Write from a Single Thread › With eventual consistent read,

33% of chance to read freshest values within 500ms -  Perhaps one master and two

other replicas. Read takes value randomly from one of these?

•  First time for eventual consistent read to reach 99% “fresh” is stable 500ms

•  Outlier cases of stale read after 500ms, but no regular daily or weekly variation observed

Page 14: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Stale Data in Other Cloud NoSQLs

Cloud NoSQL and Accessing Source

What Observed

SimpleDB (access from one thread, two threads, two processes, two VMs or two regions)

Accessing source has no affect on the observable consistency. Eventual consistent reads have 67% chance to see stale value, till 500ms after write.

S3 (with five access configurations)

No stale data was observed in ~4M reads/config. Providing better consistency than SLA describes.

GAE datastore (access from a single app or two apps)

Eventual consistent reads from different apps have very low (3.3E-4 %) chance to observe values older than previous reads. Other reads never saw stale data.

Azure Storages (with five access configurations)

No stale data observed. Matches SLA described by the vendor (all reads are consistent).

Page 15: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Read-Your-Writes?

›  Read-your-writes: a read always sees a value at least as fresh as the latest write from the same thread/session

›  Our experiment: When reader and writer share 1 thread, all reads should be fresh

›  SimpleDB with eventual consistent read: does not have this property

›  GAE with eventual consistent read: may have this property

-  No counterexample observed in ~3.7M reads over two weeks

Page 16: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Monotonic Reads?

›  Monotonic Reads: Each read sees a value at least as fresh as that seen in any previous read from the same thread/session

›  Our experiment: check for a fresh read followed by a stale one

›  SimpleDB: Monotonic read consistency is not supported

-  In staleness, two successive eventual consistent reads are almost independent

-  The correlation between staleness in two successive reads (up to 450ms after write) is 0.0218, which is very low

›  GAE with eventual consistent read: not supported

-  3.3E-4 % chance to read values older than previous reads

2nd Stale 2nd Fresh

1st Stale 39.94% (~1.9M)

21.08% (~1.0M)

1st Fresh 23.36% (~1.1M)

15.63% (~0.7M)

Page 17: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

SimpleDB’s Monotonic Write

›  A data has value v0 before each run. Writing value v1 and then v2 there, then read it repeatedly

•  When v1 != v2, writing v2 “pushes” v1 to replicas immediately (previous value v0 is not observed) •  Very different from the “only writing one value” case

•  When v1 = v2, second write does not push (v0 is observed) •  Same as the “only writing one value” case

v1 != v2 v1 = v2

Page 18: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

SimpleDB’s Eventual Consistent Read: Further exploration -- Inter-Element Consistency

› Consistency between two values when writing and reading them through various combinations of APIs

Domain Attribute Value Item

•  Write a value •  Write multiple values in an item •  Write multiple values across items in a domain

•  Read a value •  Read multiple values in an item •  Read multiple values across items in a domain

SimpleDB’s Data Model SimpleDB’s API

31 / 20

Reading two values independently

Each read has 33% chance of freshness. Each read operation is independent

Writing two at once and reading two at once

Both are stale or both are fresh. Seems “batch write” and “batch read” access to one replica

Writing two in the same domain independently

The second write “pushes” the value of the first write (but only if two values are different)

Page 19: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Trade-Offs? SimpleDB

•  No significant difference was observed in RTT, throughput, failure rate under various read-write ratios •  If anything, it favors

consistent read! •  Financial cost is exactly same

* Each client sends 1 rps. All obtained under 99:1 read-write ratio

Page 20: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

What Consumers Can Observe I

›  SimpleDB platform showed frequent inconsistency

›  It offers option for consistent read. No extra costs for the consumer were observed from our experiments -  At least under the scale of our experiments (few KB stored in a domain and

~2,500 rps)

?? Maybe the consumer should always program SimpleDB with consistent reads?

Page 21: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

What Consumers Can Observe II

›  Some platforms gave (mostly) better consistency than they promise

-  Consistency almost always (5-nines or better)

-  Perhaps consistency violated only when network or node failure occurs during execution of the operation

?? Maybe the chance of seeing stale data is so rare on these platforms that it need not be considered in programming?

-  There are other, more frequent, sources of data corruption such as data entry errors

-  The manual processes that fix these may also be used for rare errors from stale data

Page 22: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

› Different algorithms/system designs that all give eventual consistency differ so widely in other properties, that matter to the consumer

› Consumer needs to know more

-  Rate of inconsistent reads

-  Time to convergence

-  Extra properties that would be important for programming

-  Performance under variety of workloads

-  Availability

-  Costs

›  Just as non-functional requirements are as crucial as functional ones in SDLC

“Eventual consistency” is imprecise label

Page 23: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Implications for Consumers?

› Can a consumer rely on our findings in decision-making? NO!

-  Vendors might change any aspect of implementation (including presence of properties) without notice to consumers.

-  e.g., Shift from replication in a data center to geographical distribution

-  Vendors might find a way to pass on to consumers the savings from eventual consistent reads (compared to consistent ones)

›  The lesson is that clear SLAs are needed, that clarify the properties that consumers can expect

Page 24: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Outline

›  Overview ›  Consistency Properties

-  Research paper: Wada et al, CIDR’11

›  Transaction Techniques -  Research paper: Bailis et al, HotOS’13, VLDB’14

Page 25: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Transactions?

› DBMS concept: user can define a grouping that consists of a sequence of operations (on perhaps multiple data items), so they commit or abort as a group

-  ACID: atomic (all-or-nothing outcome), consistent, isolated, durable

›  Serializability: the strongest form of isolation, equivalent to transactions executing serially

-  A system can’t provide serializable transactions that are always available if the system can partition

›  This was known long before Brewer; see Davidson et al (ACM Computing Surveys 1985)

Page 26: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Data Stores with Availability

› Restricted form of transactions - Operate on set of items that are colocated

-  Eg Google Megastore entity group, UCSB G-Store

- Multiple gets or multiple puts, not get with put -  Eg Princeton COPS-GT, Eiger

› Restricted data types - Only allow commutative operations

-  eg INRIA CRDTs, Berkeley BloomL

› Weak semantics - Without isolation properties

-  Eg ETH Consistency rationing (some choices)

Page 27: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Data Stores without Availability

›  Systems that support general (read committed, SI-like, or even serializable) transactions -  but use 2PC, Paxos, a master replica for ordering, etc

-  Eg Google Megastore (across entity groups), ETH Consistency Rationing (some choices), Google Spanner, MSR Walter, UCSB Paxos-CP, Yale Calvin, Berkeley Planet (formerly MDCC)

Page 28: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

41

[HotOS’13]

Page 29: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

ACID Transactions with weaker I

›  Serializability is the ideal for isolation of transactions but most transactions on (conventional, single site) dbms don’t run serializably -  Read Committed is often the default level

Page 30: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

HAT

› We propose a useful model for programmers on an internet-scale store is -  offer txns that can be arbitrary collection of accesses to arbitrary sets of read/

write objects,

-  with semantics chosen to be as strong as feasible to implement with availability even when partitioned

Page 31: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Available?

› Clearly not possible if client is partitioned away from its data -  However, we should tolerate partition between item replicas within the data store

›  So, we ask for: -  IF client can get to one replica of each item it asks for, THEN transaction can

eventually commit (or it aborts volontarily)

Page 32: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Semantics for HAT

› We show you can offer available transactions that have -  All-or-nothing atomicity

-  Isolation levels such as ANSI-defined read committed and repeatable read*

-  But where reads may not always see the most recent committed changes

*in absence of predicate reads [which is not an issue for key-value store]

Page 33: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Defining semantics

› We are following Adya’s MIT PhD thesis -  Graph showing edges between operations

-  Types of edges: wr,ww, rw,

-  also happens before

-  Restrictions on the sorts of cycles that can occur

Page 34: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Implementation proposal

› We can sketch implementations - For read committed: server buffers operations, applies them

as a group when notified of commit - For repeatable read: client caches the value read of each

item; reuses that › This is mainly to prove existence of an available implementation - Lots of engineering will be needed to get decent

performance

Page 35: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Not all semantics are possible in HATs

› A HAT can’t offer serializability, or snapshot isolation › More generallly, a HAT can’t prevent “lost update” phenomenon - On one side of a partition, client executes “read x, increment x by

10 return the result”, and on the other side, execute “read x, double x, return the result”

› A HAT can’t offer a recency guarantee, such as “transaction sees any other transaction that completed within past 10 minutes”

Page 36: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Session guarantees and stickiness

›  To offer session guarantees, such as ”read your writes” (from one transaction in a session to a later one) or “respect causality”, we must assume more of the platform

›  Sticky Available system allows a transaction to commit whenever it can contact the same server used by previous transactions in the session -  But perhaps not if different transactions in the session use different servers

See http://www.bailis.org/blog/stickiness-and-client-server-session-guarantees/

49

Page 37: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Conclusion

› We advocate internet-scale data system to offer clients -  Unrestricted sets of operations on arbitrary multiple items as transaction

-  Semantics as strong as possible while remaining available during partitions

-  We show this can achieve traditional ANSI weak isolation levels

-  Open questions: what further properties can one offer? Can one get good performance? How to design an application so it works properly with this isolation model?

Page 38: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

References

› Data Consistency and the Trade-Offs in Commercial Cloud Storage: The Consumers' Perspective H. Wada, A. Fekete, L. Zhao, K. Lee and A. Liu Proceedings of Conference on Innovative Data Systems Research (CIDR'11), Asilomar USA, January 2011, pp 134-143.

From http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper15.pdf

51

Page 39: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

References

› HAT, not CAP: Towards Highly Available Transactions P. Bailis, A. Fekete, A. Ghodsi, J. M. Hellerstein, and I. Stoica Proceedings of USENIX Workshop on Hot Topics in Operating Systems (HotOS'13), Santa Ana Pueblo, New Mexico, USA, May 2013.

From http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/11164-hotos13-final80.pdf

› Highly Available Transactions: Virtues and Limitations P. Bailis, A. Davidson, A. Fekete, A. Ghodsi, J. M. Hellerstein, and I. Stoica Proceedings of VLDB, Vol 7 no 3, pp 181-192

From

http://www.vldb.org/pvldb/vol7/p181-bailis.pdf

52

Page 40: Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra, SimpleDB, Riak 6 . Quorums › Suppose each write is initially done to W replicas

Further reading

›  “Replication Theory and Practice” ed by B. Charron-Bost, F. Pedone and A. Schiper; Springer LNCS5959 (2010) -  Available online at http://opac.library.usyd.edu.au:80/record=b3851904~S4

›  “Data Management in the Cloud” by D. Agrawal, A. El Abaddi, and S. Das; Morgan & Claypool (2012) -  Available online at http://opac.library.usyd.edu.au:80/record=b4452001~S4

53