Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra,...

Post on 21-Jan-2021

6 views 0 download

Transcript of Data Store Consistencyts.data61.csiro.au/Events/summer/14/fekete.pdf · 2015. 2. 18. · Cassandra,...

Data Store Consistency Alan Fekete

Outline

›  Overview ›  Consistency Properties

-  Research paper: Wada et al, CIDR’11

›  Transaction Techniques -  Research paper: Bailis et al, HotOS’13, VLDB’14

The big ideas

› Each data item is stored in several places (replication) - Can improve availability (access is possible even if some nodes have failed or are not reachable)

- Can improve performance (some replica may be close to the user)

› The location(s) of an item are determined from the key of the item - As more data must be stored, use more storage on more nodes

Replica management summary

› Read any (nearby, up) replica -  Improved performance and availability for reading

› Update all replicas -  If replicas are always kept consistent, this damages

performance and availability for updating - So allow updates to propagate lazily, improving update QoS,

but reads may not be consistent - Unless some restrictions are placed, replicas may diverge

permanently (“split-brain”) -  So define efficient protocols that give some interesting property weaker

than “always consistent” replicas

Latencies

›  Message delay within a data center are usually under 1 ms -  But there is a long tail (eg 1% of messages take several ms)

›  Message delays across geography are often ~100 ms

›  See http://www.bailis.org/blog/communication-costs-in-real-world-networks/

Internet-Scale Data Storage

› Many early systems* offered scalability and availability but missed functionality expected in traditional database management platforms (=> “NoSQL”) - Access by id/key [without content-based access,

without joins] - Operations may see stale data - Lack all-or-nothing combining ops across items

*eg BigTable, PNUTS, S3, Dynamo, MongoDB, Cassandra, SimpleDB, Riak

6

Quorums

›  Suppose each write is initially done to W replicas (out of N) -  Other replicas will be updated later, after write has been acked

to client

›  How can we find the current value of the item when reading?

›  Traditional “quorum” approach is to look at R replicas -  Consider the timestamp of value in each of these

-  Choose value with highest timestamp as result of the read

›  If W>N/2 and R+W>N, this works properly

›  Any read will see the most recent completed write, -  There will be at least one replica that is BOTH among the W written and

among the R read

Sloppy Quorums

›  Waiting for R replicas to return the value makes reads slow and perhaps unavailable

›  Some systems use R and W where R+W<N ›  Common is R=1, W=2, N=3 ›  Or even R=1, W=1, N=3 ›  In this case, read may not see the most recent value! We

say the read can be “stale”

Insert name Insert Title

H. Wada*, A. Fekete^, L. Zhao*, K. Lee*, A. Liu* *NICTA (National ICT Australia), U. of New South Wales ^U. of Sydney

Data Consistency Properties and the Trade-offs in Commercial Cloud Storages: the Consumers’ Perspective [CIDR’11]

Our Research Objective

›  Take the consumer view

-  What can be measured, controlled, planned from outside the cloud

-  Working through available management APIs

›  Investigate consistency models provided by commercial NoSQLs in cloud

-  If weak, which extra properties supported? How often and in what circumstances is inconsistency (stale values) observed?

-  Any differences between what is observed and what is announced from the vendor?

›  Investigate the benefits for consumer of accepting weaker consistency models

-  Are the benefits significant to justify consumers’ effort?

-  When vendor offers choice of consistency model, how do they compare in practice?

Platforms we observed

›  A variety of commercial cloud NoSQLs that are offered as storage service

-  Amazon S3

-  Two options: Regular and Reduced redundancy (durability)

-  Amazon SimpleDB

-  Two options: Consistent Reads and Eventual Consistent Reads

-  Google App Engine datastore

-  Two options: Strong and Eventual Consistent Reads (added “High Replication” in 2011)

-  Windows Azure Table and Blob

-  No option available in data consistency

Observing Stale Data

›  Experimental Setup -  A writer updates data once each 3 secs, for 5 mins

-  On GAE, it runs for 27 seconds

-  A reader(s) reads it 100 times/sec

-  Check if the data is stale by comparing value seen to the most recent value written

-  Plot against time since most recent write occurred

–  Execute the above once every hour

•  On GAE, every 10 min •  For at least 10 days •  Done in Oct and Nov,

2010

SimpleDB: Read and Write from a Single Thread › With eventual consistent read,

33% of chance to read freshest values within 500ms -  Perhaps one master and two

other replicas. Read takes value randomly from one of these?

•  First time for eventual consistent read to reach 99% “fresh” is stable 500ms

•  Outlier cases of stale read after 500ms, but no regular daily or weekly variation observed

Stale Data in Other Cloud NoSQLs

Cloud NoSQL and Accessing Source

What Observed

SimpleDB (access from one thread, two threads, two processes, two VMs or two regions)

Accessing source has no affect on the observable consistency. Eventual consistent reads have 67% chance to see stale value, till 500ms after write.

S3 (with five access configurations)

No stale data was observed in ~4M reads/config. Providing better consistency than SLA describes.

GAE datastore (access from a single app or two apps)

Eventual consistent reads from different apps have very low (3.3E-4 %) chance to observe values older than previous reads. Other reads never saw stale data.

Azure Storages (with five access configurations)

No stale data observed. Matches SLA described by the vendor (all reads are consistent).

Read-Your-Writes?

›  Read-your-writes: a read always sees a value at least as fresh as the latest write from the same thread/session

›  Our experiment: When reader and writer share 1 thread, all reads should be fresh

›  SimpleDB with eventual consistent read: does not have this property

›  GAE with eventual consistent read: may have this property

-  No counterexample observed in ~3.7M reads over two weeks

Monotonic Reads?

›  Monotonic Reads: Each read sees a value at least as fresh as that seen in any previous read from the same thread/session

›  Our experiment: check for a fresh read followed by a stale one

›  SimpleDB: Monotonic read consistency is not supported

-  In staleness, two successive eventual consistent reads are almost independent

-  The correlation between staleness in two successive reads (up to 450ms after write) is 0.0218, which is very low

›  GAE with eventual consistent read: not supported

-  3.3E-4 % chance to read values older than previous reads

2nd Stale 2nd Fresh

1st Stale 39.94% (~1.9M)

21.08% (~1.0M)

1st Fresh 23.36% (~1.1M)

15.63% (~0.7M)

SimpleDB’s Monotonic Write

›  A data has value v0 before each run. Writing value v1 and then v2 there, then read it repeatedly

•  When v1 != v2, writing v2 “pushes” v1 to replicas immediately (previous value v0 is not observed) •  Very different from the “only writing one value” case

•  When v1 = v2, second write does not push (v0 is observed) •  Same as the “only writing one value” case

v1 != v2 v1 = v2

SimpleDB’s Eventual Consistent Read: Further exploration -- Inter-Element Consistency

› Consistency between two values when writing and reading them through various combinations of APIs

Domain Attribute Value Item

•  Write a value •  Write multiple values in an item •  Write multiple values across items in a domain

•  Read a value •  Read multiple values in an item •  Read multiple values across items in a domain

SimpleDB’s Data Model SimpleDB’s API

31 / 20

Reading two values independently

Each read has 33% chance of freshness. Each read operation is independent

Writing two at once and reading two at once

Both are stale or both are fresh. Seems “batch write” and “batch read” access to one replica

Writing two in the same domain independently

The second write “pushes” the value of the first write (but only if two values are different)

Trade-Offs? SimpleDB

•  No significant difference was observed in RTT, throughput, failure rate under various read-write ratios •  If anything, it favors

consistent read! •  Financial cost is exactly same

* Each client sends 1 rps. All obtained under 99:1 read-write ratio

What Consumers Can Observe I

›  SimpleDB platform showed frequent inconsistency

›  It offers option for consistent read. No extra costs for the consumer were observed from our experiments -  At least under the scale of our experiments (few KB stored in a domain and

~2,500 rps)

?? Maybe the consumer should always program SimpleDB with consistent reads?

What Consumers Can Observe II

›  Some platforms gave (mostly) better consistency than they promise

-  Consistency almost always (5-nines or better)

-  Perhaps consistency violated only when network or node failure occurs during execution of the operation

?? Maybe the chance of seeing stale data is so rare on these platforms that it need not be considered in programming?

-  There are other, more frequent, sources of data corruption such as data entry errors

-  The manual processes that fix these may also be used for rare errors from stale data

› Different algorithms/system designs that all give eventual consistency differ so widely in other properties, that matter to the consumer

› Consumer needs to know more

-  Rate of inconsistent reads

-  Time to convergence

-  Extra properties that would be important for programming

-  Performance under variety of workloads

-  Availability

-  Costs

›  Just as non-functional requirements are as crucial as functional ones in SDLC

“Eventual consistency” is imprecise label

Implications for Consumers?

› Can a consumer rely on our findings in decision-making? NO!

-  Vendors might change any aspect of implementation (including presence of properties) without notice to consumers.

-  e.g., Shift from replication in a data center to geographical distribution

-  Vendors might find a way to pass on to consumers the savings from eventual consistent reads (compared to consistent ones)

›  The lesson is that clear SLAs are needed, that clarify the properties that consumers can expect

Outline

›  Overview ›  Consistency Properties

-  Research paper: Wada et al, CIDR’11

›  Transaction Techniques -  Research paper: Bailis et al, HotOS’13, VLDB’14

Transactions?

› DBMS concept: user can define a grouping that consists of a sequence of operations (on perhaps multiple data items), so they commit or abort as a group

-  ACID: atomic (all-or-nothing outcome), consistent, isolated, durable

›  Serializability: the strongest form of isolation, equivalent to transactions executing serially

-  A system can’t provide serializable transactions that are always available if the system can partition

›  This was known long before Brewer; see Davidson et al (ACM Computing Surveys 1985)

Data Stores with Availability

› Restricted form of transactions - Operate on set of items that are colocated

-  Eg Google Megastore entity group, UCSB G-Store

- Multiple gets or multiple puts, not get with put -  Eg Princeton COPS-GT, Eiger

› Restricted data types - Only allow commutative operations

-  eg INRIA CRDTs, Berkeley BloomL

› Weak semantics - Without isolation properties

-  Eg ETH Consistency rationing (some choices)

Data Stores without Availability

›  Systems that support general (read committed, SI-like, or even serializable) transactions -  but use 2PC, Paxos, a master replica for ordering, etc

-  Eg Google Megastore (across entity groups), ETH Consistency Rationing (some choices), Google Spanner, MSR Walter, UCSB Paxos-CP, Yale Calvin, Berkeley Planet (formerly MDCC)

41

[HotOS’13]

ACID Transactions with weaker I

›  Serializability is the ideal for isolation of transactions but most transactions on (conventional, single site) dbms don’t run serializably -  Read Committed is often the default level

HAT

› We propose a useful model for programmers on an internet-scale store is -  offer txns that can be arbitrary collection of accesses to arbitrary sets of read/

write objects,

-  with semantics chosen to be as strong as feasible to implement with availability even when partitioned

Available?

› Clearly not possible if client is partitioned away from its data -  However, we should tolerate partition between item replicas within the data store

›  So, we ask for: -  IF client can get to one replica of each item it asks for, THEN transaction can

eventually commit (or it aborts volontarily)

Semantics for HAT

› We show you can offer available transactions that have -  All-or-nothing atomicity

-  Isolation levels such as ANSI-defined read committed and repeatable read*

-  But where reads may not always see the most recent committed changes

*in absence of predicate reads [which is not an issue for key-value store]

Defining semantics

› We are following Adya’s MIT PhD thesis -  Graph showing edges between operations

-  Types of edges: wr,ww, rw,

-  also happens before

-  Restrictions on the sorts of cycles that can occur

Implementation proposal

› We can sketch implementations - For read committed: server buffers operations, applies them

as a group when notified of commit - For repeatable read: client caches the value read of each

item; reuses that › This is mainly to prove existence of an available implementation - Lots of engineering will be needed to get decent

performance

Not all semantics are possible in HATs

› A HAT can’t offer serializability, or snapshot isolation › More generallly, a HAT can’t prevent “lost update” phenomenon - On one side of a partition, client executes “read x, increment x by

10 return the result”, and on the other side, execute “read x, double x, return the result”

› A HAT can’t offer a recency guarantee, such as “transaction sees any other transaction that completed within past 10 minutes”

Session guarantees and stickiness

›  To offer session guarantees, such as ”read your writes” (from one transaction in a session to a later one) or “respect causality”, we must assume more of the platform

›  Sticky Available system allows a transaction to commit whenever it can contact the same server used by previous transactions in the session -  But perhaps not if different transactions in the session use different servers

See http://www.bailis.org/blog/stickiness-and-client-server-session-guarantees/

49

Conclusion

› We advocate internet-scale data system to offer clients -  Unrestricted sets of operations on arbitrary multiple items as transaction

-  Semantics as strong as possible while remaining available during partitions

-  We show this can achieve traditional ANSI weak isolation levels

-  Open questions: what further properties can one offer? Can one get good performance? How to design an application so it works properly with this isolation model?

References

› Data Consistency and the Trade-Offs in Commercial Cloud Storage: The Consumers' Perspective H. Wada, A. Fekete, L. Zhao, K. Lee and A. Liu Proceedings of Conference on Innovative Data Systems Research (CIDR'11), Asilomar USA, January 2011, pp 134-143.

From http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper15.pdf

51

References

› HAT, not CAP: Towards Highly Available Transactions P. Bailis, A. Fekete, A. Ghodsi, J. M. Hellerstein, and I. Stoica Proceedings of USENIX Workshop on Hot Topics in Operating Systems (HotOS'13), Santa Ana Pueblo, New Mexico, USA, May 2013.

From http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/11164-hotos13-final80.pdf

› Highly Available Transactions: Virtues and Limitations P. Bailis, A. Davidson, A. Fekete, A. Ghodsi, J. M. Hellerstein, and I. Stoica Proceedings of VLDB, Vol 7 no 3, pp 181-192

From

http://www.vldb.org/pvldb/vol7/p181-bailis.pdf

52

Further reading

›  “Replication Theory and Practice” ed by B. Charron-Bost, F. Pedone and A. Schiper; Springer LNCS5959 (2010) -  Available online at http://opac.library.usyd.edu.au:80/record=b3851904~S4

›  “Data Management in the Cloud” by D. Agrawal, A. El Abaddi, and S. Das; Morgan & Claypool (2012) -  Available online at http://opac.library.usyd.edu.au:80/record=b4452001~S4

53