Full Stack High Availability -...
Transcript of Full Stack High Availability -...
Full Stack High Availability In an Eventually Consistent World
Ben Coverston @bcoverston DataStax Inc
QConRio 2014
Who Am I?
• Ben Coverston
• Contributor: Apache Cassandra
• DSE Architect, DataStax Inc
Availability
“An analytics system based on Cassandra, for example, where there are no single points of failure might be
considered fault tolerant, but if application-level data migrations, software upgrades, or configuration changes take an hour or more of downtime to complete, then the
system is not highly available.”
Failure Before ~2010
• The website can fail
• We have a farm!
What About the Middleware?
• Make it stateless
• Spin up a bunch of them
• Who cares if one fails, we can recover.
How about the Database?
• Build a massive database server
• Scale up
How about the Database?• We can backup to tape!
• MTTR Hours, possibly days.
• We can mirror!
• Possible loss of data
• Some loss of availability during recovery
• What if we have multiple Availability Zones?
• Geographical distribution of master slave systems is not practical
No Good Option
• RDBMS Recovery is a Special Case
• RDBMS Was Not Built for Failure
• Once you shard it, you lose the benefits of an RDBMS anyway
The Problem
• Traditional databases don’t scale
• Because information is context sensitive
• When relationships matter, so does time
• When you guarantee consistency, something else has to give.
Eventual Consistency
“In an ideal world there would only be one consistency model . . . Many systems during this time took the
approach that it was better to fail the complete system than to break this transparency”[1]
!
— Werner Vogels CTO amazon.com
The CAP Theorem
• Dr. Eric Brewer
• Consistency
• Availability
• Partition Tolerance
Tradeoffs!
• With Distribution we have to accept Partitioning
• If you want strong consistency, any failure of a master will result in a partial outage.
Banks Use BASE, not ACID
!
• Basically Available, Soft State, Eventually Consistent (BASE)
• Real time transactions are preferred, but ATMs fall back to partitioned mode when disconnected.
Why Does This Work?• Templars and Hospitiallars were some of the first
modern bankers [4]
• ATM Networks try to be fully consistent
• Banks lose money when ATMs are not working
• Partitioned State Fallback
• Operations are commutative
• Risk is an actuarial problem
Building On Eventual Consistency
• Eventual Consistency means . . .
• Two queries at the same time could get different results
• If that’s bad for your application:
• Change your application logic
• Change your business model
• OR
• Don’t use eventual consistency
Seat Inventory
• Airlines are eventually consistent too!
• Aircraft are routinely oversold (because booking flights is a distributed systems problem)
• People fail to show up, the airline makes money
• Too many people show up, the airline compensates a few, the airline makes money
But What If?
• I Need Global Distribution
• Strong Consistency
• At Scale
Remember, Tradeoffs
Other Problems
• Real Time Analytics is a Challenge
• MapReduce Helps
• But it’s too slow for a many things
Lambda Architecture
• Real Time, should be Real Time
• Analytics is Batch
• Real time layer depends on processing incoming streams, or pre-aggregated data
• In a non-trivial system, CAP still affects the design
Spark
• Not limited to MapReduce
• Directed Acyclical Graph
• Easy-To-Understand Paradigm
• Compared to Hadoop
What is Spark?• Apache Project Since 2010
• Fast
• 10-100x faster than Hadoop MapReduce
• Easy
• Scala, Java, Python APIs
• A lot less code (for you to write)
• Interactive Shell
Hadoop Mapreduce WordCount
package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);
Spark MapReduce for WordCount
scala> sc.cassandraTable(“newyork”,"presidentlocations") .map( _.get[String](“location”) ) .flatMap( _.split(“ “)) .map( (_,1)) .reduceByKey( _ + _ ) .toArray res17: Array[(String, Int)] = Array((1,3), (House,4), (NYC,3), (Force,3), (White,4), (Air,3))
1 white house
white house
white, 1 house, 1
house, 1 house, 1
house, 2
cassandraTableget[String]
_.split()
(_,1)
_ + _
Just In Memory?
• Map Reduce Style Analytics
• Not Just In-Memory (though it is really good for ‘iteration’)
Cassandra + Spark
• Cassandra-Spark Driver
• Open source
• https://github.com/datastax/cassandra-driver-spark
Distributed Aggregation (A case study)
• Real time distributed counting is hard.
• At high volume, there are still open problems.
Goals
• Provide near-real time counts for increments
• Updates are non-monotonic (can come out of order)
• We want to be able to query on windows (how many events happened last week?, The week before?)
• We also want to know what is happening right now
What about Streaming?
• Storm or Spark Streaming can help for some use cases
• But to do it right, you have to get acks from an external system (spark), or block until the items get processed by something you might have integrated (storm).
• Blocking for stream processing could cause back pressure, and loss of availability.
Distributed Counting
• Aggregate over time
• Per shard
• Save deltas
• Timestamps
• Because timestamps can come out of order
• Arrival time is important
What About Commit Log Replay?
• We could mark commit log entries dirty if the stream hasn’t processed the data
• Block on discarding the segment until processing is complete.
• This will work
• But we would have to fork C* to do it.
Rule #1
• We do not fork Cassandra
What About Commit Log Replay (cont.)
• Alternatively, we would have to wait for Future Cassandra
• We could use Kafka (distributed commit log)
• But then we have two systems, and two commit logs
• ZooKeeper? . . . Just no.
Compromises
• Create a C* plugin to do aggregation
• Do it locally, on each node (on the coordinator).
• Create a separate API to query for aggregation
• Create real-time aggregates on the fly
• Store snapshotted data in C* for windowed aggregation (1s, 1h, 1d, 1w, 1m).
Deltas
• Deltas are stored
• Aggregates have to be composed of commutative operations (because we cannot recalculate everything, every time)
• Cumulative Average is a good example of a compatible streaming operation.
SHARD ARRIVED TIMESTAMP DELTA COUNT
1 1 0 50 40
1 2 0 40 30
2 1 1 10 20
2 2 2 30 40
SHAR
ARRIV
TIMEST
DELTA
COUN1 1 0 50 40
1 1 0 40 302 1 1 10 202 2 1 30 40
SHARD ARRIVED TIME DELTA COUNT
1 1 0 90 70
2 1 1 10 20
2 2 1 30 40
SHAR
ARRIV
TIMEST
DELTA
COUN1 1 0 50 40
1 1 0 40 302 1 1 10 202 2 1 30 40
SHARD ARRIVED TIME DELTA COUNT1 1 0 90 702 1 1 10 202 2 1 30 40
Aggregation Service Average T(1:2) -> 0.923
But Nodes Can Fail
• Snapshotted data is stored with RF > 1 (similar to RDDs)
• Aggregation is done by a ‘fat client’ running on each node.
• If a network partition happens, the real-time counts may be inconsistent.
• In case of a node failure, the counts may need to be repaired
Network Partition
C*
C*C*
C*
Average T(1:2) -> 0.923
Average T(1:2) -> 2.345
The Partition Decision [3]
• Cancel the operation, and decrease availability
• Proceed with the operation, and risk inconsistency
If You Accept Eventual Consistency
• Real time aggregates may be inaccurate
• Due a network partition (may persist for hours)
• Due to latency (speed of light, network latency, small number of ms)
• In this system Historical aggregates are more reliable, because the deltas get written (and replicated) every second.
Designing for Eventual Consistency
• Partitions happen in the real world (not a myth)
• If you are building a distributed system, you have to account for possible failure.
• Define system behavior under failure conditions
• Make adjustments, set expectations
Call To Action• When building distributed systems
• Reason about concurrency
• Avoid Locking (if at all possible)
• Learn about Commutative Replicated Data Types (CRDTs)
• Learn about MultiVersion Concurrency Control (MVCC)
• Learn Functional Programming
• Scala, Clojure (lisp), whatever
• Functional programming makes distributed programming better
Things to Look At
• Cassandra (fully distributed database)
• Actor Pattern
• akka-cluster (fully distributed compute platform)
References
[1] http://www.allthingsdistributed.com/2007/12/eventually_consistent.html
[2] http://en.wikipedia.org/wiki/CAP_theorem
[3] http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
[4] http://en.wikipedia.org/wiki/History_of_banking