Full Stack High Availability -...

55
Full Stack High Availability In an Eventually Consistent World Ben Coverston @bcoverston DataStax Inc QConRio 2014

Transcript of Full Stack High Availability -...

Page 1: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Full Stack High Availability In an Eventually Consistent World

Ben Coverston @bcoverston DataStax Inc

QConRio 2014

Page 2: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Who Am I?

• Ben Coverston

• Contributor: Apache Cassandra

• DSE Architect, DataStax Inc

Page 3: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Availability

Page 4: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are
Page 5: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

“An analytics system based on Cassandra, for example, where there are no single points of failure might be

considered fault tolerant, but if application-level data migrations, software upgrades, or configuration changes take an hour or more of downtime to complete, then the

system is not highly available.”

Page 6: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Failure Before ~2010

• The website can fail

• We have a farm!

Page 7: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are
Page 8: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

What About the Middleware?

• Make it stateless

• Spin up a bunch of them

• Who cares if one fails, we can recover.

Page 9: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are
Page 10: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

How about the Database?

• Build a massive database server

• Scale up

Page 11: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

How about the Database?• We can backup to tape!

• MTTR Hours, possibly days.

• We can mirror!

• Possible loss of data

• Some loss of availability during recovery

• What if we have multiple Availability Zones?

• Geographical distribution of master slave systems is not practical

Page 12: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are
Page 13: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

No Good Option

• RDBMS Recovery is a Special Case

• RDBMS Was Not Built for Failure

• Once you shard it, you lose the benefits of an RDBMS anyway

Page 14: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

The Problem

• Traditional databases don’t scale

• Because information is context sensitive

• When relationships matter, so does time

• When you guarantee consistency, something else has to give.

Page 15: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Eventual Consistency

“In an ideal world there would only be one consistency model . . . Many systems during this time took the

approach that it was better to fail the complete system than to break this transparency”[1]

!

— Werner Vogels CTO amazon.com

Page 16: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

The CAP Theorem

• Dr. Eric Brewer

• Consistency

• Availability

• Partition Tolerance

Page 17: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Tradeoffs!

• With Distribution we have to accept Partitioning

• If you want strong consistency, any failure of a master will result in a partial outage.

Page 18: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Banks Use BASE, not ACID

!

• Basically Available, Soft State, Eventually Consistent (BASE)

• Real time transactions are preferred, but ATMs fall back to partitioned mode when disconnected.

Page 19: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Why Does This Work?• Templars and Hospitiallars were some of the first

modern bankers [4]

• ATM Networks try to be fully consistent

• Banks lose money when ATMs are not working

• Partitioned State Fallback

• Operations are commutative

• Risk is an actuarial problem

Page 20: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Building On Eventual Consistency

• Eventual Consistency means . . .

• Two queries at the same time could get different results

• If that’s bad for your application:

• Change your application logic

• Change your business model

• OR

• Don’t use eventual consistency

Page 21: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Seat Inventory

• Airlines are eventually consistent too!

• Aircraft are routinely oversold (because booking flights is a distributed systems problem)

• People fail to show up, the airline makes money

• Too many people show up, the airline compensates a few, the airline makes money

Page 22: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

But What If?

• I Need Global Distribution

• Strong Consistency

• At Scale

Page 23: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are
Page 24: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Remember, Tradeoffs

Page 25: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Other Problems

• Real Time Analytics is a Challenge

• MapReduce Helps

• But it’s too slow for a many things

Page 26: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Lambda Architecture

• Real Time, should be Real Time

• Analytics is Batch

• Real time layer depends on processing incoming streams, or pre-aggregated data

• In a non-trivial system, CAP still affects the design

Page 27: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Spark

• Not limited to MapReduce

• Directed Acyclical Graph

• Easy-To-Understand Paradigm

• Compared to Hadoop

Page 28: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

What is Spark?• Apache Project Since 2010

• Fast

• 10-100x faster than Hadoop MapReduce

• Easy

• Scala, Java, Python APIs

• A lot less code (for you to write)

• Interactive Shell

Page 29: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Hadoop Mapreduce WordCount

package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);

Page 30: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Spark MapReduce for WordCount

scala>  sc.cassandraTable(“newyork”,"presidentlocations")     .map(  _.get[String](“location”)  )     .flatMap(  _.split(“  “))     .map(  (_,1))     .reduceByKey(  _  +  _  )     .toArray  res17:  Array[(String,  Int)]  =  Array((1,3),  (House,4),  (NYC,3),  (Force,3),  (White,4),  (Air,3))

1 white house

white house

white, 1 house, 1

house, 1 house, 1

house, 2

cassandraTableget[String]

_.split()

(_,1)

_ + _

Page 31: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Just In Memory?

• Map Reduce Style Analytics

• Not Just In-Memory (though it is really good for ‘iteration’)

Page 32: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Cassandra + Spark

• Cassandra-Spark Driver

• Open source

• https://github.com/datastax/cassandra-driver-spark

Page 33: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Distributed Aggregation (A case study)

• Real time distributed counting is hard.

• At high volume, there are still open problems.

Page 34: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Goals

• Provide near-real time counts for increments

• Updates are non-monotonic (can come out of order)

• We want to be able to query on windows (how many events happened last week?, The week before?)

• We also want to know what is happening right now

Page 35: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

What about Streaming?

• Storm or Spark Streaming can help for some use cases

• But to do it right, you have to get acks from an external system (spark), or block until the items get processed by something you might have integrated (storm).

• Blocking for stream processing could cause back pressure, and loss of availability.

Page 36: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Distributed Counting

• Aggregate over time

• Per shard

• Save deltas

• Timestamps

• Because timestamps can come out of order

• Arrival time is important

Page 37: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

What About Commit Log Replay?

• We could mark commit log entries dirty if the stream hasn’t processed the data

• Block on discarding the segment until processing is complete.

• This will work

• But we would have to fork C* to do it.

Page 38: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Rule #1

• We do not fork Cassandra

Page 39: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

What About Commit Log Replay (cont.)

• Alternatively, we would have to wait for Future Cassandra

• We could use Kafka (distributed commit log)

• But then we have two systems, and two commit logs

• ZooKeeper? . . . Just no.

Page 40: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Compromises

• Create a C* plugin to do aggregation

• Do it locally, on each node (on the coordinator).

• Create a separate API to query for aggregation

• Create real-time aggregates on the fly

• Store snapshotted data in C* for windowed aggregation (1s, 1h, 1d, 1w, 1m).

Page 41: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Deltas

• Deltas are stored

• Aggregates have to be composed of commutative operations (because we cannot recalculate everything, every time)

• Cumulative Average is a good example of a compatible streaming operation.

Page 42: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

SHARD ARRIVED TIMESTAMP DELTA COUNT

1 1 0 50 40

1 2 0 40 30

2 1 1 10 20

2 2 2 30 40

Page 43: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

SHAR

ARRIV

TIMEST

DELTA

COUN1 1 0 50 40

1 1 0 40 302 1 1 10 202 2 1 30 40

SHARD ARRIVED TIME DELTA COUNT

1 1 0 90 70

2 1 1 10 20

2 2 1 30 40

Page 44: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

SHAR

ARRIV

TIMEST

DELTA

COUN1 1 0 50 40

1 1 0 40 302 1 1 10 202 2 1 30 40

SHARD ARRIVED TIME DELTA COUNT1 1 0 90 702 1 1 10 202 2 1 30 40

Aggregation Service Average T(1:2) -> 0.923

Page 45: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

But Nodes Can Fail

• Snapshotted data is stored with RF > 1 (similar to RDDs)

• Aggregation is done by a ‘fat client’ running on each node.

• If a network partition happens, the real-time counts may be inconsistent.

• In case of a node failure, the counts may need to be repaired

Page 46: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Network Partition

C*

C*C*

C*

Average T(1:2) -> 0.923

Average T(1:2) -> 2.345

Page 47: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are
Page 48: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

The Partition Decision [3]

• Cancel the operation, and decrease availability

• Proceed with the operation, and risk inconsistency

Page 49: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are
Page 50: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

If You Accept Eventual Consistency

• Real time aggregates may be inaccurate

• Due a network partition (may persist for hours)

• Due to latency (speed of light, network latency, small number of ms)

• In this system Historical aggregates are more reliable, because the deltas get written (and replicated) every second.

Page 51: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Designing for Eventual Consistency

• Partitions happen in the real world (not a myth)

• If you are building a distributed system, you have to account for possible failure.

• Define system behavior under failure conditions

• Make adjustments, set expectations

Page 52: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Call To Action• When building distributed systems

• Reason about concurrency

• Avoid Locking (if at all possible)

• Learn about Commutative Replicated Data Types (CRDTs)

• Learn about MultiVersion Concurrency Control (MVCC)

• Learn Functional Programming

• Scala, Clojure (lisp), whatever

• Functional programming makes distributed programming better

Page 53: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Things to Look At

• Cassandra (fully distributed database)

• Actor Pattern

• akka-cluster (fully distributed compute platform)

Page 54: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

References

[1] http://www.allthingsdistributed.com/2007/12/eventually_consistent.html

[2] http://en.wikipedia.org/wiki/CAP_theorem

[3] http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed

[4] http://en.wikipedia.org/wiki/History_of_banking

Page 55: Full Stack High Availability - qconrio.comqconrio.com/rio2014/system/files/presentation-slides/ben_coverston...“An analytics system based on Cassandra, for example, where there are

Thank You!

[email protected] @bcoverston

We Are Hiring!