Building Antifragile Applications with Apache Cassandra

Post on 06-May-2015

2.326 views 6 download

description

Even with the best infrastructure, failures will occur without warning and are almost guaranteed. Building applications that can resist this fact of life can be both art and science. In this talk, I'll try to eliminate the art portion and focus more on the science. Starting at high level architecture decisions, I will take you through each layer and finally down to actual application code. Using Cassandra as the back end database, we can build layers of fault tolerance that will leave end users completely unaware of the underlying chaos that could be occurring. With a little planning, we can say goodbye to the Fail Whale and the fragility of the traditional RDBMS. Topics will include: - Application strategies to utilize active-active, diverse, datacenters - Replicating data with the highest integrity and maximum resilience - Utilizing Cassandra's built-in fault tolerance - Architecture of private, cloud or hybrid based applications - Application driver techniques when using Cassandra

Transcript of Building Antifragile Applications with Apache Cassandra

@PatrickMcFadin

Patrick McFadinSolutions Architect, DataStax

Building Antifragile Applications with Apache Cassandra

1Wednesday, August 21, 13

Who I am

2

• Patrick McFadin• Solution Architect at DataStax• Cassandra MVP• User for years• Follow me for more:

I talk about Cassandra and building scalable, resilient apps ALL THE TIME!

@PatrickMcFadin

Wednesday, August 21, 13

Background - Why are we doing this?

•We live in an always-on society• Data driven applications rule the day (for now)• Failure as reality

3

•Mike Christian-Yahoo! Director of Engineering, Infrastructure Resilience• Frying Squirrels and unspun gyros

Wednesday, August 21, 13

Background - Antifragile as a practice• Antifragile: Things That Gain From Disorder• Nassim Nicholas Taleb• Things that get better with a little chaos

4

• Jesse Robbins• Master of Disaster at Amazon

• Bringer of “Game Day”

Wednesday, August 21, 13

Background - Distributed in a global economy

• Closer to your users == happy users• Latency is just physics• Best chance of light in fibre from US East to US West? 20ms

5

From To Latency

New York London 75.07

New York Rio De Janeiro 110.28

San Francisco Tokyo 97.33

Singapore Los Angeles 183.19

Tokyo London 242.88

Wednesday, August 21, 13

Challenges - When it was just HTTP

• Roaring 90s. Just static HTTP • Some of it was data driven, but most not• One awesome web server

6Wednesday, August 21, 13

Challenges - More than one web server?

• Pentium 2 web server wasn’t going to do it.•More than one web server? Wow• Distribute the HTML and then...?

7Wednesday, August 21, 13

Challenges - Spreading the load

• Clients have to find your now spread content• Round Robin DNS to the rescue!• Crazy router hacks• Hardware based load balancers• Resilient to failure

8Wednesday, August 21, 13

Challenges - Content Distribution Networks

• Cat pictures suck bandwidth• Cat videos suck even MORE bandwidth• User experience depends on response time• Content closer to users? Sweet

9Source: http://www.paulund.co.uk/content-delivery-network-review

Wednesday, August 21, 13

Challenges - What about data?

• Databases have been designed as singletons (ACID)•Master - Slave replication dominates• Sharding - No more joins• Distributed replication is hard if not impossible

10

Panic!!

Wednesday, August 21, 13

Challenges - Facebook and MySQL

• Heavily sharded but in one DC• Needed a second data center• 2007 opted for slave DBs• Re-wrote query parser and “hijacked the MySQL replication stream”• Transparent to application

11

Ref: https://www.facebook.com/note.php?note_id=23844338919&id=9445547199&index=0

Wednesday, August 21, 13

Challenges - Finally some science!

• 2007 - Amazon and the Dynamo paper• Attempted to answer:• How do we distribute our data?• How do we maximize uptime?• How can we keep our data safe?

• Consolidated distributed database science• 24 research papers cited• Almost 30 years of distributed computing thought

12

http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Wednesday, August 21, 13

So now what?

13

Let’s put it all together

•More than one server - Uptime and Scaling• Get closer to users - Reduce latency.• Transparent - Make it a natural part of the system.

Wednesday, August 21, 13

Techniques - Step one

• Control your traffic

14Wednesday, August 21, 13

Techniques - Hardware traffic control

• F5 GTM - Global Traffic Manager• A10 Thunder• Citrix Netscaler• Cisco ACE• Barracuda ADC

15

Source: http://www.f5.com

Wednesday, August 21, 13

Techniques - Service based control

• DYN - Active failover service• Akamai - Global Traffic Management• Amazon- Route 53

16Wednesday, August 21, 13

Techniques - DIY control

• Client level awareness• Client chooses which path to take

17

http://imaginethefutur.blogspot.com/2012/09/javascript-load-balancer-using-cookies.html

A good example

Wednesday, August 21, 13

Techniques - Step 2

•Make your app ok with being part of a team• Fail fast - Short circuits• Transactions are expensive• Partial or eventual data is ok

18

Great discussion on this topic:

http://www.planetcassandra.org/blog/post/a-netflix-experiment-eventual-consistency-hopeful-consistency-by-christos-kalantzis

Wednesday, August 21, 13

Techniques - Application Architecture

• Embrace the pod (or application units)• Each stands alone

19

East Coast10000 Customers

West coast10000 Customers

Next deployable unit10000 Customers

Wednesday, August 21, 13

Techniques - State management

• State in the browser?• State in the data layer?• State in both?

20

No!

Wednesday, August 21, 13

Techniques - Step 3

•Make your persistence layer resilient• If your app layer can fail, then why not the DB?•Master-Master - Less complexity• Of course I’m talking about Cassandra

21

Same data. Fully replicated.

Wednesday, August 21, 13

Techniques - Step 4

• Test!! Test!! Test!!

22

Know these guys?

• Constantly breaking things• Chaos Monkey - Shut down random services• Always be failing so it’s normal• Read this: http://queue.acm.org/detail.cfm?

id=2499552

Wednesday, August 21, 13

Cassandra - Intro

• Based on Amazon Dynamo and Google BigTable paper• Shared nothing• Data safe as possible• Predictable scaling

23

Dynamo

BigTable

Wednesday, August 21, 13

Cassandra - More than one server

• All nodes participate in a cluster• Shared nothing• Add or remove as needed•More capacity? Add a server

24Wednesday, August 21, 13

Cassandra - Locally Distributed

• Client writes to any node• Node coordinates with others• Data replicated in parallel• Replication factor: How many

copies of your data?• RF = 3 here

25Wednesday, August 21, 13

Cassandra - Geographically Distributed

• Client writes local• Data syncs across WAN• Replication Factor per DC

26Wednesday, August 21, 13

Cassandra - Consistency

• Consistency Level (CL)• Client specifies per read or write

27

• ALL = All replicas ack• QUORUM = > 51% of replicas ack• LOCAL_QUORUM = > 51% in local DC ack• ONE = Only one replica acks

Wednesday, August 21, 13

Cassandra - Transparent to the application

• A single node failure shouldn’t bring failure• Replication Factor + Consistency Level = Success• This example:• RF = 3• CL = QUORUM

28

>51% Ack so we are good!

Wednesday, August 21, 13

Cassandra Applications - Drivers

• DataStax Drives for Cassandra• Java• C#• Python•more on the way

29Wednesday, August 21, 13

Cassandra Applications - Connecting

• Create a pool of local servers• Client just uses session to interact with Cassandra

30

contactPoints = {“10.0.0.1”,”10.0.0.2”}

keyspace = “videodb”

public VideoDbBasicImpl(List<String> contactPoints, String keyspace) {

cluster = Cluster .builder() .addContactPoints(! contactPoints.toArray(new String[contactPoints.size()])) .withLoadBalancingPolicy(Policies.defaultLoadBalancingPolicy()) .withRetryPolicy(Policies.defaultRetryPolicy()) .build();

session = cluster.connect(keyspace); }

Wednesday, August 21, 13

Cassandra Applications - Load balancing• Token aware - Request sent to primary node with data• Calls can be asynchronous and in parallel

31

1

23

45

6Client

Thread

Node

Node

Node

ClientThread

ClientThread

Node

Driver

Wednesday, August 21, 13

Cassandra Applications - Fault tolerance

• Try first with a Consistency Level of QUORUM• If fails, retry with Consistency Level ONE

32

Client Node

Node Replica

Replica

NodeReplica

Wednesday, August 21, 13

Application Example - Layout

• Active-Active• Service based DNS routing

33

Cassandra Replication

Wednesday, August 21, 13

Application Example - Uptime

34

• Normal server maintenance• Application is unaware

Cassandra Replication

Wednesday, August 21, 13

Application Example - Failure

35

• Data center failure• Data is safe. Route traffic.

33

Another happy user!

Wednesday, August 21, 13

Conclusion

36

• Cassandra is THE BEST persistence tier for your application• Plan for chaos. Inject your own. • Now go write your app.

1. The Data Model is Dead, Long Live the Data Model

http://www.youtube.com/watch?v=px6U2n74q3g

2. Become a Super Modeler

http://www.youtube.com/watch?v=qphhxujn5Es

3. The World's Next Top Data Model

http://www.youtube.com/watch?v=HdJlsOZVGwM

Data Modeling!

https://github.com/pmcfadin/cql3-videodb-example

Example code

Wednesday, August 21, 13

Thank you!

37

CALL FOR PAPERSSPONSORSHIP 30+ SessionsTWO DAYS TRAINING DAYCALL FOR PAPERS

SPONSORSHIP OPPORTUNITY

TWO DAYS30+ SESSIONS

TRAINING DAY

Oh yeah. We are hiring.

Q&A?

Wednesday, August 21, 13