Cassandra summit-2013

38
Real Time Big Data With Storm, Cassandra, and In- Memory Computing DeWayne Filppi @dfilppi

Transcript of Cassandra summit-2013

Page 1: Cassandra summit-2013

Real Time Big Data With Storm, Cassandra, and In-Memory Computing

DeWayne Filppi@dfilppi

Page 2: Cassandra summit-2013

Big Data Predictions

“Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.”

Edd Dumbill, O’REILLY

2® Copyright 2013 Gigaspaces Ltd. All Rights Reserved

Page 3: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3

The Two Vs of Big Data

Velocity Volume

Page 4: Cassandra summit-2013

We’re Living in a Real Time World…Homeland Security

Real Time Search

Social

eCommerce

User Tracking & Engagement

Financial Services

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4

Page 5: Cassandra summit-2013

The Flavors of Big Data Analytics

Counting Correlating Research

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5

Page 6: Cassandra summit-2013

Analytics @ Twitter – Counting

How many signups, tweets, retweets for a topic?

What’s the average latency?

Demographics Countries and cities Gender Age groups Device types …

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved6

Page 7: Cassandra summit-2013

Analytics @ Twitter – Correlating

What devices fail at the same time?

What features get user hooked?

What places on the globe are “happening”?

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved7

Page 8: Cassandra summit-2013

Analytics @ Twitter – Research

Sentiment analysis “Obama is popular”

Trends “People like to tweet

after watching American Idol”

Spam patterns How can you tell when

a user spams?

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved8

Page 9: Cassandra summit-2013

It’s All about Timing

“Real time” (< few Seconds)

Reasonably Quick (seconds - minutes)

Batch (hours/days)

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved9

Page 10: Cassandra summit-2013

It’s All about Timing

• Event driven / stream processing • High resolution – every tweet gets counted

• Ad-hoc querying • Medium resolution (aggregations)

• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved10

This is what we’re here to discuss

Page 11: Cassandra summit-2013

VELOCITY + VAST VOLUME = IN MEMORY + BIG DATA

11

Page 12: Cassandra summit-2013

RAM is the new disk Data partitioned across a cluster

Large “virtual” memory space Transactional Highly available Code collocated with data.

In Memory Data Grid Review

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved12

Page 13: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13

Data Grid + Cassandra: A Complete Solution• Data flows through the in-memory cluster async to Cassandra• Side effects calculated• Filtering an option• Enrichment an option• Results instantly available• Internal and external event listeners notified

Page 14: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14

Simplified Event Flow

Page 15: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15

Grid – Cassandra Interface Hector and CQL based interface In memory data must be mapped to column families.

Configurable class to column family mapping Must serialize individual fields

Fixed fields can use defined types Variable fields ( for schemaless in-memory mode) need serializers

Object model flattening By default, nested fields are flattened. Can be overridden by custom serializer.

Page 16: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved16

Virtues and Limitations

Could be faster: high availability has a cost Complex flows not easy to assemble or understand with simple

event handlers

Complete stack, not just two tools of many Fast.

Microsecond latencies for in memory operations Fast enough for almost anybody

Highly available/self healing Elastic

BUT

Page 17: Cassandra summit-2013

Popular open source, real time, in-memory, streaming computation platform.

Includes distributed runtime and intuitive API for defining distributed processing flows.

Scalable and fault tolerant. Developed at BackType, and open sourced by Twitter

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved17

Storm Background

Page 18: Cassandra summit-2013

Streams Unbounded sequence of tuples

Spouts Source of streams (Queues)

Bolts Functions, Filters, Joins, Aggregations

Topologies

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved18

Storm AbstractionsSpout

Bolt

Topologies

Page 19: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved19

Streaming word count with Storm

Storm has a simple builder interface to creating stream processing topologies

Storm delegates persistence to external providers Cassandra, because of its write performance, is commonly used

Page 20: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved20

Storm : Optimistic Processing Storm (quite rationally) assumes success is normal Storm uses batching and pipelining for performance Therefore the spout must be able to replay tuples on demand

in case of error. Any kind of quasi-queue like data source can be fashioned

into a spout. No persistence is ever required, and speed attained by

minimizing network hops during topology processing.

Page 21: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21

Fast. Want to go faster? Eliminate non-memory components Substitute disk based queue for reliable in-memory queue Substitute disk based state persistence to in-memory

persistence Asynchronously update disk based state (C*)

Page 22: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22

Sample Architecture

Page 23: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23

References Try the Cloudify recipe

Download Cloudify : http://www.cloudifysource.org/ Download the Recipe (apps/xapstream, services/xapstream):

– https://github.com/CloudifySource/cloudify-recipes XAP – Cassandra Interface Details;

http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency Check out the source for the XAP Spout and a sample state

implementation backed by XAP, and a Storm friendly streaming implemention on github: https://github.com/Gigaspaces/storm-integration

For more background on the effort, check out my recent blog posts at http://blog.gigaspaces.com/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/ Part 3 coming soon.

Page 24: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved24

Page 25: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25

Twitter Storm With Cassandra

Page 26: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved26

Storm Overview

Page 27: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27

Streams Unbounded sequence of tuples

Spouts Source of streams (Queues)

Bolts Functions, Filters, Joins, Aggregations

Topologies

Storm ConceptsSpouts

Bolt

Topologies

Page 28: Cassandra summit-2013

Challenge – Word Count

Word:Count

Tweets

Count?® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28

• Hottest topics• URL mentions• etc.

Page 29: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved29

Streaming word count with Storm

Page 30: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30

Supercharging Storm Storm doesn’t supply persistence, but provides for it Storm optimizes IO to slow persistence (e.g. databases) using

batching. Storm processes streams. The stream provider itself needs to

support persistency, batching, and reliability.

Tweets, events,whatever….

Page 31: Cassandra summit-2013

XAP Real Time Analytics

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31

Page 32: Cassandra summit-2013

® Copyright 2011 Gigaspaces Ltd. All Rights Reserved

Two Layer Approach Advantage: Minimal

“impedance mismatch” between layers.– Both NoSQL cluster

technologies, with similar advantages

Grid layer serves as an in memory cache for interactive requests.

Grid layer serves as a real time computation fabric for CEP, and limited ( to allocated memory) real time distributed query capability.

In Memory Compute Cluster

NoSQL Cluster

...

Raw

Eve

nt S

trea

m

Raw

Eve

nt S

trea

m

Raw

Eve

nt S

trea

m

Raw And Derived Events

Rep

orti

ng E

ngin

e

SCALE

SCALE

Page 33: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved33

Simplified Architecture

Page 34: Cassandra summit-2013

Flowing event streams through memory for side effects Event driven architecture executing in-memory Raw events flushed, aggregations/derivations retained All layers horizontally scalable All layers highly available Real-time analytics & cached batch analytics on same scalable

layer Data grid provides a transactional/consistent façade on NoSQL

store (in this case eliminating SQL database entirely)

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34

Key Concepts

Page 35: Cassandra summit-2013

Keep Things In Memory

Facebook keeps 80% of its data in Memory (Stanford research)

RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec

Page 36: Cassandra summit-2013

Take Aways A data grid can serve different needs for big data analytics:

Supercharge a dedicated stream processing cluster like Storm.– Provide fast, reliable, transactional tuple streams and state

Provide a general purpose analytics platform– Roll your own

Simplify overall architecture while enhancing scalability– Ultra high performance/low latency– Dynamically scalable processing and in-memory storage– Eliminate messaging tier– Eliminate or minimize need for RDBMS

Page 37: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved37

Realtime Analytics with Storm and Hadoop http://www.slideshare.net/Hadoop_Summit/realtime-analy

tics-with-storm Learn and fork the code on github:

https://github.com/Gigaspaces/storm-integration

Twitter Storm: http://storm-project.net

XAP + Storm Detailed Blog Post http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/

References

Page 38: Cassandra summit-2013

® Copyright 2013 Gigaspaces Ltd. All Rights Reserved38