Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

19
Realtime Analytics with Cassandra or: How I Learned to Stopped Worrying and Love Counting

description

Slides from my tutorial at Denormalized London on 21 Sept 2012

Transcript of Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Page 1: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Realtime Analytics with Cassandra

or: How I Learned to Stopped Worrying and

Love Counting

Page 2: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Analytics

Live & historicalaggregates... Trends... Drill downs

and roll ups

Combining “big” and “real-time” is hard

2

Page 3: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

What is Realtime Analytics?eg “show me the number of mentions of

‘Acunu’ per day, between May and November 2011, on Twitter”

Batch (Hadoop) approach would require processing ~30 billion tweets,

or ~4.2 TB of datahttp://blog.twitter.com/2011/03/numbers.html

Page 4: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

• Push processing into ingest phase• Make queries fast

tweets

counterupdates

?Twitter

Okay, so how are we going to do it?

Page 5: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Okay, so how are we going to do it?

For each tweet, increment a bunch of counters, such that answering a queryis as easy as reading some counters

Page 6: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Preparing the dataStep 1: Get a feed of

the tweets

Step 2: Tokenise the tweet

Step 3: Increment countersin time buckets for each token

12:32:15 I like #trafficlights12:33:43 Nobody expects...

12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!

[1234, man] +1[1234, acunu] +1[1234, rock] +1

Page 7: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

QueryingStep 1: Do a range query

Step 2: Result table

Step 3: Plot pretty graph

start: [01/05/11, acunu]end: [30/05/11, acunu]

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

0

45

90

May Jun Jul Aug Sept Oct Nov

Page 8: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Except it’s not that easy...• Cassandra best practice is to use RandomPartitioner,

so not possible to range queries on rows

• Could manually work out each row in range, do lots of point gets

• This would suck - each query would be 100’s of random IOs on disk

• Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation

Page 9: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Key #Mentions

[01/05/11 00:01, acunu] 3

[01/05/11 00:02, acunu] 5

... ...

So instead of this...

We do thisKey 00:01 00:02 ...

[01/05/11, acunu] 3 5 ...

[02/05/11, acunu] 12 4 ...

... ... ...

Row key is ‘big’ time bucket

Column key is ‘small’ time bucket

Page 11: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Now its your turn.....

Page 12: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

1. Get a twitter account - http://twitter.com

2. Get some Cassandra VMs - http://goo.gl/Ruqlt

3. Cluster them up

4. Get the code - http://goo.gl/VxXKB

5. Implement the missing bits!

6. (Prizes for the ones that spot bugs!)

Page 13: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

http://goo.gl/O9hkv

Get some Cassandra VMs

Page 14: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Cluster them up• SSH in, set password (on both!)

• Check you can connect to the UI

• Use UI (click add host)

Page 16: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Implement the “core”

• In core.py

• def insert_tweet(cassandra, tweet):

• def do_query(cassandra, term, start, finish):

Page 17: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Check you data-bash-3.2$ cassandra-cli Connected to: "Test Cluster" on localhost/9160Welcome to Cassandra CLI version 1.0.8.acunu2Type 'help;' or '?' for help.Type 'quit;' or 'exit;' to quit.

[default@unknown] use painbird;Authenticated to keyspace: painbird[default@painbird] list keywords;Using default limit of 100-------------------RowKey: m-5-"woe=> (counter=11, value=1)

Page 18: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

Extensions

Page 19: Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012

UI

• Pretty graphs• Automatically periodically update?• Search multiple terms

Painbird

• mentions of multiple terms• sentiment analysis - http://www.nltk.org/• filtering by multiple fields (geo + keyword)

Extensions