Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012
-
Upload
acunu -
Category
Technology
-
view
3.086 -
download
0
description
Transcript of Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormalized London 2012
Realtime Analytics with Cassandra
or: How I Learned to Stopped Worrying and
Love Counting
Analytics
Live & historicalaggregates... Trends... Drill downs
and roll ups
Combining “big” and “real-time” is hard
2
What is Realtime Analytics?eg “show me the number of mentions of
‘Acunu’ per day, between May and November 2011, on Twitter”
Batch (Hadoop) approach would require processing ~30 billion tweets,
or ~4.2 TB of datahttp://blog.twitter.com/2011/03/numbers.html
• Push processing into ingest phase• Make queries fast
tweets
counterupdates
Okay, so how are we going to do it?
Okay, so how are we going to do it?
For each tweet, increment a bunch of counters, such that answering a queryis as easy as reading some counters
Preparing the dataStep 1: Get a feed of
the tweets
Step 2: Tokenise the tweet
Step 3: Increment countersin time buckets for each token
12:32:15 I like #trafficlights12:33:43 Nobody expects...
12:33:49 I ate a #bee; woe is...12:34:04 Man, @acunu rocks!
[1234, man] +1[1234, acunu] +1[1234, rock] +1
QueryingStep 1: Do a range query
Step 2: Result table
Step 3: Plot pretty graph
start: [01/05/11, acunu]end: [30/05/11, acunu]
Key #Mentions
[01/05/11 00:01, acunu] 3
[01/05/11 00:02, acunu] 5
... ...
0
45
90
May Jun Jul Aug Sept Oct Nov
Except it’s not that easy...• Cassandra best practice is to use RandomPartitioner,
so not possible to range queries on rows
• Could manually work out each row in range, do lots of point gets
• This would suck - each query would be 100’s of random IOs on disk
• Need to use wide rows, range query is a column slice, each query ~1 IO - Denormalisation
Key #Mentions
[01/05/11 00:01, acunu] 3
[01/05/11 00:02, acunu] 5
... ...
So instead of this...
We do thisKey 00:01 00:02 ...
[01/05/11, acunu] 3 5 ...
[02/05/11, acunu] 12 4 ...
... ... ...
Row key is ‘big’ time bucket
Column key is ‘small’ time bucket
Demo./painbird.py -u tom_wilkie
http://ec2-176-34-212-226.eu-west-1.compute.amazonaws.com:8000
Now its your turn.....
1. Get a twitter account - http://twitter.com
2. Get some Cassandra VMs - http://goo.gl/Ruqlt
3. Cluster them up
4. Get the code - http://goo.gl/VxXKB
5. Implement the missing bits!
6. (Prizes for the ones that spot bugs!)
Cluster them up• SSH in, set password (on both!)
• Check you can connect to the UI
• Use UI (click add host)
Get the code
SSH into one of the VMs:
# curl https://acunu-oss.s3.amazonaws.com/painbird-2.tar.gz | tar zxf -
# cd release
# ./painbird.py -u tom_wilkie
Implement the “core”
• In core.py
• def insert_tweet(cassandra, tweet):
• def do_query(cassandra, term, start, finish):
Check you data-bash-3.2$ cassandra-cli Connected to: "Test Cluster" on localhost/9160Welcome to Cassandra CLI version 1.0.8.acunu2Type 'help;' or '?' for help.Type 'quit;' or 'exit;' to quit.
[default@unknown] use painbird;Authenticated to keyspace: painbird[default@painbird] list keywords;Using default limit of 100-------------------RowKey: m-5-"woe=> (counter=11, value=1)
Extensions
UI
• Pretty graphs• Automatically periodically update?• Search multiple terms
Painbird
• mentions of multiple terms• sentiment analysis - http://www.nltk.org/• filtering by multiple fields (geo + keyword)
Extensions