Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

48
Karbon Insight: Realtime Reporting

description

Josh Glover, Software Engineer at Videoplaza, will introduce you to the domain of video advertising and show how Videoplaza uses Apache Cassandra as part of a system that solves the difficult problem of allowing clients to analyse the performance of their advertising campaigns in real-time. Videoplaza needs to aggregate data for tens of thousands of combinations of dimensions and metrics for hundreds of clients from an incoming stream of thousands of requests per second, and do it fast enough so that clients can see trends as they happen.

Transcript of Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Page 1: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Karbon Insight: Realtime Reporting

Page 2: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Introduction to ad serving

Video player

Ad player

Distributor Tracker

Page 3: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Event tracking

•View (event ID 127)•Click (event ID 128)•and many more

Page 4: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

What do our customers want?

•Any report they can dream up•Right away!

Page 5: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Simple report: hour by ad and event

Page 6: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Realtime reporting

Multidimensional OLAP cube

Ad

Event

Time

Page 7: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

ROLAP with star schema

Page 8: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Disadvantages of ROLAP

•Slow queries•Lots of joins•Expensive to scale•SQL limitations

Page 9: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

MOLAP to the rescue!

Page 10: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

What is a counter?

Page 11: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

You can’t always get what you want...

Page 12: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

•Time•Event•Ad•Device

•Category•Location•Tag•Demography

Possible report dimensions

Page 13: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Many counters

8 dimensionsaverage size of 50

508 counters!(39 trillion)

Page 14: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Average campaign length:

21 days(504 hours)

Page 15: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Time flies like a banana

21 days = 39 trillion counters

42 days -> 78 trillion84 days -> 156 trillion365 days -> 677 trillion

Page 16: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

5 years down the road

3.39 quadrillion

Page 17: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

3.39 quadrillion is a rather large number indeed

Number of stars in 7500 galaxies like the Milky way.

15% of the surveyed universe!

Page 18: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013
Page 19: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

But you might

just get what you

need!

Page 20: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Fake it till you can make it

Don’t aggregate anything until they ask

for it!

Page 21: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

•Time period•By hour•And ad•Views•Clicks

Page 22: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Counter Storage

Page 23: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Why Cassandra?

•Fast writes•Linear scaling•Battle-hardened•(Relatively) simple operations•Great community!

Page 24: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Cassandra

TrackerTracker

FlusherFlusherAggregatorAggregator

MergerMerger

live00 ... live31

RabbitMQ

flush00 ... flush31counter00 ... counter31

Page 25: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Our setup

•DataStax CE 1.1.9•18 node cluster•1 datacentre

Page 26: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Data model

•1 keyspace (RF: 3)•1 column family•Leveled compaction

Page 27: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Row keys

aggregate definition IDdimension valuestime granularity

Page 28: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

adef1|(ad1:127)|houradef1|(ad1:128)|houradef1|(ad2:127)|hour...adef1|(ad5:128)|day

Example row keys

Page 29: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Columns

time value ->counter

transaction ID ->id

Page 30: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

2013-09-10.18 -> 6348

txID -> 876219102

Example columns

2013-09-10.19 -> 9784

Page 31: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

total -> 6348

txID -> 876219102

Columns for rows with no time aggregation

Page 32: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Reading counters

Page 33: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Build row keyadef1|(ad1:127)|hour

Page 34: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Prepare querykeyspace .prepareQuery(columnFamily) .getKey(rowKey)

Page 35: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Column ranges2013-09-10.17

...2013-09-10.23

Page 36: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Execute query asynchronously

Page 37: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Get column valueFirst byte is counter type

(long, double, Hyper LogLog)

Page 38: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Writing counters

Page 39: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Flush shards

...Flusher 1

shards 00-08

Flusher 4

shards 24-32

Cassandra

Page 40: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Merge increment rows with read cache

Skip rows with the same transaction ID

Page 41: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Write rows in mutation batches

(of 400)

Page 42: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Things we got wrong

Page 43: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Each CF has 1M heap overhead

Too many column families

Multi-tenancy FTW!

FAIL #1

Page 44: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

CLI defaults to replicationfactor of 1!

Manual operations

Tools and automation FTW!

FAIL #2

Page 45: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

No way to undo data loading

No snapshots

Automated snapshots FTW!

FAIL #3

Page 46: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Post-processing of queried data

Timezones

Store data in customer timezone

FAIL #4

Page 47: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

10 TB of data1500 wps40,000 rps

Page 48: Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Q&A