C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop
-
Upload
planet-cassandra -
Category
Technology
-
view
807 -
download
0
description
Transcript of C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop
#CASSANDRAEU
Analytics on top of Cassandra and Hadoop
Dmitry Mezhensky | Mirantis Inc
What we will discuss today
● Analytics on Cassandra using Hadoop● Various types of statistics & implementation● Scalability of approach
#CASSANDRAEU
Problems
● Too many statistics (more that 100)● Various types
○ Top N○ Time series○ Min/max/average/median○ Extremum values on time interval○ Fraud analysis
● Huge amount of data● Scalability of approach
#CASSANDRAEU
#CASSANDRAEU
Statistics implementation on Hadoop
Top N
● Map phase generates <Key, Value> pairs, top N is building by Value
● Reduce phase accumulates values, persist to Cassandra is done via custom output format
● For top N entities in Cassandra suitable comparator was used
#CASSANDRAEU
Top N
● One write stage to Cassandra sorting is done by value
● On reading stage first N records will be Top N values
#CASSANDRAEU
Time series
● Map phase generates pairs <Time, Value>● Reduce phase accumulates (various behaviour
for different statistics)● Persist to Cassandra using custom output format
& using one row key per statistics, one column per date
#CASSANDRAEU
Maximum, minimum, extremum on interval
● Max/min values are simple to calculate● Extremum on interval is calculating the similar to
time series
#CASSANDRAEU
Fraud analysis
● Fraud analysis is running after all statistics are calculated
● Processed data is filtered by fraud filters
#CASSANDRAEU
Scalability approach
● Data is reading/writing to Cassandra only● Hadoop is elastically scalable● Cassandra is elastically scalable● No bottleneck
#CASSANDRAEU
Questions?
#CASSANDRAEU
#CASSANDRAEU
Thank you!