C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

12
#CASSANDRAEU Analytics on top of Cassandra and Hadoop Dmitry Mezhensky | Mirantis Inc

description

Speaker: Dmitry Mezhensky

Transcript of C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

Page 1: C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

#CASSANDRAEU

Analytics on top of Cassandra and Hadoop

Dmitry Mezhensky | Mirantis Inc

Page 2: C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

What we will discuss today

● Analytics on Cassandra using Hadoop● Various types of statistics & implementation● Scalability of approach

#CASSANDRAEU

Page 3: C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

Problems

● Too many statistics (more that 100)● Various types

○ Top N○ Time series○ Min/max/average/median○ Extremum values on time interval○ Fraud analysis

● Huge amount of data● Scalability of approach

#CASSANDRAEU

Page 4: C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

#CASSANDRAEU

Statistics implementation on Hadoop

Page 5: C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

Top N

● Map phase generates <Key, Value> pairs, top N is building by Value

● Reduce phase accumulates values, persist to Cassandra is done via custom output format

● For top N entities in Cassandra suitable comparator was used

#CASSANDRAEU

Page 6: C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

Top N

● One write stage to Cassandra sorting is done by value

● On reading stage first N records will be Top N values

#CASSANDRAEU

Page 7: C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

Time series

● Map phase generates pairs <Time, Value>● Reduce phase accumulates (various behaviour

for different statistics)● Persist to Cassandra using custom output format

& using one row key per statistics, one column per date

#CASSANDRAEU

Page 8: C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

Maximum, minimum, extremum on interval

● Max/min values are simple to calculate● Extremum on interval is calculating the similar to

time series

#CASSANDRAEU

Page 9: C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

Fraud analysis

● Fraud analysis is running after all statistics are calculated

● Processed data is filtered by fraud filters

#CASSANDRAEU

Page 10: C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

Scalability approach

● Data is reading/writing to Cassandra only● Hadoop is elastically scalable● Cassandra is elastically scalable● No bottleneck

#CASSANDRAEU

Page 11: C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

Questions?

#CASSANDRAEU

Page 12: C* Summit EU 2013: Analytics On Top of Cassandra and Hadoop

#CASSANDRAEU

Thank you!