codecentric AG: Using Cassandra and Clojure for Data Crunching backends

70
@ifesdjeen

Transcript of codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Page 1: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

@ifesdjeen

Page 2: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

CassandraMonitoring

Page 3: codecentric AG: Using Cassandra and Clojure for Data Crunching backends
Page 4: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Precision

Page 5: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

is not same as

Page 6: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Semantics

Page 7: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

is not same as

Page 8: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Anomalydetection

Page 9: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Do you see the elephant being swallowed by the snake?

Page 10: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Agenda

Page 11: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Ad-hocqueries

Page 12: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

AggregationsFast

Page 13: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

MachineLearning

Page 14: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

parallel queriesStep 1

Page 15: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

+---------------+---------------+ | timestamp | sequenceId | +---------------+---------------+

Page 16: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Used to avoid timestamp resolution collisions To ensure sub-resolution order Snapshot the data on overflow or timeout Ensures idempotence

Sequence ID

Page 17: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Fighting Dispersion

Page 18: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13

Range Tables

Page 19: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Full Table Scan

ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13

Start End

Page 20: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13

Page 21: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Open Range

Start End

ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13

Page 22: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13

Page 23: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

“Between” Range

ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13

Start End

Page 24: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13

Page 25: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

(rich query API)

Step 2 add some algebra

Page 26: codecentric AG: Using Cassandra and Clojure for Data Crunching backends
Page 27: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Stream Fusion for

rich ad-hoc queries

Page 28: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

What is even Stream Fusion

Page 29: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

map

filter

reduce

Page 30: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

single step mapFilterReduce

Page 31: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

data Step data cursor = Yield data !cursor | Skip !cursor | Done

data Stream data = ∃s. Stream (cursor → Step data cursor) cursor

Page 32: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Stream Beginning: reading from the DB

Page 33: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

map

Yield data cursor → Yield (f cursor) cursor Skip cursor → Skip cursor Done → Done

maps :: (a → b) → Stream a → Stream b

Page 34: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

filter

Yield data cursor | p data → Yield data cursor | otherwise → Skip cursor Skip cursor → Skip cursor Done → Done

filters :: (a → Bool) → Stream a → Stream a

Page 35: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

reduce/fold

Yield x cursor → loop (f data x) cursor Skip cursor → loop data cursor Done → z

foldls :: (Monoid acc) => (acc → a → acc) → acc → Stream a → acc

Page 36: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Append

class Monoid a where mempty :: a

mappend :: a -> a -> a -- ^ Identity of 'mappend'

-- ^ An associative operation

Page 37: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

class (Monoid intermediate) => Aggregate intermediate end where combine :: intermediate -> end

Combine

Page 38: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

data Count = Count Int

instance Monoid Count where mempty = Count 0 mappend (Count a) (Count b) = Count $ a + b

instance Aggregate Count Int where combine (Count a) = a

Count Example

Page 39: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

add some MLStep 3

Page 40: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Storing Models

Page 41: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Support Vector Machines

Page 42: codecentric AG: Using Cassandra and Clojure for Data Crunching backends
Page 43: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Hyperplaneα·x - φ = 1

Page 44: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

[ α1 α1 α1 ...αn ] ρ

Page 45: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Option 1:list<double>

Page 46: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

CREATE TABLE support_vectors( path varchar, alpha list<double>, phi int, PRIMARY KEY(path))

Page 47: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Problems

High deserialisation overhead Need to add PK specifiers for multiple SVs

Page 48: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Alternative:blob & byte buffers

Page 49: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Vector Representation

Page 50: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

0 8 16 24 32 40 n*8 +----+----+----+----+----+----+----+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+----+----+----+

byte address

points 1 2 3 40 n

Page 51: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Matrix Representation

Page 52: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

01 02 03 0400 1n

n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

01 02 03 0400 1n

m*n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

m1 m2 m3 m4m0 mn

Page 53: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Advantages

“As compact as it gets” representation Smaller serialisation overhead Fast relative access Easy to go multi-dimensional Easy to implement atomic in-memory operations

Page 54: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Bayesian Classifiers

Page 55: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

P(X | blue)= Number of Blue near X

Total number of blueP(X | red)=

Number of Red near X

Total number of Red

Page 56: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

[[Mean(x1), Var(x1)] [Mean(x2), Var(x3)]

... [Mean(xn), Var(xn)]]

Page 57: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

0 8 16 +---------+---------+ | Mean(x )| Var(x ) | +---------+---------+

0 0

16 24 32 +---------+---------+ | Mean(x )| Var(x ) | +---------+---------+

1 1

2n*8 (2n+1)*8 +---------+---------+ | Mean(x )| Var(x ) | +---------+---------+

n n

byte address

payloads

Page 58: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Advantages

“As compact as it gets” representation Smaller serialisation overhead Fast relative access Easy to implement atomic in-memory operations

Page 59: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

make it rocket-fastStep 4

Page 60: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Approximate Data Structures

Page 61: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Bloom Filtersare basically long arrays / vectors

Page 62: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

BitSet

Page 63: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

0 8 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 8 16 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 16 24 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 24 32 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+

...

bit address

Page 64: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Advantages

64 bits per 8-byte Long Easy to represent by the long-array using offsets, bit shifts and masks Easy to implement atomic in-memory operations

Page 65: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Count-min sketchesare basically int matrices

Page 66: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

01 02 03 0400 1n

n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

01 02 03 0400 1n

m*n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

m1 m2 m3 m4m0 mn

Page 67: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Histogramsare basically long vectors

Page 68: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

01 02 03 0400 1n

0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+

01 02 03 04001n

byte address

byte address

Longs (counts)

Doubles (bin start number)

Page 69: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Conclusions

Ad-hoc queries Parallelism Lightweight DSs representation Optimisations and good API fits

Page 70: codecentric AG: Using Cassandra and Clojure for Data Crunching backends

@ifesdjeen

http://bit.ly/cassandrasummit2015