Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

34
Mining Correlations on Massive Bursty Time Series Collections Tomasz Kuśmierczyk and Kjetil Nørvåg

Transcript of Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Page 1: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Mining Correlations on Massive Bursty Time Series CollectionsTomasz Kuśmierczyk and Kjetil Nørvåg

Page 2: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Problem statement

bursty streams

2

one of many various detection

methods

Page 3: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Problem statement

bursty streams

streams of bursts

3

Page 4: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Problem statement

bursty streams

streams of bursts

correlated bursts

identify correlated bursty streams

4

Page 5: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Problem: Massive Collections

● identify pairs: correlation >= threshold ● N ~ millions of streams● naive (all pairs) solution complexity ~ N 2

● pruning● indexing

5

Page 6: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Motivation

● any source of large number of streams:○ social media○ web page view counts○ traffic monitoring sensors○ smart grid (electricity consumption meters)○ and more

6

Page 7: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Correlated bursts

● different lengths● different heights● slight shifts

but● should overlap

7

Page 8: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Correlated bursty streams

number of bursts per stream

number of bursts from stream i

overlapping with j

number of bursts from stream j

overlapping with i

two streams i and j

Page 9: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Correlated bursty streams

number of bursts per stream

number of bursts from stream i

overlapping with j

number of bursts from stream j

overlapping with i

Ei

Ejtime

ei = 4 oij = 3

ej = 3 oij = 2

min(oij , oi

j) = min(3 , 2) = 2

J(Ei, Ej) = 2 / (4+3 - 2) = 2/5

9

two streams i and j

Page 10: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Enumerating pairs

Order streams according to number of bursts:

❏ FOREACH base count b ❏ FOREACH b’ IN connected counts of b

❏ compare streams with b and b’ bursts

10

Page 11: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Pruning

● for each base count b we need to consider only connected counts b’ such that:

JT • b ≤ b’≤ b

11

threshold particular base countpossible connected counts

Page 12: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Interval Boxes (IB) index

● k-subset of bursts = k-dim box● k-dimensional R-trees

1 2 3 4

4

3

2

1For example (k=2): the representation of stream Ei as 2-dimensional boxes

12

Page 13: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Interval Boxes (IB) index

● k-subset of bursts = k-dim box● k-dimensional R-trees● k-dim boxes overlapping =

at least k bursts overlap

IndexedQuery min(oi

j , oij) ≥ k

13

Page 14: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Interval Boxes (IB) index: mining

● mining: ○ for each base count b maintain

an IB (RTrees) index ○ query it with streams having

connected counts b’

b=1

b=2

b=3

b=414

Page 15: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Interval Boxes (IB) index: mining

● mining: ○ for each base count b maintain

an IB (RTrees) index ○ query it with streams having

connected counts b’

b=1

b=2

b=3

b=4

candidate pairs of streams: min(oij , oi

j) ≥ k

15

correlated output pairs

Page 16: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

IB index: what dimensionality k?

● small k (IB Low Dimensional = IBLD)○ small indexes○ large number of candidate pairs

● high k (IB High Dimensional = IBHD)○ large indexes○ small number of candidate pairs○ kmax = JT • b (correlation ≥ threshold guaranteed)

16

Page 17: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

IBHD index in practice● to speed up some k-subsets are skipped● some pairs may be missing for multiple overlapping ● efficiency-effectiveness tradeoff

17

Page 18: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

List-based (LS) index: bins

separate bin for each (not pruned) b, b’

b=1, b’=2

b=2, b’=3

b=3, b’=4

b=1, b’=3

b=2, b’=4

b=3, b’=5

b=1, b’=4

b=2, b’=5

b=3, b’=6

b=4, b’=5 b=4, b’=6 b=4, b’=7

b=1, b’=5

18

Page 19: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

List-based (LS) index: single bin

time

19

time granularity

Page 20: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

LS index: mining algorithm● Returns oi

j and oji

● Only for such pairs Ei, Ej that have at least one overlap● Immediate validation of pairs correlation J

20

Page 21: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

LS index: mining algorithm● For each set of bursts pointers (time moment):

21

time

current time moment (set of pointers)

bursts active in current moment

bursts active in previous moment

Page 22: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

LS index: mining algorithm● For each set of bursts pointers (time moment):

○ identify NEW, OLD, ENDING (simple set operations)

22

time

current time moment (set of pointers)

bursts active in current momentENDINGNEW

bursts active in previous moment

OLD

Page 23: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

LS index: mining algorithm● For each set of bursts pointers (time moment):

○ identify NEW, OLD, ENDING (simple set operations)○ maintain map

OVERLAPS = burst → set of overlapping streams

23

time

current time moment (set of pointers)

bursts active in current momentENDINGNEW

bursts active in previous moment

OLD

Page 24: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

LS index: mining algorithm● For each set of bursts pointers (time moment):

○ identify NEW, OLD, ENDING (simple set operations)○ maintain map

OVERLAPS = burst → set of overlapping streams○ update counts oi

j and oji

24

time

current time moment (set of pointers)

bursts active in current momentENDINGNEW

bursts active in previous moment

OLD

Page 25: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Hybrid index

● LS index works well when:○ low number of overlaps○ high number of bursts per stream

● IBHD index works well when:○ low number of bursts per stream○ high number of overlaps

25

Page 26: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Hybrid index

● LS index works well when:○ low number of overlaps○ high number of bursts per stream

● IBHD index works well when:○ low number of bursts per stream○ high number of overlaps

● Solution: Hybrid index:IBHD index for low and LS for high base counts

26

Page 27: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Experimental evaluation● Wikipedia page views from the years 2011-2013● Kleinberg’s burst extraction● streams having at least 5 bursts● 2.1M streams and 43M bursts in total● 10 bursts per stream on average● mean burst length 28h

27

Page 28: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Mining & building

Threshold: JT = 0.9528

Page 29: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Hybrid mining

Number of streams: N = 2.1M29

Page 30: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Number of generated pairs

Threshold: JT = 0.95 (<10% pairs missing)30

How_I_Met_Your_Mother_(season_7)

Two_and_a_Half_Men_(season_9)

Process_(computing) Central_processing_unit

Endoplasmic_reticulum Ribosome

Greatest_Hits,_Vol._2_(Ronnie_Milsap_album)

Greatest_Hits,_Vol._3_(Ronnie_Milsap_album)

DigiTech_JamMan Lexicon_JamMan

Humanistic_psychology Positive_psychology

Computational limits for Naive/LS index

Page 31: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

What’s more in the paper?

● formal definitions and proofs● considerations of combinatorial aspects● multiple overlap cases● on-line maintenance of indexes

31

Page 32: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Questions?

Tomasz Kuś[email protected]

32

Page 33: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Thank you!

Tomasz Kuś[email protected]

33

Page 34: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

LS index: mining● For each set of bursts pointers (time moment):

○ identify NEW, OLD, ENDING (simple set operations)○ new overlapping bursts: NEW x OLD ∪ NEW x NEW ○ remove ENDING and add new overlapping bursts to the

map OVERLAPS = burst → set of overlapping streams:○ update counts oi

j and oji for new overlapping bursts and

with the help of OVERLAPS ● For each i and j in o: calculate min(oi

j , oji) and J

34

time

current time moment (set of pointers)

bursts active in current momentENDINGNEW

bursts active in previous moment

OLD