Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)
-
Upload
tomasz-kusmierczyk -
Category
Data & Analytics
-
view
240 -
download
0
Transcript of Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)
Mining Correlations on Massive Bursty Time Series CollectionsTomasz Kuśmierczyk and Kjetil Nørvåg
Problem statement
bursty streams
2
one of many various detection
methods
Problem statement
bursty streams
streams of bursts
3
Problem statement
bursty streams
streams of bursts
correlated bursts
identify correlated bursty streams
4
Problem: Massive Collections
● identify pairs: correlation >= threshold ● N ~ millions of streams● naive (all pairs) solution complexity ~ N 2
● pruning● indexing
5
Motivation
● any source of large number of streams:○ social media○ web page view counts○ traffic monitoring sensors○ smart grid (electricity consumption meters)○ and more
6
Correlated bursts
● different lengths● different heights● slight shifts
but● should overlap
7
Correlated bursty streams
number of bursts per stream
number of bursts from stream i
overlapping with j
number of bursts from stream j
overlapping with i
two streams i and j
Correlated bursty streams
number of bursts per stream
number of bursts from stream i
overlapping with j
number of bursts from stream j
overlapping with i
Ei
Ejtime
ei = 4 oij = 3
ej = 3 oij = 2
min(oij , oi
j) = min(3 , 2) = 2
J(Ei, Ej) = 2 / (4+3 - 2) = 2/5
9
two streams i and j
Enumerating pairs
Order streams according to number of bursts:
❏ FOREACH base count b ❏ FOREACH b’ IN connected counts of b
❏ compare streams with b and b’ bursts
10
Pruning
● for each base count b we need to consider only connected counts b’ such that:
JT • b ≤ b’≤ b
11
threshold particular base countpossible connected counts
Interval Boxes (IB) index
● k-subset of bursts = k-dim box● k-dimensional R-trees
1 2 3 4
4
3
2
1For example (k=2): the representation of stream Ei as 2-dimensional boxes
12
Interval Boxes (IB) index
● k-subset of bursts = k-dim box● k-dimensional R-trees● k-dim boxes overlapping =
at least k bursts overlap
IndexedQuery min(oi
j , oij) ≥ k
13
Interval Boxes (IB) index: mining
● mining: ○ for each base count b maintain
an IB (RTrees) index ○ query it with streams having
connected counts b’
b=1
b=2
b=3
b=414
Interval Boxes (IB) index: mining
● mining: ○ for each base count b maintain
an IB (RTrees) index ○ query it with streams having
connected counts b’
b=1
b=2
b=3
b=4
candidate pairs of streams: min(oij , oi
j) ≥ k
15
correlated output pairs
IB index: what dimensionality k?
● small k (IB Low Dimensional = IBLD)○ small indexes○ large number of candidate pairs
● high k (IB High Dimensional = IBHD)○ large indexes○ small number of candidate pairs○ kmax = JT • b (correlation ≥ threshold guaranteed)
16
IBHD index in practice● to speed up some k-subsets are skipped● some pairs may be missing for multiple overlapping ● efficiency-effectiveness tradeoff
17
List-based (LS) index: bins
separate bin for each (not pruned) b, b’
b=1, b’=2
b=2, b’=3
b=3, b’=4
b=1, b’=3
b=2, b’=4
b=3, b’=5
b=1, b’=4
b=2, b’=5
b=3, b’=6
b=4, b’=5 b=4, b’=6 b=4, b’=7
b=1, b’=5
18
List-based (LS) index: single bin
time
19
time granularity
LS index: mining algorithm● Returns oi
j and oji
● Only for such pairs Ei, Ej that have at least one overlap● Immediate validation of pairs correlation J
20
LS index: mining algorithm● For each set of bursts pointers (time moment):
21
time
current time moment (set of pointers)
bursts active in current moment
bursts active in previous moment
LS index: mining algorithm● For each set of bursts pointers (time moment):
○ identify NEW, OLD, ENDING (simple set operations)
22
time
current time moment (set of pointers)
bursts active in current momentENDINGNEW
bursts active in previous moment
OLD
LS index: mining algorithm● For each set of bursts pointers (time moment):
○ identify NEW, OLD, ENDING (simple set operations)○ maintain map
OVERLAPS = burst → set of overlapping streams
23
time
current time moment (set of pointers)
bursts active in current momentENDINGNEW
bursts active in previous moment
OLD
LS index: mining algorithm● For each set of bursts pointers (time moment):
○ identify NEW, OLD, ENDING (simple set operations)○ maintain map
OVERLAPS = burst → set of overlapping streams○ update counts oi
j and oji
24
time
current time moment (set of pointers)
bursts active in current momentENDINGNEW
bursts active in previous moment
OLD
Hybrid index
● LS index works well when:○ low number of overlaps○ high number of bursts per stream
● IBHD index works well when:○ low number of bursts per stream○ high number of overlaps
25
Hybrid index
● LS index works well when:○ low number of overlaps○ high number of bursts per stream
● IBHD index works well when:○ low number of bursts per stream○ high number of overlaps
● Solution: Hybrid index:IBHD index for low and LS for high base counts
26
Experimental evaluation● Wikipedia page views from the years 2011-2013● Kleinberg’s burst extraction● streams having at least 5 bursts● 2.1M streams and 43M bursts in total● 10 bursts per stream on average● mean burst length 28h
27
Mining & building
Threshold: JT = 0.9528
Hybrid mining
Number of streams: N = 2.1M29
Number of generated pairs
Threshold: JT = 0.95 (<10% pairs missing)30
How_I_Met_Your_Mother_(season_7)
Two_and_a_Half_Men_(season_9)
Process_(computing) Central_processing_unit
Endoplasmic_reticulum Ribosome
Greatest_Hits,_Vol._2_(Ronnie_Milsap_album)
Greatest_Hits,_Vol._3_(Ronnie_Milsap_album)
DigiTech_JamMan Lexicon_JamMan
Humanistic_psychology Positive_psychology
Computational limits for Naive/LS index
What’s more in the paper?
● formal definitions and proofs● considerations of combinatorial aspects● multiple overlap cases● on-line maintenance of indexes
31
LS index: mining● For each set of bursts pointers (time moment):
○ identify NEW, OLD, ENDING (simple set operations)○ new overlapping bursts: NEW x OLD ∪ NEW x NEW ○ remove ENDING and add new overlapping bursts to the
map OVERLAPS = burst → set of overlapping streams:○ update counts oi
j and oji for new overlapping bursts and
with the help of OVERLAPS ● For each i and j in o: calculate min(oi
j , oji) and J
34
time
current time moment (set of pointers)
bursts active in current momentENDINGNEW
bursts active in previous moment
OLD