Finding Time Series Motifs on Disk-Resident Data
description
Transcript of Finding Time Series Motifs on Disk-Resident Data
Finding Time Series Motifs on Disk-Resident Data
Abdullah Mueen, Dr. Eamonn Keogh
UC Riverside
Nima Bigdely-Shamlo Swartz Center for Computational
Neuroscience, UCSD
Outline• Motivation– Time Series Motif
• DAME: Disk-Aware Motif Enumeration• Performance Evaluation– Speedup and Efficiency
• Case Studies– Motifs in Brain-Computer Interfaces– Motifs in Image Database
• Conclusion
Sequence Motif
• Repeated Pattern in a sequence .
• A Pattern can be approximately similar.– Mismatch is allowed
• A Pattern can be overlapping.
GACATAATAACCAGCTATCTGCTCGCATCGCCGCGACATAGCT
20 40 60 80 100120140160180200-2
-1
0
1
2
Structural MotifMotion Motif Time Series Motif
Time Series Motif• Repeated Pattern in a Time Series.
• Exact Motif.– The most similar pair under Euclidean Distance.
• Non Overlapping.• Euclidean distance (between normalized segments)
– Beats most similarity measures on large datasets.– Early abandoning.– Triangular inequality.
• d(P,Q) ≥ |d(P,R) - d(Q,R)|0 10 20 30 40 50 60-2
-1012
Motif Discovery in Disk-Resident Datasets
• Large datasets– Light Curves of Stars.– Performance Counters of Data Centers.
• Pseudo time series dataset– “80 million Tiny Images”
• Database of normalized subsequences– An hour long trace of EEG generates over one
million normalized subsequences.
123
789
131415
101112
456
161718
192021
222324
15
3
7 16
1012
20
11
6
24
21
18
2
2217
15
23
13
14
8
4
9
19
Geometric View Disk View
Set of 2D points
Blocks
15
3
7
1012
20
11
6
21
18
2
2217
15
23
13
14
8
4
9
19
0
2416
1514
81022
9724
11412
31517
6213
202123
161819
Geometric View Disk ViewProjected View
Linear Representation in sorted order0 is the reference point
15
1819
Blocks
1514
81022
9724
11412
31517
6213
202123
161819
Geometric View Disk View
15
3
7 16
1012
20
11
6
24
21
18
2
2217
15
23
13
14
8
4
9
19
0
Best 1
Projected View
Divide the point-set into two partitionand solve the subproblem
Projected View
15
1819
Blocks
Best 2
1514
81022
9724
11412
31517
6213
202123
161819
Geometric View Disk View
Blocks of Interest
15
3
7 16
1012
20
11
6
24
21
18
2
2217
15
23
13
14
8
4
9
19
0
The inner ring is the region for blocks 5 and 6The outer ring is the region for blocks 3 and 4
15
1819
Projected ViewProjected ViewBlocks
Bsf
Block 3 and block 6 do not
overlap. No comparison.
Loaded Blocks
bsf
No Comparison
1 Comparison 9 comparisons 1 comparison
Block-Pair (3,5) Block-Pair (3,6) Block-Pair (4,5) Block-Pair (4,6)
11 comparisons are made instead of 9*16=144
1
2
3
45
6
7
8
1
2
3
45
6
7
8
1
2
3
45
6
7
8
1
2
3
45
6
7
8
Speedup
Algorithm
Largest Dataset Tested
(thousands)
Time for the
Largest Dataset
Estimated Time for
4.0
million
CompletelyInMemory 10035
minutes37.8days
CompletelyInDisk 2001.50days
1.65years
DAME 4,0001.35days
1.35 days
NoAdditionalStorage(normalization done in memory)
2004.82 days
5.28years
√ X
X √
√ √
√ X
Memory Disk
Performance Evaluation
10,000 20,000 30,000 40,000 50,00023456789101112
# of time series
Seco
nds
in
DA
ME
_Mot
if
Total
CPU
I/O
x 103
1,000 500 34 25 20
# of blocks
0 200 400 600 800 1000 12003
4
5
6
7
8
9
10
Motif Length
Sec
onds
in
DA
ME
_Mot
if
x 103
Case Study 1: Brain-Computer Interfaces
Biosemi, Inc.Target Non-Target
Case Study 1: Brain-Computer Interfaces
0 100 200 300 400 500 600-2
-1
0
1
2
3
Time (ms)
Nor
mal
ized
IC a
ctivi
ty Motif 1
Segment 1Segment 2
-1000 -500 0 500 1000
10
20
30
40
50
60
70
80
90
100
110
Latency
Epochs
Before target presentation After target presentation
IC 17, Motif 1
Time (ms)
-1000 -800 -600 -400 -200 0 200 400 600 800
50
100
150
200
250
300 4
6
8
10
12
14
16
18
20
22
Targ
et T
rials
Non
-tar
get T
rials
Distance to Motif 1
Spatial filter (ICA)
Case Study 2: Image Motifs
• Concatenated color histogram is considered as pseudo time series.
• Each time series is of length 256*3 = 768.
• 80 million tiny images of 32X32 resolution.
0 100 200 300 400 500 600 700-2
0
2
4
6
8
10
12
80 million tiny images : collected by Antonio Torralba, Rob Fergus, William T. Freeman at MIT.
32751032 17012103
15513839 15513780
313911816791228
2327761623277667
3846805611896606
Case Study 2: Image Motifs
• DAME worked on the first 40 million time series in ~6.5 days • DAME found 3,836,902 images which have at least one duplicate.– 1,719,443 unique images.
• 542,603 images have near duplicates with distance less than 0.1.
Duplicate Image Near Duplicate Image
Conclusion
• DAME: The first exact-motif discovery algorithm that finds motif in disk-resident data.
• DAME is scalable to massive datasets of the order of millions of time series.
• DAME successfully finds motif in EEG traces and image databases.
BACKUP
Example of Multidimensional Motif
Top view of the dance floor and the trajectories of the dancers.
Motion-Motif
Dance Motions are taken from the CMU Motion Capture Database
r
12
3
4
5 6
7
8
9
10
11
12
13
14
15
16
17
18
Example of Worst Case Scenario
Multiple References for Ordering
r1 r2
yx Lower bound
r1 r2
xy
Rotational axis
40
30
20
10
0
10
20
30
40
Planar bounds
Linear bounds
Act
ual d
ista
nces
Larger Gap
Smaller Gap