Post on 29-Jun-2015
Searching and MiningTrillions of Time Series Subsequencesunder Dynamic Time Warping
Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen,
Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh
Hoan Nguyen – Trung Minh Nguyen
2
Abstract
Optimizationsto search and mine
large databasesvery fast
3
Outline
Problem
Related work
Definitions
Method
Results
Conclusion
4
Problem
Similarity search is an important part of most time series data mining algorithm.
Dynamic Time Warping is the best measure to use but slow.
5
DefinitionsTime series
Time series T is an ordered list:
T = t1, t2, … ,tm
6
DefinitionsSubsequence
Subsequence Ti,k of time series T is a time series of length k start at position i:
T = t1, t2, … ,tm
7
DefinitionsDynamic Time Warping
8
Related workKnown optimizations
Squared distance
√❑
9
Related workKnown optimizations
Lower bounding
LB_KimFL LB_Keogh
10
Related workKnown optimizations
Early abandon
11
MethodEarly abandon Z-Normalization
Q
TT3
T2
T1
Z-N
orm
aliz
atio
n
Q’
T3’T2’
T1’
…
Long Time series
SubsequencesNormalized
Subsequences
QueryNormalized
Query
Normal approach
12
MethodEarly abandon Z-Normalization Novel approach
Early abandon with Z-normalization
1. Query is Z-normalized
2. Z-normalization of each subsequence will be calculated on the fly with the distance calculation.
3. If distance > best_so_far then early abandon both calculation
13
MethodRe-ordering Early Abandoning
Ordering is created based on the query.
14
MethodCascading Lower Bounds
Lower bounds are used in a cascade to prune candidates.
15
Results
Comparison between:
Naïve
- Z-normalization from start
- full ED(DTW) calculation
State-of-the-art (SOTA)
- Z-normalization from start
- early abandoning
- LB_Keogh bounding for DTW
UCRSuite
16
ResultsBaseline Tests on Random Walk
Million Billion Trillion0
5000
10000
15000
20000
25000
30000
UCR-ED
SOTA-ED
UCR-DTW
SOTA-DTWmin
ute
s
|𝑄|=128
17
ResultsBaseline Tests on Random Walk
Million Billion0
500
1000
1500
2000
2500
UCR-ED
SOTA-ED
UCR-DTW
SOTA-DTWseco
nd
s
|𝑄|=128
18
ResultsBaseline Tests on Random Walk
|𝑇|=2×106
19
ResultsEEG
Series10
100
200
300
400
500
600
3.4
494.3
UCR-ED
SOTA-ED
ho
urs
20
Conclusion
- The approach is very simple yet so effective.
- These optimizations can be applied to most measures but may not work for some, like: Hamming distance