Similar search with trillions of time series

Post on 29-Jun-2015

118 views 3 download

Tags:

Transcript of Similar search with trillions of time series

Searching and MiningTrillions of Time Series Subsequencesunder Dynamic Time Warping

Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen,

Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh

Hoan Nguyen – Trung Minh Nguyen

2

Abstract

Optimizationsto search and mine

large databasesvery fast

3

Outline

Problem

Related work

Definitions

Method

Results

Conclusion

4

Problem

Similarity search is an important part of most time series data mining algorithm.

Dynamic Time Warping is the best measure to use but slow.

5

DefinitionsTime series

Time series T is an ordered list:

T = t1, t2, … ,tm

6

DefinitionsSubsequence

Subsequence Ti,k of time series T is a time series of length k start at position i:

T = t1, t2, … ,tm

7

DefinitionsDynamic Time Warping

8

Related workKnown optimizations

Squared distance

√❑

9

Related workKnown optimizations

Lower bounding

LB_KimFL LB_Keogh

10

Related workKnown optimizations

Early abandon

11

MethodEarly abandon Z-Normalization

Q

TT3

T2

T1

Z-N

orm

aliz

atio

n

Q’

T3’T2’

T1’

Long Time series

SubsequencesNormalized

Subsequences

QueryNormalized

Query

Normal approach

12

MethodEarly abandon Z-Normalization Novel approach

Early abandon with Z-normalization

1. Query is Z-normalized

2. Z-normalization of each subsequence will be calculated on the fly with the distance calculation.

3. If distance > best_so_far then early abandon both calculation

13

MethodRe-ordering Early Abandoning

Ordering is created based on the query.

14

MethodCascading Lower Bounds

Lower bounds are used in a cascade to prune candidates.

15

Results

Comparison between:

Naïve

- Z-normalization from start

- full ED(DTW) calculation

State-of-the-art (SOTA)

- Z-normalization from start

- early abandoning

- LB_Keogh bounding for DTW

UCRSuite

16

ResultsBaseline Tests on Random Walk

Million Billion Trillion0

5000

10000

15000

20000

25000

30000

UCR-ED

SOTA-ED

UCR-DTW

SOTA-DTWmin

ute

s

|𝑄|=128

17

ResultsBaseline Tests on Random Walk

Million Billion0

500

1000

1500

2000

2500

UCR-ED

SOTA-ED

UCR-DTW

SOTA-DTWseco

nd

s

|𝑄|=128

18

ResultsBaseline Tests on Random Walk

|𝑇|=2×106

19

ResultsEEG

Series10

100

200

300

400

500

600

3.4

494.3

UCR-ED

SOTA-ED

ho

urs

20

Conclusion

- The approach is very simple yet so effective.

- These optimizations can be applied to most measures but may not work for some, like: Hamming distance