Finding Time Series Motifs on Disk-Resident Data

21
Finding Time Series Motifs on Disk- Resident Data Abdullah Mueen, Dr. Eamonn Keogh UC Riverside Nima Bigdely- Shamlo Swartz Center for Computational Neuroscience, UCSD

description

Finding Time Series Motifs on Disk-Resident Data. Abdullah Mueen, Dr. Eamonn Keogh UC Riverside Nima Bigdely-Shamlo Swartz Center for Computational Neuroscience, UCSD. Outline. Motivation Time Series Motif DAME : Disk-Aware Motif Enumeration Performance Evaluation - PowerPoint PPT Presentation

Transcript of Finding Time Series Motifs on Disk-Resident Data

Page 1: Finding Time Series Motifs on Disk-Resident Data

Finding Time Series Motifs on Disk-Resident Data

Abdullah Mueen, Dr. Eamonn Keogh

UC Riverside

Nima Bigdely-Shamlo Swartz Center for Computational

Neuroscience, UCSD

Page 2: Finding Time Series Motifs on Disk-Resident Data

Outline• Motivation– Time Series Motif

• DAME: Disk-Aware Motif Enumeration• Performance Evaluation– Speedup and Efficiency

• Case Studies– Motifs in Brain-Computer Interfaces– Motifs in Image Database

• Conclusion

Page 3: Finding Time Series Motifs on Disk-Resident Data

Sequence Motif

• Repeated Pattern in a sequence .

• A Pattern can be approximately similar.– Mismatch is allowed

• A Pattern can be overlapping.

GACATAATAACCAGCTATCTGCTCGCATCGCCGCGACATAGCT

20 40 60 80 100120140160180200-2

-1

0

1

2

Structural MotifMotion Motif Time Series Motif

Page 4: Finding Time Series Motifs on Disk-Resident Data

Time Series Motif• Repeated Pattern in a Time Series.

• Exact Motif.– The most similar pair under Euclidean Distance.

• Non Overlapping.• Euclidean distance (between normalized segments)

– Beats most similarity measures on large datasets.– Early abandoning.– Triangular inequality.

• d(P,Q) ≥ |d(P,R) - d(Q,R)|0 10 20 30 40 50 60-2

-1012

Page 5: Finding Time Series Motifs on Disk-Resident Data

Motif Discovery in Disk-Resident Datasets

• Large datasets– Light Curves of Stars.– Performance Counters of Data Centers.

• Pseudo time series dataset– “80 million Tiny Images”

• Database of normalized subsequences– An hour long trace of EEG generates over one

million normalized subsequences.

Page 6: Finding Time Series Motifs on Disk-Resident Data

123

789

131415

101112

456

161718

192021

222324

15

3

7 16

1012

20

11

6

24

21

18

2

2217

15

23

13

14

8

4

9

19

Geometric View Disk View

Set of 2D points

Blocks

Page 7: Finding Time Series Motifs on Disk-Resident Data

15

3

7

1012

20

11

6

21

18

2

2217

15

23

13

14

8

4

9

19

0

2416

1514

81022

9724

11412

31517

6213

202123

161819

Geometric View Disk ViewProjected View

Linear Representation in sorted order0 is the reference point

15

1819

Blocks

Page 8: Finding Time Series Motifs on Disk-Resident Data

1514

81022

9724

11412

31517

6213

202123

161819

Geometric View Disk View

15

3

7 16

1012

20

11

6

24

21

18

2

2217

15

23

13

14

8

4

9

19

0

Best 1

Projected View

Divide the point-set into two partitionand solve the subproblem

Projected View

15

1819

Blocks

Best 2

Page 9: Finding Time Series Motifs on Disk-Resident Data

1514

81022

9724

11412

31517

6213

202123

161819

Geometric View Disk View

Blocks of Interest

15

3

7 16

1012

20

11

6

24

21

18

2

2217

15

23

13

14

8

4

9

19

0

The inner ring is the region for blocks 5 and 6The outer ring is the region for blocks 3 and 4

15

1819

Projected ViewProjected ViewBlocks

Bsf

Page 10: Finding Time Series Motifs on Disk-Resident Data

Block 3 and block 6 do not

overlap. No comparison.

Loaded Blocks

bsf

No Comparison

1 Comparison 9 comparisons 1 comparison

Block-Pair (3,5) Block-Pair (3,6) Block-Pair (4,5) Block-Pair (4,6)

11 comparisons are made instead of 9*16=144

1

2

3

45

6

7

8

1

2

3

45

6

7

8

1

2

3

45

6

7

8

1

2

3

45

6

7

8

Page 11: Finding Time Series Motifs on Disk-Resident Data

Speedup

Algorithm

Largest Dataset Tested

(thousands)

Time for the

Largest Dataset

Estimated Time for

4.0

million

CompletelyInMemory 10035

minutes37.8days

CompletelyInDisk 2001.50days

1.65years

DAME 4,0001.35days

1.35 days

NoAdditionalStorage(normalization done in memory)

2004.82 days

5.28years

√ X

X √

√ √

√ X

Memory Disk

Page 12: Finding Time Series Motifs on Disk-Resident Data

Performance Evaluation

10,000 20,000 30,000 40,000 50,00023456789101112

# of time series

Seco

nds

in

DA

ME

_Mot

if

Total

CPU

I/O

x 103

1,000 500 34 25 20

# of blocks

0 200 400 600 800 1000 12003

4

5

6

7

8

9

10

Motif Length

Sec

onds

in

DA

ME

_Mot

if

x 103

Page 13: Finding Time Series Motifs on Disk-Resident Data

Case Study 1: Brain-Computer Interfaces

Biosemi, Inc.Target Non-Target

Page 14: Finding Time Series Motifs on Disk-Resident Data

Case Study 1: Brain-Computer Interfaces

0 100 200 300 400 500 600-2

-1

0

1

2

3

Time (ms)

Nor

mal

ized

IC a

ctivi

ty Motif 1

Segment 1Segment 2

-1000 -500 0 500 1000

10

20

30

40

50

60

70

80

90

100

110

Latency

Epochs

Before target presentation After target presentation

IC 17, Motif 1

Time (ms)

-1000 -800 -600 -400 -200 0 200 400 600 800

50

100

150

200

250

300 4

6

8

10

12

14

16

18

20

22

Targ

et T

rials

Non

-tar

get T

rials

Distance to Motif 1

Spatial filter (ICA)

Page 15: Finding Time Series Motifs on Disk-Resident Data

Case Study 2: Image Motifs

• Concatenated color histogram is considered as pseudo time series.

• Each time series is of length 256*3 = 768.

• 80 million tiny images of 32X32 resolution.

0 100 200 300 400 500 600 700-2

0

2

4

6

8

10

12

80 million tiny images : collected by Antonio Torralba, Rob Fergus, William T. Freeman at MIT.

Page 16: Finding Time Series Motifs on Disk-Resident Data

32751032 17012103

15513839 15513780

313911816791228

2327761623277667

3846805611896606

Case Study 2: Image Motifs

• DAME worked on the first 40 million time series in ~6.5 days • DAME found 3,836,902 images which have at least one duplicate.– 1,719,443 unique images.

• 542,603 images have near duplicates with distance less than 0.1.

Duplicate Image Near Duplicate Image

Page 17: Finding Time Series Motifs on Disk-Resident Data

Conclusion

• DAME: The first exact-motif discovery algorithm that finds motif in disk-resident data.

• DAME is scalable to massive datasets of the order of millions of time series.

• DAME successfully finds motif in EEG traces and image databases.

Page 18: Finding Time Series Motifs on Disk-Resident Data

BACKUP

Page 19: Finding Time Series Motifs on Disk-Resident Data

Example of Multidimensional Motif

Top view of the dance floor and the trajectories of the dancers.

Motion-Motif

Dance Motions are taken from the CMU Motion Capture Database

Page 20: Finding Time Series Motifs on Disk-Resident Data

r

12

3

4

5 6

7

8

9

10

11

12

13

14

15

16

17

18

Example of Worst Case Scenario

Page 21: Finding Time Series Motifs on Disk-Resident Data

Multiple References for Ordering

r1 r2

yx Lower bound

r1 r2

xy

Rotational axis

40

30

20

10

0

10

20

30

40

Planar bounds

Linear bounds

Act

ual d

ista

nces

Larger Gap

Smaller Gap