Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos

28
Fast Subsequence Matching in Time- Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos Presented by George Liu / Luis L. Perez

description

Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos Presented by George Liu / Luis L. Perez. Time series?. Definition Applications Financial markets Weather forecasting Healthcare. What kind of problem are we trying to solve?. - PowerPoint PPT Presentation

Transcript of Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan, Y. Manolopoulos

Page 1: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Fast Subsequence Matching in Time-Series Databases.

C. Faloustos, M. Ranganathan, Y. Manolopoulos

Presented byGeorge Liu / Luis L. Perez

Page 2: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Time series?

Definition

Applications

Financial markets Weather forecasting Healthcare

Page 3: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

What kind of problem are we trying to solve?

Whole sequence matching Given a database S with n sequences, all of them

equally long, and a query sequence Q of the same length.

Find all sequences in S that match with Q.

Subsequence matching Given a database S with n sequences, with potentially

different lengths, and a query sequence Q. Find all sequences in S that contain Q.

Page 4: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Useful notation

Given a sequence S Len(S) denotes the length of the sequence S[i] denotes the ith element S[i:j] denotes the subsequence between S[i] and S[j]

Given two sequences, S and Q D(S,Q) denotes the distance between S and Q.

Euclidean Distance bound: e

Max. distance for two sequences to be considered “equal”

Page 5: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Naïve approaches

Sequential scanning Clearly unfeasible

R-tree Might work, but dimensionality is extremely high

(proportional to sequence length) Poor performance

What can we do to improve performance?

Page 6: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Dimensionality reduction

Redundant data, lots of patterns

Feature extraction

Data transformation Cosine Wavelet Fourier <-- we'll focus on this.

Page 7: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Discrete Fourier Transformation

Map a sequence x in time-domain to a sequence X in frequency-domain

Reversible!

Fast and easy-to-implement algorithms

Energy preservation property Key concept in dimensionality reduction. Just keep the first 2 or 3 coefficients.

Page 8: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Parseval's theorem

Let S and Q be the original sequences. S' and Q' after applying DFT.

D(S,Q) = D(S',Q')

Why is this important? Distance underestimation, remember the bound e.

D(S,Q) < e ---> D(S', Q') < e We will get no false dismissals.

Page 9: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Subsequence Matching

The problem: You are given a collection of N sequences of real

numbers. (S1, S2, .., Sn). Potentially different length. User specifies query subsequence of length Q and the

tolerance e, the max. acceptable dis-similarity. You want all to return all the sequences along with the

correct offsets k that matches the query and acceptable e.

Solutions: many!

Page 10: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Possible Solutions

1) Brute Force method - Sequential scan every possible subsequence of the data sequences for a match.

2) I-Naive - Transform all subsequences to points in feature space and store those points into an R-tree.

3) ST-Index - Transform all subsequences to points in feature space. Store MBRs of sub-trails into an R*-tree.

Note: I-Naive and ST-Index are similar in the initial steps.

Page 11: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Possible Solutions I-naive

*Assume that the min. query length is w. w changes according to the application. (ie, stock markets have a larger w that are interested in weekly/monthly patterns)

Procedure: 1) Use the "sliding window" to find every

subsequence in a sequence. 2) DFT those subsequences of size w to a point in

featured space. 3) A trail is produced of Len(S)-w+1 points.

Page 12: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Possible Solutions I-naive

Procedure cont: 4) Store all the points of the trails in feature space in a

spatial access method. (R*-tree) 5) When presented with a query of length w and

tolerance e, extract the features of the query and perform the spatial access range query with radius e.

6) Discard false alarms by retrieving all those subsequences and calculating their actual distance from the query.

Note: Very, very slow approach. Worst that Sequential Scan. You have a large R*-tree (tall and slow).

Page 13: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Possible Solutions ST-Index

*Assume that the min. query length is w. w changes according to the application. (ie, stock markets have a larger w that are interested in weekly/monthly patterns)

Procedure: 1) Use the "sliding window" to find every

subsequence in a sequence. 2) DFT those subsequences of size w to a point in

featured space. 3) A trail is produced of Len(S)-w+1 points.

Page 14: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Possible Solutions ST-Index

Procedure cont. 4) Divide the trail of points in feature space into sub-

trails. (algorithm mentioned later) 5) Represent each of them in a MBR. 6) Store the MBR into a spatial access method. (ie.

R*-Tree)

Page 15: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

MBRs in F-Dimension

Page 16: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

MBRs in F-Dimension

Page 17: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

MBRs in F-Dimension

Page 18: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

MBRs in F-Dimension

Page 19: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

MBRs in F-Dimension

Page 20: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Insertions

Problem: How do we divide these trails into sub-trails? Two heuristics:

1) Every sub-trail has a predetermined, fixed number. (I-fixed)

2) Every sub-trail has a predetermined, fixed length. (I-fixed)

Solution: Use an "adaptive heuristic." (I-adaptive)

Page 21: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

I-adaptive Algorithm

- Based on the idea of the marginal cost of a point in terms of disk accesses.

Marginal cost (mc) = Disk Accesses of a given MBR / k points in a given MBR

AlgorithmAssign the first point of the trail in a sub-trail.FOR each successive pointIF it increase the marginal cost of the current sub-trailTHEN start another sub-trailELSE include it in the current sub-trail

Page 22: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

I-adaptive Algorithm

Page 23: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Searching

Consider the sub-trail length w and distance bound e.

Let Q be the query sequence If Len(Q) = w, it's all good.

Algorithm Search_Short: Use DFT to map Q to a point q in feature space. Make it a

sphere with radius e. Retrieve all the sub-trails whose MBRs intersect the query

region using our index. Throw away false alarms.

Page 24: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Searching

Now, what if Len(Q) > w? Requires more analysis, but basically we have that

Len(Q) = p*w

So we can split Q in several subsequences of length p.

What about the radius? r = e/sqrt(p)

Page 25: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Searching

So we have... Algorithm Search_Long:

Break sequence Q in p sub-queries with radius e/sqrt(p) Retrieve from the index all the sub-trails whose MBRs

insersect at least one of the other sub-query regions. Examine the sub-sequences, discard false alarms.

Page 26: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Experimental results

Page 27: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Experimental results

Stock price database with ~300,000 points

1 number = 4 bytes

DFT keeping first 3 coefficients (actually 6)

w = 512 bytes

R*-tree

Page 28: Fast Subsequence Matching in Time-Series Databases. C. Faloustos, M. Ranganathan,  Y. Manolopoulos

Experimental results

Space Naïve methods: 24mb This method: 5kb

Time - “short” queries (Len(Q) = w) 3 to 100 times better response times

Time - “long” queries (Len(Q) > w) 10 to 100 times better response times