Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
-
date post
19-Dec-2015 -
Category
Documents
-
view
222 -
download
0
Transcript of Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
![Page 1: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/1.jpg)
Indexing Time Series
Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
![Page 2: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/2.jpg)
Outline
Find Similar/interesting objects in a database Similarity Retrieval
Multimedia
![Page 3: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/3.jpg)
Time Series Databases
A time series is a sequence of real numbers, representing the measurements of a real variable at equal time intervals Stock price movements Volume of sales over time Daily temperature readings ECG data all NYSE stocks
A time series database is a large collection of time series
![Page 4: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/4.jpg)
Time Series Problems (from a databases perspective)
The Similarity Problem
X = x1, x2, …, xn and Y = y1, y2, …, yn
Define and compute Sim(X, Y) E.g. do stocks X and Y have similar
movements? Retrieve efficiently similar time series
(Similarity Queries)
![Page 5: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/5.jpg)
Types of queries
whole match vs sub-pattern match range query vs nearest neighbors all-pairs query
![Page 6: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/6.jpg)
Examples
Find companies with similar stock prices over a time interval
Find products with similar sell cycles Cluster users with similar credit card utilization Find similar subsequences in DNA sequences Find scenes in video streams
![Page 7: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/7.jpg)
day
$price
1 365
day
$price
1 365
day
$price
1 365
distance function: by expert
(eg, Euclidean distance)
![Page 8: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/8.jpg)
Problems
Define the similarity (or distance) function Find an efficient algorithm to retrieve similar
time series form a database (Faster than sequential scan)
The Similarity function depends on the Application
![Page 9: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/9.jpg)
Euclidean Similarity Measure
View each sequence as a point in n-dimensional Euclidean space (n = length of each sequence)
Define (dis-)similarity between sequences X and Y as
n
i
ppiip yxL
1
/1)||(
p=1 Manhattan distance
p=2 Euclidean distance
![Page 10: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/10.jpg)
Easy to compute: O(n) Allows scalable solutions to other problems,
such as indexing clustering etc...
Advantages
![Page 11: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/11.jpg)
Does not allow for different baselines Stock X fluctuates at $100, stock Y at $30
Does not allow for different scales Stock X fluctuates between $95 and $105,
stock Y between $20 and $40
Disadvantages
![Page 12: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/12.jpg)
[Goldin and Kanellakis, 1995] Normalize the mean and variance for each
sequence Let µ(X) and (X) be the mean and variance of
sequence X
Replace sequence X by sequence X’, where
x’i = (xi - µ (X) )/ (X)
Normalization
![Page 13: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/13.jpg)
Similarity definition still too rigid Does not allow for noise or short-term fluctuations Does not allow for phase shifts in time Does not allow for acceleration-deceleration along
the time dimension etc ….
But..
![Page 14: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/14.jpg)
Example
![Page 15: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/15.jpg)
A general similarity framework involving a transformation rules language
[Jagadish, Mendelzon, Milo]
Each rule has an associated cost
![Page 16: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/16.jpg)
Examples of Transformation Rules
Collapse adjacent segments into one segment
new slope = weighted average of previous slopes
new length = sum of previous lengths
[l1, s1]
[l2, s2] [l1+l2, (l1s1+l2s2)/(l1+l2)]
![Page 17: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/17.jpg)
Combinations of Moving Averages, Scales, and Shifts [Rafiei and Mendelzon, 1998]
Moving averages are a well-known technique for smoothening time sequences
Example of a 3-day moving average
x’i = (xi–1 + xi + xi+1)/3
![Page 18: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/18.jpg)
Disadvantages of Transformation Rules
Subsequent computations (such as the indexing problem) become more complicated We will see later why!
![Page 19: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/19.jpg)
Dynamic Time Warping[Berndt, Clifford, 1994]
Allows acceleration-deceleration of signals along the time dimension
Basic idea Consider X = x1, x2, …, xn , and Y = y1, y2, …, yn
We are allowed to extend each sequence by repeating elements
Euclidean distance now calculated between the extended sequences X’ and Y’
![Page 20: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/20.jpg)
X
Y
warping path
j = i – w
j = i + w
Dynamic Time Warping[Berndt, Clifford, 1994]
![Page 21: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/21.jpg)
Restrictions on Warping Paths
Monotonicity Path should not go down or to the left
Continuity No elements may be skipped in a sequence
Warping Window
| i – j | <= w
![Page 22: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/22.jpg)
Formulation
Let D(i, j) refer to the dynamic time warping distance between the subsequences
x1, x2, …, xi
y1, y2, …, yj
D(i, j) = | xi – yj | + min { D(i – 1, j), D(i – 1, j – 1), D(i, j – 1) }
![Page 23: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/23.jpg)
Solution by Dynamic Programming
Basic implementation = O(n2) where n is the length of the sequences will have to solve the problem for each (i, j) pair
If warping window is specified, then O(nw) Only solve for the (i, j) pairs where | i – j | <= w
![Page 24: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/24.jpg)
Longest Common Subsequence Measures
(Allowing for Gaps in Sequences)
Gap skipped
![Page 25: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/25.jpg)
Basic LCS Idea
X = 3, 2, 5, 7, 4, 8, 10, 7
Y = 2, 5, 4, 7, 3, 10, 8, 6
LCS = 2, 5, 7, 10
Sim(X,Y) = |LCS| or Sim(X,Y) = |LCS| /n
Edit Distance is another possibility
![Page 26: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/26.jpg)
[Ge & Smyth, 2000]
Previous methods primarily “distance based”, this method “model based”
Basic ideas Given sequence Q, construct a model MQ(i.e. a
probability distribution on waveforms)
Given a new pattern Q’, measure similarity by computing p(Q’|MQ)
Probabilistic Generative Modeling Method
![Page 27: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/27.jpg)
The model MQ
a discrete-time finite-state Markov model each segment in data corresponds to a state
data in each state typically generated by a regression curve
a state to state transition matrix is provided
![Page 28: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/28.jpg)
On entering state i, a duration t is drawn from a state-duration distribution p(t)
the process remains in state i for time t
after this, the process transits to another state according to the state transition matrix
![Page 29: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/29.jpg)
Solid lines: the two states of the model
Dashed lines: the actual noisy observations
Example: output of Markov Model
![Page 30: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/30.jpg)
Landmarks[Perng et. al., 2000]
Similarity definition much closer to human perception (unlike Euclidean distance)
A point on the curve is a n-th order landmark if the n-th derivative is 0 Thus, local max and mins are first order
landmarks Landmark distances are tuples (e.g. in time and
amplitude) that satisfy the triangle inequality Several transformations are defined, such as
shifting, amplitude scaling, time warping, etc
![Page 31: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/31.jpg)
Similarity Models
Euclidean and Lp based Edit Distance and LCS based Probabilistic (using Markov Models) Landmarks Next, how to index a database for similarity
retrieval…
![Page 32: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/32.jpg)
Main idea
Seq. scanning works - how to do faster?
![Page 33: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/33.jpg)
Idea: ‘GEMINI’
(GEneric Multimedia INdexIng)
Extract a few numerical features, for a ‘quick and dirty’ test
![Page 34: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/34.jpg)
day1 365
day1 365
S1
Sn
F(S1)
F(Sn)
‘GEMINI’ - Pictorially
eg, avg
eg,. std
![Page 35: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/35.jpg)
GEMINI
Solution: Quick-and-dirty' filter: extract n features (numbers, eg., avg., etc.) map into a point in n-d feature space organize points with off-the-shelf spatial
access method (‘SAM’) discard false alarms
![Page 36: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/36.jpg)
GEMINI
Important: Q: how to guarantee no false dismissals?
A1: preserve distances (but: difficult/impossible)
A2: Lower-bounding lemma: if the mapping ‘makes things look closer’, then there are no false dismissals
![Page 37: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/37.jpg)
GEMINI
Important:
Q: how to extract features?
A: “if I have only one number to describe my object, what should this be?”
![Page 38: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/38.jpg)
Time sequences
Q: what features?
![Page 39: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/39.jpg)
Time sequences
Q: what features?
A: Fourier coefficients (we’ll see them in detail soon)
B: Moments (average, variance, etc)
…. (more next!)
Dfeature(F(x), F(y)) <= D(x, y)
![Page 40: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/40.jpg)
Indexing
Use SAM and index the objects in the feature space (ex. R-tree)
Fast retrieval and easy to implement
![Page 41: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/41.jpg)
Sub-pattern matching
Problem: find sub-sequences that match the given query pattern
![Page 42: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/42.jpg)
day
$price
1 300
day
$price
1 365
day
$price
1 400
30
![Page 43: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/43.jpg)
Sub-pattern matching
Q: how to proceed? Hint: try to turn it into a ‘whole-matching’ problem
(how?)
![Page 44: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/44.jpg)
Sub-pattern matching
Assume that queries have minimum duration w; (eg., w=7 days)
divide data sequences into windows of width w (overlapping, or not?)
![Page 45: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/45.jpg)
Sub-pattern matching
Assume that queries have minimum duration w; (eg., w=7 days)
divide data sequences into windows of width w (overlapping, or not?)
A: sliding, overlapping windows. Thus: trailsPictorially:
![Page 46: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/46.jpg)
Sub-pattern matching
![Page 47: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/47.jpg)
Sub-pattern matching
sequences -> trails -> MBRs in feature space
![Page 48: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/48.jpg)
Sub-pattern matching
Q: do we store all points? why not?
![Page 49: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/49.jpg)
Sub-pattern matching
Q: how to do range queries of duration w?
![Page 50: Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)](https://reader033.fdocuments.net/reader033/viewer/2022052701/56649d3e5503460f94a17728/html5/thumbnails/50.jpg)
Next: more on feature extraction and indexing