Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4...
Transcript of Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4...
![Page 1: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/1.jpg)
Time Series I
1
![Page 2: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/2.jpg)
Syllabus Nov 4 Introduc8on to data mining
Nov 5 Associa8on Rules
Nov 10, 14 Clustering and Data Representa8on
Nov 17 Exercise session 1 (Homework 1 due)
Nov 19 Classifica8on
Nov 24, 26 Similarity Matching and Model Evalua8on
Dec 1 Exercise session 2 (Homework 2 due)
Dec 3 Combining Models
Dec 8, 10 Time Series Analysis
Dec 15 Exercise session 3 (Homework 3 due)
Dec 17 Ranking
Jan 13 Review
Jan 14 EXAM
Feb 23 Re-‐EXAM
![Page 3: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/3.jpg)
Why deal with sequen8al data? • Because all data is sequen8al J • All data items arrive in the data store in some order • Examples
– transac8on data – documents and words
• In some (or many) cases the order does not maXer • In many cases the order is of interest
3
![Page 4: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/4.jpg)
Time-‐series data: example
Financial 8me series 4
![Page 5: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/5.jpg)
Ques8ons
• What is 8me series?
• How do we compare 8me series data?
• What is the structure of 8me series data?
• Can we represent this structure compactly and accurately?
5
![Page 6: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/6.jpg)
Time Series • A sequence of observations:
– X = (x1, x2, x3, x4, …, xn) • Each xi is a real number
– e.g., (2.0, 2.4, 4.8, 5.6, 6.3, 5.6, 4.4, 4.5, 5.8, 7.5)
8me axis
value axis
![Page 7: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/7.jpg)
Time Series Databases • A <me series is an ordered set of real numbers,
represen8ng the measurements of a real variable at equal 8me intervals
– Stock prices – Volume of sales over 8me – Daily temperature readings – ECG data
• A <me series database is a large collec8on of 8me series
7
![Page 8: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/8.jpg)
• Given two 8me series X = (x1, x2, …, xn) Y = (y1, y2, …, yn)
• Define and compute D (X, Y) • Or be@er…
Time Series Similarity
![Page 9: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/9.jpg)
database
query X
D (X, Y) 1-NN
Time Series Similarity Search • Given a 8me series database and a query X • Find the best match of X in the database
• Why is that useful?
![Page 10: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/10.jpg)
Examples
• Find companies with similar stock prices over a
8me interval
• Find products with similar sell cycles
• Cluster users with similar credit card u8liza8on
• Find similar subsequences in DNA sequences
• Find scenes in video streams
10
![Page 11: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/11.jpg)
Types of queries
• whole match vs subsequence match • range query vs nearest neighbor query
11
![Page 12: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/12.jpg)
day
$price
1 365
day
$price
1 365
day
$price
1 365
distance function: by expert
(e.g., Euclidean distance)
12
![Page 13: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/13.jpg)
Problems
• Define the similarity (or distance) func8on • Find an efficient algorithm to retrieve similar 8me series from a database – (Faster than sequen8al scan)
The Similarity function depends on the Application
13
![Page 14: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/14.jpg)
Metric Distances
• What proper8es should a similarity distance have to allow (easy) indexing?
– D(A,B) = D(B,A) Symmetry – D(A,A) = 0 Constancy of Self-‐Similarity – D(A,B) >= 0 Posi4vity – D(A,B) ≤ D(A,C) + D(B,C) Triangle Inequality
• Some8mes the distance func8on that best fits an applica8on is not a metric
• Then indexing becomes interes8ng and challenging 14
![Page 15: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/15.jpg)
Euclidean Distance
15
• Each 8me series: a point in the n-‐dim space
• Euclidean distance – pair-‐wise point distance
v1 v2
L2 = (xi − yi )2
i=1
n
∑
X = x1, x2, …, xn
Y = y1, y2, …, yn
![Page 16: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/16.jpg)
Euclidean model Query Q
n datapoints
Database
n datapoints 16
![Page 17: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/17.jpg)
Query Q
n datapoints
D Q,X( ) ≡ qi − xi( ) 2i=1
n∑
S
Q
Euclidean Distance between two time series Q = {q1, q2, …, qn} and X = {x1, x2, …, xn}
Database
n datapoints 17
Euclidean model
![Page 18: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/18.jpg)
Query Q
n datapoints
D Q,X( ) ≡ qi − xi( ) 2i=1
n∑
S
Q
Euclidean Distance between two time series Q = {q1, q2, …, qn} and X = {x1, x2, …, xn}
Distance
0.98
0.07
0.21
0.43
Rank
4
1
2
3
Database
n datapoints 18
Euclidean model
![Page 19: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/19.jpg)
• Easy to compute: O(n) • Allows scalable solu8ons to other problems, such as – indexing – clustering – etc...
Advantages
19
![Page 20: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/20.jpg)
• Query and target lengths should be equal!
• Cannot tolerate noise: – Time shiks – Sequences out of phase – Scaling in the y-‐axis
Disadvantages
20
![Page 21: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/21.jpg)
21
Limita8ons of Euclidean Distance
Euclidean Distance Sequences are aligned “one to one”.
“Warped” Time Axis Nonlinear alignments are possible.
D Q,X( ) ≡ qi − xi( ) 2i=1
n∑
Q
Q
C
C
![Page 22: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/22.jpg)
22
DTW: Dynamic 8me warping (1/2)
• Each cell c = (i, j) is a pair of indices whose corresponding values will be computed, (xi–qj)2, and included in
the sum for the distance.
• Euclidean path:
– i = j always.
– Ignores off-‐diagonal cells.
X
Q
(x2–q2)2 + (x1–q1)2 (x1–q1)2
![Page 23: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/23.jpg)
23
(i, j)
DTW: Dynamic 8me warping (2/2)
• DTW allows any path. • Examine all paths:
• Standard dynamic programming to fill in the table.
• The top-‐right cell contains final result.
(i, j) (i-1, j)
(i-1, j-1) (i, j-1)
Shrink X / stretch Q
Stretch X / shrink Q
X
Q
a
b
![Page 24: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/24.jpg)
24
Computa8on
Ddtw (Q,X) = f (N,M )
f (i, j) = qi − x j +minf (i, j −1)f (i−1, j)f (i−1, j −1)
"
#$
%$
q-‐stretch no stretch
x-‐stretch
• DTW is computed by dynamic programming • Given two sequences
– Q = {q1, q2, …, qN} – X = {x1, x2, …, xM}
![Page 25: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/25.jpg)
• Warping path W: – set of grid cells in the 8me warping matrix
• DTW finds the op8mum warping path W: – the path with the smallest matching score
Op8mum warping path W (the best alignment) Proper<es of a DTW legal path
I. Boundary condi<ons
W1=(1,1) and WK=(n,m)
II. Con<nuity Given Wk = (a, b), then Wk-‐1 = (c, d), where a-‐c ≤ 1, b-‐d ≤ 1
III. Monotonicity Given Wk = (a, b), then Wk-‐1 = (c, d), where a-‐c ≥ 0, b-‐d ≥ 0
Proper8es of DTW
X
Y
25
![Page 26: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/26.jpg)
Proper8es of DTW
I. Boundary condi<ons
W1=(1,1) and WK=(n,m)
II. Con<nuity Given Wk = (a, b), then Wk-‐1 = (c, d), where a-‐c ≤ 1, b-‐d ≤ 1
III. Monotonicity Given Wk = (a, b), then Wk-‐1 = (c, d), where a-‐c ≥ 0, b-‐d ≥ 0 26
• Paths start at the boXom lek cell and end at the top right cell
• There is always a point of the path in each row and column of the matrix
• Paths go always from lek to right and from boXom to top
![Page 27: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/27.jpg)
• Query and target lengths may not be of equal length J
• Can tolerate noise: – 8me shiks – sequences out of phase – scaling in the y-‐axis
Advantages
27
![Page 28: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/28.jpg)
• Computa8onal complexity: O(nm)
• May not be able to handle some types of noise...
• DTW is not metric (triangle inequality does not hold)
Disadvantages
28
![Page 29: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/29.jpg)
29
Sakoe-‐Chiba Band Itakura Parallelogram
r =
Global Constraints n Slightly speed up the calcula8ons and prevent pathological warpings n A global constraint limits the indices of the warping path
wk = (i, j)k such that j-‐r ≤ i ≤ j+r n Where r is a term defining allowed range of warping for a given point in a
sequence
![Page 30: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/30.jpg)
Complexity of DTW
• Basic implementa8on = O(n2) where n is the length of the sequences – will have to solve the problem for each (i, j) pair
• If warping window is specified, then O(nr) – only solve for the (i, j) pairs where | i – j | <= r
30
![Page 31: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/31.jpg)
Longest Common Subsequence Measures
(Allowing for Gaps in Sequences)
Gap skipped
31
![Page 32: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/32.jpg)
Longest Common Subsequence (LCSS)
ignore majority of noise
match
match
Advantages of LCSS:
A. Outlying values not matched
B. Distance/Similarity distorted less
Disadvantages of DTW:
A. All points are matched
B. Outliers can distort distance
C. One-to-many mapping
LCSS is more resilient to noise than DTW.
32
![Page 33: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/33.jpg)
Longest Common Subsequence Similar dynamic programming solution as DTW, but now we measure similarity not distance.
Can also be expressed as distance
33
![Page 34: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/34.jpg)
Similarity Retrieval
• Range Query – Find all 8me series X where
• Nearest Neighbor query – Find all the k most similar 8me series to Q
• A method to answer the above queries: – Linear scan
• A beXer approach – GEMINI [next 8me]
D Q,X( ) ≤ ε
34
![Page 35: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/35.jpg)
35
Lower Bounding – NN search
Intui<on ü Try to use a cheap lower bounding calcula8on as oken as possible ü Do the expensive, full calcula8ons when absolutely necessary
We can speed up similarity search by using a lower bounding func8on § D: distance measure
§ LB: lower bounding func8on s.t.: LB(Q, X) ≤ D(Q, X)
Ø Set best = ∞ Ø For each Xi:
à if LB(Xi, Q) < best if D(Xi, Q) < best best = D(Xi, Q)
1-NN Search Using LB
We assume a database of 8me series: DB = {X1, X2, …, XN}
![Page 36: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/36.jpg)
36
Lower Bounding – NN search
Intui<on ü Try to use a cheap lower bounding calcula8on as oken as possible ü Do the expensive, full calcula8ons when absolutely necessary
We can speed up similarity search by using a lower bounding func8on § D: distance measure
§ LB: lower bounding func8on s.t.: LB(Q, X) ≤ D(Q, X)
Range Query Using LB For each Xi:
à if LB(Xi, Q) ≤ ε if D(Xi, Q) < ε report Xi
We assume a database of 8me series: DB = {X1, X2, …, XN}
![Page 37: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/37.jpg)
Problems • How to define Lower bounds for different distance measures?
• How to extract the features? How to define the feature space? – Fourier transform – Wavelets transform – Averages of segments (Histograms or APCA) – Chebyshev polynomials – .... your favorite curve approxima8on...
37
![Page 38: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/38.jpg)
38
Some Lower Bounds on DTW
Each 8me series is represented by 4 features: <First, Last, Min, Max>
LB_Kim = maximum squared difference of the corresponding features
LB_Kim
max(Q)
min(Q)
LB_Yi
LB_Yi = squared differences of the points of X that fall above max(Q) or below min(Q)
X
Q
X
Q
![Page 39: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/39.jpg)
39
LB_Keogh [Keogh 2004]
L
U
Q
U
L Q
X
Q
X
Q
Sakoe-‐Chiba Band
Itakura Parallelogram
Ui = max(qi-‐r : qi+r) Li = min(qi-‐r : qi+r)
![Page 40: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/40.jpg)
40
X U
L Q
X U
L Q
X
Q
X
Q
Sakoe-Chiba Band
Itakura Parallelogram
LB_Keogh(Q,X)=
(xi −Ui )2 if xi >Ui
(xi − Li )2 if xi <Li
0 otherwise
"
#$$
%$$
i=1
n
∑LB_Keogh
LB_Keogh
LB_Keogh(Q,X) ≤ DTW (Q,X)
![Page 41: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/41.jpg)
41
LB_Keogh Sakoe-Chiba
LB_Keogh Itakura
LB_Yi
LB_Kim
…propor8onal to the length of gray lines used in the illustra8ons
Tightness of LB
nceDistaWarpTimeDynamicTruenceDistaWarpTimeDynamicofEstimateBoundLowerT =
0 ≤ T ≤ 1 The larger the
beXer
![Page 42: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/42.jpg)
Lower Bounding
distance Q
we want to find the 1-‐NN to our query data series, Q
![Page 43: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/43.jpg)
Lower Bounding
distance Q true S1
we compute the distance to the first data series in our dataset, D(S1,Q)
this becomes the best so far (BSF)
![Page 44: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/44.jpg)
Lower Bounding
distance Q true S1
BSF
LB S2
we compute the distance LB(S2,Q) and it is greater than the BSF
we can safely prune it, since D(S2,Q) LB(S2,Q)
![Page 45: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/45.jpg)
Lower Bounding
distance Q true S1
BSF
LB S2
we compute the distance LB(S3,Q) and it is smaller than the BSF we have to compute D(S3,Q)≥ LB(S3,Q), since it may s8ll be
smaller than BSF
LB S3
![Page 46: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/46.jpg)
Lower Bounding
distance Q true S1
BSF
LB S2
it turns out that D(S3,Q)≥ BSF, so we can safely prune S3
true S3
![Page 47: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/47.jpg)
Lower Bounding
distance Q true S1
BSF
LB S2 true S3
![Page 48: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/48.jpg)
Lower Bounding
distance Q true S1
BSF
LB S2 true S3
we compute the distance LB(S4,Q) and it is smaller than the BSF we have to compute D(S4,Q)≥ LB(S4,Q), since it may s8ll be
smaller than BSF
LB S4
![Page 49: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/49.jpg)
Lower Bounding
distance Q true S1
BSF
LB S2 true S3 true S4
it turns out that D(S4,Q)< BSF, so S4 becomes the new BSF
![Page 50: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/50.jpg)
Lower Bounding
distance Q true S1
S1 cannot be the 1-‐NN, because S4 is closer to Q
LB S2 true S3 true S4
BSF
![Page 51: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/51.jpg)
51
How about subsequence matching?
• DTW is defined for full-‐sequence matching: – All points of the query sequence are matched to all points of the target sequence
• Subsequence matching: – The query is matched to a part (subsequence) of the target sequence
Query sequence Data stream
![Page 52: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/52.jpg)
X: long sequence
Q: short sequence
What subsequence of X is the best match for Q?
Subsequence Matching
![Page 53: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/53.jpg)
X: long sequence
Q: short sequence
What subsequence of X is the best match for Q … such that the match ends at position j?
position j
J-Position Subsequence Match
X: long sequence
Q: short sequence
![Page 54: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/54.jpg)
X: long sequence
Q: short sequence
position j
J-Position Subsequence Match
X: long sequence
Q: short sequence
Naïve Solution: DTW Examine all possible subsequences
![Page 55: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/55.jpg)
X: long sequence
Q: short sequence
position j
J-Position Subsequence Match
Naïve Solution: DTW Examine all possible subsequences
X: long sequence
Q: short sequence
X: long sequence
Q: short sequence
Naïve Solution: DTW Examine all possible subsequences
![Page 56: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/56.jpg)
X: long sequence
Q: short sequence
position j
J-Position Subsequence Match
Naïve Solution: DTW Examine all possible subsequences
X: long sequence
Q: short sequence
X: long sequence
Q: short sequence
Naïve Solution: DTW Examine all possible subsequences
![Page 57: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/57.jpg)
X: long sequence
Q: short sequence
position j
J-Position Subsequence Match
Too costly!
Naïve Solution: DTW Examine all possible subsequences
X: long sequence
Q: short sequence
X: long sequence
Q: short sequence
Naïve Solution: DTW Examine all possible subsequences
![Page 58: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/58.jpg)
58
• Compute the 8me warping matrices star8ng from every database frame – Need O(n) matrices, O(nm) 8me per frame
Q
X xtstart xtend
x1
Why not ‘naive’?
Capture the optimal subsequence starting
from t = tstart n
m
![Page 59: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/59.jpg)
59
Key Idea • Star-padding
– Use only a single matrix
(the naïve solution uses n matrices)
– Prefix Q with ‘*’, that always gives zero distance
– Instead of Q=(q1 , q2 , …, qm), compute distances with Q’
– O(m) time and space (the naïve requires O(nm))
(*)),,,,('
0
210
=
=
qqqqqQ m…
![Page 60: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/60.jpg)
SPRING: dynamic programming
n Initialization n Insert a “dummy” state ‘*’ at the beginning of the query n ‘*’ matches every value in X with score 0
database sequence X
quer
y Q
* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
![Page 61: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/61.jpg)
n Computation n Perform dynamic programming computation in a similar
manner as standard DTW
database sequence X
* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
quer
y Q
SPRING: dynamic programming
(i, j) (i, j) (i-1, j)
(i-1, j-1) (i, j-1)
![Page 62: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/62.jpg)
Q[1:i] is matched with X[s,j]
database sequence X
* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
quer
y Q
i
js
n For each (i, j): n compute the j-position subsequence match of the first i
items of Q to X[s:j]
SPRING: dynamic programming
![Page 63: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/63.jpg)
n For each (i, j): n compute the j-position subsequence match of the first i
items of Q to X[s:j] n Top row: j-position subsequence match of Q for all j’s n Final answer: best among j-position matches
n Look at answers stored at the top row of the table
database sequence X
* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
quer
y Q
SPRING: dynamic programming
![Page 64: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/64.jpg)
database sequence X
* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Subsequence vs. full matching qu
ery
Q
Q
p1 pi pN
q1
qj
qM
![Page 65: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/65.jpg)
n Assume that the database is one very long sequence n Concatenate all sequences into one sequence
n O (|Q| * |X|) n But can be computed faster by looking at only two
adjacent columns
Computational complexity
database sequence X
* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
quer
y Q
![Page 66: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/66.jpg)
STWM (Subsequence Time Warping Matrix)
• Problem of the star-padding: we lose the information about the starting frame of the match
• After the scan, “which is the optimal subsequence?”
• Elements of STWM
– Distance value of each subsequence
– Starting position !!
• Combination of star-padding and STWM
– Efficiently identify the optimal subsequence in a stream fashion
![Page 67: Time%SeriesIpeople.dsv.su.se/~panagiotis/DAMI2014/timeseries1.pdf · 2014-12-07 · Syllabus% Nov4 Introduc8on%to%datamining% Nov5 Associaon%Rules% Nov10,14 Clustering%and%DataRepresentaon%](https://reader030.fdocuments.net/reader030/viewer/2022040917/5e90ceb64b088533aa71cd03/html5/thumbnails/67.jpg)
Up next…
• Time series summariza8ons
– Discrete Fourier Transform (DFT)
– Piecewise Aggregate Approxima8on (PAA)
– Symbolic ApproXimation (SAX)
• Streams
– Z-normalization
– A fast algorithm for subsequence matching in streams
• Time series classification [briefly]
– Lazy learners and Shapelets