Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A,...

39
U Kang Introduction to Data Mining Time Series Analysis U Kang Seoul National University

Transcript of Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A,...

Page 1: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Introduction to Data Mining

Time Series Analysis

U KangSeoul National University

Page 2: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

In This Lecture

Motivation of time series analysis

Similarity Search

Linear Forecasting

Page 3: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Outline

Motivation

Similarity Search

Linear Forecasting

Page 4: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Problem Definition

Given: one or more sequences

x1 , x2 , … , xt , …

(y1, y2, … , yt, …)

Task

Find similar sequences

Forecast future values

Classify sequences (e.g., fault or normal)

Page 5: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Applications

Financial, sales, economic series

Healthcare

ECG; blood pressure monitoring

Reaction to new drugs

Elderly care

Smart house

Monitor temperature, humidity, air quality, etc.

Fault detection and prediction

Video surveillance

Page 6: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Applications

Civil / automobile infrastructure

Bridge vibrations

Road conditions / traffic monitoring

Weather, environment / anti-pollution

Vibration peaks caused by the heavy loaded truck during night time in Haldiram Bridge, Kolkata

https://www.signaguard.com/bridge-health-monitoring-system/

Page 7: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Outline

Motivation

Similarity Search

Linear Forecasting

Page 8: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Similarity Search

Given a time series A, find the most similar time series

Applications

Clustering: group time series

Classification: classify a time series into ‘normal’ or ‘abnormal’

Rule discovery: if we observe a time series A and a time series B, it is likely to observe a time series C

Query by content: given a song A, find the most similar songs

Page 9: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Importance of Distance Functions

Two major families

Euclidean and Lp norms

Time warping and variations (DTW etc.)

Euclidean: one to one Time warping: nonlinear alignments

Page 10: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Euclidean and Lp norm

n

i

ii yxyxD1

2)(),(x(t) y(t)

...

pn

i

piip yxyxL /1

1

)||(),(

•L1: city-block = Manhattan

•L2 = Euclidean

•L

Page 11: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Limitation of Euclidean and Lp

Cannot consider the variability in the time axis

https://www.cs.ucr.edu/~eamonn/KAIS_2004_warping.pdf

Page 12: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Dynamic Time Warping

Allow accelerations - decelerations

(with or w/o penalty)

Then compute the (Euclidean) distance (+ penalty)

Related to the string-editing distance

Page 13: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Dynamic Time Warping

‘stutters’:

Page 14: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Applications of DTW

Bioinformatics

Medicine

Robotics

Chemistry

Gesture Recognition

Page 15: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Dynamic Time Warping

Q: how to compute it?

A: dynamic programming

D( i, j ) = cost to match

prefix of length i of first sequence x with prefix of length j of second sequence y

Page 16: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Thus, with no penalty for stutter, for sequences

x1, x2, …, xi,; y1, y2, …, yj

),1(

)1,(

)1,1(

min][][),(

jiD

jiD

jiD

jyixjiD x-stutter

y-stutter

no stutter

Dynamic Time Warping

Similar to string-edit distance

Page 17: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Dynamic Time Warping

Example: DTW(x, y) 𝑥1 = 3, 𝑥2 = 5

𝑦1 = 3, 𝑦2 = 4, 𝑦3 = 5

j = 1 j = 2 j = 3

i = 1 D(i,j) = 0 D(i,j) = 1 D(i,j) = 3

i = 2 D(i,j) = 2 D(i,j) = 1 D(i,j) = 1

),1(

)1,(

)1,1(

min][][),(

jiD

jiD

jiD

jyixjiD x-stutter

y-stutter

no stutter

Page 18: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Dynamic Time Warping

Complexity: O(M*N) - quadratic on the length of the strings

Many variations (penalty for stutters; limit on the number/percentage of stutters; …)

Page 19: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Outline

Motivation

Similarity Search

Linear Forecasting

Page 20: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Forecasting

Example: given xt-1, xt-2, …, forecast xt

0

10

20

30

40

50

60

70

80

90

1 3 5 7 9 11

Time Tick

Nu

mb

er o

f p

ack

ets

sen

t

??

Page 21: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Forecasting

Solution: try to express xt

as a linear function of the past data : xt-2, xt-2, …,

(up to a window of w)

0102030405060708090

1 3 5 7 9 11Time Tick

??𝑥𝑡 ≈ 𝑎1𝑥𝑡−1 +⋯+ 𝑎𝑤𝑥𝑡−𝑤

Page 22: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

(Related: Back-cast)

Solution - interpolation: try to express xt

as a linear function of the past AND the future:

xt+1, xt+2, … xt+wfuture; xt-1, … xt-wpast

(up to windows of wpast and wfuture)

Exactly the same algorithm as in the forecasting

0102030405060708090

1 3 5 7 9 11Time Tick

??

Page 23: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Linear Regression: idea

40

45

50

55

60

65

70

75

80

85

15 25 35 45

blood pressure

patient blood

pressure

height

1 27 43

2 43 54

3 54 72

… …

N 25 ??

height

Express what we don’t know (dependent variable) as a linear function of what we know (independent variables)

Page 24: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Linear Auto Regression:

Time Packets

Sent (t-1)

Sensor

value (t)

1 - 41

2 43 54

3 54 62

… …

N 25 ??

Page 25: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Linear Auto Regression:

40

45

50

55

60

65

70

75

80

85

15 25 35 45

Sensor value (t-1)S

en

so

r valu

e (

t)

Time Sensor

value (t-1)

Sensor

value (t)

1 - 41

2 41 54

3 54 62

… …

N 25 ??

• lag window = 1

• Dependent variable = Sensor value at t

• Independent variable = Sensor value at t-1

‘lag-plot’

Page 26: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Window size > 1

Q: How to handle window size > 1 ?

A: Fit a hyperplane

xt-2

xt-1

xt

Page 27: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Window size > 1

Q: How to handle window size > 1 ?

A: Fit a hyperplane

X[N w] a[w 1] = y[N 1]

Solving for a is an over-constrained problem (N > w)

Page 28: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Window size > 1

X[N w] a[w 1] = y[N 1]

N

w

NwNN

w

w

y

y

y

a

a

a

XXX

XXX

XXX

2

1

2

1

21

22221

11211

,,,

,,,

,,,

independent variables

time

dependent

variable

Page 29: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Window size > 1

X[N w] a[w 1] = y[N 1]

N

w

NwNN

w

w

y

y

y

a

a

a

XXX

XXX

XXX

2

1

2

1

21

22221

11211

,,,

,,,

,,,

independent variables

time

dependent

variable

Page 30: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Solution of AR

Q: What is the best solution a for the problem

X[N w] a[w 1] = y[N 1] ?

A: Least square fit

𝒂 = 𝑋𝑇𝑋 −1(𝑋𝑇𝑦)

a is the vector that minimizes the RMSE between Xa and y

Page 31: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Incremental Computation for AR Solution

Q: Can we estimate a incrementally?

A: Yes, with the ‘Recursive Least Squares’ (RLS) method

Main idea

The key step is to compute 𝑋𝑇𝑋 −1

We can update 𝑋𝑇𝑋 −1 efficiently by exploiting the special structure of 𝑋𝑇𝑋

Reference: Yi et al., Online data mining for co-evolving time sequences, ICDE 2000

Page 32: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Recursive Least Squarew

NXN+1

At time N+1:

𝑥𝑁+1𝑇

Let 𝐺𝑁 = (𝑋𝑁𝑇𝑋𝑁)

−1 (``gain matrix’’)

𝐺𝑁+1 can be computed recursively from 𝐺𝑁

Page 33: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

RLS Example

Independent Variable

Dep

enden

t V

aria

ble

new point

Page 34: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

RLS Example

Independent Variable

Dep

enden

t V

aria

ble

new point

Page 35: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Adaptability of RLS

Independent Variable

Dep

enden

t V

aria

ble

Page 36: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Adaptability of RLS

Independent Variable

Dep

enden

t V

aria

ble

Trend change

(R)LS

with no forgetting

Page 37: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Adaptability - ‘forgetting’

Independent Variable

Dep

enden

t V

aria

ble

Trend change

(R)LS

with no forgetting

(R)LS

with forgetting

RLS can easily handle forgetting

Page 38: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

What You Need to Know

Time Series Analysis

Many applications in finance, healthcare, smart house, fraud detection, etc.

Similarity Search

Given a time series, find the most similar time series

Forecasting

Autoregression (AR)

(Recurive) least square for AR

Page 39: Introduction to Data Miningukang/courses/20S-DM/L13-timeseries-analysis.pdfGiven a time series A, find the most similar time series Applications Clustering: group time series Classification:

U Kang

Questions?