PATTERN DISCOVERY IN TIME SERIES A SURVEY Presentations/Denis...Image source: Denis Khryashchev’s...

PATTERN DISCOVERY IN TIME SERIESA SURVEY

DENIS KHRYASHCHEV, GRADUATE CENTER, CUNY, OCTOBER 2018

MOTIVATION

Often datasets represent processes that take place over long periods of time. Their outputs are measured at regular time intervals creating discrete time series. For example, consider CitiBike demand and Fisher river temperature data.

Data sources: 1. https://s3.amazonaws.com/tripdata/index.html, 2. https://datamarket.com/data/set/235d/mean-daily-temperature-fisher-river-near-dallas-jan-01-1988-to-dec-31-1991

CitiBike Fisher river

MOTIVATION. COMPLEXITY

Complexity quantifies the internal structure of the underlying process. EEG datacan be classified [1] into interictal, preictal and seizure using their complexity.

[1] Petrosian, Arthur. "Kolmogorov complexity of finite sequences and recognition of different preictal EEG patterns." Computer-based medical systems, 1995., Proceedings of the Eighth IEEE Symposium on. IEEE, 1995.

interictal 1. voltage, µV

preictal 1. voltage, µV

seizure 1. voltage, µV

MOTIVATION. PERIODICITY

Natural phenomena like Sun activity, Earth rotation and revolution drive periodichuman activity on the large scale. E.g. New York City’s human mobility is highlyperiodic with clear peaks in ridership from 6 AM to 10 AM, and from 3 PM to 7 PM.

Image source: http://web.mta.info/mta/news/books/docs/Ridership_Trends_FINAL_Jul2018.pdf

MOTIVATION. PREDICTABILITY

Predictability estimates the expected accuracy of forecasting given time series.Often there is a trade-off between the desired accuracy and computation time [2].

[2] Zhao, Kai, et al. "Predicting taxi demand at high spatial resolution: approaching the limit of predictability." Big Data (Big Data), 2016 IEEE International Conference on. IEEE, 2016.

MOTIVATION. CLUSTERING

Image source: Denis Khryashchev’s summer internship at Simulmedia (Jun – Aug 2018).

Often a task of grouping similar in certain quality time series arises in the domainsof transportation, finance, medicine,… Time sensitive modifications of standardtechniques are applied, e.g. k-means of autocorrelation functions.

autocorrelation functions time series clustered together

MOTIVATION. FORECASTING

Perhaps, the most well known and widely applied task related to time series isforecasting. Understanding time series periodicity, complexity, and predictabilityhelps in selecting better predictors and optimizing parameters. E.g., knowingperiodicity P=5 of the series, one can forecast averaging values with lag 5.

Video source: Denis Khryashchev’s summer internship at Simulmedia (Jun – Aug 2018).

NOTATION

Throughout the presentation we will consider time series of real values and will use the following notation

𝑋 = 𝑋1, … , 𝑋𝑁 = 𝑋𝑡 1𝑁 , 𝑋𝑡 ∈ ℜ

Not to be confused with set notation, ⋅ is used to denote sequences.

A subsequence of the series 𝑋 that starts at period 𝑖 and ends at period 𝑗 is written as

𝑋𝑖𝑗= 𝑋𝑡 𝑖

𝑗= 𝑋𝑖 , … , 𝑋𝑗 , 𝑖 ≤ 𝑗

ORGANIZATION OF THE PRESENTATION

1. KOLMOGOROV COMPLEXITY

For time series 𝑋 we define the Kolmogorov complexity as the length of the shortest description of a sequence of values ordered in time in some fixed universal description language

𝐾 𝑋 = 𝑑 𝑋

where 𝐾 is the Kolmogorov complexity, and 𝑑(𝑋) is the shortest description of the time series X. Smaller values of 𝐾 𝑋 indicate lower complexity.

1. KOLMOGOROV COMPLEXITY. EXAMPLE

Given two time series

𝑋 = 0, 1, 0, 1, 0, 1, 0, 1, 0, 1 and 𝑌 = {1, 0, 0, 1, 1, 1, 0, 0, 1, 0}

and selecting Python as our description language we have the shortest descriptions

𝑑𝑃 𝑋 = 0, 1 ∗ 5 and 𝑑𝑃 𝑌 = 1, 0, 0, 1, 1, 1, 0, 0, 1, 0

quantifying smaller “Pythonic” complexity for 𝑋 comparing to 𝑌

𝐾𝑃 𝑋 = 𝑑𝑃 𝑋 = 7𝐾𝑃 𝑌 = 𝑑𝑃 𝑌 = 21

1. KOLMOGOROV COMPLEXITY. LIMITATIONS

However, as proven by Kolmogorov in [3], and Chaitin and Arslanov in [4],the complexity 𝐾 is not a computable function in general.

3. Kolmogorov, Andrei N. "On tables of random numbers." Sankhyā: The Indian Journal of Statistics, Series A (1963): 369-376.4. G. J. Chaitin, A. Arslanov, and C. Calude, “Program-size complexity computes the halting problem,” Department ofComputer Science, The University of Auckland, New Zealand, Tech. Rep., 1995.

1. LEMPEL-ZIV COMPLEXITY

Lempel and Ziv [5] proposed a combinatorial approximation of the complexity offinite sequences based on their production history. For time series 𝑋 it is

𝐻 𝑋 = 𝑋1ℎ1+1, 𝑋ℎ1+1

ℎ2 , … 𝑋ℎ𝑚−1+1𝑚

For series 𝑋 = {0,0,0,1,1,0,1,0,0,1,0,0,0,1,0,1} one of the production histories is

𝐻 𝑋 = {1,0,1}ڂ{1,0,0,0}ڂ{1,0,0}ڂ{1,0}ڂ{0,0,1}ڂ{0}

The overall complexity is the size of the shortest possible production history𝑐 𝑋 = min

𝐻(𝑋)𝐻 𝑋

Disadvantage: the actual values 𝑋𝑡 are treated as symbols, e.g.𝑐 𝑋 = 1, 2, 1, 5, 1, 2 = 𝑐(𝑌 = {8, 0.5, 8, 0.1, 8, 0.5})

5. Lempel, Abraham, and Jacob Ziv. "On the complexity of finite sequences." IEEE Transactions on information theory 22.1 (1976): 75-81.

2. ENTROPY

Shannon and Weaver introduced entropy [6] as a measure of information transmittedby a signal in a communication channel

𝐻 𝑋 = −𝔼 log2 𝑃 𝑋

Renyi [7] generalized the definition for ordinary discrete finite distribution of 𝑋𝒫 = 𝑝1, … , 𝑝𝑀 , σ𝑘 𝑝𝑘 = 1 to entropy of order 𝛼 (𝛼 → 1 for Shannon entropy)

𝐻𝛼 𝑋 = 𝐻𝛼 𝒫 =1

1−𝛼log2 σ𝑘 𝑝𝑘

𝛼

Disadvantage: both definitions do not take order of the values 𝑋𝑡 into account, e.g.𝐻 𝑋 = 1, 2, 3, 1, 2, 3 = 𝐻(𝑌 = {1, 3, 2, 2, 3, 1}).

6. Cover, Thomas M., and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2012.7. Rényi, Alfréd. On measures of entropy and information. HUNGARIAN ACADEMY OF SCIENCES Budapest Hungary, 1961.

2. KOLMOGOROV ENTROPY

Entropy is often used as an approximation of complexity. Among the most well-known approximations [8] of the complexity is Kolmogorov Entropy defined as

𝐾 = − lim𝜏→∞

lim𝜖→∞

lim𝑑→∞

1

𝑑𝜏

𝑖1,…𝑖𝑑

𝑝 𝑖1, … , 𝑖𝑑 ln 𝑝 𝑖1, … , 𝑖𝑑

It describes complexity of a dynamic system with 𝐹 degrees of freedom. 𝐹 -dimensional phase space is partitioned into 𝜖𝐹 boxes, 𝜏 stands for time intervals, and𝑝 𝑖1, … , 𝑖𝑑 is the joint probability that we find the 𝐹-dimensional point representingvalues 𝑋𝑡=𝑘𝜏 inside the box 𝜖𝐹 .

Disadvantage: the approximation is computable for known analytically definedmodels, however, it is hard to calculate it given the resulting series only.

8. Grassberger, Peter, and Itamar Procaccia. "Estimation of the Kolmogorov entropy from a chaotic signal." Physical review A 28.4 (1983): 2591.

2. ENTROPY WITH TEMPORAL COMPONENT

Another definition [6] of entropy takes into account temporal order of the values 𝑋𝑡

𝐻𝑡 𝑋 = −σ𝑖=1𝑁 σ𝑗=1

𝑁 𝑃 𝑋𝑖𝑗log2 𝑃 𝑋𝑖

𝑗

𝑃 𝑋𝑖𝑗

is the probability of the subsequence 𝑋𝑖𝑗. 𝐻𝑡 𝑋 is 𝑂 2𝑁 complex.

Lempel-Ziv estimator [9] approximates 𝐻𝑡 𝑋 rapidly converging

𝐻𝐿𝑍 𝑋 =1

𝑁σ𝑡 𝑠𝑡

′−1ln𝑁

where 𝑠𝑡′ is the shortest subsequence starting at period 𝑡 observed for the 1st time.

Disadvantage: values 𝑋𝑡 are treated as symbols, e.g.𝐻𝐿𝑍 𝑋 = 1, 2, 1, 5 = 𝐻𝐿𝑍(𝑌 = {2, 9, 2, 3})

6. Cover, Thomas M., and Joy A. Thomas. Elements of information theory. John Wiley & Sons, 2012.9. Kontoyiannis, Ioannis, et al. "Nonparametric entropy estimation for stationary processes and random fields, with applications to English text." IEEE Transactions on Information Theory 44.3 (1998): 1319-1327.

2. PERMUTATION ENTROPY

Bandt and Pompe [10] proposed permutation entropy of order 𝑛

𝐻 𝑛 = −σ𝑝 𝜋 log 𝑝 𝜋

where 𝑝 𝜋 =# 𝑡|0≤𝑡≤𝑇−𝑛, 𝑡𝑦𝑝𝑒 𝑋𝑡+1,…,𝑋𝑡+𝑛 =𝜋

𝑇−𝑛+1is a frequency of a permutation of type

𝜋. E.g., for 𝑋 = 4, 7, 9, 10, 6, 11, 3 , 𝑛 = 3 we have𝜋 4, 7, 9 = 𝜋 7, 9, 10 = 𝜋012 𝑋𝑡 < 𝑋𝑡+1 < 𝑋𝑡+2 ,𝜋 9, 10, 6 = 𝜋 6, 11, 3 = 𝜋210 𝑋𝑡+2 < 𝑋𝑡 < 𝑋𝑡+1 ,𝜋 10, 6, 11 = 𝜋102 𝑋𝑡+1 < 𝑋𝑡 < 𝑋𝑡+2 .

The entropy becomes 𝐻 3 = −22

5log

2

5−

1

5log

1

5≈ 1.52 .

Disadvantage: the definition requires 𝑋𝑡 ≠ 𝑋𝑡+1 and has a complexity of 𝑂(𝑛!).

10. Bandt, Christoph, and Bernd Pompe. "Permutation entropy: a natural complexity measure for time series." Physical review letters 88.17 (2002): 174102.

3. PREDICTABILITY

Following the Fano inequality [11], predictability of series 𝑋 Π 𝑋 ≤ Π𝑚𝑎𝑥 𝐻 𝑋 ,𝑁where 𝑁 is the number of unique values 𝑋𝑡 , 0 ≤ Π𝑚𝑎𝑥 ≤ 1 is the maximumpredictability of 𝑋 with 0 standing for completely unpredictable chaotic series.

In the works of Song et al. [12] the maximumpredictability is shown to be the solution of𝐻 𝑋 = −Π𝑚𝑎𝑥 log2 Π

𝑚𝑎𝑥 − 1 − Π𝑚𝑎𝑥 ×log2 1 − Π𝑚𝑎𝑥 + 1 − Π𝑚𝑎𝑥 log2 𝑁 − 1 .

Often Lempel-Ziv estimator 𝐻𝐿𝑍 𝑋 is used tocalculate the entropy of the series.

Disadvantage: depends of the selected measure of𝐻 𝑋 , no closed-form solution of the equation.

11. Fano, Robert M. The transmission of information. Cambridge, Mass, USA: Massachusetts Institute of Technology, Research Laboratory of Electronics, 1949.12. Song, C., Qu, Z., Blumm, N., & Barabási, A. L. (2010). Limits of predictability in human mobility. Science, 327(5968), 1018-1021.

3. PREDICTABILITY. SHUFFLING

Alternative approach to measure predictability was proposed by Kaboudan [13]. Theydefined it as a ratio of forecasting errors prior and after shuffling of the series

𝜂 𝑋 = 1 −𝑋−𝑓 𝑋

𝑇𝑋−𝑓 𝑋

𝑋𝑠−𝑓 𝑋𝑠𝑇𝑋𝑠−𝑓 𝑋𝑠

where 𝑋𝑠 is a shuffled copy of the series, 𝑓 is the selected predictor.

Disadvantage: the shuffling is done only once and could lead to inconsistent resultswhen measured multiple times for the same time series 𝑋. A significant improvementof the approach would be to calculate 𝜂 𝑋 numerous times (e.g. 1000) calculating 𝑝value and performing hypothesis testing.

13. Fano, Robert M. The transmission of information. Cambridge, Mass, USA: Massachusetts Institute of Technology, Research Laboratory of Electronics, 1949.

3. PREDICTABILITY. REGRESSION MODELS

Often in econometrics predictability is quantified with linear regression models [14]

𝑋𝑡+1 = 𝛼 + 𝛽𝑋𝑡 + 𝜖𝑡+1

that are used to calculate 𝑅2

𝑅2 = 1 −𝑉𝑎𝑟 𝛼+𝛽𝑋𝑡

𝑉𝑎𝑟 𝑋𝑡+1

ranging from 0 to 1 with the latter representing the most unpredictable series.

Disadvantage: captures only linear relationships unless non-linear regression modelsare used.

14. Campbell, John Y., and Motohiro Yogo. "Efficient tests of stock return predictability." Journal of financial economics 81.1 (2006): 27-60.

4. PERIODICITY. FOURIERNumerous approaches to quantify time series periodicity rely on Fourier transform

𝑀𝑘 = σ𝑡=0𝑁−1𝑋𝑡𝑒

−𝑖2𝜋𝑘𝑡/𝑁

where 𝑀𝑘 is the “magnitude” of 𝑘𝑡ℎ frequency quantifying “relative chance” of therepetition of the value 𝑋𝑡 after the corresponding period of time.

To decrease the amount of spurious artifacts,usually a windowed or short-time Fouriertransform is applied [15]

𝑀𝑘 = σ𝑡=0𝑁−1𝑋𝑡𝑤(𝑡 − 𝜏)𝑒−𝑖2𝜋𝑘𝑡/𝑁

where 𝑤 is the window function of effectivesize of 2𝜏. Often Blackman, Hamming, andBartlett windows are used.Disadvantage: not linear in period.

15. Allen, Jont B., and Lawrence R. Rabiner. "A unified approach to short-time Fourier analysis and synthesis." Proceedings of the IEEE 65.11 (1977): 1558-1564.

4. PERIODICITY. FOURIER AND AUTOCORRELATION

Linear autocorrelation function can be used to evaluate periodicity due to theWiener-Khinchin [16][17] theorem that states that

𝑆𝑥𝑥 𝑓 = σ𝑡=0𝑁−1 𝑟𝑥𝑥𝑒

−𝑖2𝜋𝑘𝑡/𝑁

where 𝑆𝑥𝑥 𝑓 is the power spectrum of 𝑋, 𝑟𝑥𝑥 is its autocorrelation function. Inother words, larger values of the autocorrelation function of 𝑋 at lag 𝜏 can signifylarger periodicity of the series at lag 𝜏.

16. Wiener, Norbert. "Generalized harmonic analysis." Acta mathematica 55.1 (1930): 117-258.17. Cohen, Leon. "The generalization of the wiener-khinchin theorem." Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on. Vol. 3. IEEE, 1998.

4. PERIODICITY. FISHER TEST

A similar test based on the notion of periodograms was proposed by Fisher [18]

𝐼 𝑓 =2

𝑁σ𝑡=0𝑁−1𝑋𝑡 cos 2𝜋𝑓𝑡 +

2

𝑁σ𝑡=0𝑁−1𝑋𝑡 sin 2𝜋𝑓𝑡

𝑊 = max𝑓

𝐼(𝑓)

where frequency −1

2≤ 𝑓 ≤

1

2. The test assumes that 𝑋 = ς + 𝑎 with ς being the real

periodic signal and 𝑎~𝑁(0, 𝜎2) is the unobserved noise. 𝑊 is measured to reject the𝐻0: ς = 0.

Disadvantage: similar to original Fourier transform – spurious artifacts. Can beimproved with window functions.

18. Fisher, Ronald Aylmer. "Tests of significance in harmonic analysis." Proc. R. Soc. Lond. A 125.796 (1929): 54-59.

4. PERIODICITY TRANSFORM

Sethares and Staley [19] proposed linear-in-period transformation of time series. Theydefined 𝑋 to be 𝑝-periodic if 𝑋𝑡 = 𝑋𝑡+𝑝 and 𝑃𝑝 to be a set of all 𝑝-periodic sequences.

They introduced non-orthogonal basis sequences that are 𝑝-periodic

𝛿𝑝𝑠 𝑗 = ቊ

1 if 𝑗 − 𝑠 mod 𝑝 = 00 otherwise

where 𝑗 is the time index of the series 𝑋, and 𝑠 is the time shift. The measure ofperiodicity is the projection

𝜋 𝑋, 𝑃𝑝 = σ𝑠=0𝑝−1

𝛼𝑠𝛿𝑠𝑝, 𝛼𝑠 =

1

𝑁σ𝑡=0𝑁−1𝑋𝑠+𝑡𝑝

Disadvantage: sample interval of 𝑋 or its entire length must correspond to an integerfactor of 𝑝.

19. Sethares, William A., and Thomas W. Staley. "Periodicity transforms." IEEE transactions on Signal Processing 47.11 (1999): 2953-2964.

5. SIMILARITY AND DEPENDENCE. LINEAR CORRELATION

Linear correlation is a standard and well-known dependence measure. For two timeseries 𝑋 and 𝑌 it is defined as

𝜌𝑋,𝑌 = 𝜌 𝑋, 𝑌 =𝔼 𝑋 − 𝔼 𝑋 𝑌 − 𝔼 𝑌

𝜎𝑋𝜎𝑌

Often it is more convenient to standardize the series 𝑋′ =𝑋−𝔼 𝑋

𝜎𝑋and 𝑌′ =

𝑌−𝔼 𝑌

𝜎𝑌

𝜌 𝑋, 𝑌 = 𝔼[𝑋′𝑌′]

Ranging as −1 ≤ 𝜌 𝑋, 𝑌 ≤ 1 it captures linear dependence between 𝑋 and 𝑌.

Limitation: does not capture non-linear relationship between 𝑋 and 𝑌.

5. SIMILARITY AND DEPENDENCE. RENYI POSTULATES

Renyi considered a general measure of dependence 𝛿 𝑋, 𝑌 and postulated:

• 𝛿 𝑋, 𝑌 is defined for any pair of random variables 𝑋, 𝑌 neither of them beingconstant with probability 1

• It is symmetric, 𝛿 𝑋, 𝑌 = 𝛿 𝑌, 𝑋

• 𝛿 𝑋, 𝑌 = 0 if and only if 𝑋 and 𝑌 are independent

• 0 ≤ 𝛿 𝑋, 𝑌 ≤ 1

• 𝛿 𝑋, 𝑌 = 1 only if there is a strict dependence between 𝑋 and 𝑌

• 𝛿 𝑋, 𝑌 = 𝛿 Id(𝑋), Id 𝑌 where Id is a Borel-measurable identity function

• If the joint distribution of 𝑋 and 𝑌 is normal then 𝛿 𝑋, 𝑌 = 𝜌(𝑋, 𝑌) .

20. Rényi, Alfréd. "On measures of dependence." Acta mathematica hungarica 10.3-4 (1959): 441-451.

5. SIMILARITY AND DEPENDENCE. MAXIMAL CORRELATION COEFFICIENT

The maximal correlation coefficient proposed by Gebelein [21] fits all of the Renyi’spostulates and is defined as

𝜌𝑚𝑎𝑥 𝑋, 𝑌 = max𝑓,𝑔

𝜌 𝑓 𝑋 , 𝑔 𝑌

where 𝑓:ℜ → ℜ, 𝑔:ℜ → ℜ are Borel-measurable functions, 0 ≤ 𝑉𝑎𝑟 𝑓 𝑋 < 𝐵,

and 0 ≤ 𝑉𝑎𝑟 𝑔 𝑌 < 𝐵, 𝐵 ∈ ℜ.

𝜌𝑚𝑎𝑥 𝑋, 𝑌 = 0 if 𝑋 and 𝑌 are independent.𝜌𝑚𝑎𝑥 𝑋, 𝑌 = 1 if either 𝑋 = 𝑓(𝑌) or 𝑌 = 𝑔(𝑋).

Limitation: The functions 𝑓∗, 𝑔∗ that maximize 𝜌 𝑓 𝑋 , 𝑔 𝑌 are not guaranteed to

have inverses 𝑓∗ −1, 𝑔∗ −1.

21. Gebelein, Hans. "Das statistische Problem der Korrelation als Variations‐und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung." ZAMM‐Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik 21.6 (1941): 364-379.

5. MAXIMAL VS LINEAR AUTOCORRELATIONCitiBike pickup data has a daily periodicity depicted with linear autocorrelationfunction. However, maximal autocorrelation function demonstrates the presence of

non-linear dependencies due to 𝜌𝑚𝑎𝑥 𝑋1𝑁−𝜏, 𝑋𝜏

𝑁 > 𝜌 𝑋1𝑁−𝜏 , 𝑋𝜏

𝑁 for every lag 𝜏.

5. SOME PROPERTIES OF MAXIMAL CORRELATION

Dembo et al. [22] demonstrated that the maximal correlation coefficient equals to

𝜌𝑚𝑎𝑥 𝑆𝑚, 𝑆𝑛 = 𝑚/𝑛

where 𝑆𝑘 = σ𝑖=1𝑘 𝑌𝑘, 𝑌1, … , 𝑌𝑁 is a collection of independent identically distributed

random variables with 𝑉𝑎𝑟 𝑌𝑖 < ∞.

Lancaster [23] has shown that if 𝑋, 𝑌 is a bivariate Gaussian vector the maximalcorrelation coefficient is equal to

𝜌𝑚𝑎𝑥 𝑋, 𝑌 = |𝜌(𝑋, 𝑌)|

22. Dembo, Amir, Abram Kagan, and Lawrence A. Shepp. "Remarks on the maximum correlation coefficient." Bernoulli 7.2 (2001): 343-350.23. Lancaster, Henry Oliver. "Some properties of the bivariate normal distribution considered in the form of a contingency table." Biometrika44.1/2 (1957): 289-292.

5. SOME PROPERTIES OF MAXIMAL CORRELATION

Witsenhausen [24] proposed a way to calculate the maximal correlation coefficient fordiscrete 𝑋 and 𝑌:• First, the ordered sets 𝛼 and 𝛽 that contain unique values of 𝑋 and 𝑌 are built.• Second, the probabilities 𝑝𝑖𝑗 from the contingency table for 𝑋 and 𝑌 are calculated,

𝑝𝑖𝑗 = 𝑃 𝑋 = 𝛼𝑖|𝑌 = 𝛽𝑗 .• Third, the normalized joint-probability matrix 𝑄 = (𝑞𝑖𝑗) is computed

𝑞𝑖𝑗 =𝑝𝑖𝑗

𝑝(𝑖.) 𝑝(.𝑗).

Then the maximal correlation coefficient is equal to

𝜌𝑚𝑎𝑥 𝑋, 𝑌 = 𝜆2

where 𝜆2 is the second singular value of 𝑄.

24. Witsenhausen, Hans S. "On sequences of pairs of dependent random variables." SIAM Journal on Applied Mathematics 28.1 (1975): 100-113.

5. MONOTONE CORRELATION

Monotone correlation coefficient was proposed by Kimeldorf et al. [25] and wasdefined as follows: let ℱ = 𝑓:ℜ → ℜ|𝑓 is monotone then the monotone correlation

𝜌𝑚𝑜𝑛𝑜 𝑋, 𝑌 = max𝑓,𝑔 ∈ ℱ

𝑓 𝑋 , 𝑔 𝑌

Overall, we have the following relationship between linear, monotone and maximalcorrelation coefficients

0 ≤ 𝜌 𝑋, 𝑌 ≤ 𝜌𝑚𝑜𝑛𝑜 𝑋, 𝑌 ≤ 𝜌𝑚𝑎𝑥 𝑋, 𝑌 ≤ 1

Limitation: there is no known formula that computes the value of the monotonecorrelation coefficient. It is a maximization problem.

25. Kimeldorf, George, and Allan R. Sampson. "Monotone dependence." The Annals of Statistics (1978): 895-903.

6. CLUSTERING

We will denote 𝑥 as a collection of time series

𝑥 = 𝑋1, … , 𝑋𝑁 , 𝑋𝑖 = 𝑋𝑖,1, 𝑋𝑖,2, … , 𝑋𝑖,𝑀 , 𝑋𝑖,𝑗 ∈ ℜ.

We define clustering as an unsupervised partitioning of 𝑋 into 𝑘 groups assigning 1label from the set of labels {𝐶1, … , 𝐶𝑘} to every time series 𝑋𝑖 such that each label𝐶𝑖 is assigned at least once.

6. CLUSTERING. NAÏVE APPROACHClustering discrete time series 𝑋1, 𝑋2, 𝑋3 into two groups naïvely applying k-meansresults in clusters {𝑋2, 𝑋3} and {𝑋1}. However, it seems to be more reasonable to group X1and 𝑋2 together. Using autocorrelation functions 𝑅 = {𝑟𝑥𝑥 𝑋1 , 𝑟𝑥𝑥 𝑋2 , 𝑟𝑥𝑥(𝑋3)} insteadof the actual value will group X1 and 𝑋2 in the same cluster.

𝑟𝑥𝑥 𝑋 = 𝜌 𝑋, 𝐿𝜏𝑋 |2 ≤ 𝜏 ≤ 𝑁 − 1 , 𝐿𝑘𝑋𝑡 = 𝑋𝑡−𝜏𝑋𝑖,𝑡

6. TYPES OF CLUSTERING APPROACHES

• Raw data. Standard clustering techniques applied to raw data with modifieddistance or dissimilarity measures.

• Feature generation. New features are generated from raw data and clusteredwith standard techniques.

• Model assumption. Clustering based on model parameters, hypothesis testing.

6. RAW DATA CLUSTERING

Komelj and Batagelj [26] modified relocation clustering introducing new similaritymeasure

𝐷 𝑋, 𝑌 = σ𝑡 𝛼𝑡𝑑𝑡 𝑋, 𝑌 , 𝛼𝑠 ≠ 𝛼𝑡 , 𝛼𝑡 ≥ 0,σ𝑡 𝛼𝑡 = 1

Golay et al. [27] proposed cross-correlation distance

𝑑𝑐𝑐1 𝑋, 𝑌 =

1−𝜌 𝑋,𝑌

1+𝜌 𝑋,𝑌

𝛽

, 𝑑𝑐𝑐2 𝑋, 𝑌 = 2 1 − 𝜌 𝑋, 𝑌

Limitations: most of the approaches do not take into account sequences of values𝑋𝑡−𝑘 , 𝑋𝑡−𝑘+1, … , 𝑋𝑡 but instead compare neighboring values.

26. Košmelj, Katarina, and Vladimir Batagelj. "Cross-sectional approach for clustering time varying data." Journal of Classification 7.1 (1990): 99-109.27. Golay, Xavier, et al. "A new correlation‐based fuzzy logic clustering algorithm for FMRI." Magnetic Resonance in Medicine 40.2 (1998): 249-260.

6. RAW DATA CLUSTERING

Moller-Levet et al. [28] described short time series distance

𝑑𝑆𝑇𝑆2 𝑋, 𝑌 = σ𝑘

𝛿𝑋𝑘

𝛿𝑡𝑘−

𝛿𝑌𝑘

𝛿𝑡𝑘

2

, 𝛿𝑋𝑘 = 𝑋𝑘 − 𝑋𝑘−1

Batista et. al [29] showed existing distance measures are not efficient for complex timeseries, and proposed a complexity-invariant distance measure (CID)

𝐶𝐼𝐷 𝑋, 𝑌 = 𝑋 − 𝑌max 𝐶𝐸 𝑋 , 𝐶𝐸 𝑌

min 𝐶𝐸 𝑋 , 𝐶𝐸 𝑌

where 𝐶𝐸 𝑋 = 𝑋2𝑁 − 𝑋1

𝑁−1 estimates distance between the time series and its

lagged counterpart.

Limitations: do not take subsequences into account but compare series pairwise.28. Möller-Levet, Carla S., et al. "Fuzzy clustering of short time-series and unevenly distributed sampling points." International Symposium on Intelligent Data Analysis. Springer, Berlin, Heidelberg, 2003.29. Batista, Gustavo EAPA, Xiaoyue Wang, and Eamonn J. Keogh. "A complexity-invariant distance measure for time series." Proceedings of the 2011 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2011.

6. FEATURE CLUSTERING

The approaches include clustering of autocorrelation functions, Fourier and wavelettransformation, dimensionality reduction, and other transformations of raw time series.

Vlachos et al. [30] proposed running the k-means clustering algorithm on the level 𝑖 ofHaar wavelet representation of the data, projecting the final centers obtained for level 𝑖from space 2𝑖 to space 2𝑖+1 for level 𝑖 + 1. If any time series were swapped betweenclusters then repeat previous steps.

Fu et al. [31] proposed time series smoothing and dimensionality reduction prior toclustering with self-organizing maps. Only the Perpetually Important Points that match

predefined Querying points are kept. 𝐷 𝑃𝐼𝑃, 𝑄 = 𝔼 𝑇 𝑃𝐼𝑃 − 𝑇 𝑄2

.

Limitations: the approaches are very specific and require data engineering.

30. Vlachos, Michail, et al. "A wavelet-based anytime algorithm for k-means clustering of time series." In Proc. Workshop on Clustering High Dimensionality Data and Its Applications. 2003.31. Fu, Tak-chung, et al. "Pattern discovery from stock time series using self-organizing maps." Workshop Notes of KDD2001 Workshop on Temporal Data Mining. 2001.

6. MODEL-BASED CLUSTERING

The main assumption for these approaches is that there exists an underlying generatingprocess that can be modeled with certain models (e.g. ARMA and its modifications).

Maharaj [32] proposed a clustering approach based on the 𝜒2 test statistic. Theyassumed that time series 𝑋 and 𝑌 are generated by an autoregressive process of order𝑘, AR(𝑘) with parameters 𝜃𝑋 = {𝜃1

𝑋, … , 𝜃𝑘𝑋} and 𝜃𝑌 = {𝜃1

𝑌, … , 𝜃𝑘𝑌}. Setting the null

hypothesis 𝐻0: 𝜃𝑋 = 𝜃𝑌 they cluster 𝑋 and 𝑌 together if the 𝑝-value is greater than the

predefined threshold.

Disadvantage: The main weakness of the approach is due to the simplicity andlinearity of the AR 𝑘 model. In general it might be hard to select the appropriatemodel.

32. Maharaj, Elizabeth Ann. "Cluster of time series." Journal of Classification 17.2 (2000): 297-314.

7. FORECASTING. LEMPEL-ZIV AND MARKOV CHAINS

LZW algorithm partitions time series 𝑋 into a collection of subsequences𝑆0, 𝑆1, … , 𝑆𝑀. 𝑆𝑘 starts at time 𝑘 and represents the shortest previously unobservedsubsequence. The prediction is made as

𝑃 𝑋𝑡+1 = 𝛽 𝑋1𝑡 =

𝐶 𝑆𝑘𝛽|𝑋1𝑡

𝐶 𝑆𝑘 𝑋1𝑡

where 𝐶(𝑆𝑘𝛽|𝑋1𝑡) stands for the number of times 𝑆𝑘 is followed with 𝛽.

Markov chain predictor assumes 𝑃 𝑋𝑡+1 = 𝛽|𝑋1𝑡 = 𝑃 𝑋𝑡+𝑘+1 = 𝛽|𝑋𝑘+1

𝑡+𝑘

Denoting repeating subsequences 𝑋𝑡−𝑘+1𝑡 = 𝑋𝑡+1

𝑡+𝑘 = 𝑆1, … , 𝑆𝑘 = 𝑆 we predict

𝑃 𝑋𝑡+1 = 𝛽|𝑋1𝑡 =

𝐶(𝑆𝛽|𝑋1𝑡)

𝐶 𝑆 𝑋1𝑡

Limitations: LZW and Markov chain-based predictors treat numeric values assymbols and cannot capture actual magnitudes of the values 𝑋𝑡 of time series.

7. FORECASTING. MACHINE LEARNING. SVM AND NN

We partition time series 𝑋 into 𝑘 overlapping subseries of size 𝑝 creating feature

vectors 𝑥 ∈ ℜ 𝑝−1 ×(𝑁−𝑝+1) and label vectors 𝑦 ∈ ℜ𝑁−𝑝+1

𝑥 = 𝑋1𝑝−1

, 𝑋2𝑝, … , 𝑋𝑁−𝑝+1

𝑁−1 , 𝑦 = 𝑋𝑝𝑁

Standard models are fit on 𝑥 and 𝑦. Support Vector Machines [33] (SVM) minimize

1

𝑁−𝑝+1σ𝑖=1𝑁−𝑝+1

max 0, 1 − 𝑦𝑖 𝑤𝑇𝑥𝑖 − 𝑏 + 𝜆 𝑤

2

Basic neural networks (NN) fit the model [34]

𝑓 𝑥𝑖 = 𝑤2𝑇𝜙 𝑤1

𝑇𝑥𝑖 + 𝑏1 + 𝑏2

Limitations: machine learning approach does not take into account order within

each vector 𝑥𝑖.

33. Wu, Qiang, and Ding-Xuan Zhou. "SVM soft margin classifiers: linear programming versus quadratic programming." Neural computation 17.5 (2005): 1160-1187.

7. FORECASTING. ARIMA MODELSARIMA models [34] seem to be the most natural predictors for time series. They takeinto account autoregression and integrated moving average. ARIMA is defined as

1 − σ𝑖=1𝑝

𝜙𝑖𝐿𝑖 1 − 𝐿 𝑑𝑋𝑡 = 𝛿 + 1 + σ𝑖=1

𝑞𝜃𝑖𝐿

𝑖 𝜖𝑡

where 𝜃, 𝜙 are the moving average and autoregression parameters, 𝑝 and 𝑞 define thelag of the autoregression and moving averages, 𝑑 defines integration, 𝜖 stands fornormally distributed error, 𝐿𝑘𝑋𝑡 = 𝑋𝑡−𝑘, 𝛿 regulates the drift of the model.

The parameters are often selected with Akaike Information Criterion (AIC) [35]

AIC 𝑝, 𝑑, 𝑞 = −2 log ℒ + 2(𝑝 + 𝑑 + 𝑘 + 1)

where ℒ is the maximum likelihood estimator, 𝑘 = 1 if there is a constant term.

Limitations: does not capture non-linear relations within time series.34. Box, George EP, et al. Time series analysis: forecasting and control. John Wiley & Sons, 2015.35. Bozdogan, Hamparsum. "Model selection and Akaike's information criterion (AIC): The general theory and its analytical extensions." Psychometrika 52.3 (1987): 345-370.

7. FORECASTING. RNNRecurrent Neural Network (RNN) [36]models are among the first to capturetemporal dependencies within thesubsequences of time series. A fullyinterconnected model

𝑋𝑡 =

𝑖=1

𝐼

𝑊𝑖𝑔𝑖 𝑡

Limitations: the models are toocomplex and impractical to store longsubsequences.

𝑔𝑖 𝑡 = 𝑓 σ𝑗=1MAX(𝑝,𝑞)

𝑤𝑖𝑗𝑋𝑡−𝑗 + σ𝑘=1𝑞 σ𝑙=1

𝐼 𝑤𝑖𝑙𝑘𝑔𝑙 𝑡 − 𝑘 + 𝜃𝑖

36. Connor, Jerome T., R. Douglas Martin, and Les E. Atlas. "Recurrent neural networks and robust time series prediction." IEEE transactions on neural networks 5.2 (1994): 240-254.

7. FORECASTING. LSTM

Long Short-Term Memory (LSTM) [37] NN introduced neurons with internalstructure. LSTM contain 3 gates: input, output, and forget. 2 input gates

𝑖1 = tanh 𝑏𝑖1 + 𝑋𝑡𝑊1𝑖1 + ℎ𝑡−1𝑊2

𝑖1 , 𝑖2 = 𝜎 𝑏𝑖2 + 𝑋𝑡𝑊1𝑖2 + ℎ𝑡−1𝑊2

𝑖2

1 forget gate

𝑓 = 𝜎 𝑏𝑓 + 𝑋𝑡𝑊1𝑓+ ℎ𝑡−1𝑊2

𝑓

and 1 output gate

𝑜 = 𝜎 𝑏𝑜 + 𝑋𝑡𝑊1𝑜 + ℎ𝑡−1𝑊2

𝑜

Overall, the model is a combination

ℎ𝑡 = tanh 𝑓 + 𝑖1⨂𝑖2 ⨂𝑜

37. Connor, Jerome T., R. Douglas Martin, and Les E. Atlas. "Recurrent neural networks and robust time series prediction." IEEE transactions on neural networks 5.2 (1994): 240-254.

SUMMARY

1. Combinatorial estimators of complexity and entropy, as well as predictors

including Lempel-Ziv approximator, permutation entropy, Markov chains and

Lempel-Ziv predictors treat time series like a sequence of symbols and do not take

actual values into account.

2. Methods quantifying predictability rely on entropy or assumed models.

3. Dependency measures efficient at capturing non-linear patterns use

transformations that are not guaranteed to have inverses.

4. Majority of the distance measures proposed for time series clustering compare

values pairwise and do not take subsequencies into account.

5. Machine learning approach to clustering and time series forecasting does not take

order within features into account.

FUTURE DIRECTIONS

1. Creating new non-linear dependency measures with transformations that areguaranteed to have inverses. (Developing approximations for maximum andmonotone correlation).

2. Generalizing combinatorial methods for complexity and entropy estimation sothey take magnitudes of time series into account.

3. Combining dependency measures with neural networks. Is there a link and canone improve another?

THANK YOU!

QUESTIONS?

PATTERN DISCOVERY IN TIME SERIES A SURVEY Presentations/Denis...Image source: Denis Khryashchev’s...

Documents

Transcript of PATTERN DISCOVERY IN TIME SERIES A SURVEY Presentations/Denis...Image source: Denis Khryashchev’s...