Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter...

28
Gaussian Processes for Time Series Forecasting Dr. Juan Orduz PyCon DE & PyData Berlin 2019

Transcript of Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter...

Page 1: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Gaussian Processes for Time SeriesForecasting

Dr. Juan Orduz

PyCon DE & PyData Berlin 2019

Page 2: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

OverviewAim: Give a better intuition on the notion of Gaussian Process Regression

Introduction

Bayesian Linear Regression

The Kernel Trick

Gaussian Process Regression

Kernel Examples

Non-Linear Example (RBF)

The Kernel Space

Example: Time Series

Page 3: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Multivariate Normal Distribution[5]

X = (X1, · · · ,Xd ) has a multinormal distribution if every linearcombination is normally distributed. The joint density has the form

p(x |m,K0) =1√

(2π)d |K0|exp

(−1

2(x −m)T K−1

0 (x −m)

)where m ∈ Rd is the mean vector and K0 ∈ Md (R) is the (symmetric,positive definite) covariance matrix.

Figure: Left: Multivariate Normal Distribution, Right: Non-Multivariate NormalDistribution

Page 4: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Bayesian Linear Regression[2], [7, Chapter 2.1]

Let x1, · · · , xn ∈ Rd and y1, · · · , yn be a set of observations (data). Wewant to fit the linear model

f (x) = xT b and y = f (x) + ε, with ε ∼ N(0, σ2n)

where b ∈ Rd denotes the parameter vector. Let X ∈ Md×n denotethe observation matrix.

Figure: The true parameters in this example are b = (1, 3) and σn = 0.5

We want to compute p(b|X , y) using the Bayes theorem

p(b|X , y) =p(y |X ,b)p(b)

p(y |X )∝ likelihood× prior

Page 5: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Prior DistributionI Prior

b ∼ N(0,Σp), Σp ∈ Md (R)

Figure: Prior Distribution, for this example Σp =

(2 11 2

).

I Likelihood

p(y |X ,b) =n∏

i=1

p(yi |xi ,b) = N(X T b, σ2n I)

Page 6: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Posterior Distribution Sampling[9]

Figure: Posterior distribution of the b weight estimation using MCMCsampling (PyMC3). Here we use 3 chains.

Page 7: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Posterior Distribution - Analytical Solution[7, Chapter 2.1.1]

I Posterior

p(b|y ,X ) = N(

b̄ =1σ2

nA−1Xy ,A−1

), A = σ−2

n XX T + Σ−1p

Figure: Posterior Distribution.

Page 8: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Predictive Distribution - Analytical Solution[7, Chapter 2.1.1]

p(f∗|x∗,X , y) =

∫p(f∗|x∗,b)p(b|X , y)db = N

(1σ2

nxT∗ A−1Xy , xT

∗ A−1x∗

)

Figure: Prediction Interval.

Page 9: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

The Kernel Trick[7, Chapter 2.1.2]

Let us consider a map φ : Rd −→ RN and the model

f (x) = φ(x)T b and y = f (x) + ε, with ε ∼ N(0, σ2n).

It is easy to verify that the analysis for this model as analogous to thestandard linear model replacing X with Φ := φ(X ). Set φ∗ = φ(x∗),

p(f∗|x∗,X , y) =N

(1)︷ ︸︸ ︷

1σ2

nφT∗A−1Φy , φT

∗A−1φ∗︸ ︷︷ ︸(2)

(1) =φT∗ΣpΦ(ΦT ΣpΦ + σ2

n I)−1y

(2) =φT∗Σpφ∗ − φT

∗ΣpΦ(ΦT ΣpΦ + σ2n I)−1ΦT Σpφ∗

This motivates the definition of the covariance function or kernel

k(x , x ′) := φ(x)T Σpφ(x ′)

Page 10: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Gaussian Process[1, Chapter 21], [7, Chapter 2.2]

Main IdeaThe specification of a covariance function implies a distribution overfunctions.

Gaussian ProcessI A Gaussian Process is a collection of random variables, any

finite number of which have a joint multinormal distribution.I A Gaussian process f ∼ GP(m, k) is completely specified by its

mean function m(x) and covariance function k(x , x ′). Herex ∈ X denotes a point on the index set X .

m(x) = E [f (x)] and k(x , x ′) = E [(f (x)−m(x))(f (x ′)−m(x ′))]

ExampleThe map f (x) = φ(x)T b (with prior b ∼ N(0,Σp)) defines a Gaussianprocess with m(x) = 0 and k(x , x ′) = φ(x)T Σpφ(x ′).

NotationLet K (X ,X ) denote the matrix of the point-wise kernel images.

Page 11: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Gaussian Process[1, Chapter 21], [7, Chapter 2.2]

Main IdeaThe specification of a covariance function implies a distribution overfunctions.

Gaussian ProcessI A Gaussian Process is a collection of random variables, any

finite number of which have a joint multinormal distribution.I A Gaussian process f ∼ GP(m, k) is completely specified by its

mean function m(x) and covariance function k(x , x ′). Herex ∈ X denotes a point on the index set X .

m(x) = E [f (x)] and k(x , x ′) = E [(f (x)−m(x))(f (x ′)−m(x ′))]

ExampleThe map f (x) = φ(x)T b (with prior b ∼ N(0,Σp)) defines a Gaussianprocess with m(x) = 0 and k(x , x ′) = φ(x)T Σpφ(x ′).

NotationLet K (X ,X ) denote the matrix of the point-wise kernel images.

Page 12: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Linear Regression - Function Space View[7, Chapter 2.2]

I Let us consider input points X∗ (test set).I Prior

f∗ ∼ N(0,K (X∗,X∗))

Figure: We sample from the prior space of functions by plotting theirrealization on n∗ = 80 points.

Page 13: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Linear Regression - Function Space View[7, Chapter 2.2]

I Joint Distribution(yf∗

)∼ N

(0,(

K (X ,X ) + σ2n K (X ,X∗)

K (X∗,X ) K (X∗,X∗)

))I Conditional Distribution f∗|X , y ,X∗ ∼ N(f̄∗, cov(f∗))

f̄∗ =K (X∗,X )(K (X ,X ) + σ2n I)y

cov(f∗) =K (X∗,X∗)− K (X∗,X )(K (X ,X ) + σ2n I)−1K (X ,X∗)

Page 14: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Kernel Examples[8], [7, Chapter 4.2]

Symmetric and positive semi-definite functions k : X × X −→ R.I Dot Product

kDOT (x , x ′) = (σ20 + xT Σpx ′)m

I Squared Exponential

kSE (x , x ′) = exp

(− (x − x ′)2

2`2

)I Rational Quadratic

kRQ(x , x ′) =

(1 +

(x − x ′)2

2α`2

)−αI Exp-Sine-Squared

kESS(x , x ′) = exp

(−2(

sin(π(x − x ′)/T )

`

)2)

Page 15: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Example: Non-Linear Function[4]

Figure: Non-Linear Example with n = 500 training points.

Page 16: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Prior Distribution

f ∼ N(0,K (X∗,X∗))

Figure: We sample from the prior space of functions by plotting theirrealization on n∗ = 100 points.

Page 17: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Joint Distribution

(yf∗

)∼ N

(0,(

K (X ,X ) + σ2n K (X ,X∗)

K (X∗,X ) K (X∗,X∗)

))

Page 18: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Conditional Distribution

f∗|X , y ,X∗ ∼ N(f̄∗, cov(f∗))

Figure: Covariance matrix of the posterior (conditional) distribution.

f̄∗ =K (X∗,X )(K (X ,X ) + σ2n I)y

cov(f∗) =K (X∗,X∗)− K (X∗,X )(K (X ,X ) + σ2n I)−1K (X ,X∗)

Page 19: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Posterior Distribution

Figure: We sample from the posterior space of functions by plotting theirrealization on n∗ = 100 points.

Page 20: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Hyperparameter Estimation[7, Chapter 2.3, 5]

Figure: Samples from two gaussian processes with different scaleparameters (left: ` = 1 and right: ` = 0.001).

MethodsI Marginal Likelihood (θ = parameter vector)

log(p(y |X , θ)) = −12

yT (K + σ2n I)−1y − 1

2log |K + σ2

n I| − n2

log(2π)

I Cross Validation

Page 21: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

The Kernel Space[8], [7, Chapter 4]

Let k1, k2 : X × X −→ R be two kernels, then the following are alsokernelsI k1 + k2

I k1 × k2

I k1 ∗ k2 (convolution)If k : X1 ×X1 −→ R and h : X2 ×X2 −→ R are two kernels, then thefollowing are also kernels (on X1 ×X2)I k1 ⊕ k2

I k1 ⊗ k2

Remark ([7, Chapter 4.3])

There is a rich theory of spectral theory for kernels by considering theintegral operator Tk : L2(X , µ) −→ L2(X , µ) (where (X , µ) is a finitemeasure space and k ∈ L∞(X × X , µ× µ)).

(Tkφ)(x) =

∫X

k(x , x ′)φ(x ′)dµ(x ′)

Page 22: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Example: Periodic Component I ([3])[6, Section 1.7. Gaussian Processes]

Page 23: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Example: Add Linear Trend

Page 24: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Example: Add Periodic Component II

Page 25: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Computational Challenges

I Calculating the posterior mean and covariance matrix requiresO(n3) computations.

I A practical implementation of Gaussian process regression isdescribed in [7, Algorithm 2.1], where the Choleskydecomposition is used instead of inverting the matrices directly.

I There are remarkable approximation methods for Gaussianprocesses to speed up the computation ([1, Chapter 20.1])

Page 26: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

References I[1] A. Gelman, J.B. Carlin, H.S. Stern, D.B. Dunson, A. Vehtari, and D.B. Rubin.

Bayesian Data Analysis, Third Edition.Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, 2013.

[2] Juan Orduz.Bayesian regression as a gaussian process.https://juanitorduz.github.io/reg_bayesian_regression/, Apr2019.

[3] Juan Orduz.Gaussian processes for time series forecasting.https://juanitorduz.github.io/gaussian_process_time_series/,Oct 2019.

[4] Juan Orduz.An introduction to gaussian process regression.https://juanitorduz.github.io/gaussian_process_reg/, Apr 2019.

[5] Juan Orduz.Sampling from a multivariate normal distribution.https://juanitorduz.github.io/multivariate_normal/, Mar 2019.

Page 27: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

References II[6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,

M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011.

[7] Carl Edward Rasmussen and Christopher K. I. Williams.Gaussian Processes for Machine Learning (Adaptive Computation and MachineLearning).The MIT Press, 2005.

[8] S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain.Gaussian processes for timeseries modelling.Philosophical Transactions of the Royal Society.

[9] John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck.Probabilistic programming in python using PyMC3.PeerJ Computer Science, 2:e55, apr 2016.

Page 28: Gaussian Processes for Time Series Forecasting · 2020. 8. 16. · Gaussian Process [1, Chapter 21], [7, Chapter 2.2] Main Idea The specification of a covariance function implies

Thank You!ContactI ÷ https://juanitorduz.github.ioI � github.com/juanitorduzI 7 juanitorduzI R [email protected]