Bayesian Statistics and Data Assimilation Jonathan Stroud · 2012. 4. 10. · Jonathan Stroud...

42
Bayesian Statistics and Data Assimilation Jonathan Stroud Department of Statistics The George Washington University 1

Transcript of Bayesian Statistics and Data Assimilation Jonathan Stroud · 2012. 4. 10. · Jonathan Stroud...

  • Bayesian Statistics and DataAssimilation

    Jonathan StroudDepartment of Statistics

    The George Washington University

    1

  • Outline

    • Motivation

    • Bayesian Statistics

    • Parameter Estimation in Data Assimilation

    • Combined State and Parameter Estimation within EnKF

    – Physical Parameters

    – Observation Error Variance

    – Observation Error Covariance

    2

  • Motivation

    • Physical models and data assimilation systems often involve

    unknown parameters:

    – physical model parameters

    – error covariance parameters

    – covariance inflation factors

    – localization radius

    • Use data to estimate parameters, either off-line or sequentially.

    • Two common approaches to parameter estimation

    – Maximum likelihood estimation

    – Bayesian estimation

    3

  • Statistical Methods for Parameter Estimation

    • Maximum Likelihood approach

    – Specify a likelihood function for the data.

    – Choose parameters to maximize this function.

    • Bayesian approach

    – Parameters are assigned a prior probability distribution.

    – Update the prior distribution using Bayes Theorem.

    – Summarize the parameters using the posterior distribution.

    4

  • Bayesian Parameter Estimation

    • A Bayesian model includes the following components

    (where d = data; α = unknown parameters):

    • Likelihood function: p(d |α)

    • Prior distribution: p(α)

    • Posterior distribution (via Bayes’ Theorem):

    p(α|d) =p(α)p(d |α)

    p(d)

    • The parameters can be summarized using the posterior mean,

    standard deviation, or 95% posterior intervals.

    5

  • Example 1: Normal Data with Unknown Mean

    • Let d1, ... , dn be iid samples from a normal distribution with

    unknown mean θ and variance v . The likelihood function is

    p(d |θ) = (2πv)−n/2 exp

    {

    −(d̄ − θ)2

    2v/n

    }

    .

    • The standard prior distribution is a normal: θ ∼ N(θb, vb)

    p(θ) = (2πvb)−1/2 exp

    {

    −(θ − θb)

    2

    2vb

    }

    • Posterior distribution (likelihood × prior):

    p(θ|d) ∝ exp

    {

    −(d̄ − θ)2

    2v/n−

    (θ − θb)2

    2vb

    }

    6

  • Example 1: Normal Data with Unknown Mean

    • The posterior distribution is normal

    θ|d ∼ N(θa, va)

    with mean

    θa =(v/n)θb + vbd̄

    vb + (v/n)

    and information

    v−1a = v−1b + (v/n)

    −1

    • The posterior mean is a weighted average of the prior mean and

    the sample mean. The posterior information is the sum of the

    prior and data information.

    7

  • Example 1: Normal Data with Unknown Mean

    −5 0 5 100.0

    0.1

    0.2

    0.3

    0.4Prior

    N(0,1)

    −5 0 5 100.00

    0.05

    0.10

    0.15

    0.20Likelihood

    N(2,4)

    −5 0 5 100.0

    0.1

    0.2

    0.3

    0.4

    Posterior

    N(0.4,0.8)

    8

  • Example 2: Normal with Unknown Variance

    • Let d1, ... , dn be iid samples from a normal distribution with

    mean zero and unknown variance v . The likelihood function is

    p(d |v) = (2πv)−n/2 exp

    (

    −1

    2v

    n∑

    i=1

    d2i

    )

    .

    • The prior distribution is inverse gamma, v ∼ IG (m/2, s/2)

    p(v) =(s/2)m/2

    Γ(m/2)v−m/2−1 exp

    (

    −s

    2v

    )

    .

    • The posterior distribution is

    p(v |d) ∝ v−(m+n)/2−1 exp

    (

    −s +

    d2i2v

    )

    .

    9

  • Example 2: Normal with Unknown Variance

    • The posterior is also inverse gamma

    v |d ∼ IG

    (

    m + n

    2,s +

    d2i2

    )

    .

    • The parameters are updated by adding the sample size and the

    data sum of squares, respectively.

    10

  • Example 2: Normal with Unknown Variance

    2 4 6 8 10 120.0

    0.1

    0.2

    0.3

    0.4

    0.5 PriorIG(5,10)

    mean=2.5

    2 4 6 8 10 120.00

    0.05

    0.10

    0.15

    0.20

    0.25Likelihood

    IG(10,50)mean=5.6

    2 4 6 8 10 120.0

    0.1

    0.2

    0.3

    0.4Posterior

    IG(15,60)mean=4.3

    11

  • Conjugate Priors

    • The normal and inverse gamma priors are called “conjugate

    priors”. These occur when the prior and posterior distribution

    belong to the same family.

    • These are convenient because the distribution can be updated

    by updating a set of “sufficient statistics.”

    • Other examples of conjugate priors:

    – Inverse-Wishart, for the covariance matrix of a normal.

    – Normal-Inverse Gamma, for the mean and variance of normal.

    – Dirichlet, for the probabilities of a discrete distribution.

    12

  • Sequential Bayesian Estimation

    • If the data are assimilated sequentially, we want to update the

    parameters α after each new observation d1, d2, ... , dn.

    • Under the Bayesian approach, this requires calculating the

    sequence of posterior distributions

    p(α|d1), p(α|d1, d2), ... , p(α|d1, d2, ... , dn)

    • This is done by applying Bayes theorem recursively after each

    new observation:

    p(α|d1, ... , dk) ∝ p(dk |α) p(α|d1, ... , dk−1), for each k .

    13

  • Sequence of Posterior Distributions

    0 1 2 3 4 5 60.0

    0.1

    0.2

    0.3

    0.4

    Posterior t=1MLE=2.9Mode=1Mean=3

    0 1 2 3 4 5 60.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6Posterior t=2

    MLE=0.2Mode=0.9Mean=2

    0 1 2 3 4 5 60.00

    0.05

    0.10

    0.15

    0.20

    0.25

    0.30Posterior t=4

    MLE=11Mode=2.2Mean=3.9

    0 1 2 3 4 5 60.0

    0.2

    0.4

    0.6

    0.8

    Posterior t=25MLE=1Mode=1.6Mean=1.8

    0 1 2 3 4 5 60.00.20.40.60.81.01.21.4

    Posterior t=100MLE=9Mode=2Mean=2.1

    0 1 2 3 4 5 60.0

    0.5

    1.0

    1.5

    2.0Posterior t=200

    MLE=8.6Mode=2Mean=2.1

    14

  • Sequential Bayesian Estimates

    0 50 100 150 2000

    5

    10

    15

    MLEs of alpha

    0 50 100 150 200

    1

    2

    3

    4

    Sequential EstimatesPosterior MeanPosterior Mode90% Posterior CI

    15

  • Convergence of the Posterior Distribution

    1. The posterior distribution converges to the true parameter value.

    2. If the model is correct, and certain regularity conditions hold,

    the posterior distribution converges to a normal distribution with

    mean equal to the true value and covariance equal to the

    asymptotic covariance matrix.

    16

  • Parameter Estimation in Data Assimilation

    • Many parameter estimation methods have been proposed for

    data assimilation systems.

    • Maximum likelihood estimation

    – Dee (1995) and Dee & Da Silva (1999): Error covariances

    – Mitchell & Houtekamer (2000): Error covariances (EnKF)

    – Li, Kalnay & Miyoshi (2007): Variance/Covariance inflation

    • Bayesian estimation

    – Anderson & Anderson (1999): State augmentation

    – Stroud & Bengtsson (2007): Observation error variance

    – Anderson (2007, 2009): Covariance inflation factors

    – Miyoshi (2011): Covariance inflation factors

    17

  • Estimation of Physical Parameters

    • State augmentation is used to estimate unknown parameters θ

    in the physical model M(xt , θ).

    • Define the augmented state vector zt = (xt , θt), and the

    augmented model as

    xt

    θt

    =

    M(xt−1, θt−1)

    θt−1

    +

    wt

    0

    .

    • Specify an initial prior distribution, θ0 ∼ p(θ0).

    • Then, standard data assimilation methods are applied to zt to

    estimate the posterior distribution, p(θ|d1, ... , dt), at each time t.

    18

  • Example: Lorenz 63 Model

    • Model equations:

    dx/dt = σ(y − x)

    dy/dt = ρx − y − xz

    dz/dt = xy − βz

    -15-10

    -5 0

    510

    15

    x

    -10

    0

    10

    20

    y

    1020

    3040

    50z

    •••••••••••••••••

    •••

    •••••• • • ••••••••••••••••••••••••

    ••

    •• • • • • • • • • •••••••••

    ••••

    •••••

    ••••••

    ••

    •••••••••••••••••••••••••

    ••

    • ••

    ••

    •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

    • The state vector is x = (x , y , z), and parameter is θ = (σ, ρ,β)

    • The parameters σ = 10, ρ = 28,β = 8/3 give the famous butterfly.

    • Generate data with time step dt = .01, and observation noise = 1.

    • ETKF/state augmentation on zt = (xt , θt) with ensemble size

    100 and variance inflation factor 1.04.

    19

  • Sequential Parameter Estimates: Lorenz 63

    0 20 40 60 80 100

    8

    10

    12

    14

    σPosterior Mean90% Posterior CI

    0 20 40 60 80 10026

    27

    28

    29

    30

    31

    32

    ρPosterior Mean90% Posterior CI

    0 20 40 60 80 100

    2.2

    2.4

    2.6

    2.8

    3.0

    βPosterior Mean90% Posterior CI

    20

  • Estimation of Covariance Parameters

    • State augmentation does not work well for parameters in the

    background or error covariance matrices, P, Q, and R.

    • Dee (1995), D&D (1999) and Mitchell & Houtekamer (2000)

    proposed Maximum Likelihood estimation for these parameters.

    • Assuming the innovations d are normal with mean zero and

    covariance D(α), the likelihood function is

    p(d|α) ∝ |D(α)|−1/2 exp

    {

    −1

    2d′D(α)−1d

    }

    and the maximum likelihood estimator is

    α̂ML = argmaxα

    p(d|α).

    21

  • Estimation of Covariance Parameters

    • Maximum Likelihood (ML) works well for large samples, but has

    problems for recursive estimation.

    • D95 and MH00 proposed the recursive ML estimator:

    α̂t = (1− γt)α̂t−1 + γt

    (

    argmaxα

    p(dt |α))

    .

    • Setting γt = 1/t defines α̂t as the mean of the ML estimates.

    • They also defined α̂t as the median of the ML estimates.

    22

  • Simple Scalar Example

    • Mitchell & Houtekamer (2000) proposed the following example:

    • Generate data d ∼ N(0, 2 + α), with true value α∗ = .3.

    • Since α ≥ 0, the single sample ML estimator is

    α̂ML = max(

    0, d2 − 2)

    23

  • Mitchell & Houtekamer Example

    • Recursive estimates for α.

    0 2000 4000 6000 8000 10000

    0.0

    0.5

    1.0TrueMeanMedianBayes

    • The recursive ML estimators do not converge to the true value.

    • The Bayes estimator does converge.

    24

  • Bayesian Parameter Estimation in the EnKF

    • We propose the following generic EnKF algorithm for combined

    estimation of states z and covariance parameters α.

    1. Assume a prior distribution for the parameters α ∼ p(α).

    2. Generate a forecast ensemble of parameters and states:

    αfi ∼ p(α)

    zfi ∼ p(z|α

    fi )

    3. Update the prior distribution via Bayes’ Theorem:

    p(α|d) ∝ p(α)p(d|α)

    4. Generate an analysis ensemble of parameters and states:

    αi ∼ p(α|d)

    zi ∼ p(z|αi , d)

    25

  • Model 1: Unknown Observation Variance

    • Stroud & Bengtsson (2007) considered the case where R = αR∗,

    Q = αQ∗ and D = αD∗.

    1. Assume an inverse gamma prior distribution: α ∼ IG (n/2, s/2).

    2. Generate the forecast ensemble:

    αfi ∼ IG (n/2, s/2),

    zft,i ∼ M(zt−1,i) + N(0,α

    fi Q

    ∗)

    3. Update the parameters of the inverse gamma distribution:

    n∗ = n + p, s∗ = s + d′(D∗)−1d

    4. Generate the analysis ensemble:

    αi ∼ IG (n∗/2, s∗/2)

    zt,i ∼ zft,i +K(d+ N(0,αiR

    ∗))

    26

  • Example: Lorenz 96 Model

    • The Lorenz 96 model mimics advection on a latitude circle. The

    model is highly nonlinear (chaotic), containing quadratic terms.

    ẋt,j = (xt,j+1 − xt,j−2) xt,j−1 − xt,j + F .

    • The state vector has 40

    variables x = (x1, ... , x40).

    • The model parameter is F ,

    the forcing variable.

    27

  • Model & Assimilation Settings

    • Time step dt = .05 or .25.

    • Forcing F = 8 (known or unknown).

    • Observations at every location.

    • Error covariances: Q = 0, R = αI; true α = 4.

    • EnKF with ensemble size m = 100.

    • Covariance localization with cutoff radius c = 10.

    • Covariance inflation factor 1.01.

    28

  • Sequential Bayesian Estimates of α (dt = .05)

    50 100 150 200 250 300 350 400 450 5003

    3.5

    4

    4.5

    5

    α

    t

    α|Y0~IG(15,240)

    50 100 150 200 250 300 350 400 450 5003

    3.5

    4

    4.5

    5

    α

    t

    α|Y0~IG(1.5,6)

    50 100 150 200 250 300 350 400 450 5003

    3.5

    4

    4.5

    5

    α

    t

    α|Y0~IG(15,15)

    29

  • Sequential Bayesian Estimates of α (dt = .25)

    50 100 150 200 250 300 350 400 450 5003

    3.5

    4

    4.5

    5

    α

    t

    α|Y0~IG(1.5,6)

    50 100 150 200 250 300 350 400 450 5003

    3.5

    4

    4.5

    5

    α

    t

    α|Y0~IG(15,240)

    50 100 150 200 250 300 350 400 450 5003

    3.5

    4

    4.5

    5

    α

    t

    α|Y0~IG(15,15)

    30

  • Sequential Bayesian Estimates of (α,F )

    100 200 300 400 5003

    3.5

    4

    4.5

    5

    α

    t

    α|Y0~IG(15,15)

    100 200 300 400 5006

    7

    8

    9

    10

    F

    t

    F|Y0~N(8,1)

    100 200 300 400 5003

    3.5

    4

    4.5

    5

    α

    t

    α|Y0~IG(15,15)

    100 200 300 400 5006

    7

    8

    9

    10

    F

    t

    F|Y0~N(8,1)

    31

  • Sequential Estimates of (α,F ): Sparse Network

    200 400 600 800 10003

    3.5

    4

    4.5

    5

    α

    t

    α|Y0~IG(15,15)

    200 400 600 800 10006

    7

    8

    9

    10

    F

    t

    F|Y0~N(8,1)

    200 400 600 800 10003

    3.5

    4

    4.5

    5

    α

    t

    α|Y0~IG(15,15)

    200 400 600 800 10006

    7

    8

    9

    10

    F

    t

    F|Y0~N(8,1)

    32

  • Spatially- and Temporally-Varying Scale Factors

    1000 2000 3000 40003

    3.5

    4

    4.5

    5

    α1

    t

    α1|Y

    0~IG(1.5, 6)

    1000 2000 3000 40001.5

    2

    2.5

    α2

    t

    α2|Y

    0~IG(1.5, 3)

    1000 2000 3000 40003

    3.5

    4

    4.5

    5

    α1

    t

    α1|Y

    0~IG(1.5, 6)

    1000 2000 3000 40007

    8

    9

    10

    11

    α2

    t

    α2|Y

    0~IG(1.5, 13.5)

    33

  • Estimation of Spatial Correlation Parameters:Discrete Representation

    • Assume R is defined by a covariance model K (r ;α).

    1. Assume a discrete prior on a grid of parameter values α∗:

    α ∼ Mult(α∗,π)

    2. Generate the forecast ensemble.

    3. Estimate the innovation mean d and covariance, D(α).

    4. Update the posterior distribution:

    p(α|d) ∝ Mult(α|α∗,π)p(d|α) = Mult(α|α∗,π∗)

    5. Generate the analysis ensemble:

    αi ∼ Mult(α∗,π∗)

    zt,i ∼ zft,i +K(αi )(d+ N(0,R(αi )))

    34

  • Estimation of Spatial Correlation Parameters:Gaussian Approximation

    • Assume R is defined by a covariance model K (r ;α).

    1. Assume a normal prior on the parameters:

    α ∼ N(m,C)

    2. Generate the forecast ensemble.

    3. Estimate the innovation mean d and covariance, D(α).

    4. Update the posterior distribution:

    p(α|d) ∝ N(α|m,C)p(d|α) ≈ N(α|m∗,C∗)

    5. Generate the analysis ensemble:

    αi ∼ N(m∗,C∗)

    zt,i ∼ zft,i +K(αi )(d+ N(0,R(αi )))

    35

  • Grid vs Normal Posteriors: Linear Model

    3 4 5 6 7

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    True: t=10

    3 4 5 6 7

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5Grid: t=10

    3 4 5 6 7

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5Normal: t=10

    3 4 5 6 7

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    True: t=40

    3 4 5 6 7

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5Grid: t=40

    3 4 5 6 7

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5Normal: t=40

    3 4 5 6 7

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    True: t=100

    3 4 5 6 7

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5Grid: t=100

    3 4 5 6 7

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5Normal: t=100

    36

  • Sequential Posterior Estimates: Linear Model

    0 20 40 60 80 100

    0.1

    0.2

    0.3

    0.4

    0.5

    γ1

    0 20 40 60 80 100

    0.4

    0.5

    0.6

    0.7

    0.8

    γ2

    0 20 40 60 80 100

    −0.1

    0.0

    0.1

    0.2

    0.3

    γ3

    0 20 40 60 80 100

    3.5

    4.0

    4.5

    5.0

    5.5

    6.0

    6.5

    7.0

    σ2

    0 20 40 60 80 100

    0.6

    0.8

    1.0

    1.2

    1.4

    1.6

    θ2

    0 20 40 60 80 100

    0.8

    1.0

    1.2

    1.4

    α

    37

  • Lorenz 96 Model & AssimilationSettings

    • Time step dt = .01.

    • Perfect model, F = 8 known.

    • Observations at 40 locations.

    • R defined by the Matérn correlation model:

    K (r) =α

    Γ(ν)2ν−1

    ( r

    λ

    )νKν

    ( r

    λ

    )

    ; α,λ, ν > 0.

    • EnKF with ensemble size m = 100

    • Covariance localization with radius r = 12.

    • No covariance inflation.

    38

  • Sequential Posterior Distributions: Discrete

    0.0 0.5 1.0 1.5 2.00.0

    0.2

    0.4

    0.6

    0.8

    1.0t=0

    0.0 0.5 1.0 1.5 2.00.0

    0.2

    0.4

    0.6

    0.8

    1.0t=1

    0.0 0.5 1.0 1.5 2.00.0

    0.2

    0.4

    0.6

    0.8

    1.0t=5

    0.0 0.5 1.0 1.5 2.00.0

    0.2

    0.4

    0.6

    0.8

    1.0t=10

    0.0 0.5 1.0 1.5 2.00.0

    0.2

    0.4

    0.6

    0.8

    1.0t=50

    0.0 0.5 1.0 1.5 2.00.0

    0.2

    0.4

    0.6

    0.8

    1.0t=200

    39

  • Sequential Bayesian Estimates: Discrete

    0 50 100 150 200 250

    0.8

    1.0

    1.2

    1.4

    αPosterior Mean95% Interval

    0 50 100 150 200 250

    0.5

    1.0

    1.5

    λPosterior Mean95% Interval

    0 50 100 150 200 250

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    νPosterior Mean95% Interval

    40

  • Conclusions

    • Bayesian methods are useful for parameter estimation in DA.

    • Two new algorithms for combined state and parameter

    estimation within EnKF.

    • Easily combined with state augmentation.

    • Good convergence properties (unlike recursive ML).

    • Conjugate priors allow for easy updating.

    • Would love to collaborate with you on this topic!

    41

  • Computational Methods

    • Bayesian and ML methods rely heavily on calculation of the

    likelihood.

    • Several approximate methods have been proposed for computing

    the likelihood for large spatial data sets

    – Spectral approximations (Whittle, 1953)

    – Approximate likelihood (Vecchia, 1988)

    – Covariance localization (Kaufman et al., 2008)

    • These methods can be applied in data assimilation systems.

    42