Bayesian Machine Learning for Signal Processing Hagai T. Attias Golden Metallic, Inc. San Francisco,...

Bayesian Machine Learning for Signal Processing

Hagai T. Attias Golden Metallic, Inc. San Francisco, CA

Tutorial

6th International Conference on Independent Component Analysis

and Blind Source Separation, Charleston, SC, March 2006

ICA / BSS is 15 Years Old First pair of papers: Comon, Jutten & Herault, Signal Processing, 1991

First papers on a statistical machine learning approach to ICA/BSS: Bell & Sejnowski 1995; Cardoso 1996; Pearlmutter & Parra 1997

First conference on ICA/BSS: Helsinki, 2000

Lesson drawn by many: ICA is a cool problem. Let’s find many approaches to it and many places where it’s useful.

Lesson drawn by some: statistical machine learning is a cool framework. Let’s use it to transform adaptive signal processing. ICA is a good start.

Noise Cancellation

Microphone

TV

ANC

Background

From Noise Cancellation to ICA

Microphone

TV

ICA

Background

TV

Noise Cancellation: Derivation y = sensor, x = sources, n = time point

y1(n) = x1(n) + w(n) * x2(n) y2(n) = x2(n)

Joint probability distribution of observed sensor data p(y) = px(x1 = y1 – w*y2, x2 = y2)

Assume the sources are independent, identically distributed Gaussians, with mean 0 and precisions v1, v2

Observed data likelihood L = log p(y) = -0.5 v1 (y1 – w*y2)2 + const.

dL / dw = 0 linear equation for w

Noise Cancellation ICA: Derivation y = sensor, x = sources, n = time point

y(n) = A x(n) , A = square mixing matrix

x(n) = G y(n) , G = square unmixing matrix

Probability distribution of observed sensor data p(y) = |G| px(G y)

Assume the sources are i.i.d. non-Gaussians

Observed data likelihood L = log p(y) = log |G| + log p(x1) + log p(x2)

dL / dG = 0 non-linear equation for G

Sensor Noise and Hidden Variables y = sensor, x = sources, u = noise, n = time point

y(n) = A x(n) + u(n)

x are now hidden variables: even if A is known, one cannot

obtain x exactly from y

However, one can compute the posterior probability of x conditioned on y p(x|y) = p(y|x) p(x) / p(y)

where p(y|x) = pu(y – A x) To learn A from data, one must use an expectation maximization (EM) algorithm (and often approximate it)

Probabilistic Graphical Models Model the distribution of observed data Graph structure determines the probabilistic

dependence between variables We focus on DAGs = directed acyclic graphs Node = variable Arrow = probabilistic dependence

x x y

p(x) p(y,x) = p(y|x) p(x)

Linear Classification

c = class label; discrete, multinomial y = data; continuous, Gaussian p(c) = πc , p(y|c) = N( y | μc, νc ) Training set: pairs {y,c} Learn parameters by maximum likelihood L = log p(y,c) = log p(y|c) + log p(c) Test set: {y}, classify using p(c|y) = p(y,c) / p(y)

p(y,c) = p(y|c) p(c)

c

y

Linear Regression

x = predictor; continuous, Gaussian y = dependent; continuous, Gaussian p(x) = N(x | μ, ν ) , p(y|x) = N( y | Ax, Λ ) Training set: pairs {y,x} Learn parameters by maximum likelihood L = log p(y,x) = log p(y|x) + log p(x) Test set: {x}, predict using p(y|x)

x

y

p(y,x) = p(y|x) p(x)

Clustering

c = class label; discrete, multinomial y = data; continuous, Gaussian p(c) = πc , p(y|c) = N( y | μc, νc ) Training set: {y} p(y) is a mixture of Gaussians (MoG) Learn parameters by expectation maximization (EM) Test set: {y}, cluster using p(c|y) = p(y,c) / p(y) Limit of zero variance: vector quantization (VQ)

p(y,c) = p(y|c) p(c)

But now c is hidden

c

y

Factor Analysis

x = factors; continuous, Gaussian y = data; continuous, Gaussian p(x) = N( x | 0, I ) , p(y|x) = N( y | Ax, Λ ) Training set: {y} p(y) is Gaussian with covariance AA’+Λ-1

Learn parameters by expectation maximization (EM) Test set: {y}, obtain factors by p(x|y) = p(y,x) / p(x) Limit of zero noise: principal component analysis (PCA)

p(y,x) = p(y|x) p(x)

But now x is hidden

x

y

Dynamical Models

s(1)

y(1)

s(2)

y(2)

s(3)

y(3)

x(1)

y(1)

x(2)

y(2)

x(3)

y(3)

y(1)

x(2)

y(2)

x(3)

y(3)

x(1)

s(1) s(2) s(3)Hidden Markov

Model(Baum-Welch)

State Space Model

(Kalman Smoothing)

Switching State Space Model(Intractable)

Probabilistic Inference

Nodes inside frame: variables, vary in time Nodes outside frame: parameters, constant in time Parameters have prior distributions p(A), p(Λ)

Bayesian Inference: compute full posterior distribution p(x,A,Λ|y) over all hidden nodes conditioned on observed nodes

Bayes’ rule: p(x,A,Λ|y)=p(y|x,A,Λ)p(x)p(A)p(Λ)/p(y)

In hidden variable models, joint posterior can generally not be computed exactly. The normalization factor p(y) is instractable

x

y

A

Λ

Factor analysis model

p(x) = N(x|0,I)

p(y|x,A,Λ) = N(y|Ax,Λ)

MAP and Maximum Likelihood

MAP = maximum aposteriori; consider only the parameter values the maximize the posterior p(x,A,Λ|y)

This is the maximum likelihood method:

compute A,Λ that maximize L = log p(y|A,Λ) However, in hidden variable models L is a complicated

function of the parameters; direct maximization would require gradient based techniques which are slow

Solution: the EM algorithm Iterative algorithm, each iteration has an E-step and an M-

step E-step: compute posterior over hidden variables p(x|y) M-step: maximize complete data likelihood E log

p(y,x,A,Λ) w.r.t. the parameters A,Λ; E = posterior average over x

Derivation of the EM AlgorithmInstead of the likelihood L = log p(y), consider

F(q) = E log p(y,x) – E log q(x|y) where q(x|y) is a ‘trial’ posterior and E averaged over x

w.r.t. q

Can show: F(q) = L – KL [ q(x|y) || p(x|y) ] <= L

Hence F is upper bounded by L, and F=L when q=true posterior

EM performs an alternate maximization of F The E-step maximizes F w.r.t. the posterior q The M-step maximizes F w.r.t. the parameters A,Λ

Hence EM performs maximum likelihood

ICA by EM: MoG Sources Each source distribution p(x) is a 1-dim mixture of Gaussians The Gaussian labels s are hidden variables The data y = A x, hence x = G y are not hidden Likelihood: L = log |G| + log p(x) F(q) = log |G| + E log p(x,s) – E log q(s|y) E-step: q(s|y) = p(x,s) / z M-step: G G + ε(I-Φ(x)x’)G (natural gradient) Φ(x) is linear in x and q Can also learn the source parameters MoG1, MoG2 at the M-step

y1

s1

x1

y2

s2

x2

MoG2MoG1

A

Noisy, Non-Square ICA: Independent Factor Analysis The Gaussian labels s are hidden variables The data y = A x + u, hence x are also hidden p(y|x) = N( y | Ax, Λ ) Likelihood: L = log p(y) must marginalize over x,s F(q) = E log p(y,x,s) – E log q(x,s|y) E-step: q(x,s|y) = q(x|s,y)q(s|y) M-step: linear eqs for A, Λ Can also learn the source parameters MoG1, MoG2 at the M-step Convergence problem in low noise

y1

s1

y2

s2 MoG2MoG1

A,Λ

x1 x2

Intractability of Inference

In many models of interest the E-step is computationally intractable Switching state space model: posterior over discrete state p(s|y) is exponential in time Independent factor analysis: posterior over Gaussian labels is exponential in number of sources Approximations must be made MAP approximation: consider only the most likely state configuration(s) Markov Chain Monte Carlo: convergence may be quite slow and hard to determine

y(1)

x(2)

y(2)

x(3)

y(3)

x(1)

s(1) s(2) s(3)

Variational EM Idea: use an approximate posterior which has a factorized

form Example: switching state space model factorize the continuous states from the discrete states p(x,s|y) ~ q(x,s|y) = q(x|y) q(s|y) make no other assumptions (e.g., functional forms) To derive, consider F(q) from the derivation of EM F(q) = E log p(y,x,s) - E log q(x|y) – E log q(s|y) E performs posterior averaging w.r.t. q Maximize F alternately w.r.t. q(x|y) and q(s|y) q(x|y) = Es p(y,x,s) / zs

q(s|y) = Ex p(y,x,s) / zx

This adds an internal loop in the E-step; M-step is unchanged Convergence is guaranteed since F(q) is upper bounded by L

Switching Model: 2 Variational Approximations

Model: Variational approximation I:

y(1)

x(2)

y(2)

x(3)

y(3)

x(1)

s(1) s(2) s(3)

x(2) x(3)x(1)

s(1) s(2) s(3)

x(2) x(3)x(1)

s(1) s(2) s(3)

Variational approximation II:

I: Baum-Welch, Kalman; Gaussian, smoothed II: Baum-Welch, MoG; Multimodal, not smoothed

IFA: 2 Variational Approximations

Model: Variational approximation I:

y1

s1

y2

s2

x1 x2

s1 s2

x1 x2

s1 s2

x1 x2

Variational approximation II:

I: Source posterior is Gaussian, correlated II: Source posterior is multimodal, independent

Model Order Selection How does one determine the optimal number of factors in

FA? Maximum likelihood would always prefer more complex

models, since they fit the data better; but they overfit The probabilistic inference approach: place a prior p(A) over

the model parameters, and consider the marginal likelihood L = log p(y) = E log p(y,A) – E log p(A|y) Compute L for each number of factors. Choose the number

that maximizes L An alternative approach: place a prior p(A) assuming a

maximum number of factors. The prior has a hyperparameter for each column of A – its precision α. Optimize the precisions by maximizing L. Unnecessary columns will have α infinity

Both approaches require computing the parameter posterior p(A|y), which is usually intractable

Variational Bayesian EM Idea: use an approximate posterior which factorizes the

parameters from the hidden variables Example: factor analysis p(x,A|y) ~ q(x,A|y) = q(x|y) q(A|y) make no other assumptions (e.g., functional forms) To derive, consider F(q) from the derivation of EM F(q) = E log p(y,x,A) - E log q(x|y) – E log q(A|y) E performs posterior averaging w.r.t. q Maximize F alternately w.r.t. q(x|y) and q(A|y) E-step: q(x|y) = EA p(y,x,A) / zA

M-step: q(A|y) = Ex p(y,x,A) / zx

Plus, maximize F w.r.t. the noise precision Λ and hyperparameters α (MAP approximation)

VB Approximation for IFAModel: VB approximation:

y1

s1

y2

s2

A

x1 x2

s1 s2

A

x1 x2

Conjugate Priors Which form should one choose for prior distributions? Conjugate prior idea: Choose a prior such that the

resulting posterior distribution would have the same functional form as the prior

Single Gaussian: posterior over mean is p(μ|y) = p(y|μ) p(μ) / p(y) conjugate prior is Gaussian Single Gaussian: posterior over mean + precision is p(μ,ν|y) = p(y|μ,ν) p(μ,ν) conjugate prior is Normal-Wishart Factor analysis: VB posterior over mixing matrix is q(A|y) = Ex p(y,x|A) p(A) / z

conjugate prior is Gaussian

Separation of Convoluted Mixtures of Speech Sources Blind separation methods use extremely simple models

for source distributions Speech signals have a rich structure. Models that

capture aspects of it could result in improved separation, deconvolution, and noise robustness

One such model: work in the windowed FFT domain x(n,k) = G(k) y(n,k) where n=frame index, k=frequency Train a MoG model on the x(n,k) such that different

components capture different speech spectra Plus this model into IFA and use EM to obtain separation

of convoluted mixtures

Noise Suppression in Speech Signals Existing methods based on, e.g., spectral subtraction and array processing, often produce unsatisfactory noise suppression Algorithms based on probabilistic models can (1) exploit rich speech models, (2) learn the noise from noisy data (not just silent segments) (3) can work with one or more microphones Use speech model in the windowed FFT domain Λ(k) = noise precision per frequency (inverse spectrum)

y1

s1

y2

s2 MoG2MoG1

A,Λ

x1 x2

Interference Suppression and Evoked Source Separation in MEG data y(n) = MEG sensor data, x(n) = evoked brain sources, u(n) = interference sources, v(n) = sensor noise Evoked stimulus experimental paradigm: evoked sources are

active only after the stimulus onset pre-stimulus: y(n) = B u(n) + v(n) post-stimulus: y(n) = A x(n) + B u(n) + v(n) SEIFA is an extension of IFA to this case: model x by MoG, model

u by Gaussians N(0,I), model v by Gaussian N(0,Λ) Use pre-stimulus to learn interference mixing matrix B and noise

precision Λ; use post-stimulus to learn evoked mixing matrix A Use VB-EM to infer from data the optimal number of interference

factors u and of evoked factors x; also protect from overfitting Cleaned data: y = A x ; Contribution of factor j: yi = Aij xj Advantages over ICA: no need to discard information by dim

reduction; can exploit stimulus onset information; superior noise suppression

Stimulus Evoked Independent Factor AnalysisPre-stimulus Post-stimulus

y

sMoG

A,B

x u

y

B

u

Brain Source Localization using MEG Problem: localize brain sources that respond to a stimulus Response model is simple: y(n) = F s(n) + v(n) F = lead field (known), s = brain voxel activity However, the number of voxels (~3000-1000) is much larger

than the number of sensors (~100-300) One approach: fit multiple dipole sources; cost is exponential

in the number of sources Idea: loop over voxels; for each one, use VB-EM to learn a modified FA model y(n) = F’ z(n) + A x(n) + v(n) where F’ = lead field for that voxel, z = voxel activity, x = response from all other active voxels Obtain a localization map by plotting <z(n)2> per voxel Superior results to exising (beamforming based) methods; can

handle correlated sources

MEG Localization Model

Pre-stimulus Post-stimulus

y

A,B

z u

y

B

u x

F’

Conclusion Statistical machine learning provides a principled framework for formulating and solving adaptive signal processing problems Process: (1) design a probabilistic model that corresponds to the problem (2) use machinery for exact and approximate inference to learn the model from data, including model order (3) extend the model, by e.g. incorporating rich signal models, to improve performance Problems treated here: noise suppression, source separation, source localization Domains: Speech, audio, biomedical data Domains outside this tutorial: image, video, text, coding, ….. Future: algorithms derived from probabilistic models take over and completely transform adaptive signal processing

Bayesian Machine Learning for Signal Processing Hagai T. Attias Golden Metallic, Inc. San Francisco,...

Documents

Transcript of Bayesian Machine Learning for Signal Processing Hagai T. Attias Golden Metallic, Inc. San Francisco,...