Expectation propagation
-
Upload
- -
Category
Technology
-
view
582 -
download
7
description
Transcript of Expectation propagation
Expecta(on Propaga(on Theory and Applica(on
Dong Guo Research Workshop 2013 Hulu Internal
See more details in
hEp://dongguo.me/blog/2014/01/01/expecta(on-‐propaga(on/ hEp://dongguo.me/blog/2013/12/01/bayesian-‐ctr-‐predic(on-‐for-‐bing/
Outline
• Overview • Background • Theory • Applica(ons
OVERVIEW
Bayesian Paradigm
• Infer posterior distribu(on Prior
Data
Posterior Make decision
Note: figure of LDA is from Wikipedia, and the right figure is from paper ‘Web-‐Scale Bayesian Click-‐Through Rate PredicFon for Sponsored Search AdverFsing in MicrosoI’s Bing Search Engine’
Bayesian inference methods
• Exact inference – Belief propaga(on
• Approximate inference – Stochas(c (sampling) – Determinis(c
• Assumed density filtering • Expecta(on propaga(on • Varia(onal Bayes
Message passing
• A form of communica(on used in mul(ple domains of computer science – Parallel compu(ng (MPI) – Object-‐oriented programming – Inter-‐process communica(on – Bayesian inference
• A family of methods to infer posterior distribu(on
Expecta(on Propaga(on
• Belongs to message passing family
• Approximated method (itera(on is needed)
• Very popular in Bayesian inference, especially in graphic model
Researchers
• Thomas Minka – EP was proposed in PhD thesis
• Kevin p. Murphy – Machine Learning A ProbabilisFc PerspecFve
BACKGROUND
Background
• (Truncated) Gaussian • Exponen(al family • Graphic model • Factor graph • Belief propaga(on • Moment matching
Gaussian and Truncated Gaussian
• Gaussian opera(on is a basis for EP inference – Gaussian +*/ Gaussian – Gaussian integral
• Truncated Gaussian is used in many EP applica(ons
• See details here
Exponen(al family distribu(on
• Very good summary in Wikipedia • Sufficient sta(s(cs of Gaussian distribu(on: (x, x^2) • Typical distribu(on
q(z) = h(z)g(η)exp{ηTu(z)}
Note: above 4 figures are from Wikipedia
Graphical Models • Directed graph (Bayesian Network) • Undirected graph (Condi(onal
Random Field)
P(x) = p(xk | pak )k=1
K
∏
x1
x4
x3 x2 x1
x4
x3 x2
Factor graph
• Express rela(on between variable nodes explicitly • Rela(on in edge -‐> factor node
• Hide the difference of BN and CRF in inference • Make inference more intui(onal
x1
x4
x3 x2 x1
x4
x3 x2 fa
fc
c
BELIEF PROPAGATION
Belief Propaga(on Overview
• Exact Bayesian method to infer marginal distribu(on – ‘sum-‐product’ message passing
• Key components – Calculate posterior distribu(on of variable node – Two kinds of messages
Posterior distribu(on of variable node
• Factor graph
p(X) = Fs (s,Xs )s∈ne(x )∏ , for any variable x in the graph
p(x) = p(X)X \x∑ = Fs (s,Xs )
s∈ne(x )∏ =
X \x∑ Fs (x,Xs )
Xs∑
s∈ne(x )∏ = µ fs −>x
(x)s∈ne(x )∏
in which µ fs −>x(x) = Fs (x,Xs )
Xs∑
Note: the figure is from book ‘PaMern recogniFon and machine learning’
Message: factor -‐> variable node
• Factor graph
µ fs −>x(x) = ...
x1
∑ fs (x, x1,..., xM )xM∑ µxm −> fs
(xm )xm∈ne( fs )\x∏ ,
in which {x1,..., xM } is the set of variables on which the factor fs depends
Note: the figure is from book ‘PaMern recogniFon and machine learning’
Message: variable -‐> factor node
• Factor graph
µxm −> fs(xm ) = µ fl −>xm
(xm )l∈ne(xm )\ fs∏
Summary: posterior distribuFon is only determined by factors !!
Note: the figure is from book ‘PaMern recogniFon and machine learning’
Whole steps of BP
• Steps to calculate posterior distribu(on of given variable node – Step 1: construct factor graph – Step 2: treat the variable node as root, and ini(alize messages sent from leaf nodes
– Step 3: leverage the message passing steps recursively un(l the root node receives messages from all of its neighbors
– Step 4: get the marginal distribu(on by mul(plying all messages sent in
Note: the figures are from book ‘PaMern recogniFon and machine learning’
BP: example • Infer marginal distribu(on of x_3
• Infer marginal distribu(on of every variables
Note: the figures are from book ‘PaMern recogniFon and machine learning’
Posterior is intractable some(mes
• Example – Infer the mean of a Gaussian distribu(on
– Ad predictor
p(x |θ ) = (1−w)N(x |θ , I )+wN(x | 0,aI )
p(θ ) = N(θ | 0,bI )
Note: the figure is from book ‘PaMern recogniFon and machine learning’
Distribu(on Approxima(on
Such that: q(x) = h(x)g(η)exp{ηTu(x)}
KL(p || q) = − p(x)∫ In q(x)p(x)
dx = − p(x)Inq(x)dx +∫ p(x)Inp(x)∫ dx
= − p(x)Ing(η)dx − p(x)ηTu(x)∫ dx + const∫ = − Ing(η)−ηTΕ p(x )[u(x)]+ constwhere const terms are independent of the natural parameter η
Minimize KL(p || q) by setting the gradient with repect to η to zero: => −∇Ing(η) = Ε p(x )[u(x)]By leveraging formula (2.226) in PRML: => Eq(x )[u(x)]= −∇Ing(η) = Ε p(x )[u(x)]
Approximate p(x) with q(x), which belongs to exponential family
Moment matching
• Moments of a distribu(on
It's called moment matching when q(x) is Gaussian distribution then u(x) = (x, x2 )T
=> q(x)xdx = p(x)xdx∫∫ , and q(x)x2 dx = p(x)x2 dx∫∫=> meanq(x ) = q(x)xdx = p(x)xdx∫∫ = meanp(x ),
varianceq(x ) = q(x)x2 dx − (meanq(x ) )2∫
= p(x)x2 dx∫ − (meanp(x ) )2 = variance p(x )
k'th moment Mk = xk f (x)dxa
b
∫
EXPECTATION PROPAGATION = Belief Propaga(on + Moment matching?
Key Idea • Approximate each factor with Gaussian distribu(on
• Approximate corresponding factor pairs one by one?
• Approximate each factor in turn in the context of all remaining factors (Proposed by Minka)
refine factor f j(θ ) by ensuring qnew (θ )∝ f j(θ )q \ j (θ ) is close with f j (θ )q \ j (θ )
in which q \ j (θ ) = q(θ )f j(θ )
EP: The detail steps
1.Initialize all of the approximating factors fi(θ )
2.Initialize the posterior approximation by setting :q(θ )∝ fi(θ )i∏
3.Until convergence :
(a). Choose a fator f j(θ ) to refine.
(b). Remove f j(θ ) from the posterior by division :q \ j (θ ) = q(θ )f j(θ )
(c). Get the new posterior by settting sufficient statistics of qnew (θ ) equal to those of f j (θ )q \ j (θ )
z j
(minimize KL(f j (θ )q \ j (θ )
z j|| qnew (θ ))),in which z j = f j (θ )q \ j (θ )dθ∫ , and qnew (θ ) = 1
kf j(θ )q \ j (θ )
(d). Get the refined factor f j(θ ) : f j(θ ) = k qnew (θ )q \ j (θ )
Example: The cluEer problem
• Infer the mean of a Gaussian distribu(on • Want to try MLE, but
• Approximate with
– Approximate mixture Gaussian using Gaussian
p(x |θ ) = (1−w)N(x |θ , I )+wN(x | 0,aI )
p(θ ) = N(θ | 0,bI )
q(θ ) = N(θ |m,vI ), and each factor fn(θ ) = N(θ |mn ,vnI )
Note: the figure is from book ‘PaMern recogniFon and machine learning’
Example: The cluEer problem(2)
• Approximate complex factor(e.g. mixture Gaussian) with Gaussian
fn (θ ) in blue, fn(θ ) in red, and q \n (θ ) in green Remember variance of q \n (θ ) is usually very small, so fn(θ ) only need to approximate fn (θ ) in small range
Note: above 2 figures are from book ‘PaMern recogniFon and machine learning’
Applica(on: Bayesian CTR predictor for Bing
• See the details here – Inference step by step – Make predic(on
• Some insights – Variance of each feature increases aker every exposure
– Sample with more features will have bigger variance • Independent assump(on for features
Experimenta(on • Dataset is very Inhomogeneous
• Performance
– Other metrics
• Pros: speed, parameter choice cost, online learning support, interpreta(ve, support add more factors
• Cons: sparse • Code
Model FTRL OWLQN Ad predictor
AUC 0.638 0.641 0.639
Application: XBOX skill rating system
•
See details in P793~798 of Machine Learning A ProbabilisFc PerspecFve Note: the figure is from paper: ‘TrueSkill: A Bayesian Skill RaFng System’
Apply to all Bayesian models
• Infer.net (Microsok/Bishop) – A framework for running Bayesian inference in graphical models
– Model-‐based machine learning
References • Books
– Chapter 2/8/10 of PaMern RecogniFon and Machine Learning – Chapter 22 of Machine Learning: A ProbabilisFc PerspecFve
• Papers – A family of algorithms for approximate Bayesian inference – From belief propagaFon to expectaFon propagaFon – TrueSkill: A Bayesian Skill RaFng System – Web-‐Scale Bayesian Click-‐Through Rate PredicFon for Sponsored
Search AdverFsing in MicrosoI’s Bing Search Engine
• Roadmap for EP