Extending Expectation Propagation on Graphical Models Yuan (Alan) Qi [email protected].
Extending Expectation Propagation on Graphical Models
-
Upload
allegra-obrien -
Category
Documents
-
view
36 -
download
0
description
Transcript of Extending Expectation Propagation on Graphical Models
Motivation• Graphical models are widely used in real-world
applications, such as human behavior recognition and wireless digital communications.
• Inference on graphical models: infer hidden variables– Previous approaches often sacrifice efficiency for accuracy
or sacrifice accuracy for efficiency.> Need methods that better balance the trade-off between
accuracy and efficiency.
• Learning graphical models : learning model parameters– Overfitting problem: Maximum likelihood approaches > Need efficient Bayesian training methods
Outline• Background:
– Graphical models and expectation propagation (EP)
• Inference on graphical models– Extending EP on Bayesian dynamic networks
• Fixed lag smoothing: wireless signal detection• Different approximation techniques: Poisson tracking
– Combining EP with local propagation on loopy graphs
• Learning conditional graphical models– Extending EP classification to perform feature selection
• Gene expression classification– Training Bayesian conditional random fields
• FAQ labeling
• Handwritten ink analysis
• Conclusions
Outline
• Background – 4 kinds of graphical models– Bayesian integration techniques– EP in a nutshell
• Inference on graphical models
• Learning conditional graphical models
• Conclusions
Graphical Models
Bayesian networks
Markov networks
conditional classification conditional random fields
x1 x2
y1 y2
j
jpaji
ipai ppp )|()|()( )()( xyxxyx,
x1 x2
y1 y2
a
aZp )(
1)( yx,yx,
i
iitpp )|()( w,xwx,|t a
aZp )(
)(
1)( ty,x,
wxw,|t
x1 x2
t1 t2
w x1 x2
t1 t2
w
INFERENCE
LEARNING
Bayesian Integration Techniques (1)
• Monte Carlo– Markov Chain Monte Carlo (MCMC), e.g., Hasting
algorithm, Gibbs sampler– (Sequential) Importance sampling, e.g., particle filter
and smoothers
• Quadrature: numerical approximation• Laplace’s method: fitting a Guassian to the mode• Variational method: converting inference to an
optimization problem by lower- or upper-bounding techniques.
Bayesian Integration Techniques(2)
• Expectation propagation (Minka 01): efficient deterministic approximation techniques.
Expectation Propagation in a Nutshell
• Approximate a probability distribution by simpler parametric terms (Minka 2001):
– For Bayesian networks:– For Markov networks:
– For conditional classification:– For conditional random fields:
Each approximation term or lives in an exponential family (such as Gaussian & Multinomial)
a
afp )()( xx a
afq )(~
)( xx
)(~
xaf
)|()( ajaia xxpf x
)()( yx,x aaf
)|()( w,xw aaa tpf
),,()( wxw aaa tf
)(~
waf
a
aa
a fqfp )(~
)()()( wwwxt,|w
EP in a Nutshell (2)• The approximate term minimizes the
following KL divergence by moment matching:
))()(~
||)()((minarg \\
)(~
xxxxx
aa
aa
f
qfqfD
a
)(~
)()(
~)(\
x
xxx
aabb
a
f
qfq
Where the leave-one-out approximation is
)(~
xaf
EP in a Nutshell (3)Three key steps:• Deletion Step: approximate the “leave-one-out”
predictive posterior for the ith point:
• Minimizing the following KL divergence by moment matching (Assumed Density filtering):
• Inclusion:
ij
jii ffqq )(
~)(
~/)()(\ xxxx
))()(~
||)()((minarg \\
)(~
xxxxx
ii
ii
f
qfqfKL
i
)()(~
)( \ xxx ii qfq
Limitations of Plain EP• Batch processing of terms: not online• Can be difficult or expensive to analytically compute
ADF step• Can be expensive to compute and maintain a valid
approximation distribution q(x), which is coherent under marginalization – Tree-structured q(x):
• EP classification degenerates in the presence of noisy features.
• Cannot incorporate denominators
)()( iji xq,xxqj
x
Four Extensions on Four Types of Graphical Models
• Fixed-lag smoothing and embedding different approximation techniques for dynamic Bayesian networks
• Allow a structured approximation to be globally non-coherent, while only maintaining local consistency during inference on loopy graphs.
• Combine EP with ARD for classification with noisy features
• Extend EP to train conditional random fields with a denominator (partition function)
Inference on dynamic Baysian networks
Bayesian networks
Markov networks
conditional classification conditional random fields
x1 x2
y1 y2
j
jpaji
ipai ppp )|()|()( )()( xyxxyx,
x1 x2
y1 y2
a
aZp )(
1)( yx,yx,
i
iitpp )|()( w,xwx,|t a
aZp )(
)(
1)( ty,x,
wxw,|t
x1 x2
t1 t2
w x1 x2
t1 t2
w
Outline
• Background • Inference on graphical models
– Extending EP on Bayesian dynamic networks • Fixed lag smoothing: wireless signal detection
• Different approximation techniques: Poisson tracking
– Combining EP with junction tree algorithm on loopy graphs
• Learning conditional graphical models• Conclusions
Object Tracking
Guess the position of an object given noisy observations
1y
4y
Object
1x2x
3x
4x
2y
3y
Bayesian Network
ttt νxx 1
noise tt xy
(random walk)e.g.
want distribution of x’s given y’s
x1 x2 xT
y1 y2 yT
Deterministic Filtering on Nonlinear
Non-Gaussian Dynamic Models
• Extended Kalman filter (EKF): – The process and measurement equations are
linearized by a Taylor expansion about the current state estimate. The noise variance in the equations is not changed, i.e. the additional error due to linearization is not modeled. After linearization, the Kalman filter is applied.
• Linear-regression Filter:– The nonlinear/non-Gaussian measurement
equation is converted into a linear-Gaussian equation by regressing the observation onto the state (Minka 01). The result is a Kalman lter whose Kalman gain is
Cov(state,measurement)Var(measurement)
For example, the unscented Kalman filter (Wan and van der Merwe 00), which uses the unscented transformation to measure the state-measurement covariance.
Deterministic Filtering on Nonlinear
Non-Gaussian Dynamic Models
• Assumed-density Filters Choose the Gaussian belief which is closest
in KL-divergence to the exact state posterior given previous beliefs.
• Mixture of Kalman Filters Pruning the number of Gaussians in the
exact posterior to avoid explosion of Guassians.
Deterministic Filtering on Nonlinear
Non-Gaussian Dynamic Models
Sequential Monte Carlo
• Particle Filters
Using sequential importance sampling with resampling steps.
• The proposal distribution is very important
• A lot of samples are needed for accurate estimation
• EKF- and UKF-Particle Filters.
Using EKF or UKF to generate proposal distributions for particle filters
Combining Sequential Monte Carlo withDeterministic Filters
• Rao-Blackwellised Particle Filter and Stochastic Mixture of Kalman Filters.– Only draw samples in a subspace which is
marginalized over the variables for which we can do exact inference.
Smoothing on Nonlinear Non-GaussianDynamic Models
• Particle smoothing
• Markov chain Monte Carlo
• Extended Kalman smoothing
• Variational methods
• Expectation propagation
EP Approximation
1
1111 )|()|()|()(),(t
tttt xypxxpxypxpp yx
1
111111 )(~)(~)(~)(~)()(t
tttttttt xoxpxpxoxpq x
Factorized and Gaussian in x
Message Interpretation
)(~)(~)(~)( 11 tttttttt xpxoxpxq
= (forward msg)(observation msg)(backward msg)
xt
yt
Forward Message
Backward Message
Observation Message
Extensions of EP• Instead of batch iterations, use fixed-lag smoothing
for online processing.• Instead of assumed density filtering, use any method
for approximate filtering.– Examples: EKF, unscented Kalman filter (UKF)
Turn any deterministic filtering method into a smoothing method!
All methods can be interpreted as finding linear/Gaussian approximations to original terms.
– Use quadrature or Monte Carlo for term approximations
Bayesian network for Wireless Signal Detection
x1 x2 xT
y1 y2 yT
s1 s2 sT
si: Transmitted signals
xi: Channel coefficients for digital wireless communications
yi: Received noisy observations
Experimental Results
EP outperforms particle smoothers in efficiency with comparable accuracy.
(Chen, Wang, Liu 2000)
Signal-Noise-Ratio Signal-Noise-Ratio
Computational ComplexityAlgorithm Complexity
Extended EP O(nLd2)
Stochastic mixture of Kalman filters O(MLd2)
Rao-blackwellised particle smoothers O(MNLd2)
L: Length of fixed-lag smooth window
d: Dimension of the parameter vector
n: Number of EP iterations (Typically, 4 or 5)
M: Number of samples in filtering (Often larger than 500 or 100)
N: Number of samples in smoothing (Larger than 50)
000,505
50100 2
2
2
n
MN
nLd
MNLdExtended EP is about than
Rao-blackwellised particle smoothers!
Example: Poisson Tracking
• is an integer valued Poisson variate with mean )exp( tx
ty
Accuracy/Efficiency Tradeoff
(TIME)
Inference on markov networks
Bayesian networks
Markov networks
conditional classification conditional random fields
x1 x2
y1 y2
j
jpaji
ipai ppp )|()|()( )()( xyxxyx,
x1 x2
y1 y2
a
aZp )(
1)( yx,yx,
i
iitpp )|()( w,xwx,|t a
aZp )(
)(
1)( ty,x,
wxw,|t
x1 x2
t1 t2
w x1 x2
t1 t2
w
Outline
• Background• Inference on graphical models
– Extending EP on Bayesian dynamic networks • Fixed lag smoothing: wireless signal detection
• Different approximation techniques: poisson tracking
– Combining EP with junction tree algorithm on loopy graphs
• Learning conditional graphical models• Conclusions
Inference on Loopy Graphs
Problem: estimate marginal distributions of the variables indexed by the nodes in a loopy graph, e.g., p(xi), i = 1, . . . , 16.
X1 X2 X3 X4
X5 X6 X7 X8
X9 X10 X11 X12
X13 X14 X15 X16
4-node Loopy Graph
4x
2x 3x
1x
a
afp )()( xx
Joint distribution is product of pairwise potentials for all edges:
Want to approximate by a simpler distribution)(xp
BP vs. TreeEP
4x
2x 3x
1x
4x
2x 3x
4x
2x 3x
1x1xBP TreeEP
projectionprojection
Junction Tree Representation
p(x) q(x) Junction tree
p(x) q(x) Junction tree
Two Kinds of Edges
• On-tree edges, e.g., (x1,x4): exactly incorporated into the junction tree
• Off-tree edges, e.g., (x1,x2): approximated by projecting them onto the tree structure
4x
2x 3x
1x
KL Minimization • KL minimization moment matching
• Match single and pairwise marginals of
4x
2x 3x
1x
4x
2x 3x
1x
and
Matching Marginals on Graph
(1) Incorporate edge (x3 x4)
x3 x4
x5 x7
x1 x2
x1 x3 x1 x4
x3 x5
x3 x6
(2) Incorporate edge (x6 x7)
x5 x7
x1 x2
x1 x3 x1 x4
x3 x5
x3 x6
x6 x7
x5 x7
x1 x2
x1 x3 x1 x4
x3 x5
x3 x6x5 x7
x1 x2
x1 x3 x1 x4
x3 x5
x3 x6x5 x7
x1 x2
x1 x3 x1 x4
x3 x5
x3 x6
Drawbacks of Global Propagation by Regular EP
• Update all the cliques even when only incorporating one off-tree edge – Computationally expensive
• Store each off-tree data message as a whole tree– Require large memory size
Solution: Local Propagation
• Allow q(x) be non-coherent during the iterations. It only needs to be coherent in the end.
• Exploit the junction tree representation: only locally propagate information within the minimal loop (subtree) that is directly connected to the off-tree edge.– Reduce computational complexity
– Save memory
x5 x7
x1 x2
x1 x3 x1 x4
x3 x5
x3 x6
x3 x4
x1 x2
x1 x3 x1 x4
x5 x7
x1 x2
x1 x3 x1 x4
x3 x5
x3 x6
x5 x7
x1 x2
x1 x3 x1 x4
x3 x5
x3 x6
x3 x5
x6 x7
x5 x7
x3 x5
x3 x6
(1) Incorporate edge(x3 x4)
(3) Incorporate edge (x6 x7)
(2) Propagate evidence
On this simple graph, local propagation runs roughly 2 times faster and uses 2 times less memory to store messages than plain EP
Tree-EP
• Combine EP with junction algorithm
• Can perform efficiently over hypertrees and hypernodes
Fully-connected graphs
Results are averaged over 10 graphs with randomly generated potentials• TreeEP performs the same or better than all other methods in both accuracy and efficiency!
Learning Conditional Classification Models
Bayesian networks
Markov networks
conditional classification conditional random fields
x1 x2
y1 y2
j
jpaji
ipai ppp )|()|()( )()( xyxxyx,
x1 x2
y1 y2
a
aZp )(
1)( yx,yx,
i
iitpp )|()( w,xwx,|t a
aZp )(
)(
1)( ty,x,
wxw,|t
x1 x2
t1 t2
w x1 x2
t1 t2
w
Outline
• Background• Inference on graphical models• Learning conditional graphical models
– Extending EP classification to perform feature selection
• Gene expression classification
– Training Bayesian conditional random fields
• Handwritten ink analysis
• Conclusions
Motivation
Task 1: Classify high dimensional datasets with many irrelevant features, e.g., normal v.s. cancer microarray data.
Task 2: Sparse Bayesian kernel classifiers for fast test performance.
Outline• Background
– Bayesian classification model– Automatic relevance determination (ARD)
• Risk of Overfitting by optimizing hyperparameters• Predictive ARD by expectation propagation (EP):
– Approximate prediction error– EP approximation
• Experiments• Conclusions
Bayesian Classification Model
Prior of the classifier w:
Labels: t inputs: X parameters: w
Likelihood for the data set:
Where is a cumulative distribution function for a standard Gaussian.
Evidence and Predictive Distribution
The evidence, i.e., the marginal likelihood of the hyperparameters :
The predictive posterior distribution of the label for a new input :
Automatic Relevance Determination (ARD)
• Give the classifier weight independent Gaussian priors whose variance, , controls how far away from zero each weight is allowed to go:
• Maximize , the marginal likelihood of the model, with respect to .
• Outcome: many elements of go to infinity, which naturally prunes irrelevant features in the data.
1
Two Types of Overfitting
• Classical Maximum likelihood:– Optimizing the classifier weights w can directly fit
noise in the data, resulting in a complicated model.
• Type II Maximum likelihood (ARD):– Optimizing the hyperparameters corresponds to
choosing which variables are irrelevant. Choosing one out of exponentially many models can also overfit if we maximize the model marginal likelihood.
Risk of Optimizing
X: Class 1 vs O: Class 2
Evd-ARD-1
Evd-ARD-2
Bayes Point
Predictive-ARD
• Choosing the model with the best estimated predictive performance instead of the most probable model.
• Expectation propagation (EP) estimates the leave-one-out predictive performance without performing any expensive cross-validation.
Estimate Predictive Performance• Predictive posterior given a test data point
• EP can estimate predictive leave-one-out error probability
• where q( w| t\i) is the approximate posterior of leaving out the ith label.
• EP can also estimate predictive leave-one-out error count
1Nx
wtwwxtx d)|(),|(),|( 1111 ptptp NNNN
N
iiii
N
iiii qtp
Ntp
N 1\
1\ d)|(),|(1
1),|(1
1wtwwxtx
N
1i21
\ )),|(I(1
iiitpN
LOO tx
Expectation Propagation in a Nutshell
• Approximate a probability distribution by simpler parametric terms:
• Each approximation term lives in an exponential family (e.g. Gaussian)
ii
iii
ii
fq
tfp
)(~
)(
))(()()( T
ww
xwwt|w
)(~
wif
EP in a NutshellThree key steps:• Deletion Step: approximate the “leave-one-out”
predictive posterior for the ith point:
• Minimizing the following KL divergence by moment matching:
• Inclusion:
ij
jii ffqq )(
~)(
~/)()(\ wwww
))()(~
||)()((minarg \\
)(~
wwwww
ii
ii
f
qfqfKL
i
)()(~
)( \ www ii qfq
The key observation: we can use the approximate predictive posterior, obtained in the deletion step, for model selection. No extra computation!
Comparison of different model selection criteria for ARD training
• 1st row: Test error• 2nd row: Estimated leave-one-out error probability• 3rd row: Estimated leave-one-out error counts• 4th row: Evidence (Model marginal likelihood)• 5th row: Fraction of selected features
The estimated leave-one-out error probabilities and counts are better correlated with the test error than evidence and sparsity level.
Gene Expression Classification
Task: Classify gene expression datasets into different categories, e.g., normal v.s. cancer
Challenge: Thousands of genes measured in the micro-array data. Only a small subset of genes are probably correlated with the classification task.
Classifying Leukemia Data
• The task: distinguish acute myeloid leukemia (AML) from acute lymphoblastic leukemia (ALL).
• The dataset: 47 and 25 samples of type ALL and AML respectively with 7129 features per sample.
• The dataset was randomly split 100 times into 36 training and 36 testing samples.
Classifying Colon Cancer Data
• The task: distinguish normal and cancer samples
• The dataset: 22 normal and 40 cancer samples with 2000 features per sample.
• The dataset was randomly split 100 times into 50 training and 12 testing samples.
• SVM results from Li et al. 2002
Bayesian Sparse Kernel Classifiers
• Using feature/kernel expansions defined on training data points:
• Predictive-ARD-EP trains a classifier that depends on a small subset of the training set.
• Fast test performance.
Test error rates and numbers of relevance or support vectors on breast cancer dataset.
50 partitionings of the data were used. All these methods use the same Gaussian kernel with kernel width = 5. The trade-off parameter C in SVM is chosen via 10-fold cross-validation for each partition.
Test error rates on diabetes data
100 partitionings of the data were used. Evidence and Predictive ARD-EPs use the Gaussian kernel with kernel width = 5.
Appendix: Sequential Updates
• EP approximates true likelihood terms by Gaussian virtual observations.
• Based on Gaussian virtual observations, the classification model becomes a regression model.
• Then, we can achieve efficient sequential updates without maintaining and updating a full covariance matrix. (Faul & Tipping 02)
Learning Conditional Random Fields
Bayesian networks
Markov networks
conditional classification conditional random fields
x1 x2
y1 y2
j
jpaji
ipai ppp )|()|()( )()( xyxxyx,
x1 x2
y1 y2
a
aZp )(
1)( yx,yx,
i
iitpp )|()( w,xwx,|t a
aZp )(
)(
1)( ty,x,
wxw,|t
x1 x2
t1 t2
w x1 x2
t1 t2
w
Outline
• Background• Inference on graphical models• Learning conditional graphical models
– Extending EP classification to perform feature selection• Gene expression classification
– Training Bayesian conditional random fields• FAQ labeling• Handwritten ink analysis
• Conclusions
x1 x2
t1 t2
w
• Generalize traditional classification model by introducing correlation between labels.
Potential functions:
• Model data with independence and structure, e.g, web pages, natural languages, and multiple visual objects in a picture.
• Information at one location propagates to other locations.
Conditional random fields (CRFs)
Learning the parameter w by ML/MAP
Maximum likelihood (ML) : Maximize the data likelihood
where
Maximum a posterior (MAP):Gaussian prior on w
ML/MAP problem: Overfitting to the noise in data.
Bayesian Conditional Random Fields
• Bayesian training to avoid overfitting
• Need efficient training:– The exact posterior of w
– The Gaussian approximate posterior of w
Two Difficulties for Bayesian Training
• the partition function appears in the denominator– Regular varitional method or EP does not apply
• the partition function is a complex function of w
Turn Denominator to Numerator (1)• Transformed EP:
– Deletion:
– ADF:
– Inclusion:
Turn Denominator to Numerator (2)• Power EP:
– Deletion:
– ADF:
– Inclusion:
Approximating the partition function
• The parameters w and the labels t are intertwined in Z(w):
where k = {i, j} is the index of edges.
The joint distribution of w and t:
• Factorized approximation:
Remove the intermediate level
Flatten Approximation Structure
Increased efficiency, stability, and accuracy!
Iterations
Iterations
Iterations
Classifying Synthetic Data
• Data generation: first, randomly sample input x, fixed true parameters w, and then sample the labels t
• Graphical structure: Three nodes in a simple loop• Comparing maximum likelihood trained CRF with
BCRFs
Test error rates for ML- and MAP-trained CRFs and BCRFs on synthetic datasets with different numbers of training loops. The results are averaged over 10 runs. Each run has 1000 test loops. The error bars are the standard errors multiplied by 1.64.
FAQ labeling• The dataset consists of 47 files, belonging to 7
Usenet newsgroup FAQs. Each files has multiple lines, which can be the header (H), a question (Q), an answer (A), or the tail (T).
• Task: since identifying the header and the
tail is relatively easy, we simplify the task to label only the lines that are questions or answers.
CRF Modeling• Extract features from each line of a FAQ file.
• Model a FAQ file by a chain-structured CRF. The feature vector for each edge of a CRF is simply the concatenation of feature vectors extracted at the two neighboring vertices.
• BCRF parsing of FAQ files.
Excerpt from a labeled FAQ (McCallum et al. 00)
Line-based features (McCallum et al. 00)
Errors on the FAQ dataset by Hidden Markov Models (HMM), Maximum Entropy Markov Models (MEMM), and CRF
FAQ labeling
Test error rates of different algorithms on FAQ dataset. The results are averaged over 10 random splits.
Ink Application: analyzing handwritten organization charts
• Parsing a graph into different components: containers vs. connectors
Building CRF
We break the task into three steps:
• Subdivision of pen strokes into fragments,
• Construction of a conditional random field on the fragments,
• Training and inference on the network.
Locally ambiguous (Container or not? Fragment 18,21, 22, 29); globally unambiguous
Ink Application
Bayesian classifiers BCRFs
• 10 Trials, 14 graphs for training and 9 graphs for testing in each trials.
• BCRFs significantly outperformed ML- and MAP-trained CRFs.
4 types of graphical models
Bayesian networks
Markov networks
conditional classification conditional random fields
x1 x2
y1 y2
j
jpaji
ipai ppp )|()|()( )()( xyxxyx,
x1 x2
y1 y2
a
aZp )(
1)( yx,yx,
i
iitpp )|()( w,xwx,|t a
aZp )(
)(
1)( ty,x,
wxw,|t
x1 x2
t1 t2
w x1 x2
t1 t2
w
Outline
• Background• Inference on graphical models• Learning conditional graphical models• Conclusions
– 4 extensions to EP on 4 types of graphical models & 3 real-world applications
– Inference: better trade-off between accuracy and efficiency
– Learning: better generalization than the state of the art.
Conclusion: 4 extensions & 3 applications1. Extending EP on dynamic models by fixed-lag smoothing and
embedding different approximation techniques:– Wireless signal detection: Much less computation, and
comparable or superior accuracy to sequential Monte Carlo 2. Combining EP with local propagation on loopy graphs:
– Outperformed belief propagation, naïve mean field, and structure variational methods
3. Extending EP classification to perform feature selection:– Gene expression classification: Outperformed traditional ARD,
SVM with feature selection
4. Training Bayesian conditional random fields to deal with denominator and flattened approximation structure
– FAQ labeling and Ink analysis: Beats ML- and MAP-trained CRFs
Extended EP algorithms for inference and learning
Computational time
Infe
renc
e er
ror
Outperform state-of-art inference techniques in the trade-off between accuracy and efficiency
Extended EP
State-of-artInference tech.
0
Lea
rnin
g te
ch.
Outperform state-of-art techs on sparse learning and CRFs in generalization performance
Test error with error bar0
Extended EP
State-oft-art learning tech.
More questions?
Email: [email protected]