Sampling Techniques for Probabilistic and …Markov Chain Monte Carlo: Gibbs Sampling 4. Sampling in...

Overview

1. Probabilistic Reasoning/Graphical models

2. Importance Sampling

3. Markov Chain Monte Carlo: Gibbs Sampling

4. Sampling in presence of Determinism

5. Rao-Blackwellisation

6. AND/OR importance sampling

Overview

5. Cutset-based Variance Reduction

Probabilistic Reasoning;Graphical models

• Graphical models:

– Bayesian network, constraint networks, mixed network

• Queries

• Exact algorithm

– using inference,

– search and hybrids

• Graph parameters:

– tree-width, cycle-cutset, w-cutset

Queries

• Probability of evidence (or partition function)

• Posterior marginal (beliefs):

• Most Probable Explanation

)var( 1

|)|()(eX

eii paxPeP X i

ii CZ )(

)var( 1

),()|(

exPexP i

e),xP(maxarg*xx

Approximation

• Since inference, search and hybrids are too expensive when graph is dense; (high treewidth) then:

• Bounding inference:• mini-bucket and mini-clustering• Belief propagation

• Bounding search:• Sampling

• Goal: an anytime scheme

Overview

Outline

• Definitions and Background on Statistics

• Theory of importance sampling

• Likelihood weighting

• State-of-the-art importance sampling techniques

A sample

• Given a set of variables X={X1,...,Xn}, a sample, denoted by St is an instantiation of all variables:

),...,,(t

ttt xxxS 21

How to draw a sample ?Univariate distribution

• Example: Given random variable X having domain {0, 1} and a distribution P(X) = (0.3, 0.7).

• Task: Generate samples of X from P.

• How?

– draw random number r [0, 1]

– If (r < 0.3) then set X=0

– Else set X=1

How to draw a sample?Multi-variate distribution

• Let X={X1,..,Xn} be a set of variables

• Express the distribution in product form

• Sample variables one by one from left to right, along the ordering dictated by the product form.

• Bayesian network literature: Logic sampling

),...,|(...)|()()( 11121 nn XXXPXXPXPXP

Sampling for Prob. InferenceOutline

• Logic Sampling

• Importance Sampling

– Likelihood Sampling

– Choosing a Proposal Distribution

• Markov Chain Monte Carlo (MCMC)

– Metropolis-Hastings

– Gibbs sampling

• Variance Reduction

Logic Sampling:

No Evidence (Henrion 1988)

Input: Bayesian network

X= {X1,…,XN}, N- #nodes, T - # samples

Output: T samples

Process nodes in topological order – first process the

ancestors of a node, then the node itself:

1. For t = 0 to T

2. For i = 0 to N

3. Xi sample xit from P(xi | pai)

Logic sampling (example)

)( 1XP

)|( 12 XXP )|( 13 XXP

)|( from Sample .4

)|( from Sample .3

)|( from Sample .2

)( from Sample .1

sample generate//

Evidence No

33,2244

xXxXxPx

),|()|()|()(),,,( 324131214321 XXXPXXPXXPXPXXXXP

),|( 324 XXXP

Logic Sampling w/ Evidence

Input: Bayesian network

X= {X1,…,XN}, N- #nodes

E – evidence, T - # samples

Output: T samples consistent with E

1. For t=1 to T

2. For i=1 to N

3. Xi sample xit from P(xi | pai)

4. If Xi in E and Xi xi, reject sample:

5. Goto Step 1.

Logic Sampling (example)

)( 1xP

)|( 12 xxP

),|( 324 xxxP

)|( 13 xxP

)|( from Sample 5.

otherwise 1, fromstart and

samplereject 0, If .4

)|( from Sample .3

)|( from Sample .2

)( from Sample .1

sample generate//

0 :Evidence

Expected value: Given a probability distribution P(X)

and a function g(X) defined over a set of variables X =

{X1, X2, … Xn}, the expected value of g w.r.t. P is

Variance: The variance of g w.r.t. P is:

Expected value and Variance

)()()]([ xPxgxgEx

)()]([)()]([2

xPxgExgxgVarx

Monte Carlo Estimate

• Estimator:

– An estimator is a function of the samples.

– It produces an estimate of the unknown parameter of the sampling distribution.

:bygiven is [g(x)]E of estimate carlo Monte the

, fromdrawn S ,S ,S samples i.i.d.Given

Example: Monte Carlo estimate• Given:

– A distribution P(X) = (0.3, 0.7).– g(X) = 40 if X equals 0

= 50 if X equals 1.

• Estimate EP[g(x)]=(40x0.3+50x0.7)=47.• Generate k samples from P: 0,1,1,1,0,1,1,0,1,0

650440

150040

samples

XsamplesXsamplesg

)(#)(#ˆ

Outline

Importance sampling: Main idea

• Express query as the expected value of a random variable w.r.t. to a distribution Q.

• Generate random samples from Q.

• Estimate the expected value from the generated samples using a monte carloestimator (average).

Importance sampling for P(e)

)(,)()(ˆ

)]([)(

)(),(),()(

zQezPezPeP

EXZLet

z where1

estimate Carlo Monte

:as P(e) rewritecan weThen,

satisfying on,distributi (proposal) a be Q(Z)Let

Properties of IS estimate of P(e)

• Convergence: by law of large numbers

• Unbiased.

• Variance:

Tfor )()(

1)(ˆ ..

TeP saT

zwVarzw

TVarePVar

)]([)(

)()](ˆ[ ePePEQ

Properties of IS estimate of P(e)

• Mean Squared Error of the estimator

ePVarePEeP

ePePEePMSE

)(ˆ)](ˆ[)(

)()(ˆ)(ˆ

This quantity enclosed in the brackets is zero because the expected value of the

estimator equals the expected value of g(x)

Estimating P(Xi|e)

)|()|(E:biased is Estimate

),(ˆ)|(:estimate Ratio

IS.by r denominato andnumerator Estimate :Idea

),()|(

otherwise. 0 and xcontains z if 1 is which function, delta-dirac a be (z)Let

exPexP

e))w(z(z

exPexP

Properties of the IS estimator for P(Xi|e)

• Convergence: By Weak law of large numbers

• Asymptotically unbiased

• Variance

– Harder to analyze

– Liu suggests a measure called “Effective sample size”

T as )|()|( exPexP ii

)|()]|([lim exPexPE iiPT

Generating samples from Q

• No restrictions on “how to”

• Typically, express Q in product form:

– Q(Z)=Q(Z1)xQ(Z2|Z1)x….xQ(Zn|Z1,..Zn-1)

• Sample along the order Z1,..,Zn

• Example:

– Z1Q(Z1)=(0.2,0.8)

– Z2 Q(Z2|Z1)=(0.1,0.9,0.2,0.8)

– Z3 Q(Z3|Z1,Z2)=Q(Z3)=(0.5,0.5)

Outline

Likelihood Weighting(Fung and Chang, 1990; Shachter and Peot, 1990)

Works well for likely evidence!

“Clamping” evidence+logic sampling+weighing samples by evidence likelihood

Is an instance of importance sampling!

Likelihood Weighting: Sampling

e e e e e

e e e e

Sample in topological order over X !

Clamp evidence, Sample xi P(Xi|pai), P(Xi|pai) is a look-up in CPT!

Likelihood Weighting: Proposal Distribution

EXX EE

paePepaxP

Weights

xXXXPXP

epaXPEXQ

)|(),|(

),..,(:

),|()(

),|()\(

sample aGiven

)X,Q(X

.xX Evidence

and )X,X|P(X)X|P(X)P(X)X,X,P(X :networkBayesian aGiven

:Example

2213131

213121321

Notice: Q is another Bayesian network

Likelihood Weighting: Estimates

)(1)(ˆEstimate P(e):

otherwise zero equals and xif 1)(

),(ˆ)|(ˆ

exPexP

Estimate Posterior Marginals:

Likelihood Weighting

• Converges to exact posterior marginals

• Generates Samples Fast

• Sampling distribution is close to prior (especially if E Leaf Nodes)

• Increasing sampling variance

Convergence may be slow

Many samples with P(x(t))=0 rejected

Outline

• Error estimation

absolute

Outline

Proposal selection

• One should try to select a proposal that is as close as possible to the posterior distribution.

estimator variance-zero a have to,0)()(

)()()(

),(1)]([)(ˆ

zQePzQ

zwVarePVar

Perfect sampling using Bucket Elimination

• Algorithm:

– Run Bucket elimination on the problem along an ordering o=(XN,..,X1).

– Sample along the reverse ordering: (X1,..,XN)

– At each variable Xi, recover the probability P(Xi|x1,...,xi-1) by referring to the bucket.

Bucket Elimination )0,()0|( eaPeaP

cbePbadPacPabPaPeaP,0,,

),|(),|()|()|()()0,(

dc b e

badPcbePabPacPaP ),|(),|()|()|()(0

Elimination Order: d,e,b,cQuery:

D badPbaf ),|(),(),|( badP

),|( cbeP ),|0(),( cbePcbfE

EDB cbfbafabPcaf ),(),()|(),(

)()()0,( afApeaP C)(aP

)|( acP c

BC cafacPaf ),()|()(

)|( abP

D,A,B E,B,C

),( bafD ),( cbfE

),( cafB

Bucket TreeD E

Original Functions Messages

Time and space exp(w*)

SP2 42

Bucket elimination (BE) Algorithm elim-bel (Dechter 1996)

Elimination operator

bucket B:

P(C|A)

P(B|A) P(D|B,A) P(e|B,C)

bucket C:

bucket D:

bucket E:

bucket A:

e)(A,hD

e)C,D,(A,hB

e)D,(A,hC

SP2 43

Sampling from the output of BE(Dechter 2002)

bucket B:

P(C|A)

P(B|A) P(D|B,A) P(e|B,C)

bucket C:

bucket D:

bucket E:

bucket A:

e)(A,hD

e)C,D,(A,hB

e)D,(A,hC

Q(A)aA:

(A)hP(A)Q(A)E

Sample

ignore :bucket Evidence

e)D,(a,he)a,|Q(Dd D :Sample

bucket in the aASet

e)C,d,(a,h)|(d)e,a,|Q(C cC :Sample

bucket thein dDa,ASet

),|(),|()|(d)e,a,|Q(Bb B:Sample

bucket thein cCd,Da,ASet

cbePaBdPaBP

Mini-buckets: “local inference”

• Computation in a bucket is time and space exponential in the number of variables involved

• Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables

• Can control the size of each “mini-bucket”, yielding polynomial complexity.

SP2 44

SP2 45 45

Mini-Bucket Elimination

bucket A:

bucket E:

bucket D:

bucket C:

bucket B:

P(B|A) P(D|B,A)

hB(A,D)

P(e|B,C)

Mini-buckets

P(C|A) hB(C,e)

hC(A,e)

Approximation of P(e)

Space and Time constraints:Maximum scope size of the new function generated should be bounded by 2

BE generates a function having scope size 3. So it cannot be used.

SP2 46 46

Sampling from the output of MBE

bucket A:

bucket E:

bucket D:

bucket C:

bucket B: P(B|A) P(D|B,A)

hB(A,D)

P(e|B,C)

P(C|A) hB(C,e)

hC(A,e)Sampling is same as in BE-sampling except that now we construct Q from a randomly selected “mini-bucket”

IJGP-Sampling (Gogate and Dechter, 2005)

• Iterative Join Graph Propagation (IJGP)

– A Generalized Belief Propagation scheme (Yedidiaet al., 2002)

• IJGP yields better approximations of P(X|E) than MBE

– (Dechter, Kask and Mateescu, 2002)

• Output of IJGP is same as mini-bucket “clusters”

• Currently the best performing IS scheme!

Current Research question

• Given a Bayesian network with evidence or a Markov network representing function P, generate another Bayesian network representing a function Q (from a family of distributions, restricted by structure) such that Q is closest to P.

• Current approaches– Mini-buckets– Ijgp– Both

• Experimented, but need to be justified theoretically.

Algorithm: Approximate Sampling

1) Run IJGP or MBE

2) At each branch point compute the edge probabilities by consulting output of IJGP or MBE

• Rejection Problem:

– Some assignments generated are non solutions

)(ˆ Re

')(Q Update

1)(ˆe)(EP̂

Q z,...,z samples Generate

dok to1iFor

))(|(...))(|()()(Q Proposal Initial

eEPturn

ZpaZQZpaZQZQZ

Adaptive Importance Sampling

• General case

• Given k proposal distributions

• Take N samples out of each distribution

• Approximate P(e)

proposaljthweightAvgk

Estimating Q'(z)

sampling importanceby estimated is

)Z,..,Z|(ZQ'each where

))(|('...))(|(')(')(Q

nn ZpaZQZpaZQZQZ

Overview

Markov Chain

• A Markov chain is a discrete random process with the property that the next state depends only on the

current state (Markov Property):

x1 x2 x3 x4

)|(),...,,|( 1121 tttt xxPxxxxP

• If P(Xt|xt-1) does not depend on t (time homogeneous) and state space is finite, then it is often expressed as a transition function (aka transition matrix) 1)(

Example: Drunkard’s Walk• a random walk on the number line where, at

each step, the position may change by +1 or −1 with equal probability

,...}2,1,0{)( XD5.05.0

)1()1(

transition matrix P(X)

Example: Weather Model

5.05.0

1.09.0

sunnyPrainyP

transition matrix P(X)

},{)( sunnyrainyXD

rain rain rainrain sun

Multi-Variable System

• state is an assignment of values to all the variables

},...,,{ 21

ttt xxxx

finitediscreteXDXXXX i ,)(},,,{ 321

Bayesian Network System

• Bayesian Network is a representation of the joint probability distribution over 2 or more variables

},,{ 321

tttt xxxx

},,{ 321 XXXX

Stationary DistributionExistence

• If the Markov chain is time-homogeneous, then the vector (X) is a stationary distribution (aka invariant or equilibrium distribution, aka “fixed point”), if its entries sum up to 1 and satisfy:

• Finite state space Markov chain has a unique stationary distribution if and only if:– The chain is irreducible

– All of its states are positive recurrent

)|()()(XDx

Irreducible

• A state x is irreducible if under the transition rule one has nonzero probability of moving from x to any other state and then coming back in a finite number of steps

• If one state is irreducible, then all the states must be irreducible

(Liu, Ch. 12, pp. 249, Def. 12.1.1)

Recurrent

• A state x is recurrent if the chain returns to x

with probability 1

• Let M(x ) be the expected number of steps to return to state x

• State x is positive recurrent if M(x ) is finiteThe recurrent states in a finite state chain are positive recurrent .

Stationary Distribution Convergence

• Consider infinite Markov chain:nnn PPxxPP 00)( )|(

• Initial state is not important in the limit

“The most useful feature of a “good” Markov chain is its fast forgetfulness of its past…”

(Liu, Ch. 12.1)

)(lim n

• If the chain is both irreducible and aperiodic, then:

Aperiodic

• Define d(i) = g.c.d.{n > 0 | it is possible to go from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the set. If d(i)=1 for i, then chain is aperiodic

• Positive recurrent, aperiodic states are ergodic

Markov Chain Monte Carlo

• How do we estimate P(X), e.g., P(X|e) ?

• Generate samples that form Markov Chain with stationary distribution =P(X|e)

• Estimate from samples (observed states):

visited states x0,…,xn can be viewed as “samples” from distribution

)(lim xT

MCMC Summary

• Convergence is guaranteed in the limit

• Samples are dependent, not i.i.d.

• Convergence (mixing rate) may be slow

• The stronger correlation between states, the slower convergence!

• Initial state is not important, but… typically, we throw away first K samples - “burn-in”

Gibbs Sampling (Geman&Geman,1984)

• Gibbs sampler is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables

• Sample new variable value one variable at a time from the variable’s conditional distribution:

• Samples form a Markov chain with stationary distribution P(X|e)

)\|(},...,,,..,|()( 111 i

ii xxXPxxxxXPXP

Gibbs Sampling: Illustration

The process of Gibbs sampling can be understood as a random walk in the space of all instantiations of X=x (remember drunkard’s walk):

In one step we can reach instantiations that differ from current one by value assignment to at most one variable (assume randomized choice of variables Xi).

Ordered Gibbs Sampler

Generate sample xt+1 from xt :

In short, for i=1 to N:

),\|( from sampled

),,...,,|(

exxXPxX

exxxXPxX

ProcessAllVariablesIn SomeOrder

Transition Probabilities in BN

Markov blanket:

:)|( )\|( t

i markovXPxxXP

ij chX

i paxPpaxPxxxP )|()|()\|(

)()( jj chX

jiii pachpaXmarkov

Given Markov blanket (parents, children, and their parents),Xi is independent of all other nodes

Computation is linear in the size of Markov blanket!

Ordered Gibbs Sampling Algorithm(Pearl,1988)

Input: X, E=e

Output: T samples {xt }

Fix evidence E=e, initialize x0 at random1. For t = 1 to T (compute samples)

2. For i = 1 to N (loop through variables)

3. xit+1 P(Xi | markovi

4. End For

5. End For

Gibbs Sampling Example - BN

X8 X5 X2

}{},,...,,{ 9921 XEXXXX

X1 = x10

X6 = x60

X2 = x20

X7 = x70

X3 = x30

X8 = x80

X4 = x40

X5 = x50

Gibbs Sampling Example - BN

X8 X5 X2

),,...,|( 9

1 xxxXPx

}{},,...,,{ 9921 XEXXXX

),,...,|( 9

2 xxxXPx

Answering Queries P(xi |e) = ?• Method 1: count # of samples where Xi = xi (histogram estimator):

iiiii markovxXPT

iii xxT

Dirac delta f-n

• Method 2: average probability (mixture estimator):

• Mixture estimator converges faster (consider estimates for the unobserved values of Xi; prove via Rao-Blackwell theorem)

Rao-Blackwell Theorem

Rao-Blackwell Theorem: Let random variable set X be composed of two groups of variables, R and L. Then, for the joint distribution (R,L) and function g, the following result applies

)]([}|)({[ RgVarLRgEVar

for a function of interest g, e.g., the mean or covariance (Casella&Robert,1996, Liu et. al. 1995).

• theorem makes a weak promise, but works well in practice!• improvement depends the choice of R and L

Importance vs. Gibbs

eXPeXP

)|()|(ˆ

Gibbs:

Importance:

Gibbs Sampling: Convergence

• Sample from `P(X|e)P(X|e)

• Converges iff chain is irreducible and ergodic

• Intuition - must be able to explore all states:– if Xi and Xj are strongly correlated, Xi=0 Xj=0,

then, we cannot explore states with Xi=1 and Xj=1

• All conditions are satisfied when all probabilities are positive

• Convergence rate can be characterized by the second eigen-value of transition matrix

Gibbs: Speeding Convergence

Reduce dependence between samples (autocorrelation)

• Skip samples

• Randomize Variable Sampling Order

• Employ blocking (grouping)

• Multiple chains

Reduce variance (cover in the next section)

Blocking Gibbs Sampler

• Sample several variables together, as a block

• Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample:

+ Can improve convergence greatly when two variables are strongly correlated!

- Domain of the block variable grows exponentially with the #variables in a block!

)|,(),(

)(),|(

xZYPwzy

wPzyXPx

Gibbs: Multiple Chains

• Generate M chains of size K

• Each chain produces independent estimate Pm:

iim xxxPK

Treat Pm as independent random variables.

• Estimate P(xi|e) as average of Pm (xi|e) :

Gibbs Sampling Summary

• Markov Chain Monte Carlo method(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)

• Samples are dependent, form Markov Chain

• Sample from which converges to

• Guaranteed to converge when all P > 0

• Methods to improve convergence:– Blocking

– Rao-Blackwellised

)|( eXP )|( eXP

Overview

Sampling: Performance

• Gibbs sampling

– Reduce dependence between samples

• Importance sampling

– Reduce variance

• Achieve both by sampling a subset of variables and integrating out the rest (reduce dimensionality), aka Rao-Blackwellisation

• Exploit graph structure to manage the extra cost

Smaller Subset State-Space

• Smaller state-space is easier to cover

},,,{ 4321 XXXXX },{ 21 XXX

64)( XD 16)( XD

Smoother Distribution

P(X1,X2,X3,X4)

0-0.1 0.1-0.2 0.2-0.26

P(X1,X2)

0-0.1 0.1-0.2 0.2-0.26

Speeding Up Convergence

• Mean Squared Error of the estimator:

PVarBIASPMSE QQ 2

][ˆ]ˆ[]ˆ[ PEPEPVarPMSE QQQQ

• Reduce variance speed up convergence !

• In case of unbiased estimator, BIAS=0

Rao-Blackwellisation

)}(~{]}|)([{)}({

)}(ˆ{

]}|)([{)}({

]}|)({var[]}|)([{)}({

]}|)([]|)([{1

)}()({1

xgVarT

lxhEVar

xhVarxgVar

lxgEVarxgVar

lxgElxgEVarxgVar

lxhElxhET

Liu, Ch.2.3

Rao-Blackwellisation

• X=RL

• Importance Sampling:

• Gibbs Sampling: – autocovariances are lower (less correlation

between samples)

– if Xi and Xj are strongly correlated, Xi=0 Xj=0, only include one fo them into a sampling set

LRPVar QQ

Liu, Ch.2.5.5

“Carry out analytical computation as much as possible” - Liu

Blocking Gibbs Sampler vs. Collapsed

• Standard Gibbs:

• Blocking:

• Collapsed:

),|(),,|(),,|( yxzPzxyPzyxP

)|,(),,|( xzyPzyxP

)|(),|( xyPyxP

FasterConvergence

Collapsed Gibbs Sampling

Generating Samples

Generate sample ct+1 from ct :

),,...,,|(

ecccPcC

eccccPcC

from sampled

In short, for i=1 to K:

Collapsed Gibbs Sampler

Input: C X, E=e

Output: T samples {ct }

Fix evidence E=e, initialize c0 at random1. For t = 1 to T (compute samples)

2. For i = 1 to N (loop through variables)

3. cit+1 P(Ci | ct\ci)

4. End For

5. End For

Calculation Time

• Computing P(ci| ct\ci,e) is more expensive (requires inference)

• Trading #samples for smaller variance:

– generate more samples with higher covariance

– generate fewer samples with lower covariance

• Must control the time spent computing sampling probabilities in order to be time-effective!

Exploiting Graph Properties

Recall… computation time is exponential in the adjusted induced width of a graph

• w-cutset is a subset of variable s.t. when they are observed, induced width of the graph is w

• when sampled variables form a w-cutset , inference is exp(w) (e.g., using Bucket Tree

Elimination)

• cycle-cutset is a special case of w-cutset

Sampling w-cutset w-cutset sampling!

What If C=Cycle-Cutset ?

}{},{ 9

0 XE,xxc

P(x2,x5,x9) – can compute using Bucket Elimination

P(x2,x5,x9) – computation complexity is O(N)

Computing Transition Probabilities

),,1(:

),,0(:

xxxPBE

),,1()|1(

),,0()|0(

),,1(),,0(

932932

xxxPxxP

xxxPxxxP

Compute joint probabilities:

Normalize:

Cutset Sampling-Answering Queries

• Query: ci C, P(ci |e)=? same as Gibbs:

computed while generating sample tusing bucket tree elimination

compute after generating sample tusing bucket tree elimination

ii ecccPT

),\|(1

ii ,ecxPT

|e)(xP1

• Query: xi X\C, P(xi |e)=?

Cutset Sampling vs. Cutset Conditioning

)|()|(

ecPc,exP

ccountc,exP

,ecxPT

|e)(xP

• Cutset Conditioning

• Cutset Sampling

)|()|()(

ecPc,exP|e)P(xCDc

Cutset Sampling Example

,x| xxP

,x| xxP x

X6 X5 X4

Estimating P(x2|e) for sampling node X2 :

Sample 1

Sample 2

Sample 3

Cutset Sampling Example

),,|(},{

xxxxPxxc

Estimating P(x3 |e) for non-sampled node X3 :

X6 X5 X4

CPCS54 Test Results

MSE vs. #samples (left) and time (right)

Ergodic, |X|=54, D(Xi)=2, |C|=15, |E|=3

Exact Time = 30 sec using Cutset Conditioning

CPCS54, n=54, |C|=15, |E|=3

0 1000 2000 3000 4000 5000

# samples

Cutset Gibbs

CPCS54, n=54, |C|=15, |E|=3

0.0002

0.0004

0.0006

0.0008

0 5 10 15 20 25

Time(sec)

Cutset Gibbs

CPCS179 Test Results

MSE vs. #samples (left) and time (right) Non-Ergodic (1 deterministic CPT entry)|X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35

CPCS179, n=179, |C|=8, |E|=35

100 500 1000 2000 3000 4000

# samples

Cutset Gibbs

CPCS179, n=179, |C|=8, |E|=35

0 20 40 60 80

Time(sec)

Cutset Gibbs

CPCS360b Test Results

Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36

Exact Time > 60 min using Cutset Conditioning

Exact Values obtained via Bucket Elimination

CPCS360b, n=360, |C|=21, |E|=36

0.00004

0.00008

0.00012

0.00016

0 200 400 600 800 1000

# samples

Cutset Gibbs

CPCS360b, n=360, |C|=21, |E|=36

0.00004

0.00008

0.00012

0.00016

1 2 3 5 10 20 30 40 50 60

Time(sec)

Cutset Gibbs

Random Networks

|X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20

RANDOM, n=100, |C|=13, |E|=15-20

0.0005

0.0015

0.0025

0.0035

0 200 400 600 800 1000 1200

# samples

Cutset Gibbs

RANDOM, n=100, |C|=13, |E|=15-20

0.0002

0.0004

0.0006

0.0008

0 1 2 3 4 5 6 7 8 9 10 11

Time(sec)

Cutset Gibbs

Coding NetworksCutset Transforms Non-Ergodic Chain to Ergodic

MSE vs. time (right)

Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50

Sample Ergodic Subspace U={U1, U2,…Uk}

x1 x2 x3 x4

u1 u2 u3 u4

p1 p2 p3 p4

y4y3y2y1

Coding Networks, n=100, |C|=12-14

0 10 20 30 40 50 60

Time(sec)

IBP Gibbs Cutset

Non-Ergodic Hailfinder

Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0

Exact Time = 2 sec using Loop-Cutset Conditioning

HailFinder, n=56, |C|=5, |E|=1

0.0001

1 2 3 4 5 6 7 8 9 10

Time(sec)

Cutset Gibbs

HailFinder, n=56, |C|=5, |E|=1

0.0001

0 500 1000 1500

# samples

Cutset Gibbs

CPCS360b - MSE

cpcs360b, N=360, |E|=[20-34], w*=20, MSE

0.000005

0.00001

0.000015

0.00002

0.000025

0 200 400 600 800 1000 1200 1400 1600

Time (sec)

|C|=26,fw=3

|C|=48,fw=2

MSE vs. Time

Ergodic, |X| = 360, |C| = 26, D(Xi)=2

Exact Time = 50 min using BTE

Cutset Importance Sampling

• Apply Importance Sampling over cutset C

),(1)(ˆ

ii wccT

ii wecxPT

where P(ct,e) is computed using Bucket Elimination

(Gogate & Dechter, 2005) and (Bidyuk & Dechter, 2006)

Likelihood Cutset Weighting (LCS)

• Z=Topological Order{C,E}

• Generating sample t+1:

For End

If End

),...,|(

:do For

e ,zzz

ZZ • computed while generating sample tusing bucket tree elimination

• can be memoized for some number of instances K (based on memory available

KL[P(C|e), Q(C)+ ≤ KL*P(X|e), Q(X)]

Pathfinder 1

Pathfinder 2

Summary

• i.i.d. samples

• Unbiased estimator

• Generates samples fast

• Samples from Q

• Reject samples with zero-weight

• Improves on cutset

• Dependent samples

• Biased estimator

• Generates samples slower

• Samples from `P(X|e)

• Does not converge in presence of constraints

• Improves on cutset

Importance Sampling Gibbs Sampling

CPCS360b

cpcs360b, N=360, |LC|=26, w*=21, |E|=15

1.E-05

1.E-04

1.E-03

1.E-02

0 2 4 6 8 10 12 14

Time (sec)

AIS-BN

LW – likelihood weightingLCS – likelihood weighting on a cutset

CPCS422b

1.0E-05

1.0E-04

1.0E-03

1.0E-02

0 10 20 30 40 50 60

Time (sec)

cpcs422b, N=422, |LC|=47, w*=22, |E|=28 LW

AIS-BN

Coding Networks

coding, N=200, P=3, |LC|=26, w*=21

1.0E-05

1.0E-04

1.0E-03

1.0E-02

1.0E-01

0 2 4 6 8 10

Time (sec)

ELWAIS-BNGibbsLCSIBP

Sampling Techniques for Probabilistic and …Markov Chain Monte Carlo: Gibbs Sampling 4. Sampling in...

Documents

Transcript of Sampling Techniques for Probabilistic and …Markov Chain Monte Carlo: Gibbs Sampling 4. Sampling in...

Efficient Gibbs Sampling of Truncated

DegSampler: Gibbs sampling strategy for predicting E3 ...

On particle Gibbs sampling

Chapter 6: Gibbs Sampling - GitHub Pagesjwmi.github.io/BMS/chapter6-gibbs-sampling.pdf · 2021. 1. 13. · 2 Gibbs sampling with two variables Suppose p(x;y) is a p.d.f. or p.m.f.

6.047/6.878 Lecture 15: Regulatory Motifs, Gibbs Sampling ...web.mit.edu/6.047/book-2012/Lecture15_Regulatory... · 6.047/6.878 Lecture 15: Regulatory Motifs, Gibbs Sampling, and

MCMC Methods: Gibbs Sampling and the Metropolis-Hastings ...patricklam.org/teaching/mcmc_print.pdf · MCMC Methods: Gibbs Sampling and the Metropolis-Hastings Algorithm Patrick Lam.

Parallel&Gibbs&Sampling - LCCClccc.eecs.berkeley.edu/Slides/Gonzalez10.pdf · Parallel&Gibbs&Sampling& From&Colored&Fields&to&Thin&Junc:on&Trees& Yucheng& Low& Arthur& Greon & Carlos&

Gibbs sampling for motif finding Yves Moreau. 2 Overview Markov Chain Monte Carlo Gibbs sampling Motif finding in cis-regulatory DNA Biclustering microarray.

Detecting Subtle Sequence Signals: A Gibbs Sampling ...

Gibbs Sampling for the Uninitiated - UMD

Gibbs sampling, exponential families and orthogonal polynomialsstatweb.stanford.edu/~cgates/PERSI/papers/Gibbs.pdf · 2007. 9. 4. · Gibbs sampling, exponential families and orthogonal

Expectation Maximization and Gibbs Sampling

via Gibbs sampling with an application - HAL

MCMC Methods: Gibbs Sampling and the Metropolis-Hastings Algorithm

Learning Sequence Motif Models Using Gibbs Sampling

A (poor) Gibbs Sampling Approach to Logistic Regression

JMASM27: An Algorithm for Implementing Gibbs Sampling for ...

GIBBS SAMPLING WILL FAIL IN OUTLIER PROBLEMS WITH …

Gibbs Sampling in Linear Models #2 - Purdue Universityjltobias/BayesClass/lecture...3 Gibbs sampling in a linear model with inequality constraints Justin L. Tobias Gibbs Sampling #2

Gibbs sampling for motif finding in biological sequences