CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and...
-
Upload
marvin-norman -
Category
Documents
-
view
216 -
download
0
Transcript of CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and...
CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNINGContinuous Probability Distributions and Bayesian Networks with Continuous Variables
AGENDA
Continuous probability distributions Common families:
The Gaussian distribution Linear Gaussian Bayesian networks
CONTINUOUS PROBABILITY DISTRIBUTIONS
Let X be a random variable in R, P(X) be a probability distribution over X P(x)0 for all x, “sums to 1”
Challenge: (most of the time) P(X=x) = 0 for any x
CDF AND PDF Probability density function (pdf) f(x)
Nonnegative, f(x) dx = 1 Cumulative distribution function (cdf) g(x)
g(x) = P(Xx)
g(-) = 0, g() = 1, g(x) = (-,x] f(y) dy, monotonic
f(x) = g’(x)
pdf f(x)
cdf g(x)1
Both cdfs and pdfs are complete representations of the probability
space over X, but usually pdfs are more intuitive to work with.
CAVEATS
pdfs may exceed 1 Deterministic values, or ones taking on a few
discrete values, can be represented in terms of the Dirac delta function a(x) pdf (an improper function) a(x) = 0 if x a a(x) = if x = a a(x) dx = 1
COMMON DISTRIBUTIONS
Uniform distribution U(a,b) p(x) = 1/(b-a) if x [a,b], 0 otherwise P(Xx) = 0 if x < a, (x-a)/(b-a) if x [a,b], 1
otherwise Gaussian (normal) distribution N(,)
= mean, = standard deviation
P(Xx) not closed form:(1+erf(x))/2 for N(0,1)
U(a1,b1)
U(a2,b2)
MULTIVARIATE CONTINUOUS DISTRIBUTIONS
Consider c.d.f. g(x,y) = P(Xx,Yy) g(-,y) = 0, g(x,-) = 0 g(,) = 1 g(x,) = P(Xx), g(,x) = P(Yy) g monotonic
Its joint density is given by the p.d.f. f(x,y) iff g(p,q) = (-,p] (-,q] f(x,y) dy dx
i.e. P(axXbx,ayYby) = [ax,bx] [ay,by] f(x,y) dy dx
MARGINALIZATION WORKS OVER PDFS Marginalizing f(x,y) over y:
If h(x) = (-,) f(x,y) dy, then h(x) is a p.d.f. for P(Xx) Proof:
P(Xa) = P(Xa,Y) = g(a,) = (-,a] (-,) f(x,y) dy dx h(a) = d/da P(Xa)
= d/da (-,a] (-,) f(x,y) dy dx (definition)= (-,) f(a,y) dy (fundamental theorem of
calculus)
So, the joint density contains all information needed to reconstruct the density of each individual variable
CONDITIONAL DENSITIES
We might want to represent the density P(Y|X=x)… but how? Naively, P(aYb|X=x) = P(aYb,X=x)/P(X=x), but
denominator is 0! Consider pdf p(x,y), consider taking limit
P(aYb|x+eXx+e) as e0
So p(x,y)/p(x) is the conditional density
TRANSFORMATIONS OF CONTINUOUS RANDOM VARIABLES
Suppose we want to compute the distribution of f(X), where X is a random variable distributed w.r.t. pX(x) Assume f is monotonic and invertible
Consider Y=f(X) a random variable P(Yy) = I[f(x)y]pX(x) dx
= = P(X f-1(y)) pY(y) = d/dy P(Yy)
= d/dy P(X f-1(y)) = p(f-1(y)) d/dy f-1(y) by chain rule
= pX(f-1(y))/f ’(f-1(y)) by inverse function derivative
NOTES:
In general, continuous multivariate distributions are hard to handle exactly
But, there are specific classes that lead to efficient exact inference techniques In particular, Gaussians
Other distributions usually require resorting to Monte Carlo approaches
MULTIVARIATE GAUSSIANS
Multivariate analog in N-D space Mean (vector) m, covariance (matrix) S
With a normalization factor
X ~ N(m,S)
INDEPENDENCE IN GAUSSIANS
If X ~ N(mX,SX) and Y ~ N(mY,SY) are independent, then
Moreover, if X~N( ,m S), then Sij=0 iff Xi and Xj are independent
LINEAR TRANSFORMATIONSLinear transformations of gaussians
If X~ N(m,S), y = A x + bThen Y ~ N(Am+b, ASAT) In fact,
Consequence: If X ~ N(mx,Sx), Y ~ N(my,Sy), Z=X+YThen Z ~ N(mx+my,Sx+Sy)
MARGINALIZATION AND CONDITIONING If (X,Y) ~ N([mX mY],[SXX,SXY;SYX,SYY]), then:
Marginalization Summing out Y gives
X ~ N(mX , SXX)
Conditioning:On observing Y=y, we have
X ~ N(mX-SXYSYY-1(y-mY), SXX-SXYSYY
-1SYX)
LINEAR GAUSSIAN MODELS
A conditional linear Gaussian model has : P(Y|X=x) = N(0+Ax,S0) With parameters 0, A, and S0
LINEAR GAUSSIAN MODELS
A conditional linear Gaussian model has : P(Y|X=x) = N(0+Ax,S0) With parameters 0, A, and S0
If X ~ N(mX,SX), then joint distribution over is given by:
(Recall the linear transformation rule)
If X~ N(m,S) and y=Ax+b, thenY ~ N(Am+b, ASAT)
CLG BAYESIAN NETWORKS
If all variables in a Bayesian network have Gaussian or CLG CPTS, inference can be done efficiently!
X1
Y
X2
P(Y|x1,x2) = N(ax1+bx2,y)
P(X2) = N(2,2)P(X1) = N(1,1)
P(Z|x1,y) = N(c+dx1+ey,z)Z
CANONICAL REPRESENTATION All factors in a CLG Bayes net can be represented
as C(x;K,h,g) withC(x;K,h,g) = exp(-1/2 xT Kx + hTx + g)
Ex: if P(Y|x) = N(0+Ax,S0) thenP(y|x) = 1/Z exp(-1/2 (y-Ax-0)T S0
-1(y-Ax-0))=1/Z exp(-1/2 (y,x)T [I –A]T S0
-1 [I –A](y,x) + 0T
S0-1 [I –A](y,x) – ½ 0
TS0-10)
Is of form C((y,x);K,h,g) with K = [I –A]T S0
-1 [I –A]
h = [I –A]T S0
-1 0
g = log(1/Z) exp (–½ 0TS0
-10)
PRODUCT OPERATIONS
C(x;K1,h1,g1)C(x;K2,h2,g2) = C(x;K,h,g) with K=K1+K2
h=h1+h2
g = g1+g2
If the scopes of the two factors are not equivalent, just extend the K’s with 0 rows and columns, and h’s with 0 rows so that each row/column matches
SUM OPERATION C((x,y);K,h,g)dy = C(x;K’,h’,g’)with
K’=KXX-KXYKYY-1KYX
h’=hX-KXYKYY-1hY
g’ = g+1/2 (log|2pKYY-1|+hY
TKYY-1hY)
Using these two operations we can implement inference algorithms developed for discrete Bayes nets: Top-down inference, variable elimination (exact) Belief propagation (approximate)
MONTE CARLO WITH GAUSSIANS
Assume sample X ~ N(0,1) is given as a primitive RandN() To sample X ~ N(m,s2), simply + m s RandN()
How to generate a random multivariate Gaussian variable N( , )m S ?
MONTE CARLO WITH GAUSSIANS
Assume sample X ~ N(0,1) is given as a primitive RandN() To sample X ~ N(m,s2), simply set x + m s
RandN() How to generate a random multivariate
Gaussian variable N( , )m S ? Take Cholesky decomposition: S-1=LLT, L
invertible if S is positive definite Let y = LT(x-m) P(y) exp(-1/2 (y1
2 + … + yN2)) is isotropic, and
each yi is independent Sample each component of y at random Set x L-Ty+m
MONTE CARLO WITH LIKELIHOOD WEIGHTING
Monte Carlo with rejection has probability 0 of finding a continuous value given as evidence, so likelihood weighting must be used
X
Y=y
P(Y|x)=N(Ax+mY,SY)
P(X)=N(mX,SX)
Step 1: Sample x ~ N(mX,SX)Step 2: weight by P(y|x)
HYBRID NETWORKS
Hybrid networks combine both discrete and continuous variables
Exact inference techniques are hard to apply Result in Gaussian mixtures NP hard even in polytree networks
Monte Carlo techniques apply in straightforward way
Belief approximation can be applied (e.g., collapsing Gaussian mixtures to single Gaussians)
ISSUES
Non-gaussian distributions Nonlinear dependencies More in future lectures on particle filtering