Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall...

25
Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1

Transcript of Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall...

Page 1: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Bayesian networks: Modeling

CS194-10 Fall 2011 Lecture 21

CS194-10 Fall 2011 Lecture 21 1

Page 2: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Outline

♦ Overview of Bayes nets

♦ Syntax and semantics

♦ Examples

♦ Compact conditional distributions

CS194-10 Fall 2011 Lecture 21 2

Page 3: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Learning with complex probability models

Learning cannot succeed without imposing some prior structure on thehypothesis space (by constraint or by preference)

Generative models P (X | θ) support MLE, MAP, and Bayesian learning fordomains that (approximately) reflect the assumptions underlying the model:♦ Naive Bayes—conditional independence of attributes given class value♦ Mixture models—domain has a flat, discrete category structure♦ All i.i.d. models—model doesn’t change over time♦ Etc.

Would like to express arbitrarily complex and flexible prior knowledge:♦ Some attributes depend on others♦ Categories have hierarchical structure;

objects may be mixtures of several categories♦ Observations at time t may depend on earlier observations♦ Etc.

CS194-10 Fall 2011 Lecture 21 3

Page 4: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Bayesian networks

A simple, graphical notation for conditional independence assertionsamong a predefined set of random variables Xj, j = 1, . . . , Dand hence for compact specification of arbitrary joint distributions

Syntax:a set of nodes, one per variablea directed, acyclic graph (link ≈ “directly influences”)a set of parameters for each node given its parents:

θ(Xj|Parents(Xj))

In the simplest case, parameters consist ofa conditional probability table (CPT) giving thedistribution over Xj for each combination of parent values

CS194-10 Fall 2011 Lecture 21 4

Page 5: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Example

Topology of network encodes conditional independence assertions:

Weather Cavity

Toothache Catch

Weather is independent of the other variables

Toothache and Catch are conditionally independent given Cavity

CS194-10 Fall 2011 Lecture 21 5

Page 6: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Example

I’m at work, neighbor John calls to say my alarm is ringing, but neighborMary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there aburglar?

Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCallsNetwork topology reflects “causal” knowledge:

– A burglar can set the alarm off– An earthquake can set the alarm off– The alarm can cause Mary to call– The alarm can cause John to call

CS194-10 Fall 2011 Lecture 21 6

Page 7: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Example contd.

.001P(B)

.002P(E)

Alarm

Earthquake

MaryCallsJohnCalls

Burglary

BTTFF

ETFTF

.95

.29

.001

.94

P(A|B,E)

ATF

.90

.05

P(J|A) ATF

.70

.01

P(M|A)

CS194-10 Fall 2011 Lecture 21 7

Page 8: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Compactness

A CPT for Boolean Xj with L Boolean parents has B E

JA

M

2L rows for the combinations of parent values

Each row requires one parameter p for Xj = true(the parameter for Xj = false is just 1− p)

If each variable has no more than L parents,the complete network requires O(D · 2L) parameters

I.e., grows linearly with D, vs. O(2D) for the full joint distribution

For burglary net, 1 + 1 + 4 + 2 + 2 = 10 parameters (vs. 25 − 1 = 31)

CS194-10 Fall 2011 Lecture 21 8

Page 9: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Global semantics

Global semantics defines the full joint distribution B E

JA

M

as the product of the local conditional distributions:

P (x1, . . . , xD) = ΠDj = 1θ(xj|parents(Xj)) .

e.g., P (j ∧m ∧ a ∧ ¬b ∧ ¬e)

=

CS194-10 Fall 2011 Lecture 21 9

Page 10: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Global semantics

Global semantics defines the full joint distribution B E

JA

M

as the product of the local conditional distributions:

P (x1, . . . , xD) = ΠDj = 1θ(xj|parents(Xj)) .

e.g., P (j ∧m ∧ a ∧ ¬b ∧ ¬e)

= θ(j|a)θ(m|a)θ(a|¬b,¬e)θ(¬b)θ(¬e)

= 0.9× 0.7× 0.001× 0.999× 0.998

≈ 0.00063

CS194-10 Fall 2011 Lecture 21 10

Page 11: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Global semantics

Global semantics defines the full joint distribution B E

JA

M

as the product of the local conditional distributions:

P (x1, . . . , xD) = ΠDj = 1θ(xj|parents(Xj)) .

e.g., P (j ∧m ∧ a ∧ ¬b ∧ ¬e)

= θ(j|a)θ(m|a)θ(a|¬b,¬e)θ(¬b)θ(¬e)

= 0.9× 0.7× 0.001× 0.999× 0.998

≈ 0.00063

Theorem: θ(Xj|Parents(Xj)) = P(Xj|Parents(Xj))

CS194-10 Fall 2011 Lecture 21 11

Page 12: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Local semantics

Local semantics: each node is conditionally independentof its nondescendants given its parents

. . .

. . .U1

X

Um

Yn

Znj

Y1

Z1j

Theorem: Local semantics ⇔ global semantics

CS194-10 Fall 2011 Lecture 21 12

Page 13: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Markov blanket

Each node is conditionally independent of all others given itsMarkov blanket: parents + children + children’s parents

. . .

. . .U1

X

Um

Yn

Znj

Y1

Z1j

CS194-10 Fall 2011 Lecture 21 13

Page 14: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Constructing Bayesian networks

Need a method such that a series of locally testable assertions ofconditional independence guarantees the required global semantics

1. Choose an ordering of variables X1, . . . , XD

2. For j = 1 to Dadd Xj to the networkselect parents from X1, . . . , Xj−1 such that

P(Xj|Parents(Xj)) = P(Xj|X1, . . . , Xj−1)i.e., Xj is conditionally independent of other variables given parents

This choice of parents guarantees the global semantics:

P(X1, . . . , XD) = ΠDj = 1P(Xj|X1, . . . , Xj−1) (chain rule)

= ΠDj = 1P(Xj|Parents(Xj)) (by construction)

CS194-10 Fall 2011 Lecture 21 14

Page 15: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Example: Car diagnosis

Initial evidence: car won’t startTestable variables (green), “broken, so fix it” variables (orange)Hidden variables (gray) ensure sparse structure, reduce parameters

lights

no oil no gas starterbroken

battery age alternator broken

fanbeltbroken

battery dead no charging

battery flat

gas gauge

fuel lineblocked

oil light

battery meter

car won’t start dipstick

CS194-10 Fall 2011 Lecture 21 15

Page 16: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Example: Car insurance

SocioEconAge

GoodStudentExtraCar

MileageVehicleYear

RiskAversion

SeniorTrain

DrivingSkill MakeModelDrivingHist

DrivQualityAntilock

Airbag CarValue HomeBase AntiTheftt

TheftOwnDamage

PropertyCostLiabilityCostMedicalCost

Cushioning

Ruggedness Accident

OtherCost OwnCost

CS194-10 Fall 2011 Lecture 21 16

Page 17: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Compact conditional distributions

CPT grows exponentially with number of parentsCPT becomes infinite with continuous-valued parent or child

Solution: canonical distributions that are defined compactly

Deterministic nodes are the simplest case:X = f (Parents(X)) for some function f

E.g., Boolean functionsNorthAmerican ⇔ Canadian ∨ US ∨Mexican

E.g., numerical relationships among continuous variables

∂LakeLevel

∂t= inflow + precipitation - outflow - evaporation

CS194-10 Fall 2011 Lecture 21 17

Page 18: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Compact conditional distributions contd.

Noisy-OR distributions model multiple noninteracting causes1) Parents U1 . . . UL include all causes (can add leak node)2) Independent failure probability q` for each cause alone

⇒ P (X|U1 . . . UM ,¬UM+1 . . .¬UL) = 1−ΠM` = 1q`

Cold F lu Malaria P (Fever) P (¬Fever)F F F 0.0 1.0F F T 0.9 0.1F T F 0.8 0.2F T T 0.98 0.02 = 0.2× 0.1T F F 0.4 0.6T F T 0.94 0.06 = 0.6× 0.1T T F 0.88 0.12 = 0.6× 0.2T T T 0.988 0.012 = 0.6× 0.2× 0.1

Number of parameters linear in number of parents

CS194-10 Fall 2011 Lecture 21 18

Page 19: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Hybrid (discrete+continuous) networks

Discrete (Subsidy? and Buys?); continuous (Harvest and Cost)

Buys?

HarvestSubsidy?

Cost

Option 1: discretization—possibly large errors, large CPTsOption 2: finitely parameterized canonical families

1) Continuous variable, discrete+continuous parents (e.g., Cost)2) Discrete variable, continuous parents (e.g., Buys?)

CS194-10 Fall 2011 Lecture 21 19

Page 20: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Continuous child variables

Need one conditional density function for child variable given continuousparents, for each possible assignment to discrete parents

Most common is the linear Gaussian model, e.g.,:

P (Cost = c|Harvest = h, Subsidy? = true)

= N(ath + bt, σt)(c)

=1

σt

√2π

exp

−1

2

c− (ath + bt)

σt

2

Mean Cost varies linearly with Harvest, variance is fixed

Linear variation is unreasonable over the full rangebut works OK if the likely range of Harvest is narrow

CS194-10 Fall 2011 Lecture 21 20

Page 21: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Continuous child variables

0 2 4 6 8 10Cost c02 46 81012

Harvest h

00.10.20.30.4

P(c | h, subsidy)

All-continuous network with LG distributions⇒ full joint distribution is a multivariate Gaussian

Discrete+continuous LG network is a conditional Gaussian network i.e., amultivariate Gaussian over all continuous variables for each combination ofdiscrete variable values

CS194-10 Fall 2011 Lecture 21 21

Page 22: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Discrete variable w/ continuous parents

Probability of Buys? given Cost should be a “soft” threshold:

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12

P(c)

Cost c

Probit distribution uses integral of Gaussian:Φ(x) = ∫x

−∞N(0, 1)(x)dxP (Buys? = true | Cost = c) = Φ((−c + µ)/σ)

CS194-10 Fall 2011 Lecture 21 22

Page 23: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Why the probit?

1. It’s sort of the right shape

2. Can view as hard threshold whose location is subject to noise

Buys?

Cost Cost Noise

CS194-10 Fall 2011 Lecture 21 23

Page 24: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Discrete variable contd.

Sigmoid (or logit) distribution also used in neural networks:

P (Buys? = true | Cost = c) =1

1 + exp(−2−c+µσ )

Sigmoid has similar shape to probit but much longer tails:

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12

P(buys

| c)

Cost c

LogitProbit

CS194-10 Fall 2011 Lecture 21 24

Page 25: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes

Summary (representation)

Bayes nets provide a natural representation for (causally induced)conditional independence

Topology + CPTs = compact representation of joint distribution⇒ fast learning from few examples

Generally easy for (non)experts to construct

Canonical distributions (e.g., noisy-OR, linear Gaussian)⇒ compact representation of CPTs⇒ faster learning from fewer examples

CS194-10 Fall 2011 Lecture 21 25