. Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford.
-
date post
22-Dec-2015 -
Category
Documents
-
view
220 -
download
3
Transcript of . Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford.
.
Learning Bayesian Networks from Data
Nir Friedman Daphne Koller
Hebrew U. Stanford
2
Overview
Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data
3
Family of Alarm
Bayesian Networks
Qualitative part:
Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence
Quantitative part: Set of conditional probability distributions
0.9 0.1
e
b
e
0.2 0.8
0.01 0.99
0.9 0.1
be
b
b
e
BE P(A | E,B)Earthquake
Radio
Burglary
Alarm
Call
Compact representation of probability distributions via conditional independence
Together:Define a unique distribution in a factored form
)|()|(),|()()(),,,,( ACPERPEBAPEPBPRCAEBP
4
Example: “ICU Alarm” networkDomain: Monitoring Intensive-Care Patients 37 variables 509 parameters
…instead of 254
PCWP CO
HRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG VENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
ANAPHYLAXIS
MINOVL
PVSAT
FIO2
PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
5
Inference Posterior probabilities
Probability of any event given any evidence Most likely explanation
Scenario that explains evidence Rational decision making
Maximize expected utility Value of Information
Effect of intervention
Earthquake
Radio
Burglary
Alarm
Call
Radio
Call
6
Why learning?
Knowledge acquisition bottleneck Knowledge acquisition is an expensive process Often we don’t have an expert
Data is cheap Amount of available information growing rapidly Learning allows us to construct models from raw
data
7
Why Learn Bayesian Networks?
Conditional independencies & graphical language capture structure of many real-world distributions
Graph structure provides much insight into domain Allows “knowledge discovery”
Learned model can be used for many tasks
Supports all the features of probabilistic learning Model selection criteria Dealing with missing data & hidden variables
8
Learning Bayesian networks
E
R
B
A
C
.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
Data+
Prior Information
Learner
9
Known Structure, Complete Data
E B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B) E B
A
Network structure is specified Inducer needs to estimate parameters
Data does not contain missing values
Learner
E, B, A<Y,N,N><Y,N,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>
10
Unknown Structure, Complete Data
E B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B) E B
A
Network structure is not specified Inducer needs to select arcs & estimate parameters
Data does not contain missing values
E, B, A<Y,N,N><Y,N,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>
Learner
11
Known Structure, Incomplete Data
E B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B) E B
A
Network structure is specified Data contains missing values
Need to consider assignments to missing values
E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>
Learner
12
Unknown Structure, Incomplete Data
E B
A.9 .1
e
b
e
.7 .3
.99 .01
.8 .2
be
b
b
e
BE P(A | E,B)
? ?
e
b
e
? ?
? ?
? ?
be
b
b
e
BE P(A | E,B) E B
A
Network structure is not specified Data contains missing values
Need to consider assignments to missing values
E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>
Learner
13
Overview
Introduction Parameter Estimation
Likelihood function Bayesian estimation
Model Selection Structure Discovery Incomplete Data Learning from Structured Data
14
Learning Parameters
E B
A
C
][][][][
]1[]1[]1[]1[
MCMAMBME
CABE
D
Training data has the form:
15
Likelihood Function
E B
A
C
Assume i.i.d. samples Likelihood function is
m
mCmAmBmEPDL ):][],[],[],[():(
16
Likelihood Function
E B
A
C
By definition of network, we get
m
m
mAmCP
mEmBmAP
mBP
mEP
mCmAmBmEPDL
):][|][(
):][],[|][(
):][(
):][(
):][],[],[],[():(
][][][][
]1[]1[]1[]1[
MCMAMBME
CABE
17
Likelihood Function
E B
A
C
Rewriting terms, we get
m
m
m
m
m
mAmCP
mEmBmAP
mBP
mEP
mCmAmBmEPDL
):][|][(
):][],[|][(
):][(
):][(
):][],[],[],[():(
][][][][
]1[]1[]1[]1[
MCMAMBME
CABE=
18
General Bayesian Networks
Generalizing for any Bayesian network:
Decomposition Independent estimation problems
iii
i miii
mn
DL
mPamxP
mxmxPDL
):(
):][|][(
):][,],[():( 1
19
Likelihood Function: Multinomials
The likelihood for the sequence H,T, T, H, H is
m
mxPDPDL )|][()|():(
)1()1():( DL
0 0.2 0.4 0.6 0.8 1
L(
:D)
K
1k
Nk
kDL ):(Probability of kth outcome
Count of kth outcome in D
General case:
20
Bayesian Inference
Represent uncertainty about parameters using a probability distribution over parameters, data
Learning using Bayes rule
])[],1[()()|][],1[(
])[],1[|(MxxP
PMxxPMxxP
Posterior
Likelihood Prior
Probability of data
21
Bayesian Inference Represent Bayesian distribution as Bayes net
The values of X are independent given P(x[m] | ) = Bayesian prediction is inference in this network
X[1] X[2] X[m]
Observed data
22
Priors for each parameter group are independent Data instances are independent given the
unknown parameters
X
X[1] X[2] X[M] X[M+1]
Observed dataPlate notation
Y[1] Y[2] Y[M] Y[M+1]
Y|X X
Y|Xm
X[m]
Y[m]
Query
Bayesian Nets & Bayesian Prediction
23
We can also “read” from the network:
Complete data posteriors on parameters are independent
Can compute posterior over parameters separately!
Bayesian Nets & Bayesian PredictionX
X[1] X[2] X[M]
Observed data
Y[1] Y[2] Y[M]
Y|X
24
Learning Parameters: Summary
Estimation relies on sufficient statistics For multinomials: counts N(xi,pai) Parameter estimation
Both are asymptotically equivalent and consistent Both can be implemented in an on-line manner by
accumulating sufficient statistics
)()(),(),(~
|ii
iiiipax paNpa
paxNpaxii
)(),(ˆ
|i
iipax paN
paxNii
MLE Bayesian (Dirichlet)
25
Overview
Introduction Parameter Learning Model Selection
Scoring function Structure search
Structure Discovery Incomplete Data Learning from Structured Data
26
Why Struggle for Accurate Structure?
Increases the number of parameters to be estimated
Wrong assumptions about domain structure
Cannot be compensated for by fitting parameters
Wrong assumptions about domain structure
Earthquake Alarm Set
Sound
Burglary Earthquake Alarm Set
Sound
Burglary
Earthquake Alarm Set
Sound
Burglary
Adding an arcMissing an arc
27
Score based Learning
E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>
E B
A
E
B
A
E
BA
Search for a structure that maximizes the score
Define scoring function that evaluates how well a structure matches the data
28
Likelihood Score for Structure
Larger dependence of Xi on Pai higher score
Adding arcs always helps I(X; Y) I(X; {Y,Z}) Max score attained by fully connected network Overfitting: A bad idea…
i
iGii H(X)Pa;I(XMD):L(GD):(G )log
Mutual information between Xi and its parents
29
Max likelihood params
Bayesian Score
Likelihood score:
Bayesian approach: Deal with uncertainty by assigning probability to
all possibilities
)θG,|P(DD):L(G Gˆ
dGPGDPGDP )|(),|()|(
Likelihood Prior over parameters
P(D)G)P(G)P(D
D)P(G|
|
Marginal Likelihood
30
Heuristic Search
Define a search space: search states are possible structures operators make small changes to structure
Traverse space looking for high-scoring structures Search techniques:
Greedy hill-climbing Best first search Simulated Annealing ...
31
Local Search
Start with a given network empty network best tree a random network
At each iteration Evaluate all possible changes Apply change based on score
Stop when no modification improves score
32
Heuristic Search
Typical operations:
S C
E
D Reverse C EDelete C
E
Add C
D
S C
E
D
S C
E
D
S C
E
D
score = S({C,E} D) - S({E} D)
To update score after local change,
only re-score families that changed
33
Learning in Practice: Alarm domain
0
0.5
1
1.5
2
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
KL
Div
erge
nce
to
true
dis
trib
utio
n
#samples
Structure known, fit params
Learn both structure & params
34
Local Search: Possible Pitfalls
Local search can get stuck in: Local Maxima:
All one-edge changes reduce the score Plateaux:
Some one-edge changes leave the score unchanged
Standard heuristics can escape both Random restarts TABU search Simulated annealing
35
Improved Search: Weight Annealing
Standard annealing process: Take bad steps with probability exp(score/t) Probability increases with temperature
Weight annealing: Take uphill steps relative to perturbed score Perturbation increases with temperature
Sco
re(G
|D)
G
36
Perturbing the Score
Perturb the score by reweighting instances Each weight sampled from distribution:
Mean = 1 Variance temperature
Instances sampled from “original” distribution … but perturbation changes emphasis
Benefit: Allows global moves in the search space
37
Weight Annealing: ICU Alarm networkCumulative performance of 100 runs of
annealed structure search
Greedyhill-climbing
True structureLearned params
Annealedsearch
38
Structure Search: Summary
Discrete optimization problem In some cases, optimization problem is easy
Example: learning trees In general, NP-Hard
Need to resort to heuristic search In practice, search is relatively fast (~100 vars in
~2-5 min):DecomposabilitySufficient statistics
Adding randomness to search is critical
39
Overview
Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data
40
Structure Discovery
Task: Discover structural properties Is there a direct connection between X & Y Does X separate between two “subsystems” Does X causally effect Y
Example: scientific data mining Disease properties and symptoms Interactions between the expression of genes
41
Discovering Structure
Current practice: model selection Pick a single high-scoring model Use that model to infer domain structure
E
R
B
A
C
P(G|D)
42
Discovering Structure
Problem Small sample size many high scoring models Answer based on one model often useless Want features common to many models
E
R
B
A
C
E
R
B
A
C
E
R
B
A
C
E
R
B
A
C
E
R
B
A
C
P(G|D)
43
Bayesian Approach
Posterior distribution over structures Estimate probability of features
Edge XY Path X… Y …
G
DGPGfDfP )|()()|(
Feature of G,e.g., XY
Indicator functionfor feature f
Bayesian scorefor G
44
MCMC over Networks
Cannot enumerate structures, so sample structures
MCMC Sampling Define Markov chain over BNs Run chain to get samples from posterior P(G | D)
Possible pitfalls: Huge (superexponential) number of networks Time for chain to converge to posterior is unknown Islands of high posterior, connected by low bridges
n
iiGf
nDGfP
1
)(1
)|)((
45
ICU Alarm BN: No Mixing 500 instances:
The runs clearly do not mix
MCMC Iteration
Sco
re o
f cu
ure
nt s
ampl
e
-9400
-9200
-9000
-8800
-8600
-8400
0 100000 200000 300000 400000 500000 600000
score
iteration
emptygreedy
46
Effects of Non-Mixing Two MCMC runs over same 500 instances Probability estimates for edges for two runs
Probability estimates highly variable, nonrobust
True BN
Ra
nd
om
sta
rt
True BN
Tru
e B
N
47
Fixed OrderingSuppose that We know the ordering of variables
say, X1 > X2 > X3 > X4 > … > Xn parents for Xi must be in X1,…,Xi-1
Limit number of parents per nodes to k
Intuition: Order decouples choice of parents Choice of Pa(X7) does not restrict choice of Pa(X12)
Upshot: Can compute efficiently in closed form Likelihood P(D | ) Feature probability P(f | D, )
2k•n•log n networks
48
Our Approach: Sample Orderings
We can write
Sample orderings and approximate
MCMC Sampling Define Markov chain over orderings Run chain to get samples from posterior P( | D)
)|(),|()|( DPDfPDfP
),|()|(1
DfPDfP i
n
i
49
Mixing with MCMC-Orderings 4 runs on ICU-Alarm with 500 instances
fewer iterations than MCMC-Nets approximately same amount of computation
Process appears to be mixing!
-8450
-8445
-8440
-8435
-8430
-8425
-8420
-8415
-8410
-8405
-8400
0 10000 20000 30000 40000 50000 60000
score
iteration
randomgreedy
MCMC Iteration
Sco
re o
f cu
ure
nt s
ampl
e
50
Mixing of MCMC runs Two MCMC runs over same instances Probability estimates for edges
500 instances 1000 instances
Probability estimates very robust
51
Application to Gene Array Analysis
52
Chips and Features
53
Application to Gene Array Analysis
See www.cs.tau.ac.il/~nin/BioInfo04 Bayesian Inference :
http://www.cs.huji.ac.il/~nir/Papers/FLNP1Full.pdf
54
Application: Gene expression
Input: Measurement of gene expression under
different conditions Thousands of genes Hundreds of experiments
Output: Models of gene interaction
Uncover pathways
55
Map of Feature Confidence
Yeast data [Hughes et al 2000]
600 genes 300 experiments
56
“Mating response” Substructure
Automatically constructed sub-network of high-confidence edges
Almost exact reconstruction of yeast mating pathway
KAR4
AGA1PRM1TEC1
SST2
STE6
KSS1NDJ1
FUS3AGA2
YEL059W
TOM6 FIG1YLR343W
YLR334C MFA1
FUS1
57
Summary of the course
Bayesian learning The idea of considering model parameters as
variables with prior distribution PAC learning
Assigning confidence and accepted error to the learning problem and analyzing polynomial L T
Boosting and Bagging Use of the data for estimating multiple models
and fusion between them
58
Summary of the course
Hidden Markov Models Markov property is widely assumed, hidden
Markov model very powerful & easily estimated Model selection and validation
Crucial for any type of modeling! (Artificial) neural networks
The brain performs computations radically different than modern computers & often much better! we need to learn how?
ANN powerful modeling tool (BP, RBF)
59
Summary of the course
Evolutionary learning Very different learning rule, has its merits
VC dimensionality Powerful theoretical tool in defining solvable
problems, difficult for practical use Support Vector Machine
Clean theory, different than classical statistics as it looks for simple estimators in high dim, rather than reducing dim
Bayesian networks: compact way to represent conditional dependencies between variables
60
Final project
61
Probabilistic Relational Models
Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies
Links give us potential interactions!
StudentIntelligence
RegGrade
Satisfaction
CourseDifficulty
ProfessorTeaching-Ability
Key ideas:
0% 20% 40% 60% 80% 100%
hard,smart
hard,weak
easy,smart
easy,weakA B C
62
Prof. SmithProf. Jones
Forrest Gump
Jane Doe
Welcome to
CS101
Welcome to
Geo101
PRM Semantics
Teaching-abilityTeaching-ability
Difficulty
Difficulty
Grade
Grade
Grade
Satisfac
Satisfac
Satisfac
Intelligence
Intelligence
Instantiated PRM BN variables: attributes of all objects dependencies: determined by links & PRM
Grade|Intell,Diffic
63
Welcome to
CS101
weak / smart
The Web of Influence
0% 50% 100%0% 50% 100%
Welcome to
Geo101 A
C
weak smart
0% 50% 100%
easy / hard
Objects are all correlated Need to perform inference over entire model For large databases, use approximate inference:
Loopy belief propagation
64
PRM Learning: Complete Data
Prof. SmithProf. Jones
Welcome to
CS101
Welcome to
Geo101
Teaching-abilityTeaching-ability
Difficulty
Difficulty
Grade
Grade
Grade
Satisfac
Satisfac
Satisfac
Intelligence
Intelligence
HighLow
Easy
Easy
A
B
C
Like
Hate
Like
Smart
Weak
Grade|Intell,Diffic
Entire database is single “instance” Parameters used many times in instance
Introduce prior over parameters Update prior with sufficient statistics:
Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi)
65
PRM Learning: Incomplete Data
Welcome to
CS101
Welcome to
Geo101
??????
???
???
B
A
C
Hi
Low
Hi
???
??? Use expected sufficient statistics But, everything is correlated:
E-step uses (approx) inference over entire model
66
Example: Binomial Data
Prior: uniform for in [0,1] P( |D) the likelihood L( :D)
(NH,NT ) = (4,1)
MLE for P(X = H ) is 4/5 = 0.8 Bayesian prediction is
7142.075
)|()|]1[( dDPDHMxP
0 0.2 0.4 0.6 0.8 1
)()|][],1[(])[],1[|( PMxxPMxxP
67
Dirichlet Priors
Recall that the likelihood function is
Dirichlet prior with hyperparameters 1,…,K
the posterior has the same form, with
hyperparameters 1+N 1,…,K +N K
K
1k
Nk
kDL ):(
K
kk
kP1
1)(
K
k
Nk
K
k
Nk
K
kk
kkkkDPPDP1
1
11
1)|()()|(
68
Dirichlet Priors - Example
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 0.2 0.4 0.6 0.8 1
Dirichlet(1,1)
Dirichlet(2,2)
Dirichlet(5,5)Dirichlet(0.5,0.5)
Dirichlet(heads, tails)
heads
P(
he
ad
s)
69
Dirichlet Priors (cont.)
If P() is Dirichlet with hyperparameters 1,…,K
Since the posterior is also Dirichlet, we get
kk dPkXP )()]1[(
)(
)|()|]1[(N
NdDPDkMXP kk
k
70
Learning Parameters: Case Study
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
KL
Div
erge
nce
to tr
ue d
istr
ibut
ion
# instances
MLE
Bayes; M'=50
Bayes; M'=20
Bayes; M'=5
Instances sampled from ICU Alarm network
M’ — strength of prior
71
Marginal Likelihood: Multinomials
Fortunately, in many cases integral has closed form
P() is Dirichlet with hyperparameters 1,…,K
D is a dataset with sufficient statistics N1,…,NK
Then
)()(
)()(
N
NDP
72
Marginal Likelihood: Bayesian Networks
X
Y
Network structure determines form ofmarginal likelihood
1 2 3 4 5 6 7
Network 1:Two Dirichlet marginal likelihoods
P( )
P( )
X Y
Integral over X
Integral over Y
H T T H T H HH T T H T H H
H T H H T T HH T H H T T H
73
Marginal Likelihood: Bayesian Networks
X
Y
Network structure determines form ofmarginal likelihood
1 2 3 4 5 6 7
Network 2:Three Dirichlet marginal likelihoods
P( )
P( )
P( )
X Y
Integral over X
Integral over Y|X=H
H T T H T H HH T T H T H H
H T H H T T HT H TH H T H
Integral over Y|X=T
74
Marginal Likelihood for Networks
The marginal likelihood has the form:
i paG
i
GDP )|( Dirichlet marginal likelihoodfor multinomial P(Xi | pai)
i i
ii
ii
i
xG
i
Gi
Gi
GG
G
pax
paxNpax
paNpa
pa
)),((
)),(),((
)()(
)(
N(..) are counts from the data(..) are hyperparameters for each family given G
75
As M (amount of data) grows, Increasing pressure to fit dependencies in distribution Complexity term avoids fitting noise
Asymptotic equivalence to MDL score Bayesian score is consistent
Observed data eventually overrides prior
O(1)dim(G)2M
)H(X)Pa;I(XMi
iGii log
O(1)dim(G)2M
D):(GG)|P(D log
log
Fit dependencies inempirical distribution
Complexitypenalty
Bayesian Score: Asymptotic Behavior
76
Structure Search as OptimizationInput:
Training data Scoring function Set of possible structures
Output: A network that maximizes the score
Key Computational Property: Decomposability:
score(G) = score ( family of X in G )
77
Tree-Structured Networks
Trees: At most one parent per variable
Why trees? Elegant math
we can solve the optimization problem
Sparse parameterization avoid overfitting
PCWP CO
HRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNGVENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
MINOVL
PVSAT
PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
78
Learning Trees Let p(i) denote parent of Xi
We can write the Bayesian score as
Score = sum of edge scores + constant
Score of “empty”network
Improvement over “empty” network
i
ii PaXScoreDGScore ):():(
i
ii
iipi XScoreXScoreXXScore )()():( )(
79
Learning Trees
Set w(ji) =Score( Xj Xi ) - Score(Xi) Find tree (or forest) with maximal weight
Standard max spanning tree algorithm — O(n2 log n)
Theorem: This procedure finds tree with max score
80
Beyond Trees
When we consider more complex network, the problem is not as easy
Suppose we allow at most two parents per node A greedy algorithm is no longer guaranteed to find
the optimal network
In fact, no efficient algorithm exists
Theorem: Finding maximal scoring structure with at most k parents per node is NP-hard for k > 1