An Introduction to Bayesian Networks: Concepts and Learning from Data
Transcript of An Introduction to Bayesian Networks: Concepts and Learning from Data
An Introduction to Bayesian Networks: Concepts and
Learning from Data
Jeong-Ho ChangSeoul National University
2005-04-16 (Sat.) 2An Introduction to Bayesian Networks
IntroductionBasic Concepts of Bayesian NetworksLearning Bayesian Networks
Parameter LearningStructural Learning
ApplicationsDNA microarrayClassificationDependency Analysis
Summary
2005-04-16 (Sat.) 3An Introduction to Bayesian Networks
Bayesian NetworksA compact representation of a joint probability of variables on the basis of the concept of conditional independence.Qualitative part: graph theory
Directed acyclic graph Nodes: variables Edges: dependency or influence of a variable on another.
Quantitative part: probability theorySet of conditional probabilities for all variables
Naturally handles the problem of complexity and uncertainty.
2005-04-16 (Sat.) 4An Introduction to Bayesian Networks
Bayesian Networks (Cont’d)BN encodes probabilistic relationships among a set of objects or variables.
It is useful in thatdependency encoding among all variables: Modular representation of knowledge.can be used for the learning of causal relationships
helpful in understanding a problem domain.has both a causal and probabilistic semantics
can naturally combine prior knowledge and data.provide an efficient and principled approach for avoiding overfitting data in conjunction with Bayesian statistical methods.
2005-04-16 (Sat.) 5An Introduction to Bayesian Networks
BN as a Probabilistic Graphical ModelGraphical model
directed graphundirected graph
C
E
D
BA
C
E
D
BA
Bayesian NetworksMarkov Random Field
2005-04-16 (Sat.) 6An Introduction to Bayesian Networks
Representation of Joint Probability
),,(),,(),(1
)(1),,,,(
321 EDCDCBBA
EDCBAPi
ii
ϕϕϕα
ϕα
=
= ∏ ZC
E
D
BA
Z3
Z2
Z1
normalization constant
C
E
D
BA
))(|())(|( ))(|())(|())(|(),,,,(
EEPDDPCCPBBPAAPEDCBAP
PaPaPaPaPa=
2005-04-16 (Sat.) 7An Introduction to Bayesian Networks
Joint probability as a product of conditional probabilities
Can dramatically reduce the parameters for data modeling in Bayesian networks.
PCWP CO
HRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG VENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
ANAPHYLAXIS
MINOVL
PVSAT
FIO2PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
(From NIPS’01 tutorial by Friedman, N. and Daphne K.)
37 variables in total
509 parameters 254
2005-04-16 (Sat.) 8An Introduction to Bayesian Networks
Real World Applications of BNIntelligent agents
Microsoft Office assistant: Bayesian user modelingMedical diagnosis
PATHFINDER (Heckerman, 1992): diagnosis of lymph node disease commercialized as INTELLIPATH (http://www.intellipath.com/)
Control decision support systemSpeech recognition (HMMs)Genome data analysis
gene expression, DNA sequence, a combined analysis of heterogeneous data.
Turbocodes (channel coding)
2005-04-16 (Sat.) 9An Introduction to Bayesian Networks
IntroductionBasic Concepts of Bayesian NetworksLearning Bayesian Networks
Parameter LearningStructural Learning
ApplicationsDNA microarrayClassificationDependency Analysis
Summary
2005-04-16 (Sat.) 10An Introduction to Bayesian Networks
Causal NetworksNode: eventArc: causal relationship between two nodes
A B: A causes B.Causal network for the car start problem [Jensen 01]
Fuel
Fuel MeterStanding Start
Clean SparkPlugs
2005-04-16 (Sat.) 11An Introduction to Bayesian Networks
Reasoning with Causal Networks
• My car does not start. increases the certainty of no fuel and dirty spark plugs. increases the certainty of fuel meter’s standing for the empty.
• Fuel meter stands for the half. decreases the certainty of no fuel increases the certainty of dirty spark plugs.
Fuel
Fuel MeterStanding Start
Clean SparkPlugs
2005-04-16 (Sat.) 12An Introduction to Bayesian Networks
d-separationConnections in causal networks
CA BC
A B C
A B
Serial diverging converging
Definition [Jensen 01]:♦Two nodes in a causal network are d-separated if for all paths between them there is an intermediate node V such that
the connection is serial or diverging and the state of V is known orthe connection is converging and neither V nor any of V’s
descendants have received evidence.
2005-04-16 (Sat.) 13An Introduction to Bayesian Networks
d-separation (Cont’d)
CA BC
A B
A and B is marginally dependent
CA BC
A B
A and B is conditionally independent
C
A B
A and B is marginally independent
C
A B
A and B is conditionally dependent
2005-04-16 (Sat.) 14An Introduction to Bayesian Networks
d-separation: Car Start Problem1. ‘Start’ and ‘Fuel’ are dependent on each other.
2. ‘Start’ and ‘Clean Spark Plugs’ are dependent on each other.
3. ‘Fuel’ and ‘Fuel Meter Standing’ are dependent on each other.
4. ‘Fuel’ and ‘Clean Spark Plugs’ are conditionally dependent on each other given the value of ‘Start’.
5. ‘Fuel Meter Standing’ and ‘Start’ are conditionally independent given the value of ‘Fuel’.
Fuel
Fuel MeterStanding Start
Clean SparkPlugs
2005-04-16 (Sat.) 15An Introduction to Bayesian Networks
DefinitionA graphical model for the probabilistic relationships among a set of variables.Compact representation of joint probability distributions on the basis of conditional probabilities.Consists of the following.
A set of n variables X = {X1, X2, …, Xn} and a set of directed edges between variables.The variables (nodes) with the directed edges form a directed acyclic graph (DAG) structure.
To each variable Xi and its parents Pa(Xi), a conditional probability table for P(Xi|Pa(Xi)).
Modeling for continuous variables is also possible.
Bayesian Networks: Revisited
Quantitative part
Qualitative part
2005-04-16 (Sat.) 16An Introduction to Bayesian Networks
Quantitative Specification by Probability Calculus
FundamentalsConditional Probability
Product Rule
Chain Rule: a successive application of the product rule.
)(
),()|(APBAPABP =
)()|()()|(),( BPBAPAPABPBAP ==
∏=
−
−−−−
−−
=
===
n
iii
nnnnn
nnn
n
XXXPXP
XXXXPXXXXPXXXPXXXXPXXXP
XXXP
2111
1212211221
121121
21
),...,|()(
),...,,|(),...,,|(),...,,(
),...,,|(),...,,(),...,,(
Λ
2005-04-16 (Sat.) 17An Introduction to Bayesian Networks
Independence of two events
)|(),|( ),|(),|(|
CBPCABPCAPCBAPCBA
==⇔⊥
Conditional independence
)()|( )()|(
BPABPAPBAPBA
==⇔⊥
Marginal independence
CA BC
A B
C
A B
2005-04-16 (Sat.) 18An Introduction to Bayesian Networks
X4
Ch(X5)
Ch(X5): the children of X5
Pa(X5)
Pa(X5): the parents of X5X={X1, X2, … , X10}
∏==i ii XXPXPXP ))(Pa|()X ,()( 101 Κ
De(X5)
De(X5): the descendents of X5X1
X3
X5 X6
X7 X8
X2
X9 X10
X5
Topological Sort of X∈iXX1, X2, X3, X4, X5, X6, X7, X8, X9, X10
Chain rule in a reverse order
)|(}){\( }){\|(}){\()},{\(
81010
1010101010
XXPXXPXXXPXXPXXXP
==
)|(})'\{( })'\{|(})'\{()},'\{(
799
99999
XXPXXPXXXPXXPXXXP
==
),|(})'\{( })'\{|(})'\{()},'\{(
6588
88888
XXXPXXPXXXPXXPXXXP
==
2005-04-16 (Sat.) 19An Introduction to Bayesian Networks
X4
Ch(X5)
Ch(X5): the children of X5
Pa(X5)
Pa(X5): the parents of X5X={X1, X2, … , X10}
∏==i ii XXPXPXP ))(Pa|()X ,()( 101 Κ
De(X5)
De(X5): the descendents of X5X1
X3
X5 X6
X7 X8
X2
X9 X10
X5
Topological Sort of X∈iXX1, X2, X3, X4, X5, X6, X7, X8, X9, X10
Chain rule in a reverse order
)|(}){\( }){\|(}){\()},{\(
81010
1010101010
XXPXXPXXXPXXPXXXP
==
)|(})'\{( })'\{|(})'\{()},'\{(
799
99999
XXPXXPXXXPXXPXXXP
==
),|(})'\{( })'\{|(})'\{()},'\{(
6588
88888
XXXPXXPXXXPXXPXXXP
==
2005-04-16 (Sat.) 20An Introduction to Bayesian Networks
Bayesian Networkfor the Car Start Problem
Fuel
Fuel MeterStanding Start
Clean SparkPlugs
P(Fu = Yes) = 0.98 P(CSP = Yes) = 0.96
P(St|Fu, CSP)P(FMS|Fu)
0.0010.60
FMS = Half
0.9980.001Fu = No0.010.39Fu = Yes
FMS = Empty
FMS = Full
10(No, Yes)0.990.01(Yes, No)
10(No, No)
0.010.99(Yes, Yes)Start=NoStart=YES(Fu, CSP)
2005-04-16 (Sat.) 21An Introduction to Bayesian Networks
An Illustration of Conditional Independence in BN
Netica (http://www.norsys.com)
2005-04-16 (Sat.) 22An Introduction to Bayesian Networks
Main Issues in BNInference in Bayesian networks
Given an assignment of a subset of variables (evidence) in a BN,estimate the posterior distribution over another subset of unobserved variables of interest.
Learning Bayesian network from dataParameter Learning
Given a data set, estimate local probability distributions P(Xi|Pa(Xi)).for all variables (nodes) comprising the BN .
Structure learningFor a data set, search a network structure G (dependency structure) which is best or at least plausible.
This tutorial introduces this issue.
( )( )obs
obsunobsun P
PPx
xxxx ,)|( =
2005-04-16 (Sat.) 23An Introduction to Bayesian Networks
IntroductionBasic Concepts of Bayesian NetworksLearning Bayesian Networks
Parameter LearningStructural Learning
ApplicationsDNA microarrayClassificationDependency Analysis
Summary
2005-04-16 (Sat.) 24An Introduction to Bayesian Networks
Learning Bayesian Networks
Bayesian network learning
* Structure Search* Score Metric* Parameter Learning
Data Acquisition Preprocessing
BN = Structure + Local probability distribution
Prior knowledge
2005-04-16 (Sat.) 25An Introduction to Bayesian Networks
Learning Bayesian Networks (Cont’d)
Bayesian network learning consists of Structure learning (DAG structure)Parameter learning (for local probability distribution)
SituationsKnown structure and complete data.Unknown structure and complete data.Known structure and incomplete data.Unknown structure and incomplete data.
2005-04-16 (Sat.) 26An Introduction to Bayesian Networks
Parameter LearningTask: Given a network structure, estimate the parameters of the model from data.
SM
S2
S1
LHHL
……..…
HHHH
LLLH
DCBA
A B
C D
P(A)
0.010.99
LH
P(B)
0.070.93
LH
0.60.4(H, H)
0.80.2(H, L)
0.70.3(L, H)
(L, L)
(A, B)
P(C)
0.20.8
LH
0.10.9H
0.90.1L
B
P(D)
LH
2005-04-16 (Sat.) 27An Introduction to Bayesian Networks
Key point: independence of parameter estimation
D={s1, s2, …, sM}, where si=(ai, bi, ci, di) is an instance of a random vector variable S=(A, B, C, D).Assumption: samples si are independent and identically distributed(i.i.d).
[ ]
⎟⎟⎠
⎞⎜⎜⎝
⎛Θ⎟⎟
⎠
⎞⎜⎜⎝
⎛Θ⎟⎟
⎠
⎞⎜⎜⎝
⎛Θ⎟⎟
⎠
⎞⎜⎜⎝
⎛Θ=
ΘΘΘΘ=
Θ=Θ=Θ
∏∏∏∏
∏
∏
====
=
=
M
iiiG
M
iiiiG
M
iiG
M
iiG
M
iiiGiiiGiGiG
M
iiGGG
,|bdP,,b|acP|bP|aP
,|bdP,,b|acP|bP|aP
PDPDL
1111
1
1
)( )( )( )(
)()()()(
)|()|();( s A B
C D
G
Independent parameter estimation for each node (variable)
2005-04-16 (Sat.) 28An Introduction to Bayesian Networks
One can estimate the parameters for P(A), P(B), P(C|A, B), and P(D|B) in an independent manner.
If A, B, C, and D are all binary-valued, the number of parameters are reduced from 15 (24-1) to 8 (1+1+4+2).
A
)(AP
B
)(BP
A B C D
s1
s2
sM
),|( BACP
A B
C
)|( BDP
B
D
P(A,B,C,D)=P(A)xP(B)xP(C|A,B)xP(D|B)
⎥⎥⎥⎥
⎦
⎤
Md
dd
Μ2
1
⎢⎢⎢⎢
⎣
⎡
Mc
cc
Μ2
1
Mb
bb
Μ2
1
Ma
aa
Μ2
1
2005-04-16 (Sat.) 29An Introduction to Bayesian Networks
One can estimate the parameters for P(A), P(B), P(C|A, B), and P(D|B) in an independent manner.
If A, B, C, and D are all binary-valued, the number of parameters are reduced from 15 (24-1) to 8 (1+1+4+2).
A
)(AP
B
)(BP
A B C D
s1
s2
sM
),|( BACP
A B
C
)|( BDP
B
D
P(A,B,C,D)=P(A)xP(B)xP(C|A,B)xP(D|B)
⎥⎥⎥⎥
⎦
⎤
Md
dd
Μ2
1
⎢⎢⎢⎢
⎣
⎡
Mc
cc
Μ2
1
Mb
bb
Μ2
1
Ma
aa
Μ2
1
2005-04-16 (Sat.) 30An Introduction to Bayesian Networks
Methods for Parameter EstimationMaximum Likelihood Estimation
Choose the value of Θ which maximizes the likelihood for the observed data D.
Bayesian estimationRepresent uncertainty about parameters using a probability distribution over Θ.Θ is also a random variable rather than a parameter value.
)|(maxarg);(maxargˆ Θ=Θ=Θ ΘΘ DPDLG
)|()()(
)|()()|( ΘΘ∝ΘΘ
=Θ DPPDPDPPDP
posterior prior likelihood
2005-04-16 (Sat.) 31An Introduction to Bayesian Networks
Bayes Rule, MAP and MLBayes’ rule
)()()|()|(
DPhPhDPDhP =
)|(maxarg* hDPh h=
)|(maxarg* DhPh h=
)|( DhP
ML
h: hypothesis (models or parameters)D: data
(maximum likelihood) estimation
MAP (maximum a posteriori) estimation
Bayesian LearningNot a point estimate, but the posterior distribution
From NIPS’99 tutorial by Ghahramani, Z.
2005-04-16 (Sat.) 32An Introduction to Bayesian Networks
Bayesian estimation (for multinomial distribution)
∏ −∝k
kkP 1)( αθθ
∏ −+∝
∝
kN
kkk
DPPDP1
)|()()|(αθ
θθθ
[ ]∑
∫∫
++
==
=
==+
l ll
kkkDP
k
M
NNE
dDP
dDPkPDkSP
)(
)|(
)|()|()|(
)|(
1
ααθ
θ
θ
θθ
θθθ
Smoothed version of MLE
s1 s2 s3 sM
θ
Prior P(ө)
Likelihood P(D| ө)
θSufficient statistics
Prior knowledge or pseudo counts
),,,(~ 21 KDir αααθ Κ
2005-04-16 (Sat.) 33An Introduction to Bayesian Networks
Bayesian estimation (for multinomial distribution)
∏ −∝k
kkP 1)( αθθ
∏ −+∝
∝
kN
kkk
DPPDP1
)|()()|(αθ
θθθ
[ ]∑
∫∫
++
==
=
==+
l ll
kkkDP
k
M
NNE
dDP
dDPkPDkSP
)(
)|(
)|()|()|(
)|(
1
ααθ
θ
θ
θθ
θθθ
Smoothed version of MLE
s1 s2 s3 sM
θ
Prior P(ө)
Likelihood P(D| ө)
θSufficient statistics
Prior knowledge or pseudo counts
),,,(~ 21 KDir αααθ Κ
2005-04-16 (Sat.) 34An Introduction to Bayesian Networks
H H T H
θ
H H
( ) ( ), 5, 1H TN N =
( )1 ,1~ Dir15)|()()|( THDPPDP θθθθθ =×∝
75.043
)11()51(51)|( ==+++
+=DP H
)11()11()( −−∝ THP θθθ
15
)|(
TH
HHHHTHDP
θθ
θθθθθθ
=
=θ
H H T HH H
15 )|( THHHHHTHDP θθθθθθθθ ==θ
65
155 )|(maxargˆ =+
== θθ θ DP
833.0ˆ)( ≈== HHSP θ
Maximum likelihood estimation
Bayesian inference
An Example: Coin toss
2005-04-16 (Sat.) 35An Introduction to Bayesian Networks
H H T H
θ
H H
( ) ( ), 5, 1H TN N =
( )1 ,1~ Dir15)|()()|( THDPPDP θθθθθ =×∝
75.043
)11()51(51)|( ==+++
+=DP H
)11()11()( −−∝ THP θθθ
15
)|(
TH
HHHHTHDP
θθ
θθθθθθ
=
=θ
H H T HH H
15 )|( THHHHHTHDP θθθθθθθθ ==θ
65
155 )|(maxargˆ =+
== θθ θ DP
833.0ˆ)( ≈== HHSP θ
Maximum likelihood estimation
Bayesian inference
An Example: Coin toss
2005-04-16 (Sat.) 36An Introduction to Bayesian Networks
P(H)H HH HHT HHTH HHTHH HHTHH
HHHTHHHTTHH
(100, 50)
MLE 1.00 1.00 0.67 0.75 0.80 0.83 0.70 0.67B(0.5, 0.5) 0.75 0.83 0.63 0.70 0.75 0.79 0.68 0.67B(1, 1) 0.67 0.75 0.60 0.67 0.71 0.75 0.67 0.66B(2, 2) 0.60 0.67 0.57 0.63 0.67 0.70 0.64 0.66B(5, 5) 0.55 0.58 0.54 0.57 0.60 0.63 0.60 0.65
Dirichlet prior
2005-04-16 (Sat.) 37An Introduction to Bayesian Networks
B(1, 1)
B(2, 2) B(5, 5)
B(0.5, 0.5)
Variation of posterior distribution for the parameter
2005-04-16 (Sat.) 38An Introduction to Bayesian Networks
IntroductionBasic Concepts of Bayesian NetworksLearning Bayesian Networks
Parameter LearningStructural Learning
Scoring MetricSearch Strategy
Practical ApplicationsDNA microarrayClassificationDependency Analysis
Summary
2005-04-16 (Sat.) 39An Introduction to Bayesian Networks
Structural LearningTask: Given a data set, search a most plausible network structure underlying the generation of the data set.Metric (score)-based approach
Use a scoring metric to measure how well a particular structure fits the observed set of cases.
A B C DS1S2
SM
H L L L
H H H H
… … … …
L H H L
DC
A B
DC
A B
DC
A B
Score
Scoring metric
Search strategy
2005-04-16 (Sat.) 40An Introduction to Bayesian Networks
Scoring Metric)|()(
)()|()()|( GDPGP
DPGDPGPDGP ∝=
Prior for network structure
Marginal likelihood
Likelihood Score
Nodes of high mutual information (dependency) with their parents get higher score.Since, I(X;Y) ≤ I(X; {Y, Z}), the fully connected network is obtained in an unrestricted case.Prone to overfitting.
( )MLEGDPDGScore Θ= ,|log);( ∑∑==
−∝N
ii
N
iii XHXI
11)( );( Pa
2005-04-16 (Sat.) 41An Introduction to Bayesian Networks
∑∑∑
∑∑∑
= = =
= = =
=
=
N
i j
X
k ij
ijkijk
N
i j
X
k ij
ijkijk
i i
i i
NN
MN
M
NN
NL
1 1 1
1 1 1
log
logD);Θ̂(log
Pa
Pa
( )∑=
−=N
iiiXHM
1| Pa
( ) ⎟⎠
⎞⎜⎝
⎛−−= ∑ ∑
= =
N
i
N
iiiii XHXHXHM
1 1)()|()( Pa
∑∑==
−∝N
ii
N
iii XHXI
11)( );( Pa
H(X) H(Y)
H(X|Y) H(Y|X)I(X;Y)
A B
C D
A B
C D
Empty network
Likelihood Score in Relation with Information Theory
2005-04-16 (Sat.) 42An Introduction to Bayesian Networks
Bayesian ScoreConsider the uncertainty in parameter estimation in Bayesian network
Assuming a complete data and parameter independence, the marginal likelihood can be rewritten as
( )( )( ) ( )[ ]∏ ∫=
=N
iiiiii dGPGXXDPGDP
1
|, ;)|( θθθPa
∫ ΘΘΘ= dGPGDPGPDGScore )|(),|()();(
Marginal likelihood for each pair of (Xi;Pa(Xi))
2005-04-16 (Sat.) 43An Introduction to Bayesian Networks
Bayesian Dirichlet ScoreFor a multinomial case, if we assume a Dirichletprior for each parameter (Heckerman, 1995),
( )( )( ) ( ) ∏ ∏∫= = Γ
+Γ
+Γ
Γ=
i i
j
X
k ijk
ijkijk
ijij
ijiiiii
NN
dGPGXXDPPa
θθθPa1 1 )(
)()(
)(|, ;
αα
αα
( )iXijijijij Dir αααθ ,,, ~ 21 Κ ∑ =
= iX
k ijkij 1αα
∑ == iX
k ijkij NN1( )j
iikiiijk paXxXN === )(, of # Pa
∫∞ −−=Γ0
1)( dtetx tx
!)()1( nnnn ==Γ=+Γ Λ
1)1( =Γ
2005-04-16 (Sat.) 44An Introduction to Bayesian Networks
Bayesian Dirichlet Score (Cont’d)
H H T H
TH
HHPαα
αφ+
=)|(
TH
HHHPαα
α+++)1(
)1()|(
TH
THHTPαα
α++
=)2(
)|(
)1()2()2()|(+++
+=
TH
HHHTHPαα
α
[ ])3()1()2()1(
+×+××+×+×
ααααααα
ΛΛ THHH
)()4(
ααΓ
+Γ
)()3(
H
H
ααΓ
+Γ)(
)1(
T
T
ααΓ
+Γ
⎥⎦
⎤⎢⎣
⎡Γ
+Γ×
Γ+Γ
+ΓΓ
)()1(
)()3(
)4()(
T
T
H
H
αα
αα
αα
~ ( , ) H T H TDirθ α α α α α= +
2005-04-16 (Sat.) 45An Introduction to Bayesian Networks
Bayesian Dirichlet Score (Cont’d)
H H T H
TH
HHPαα
αφ+
=)|(
TH
HHHPαα
α+++)1(
)1()|(
TH
THHTPαα
α++
=)2(
)|(
)1()2()2()|(+++
+=
TH
HHHTHPαα
α
[ ])3()1()2()1(
+×+××+×+×
ααααααα
ΛΛ THHH
)()4(
ααΓ
+Γ
)()3(
H
H
ααΓ
+Γ)(
)1(
T
T
ααΓ
+Γ
⎥⎦
⎤⎢⎣
⎡Γ
+Γ×
Γ+Γ
+ΓΓ
)()1(
)()3(
)4()(
T
T
H
H
αα
αα
αα
~ ( , ) H T H TDirθ α α α α α= +
2005-04-16 (Sat.) 46An Introduction to Bayesian Networks
Bayesian Dirichlet Score (Cont’d)
∏∏ ∏= = = Γ
+Γ
+Γ
Γ=
N
i j
X
k ijk
ijkijk
ijij
iji i N
N1 1 1 )()(
)()(Pa
αα
αα
( )( )( ) ( )[ ]∏ ∫=
=N
iiiiii dGPGXXDPGDP
1
|, ;)|( θθθPa
log P(D|G) in Bayesian score is asymptotically equivalent to BIC (Schwarz, 1978) and minus the MDL criterion (Rissanen, 1987).
MGGDPDGBICGDP log2
)dim(),ˆ|(log);()|(log −Θ=≈
dim( ) # of parameters in G G=
2005-04-16 (Sat.) 47An Introduction to Bayesian Networks
Structure SearchGiven a data set, a score metric, and a set of possible structures,
Find the network structure with maximal score.Discrete optimization
One can utilize the property of independent score for each pair of (Xi, Pa(Xi)).
( )∏=
=N
iii GXXDPGDP
1
|))(;()|( Pa
∑=
==N
iii XPaXScoreGDPDGScore
1))(;()|(log);(
2005-04-16 (Sat.) 48An Introduction to Bayesian Networks
Tree-Structured NetworkDefinition: Each node has at most one parent.
An effective search algorithm exists.
[ ] ∑∑∑ +−==i
ii
iiii
ii XScoreXScorePaXScorePaXScoreDGScore )()()|( )|()|(Improvement over empty network
Score for empty network
Chow and Liu (1968)Construct the undirected complete graph with the weights of edge E(Xi, Xj) being I(Xi; Xj).
Build a maximum weighted spanning tree.
Transform to a directed tree with an arbitrary root node.
A
B C
D
2005-04-16 (Sat.) 49An Introduction to Bayesian Networks
A
B C
D
A
B C
I(A;B) = 0.0652 I(A;D)=
0.0834
I(A;C)= 0.0880
I(B;C)= 0.1913
I(C;D)= 0.1980
I(B;D)= 0.1967 D
A
B C
D
SM
S2
S1
LHHL
……..…
HHHH
LLLH
DCBA
Complete undirected graph
Maximum spanning
tree
Directed tree
2005-04-16 (Sat.) 50An Introduction to Bayesian Networks
Search Strategies for General Bayesian Networks
With more than one parents per node NP-hard(Chickering, 1996)
Heuristic search methods are usually employed.Greedy hill-climbing (local search)Greedy hill-climbing with random restartSimulated annealingTabu search…
2005-04-16 (Sat.) 51An Introduction to Bayesian Networks
Greedy Local Search AlgorithmINPUT: Data set, Scoring Metric,
Possible Structures
Initialize the structure• empty network, random network,
tree-structured network, etc
Do a local edge operation resulting in the largest improvement in scoreamong all possible operations
Score(Gt+1;D)>Score(Gt;D)YES
NO
Gfinal = Gt
A B
C D
INSERT
DELETE REVERSE
A B
C D
A B
C D
A B
C D
2005-04-16 (Sat.) 52An Introduction to Bayesian Networks
Enhanced SearchGreedy local search can get stuck in local maxima or plateaux.Standard heuristics to escape the two includes
Search with random restarts, Simulated annealing, Tabu searchGenetic algorithm: a population-based search.
scor
e
Greedy search
Local maximum
scor
e
Greedy search with random restarts
Random restart
scor
e
Simulated annealing
P( ) scor
e
Population-based search (GA)
2005-04-16 (Sat.) 53An Introduction to Bayesian Networks
Learning Bayesian networksJoint (probability) distribution product of conditional probabilities for each variable (node) (sum in log-based representation.)
Parameter LearningEstimation of local probability distributionMLE, Bayesian estimation.
Structure LearningScore metrics
Likelihood, Bayesian score, BIC, MDLSearch strategies
Optimization in discrete space maximize the scoreKey concept: the decomposability of the score Tree-structured and general Bayesian networks.
Learning Bayesian Networks: Summary
2005-04-16 (Sat.) 54An Introduction to Bayesian Networks
IntroductionBasic Concepts of Bayesian NetworksLearning Bayesian Networks
Parameter LearningStructural Learning
Applications DNA microarrayClassification (Naïve Bayes, TAN)Dependency Analysis
Summary
2005-04-16 (Sat.) 55An Introduction to Bayesian Networks
DNA Microarray Data AnalysisClassification
Gene expression data of 72 leukemia patients.Task: classification of samples into AML or ALL based on expression patterns using Bayesian networks.
Combined analysis of gene expression data and drug activity data
Gene expression data and drug activity data of 60 cancer samplesTask: construct a dependency network of genes and drugs.
ToolsWEKA (http://www.cs.waikato.ac.nz/~ml/weka/)
A collection of machine learning algorithms for data mining tasks (implemented in JAVA, open source software issued under the GNU General Public License)Bayesian network learning algorithms are also included.
BNJ software (http://bnj.sourceforge.net/) are used for visualization when needed.
2005-04-16 (Sat.) 56An Introduction to Bayesian Networks
DNA MicroarraysRecent developments in the technology for biological experiments have made it possible to produce massive biological data sets.
Monitor thousands of gene expression levels simultaneously. traditional one gene experiments.
parallel view of the expression patterns of thousands of genes in a cell.Bayesian networks is a useful tool for the identification of a variety of meaningful relationships among genes from the data.
2005-04-16 (Sat.) 57An Introduction to Bayesian Networks
Bayesian Networksin Microarray Data Analysis
Sample classification
Gene-gene relation analysis
Gene regulatory network analysisExpression profiles
Bayesian network construction
- Disease diagnosis
- Activation or inhibition between genes
- Global view on the relations among genes
Combined analysis of other biological data- DNA sequence data, drug activity data, and so on.
2005-04-16 (Sat.) 58An Introduction to Bayesian Networks
DNA Microarray
Image analysis
2005-04-16 (Sat.) 59An Introduction to Bayesian Networks
Data Preparation for Data Mining
Sample 1 Sample 2 Sample i
Sample k Sample M
Microarray image samples
Sample 1
Gene 2
Numerical data for data mining
Image analysis
2005-04-16 (Sat.) 60An Introduction to Bayesian Networks
Example 1: Tumor Type ClassificationTask: classification a leukemia samples into two classes based on gene expression patterns using Bayesian networks
Gene B
TargetGene D
Gene CGene A
- Selected genesand the target variable Gene B
TargetGene D
Gene CGene A
- Learned Bayesian network
- DNA microarraydata from two kinds of leukemia patients
2005-04-16 (Sat.) 61An Introduction to Bayesian Networks
Data Sets72 samples in total: 25 AML (acute myeloid leukemia) samples + 47 ALL (acute lymphoblastic leukemia) samples (Golub et al., 1999).A data sample consists of expression measurements for over 7,000 genes.Preprocessing
30 informative genes were selected by the P-metric
Expression values were discretized into 2 levels, High and Low.The final data is 72 samples of which each consists of ‘High’ or ‘Low’expression values of 30 genes
ALLAML
ALLAMLgscoreσσμμ
+−
=)(
2005-04-16 (Sat.) 62An Introduction to Bayesian Networks
Classification by Bayesian NetworksClassification as an inference for a variable (node) in Bayesian networks.
∑∈
∈
=
=
hCc
Cc
DhPhcPDcPc
)|(),|(maxarg ),|(maxarg)(*
ggg
Bayes optimal classifier
),(),(
),()(
),()|( gg
gg
gg khl lh
kh
h
khkh cP
cPcP
PcPcP ∝==
∑=),( gtypeP )(typeP
),|1( fahtypembP −
)|( typepicP )|4( typesltcP )|( typezyxinP),|_( pictypemybcP )4,|( sltctypefahP
1)|ˆ( if =DhP* ˆc ( ) argmax ( | , )c C P c h∈=g g
2005-04-16 (Sat.) 63An Introduction to Bayesian Networks
Classification by Bayesian NetworksClassification as an inference for a variable (node) in Bayesian networks.
∑∈
∈
=
=
hCc
Cc
DhPhcPDcPc
)|(),|(maxarg ),|(maxarg)(*
ggg
Bayes optimal classifier
),(),(
),()(
),()|( gg
gg
gg khl lh
kh
h
khkh cP
cPcP
PcPcP ∝==
∑=),( gtypeP )(typeP
),|1( fahtypembP −
)|( typepicP )|4( typesltcP )|( typezyxinP),|_( pictypemybcP )4,|( sltctypefahP
1)|ˆ( if =DhP* ˆc ( ) argmax ( | , )c C P c h∈=g g
2005-04-16 (Sat.) 64An Introduction to Bayesian Networks
Naïve Bayes ClassifierVery restricted form of Bayesian network
Conditional independence of all variables (nodes) given the value of the class variable.Though simple, widely used in classification tasks up to now.
)|()(),( typePtypePtypeP ss =
)|1_()|( )|4()|(
)|()|_()|(
typembPtypefahPtypesltcPtypepicP
typezyxinPtypemybcPtypeP =s
2005-04-16 (Sat.) 65An Introduction to Bayesian Networks
Tree-Augmented Network(Friedman et al., 1997)Naïve Bayes + tree-structured network of variables but class node.
Express the dependency between variables in a restricted way.‘Class’ is the root nodeXN can have at most one parent, besides ‘Class’node.
)|( typezyxinP
),|( fahtypezyxinP
2005-04-16 (Sat.) 66An Introduction to Bayesian Networks
ResultsNB TAN
2_NO 2_NB
3_NB3_NO
65/72 69/72
67/7266/72
69/72 (NB)69/72 (NB)
General BN(max #pa = 2)
67/72 (NB)70/72 (NB)
General BN(max #pa = 3)
68/7269/72TAN
68/7266/72NB
30 genes6 genes
Leave-one-out cross-validationBayesian network learning with 71 samples and test for the remaining one samples.72 iteration
2005-04-16 (Sat.) 67An Introduction to Bayesian Networks
Markov Blanket in BN
X3 X4
X1
X5 X6
X7X8
X2
X9 X10
T
Markov Blanket of the node T:
P(T | X\{T}) = P(T | MB(T))MB(T ) = {Pa(T), Ch(T), Pa(Ch(T))\{T}}
TAN
Markov Blanket of the node ‘type’
X4
X6
X8
X3
X7
2005-04-16 (Sat.) 68An Introduction to Bayesian Networks
Example 2: Gene-Drug Dependency Analysis
Task: Construct a Bayesian network for the combined analysis of gene expression data and drug activity data.
Gene B
CancerDrug B
Drug AGene A
- Selected genes, drugsand cancer type node
Drug A
CancerDrug B
Gene BGene A
< Learned Bayesian network >- Dependency analysis- Probabilistic inference
Drug activity DataDrug activity Data
Gene Expression DataGene Expression Data
Preprocessing- Thresholding- Discretization
Bayesian network
learning
2005-04-16 (Sat.) 69An Introduction to Bayesian Networks
Data SetsNCI60 data sets (Scherf et al., 2000)
9703 genes and 1400 chemical compounds from 60 human cancer samples.
Preprocessing12 genes and 4 drugs were selected based the correlation analysis result in (Scherf et al. 2000): Considering the learning time and visualization of Bayesian networks.Discretize the gene expression values and drug activity values into 3 levels, High, Mid, Low.
Nodes in Bayesian networks: 12 genes, 4 drugs, cancer type.
2005-04-16 (Sat.) 70An Introduction to Bayesian Networks
Results
Visualization by BNJ3 Software (http://bnj.sourceforge.net/)
Identification of a plausible dependency among 2 genes and 1 drugs
2005-04-16 (Sat.) 71An Introduction to Bayesian Networks
Inferring Gene Regulation Model Reconstruction of regulatory relation among genes from genome data.One of the challenges in reconstruction of gene regulatory networks using Bayesian networks is statistical robustness.
Arises from the sparse nature underlying gene expression data.
Usually, the number of samples are much smaller than that of genes (attributes).Can produce unstable, spurious results.
Bootstrap, model averaging, incorporation of prior biological knowledge.
2005-04-16 (Sat.) 72An Introduction to Bayesian Networks
Bootstrap-based approach Multiple Bayesian networks from bootstrapped samples. (Friedman, N. et al., 2000, Pe’er, D. et al., 2001)
…
Expression Data
Estimate feature statistics (e.g., gene Y is in the Markov blanket of X)
B1 B2 BL…
Identification of significant - pairwise relations (Friedman, 2000)- subnetworks (Pe’er, 2001)
Friedman et al., 2000
1
1conf ( ) ( )Lii
f f GL =
= ∑
Resampling with replacement
G1 G2 GL
Yeast cell-cycle data76 samples for 800 genes
2005-04-16 (Sat.) 73An Introduction to Bayesian Networks
Model averaging-based approach Estimate the confidence of features by averaging over posterior model (BN) distribution (Hertemink et al., 2002)
…
scor
e
Expression Data
conf ( ) ( | ) ( | , )Li ii
f P G D P f D G=∑
Location data(serves as a prior)
G1 G2 GL
Structure learning with simulated
annealing
2005-04-16 (Sat.) 74An Introduction to Bayesian Networks
Incorporation prior knowledge or assumption. Impose biologically-motivated assumptions on the network structure.Construction of gene regulation model from gene expression data: inferring regulator set (Pe’er, D., et al., 2002)
AssumptionA relatively small set of genes is directly involved in transcriptional regulation
Limit the number of candidate regulatorsLimit the number of parents for each node (gene) and the number of genes having outgoing edges.
Significantly reduce the number of possible models.Statistical robustness of the identified regulator sets.
2005-04-16 (Sat.) 75An Introduction to Bayesian Networks
358 samples from combined data sets for S. cerevisiae.
Gene expression
data
•Remove non-informative genes
•Discretization
3,622 genes in total
Candidate regulator set
(=456)
Learning with structure
constraints
2-level network
R1 R2 R3 RMRegulator set
Target
Score(g;Pag)=I(g;Pag)
(Pe’er et al., 2002)
2005-04-16 (Sat.) 76An Introduction to Bayesian Networks
SummaryBayesian networks provide an efficient/effective framework for organizing the body of knowledge by encoding the probabilistic relationships among variables of interest.
Graph theory + probability theory: DAG + local probability distribution.Conditional independence and conditional probability are keystones.A compact way to express complex systems by simpler probabilistic modules and thus a natural framework for dealing with complexity and uncertainty.
Two problems in the learning of Bayesian networks from dataParameter estimation: MLE, MAP, Bayesian estimationStructural learning: tree-structured network, heuristic search for general Bayesian network.
2005-04-16 (Sat.) 77An Introduction to Bayesian Networks
Summary (Cont’d)Not covered in this tutorial but important issues.
Probabilistic inference in general Bayesian networksExact & approximate inference
Incomplete data (missing data, hidden variables)Known structure & unknown structureExpectation-Maximization algorithm can be employed as in latent variable models.
Bayesian model averaging over structures as well as parameters.Dynamic Bayesian networks for temporal data.
Many applicationsText and web miningMedical diagnosisIntelligent agent systemBioinformatics
2005-04-16 (Sat.) 78An Introduction to Bayesian Networks
ReferencesBayesian Networks (papers & books)
Chow, C. and Liu, C., “Approximating discrete probability distributions with dependency trees”, IEEE Transactions on Information Theory, 14, pp. 462-467, 1968.Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, 1988.Neapolitan, E., Probabilistic Reasoning in Expert Systems, 1990.Charniak, E., Bayesian networks without tears, AI Magazine, 12(4):50-63, 1991.Heckerman, Learning Bayesian networks: the combination of knowledge and statistical data, Machine Leanring, 20, 197-243, 1995.Jensen, F.V., An Introduction to Bayesian Networks, Springer-Verlag, 1996.Friedman, N., Geiger, D., and Goldszmidt, M., “Bayesian network classifiers”, Machine Learning, 29(2), pp. 131-163, 1997.Frey, B.J., Graphical Models for Machine Learning and Digital Communication, MIT Press, 1998.Jordan, M. I. eds., Learning in Graphical Models, Kluwer Academic Publishers, 1998.Friedman, N. and Goldszmidt, M., Learning Bayesian networks with local structure, In Learning in Graphical Models (Ed. Jordan, M. I.) , pp. 421-460, MIT Press, 1999.Heckerman, D., A tutorial on learning with Bayesian networks, In Learning with Graphical Models (Ed. Jordan, M. I.), pp. 301-354, MIT Press, 1999.J. Pearl and S. Russel, Bayesian networks, UCLA Cognitive Systems Laboratory, Technical Report (R-277), November 2000. Spirtes, P., Glymour, C., and Scheines, R., Causation, Prediction, and Search, 2nd edition, MIT Press, 2000.Jensen, F. V., Bayesian Networks and Decision Graphs, Springer, 2001.J. Pearl, Bayesian networks, causal inference and knowledge discovery, UCLA Cognitive Systems Laboratory, Technical Report (R-281), March 2001. Korb, K. B. and Nicholson, A. B., Bayesian Artificial Intelligence, CRC Press, 2004.
2005-04-16 (Sat.) 79An Introduction to Bayesian Networks
Graphical models and Bayesian networks (tutorials & lectures)
Friedman, N. and Koller, D., Learning Bayesian Networks from Data, (http://www.cs.huji.ac.il/~nir/Nips01-Tutorial/)Murphy, K., A Brief Introduction to Graphical Models and Bayesian Networks (http://www.cs.ubc.ca/~murphyk/Bayes/bayes.html)Ghahramani, Z., Probabilistic Models for Unsupervised Learning, (http://www.gatsby.ucl.ac.uk/~zoubin/NIPStutorial.html)Ghahramani, Z., Bayesian Methods for Machine Learning, (http://www.gatsby.ucl.ac.uk/~zoubin/ICML04-tutorial.html)Moser, A., Probabilistic Independence Networks (I-II), (http://www.eecis.udel.edu/~bloch/eleg867/notes/PIN_I_rev2/index.htm, http://www.eecis.udel.edu/~bloch/eleg867/notes/PIN_II_rev2/index.htm)Poet, M., Special Topics: Belief Networks, (http://www.stat.duke.edu/courses/Spring99/sta294/)
2005-04-16 (Sat.) 80An Introduction to Bayesian Networks
Bayesian network with application to bioinformaticsMurphy, K. and Mian, S., Modeling gene expression data using dynamic Bayesian networks, Technical. report 1999: Computer Science Division, University of California, Berkeley, CA Friedman, N., Linial, M., Nachman, I., and Pe’er, D., Using Bayesian networks to analyze expression data, Journal of Computational Biology, 7(3/4), pp. 601-620, 2000.Cai, D., Delcher, A., Kao, B., and Kasif, S. Modeling splice sites with Bayes networks, Bioinformatics, 2000, 16(2), pp. 152-158, 2000.Pe’er, D., Regev, A., Elidan, G., and Friedman, N., Inferring subnetworks from perturbed expression profiles, Bioinformatics, 17(suppl 1), pp. S215-S224, 2001.Pe’er, D., Regev, A., and Tanay, A., Minreg: inferring an active regulator set, Bioinformatics, 18(suppl 1), pp. 258-267, 2002.Hartemink, A. J., Gifford, D. K., Jaakkola, T. S., and Young, R. A., Combining location and expression data for principled discovery of genetic regulatory network models, Pacific Symposium on Biocomputing, 7, pp. 437-449, 2002.Imoto, S., Kim, S., Goto, T., Aburatani, S., Tashiro, K., Kuhara, S., and Miyano, S., Bayesian network and nonparametric heteroscedastic regression for nonlinear modeling of genetic network, Journal of Bioinformatics and Computational Biology, 1(2), pp. 231-252, 2003.Ronald, J., et al., A Bayesian networks approach for predicting protein-protein interactions from genomic data, Science 302: 449-453, 2003.Olga G. Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B., and Botstein, D., A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae), PNAS, 100(14), pp. 8348–8353, 2003.Beer M. A. and Tavazoie, S. Predicting gene expression from sequence. Cell, 117(2), pp. 185-198, 2004. Friedman, N., “Inferring cellular networks using probabilistic graphical models”, Science, 303(5659), pp. 799-805, 2004.
A comprehensive list is available at http://zlab.bu.edu/kasif/bayes-net.html
2005-04-16 (Sat.) 81An Introduction to Bayesian Networks
Bayesian network software packagesBayes Net Toolbox (Kevin Murphy)
http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.htmlA variety of algorithms for learning and inference in graphical models (written in MATLAB).
WEKA http://www.cs.waikato.ac.nz/~ml/weka/Bayesian network learning and classification modules are included among a collection of machine learning algorithms (written in JAVA) .
Detailed lists and comparison are referred to http://www.cs.ubc.ca/~murphyk/Bayes/bnsoft.htmlBayesian network list in Google
http://directory.google.com/Top/Computers/Artificial_Intelligence/Belief_Networks/Software/
Korb, K. B. and Nicholson, A. B., Bayesian Artificial Intelligence, CRC Press, 2004.
http://www.csse.monash.edu.au/bai/book/appendix_b.pdf