Post on 05-Jan-2016
1
Chapter 14
Probabilistic Reasoning
2
Outlines
• Representing Knowledge in an Uncertain Domain
• The semantics of Bayesian Networks• Efficient Representation of Conditional
Distributions• Exact inference in Bayesian Networks• Approximate inference in in Bayesian
Networks• Extending Probability to FOL
representations• Other approaches to Uncertain Reasoning
3
14-1• Full joint probability distribution
– can answer any question about the domain, but can become intractably large as the number of variables grows.
– Furthermore, specifying probabilities for atomic events is rather unnatural and can be very difficult unless a large amount of data is available from which to gather statistical estimates.
• We also saw that independence and conditional independence relationships among variables can greatly reduce the number of probabilities that need to be specified in order to define the full joint distribution.
• This section introduces a data structure called a Bayesian networks to represent the dependencies among variables and to give a concise specification of any full joint probability distribution.
4
Definition Definition Definition Definition
A Bayesian network is a directed acyclic graph (DAG) which consists of:
• A set of random variables which makes up the nodes of the network.
• A set of directed links (arrows) connecting pairs of nodes. If there is an arrow from node X to node Y, X is said to be a parent of Y.
• Each node Xi has a conditional probability distribution P(Xi|Parents(Xi)) that quantifies the effect of the parents on the node.
Intuitions: • A Bayesian network models our incomplete understanding of
the causal relationships from an application domain.• A node represents some state of affairs or event.• A link from X to Y means that X has a direct influence on
Y.
5
What are Bayesian Networks?
• Graphical notation for conditional independence assertions
• Compact specification of full joint distributions
• What do they look like?– Set of nodes, one per variable– Directed, acyclic graph– Conditional distribution for each node given its
parents: P(Xi|Parents(Xi))
6
Example (Fig. 14.1)
• Weather is independent of the other variables
• Toothache and Catch are conditionally independent given Cavity
Weather Cavity
Toothache Catch
causes
effects
7
Bayesian Network Notice that Cavity is the “cause” of both Toothach
e and PCatch, and represent the causality links explicitly
Give the prior probability distribution of Cavity Give the conditional probability tables of Toothach
e and PCatchCavity
Toothache
P(Cavity)0.2
P(Toothache|c)
CavityCavity
0.60.1
PCatch
P(PClass|c)CavityCavity
0.90.02
5 probabilities, instead of 7
P(ctpc) = P(tpc|c) P(c) = P(t|c) P(pc|c) P(c)
8
Another Example
Sample Domain: • You have a burglar alarm installed in your
home. It is fairly reliable at detecting a burglary, but also responds on occasion to minor earthquakes.
• You also have to neighbors, John and Mary, who have promised to call you at work when they hear the alarm.
• John always calls when he hears the alarm, but sometimes confuses the telephone ringing with the alarm and calls then, too.
• Mary, on the other hand, likes rather loud music and sometimes misses the alarm altogether.
9
Example
• I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a burglar?
• What are the variables?– Burglar– Earthquake– Alarm– JohnCalls– MaryCalls
10
Another Example (continued)
• Network Topology reflects causal knowledge:– A burglar can set the alarm off– An earthquake can set the alarm off– The alarm can cause Mary to call– The alarm can cause John to call
• Assumption– they do not perceive any burglaries directly, – they do not notice the minor earthquakes,
and – they do not confer before calling.
11
Another example (Fig.14.2)
Alarm
Earthquake
JohnCalls MaryCalls
Burglary
12
Conditional Probability table (CPT)• Each distribution is shown as a conditional probability
table, or CPT. • CPT can be used for discrete variables; other
representations, including those suitable for continuous variables
• Each row in a CPT contains the conditionalprobability of each node value for a conditioning case.
• A conditioning case is just a possible combination of values for the parent nodes—a miniature atomic event, if you like.
• Each row must sum to 1, because the entries represent an exhaustive set of cases for the variable.
• For Boolean variables, once you know that the probability of a true value is p, the probability of false must be 1 – p, so we often omit the second number, as in Figure 14.2.
• In general, a table for a Boolean variable with k Boolean parents contains 2k independently specifiable probabilities.
• A node with no parents has only one row, representing the prior probabilities of each possible value of the variable.
13
Another example (Fig.14.2)
B E P(A|B,E)
T T .95
T F .94
F T .29
F F .001
P(B)
.001
A P(M|A)
T .70
F .01
A P(J|A)
T .90
F .05
P(E)
.001
Alarm
Earthquake
JohnCalls MaryCalls
Burglary
14
Compactness
• Conditional Probability Table (CPT): distribution over Xi for each combination of parent values
• Each row requires one number p for for Xi=true (since the false case is just 1-p)
• A CPT for Boolean Xi with k Boolean parents has 2k rows for the combinations of parent values
• Full network requires O(n = 2k) numbers (instead of 2n)
B E P(A|B,E)
T T .95
T F .94
F T .29
F F .001
A
E
J M
B
15
14.2 Semantics of Bayesian Networks
• Global: Representing the full joint distribution– be helpful in understanding how to
construct networks,
• Local: Representing conditional independence – be helpful in designing inference
procedures
16
Global Semantics• Global semantics defines the full joint distribution as
the product of the local conditional distributions• P(X1=x1, X2=x2, X3=x3, …,Xn=xn)• = P(X1,…,Xn)=n
i=1 P(Xi|Parents(Xi))– where parents (Xi) denotes the specific values of the variable
s in Parents(Xi). – Thus, each entry in the joint distribution is represented by th
e product of the appropriate elements of the conditional probability tables (CPTs) in the Bayesian network.
– The CPTs therefore provide a decomposed representation of the joint distribution.
• Example:What is P(j m a b e)? = P(j|a)P(m|a)P(a| b, e)P( b)P( e)0.90*0.70*0.001*0.999*0.998=0.00062
17
But does a BN represent a belief state?
In other words, can we compute the full joint
distribution of the propositions from it?
18
Calculation of Joint Probability
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
P(JMABE) = ??
19
P(JMABE)= P(JM|A,B,E) P(ABE)= P(J|A,B,E) P(M|A,B,E) P(ABE)(J and M are independent given A)
P(J|A,B,E) = P(J|A)(J and BE are independent given A)
P(M|A,B,E) = P(M|A) P(ABE) = P(A|B,E) P(B|E) P(E)
= P(A|B,E) P(B) P(E)(B and E are independent)
P(JMABE) = P(J|A)P(M|A)P(A|B,E)P(B)P(E)
Burglary Earthquake
Alarm
MaryCallsJohnCalls
20
Calculation of Joint Probability
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
P(JMABE)= P(J|A)P(M|A)P(A|B,E)P(B)P(E)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062
21
Calculation of Joint Probability
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
P(JMABE)= P(J|A)P(M|A)P(A|B,E)P(B)P(E)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062
P(x1x2…xn) = i=1,…,nP(xi|parents(Xi)) full joint distribution table
22
Calculation of Joint Probability
B E P(A|…)
TTFF
TFTF
0.950.940.290.001
Burglary Earthquake
Alarm
MaryCallsJohnCalls
P(B)
0.001
P(E)
0.002
A P(J|…)
TF
0.900.05
A P(M|…)
TF
0.700.01
P(JMABE)= P(J|A)P(M|A)P(A|B,E)P(B)P(E)= 0.9 x 0.7 x 0.001 x 0.999 x 0.998= 0.00062
Since a BN defines the full joint distribution of a set of propositions, it represents a belief state
Since a BN defines the full joint distribution of a set of propositions, it represents a belief state
P(x1x2…xn) = i=1,…,nP(xi|parents(Xi)) full joint distribution table
23
24
Chain Rule
25
• We need to choose parents for each node such that this property holds. Intuitively, the parents of node Xi should contain all those nodes in X1, ... , Xi_1 that directly influence Xi.
• For example, suppose we have completed the network in Figure 14.2 except for the choice of parents for MaryCalls. MaryCalls is certainly influenced by whether there is a Burglary or an Earthquake, but not directly influenced. Intuitively, our knowledge of the domain tells us that these events influence Mary's calling behavior only through their effect on the alarm.
• Also, given the state of the alarm, whether John calls has no influence on Mary's calling. Formally speaking, we believe that the following conditional independence statement holds:
• P(MaryCalls John Calls , Alarm, Earthquake, Burglary) = P( MaryCalls I Alarm) .
26
Constructing Bayesian Networks (cont.)
• Direct influencers should be added to the network first
• The correct order in which to add nodes is to add “root causes” first, then the variables they influence, and so on.
• We need to choose parents for each node such that this property holds. Intuitively, the parents of node Xi should contain all those nodes in X1, X2, ... , Xi-1 that directly influence Xi.
• If we don’t follow these rules, we can end up with a very complicated network.
27
Constructing Bayesian Networks
M
A
E
J
B
M
P(MaryCalls|JohnCalls,Alarm,Earthquake,Burglary) = P(MaryCalls|Alarm)
28
Constructing Bayesian Networks (cont.)
MaryCalls
P(J|M)=P(J)?
Chosen order: M,J,A,B,E
29
Constructing Bayesian Networks (cont.)
JohnCalls
MaryCalls
P(J|M)=P(J)? No!P(A|J,M)=P(A|J)? P(A|J,M)=P(A)?
Chosen order: M,J,A,B,E
30
Constructing Bayesian Networks (cont.)
Alarm
JohnCalls
MaryCalls
P(J|M)=P(J)? No!P(A|J,M)=P(A|J)? P(A|J,M)=P(A)? No.P(B|A,J,M)=P(B|A)?P(B|A,J,M)=P(B)?
Chosen order: M,J,A,B,E
31
Constructing Bayesian Networks (cont.)
Burglary
Alarm
JohnCalls
MaryCalls
P(J|M)=P(J)? No! P(A|J,M)=P(A|J)? P(A|J,M)=P(A)? No. P(B|A,J,M)=P(B|A)? Yes! P(E|B,A,J,M)=P(E|A)?P(B|A,J,M)=P(B)? No! P(E|B,A,J,M)=P(E|A,B)?
Chosen order: M,J,A,B,E
32
Constructing Bayesian Networks (cont.)
Alarm
JohnCalls
Burglary
Earthquake
MaryCalls
P(J|M)=P(J)? No! P(A|J,M)=P(A|J)? P(A|J,M)=P(A)? No. P(B|A,J,M)=P(B|A)? Yes! P(E|B,A,J,M)=P(E|A)? No!P(B|A,J,M)=P(B)? No! P(E|B,A,J,M)=P(E|A,B)? Yes!
Chosen order: M,J,A,B,E
33
Bad example
34
Constructing Bayesian Networks (cont.)
Earthquake
JohnCalls
Burglary
Alarm
MaryCalls
35
Local Semantics
• Local Semantics: Each node is conditionally independent of its nondescendants given its parents
36
Markov Blanket• A node is conditionally independent of all other
nodes in the network, given its parents, children, and children's parents—that is, given its Markov blanket.
37
14-3 Efficient Representation of Conditional Distributions
• Even if the maximum number of parents k is smallish, filling in the CPT for a node requires up to O(2k) numbers and perhaps a great deal of experience with all the possible conditioning cases.
• In fact, this is a worst-case scenario in which the relationship between the parents and the child is completely arbitrary.
• Usually, such relationships are describable by a canonical distribution that fits some standard pattern.
• In such cases, the complete table can be specified by naming the pattern and perhaps supplying a few parameters—much easier than supplying an exponential number of parameters.
38
Deterministic nodes• A deterministic node has its value specified
exactly by the values of its parents, with no uncertainty. – The relationship can be a logical one:
• for example, the relationship between the parent nodes Canadian, US, Mexican and the child node NorthAmerican is simply that the child is the disjunction of the parents.
– The relationship can also be numerical: • for example, if the parent nodes are the prices of a particu
lar model of car at several dealers, and the child node is the price that a bargain hunter ends up paying, then the child node is the minimum of the parent values; or
• if the parent nodes are the inflows (rivers, runoff, precipitation) into a lake and the outflows (rivers, evaporation, seepage) from the lake and the child is the change in the water level of the lake, then the value of the child is the difference between the inflow parents and the outflow parents.
39
Efficient representation of PDs
40
Efficient representation of PDs
41
Noisy-OR relation
• The standard example is the noisy-OR relation, which is a generalization of the logical OR.
• In propositional logic, we might say that Fever is true if and only if Cold, Flu( 流行性感冒 ), or Malaria( 瘧疾 ) is true.
• The noisy-OR model allows for uncertainty about the ability of each parent to cause the child to be true—the causal relationship between parent and child may be inhibited, and so a patient could have a cold, but not exhibit a fever.
42
Noisy-OR relation
• The model makes two assumptions. – First, it assumes that all the possible causes
are listed. (This is not as strict as it seems, because we can always add a so-called leak node that covers "miscellaneous causes.")
– Second, it assumes that inhibition of each parent is independent of inhibition of any other parents:
• for example, whatever inhibits Malaria from causing a fever is independent of whatever inhibits Flu from causing a fever.
• Fever is false if and only if all its true parents are inhibited, and the probability of this is the product of the inhibition probabilities for each parent.
43
Example• Let us suppose these individual inhibition probabilities
(or noisy paramaters) are as follows: – P(fever |cold, flu , malaria) = 0.6 , [P(fever |cold) = 0.4 ], – P( fever | cold , flu, malaria) = 0.2 , [P(fever|flu) = 0.8 ], – P( fever | cold , flu, malaria) = 0.1 . [P(fever|malaria) = 0.9 ],
• Then, from this information and the noisy-OR assumptions, the entire CPT can be built.
O(k)
44
45
46
Bayesian nets with cont. variables• Many real-world problems involve continuou
s quantities.• Much of statistics deals with random variable
s whose domains are continuous. By definition, continuous variables have an infinite number of possible values, so it is impossible to specify conditional probabilities explicitly for each value.
• Handle continuous variables is to avoid them by using discretizations, dividing up the possible values into a fixed set of intervals.
• Discretization is sometimes an adequate solution, but often results in a considerable loss of accuracy and very large CPTs.
47
cont. variables
• The other solution is to define standard families of probability density functions (see Appendix A) that are specified by a finite number of parameters. – For example, a Gaussian (or normal) distribution N
(µ, σ2) (x) has the mean µ and the variance σ2 as parameters.
• A network with both discrete and continuous variables is called a hybrid Bayesian network. – the conditional distribution for a continuous variab
le given discrete or continuous parents; P(C|C) or P(C|D)
– the conditional distribution for a discrete variable given continuous parents. P(D|C)
48
49
Hybrid (discrete + continuous) networks
Discrete (Subsidy? and Buys?);Continuous (Harvest and Cost)
• How to deal with this?
補助 產季、收穫
50
Probability density functions
• Instead of probability distributions
• For continuous variablescontinuous variables• Ex.: let X denote tomorrow’s
maximum temperature in the summer in EindhovenBelief that X is distributed uniformly
between 18 and 26 degree Celsius:P(X=x) = U[18,26](x)P(X=20,5) = U[18,26](20,5)=0,125/C
51
52
CDF
53
Normal PDF
54
Hybrid (discrete + continuous) networks
Discrete (Subsidy? and Buys?);Continuous (Harvest and Cost)
• How to deal with this?
55
Hybrid (discrete + continuous) networks
Discrete (Subsidy? and Buys?);Continuous (Harvest and Cost)
• Option 1: discretizationdiscretization – possibly large errors, large CPTs
• Option 2: finitely parameterized cafinitely parameterized canonical familiesnonical familiesa) Continuous variable, discrete + c
ontinuous parents (e.g., Cost)b) Discrete variable, continuous par
ents (e.g., Buys?)
56
a) Continuous child variables
• Need one conditional density function for child variable given continuous parents, for each possible assignment to discrete parents
• Most common is the linear Gaussian model, e.g.:
• Mean Cost varies linearly w. Harvest, variance is fixed
• Linear variation is unreasonable over the full range, but works OK if the likely range of Harvest is narrow
57
Continuous child variables – ex.
• All-continuous network w. LG distribution full joint is a multivariate
Gaussian
• Discrete + continuous LG network is a conditional Gaussian networkconditional Gaussian network, i.e., a multivariate Gaussianmultivariate Gaussian over all continuous variables for each combination of discrete variable values
58
b) Discrete child, continuous parent
• P(buys|Cost=c) = ((-c + ) / ) • with - threshold for buying • Probit distributionProbit distribution:
- the integralintegral on the standard normal distribution
• Logit distributionLogit distribution:– Uses the sigmoidsigmoid function
x
dxxNx ))(1,0()(
xex
21
1)(
59
Probit distribution
60
14-4 Exact inference in Bayesian Networks
• Let us use the following notations:– X denotes the query variable– e denotes the set of evidence variables E1, … , En
– y denotes the set of nonevidence (hidden) variables Y1, … , Yn
– Complete set of variable X={X}E Y.• A typical query asks for the posterior probabili
ty distribution P(Xle). • P(Burglary|JohnCalls = true, MaryCalls = tru
e) = <0.284, 0.716>
61
Inference by enumeration• any conditional probability can be computed by summi
ng terms from the full joint distribution. • query P(X le) can be answered using Equation (13.6),
• Bayesian network gives a complete representation of the full joint distribution.
• The terms P(x, e, y) in the joint distribution can be written as products of conditional probabilities from the network.
• Therefore, a query can be answered using a Bayesian network by computing sums of products of conditional probabilities from the network.
62
63
13-4 Enumerate-Joint-Ask
64
65
Computation
Hidden variables
66
Improvement
67
68
69
70
71
72
73
74
Variable elimination algorithm
The enumeration algorithm can be improved substantially by eliminating repeated calculations of the kind illustrated in Figure 14.8.
The idea is simple: do the calculation once andsave the results for later use. This is a form of dynamic programming. There are several versions of this approach; we present the variable elimination algorithm, which is the simplest.
Variable elimination works by evaluating expressions such as Equation (14.3) in right-to-left order (that is, bottom-up in Figure 14.8). Intermediate results are stored, and summations over each variable are done only for those portions of the expression that depend on the variable.
75
76
Pointwise product
77
78
79
Elimination-ASK
80
New evidence E indicates that JohnCalls with some probability p
We would like to know the posterior probability of the other beliefs, e.g. P(Burglary|E)
P(B|E) = P(BJ|E) + P(B J|E)= P(B|J,E) P(J|E) + P(B |J,E) P(J|E)= P(B|J) P(J|E) + P(B|J) P(J|E)= p P(B|J) + (1-p) P(B|J)
We need to compute P(B|J) and P(B|J)
Querying the BN
81
The BN gives P(t|c) What about P(c|t)? P(Cavity|t)
= P(Cavity t)/P(t)= P(t|Cavity) P(Cavity) / P
(t)[Bayes’ rule]
P(c|t) = P(t|c) P(c)
Querying a BN is just applying the trivial Bayes’ rule on a larger scale
Querying the BN
Cavity
Toothache
P(C)
0.1
C P(T|c)TF
0.40.01111
82
P(b|J) = P(bJ)= maeP(bJmae) [marginalization]
= maeP(b)P(e)P(a|b,e)P(J|a)P(m|a) [BN]
= P(b)eP(e)aP(a|b,e)P(J|a)mP(m|a) [re-ordering]
Depth-first evaluation of P(b|J) leads to computing each of the 4 following products twice:P(J|A) P(M|A), P(J|A) P(M|A), P(J|A) P(M|A), P(J|A) P(M|A)
Bottom-up (right-to-left) computation + caching – e.g., variable elimination algorithm (see R&N) – avoids such repetition
For singly connected BN, the computation takes time linear in the total number of CPT entries ( time linear in the # propositions if CPT’s size is bounded)
Querying the BN
83
Singly Connected BN
A BN is singly connected (or polytree) if there is at most one undirected path between any two nodes
Burglary Earthquake
Alarm
MaryCallsJohnCalls
is singly connected
The time and space complexity of exact inference in polytrees is linear in the size of the network (the number of CPT entries).
84
Multiply Connected BN
A BN is multiply connected if there is more than one undirected path between a pair of nodes
Burglary Earthquake
Alarm
MaryCallsJohnCalls
is multiply connected
variable elimination can have exponential time and space complexity in the worst case, even when the number of parents per node is bounded. it includes inference in propositional logic as a special case, inference in Bayesian networks is NP-hard.
85
Multiply Connected BN
A BN is multiply connected if there is more than one undirected path between a pair of nodes
Burglary Earthquake
Alarm
MaryCallsJohnCalls
is multiply connectedQuerying a multiply-connected BN takes time exponential in the total
number of CPT entries in the worst case
Querying a multiply-connected BN takes time exponential in the total
number of CPT entries in the worst case
86
Clustering algorithm
• Join tree algorithm O(n)• Widely used in commercial networks tools.• join individual nodes of the network to for
m clus ter nodes in such a way that the resulting network is a polytree.
• Once the network is in polytree form, a special-purpose inference algorithm is applied. Essentially, the algorithm is a form of constraint propagation (see Chapter 5) where the constraints ensure that neighboring clusters agree on the posterior probability of any variables that they have in common.
87
Clustering algorithm
88
14-5 Approximate inference in Bayesian Networks
• Given the intractability of exact inference in large, multiply connected networks, it is essential to consider approximate inference methods.
• This section describes randomized sampling algorithms, also called Monte Carlo algorithms, that provide approximate answers whose accuracy depends on the number of samples generated.
• In recent years, Monte Carlo alg rithms have become widely used in computer science to estimate quantities that are difficult to calculate exactly. For example, the simulated annealing algorithm.
• We describe two families of algorithms: – direct sampling and – Markov chain sampling. – variational methods and (skip)– loopy propagation (skip).
89
Methods
i.Sampling from an empty networkii.Rejection sampling: reject
samples disagreeing w. evidenceiii.Likelihood weighting: use
evidence to weight samplesiv.MCMC: sample from a stochastic
process whose stationary distribution is the true posterior
90
Introduction
• The primitive element in any sampling algorithm is the generation of samples from a known probability distribution.
• For example, an unbiased coin can be thought of as a random variable Coin with values (heads, tails) and a prior distribution P(Coin) = (0.5, 0.5).
• Sampling from this distribution is exactly like flipping the coin: with probability 0.5 it will return heads, and with probability 0.5 it will return tails.
• Given a source of random numbers in the range [0, 1], it is a simple matter to sample any distribution on a single variable.
91
Sampling on Bayesian Network
• The simplest kind of random sampling process for Bayesiah networks generates events from a network that has no evidence associated with it.
• The idea is to sample each variable in turn, in topological order.
• The probability distribution from which the value is sampled is conditioned on the values already assigned to the variable's parents.
92
93
Prior-sample
• This algorithm is shown in Figure 14.12. We can illustrate its operation on the network in Figure 14.11(a), assuming an ordering [Cloudy, Sprinkler, Rain, WetGrass] :
• Sample from P(Cloudy) _ <0.5, 0.5>; suppose this returns true.
• Sample from P(Sprinkler |Cloudy = true) = <0.1, 0.9>; suppose this returns false.
• Sample from P(Rain | Cloudy = true) = <0.8, 0.2>; suppose this returns true.
• Sample from P( WetGrass| Sprinkler = false, Rain = true) = <0.9, 0.1>; suppose this returns true.
• PRIOR-SAMPLE returns the event [true, false, true, true].
94
i. Sampling from an empty network – cont.
• Probability that PRIOR-SAMPLE generates a particular event:
SPS(x1, … ,xn) = n i=1P(Xi|Parents(Xi))=P(x1,…, xn)
• NPS (Y=y) no. of samples generated for which Y=y for any set of variables Y.
• Then, P’(Y=y) = NPS(Y=y)/N and
• lim N P’(Y=y) = h SPS(Y=y,H=h) = = h P(Y=y,H=h) = = P(Y=y)
estimates derived from PRIOR-SAMPLE are consistenconsistentt
95
II Rejection samplingII Rejection sampling
• Rejection sampling is a general method for producing samples from a hard-to-sample distribution given an easy-to-sample distribution.
• In its simplest form, it can be used to compute conditional probabilities—that is, to determine P(X le).
• REJECTION-SAMPLING algorithm– First, generates samples from the prior
distribution specified by the network. – Then, it rejects all those that do not match the
evidence. – Finally, the estimate P(X = x| e) is obtained by
counting how often X = x occurs in the remaining samples
96
Rejection-sampling algorithm
97
Example• Assume that we wish to estimate P(Rainl Sprinkler = true), us
ing 100 samples. • Of the 100 that we generate, suppose that 73 have Sprinkler =
false and are rejected, while 27 have Sprinkler = true; of the 27, 8 have Rain = true and 19 have Rain = false.
• P( Rain1 |Sprinkler = true) ≈ NORMALIZE(<8,19>) = <0.296, 0.704> .
• The true answer is <0.3, 0.7>. • As more samples are collected, the estimate will converge to
the true answer. The standard deviation of the error in each probability will be proportional to 1/sqrt(n), where n is the number of samples used in the estimate.
• The biggest problem with rejection sampling is that it rejects so many samples! The fraction of samples consistent with the evidence e drops exponentially as the number of evi dence variables grows, so the procedure is simply unusable for complex problems.
98
iii. Likelihood weighting analysis
• avoids the inefficiency of rejection sampling by generating only events that are consistent with the evidence e.
• generates consistent probability estimates.• fixes the values for the evidence variables E and
samples only the remaining variables X and Y. This guarantees that each event generated is consistent with the evidence.
• Before tallying the counts in the distribution for the query variable, each event is weighted by the likelihood that the event accords to the evidence, as measured by the product of the conditional probabilities for each evidence variable, given its parents.
• Intuitively, events in which the actual evidence appears unlikely should be given less weight.
99
Example
• query P(Rain lSprinkler = true, WetGrass = true).
• First, the weight w is set to 1.0. • Then an event is generated:• Sample from P( Cloudy) _=(0.5, 0.5); suppose this returns tru
e.• Sprinkler is an evidence variable with value true. • Therefore, we set• w <-_ w x P(Sprinkler = true| Cloudy = true) = 0.1 .• Sample from P(Rain| Cloudy = true) = (0.8, 0.2); suppose this r
eturns true.• WetGrass is an evidence variable with value true. • Therefore, we set• w <- w x P(WetGrass = true Sprinkler = true, Rain = true) = 0.0
99 .
100
101
iii. Likelihood weighting analysis
• Sampling probability for WEIGHTED-SAMPLE is SWS(y,e) = l i=1P(yi|Parents(Yi))
• Note: pays attention to evidence in ancestors only somewhere “in between” prior and posterior distribution
• Weight for a given sample y,e, is w(y,e) = n i=1P(ei|Parents(Ei))
• Weighted sampling probability isSWS(y,e) w(y,e) = l i=1P(yi|Parents(Yi)) m i=1P(ei|Parents(Ei)) =
P(y,e) # by standard global semantics of network
• Hence, likelihood weighting is consistentconsistent• But performance still degrades w. many evidence variabl
es
102
iv. MCMC Example• Estimate P(Rain|Sp
rinkler=true, WetGrass=true)
• Sample Cloudy then Rain, repeat.– Markov blanket of C
loudy is Sprinkler and Rain.
– Markov blanket of Rain is Cloudy, Sprinkler and WetGrass.
103
iv. MCMC Example – cont.0. Random initial state: Cloudy=true and Rain=false1. P(Cloudy|MB(Cloudy)) = P(Cloudy|Sprinkler, Rai
n)sample false
2. P(Rain|MB(Rain)) = P(Rain|Cloudy, Sprinkler,WetGrass)
sample true
Visit 100 states 31 have Rain=true, 69 have Rain=false P’(Rain|Sprinkler=true,WetGrass=true) = NORMALIZ
E(<31,69>) = <0.31,0.69>
104
Probability of x, given MB(x)
)(
))(|())(|())(|(XChildrenY
jj
j
YparentsyPXparentsxPXmbxP
105
MCMC algorithm
106
Performance of statistical algorithms
• Polytime approximation• Stochastic approximation techniques
such as likelihood weighting and MCMC – can give reasonable estimates of true pos
terior probabilities in a network, and – can cope with much larger networks
107
14-6 Skip
108
14-7 Other approaches to uncertain reasoning
• Different generations of expert systems– Strict logic reasoning (ignore uncertainty)– Probabilistic techniques using the full Joint– Default reasoning - believed until a better reaso
n is found to believe something else– Rules with certainty factors– Handling ignorance - Dempster-Shafer theory– Vagueness( 含糊 ) - something is sort of true (fuzz
y logic)• Probability makes the same ontological co
mmitment as logic: the event is true or false
109
Default reasoning
• The four-wheel car conclusion is reached by default.
• New evidence can cause the conclusion retracted, while FOL is strictly monotonic.
• Representatives are default logic, nonmonotonic logic, circumscription
• There are problematic issues
110
Rule-based methods
• Logical reasoning systems have properties like:– Monotonicity– Locality
• In logical systems, whenever we have a rule of the form AB, we can conclude B, given evidence A, without worrying about any other rules.
– Detachment• Once a logical proof is found for a proposition B, the proposition can
be used regardless of how it was derived. That is, it can be detached from its justification.
– Truth-functionality• In logic, the truth of complex sentences can be computed fr
om the truth of the components.
111
Rule-based method
• These properties are good for obvious computational advantages;
• bad as they’re inappropriate for uncertain reasoning.
112
Representing ignorance:
• Dempster—Shafer theory• The Dempster—Shafer theory is design
ed to deal with the distinction between uncertainty and ignorance( 無知 ).
• Rather than computing the probability of a proposition, it computes the probability that the evidence supports the proposition.
• This measure of belief is called a belief function, written Bel (X)
113
Example• coin flipping for an example of belief functions. • Suppose a shady character comes up to you and offers to bet
you $10 that his coin will come up heads on the next flip. • Given that the coin might or might not be fair, what belief sh
ould you ascribe to the event that it comes up heads? • Dempster—Shafer theory says that because you have no evid
ence either way, you have to say that the belief Bel(Heads) = 0 and also that Bel(Heads) = 0.
• Now suppose you have an expert at your disposal who testifies with 90% certainty that the coin is fair (i.e., he is 90% sure that P(Heads) = 0.5).
• Then Dempster—Shafer theory gives Bel(Heads) = 0.9 x 0.5 = 0.45 and likewise Bel( Heads) = 0.45.
• There is still a 10 percentage point "gap" that is not accounted for by the evidence.
• "Dempster's rule" (Dempster, 1968) shows how to combine evidence to give new values for Bel, and Shafer's work extends this into a complete computational model.
114
Fuzzy set & fuzzy logic• Fuzzy set theory is a means of specifying how well an object s
atisfies a vague description( 模糊的描述 ). • For example, consider the proposition "Nate is tall." Is this tr
ue, if Nate is 5' 10"? Most people would hesitate to answer "true" or "false," preferring to say, "sort of."
• Note that this is not a question of uncertainty about the external world—we are sure of Nate's height. The issue is that the linguistic term "tall" does not refer to a sharp demarcation of objects into two classes—there are degrees of tallness.
• For this reason, fuzzy set theory is not a method for uncertain reasoning at all.
• Rather, fuzzy set theory treats Tall as a fuzzy predicate and says that the truth value of Tall(Nate) is a number between 0 and 1, rather than being just true or false. The name "fuzzy set" derives from the interpretation of the predicate as implicitly defining a set of its members—a set that does not have sharp boundaries.
115
Fuzzy logic
• Fuzzy logic is a method for reasoning with logical expressions describing membership in fuzzy sets.
• For example, the complex sentence Tall(Nate) Heavy(Nate) has a fuzzy truth value that is a function of the truth values of its components.
• The standard rules for evaluating the fuzzy truth, T, of a complex sentence are– T(A B) = min(T(A), T(B)) – T(A B) = max(T(A),T(B)) – T(A)=1-T(A).
116
Summary• Reasoning properly
– In FOL, it means conclusions follow from premises– In probability, it means having beliefs that allow an
agent to act rationally• Conditional independence info is vital• A Bayesian network is a complete representati
on for the JPD, but exponentially smaller in size
• Bayesian networks can reason causally, diagnostically, intercausally, or combining two or more of the three.
• For polytrees, the computational time is linear in network size.