CS540 Introduction to Artificial Intelligence Lecture...

35
Probability Distributions Bayesian Network Naive Bayes CS540 Introduction to Artificial Intelligence Lecture 11 Young Wu Based on lecture slides by Jerry Zhu, Yingyu Liang, and Charles Dyer July 5, 2021

Transcript of CS540 Introduction to Artificial Intelligence Lecture...

Page 1: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

CS540 Introduction to Artificial IntelligenceLecture 11

Young WuBased on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Dyer

July 5, 2021

Page 2: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Joint DistributionMotivation

The joint distribution of Xj and Xj 1 provides the probability ofXj � xj and Xj 1 � xj 1 occur at the same time.

P Xj � xj ,Xj 1 � xj 1

(

The marginal distribution of Xj can be found by summingover all possible values of Xj 1 .

P tXj � xju �¸xPXj1

P Xj � xj ,Xj 1 � x

(

Page 3: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Conditional DistributionMotivation

Suppose the joint distribution is given.

P Xj � xj ,Xj 1 � xj 1

(

The conditional distribution of Xj given Xj 1 � xj 1 is ratiobetween the joint distribution and the marginal distribution.

P Xj � xj |Xj 1 � xj 1

( � P Xj � xj ,Xj 1 � xj 1

(

P Xj 1 � xj 1

(

Page 4: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

NotationMotivation

The notations for joint, marginal, and conditional distributionswill be shortened as the following.

P xj , xj 1

(,P txju ,P

xj |xj 1

(

When the context is not clear, for example whenxj � a, xj 1 � b with specific constants a, b, subscripts will beused under the probability sign.

PXj ,Xj1ta, bu ,PXj

tau ,PXj |Xj1ta|bu

Page 5: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Bayesian NetworkDefinition

A Bayesian network is a directed acyclic graph (DAG) and aset of conditional probability distributions.

Each vertex represents a feature Xj .

Each edge from Xj to Xj 1 represents that Xj directly influencesXj 1 .

No edge between Xj and Xj 1 implies independence orconditional independence between the two features.

Page 6: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Conditional IndependenceDefinition

Recall two events A,B are independent if:

P tA,Bu � P tAuP tBu or P tA|Bu � P tAu

In general, two events A,B are conditionally independent,conditional on event C if:

P tA,B|Cu � P tA|CuP tB|Cu or P tA|B,Cu � P tA|Cu

Page 7: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Causal ChainDefinition

For three events A,B,C , the configuration AÑ B Ñ C iscalled causal chain.

In this configuration, A is not independent of C , but A isconditionally independent of C given information about B.

Once B is observed, A and C are independent.

Page 8: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Common CauseDefinition

For three events A,B,C , the configuration AÐ B Ñ C iscalled common cause.

In this configuration, A is not independent of C , but A isconditionally independent of C given information about B.

Once B is observed, A and C are independent.

Page 9: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Common EffectDefinition

For three events A,B,C , the configuration AÑ B Ð C iscalled common effect.

In this configuration, A is independent of C , but A is notconditionally independent of C given information about B.

Once B is observed, A and C are not independent.

Page 10: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Storing DistributionDefinition

If there are m binary variables with k edges, there are 2m jointprobabilities to store.

There are significantly less conditional probabilities to store.For example, if each node has at most 2 parents, then thereare less than 4m conditional probabilities to store.

Given the conditional probabilities, the joint probabilities canbe recovered.

Page 11: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Conditional Probability Table DiagramDefinition

Page 12: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Conditional Probability Table ExampleDefinition

Page 13: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Conditional Probability Table Larger ExampleDefinition

Page 14: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Training Bayes NetDefinition

Training a Bayesian network given the DAG is estimating theconditional probabilities. Let P pXjq denote the parents of thevertex Xj , and p pXjq be realizations (possible values) ofP pXjq.

P txj |p pXjqu , p pXjq P P pXjq

It can be done by maximum likelihood estimation given atraining set.

P txj |p pXjqu �cxj ,ppXjqcppXjq

Page 15: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Bayes Net Training Example, Part IDefinition

Page 16: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Bayes Net Training Example, Part IIDefinition

Page 17: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Laplace SmoothingDefinition

Recall that the MLE estimation can incorporate Laplacesmoothing.

P txj |p pXjqu �cxj ,ppXjq � 1

cppXjq � |Xj |

Here, |Xj | is the number of possible values (number ofcategories) of Xj .

Laplace smoothing is considered regularization for Bayesiannetworks because it avoids overfitting the training data.

Page 18: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Bayes Net Inference 1Definition

Given the conditional probability table, the joint probabilitiescan be calculated using conditional independence.

P tx1, x2, ..., xmu �m¹j�1

P txj |x1, x2, ..., xj�1, xj�1, ..., xmu

�m¹j�1

P txj |p pXjqu

Page 19: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Bayes Net Inference 2Definition

Given the joint probabilities, all other marginal and conditionalprobabilities can be calculated using their definitions.

P xj |xj 1 , xj2 , ...

( � P xj , xj 1 , xj2 , ...

(

P xj 1 , xj2 , ...

(

P xj , xj 1 , xj2 , ...

( �¸

Xk :k�j ,j 1,j2,...

P tx1, x2, ..., xmu

P xj 1 , xj2 , ...

( �¸

Xk :k�j 1,j2,...

P tx1, x2, ..., xmu

Page 20: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Bayes Net Inference Example, Part IDefinition

Page 21: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Bayes Net Inference Example, Part IIDefinition

Page 22: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Bayes Net Inference Example, Part IIIDefinition

Page 23: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Bayesian NetworkAlgorithm

Input: instances: txiuni�1 and a directed acyclic graph suchthat feature Xj has parents P pXjq.Output: conditional probability tables (CPTs): P txj |p pXjqufor j � 1, 2, ...,m.

Compute the transition probabilities using counts and Laplacesmoothing.

P txj |p pXjqu �cxj ,ppXjq � 1

cppXjq � |Xj |

Page 24: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Network StructureDiscussion

Selecting from all possible structures (DAGs) is too difficult.

Usually, a Bayesian network is learned with a tree structure.

Choose the tree that maximizes the likelihood of the trainingdata.

Page 25: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Chow Liu AlgorithmDiscussion

Add an edge between features Xj and Xj 1 with edge weightequal to the information gain of Xj given Xj 1 for all pairs j , j 1.

Find the maximum spanning tree given these edges. Thespanning tree is used as the structure of the Bayesian network.

Page 26: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Aside: Prim’s AlgorithmDiscussion

To find the maximum spanning tree, start with an arbitraryvertex, a vertex set containing only this vertex, V , and anempty edge set, E .

Choose an edge with the maximum weight from a vertexv P V to a vertex v 1 R V and add v 1 to V , add an edge fromv to v 1 to E

Repeat this process until all vertices are in V . The treepV ,E q is the maximum spanning tree.

Page 27: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Aside: Prim’s Algorithm DiagramDiscussion

Page 28: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Classification ProblemDiscussion

Bayesian networks do not have a clear separation of the labelY and the features X1,X2, ...,Xm.

The Bayesian network with a tree structure and Y as the rootand X1,X2, ...,Xm as the leaves is called the Naive Bayesclassifier.

Bayes rules is used to compute P tY � y |X � xu, and theprediction y is y that maximizes the conditional probability.

yi � arg maxy

P tY � y |X � xiu

Page 29: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Naive Bayes DiagramDiscussion

Page 30: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Multinomial Naive BayesDiscussion

The implicit assumption for using the counts as the maximumlikelihood estimate is that the distribution of Xj |Y � y , or ingeneral, Xj |P pXjq � p pXjq has the multinomial distribution.

P tXj � x |Y � yu � px

px � cx ,ycy

Page 31: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Gaussian Naive BayesDiscussion

If the features are not categorical, continuous distributionscan be estimated using MLE as the conditional distribution.

Gaussian Naive Bayes is used if Xj |Y � y is assumed to havethe normal distribution.

limεÑ0

1

εP tx   Xj ¤ x � ε|Y � yu � 1?

2πσpjqy

exp

����

�x � µ

pjqy

2

2�σpjqy

2

��

Page 32: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Gaussian Naive Bayes TrainingDiscussion

Training involves estimating µpjqy and σ

pjqy since they

completely determine the distribution of Xj |Y � y .

The maximum likelihood estimates of µpjqy and

�σpjqy

2are the

sample mean and variance of the feature j .

µpjqy � 1

ny

n

i�1

xij1tyi�yu, ny �n

i�1

1tyi�yu

�σpjqy

2� 1

ny

n

i�1

�xij � µ

pjqy

21tyi�yu

sometimes�σpjqy

2� 1

ny � 1

n

i�1

�xij � µ

pjqy

21tyi�yu

Page 33: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Gaussian Naive Bayes DiagramDiscussion

Page 34: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Tree Augmented Network AlgorithmDiscussion

It is also possible to create a Bayesian network with allfeatures X1,X2, ...,Xm connected to Y (Naive Bayes edges)and the features themselves form a network, usually a tree(MST edges).

Information gain is replaced by conditional information gain(conditional on Y ) when finding the maximum spanning tree.

This algorithm is called TAN: Tree Augmented Network.

Page 35: CS540 Introduction to Artificial Intelligence Lecture 11pages.cs.wisc.edu/~yw/CS540/CS540_Lecture_11_P.pdf · 2020. 6. 28. · Lecture 11 Young Wu Based on lecture slides by Jerry

Probability Distributions Bayesian Network Naive Bayes

Tree Augmented Network Algorithm DiagramDiscussion