CS540 Introduction to Artificial Intelligence Lecture...

Probability Distributions Bayesian Network Naive Bayes

CS540 Introduction to Artificial IntelligenceLecture 11

Young WuBased on lecture slides by Jerry Zhu, Yingyu Liang, and Charles

Dyer

July 5, 2021


Joint DistributionMotivation

The joint distribution of Xj and Xj 1 provides the probability ofXj � xj and Xj 1 � xj 1 occur at the same time.

P Xj � xj ,Xj 1 � xj 1

(

The marginal distribution of Xj can be found by summingover all possible values of Xj 1 .

P tXj � xju �¸xPXj1

P Xj � xj ,Xj 1 � x

(


Conditional DistributionMotivation

Suppose the joint distribution is given.

P Xj � xj ,Xj 1 � xj 1

(

The conditional distribution of Xj given Xj 1 � xj 1 is ratiobetween the joint distribution and the marginal distribution.

P Xj � xj |Xj 1 � xj 1

( � P Xj � xj ,Xj 1 � xj 1

(

P Xj 1 � xj 1

(


NotationMotivation

The notations for joint, marginal, and conditional distributionswill be shortened as the following.

P xj , xj 1

(,P txju ,P

xj |xj 1

(

When the context is not clear, for example whenxj � a, xj 1 � b with specific constants a, b, subscripts will beused under the probability sign.

PXj ,Xj1ta, bu ,PXj

tau ,PXj |Xj1ta|bu


Bayesian NetworkDefinition

A Bayesian network is a directed acyclic graph (DAG) and aset of conditional probability distributions.

Each vertex represents a feature Xj .

Each edge from Xj to Xj 1 represents that Xj directly influencesXj 1 .

No edge between Xj and Xj 1 implies independence orconditional independence between the two features.


Causal ChainDefinition

For three events A,B,C , the configuration AÑ B Ñ C iscalled causal chain.

In this configuration, A is not independent of C , but A isconditionally independent of C given information about B.

Once B is observed, A and C are independent.


Common CauseDefinition

For three events A,B,C , the configuration AÐ B Ñ C iscalled common cause.

In this configuration, A is not independent of C , but A isconditionally independent of C given information about B.

Once B is observed, A and C are independent.


Common EffectDefinition

For three events A,B,C , the configuration AÑ B Ð C iscalled common effect.

In this configuration, A is independent of C , but A is notconditionally independent of C given information about B.

Once B is observed, A and C are not independent.


Storing DistributionDefinition

If there are m binary variables with k edges, there are 2m jointprobabilities to store.

There are significantly less conditional probabilities to store.For example, if each node has at most 2 parents, then thereare less than 4m conditional probabilities to store.

Given the conditional probabilities, the joint probabilities canbe recovered.


Conditional Probability Table DiagramDefinition


Conditional Probability Table ExampleDefinition


Conditional Probability Table Larger ExampleDefinition


Training Bayes NetDefinition

Training a Bayesian network given the DAG is estimating theconditional probabilities. Let P pXjq denote the parents of thevertex Xj , and p pXjq be realizations (possible values) ofP pXjq.

P txj |p pXjqu , p pXjq P P pXjq

It can be done by maximum likelihood estimation given atraining set.

P txj |p pXjqu �cxj ,ppXjqcppXjq


Bayes Net Training Example, Part IDefinition


Bayes Net Training Example, Part IIDefinition


Laplace SmoothingDefinition

Recall that the MLE estimation can incorporate Laplacesmoothing.

P txj |p pXjqu �cxj ,ppXjq � 1

cppXjq � |Xj |

Here, |Xj | is the number of possible values (number ofcategories) of Xj .

Laplace smoothing is considered regularization for Bayesiannetworks because it avoids overfitting the training data.


Bayes Net Inference 1Definition

Given the conditional probability table, the joint probabilitiescan be calculated using conditional independence.

P tx1, x2, ..., xmu �m¹j�1

P txj |x1, x2, ..., xj�1, xj�1, ..., xmu

�m¹j�1

P txj |p pXjqu


Bayes Net Inference 2Definition

Given the joint probabilities, all other marginal and conditionalprobabilities can be calculated using their definitions.

P xj |xj 1 , xj2 , ...

( � P xj , xj 1 , xj2 , ...

(

P xj 1 , xj2 , ...

(

P xj , xj 1 , xj2 , ...

( �¸

Xk :k�j ,j 1,j2,...

P tx1, x2, ..., xmu

P xj 1 , xj2 , ...

( �¸

Xk :k�j 1,j2,...

P tx1, x2, ..., xmu


Bayes Net Inference Example, Part IDefinition


Bayes Net Inference Example, Part IIDefinition


Bayes Net Inference Example, Part IIIDefinition


Bayesian NetworkAlgorithm

Input: instances: txiuni�1 and a directed acyclic graph suchthat feature Xj has parents P pXjq.Output: conditional probability tables (CPTs): P txj |p pXjqufor j � 1, 2, ...,m.

Compute the transition probabilities using counts and Laplacesmoothing.

P txj |p pXjqu �cxj ,ppXjq � 1

cppXjq � |Xj |


Network StructureDiscussion

Selecting from all possible structures (DAGs) is too difficult.

Usually, a Bayesian network is learned with a tree structure.

Choose the tree that maximizes the likelihood of the trainingdata.


Chow Liu AlgorithmDiscussion

Add an edge between features Xj and Xj 1 with edge weightequal to the information gain of Xj given Xj 1 for all pairs j , j 1.

Find the maximum spanning tree given these edges. Thespanning tree is used as the structure of the Bayesian network.


Aside: Prim’s AlgorithmDiscussion

To find the maximum spanning tree, start with an arbitraryvertex, a vertex set containing only this vertex, V , and anempty edge set, E .

Choose an edge with the maximum weight from a vertexv P V to a vertex v 1 R V and add v 1 to V , add an edge fromv to v 1 to E

Repeat this process until all vertices are in V . The treepV ,E q is the maximum spanning tree.


Aside: Prim’s Algorithm DiagramDiscussion


Classification ProblemDiscussion

Bayesian networks do not have a clear separation of the labelY and the features X1,X2, ...,Xm.

The Bayesian network with a tree structure and Y as the rootand X1,X2, ...,Xm as the leaves is called the Naive Bayesclassifier.

Bayes rules is used to compute P tY � y |X � xu, and theprediction y is y that maximizes the conditional probability.

yi � arg maxy

P tY � y |X � xiu


Naive Bayes DiagramDiscussion


Multinomial Naive BayesDiscussion

The implicit assumption for using the counts as the maximumlikelihood estimate is that the distribution of Xj |Y � y , or ingeneral, Xj |P pXjq � p pXjq has the multinomial distribution.

P tXj � x |Y � yu � px

px � cx ,ycy


Gaussian Naive BayesDiscussion

If the features are not categorical, continuous distributionscan be estimated using MLE as the conditional distribution.

Gaussian Naive Bayes is used if Xj |Y � y is assumed to havethe normal distribution.

limεÑ0

1

εP tx Xj ¤ x � ε|Y � yu � 1?

2πσpjqy

exp

��

�x � µ

pjqy

2

2�σpjqy

2

��


Gaussian Naive Bayes TrainingDiscussion

Training involves estimating µpjqy and σ

pjqy since they

completely determine the distribution of Xj |Y � y .

The maximum likelihood estimates of µpjqy and

�σpjqy

2are the

sample mean and variance of the feature j .

µpjqy � 1

ny

n

i�1

xij1tyi�yu, ny �n

i�1

1tyi�yu

�σpjqy

2� 1

ny

n

i�1

�xij � µ

pjqy

21tyi�yu

sometimes�σpjqy

2� 1

ny � 1

n

i�1

�xij � µ

pjqy

21tyi�yu


Gaussian Naive Bayes DiagramDiscussion


Tree Augmented Network AlgorithmDiscussion

It is also possible to create a Bayesian network with allfeatures X1,X2, ...,Xm connected to Y (Naive Bayes edges)and the features themselves form a network, usually a tree(MST edges).

Information gain is replaced by conditional information gain(conditional on Y ) when finding the maximum spanning tree.

This algorithm is called TAN: Tree Augmented Network.


Tree Augmented Network Algorithm DiagramDiscussion

CS540 Introduction to Artificial Intelligence Lecture...

Documents

Transcript of CS540 Introduction to Artificial Intelligence Lecture...