Bayesian Network Classifier

69
BAYESIAN NETWORK CLASSIFIER Not so naïve any more or Bringing causality into the equation

description

Bayesian Network Classifier. Not so naïve any more or Bringing causality into the equation. Review. Before covering Bayesian Belief Networks…. A little review of the Naïve Bayesian Classifier. Approach. - PowerPoint PPT Presentation

Transcript of Bayesian Network Classifier

Page 1: Bayesian Network Classifier

BAYESIAN NETWORK CLASSIFIERNot so naïve any more

orBringing causality into the

equation

Page 2: Bayesian Network Classifier

Bayes Network 2

Review Before covering Bayesian Belief

Networks…

8/29/03

A little review of the Naïve Bayesian

Classifier

Page 3: Bayesian Network Classifier

Bayes Network 3

Approach Looks at the probability that an instance

belongs to each class given its value in each of its dimensions

8/29/03

P P P P P P P P P P P PP P P P P P P P P P P PP P P P P P P P P P P P

P P P P P P P P P P P P

Page 4: Bayesian Network Classifier

Bayes Network 4

0 2 4 6 8

0.00

000.00

050.00

100.00

150.00

20

Distribution of Redness Values

Redness

Den

sity

Fruit

ApplesPeachesOrangesLemons

Example: Redness If one of the dimensions was “redness”

For a given redness value which is the most probable fruit

8/29/03

Page 5: Bayesian Network Classifier

Bayes Network 5

Bayes Theorem

8/29/03

Above from the book h is hypothesis, D is training Data

𝑃 (h|𝐷 )=𝑃 (𝐷|h ) 𝑃 (h)

𝑃 (𝐷)

𝑃 (𝑎𝑝𝑝𝑙𝑒|𝑟𝑒𝑑𝑛𝑒𝑠𝑠=4.05 )=𝑃 (𝑟𝑒𝑑𝑛𝑒𝑠𝑠=4.05|𝑎𝑝𝑝𝑙𝑒 )𝑃 (𝐴𝑝𝑝𝑙𝑒)

𝑃 (𝑅𝑒𝑑𝑛𝑒𝑠𝑠=4.05)

Page 6: Bayesian Network Classifier

Bayes Network 6

Redness of Apples and Oranges

Redness

1 2 3 4 5 6 7

050

100

150

200

250

If Non-Parametric… 2506 apples 2486 oranges Probability that

redness would be 4.05 if know an apple About 10/2506

P(apple)? 2506/(2506+2486)

P(redness=4.05) About

(10+25)/(2506+2486)

8/29/03

𝑃 (𝑎𝑝𝑝𝑙𝑒|𝑟𝑒𝑑𝑛𝑒𝑠𝑠=4.05 )=𝑃 (𝑟𝑒𝑑𝑛𝑒𝑠𝑠=4.05|𝑎𝑝𝑝𝑙𝑒 )𝑃 (𝐴𝑝𝑝𝑙𝑒)

𝑃 (𝑅𝑒𝑑𝑛𝑒𝑠𝑠=4.05)

10(10+25)

=

102506 ∙

2506(2506+2486)

(10+25)(2506+2486)

?

Page 7: Bayesian Network Classifier

Bayes Network 7

Bayes

I think of the ratio of P(h) to P(D) as an adjustment to the easily determined P(D|h) in order to account for differences in sample size

8/29/03

𝑃 (h|𝐷 )=𝑃 (𝐷|h ) 𝑃 (h)

𝑃 (𝐷)

Prior Probabilities or Priors

Posterior Probability

Page 8: Bayesian Network Classifier

Bayes Network 8

Naïve Bayes Classifier The Naïve term comes from…

Where vj is class and ai is an attribute Derivation

8/29/03

𝑣𝑁𝐵=argmax𝑣 𝑗∈𝑉

𝑃 (𝑣 𝑗)∏𝑖𝑃 (𝑎𝑖∨𝑣 𝑗)

Page 9: Bayesian Network Classifier

Bayes Network 9

Can remove the naïveness Go with covariance matrix instead of

standard deviation

8/29/03

𝑓 (�⃗� )= 1(2𝜋)𝑑 /2|Σ|1 /2

exp (− 12 (�⃗�−�⃗� )𝑇 Σ−1( �⃗�− �⃗�))

𝑓 (𝑥 )=𝑒− (𝑥−𝜇 )2 /2𝜎2

𝜎 √2𝜋

Page 10: Bayesian Network Classifier

Bayes Network 108/29/03

Solution

Red Yellow Mass Volapples

0 235 106 3

peaches

0 262 176 57

oranges

9 263 143 7

lemons

22 239 239 184

Total 31 999 664 251apples

0 0.24 0.16 0.01 0

peaches

0 0.26 0.27 0.23 0

oranges

0.29 0.26 0.22 0.28 0.0004

lemons

0.71 0.24 0.36 0.73 0.0044

Page 11: Bayesian Network Classifier

Bayes Network 11

Lot of work Need Bayes rule Instead of simply multiplying each

dimensional probability… Must compute a multivariate covariance

matrix (num dim x num dim) Calculate multivariate PDF and all priors

Includes getting inverse of covariance matrix Only useful if covariance is strong and

predictive component of the data

8/29/03

Page 12: Bayesian Network Classifier

Bayes Network 12

Other ways of removing naiveté

Sometimes useful to infer causal relationships

8/29/03

Dim x Dim yCauses

Page 13: Bayesian Network Classifier

Bayes Network 13

If can figure out… …causal relationship between dimensions

no longer independent Conditional In terms of bins?

8/29/03

Dim x Dim yCauses

𝑃 (𝑦 𝑖𝑠𝑐𝑙𝑎𝑠𝑠𝐴𝑔𝑖𝑣𝑒𝑛 h𝑡 𝑎𝑡 𝑐𝑎𝑢𝑠𝑒𝑑𝑏𝑦 𝑥 )=𝑃 (𝑦 𝑖𝑠𝑐𝑙𝑎𝑠𝑠𝐴∨𝑥𝑖𝑠 𝑐𝑙𝑎𝑠𝑠𝐴 )

Page 14: Bayesian Network Classifier

Bayes Network 14

Problem Determine a dependency network from

the data Use the dependencies to determine

probabilities that an instance is a given class

Use those probabilities to classify

8/29/03

Dim 4

Dim 3

Dim 2

Dim 1

Dim 5

Page 15: Bayesian Network Classifier

Bayes Network 158/29/03

DAG Directed acyclic graph

Used to represent a dependency network Known as a Bayesian Belief Network

Structure

Dim 4

Dim 3

Dim 2

Dim 1

Dim 5

Page 16: Bayesian Network Classifier

Bayes Network 168/29/03

Algorithm Not in book: 39 page

paper A Bayesian Method for th

e Induction of Probabilistic networks from Data

Known as the K2 algorithm

Page 17: Bayesian Network Classifier

Bayes Network 178/29/03

Similar to Decision Tree Each node is a dimension Instead of representing a

decision it represents a conditional relationship

Algorithm for selecting nodes is greedy

Quote from paper: “The algorithm is named K2 because it evolved from a system named Kutató (Herskovits & Cooper, 1990) that applies the same greedy-search heuristics. As we discuss in section 6.1.3, Kutató uses entropy to score network structures.”

Dim 4

Dim 3

Dim 2

Dim 1 Dim 5

Page 18: Bayesian Network Classifier

Bayes Network 188/29/03

General Approach Determine the no-parent score

for a given node (dimension) For each remaining dimension

(node) Determine the probability (score)

that each dimension is the parent for the given node(does a dependency appear to exist)

Compare the score of the best to the no-parent score If better, keep as parent and

repeat (see if can add another parent)

Otherwise done

Greedy

Find the parents of each node

Find best parentIf improve, keep

Find next “best parent”

Page 19: Bayesian Network Classifier

Bayes Network 198/29/03

How Score? The probability that a

given data “configuration” could belong to a given DAG Do records with a given

value in one dimension tend to have a specific value in another dimension

Storm BusTourGroup

Campfire

Lightning

Thunder ForestFire

Example from Book

Page 20: Bayesian Network Classifier

Bayes Network 208/29/03

Bayesian Belief Network BS Belief Network Structure

How probable?

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!

Don’t panic!We’ll get through it

Page 21: Bayesian Network Classifier

Bayes Network 218/29/03

Proof Last 5 pages in the paper (39 pages total) A Bayesian Method for the Induction of Pr

obabilistic networks from Data

Page 22: Bayesian Network Classifier

Bayes Network 228/29/03

Bayesian Belief Network BS Belief Network Structure

How probable?

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!

n=num dims (nodes)q=num unique instantiations of parents

• If one parent, q = number of distinct vals in parent• If two parents, q = num in p1*num in p2

r=num distinct vals possible in dimensionNijk=number of records with value = k in current dim that match parental instantiationNij=Number of records that match parental instantiation (sum of Nijk’s)

Page 23: Bayesian Network Classifier

Bayes Network 238/29/03

Intuition Think of as a random match probability

What are the chances that the values seen in a dimension (for records that match the parental instantiation) could occur randomly?

Think of the as an adjustment upward (since it will show up in the numerator) indicating how the data is actually organized How organized is the data in the child dimension? Example 6!0! is 720 while 3!3! is 36 Sound familiar?

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!

Page 24: Bayesian Network Classifier

Bayes Network 248/29/03

𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !

Algorithm Greedy algorithm for finding parents

For a given dimension Check no parent probability, store in Pold Then choose parent that maximizes g If that probability is greater than Pold, add to

list of parents, update Pold Keep adding till can’t increase probability

Page 25: Bayesian Network Classifier

Bayes Network 258/29/03

No Parent Probability?

There is only one “instantiation” No parent filtering, so Nij is all

training samples Nijk is number in training set where

current dimension is value vj

𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !Orp

han

Page 26: Bayesian Network Classifier

Bayes Network 268/29/03

Example from paper Three nodes Two instantiations for

parent X2 (of child X3) Parent X2 has value

absent Parent X2 has value

present Two instantiations for

parent X1 Parent X1 has value

absent Parent X1 has value

present

Page 27: Bayesian Network Classifier

Bayes Network 278/29/03

X2 instantiation with val absent Number of X3 absents that

were X2 absents: 4 Number of X3 presents that

were X2 absents: 1 X2 instantion: val present

X3 absent|X2 present: 0 X3 present|X2 present: 5

Some numbers

Page 28: Bayesian Network Classifier

Bayes Network 288/29/03

For dimension (i) 3

X3 Calculations

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!

(2−1 )! 0 !5 !(5+2−1 )!

(2−1 ) !4 !1!(5+2−1 ) !

Page 29: Bayesian Network Classifier

Bayes Network 298/29/03

X1 instantiation with val absent Number of X2 absents that

were X1 absents: 4 Number of X2 presents that

were X1 absents: 1 X1 instantion: val present

X2 absent|X1 present: 1 X2 present|X1 present: 4

Some more numbers

Page 30: Bayesian Network Classifier

Bayes Network 308/29/03

For dimension (i) 2

X2 Calculations

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!

(2−1 )!1 ! 4 !(5+2−1 )!

(2−1 ) ! 4 !1 !(5+2−1 ) !

Page 31: Bayesian Network Classifier

Bayes Network 318/29/03

Dimension 1 has no parents Number of X1 absents: 5 Number of X1 presents: 5

Some more numbers

Page 32: Bayesian Network Classifier

Bayes Network 328/29/03

For dimension (i) 1

Xi Calculations

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!

(2−1 )!5 !5 !(10+2−1 )!

Page 33: Bayesian Network Classifier

Bayes Network 338/29/03

The whole enchilada The article calls this BS1

Putting it all Together

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!

(2−1 )!5 !5 !(10+2−1 )!

(2−1 )!1 ! 4 !(5+2−1 )!

(2−1 ) ! 4 !1!(5+2−1 ) !

(2−1 )! 0 !5 !(5+2−1 )!

(2−1 ) !4 !1!(5+2−1 ) !

X1 X2 X3

X3X2X1

𝑃 (𝐵𝑆1 ,𝐷 )=𝑃 (𝐵𝑆 1)2.23∗10− 9

Page 34: Bayesian Network Classifier

Bayes Network 348/29/03

Assume that P(BS1)=P(BS2)S1 ten times more probable than S2

X3

X2

X1

Comparing networks Article compares BS1 to BS2

Page 35: Bayesian Network Classifier

Bayes Network 358/29/03

Remember Not calculating whole tree Just a set of parents for a single node

No need for first product

𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !

Page 36: Bayesian Network Classifier

Bayes Network 368/29/03

Result We get a list of most likely parents for

each nodeStorm BusTour

Group

Campfire

Lightning

Thunder ForestFire

Page 37: Bayesian Network Classifier

Bayes Network 378/29/03

1. procedure K2;2, {Input: A set of n nodes, an ordering on the nodes, an upper bound u on the3. number of parents a node may have, and a database D containing m cases. }4. {Output: For each node, a printout of the parents of the node.}5. For i := 1 to n Do 6. πi: =∅;7. Pold := g(i, πi); {This function is computed using equation (12).}8. OKToProceed := true9. while OKToProceed and |πi| < u do10. let z be the node in Pred(xi) - πi that maximizes g(i, πi U {z});11. Pnew := g(i, πi U {z});12. if Pnew > Pold then13. Pold := Pnew;14. πi := πi U {z}!5. else OKToProceed := false;16. end {while};17. write('Node:', xi, 'Parents of this node:; πi)18. end {for};19. end {K2};

K2 algorithm (more formally)

The algorithm is named K2 because it evolved from a system named Kutató (Herskovits & Cooper, 1990) that applies the same greedy-search heuristics. As we discuss in section 6.1.3, Kutató uses entropy to score network structures.

𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !

Page 38: Bayesian Network Classifier

Bayes Network 388/29/03

The “g” function

𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !

Function g(i,set of parents){Set score = 1If set of parents is empty

Nij is the size of entire training set and Sv is entire training setScore *= For each child instantiation (e.g. 0 and 1)

Get count of training record subset items (in Sv)That will be Nijk (in the case of two instantiations there will be two Nijk’s, and ri = 2)Score *=

ElseGet parental instantiations (e.g. 00,01,10,11)For each parental instantiation

Get the training records that match (Sv)Size of that set is Nij

Score *= For each child instantiation (e.g. 0 and 1)

Get count of training record subset items (in Sv)That will be Nijk (in the case of two instantiations there will be two Nijk’s, and ri = 2)Score *=

Return Score}

Page 39: Bayesian Network Classifier

Bayes Network 398/29/03

Implementation Straightforward

Pred(i): returns all nodes that come before I in ordering

g: our formula

𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !

Page 40: Bayesian Network Classifier

Bayes Network 408/29/03

How get instantiations If no parents

Work with all records in training set to accumulate counts for current dimension

No-parent approach

(𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘 !

Page 41: Bayesian Network Classifier

Bayes Network 418/29/03

How get instantiations With parents

My first attempt

What’s wrong with this?

For each parentFor each possible value in that parent’s dimension

Accumulate values End for

End for

Page 42: Bayesian Network Classifier

Bayes Network 428/29/03

Instantiations Have to know which values to

use for every parent when accumulating counts

Must get all instantiations first

parents: 0 2 3 4instantiations0 0 0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

For each instantiationCompute first portion of numerator log((ri – 1)!)For each possible value in the current dim

Get counts that match instantiation and val in current dim

Update a sum of counts (for Nij)Update sum of log factorials (for Nijk’s)

End forAdd sum of log factorials to original numeratorCompute denominator log(Nij + ri -1)!)Subtract from numerator

End for

Page 43: Bayesian Network Classifier

Bayes Network 438/29/03

parents: 0 2 3 4instantiations0 0 0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

But how… How generate the

instantiations? What if had different number’s

of legal values in each dimension?

I generated an increment function

Page 44: Bayesian Network Classifier

Bayes Network 448/29/03

what’s this Ordering Nonsense?

It’s got to be a DAG (Acyclic) The algorithm ensures this with an

ordering An assumed order Pred(i) returns all nodes that occur earlier in

the ordering

An ordering on the nodes…

Page 45: Bayesian Network Classifier

Bayes Network 458/29/03

How get around order issue?

Paper gives a couple of suggestions Could randomly shuffle order Do several times Take best scoring network

Perhaps whole different approach to generating network Start with fully connected Remove edge that increases the P(BS)

the most Continue until can’t increase Use whichever is better (random or

reverse)

Rand

omBa

ckw

ards

Page 46: Bayesian Network Classifier

Bayes Network 468/29/03

Even with ordering … Number of possible structures grows exponentially

Paper states Even with an ordering constraint there are networks Once again binary membership (1 means edge is part of

graph, 0 not) All unidirectional edges:

Think distance matrix (from row to column) From Descartes

Book states Exact inference of probabilities in general for an

arbitrary Bayesian network is known to be NP-hard (guess who… Cooper 1990)

NP-hard

Page 47: Bayesian Network Classifier

Bayes Network 478/29/03

is one over the count of all Bayesian Belief Structures (BS’s )

That’s a lot of networks and is at the heart of the derivation

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!

P(BS)

Page 48: Bayesian Network Classifier

Bayes Network 488/29/03

How do we know the order? A priori, no knowledge of network

structureStorm BusTour

Group

Campfire

Lightning

Thunder ForestFire

Page 49: Bayesian Network Classifier

Bayes Network 498/29/03

Bigger example What if have a thousand training When determining first Pold (no parents)

What will Nij be? How get around?

𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !

Perl presents 1000! as “inf”

Page 50: Bayesian Network Classifier

Bayes Network 508/29/03

Paper discussed in time complexity section

Switching to log values Could add and subtract instead of

multiply and divide (faster) They even pre-calculated every log-

factorial value Up to the number of training values plus the

maximum number of distinct values

Approach that helped time complexity also helps in managing extremely large numbers

Page 51: Bayesian Network Classifier

Bayes Network 518/29/03

Formula for log factorial Easy enough to write your own function

log (𝑛 ! )=∑𝑖=1

𝑛

log ( 𝑖)

Page 52: Bayesian Network Classifier

Bayes Network 528/29/03

How does this … Impact implementation?

Page 53: Bayesian Network Classifier

Bayes Network 538/29/03

𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !

Revisit Algorithm Greedy algorithm for finding parents

For a given dimension Check no parent probability, store in Pold Then choose parent that maximizes g If that probability is greater than Pold, add to

list of parents, update Pold Keep adding till can’t increase probability

Page 54: Bayesian Network Classifier

Bayes Network 548/29/03

Have Network—Now What?

Training For dimensions (nodes) with no parents,

calculate counts as usual (considered independent so naïve process appropriate)

For dimensions with parents, will need to calculate counts for all possible combinations of parent values

TRAININGOne parent:• Just count records with par val

• Present• Absent

Two parents• Just count records with par val

• Present, present• Present, absent• Absent, present• Absent,absent

Page 55: Bayesian Network Classifier

Bayes Network 558/29/03

I didn’t do this I made my algorithm lazy I did my “with-parents” counts

during the test process

Lazy

Total = 0;Num in class = 0;For each training record

Count this record = trueFor each parent

For each possible value in parentIf parent value (or bin) not equal to test value (or bin)

Count this record = false

if count this record is trueTotal ++If training class equals test class

Num in class ++

Page 56: Bayesian Network Classifier

Bayes Network 568/29/03

Whole Algorithm Train

Build network Generate counts

with all possible parental value combinations Test

Just like naïve For each dimension (node) calculate probability for each

class Note that those with parents will be conditional probabilities

i.e. just look at training samples that match parental values For each class, multiply together the probabilities that

the test instance is that class across all nodes (dimensions)

Choose class with maximum probability as your prediction

Page 57: Bayesian Network Classifier

Bayes Network 578/29/03

Lazy Algorithm Train

Build network Generate counts

Test For each dimension (node) calculate

probability for each class Note that those with parents will have to be

calculated on the fly (lazy) For each class, multiply together the

probabilities that the test instance is that class across all nodes (dimensions)

Choose class with maximum probability as your prediction

Page 58: Bayesian Network Classifier

Bayes Network 588/29/03

Lazy and Random Ordering Algorithm Train

Try multiple random orderings Keep best network

Generate counts Test

For each dimension (node) calculate probability for each class Note that those with parents will have to be

calculated on the fly (lazy) For each class, multiply together the

probabilities that the test instance is that class across all nodes (dimensions)

Choose class with maximum probability as your prediction

Page 59: Bayesian Network Classifier

Bayes Network 598/29/03

Formulaic Representation Probability representation, note the

comma

Book utilizesFor instance if parents were dimensions 1 and 2

Page 60: Bayesian Network Classifier

Bayes Network 608/29/03

Terminology Quotes from paper

Cases occur independently given a belief network

Nodes not connected represent variables which are conditionally independent of each other

From book A Bayesian belief network describes the

probability distribution governing a set of variables by specifying a set of conditional independence assumptions along with a set of conditional probabilities

Frame in term

s of independence vs.

dependence

Page 61: Bayesian Network Classifier

Bayes Network 618/29/03

Can be Given a Network Don’t necessarily have to “learn” it Could be the result of domain knowledge

Storm BusTourGroup

Campfire

Lightning

Thunder ForestFire

Page 62: Bayesian Network Classifier

Bayes Network 628/29/03

Causality? Do edges really represent causal

relationships? Maybe not

Non “conditional independence” does not necessarily imply causality

"Correlation does not imply causation" Both could be symptoms of some other

cause …though it is necessary for causationDim

xDim

yCauses

Page 63: Bayesian Network Classifier

Bayes Network 638/29/03

Famous example Studies showed that women

who were taking combined hormone replacement therapy (HRT) also had a lower-than-average incidence of coronary heart disease (CHD) Led doctors to propose that HRT was protective

against CHD Randomized controlled trials showed that HRT

caused a small but statistically significant increase in risk of CHD

Subjects were more likely to be from higher socio-economic groups

Example

Page 64: Bayesian Network Classifier

Bayes Network 648/29/03

Another Third Factor Example

Sleeping with one's shoes on is strongly correlated with waking up with a headache.

Therefore, sleeping with one's shoes on causes headache.

Maybe drunk people are more likely to sleep with shoes on and wake up with headache

Page 65: Bayesian Network Classifier

Bayes Network 658/29/03

Another third factor As ice cream sales

increase, the rate of drowning deaths increases sharply.

Therefore, ice cream causes drowning.

Ice cream sales increase greatly in summer as do drownings

Page 66: Bayesian Network Classifier

Bayes Network 668/29/03

Directionality Also tough to determine

direction of a causal relationship

Example The more firemen fighting

a fire, the bigger the fire is observed to be.

Therefore firemen cause fire.

Page 67: Bayesian Network Classifier

Bayes Network 678/29/03

Project note The project has only 7 positives (993 non-

forest-fires) Both Weka and my first version

achieved .993 percent accuracy It learned that it could simply predict

negative every time and still get .993 Solution?

99.3% accuracy ain’t badOr is it?

Page 68: Bayesian Network Classifier

Bayes Network 688/29/03

Summary Greedy algorithm

Learn the network (using as a scoring mechanism)

Learn the underlying probabilities from the training data given the network Or just the counts and learn the

probabilities lazily Classify new instances based upon

these probabilities Must have an assumed order

Can try several random orders and choose best

Network is consistent the implied causal relationships but…

Dim 4

Dim 3

Dim 2

Dim 1

Dim 5

Page 69: Bayesian Network Classifier

Bayes Network 698/29/03