Bayesian Network Classifier

BAYESIAN NETWORK CLASSIFIERNot so naïve any more

orBringing causality into the

equation

Bayes Network 2

Review Before covering Bayesian Belief

Networks…

8/29/03

A little review of the Naïve Bayesian

Classifier

Bayes Network 3

Approach Looks at the probability that an instance

belongs to each class given its value in each of its dimensions

8/29/03

P P P P P P P P P P P PP P P P P P P P P P P PP P P P P P P P P P P P

P P P P P P P P P P P P

Bayes Network 4

0 2 4 6 8

0.00

000.00

050.00

100.00

150.00

20

Distribution of Redness Values

Redness

Den

sity

Fruit

ApplesPeachesOrangesLemons

Example: Redness If one of the dimensions was “redness”

For a given redness value which is the most probable fruit

8/29/03

Bayes Network 5

Bayes Theorem

8/29/03

Above from the book h is hypothesis, D is training Data

𝑃 (h|𝐷 )=𝑃 (𝐷|h ) 𝑃 (h)

𝑃 (𝐷)

𝑃 (𝑎𝑝𝑝𝑙𝑒|𝑟𝑒𝑑𝑛𝑒𝑠𝑠=4.05 )=𝑃 (𝑟𝑒𝑑𝑛𝑒𝑠𝑠=4.05|𝑎𝑝𝑝𝑙𝑒 )𝑃 (𝐴𝑝𝑝𝑙𝑒)

𝑃 (𝑅𝑒𝑑𝑛𝑒𝑠𝑠=4.05)

Bayes Network 6

Redness of Apples and Oranges

Redness

1 2 3 4 5 6 7

050

100

150

200

250

If Non-Parametric… 2506 apples 2486 oranges Probability that

redness would be 4.05 if know an apple About 10/2506

P(apple)? 2506/(2506+2486)

P(redness=4.05) About

(10+25)/(2506+2486)

8/29/03

𝑃 (𝑎𝑝𝑝𝑙𝑒|𝑟𝑒𝑑𝑛𝑒𝑠𝑠=4.05 )=𝑃 (𝑟𝑒𝑑𝑛𝑒𝑠𝑠=4.05|𝑎𝑝𝑝𝑙𝑒 )𝑃 (𝐴𝑝𝑝𝑙𝑒)

𝑃 (𝑅𝑒𝑑𝑛𝑒𝑠𝑠=4.05)

10(10+25)

=

102506 ∙

2506(2506+2486)

(10+25)(2506+2486)

?

Bayes Network 7

Bayes

I think of the ratio of P(h) to P(D) as an adjustment to the easily determined P(D|h) in order to account for differences in sample size

8/29/03

𝑃 (h|𝐷 )=𝑃 (𝐷|h ) 𝑃 (h)

𝑃 (𝐷)

Prior Probabilities or Priors

Posterior Probability

Bayes Network 8

Naïve Bayes Classifier The Naïve term comes from…

Where vj is class and ai is an attribute Derivation

8/29/03

𝑣𝑁𝐵=argmax𝑣 𝑗∈𝑉

𝑃 (𝑣 𝑗)∏𝑖𝑃 (𝑎𝑖∨𝑣 𝑗)

Bayes Network 9

Can remove the naïveness Go with covariance matrix instead of

standard deviation

8/29/03

𝑓 (�⃗� )= 1(2𝜋)𝑑 /2|Σ|1 /2

exp (− 12 (�⃗�−�⃗� )𝑇 Σ−1( �⃗�− �⃗�))

𝑓 (𝑥 )=𝑒− (𝑥−𝜇 )2 /2𝜎2

𝜎 √2𝜋

Bayes Network 108/29/03

Solution

Red Yellow Mass Volapples

0 235 106 3

peaches

0 262 176 57

oranges

9 263 143 7

lemons

22 239 239 184

Total 31 999 664 251apples

0 0.24 0.16 0.01 0

peaches

0 0.26 0.27 0.23 0

oranges

0.29 0.26 0.22 0.28 0.0004

lemons

0.71 0.24 0.36 0.73 0.0044

Bayes Network 11

Lot of work Need Bayes rule Instead of simply multiplying each

dimensional probability… Must compute a multivariate covariance

matrix (num dim x num dim) Calculate multivariate PDF and all priors

Includes getting inverse of covariance matrix Only useful if covariance is strong and

predictive component of the data

8/29/03

Bayes Network 12

Other ways of removing naiveté

Sometimes useful to infer causal relationships

8/29/03

Dim x Dim yCauses

Bayes Network 13

If can figure out… …causal relationship between dimensions

no longer independent Conditional In terms of bins?

8/29/03

Dim x Dim yCauses

𝑃 (𝑦 𝑖𝑠𝑐𝑙𝑎𝑠𝑠𝐴𝑔𝑖𝑣𝑒𝑛 h𝑡 𝑎𝑡 𝑐𝑎𝑢𝑠𝑒𝑑𝑏𝑦 𝑥 )=𝑃 (𝑦 𝑖𝑠𝑐𝑙𝑎𝑠𝑠𝐴∨𝑥𝑖𝑠 𝑐𝑙𝑎𝑠𝑠𝐴 )

Bayes Network 14

Problem Determine a dependency network from

the data Use the dependencies to determine

probabilities that an instance is a given class

Use those probabilities to classify

8/29/03

Dim 4

Dim 3

Dim 2

Dim 1

Dim 5


DAG Directed acyclic graph

Used to represent a dependency network Known as a Bayesian Belief Network

Structure

Dim 4

Dim 3

Dim 2

Dim 1

Dim 5


Algorithm Not in book: 39 page

paper A Bayesian Method for th

e Induction of Probabilistic networks from Data

Known as the K2 algorithm

http://www.springerlink.com/content/t2k011n123r16831/fulltext.pdf




Similar to Decision Tree Each node is a dimension Instead of representing a

decision it represents a conditional relationship

Algorithm for selecting nodes is greedy

Quote from paper: “The algorithm is named K2 because it evolved from a system named Kutató (Herskovits & Cooper, 1990) that applies the same greedy-search heuristics. As we discuss in section 6.1.3, Kutató uses entropy to score network structures.”

Dim 4

Dim 3

Dim 2

Dim 1 Dim 5


General Approach Determine the no-parent score

for a given node (dimension) For each remaining dimension

(node) Determine the probability (score)

that each dimension is the parent for the given node(does a dependency appear to exist)

Compare the score of the best to the no-parent score If better, keep as parent and

repeat (see if can add another parent)

Otherwise done

Greedy

Find the parents of each node

Find best parentIf improve, keep

Find next “best parent”


How Score? The probability that a

given data “configuration” could belong to a given DAG Do records with a given

value in one dimension tend to have a specific value in another dimension

Storm BusTourGroup

Campfire

Lightning

Thunder ForestFire

Example from Book


Bayesian Belief Network BS Belief Network Structure

How probable?

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!

Don’t panic!We’ll get through it


Proof Last 5 pages in the paper (39 pages total) A Bayesian Method for the Induction of Pr

obabilistic networks from Data




Bayesian Belief Network BS Belief Network Structure

How probable?

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!

n=num dims (nodes)q=num unique instantiations of parents

• If one parent, q = number of distinct vals in parent• If two parents, q = num in p1*num in p2

r=num distinct vals possible in dimensionNijk=number of records with value = k in current dim that match parental instantiationNij=Number of records that match parental instantiation (sum of Nijk’s)


Intuition Think of as a random match probability

What are the chances that the values seen in a dimension (for records that match the parental instantiation) could occur randomly?

Think of the as an adjustment upward (since it will show up in the numerator) indicating how the data is actually organized How organized is the data in the child dimension? Example 6!0! is 720 while 3!3! is 36 Sound familiar?

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!


𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !

Algorithm Greedy algorithm for finding parents

For a given dimension Check no parent probability, store in Pold Then choose parent that maximizes g If that probability is greater than Pold, add to

list of parents, update Pold Keep adding till can’t increase probability


No Parent Probability?

There is only one “instantiation” No parent filtering, so Nij is all

training samples Nijk is number in training set where

current dimension is value vj

𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !Orp

han


Example from paper Three nodes Two instantiations for

parent X2 (of child X3) Parent X2 has value

absent Parent X2 has value

present Two instantiations for

parent X1 Parent X1 has value

absent Parent X1 has value

present


X2 instantiation with val absent Number of X3 absents that

were X2 absents: 4 Number of X3 presents that

were X2 absents: 1 X2 instantion: val present

X3 absent|X2 present: 0 X3 present|X2 present: 5

Some numbers


For dimension (i) 3

X3 Calculations

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!

(2−1 )! 0 !5 !(5+2−1 )!

(2−1 ) !4 !1!(5+2−1 ) !


X1 instantiation with val absent Number of X2 absents that

were X1 absents: 4 Number of X2 presents that

were X1 absents: 1 X1 instantion: val present

X2 absent|X1 present: 1 X2 present|X1 present: 4

Some more numbers


For dimension (i) 2

X2 Calculations

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!

(2−1 )!1 ! 4 !(5+2−1 )!

(2−1 ) ! 4 !1 !(5+2−1 ) !


Dimension 1 has no parents Number of X1 absents: 5 Number of X1 presents: 5

Some more numbers


For dimension (i) 1

Xi Calculations

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!

(2−1 )!5 !5 !(10+2−1 )!


The whole enchilada The article calls this BS1

Putting it all Together

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!

(2−1 )!5 !5 !(10+2−1 )!

(2−1 )!1 ! 4 !(5+2−1 )!

(2−1 ) ! 4 !1!(5+2−1 ) !

(2−1 )! 0 !5 !(5+2−1 )!

(2−1 ) !4 !1!(5+2−1 ) !

X1 X2 X3

X3X2X1

𝑃 (𝐵𝑆1 ,𝐷 )=𝑃 (𝐵𝑆 1)2.23∗10− 9


Assume that P(BS1)=P(BS2)S1 ten times more probable than S2

X3

X2

X1

Comparing networks Article compares BS1 to BS2


Remember Not calculating whole tree Just a set of parents for a single node

No need for first product

𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !


Result We get a list of most likely parents for

each nodeStorm BusTour

Group

Campfire

Lightning

Thunder ForestFire


1. procedure K2;2, {Input: A set of n nodes, an ordering on the nodes, an upper bound u on the3. number of parents a node may have, and a database D containing m cases. }4. {Output: For each node, a printout of the parents of the node.}5. For i := 1 to n Do 6. πi: =∅;7. Pold := g(i, πi); {This function is computed using equation (12).}8. OKToProceed := true9. while OKToProceed and |πi| < u do10. let z be the node in Pred(xi) - πi that maximizes g(i, πi U {z});11. Pnew := g(i, πi U {z});12. if Pnew > Pold then13. Pold := Pnew;14. πi := πi U {z}!5. else OKToProceed := false;16. end {while};17. write('Node:', xi, 'Parents of this node:; πi)18. end {for};19. end {K2};

K2 algorithm (more formally)

The algorithm is named K2 because it evolved from a system named Kutató (Herskovits & Cooper, 1990) that applies the same greedy-search heuristics. As we discuss in section 6.1.3, Kutató uses entropy to score network structures.

𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !


The “g” function

𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !

Function g(i,set of parents){Set score = 1If set of parents is empty

Nij is the size of entire training set and Sv is entire training setScore *= For each child instantiation (e.g. 0 and 1)

Get count of training record subset items (in Sv)That will be Nijk (in the case of two instantiations there will be two Nijk’s, and ri = 2)Score *=

ElseGet parental instantiations (e.g. 00,01,10,11)For each parental instantiation

Get the training records that match (Sv)Size of that set is Nij

Score *= For each child instantiation (e.g. 0 and 1)

Get count of training record subset items (in Sv)That will be Nijk (in the case of two instantiations there will be two Nijk’s, and ri = 2)Score *=

Return Score}


Implementation Straightforward

Pred(i): returns all nodes that come before I in ordering

g: our formula

𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !


How get instantiations If no parents

Work with all records in training set to accumulate counts for current dimension

No-parent approach

(𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘 !


How get instantiations With parents

My first attempt

What’s wrong with this?

For each parentFor each possible value in that parent’s dimension

Accumulate values End for

End for


Instantiations Have to know which values to

use for every parent when accumulating counts

Must get all instantiations first

parents: 0 2 3 4instantiations0 0 0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

For each instantiationCompute first portion of numerator log((ri – 1)!)For each possible value in the current dim

Get counts that match instantiation and val in current dim

Update a sum of counts (for Nij)Update sum of log factorials (for Nijk’s)

End forAdd sum of log factorials to original numeratorCompute denominator log(Nij + ri -1)!)Subtract from numerator

End for


parents: 0 2 3 4instantiations0 0 0 00 0 0 10 0 1 00 0 1 10 1 0 00 1 0 10 1 1 00 1 1 11 0 0 01 0 0 11 0 1 01 0 1 11 1 0 01 1 0 11 1 1 01 1 1 1

But how… How generate the

instantiations? What if had different number’s

of legal values in each dimension?

I generated an increment function


what’s this Ordering Nonsense?

It’s got to be a DAG (Acyclic) The algorithm ensures this with an

ordering An assumed order Pred(i) returns all nodes that occur earlier in

the ordering

An ordering on the nodes…


How get around order issue?

Paper gives a couple of suggestions Could randomly shuffle order Do several times Take best scoring network

Perhaps whole different approach to generating network Start with fully connected Remove edge that increases the P(BS)

the most Continue until can’t increase Use whichever is better (random or

reverse)

Rand

omBa

ckw

ards


Even with ordering … Number of possible structures grows exponentially

Paper states Even with an ordering constraint there are networks Once again binary membership (1 means edge is part of

graph, 0 not) All unidirectional edges:

Think distance matrix (from row to column) From Descartes

Book states Exact inference of probabilities in general for an

arbitrary Bayesian network is known to be NP-hard (guess who… Cooper 1990)

NP-hard


is one over the count of all Bayesian Belief Structures (BS’s )

That’s a lot of networks and is at the heart of the derivation

𝑃 (𝐵𝑆 ,𝐷 )=𝑃 (𝐵𝑆)∏𝑖=1

𝑛

∏𝑗=1

𝑞 𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟 𝑖

𝑁 𝑖𝑗𝑘!

P(BS)


How do we know the order? A priori, no knowledge of network

structureStorm BusTour

Group

Campfire

Lightning

Thunder ForestFire


Bigger example What if have a thousand training When determining first Pold (no parents)

What will Nij be? How get around?

𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !

Perl presents 1000! as “inf”


Paper discussed in time complexity section

Switching to log values Could add and subtract instead of

multiply and divide (faster) They even pre-calculated every log-

factorial value Up to the number of training values plus the

maximum number of distinct values

Approach that helped time complexity also helps in managing extremely large numbers


Formula for log factorial Easy enough to write your own function

log (𝑛 ! )=∑𝑖=1

𝑛

log ( 𝑖)


How does this … Impact implementation?


𝑔 (𝑖 ,𝜋𝑖)=∏𝑗=1

𝑞𝑖 (𝑟 𝑖−1 ) !(𝑁 𝑖𝑗+𝑟 𝑖−1 ) !∏𝑘=1

𝑟𝑖

𝑁 𝑖𝑗𝑘 !

Revisit Algorithm Greedy algorithm for finding parents

For a given dimension Check no parent probability, store in Pold Then choose parent that maximizes g If that probability is greater than Pold, add to

list of parents, update Pold Keep adding till can’t increase probability


Have Network—Now What?

Training For dimensions (nodes) with no parents,

calculate counts as usual (considered independent so naïve process appropriate)

For dimensions with parents, will need to calculate counts for all possible combinations of parent values

TRAININGOne parent:• Just count records with par val

• Present• Absent

Two parents• Just count records with par val

• Present, present• Present, absent• Absent, present• Absent,absent


I didn’t do this I made my algorithm lazy I did my “with-parents” counts

during the test process

Lazy

Total = 0;Num in class = 0;For each training record

Count this record = trueFor each parent

For each possible value in parentIf parent value (or bin) not equal to test value (or bin)

Count this record = false

if count this record is trueTotal ++If training class equals test class

Num in class ++


Whole Algorithm Train

Build network Generate counts

with all possible parental value combinations Test

Just like naïve For each dimension (node) calculate probability for each

class Note that those with parents will be conditional probabilities

i.e. just look at training samples that match parental values For each class, multiply together the probabilities that

the test instance is that class across all nodes (dimensions)

Choose class with maximum probability as your prediction


Lazy Algorithm Train

Build network Generate counts

Test For each dimension (node) calculate

probability for each class Note that those with parents will have to be

calculated on the fly (lazy) For each class, multiply together the

probabilities that the test instance is that class across all nodes (dimensions)



Lazy and Random Ordering Algorithm Train

Try multiple random orderings Keep best network

Generate counts Test

For each dimension (node) calculate probability for each class Note that those with parents will have to be

calculated on the fly (lazy) For each class, multiply together the

probabilities that the test instance is that class across all nodes (dimensions)



Formulaic Representation Probability representation, note the

comma

Book utilizesFor instance if parents were dimensions 1 and 2


Terminology Quotes from paper

Cases occur independently given a belief network

Nodes not connected represent variables which are conditionally independent of each other

From book A Bayesian belief network describes the

probability distribution governing a set of variables by specifying a set of conditional independence assumptions along with a set of conditional probabilities

Frame in term

s of independence vs.

dependence


Can be Given a Network Don’t necessarily have to “learn” it Could be the result of domain knowledge

Storm BusTourGroup

Campfire

Lightning

Thunder ForestFire


Causality? Do edges really represent causal

relationships? Maybe not

Non “conditional independence” does not necessarily imply causality

"Correlation does not imply causation" Both could be symptoms of some other

cause …though it is necessary for causationDim

xDim

yCauses


Famous example Studies showed that women

who were taking combined hormone replacement therapy (HRT) also had a lower-than-average incidence of coronary heart disease (CHD) Led doctors to propose that HRT was protective

against CHD Randomized controlled trials showed that HRT

caused a small but statistically significant increase in risk of CHD

Subjects were more likely to be from higher socio-economic groups

Example


Another Third Factor Example

Sleeping with one's shoes on is strongly correlated with waking up with a headache.

Therefore, sleeping with one's shoes on causes headache.

Maybe drunk people are more likely to sleep with shoes on and wake up with headache


Another third factor As ice cream sales

increase, the rate of drowning deaths increases sharply.

Therefore, ice cream causes drowning.

Ice cream sales increase greatly in summer as do drownings


Directionality Also tough to determine

direction of a causal relationship

Example The more firemen fighting

a fire, the bigger the fire is observed to be.

Therefore firemen cause fire.


Project note The project has only 7 positives (993 non-

forest-fires) Both Weka and my first version

achieved .993 percent accuracy It learned that it could simply predict

negative every time and still get .993 Solution?

99.3% accuracy ain’t badOr is it?


Summary Greedy algorithm

Learn the network (using as a scoring mechanism)

Learn the underlying probabilities from the training data given the network Or just the counts and learn the

probabilities lazily Classify new instances based upon

these probabilities Must have an assumed order

Can try several random orders and choose best

Network is consistent the implied causal relationships but…

Dim 4

Dim 3

Dim 2

Dim 1

Dim 5

Bayesian Network Classifier

Documents

Transcript of Bayesian Network Classifier