Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S....

Probabilistic Machine Learning

Brigham S. Anderson

School of Computer Science

Carnegie Mellon University

www.cs.cmu.edu/~brigham

[email protected]

2

ML: Some Successful Applications

• Learning to recognize spoken words (speech recognition);

• Text categorization (SPAM, newsgroups);

• Learning to play world-class chess, backgammon and checkers;

• Handwriting recognition;

• Learning to classify new astronomical data;

• Learning to detect cancerous tissues (e.g. colon polyp detection).

3

Machine Learning Application Areas

• Science• astronomy, bioinformatics, drug discovery, …

• Business• advertising, CRM (Customer Relationship management),

investments, manufacturing, sports/entertainment, telecom, e-Commerce, targeted marketing, health care, …

• Web: • search engines, bots, …

• Government• law enforcement, profiling tax cheaters, anti-terror(?)

4

Classification Application: Assessing Credit Risk

• Situation: Person applies for a loan• Task: Should a bank approve the loan?• Banks develop credit models using variety of

machine learning methods. • Mortgage and credit card proliferation are the

results of being able to successfully predict if a person is likely to default on a loan

• Widely deployed in many countries

5

Prob. TableAnomaly Detector

• Suppose we have the following model:

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) =

• We’re trying to detect anomalous cars.

• If the next example we see is <good,high>, how anomalous is it?

6

Prob. TableAnomaly Detector

04.0

),(),(

highgoodPhighgoodlikelihood


P(Mpg, Horse) = How likely is

<good,high>?

8

Probability Model Uses

ClassifierData point x

AnomalyDetector

Data point x P(x)

P(C | x)

Inference

Engine

Evidence e1P(E2 | e1) Missing Variables E2

9

Bayes Classifiers

• A formidable and sworn enemy of decision trees

DT BC

ClassifierData point x P(C | x)

10

Dead-SimpleBayes Classifier Example

• Suppose we have the following model:


P(Mpg, Horse) =

• We’re trying to classify cars as Mpg = “good” or “bad”

• If the next example we see is Horse = “low”, how do we classify it?

11

Dead-SimpleBayes Classifier Example

)(

),()|(

lowP

lowgoodPlowgoodP


P(Mpg, Horse) =

),(),(

),(

lowbadPlowgoodP

lowgoodP

739.012.036.0

36.0

How do we classify

<Horse=low>?

The P(good | low) = 0.75,so we classify the example

as “good”

12

Bayes Classifiers

• That was just inference!

• In fact, virtually all machine learning tasks are a form of inference

• Anomaly detection: P(x)• Classification: P(Class | x)• Regression: P(Y | x)• Model Learning: P(Model | dataset)• Feature Selection: P(Model | dataset)

13Suppose we get a

<Horse=low, Accel=fast> example?

),(

),,(),|(

fastlowP

fastlowgoodPfastlowgoodP

Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)


P(Accel|Horse)

),(

)|()|()(

fastlowP

lowfastPgoodlowPgoodP

),(

0178.0

),(

)05.0)(89.0)(4.0(

fastlowPfastlowP

),,(),,(

0178.0

fastlowbadPfastlowgoodP

75.0 Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier…

14Suppose we get a


),(

),,(),|(

fastlowP


Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)


P(Horse|Mpg)


P(Accel|Horse)

),(

)|()|()(

fastlowP

lowfastPgoodlowPgoodP

),(

0178.0

),(

)05.0)(89.0)(4.0(

fastlowPfastlowP

),,(),,(

0178.0

fastlowbadPfastlowgoodP

75.0 Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier…

The P(good | low, fast) = 0.75,so we classify the example

as “good”.

…but that seems somehow familiar…

Wasn’t that the same answer asP(good | low)?

15Suppose we get a


),(

),,(),|(

fastlowP


Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)


P(Horse|Mpg)


P(Accel|Horse)

16

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …

vny.

• Assume there are m input attributes called X1, X2, … Xm

• Break dataset into nY smaller datasets called DS1, DS2, … DSny.

• Define DSi = Records in which Y=vi

• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.

17

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …

vny.





• Mi estimates P(X1, X2, … Xm | Y=vi )

18

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2,

… vny.






• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely

)|(argmax 11predict vYuXuXPY mm

v

Is this a good idea?

19


… vny.






• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely


v


This is a Maximum Likelihood classifier.

It can get silly if some Ys are very unlikely

20


… vny.






• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(Y=vi | X1, X2, … Xm) most likely

)|(argmax 11predict

mmv

uXuXvYPY


Much Better Idea

21

Terminology

• MLE (Maximum Likelihood Estimator):

• MAP (Maximum A-Postiori Estimator):

)|(argmax 11predict

mmv

uXuXvYPY


v

22

Getting what we need

)|(argmax 11predict

mmv

uXuXvYPY

23

Getting a posterior probability

Yn

jjjmm

mm

mm

mm

mm

vYPvYuXuXP

vYPvYuXuXP

uXuXP

vYPvYuXuXP

uXuXvYP

111

11

11

11

11

)()|(

)()|(

)(

)()|(

)|(

24

Bayes Classifiers in a nutshell

)()|(argmax

)|(argmax

11

11predict

vYPvYuXuXP

uXuXvYPY

mmv

mmv

1. Learn the distribution over inputs for each value Y.

2. This gives P(X1, X2, … Xm | Y=vi ).

3. Estimate P(Y=vi ). as fraction of records with Y=vi .

4. For a new prediction:

25

Bayes Classifiers in a nutshell

)()|(argmax

)|(argmax

11

11predict

vYPvYuXuXP

uXuXvYPY

mmv

mmv

1. Learn the distribution over inputs for each value Y.

2. This gives P(X1, X2, … Xm | Y=vi ).

3. Estimate P(Y=vi ). as fraction of records with Y=vi .

4. For a new prediction: We can use our favorite Density Estimator here.

Right now we have three options:

• Probability Table• Naïve Density• Bayes Net

26

Joint Density Bayes Classifier

)()|(argmax 11predict vYPvYuXuXPY mm

v

In the case of the joint Bayes Classifier this degenerates to a very simple rule:

Ypredict = the most common value of Y among records in which X1 = u1, X2 = u2, …. Xm = um.

Note that if no records have the exact set of inputs X1 = u1, X2 = u2, …. Xm = um, then P(X1, X2, … Xm | Y=vi ) = 0 for all values of Y.

In that case we just have to guess Y’s value

27

Joint BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The Classifier

learned by “Joint BC”

28

Joint BC Results: “All Irrelevant”The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated 50-50 randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25

30

BC Results: “MPG”: 392

records

The Classifier

learned by “Naive BC”

31

Joint Distribution

Horsepower

Mpg Acceleration

Maker

32

Joint Distribution

P(Mpg, Horse) = P(Mpg) * P(Horse|Mpg)

Recall:A joint distribution can be decomposed via the chain rule…

Note that this takes the same amount of information to create.

We “gain” nothing from this decomposition

34

Naïve Bayes Classifier


v

In the case of the naive Bayes Classifier this can be simplified:

Yn

jjj

vvYuXPvYPY

1

predict )|()(argmax

35

Naïve Bayes Classifier


v

In the case of the naive Bayes Classifier this can be simplified:

Yn

jjj

vvYuXPvYPY

1

predict )|()(argmax

Technical Hint:If you have 10,000 input attributes that product will underflow in floating point math. You should use logs:

Yn

jjj

vvYuXPvYPY

1

predict )|(log)(logargmax

36

BC Results: “XOR”The “XOR” dataset consists of 40,000 records and 2 boolean inputs called a and b, generated 50-50 randomly as 0 or 1. c (output) = a XOR b

The Classifier


The Classifier


37

Naive BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The Classifier


38

Naive BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The Classifier


This result surprised Andrew until he had thought about it a little

39

Naïve BC Results: “All Irrelevant”The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated 50-50 randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25The Classifier


40


records

The Classifier


41


records

42

More Facts About Bayes Classifiers

• Many other density estimators can be slotted in*.• Density estimation can be performed with real-valued inputs*• Bayes Classifiers can be built with real-valued inputs*• Rather Technical Complaint: Bayes Classifiers don’t try to be

maximally discriminative---they merely try to honestly model what’s going on*

• Zero probabilities are painful for Joint and Naïve. A hack (justifiable with the magic words “Dirichlet Prior”) can help*.

• Naïve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully!

*See future Andrew Lectures

43

What you should know

• Probability• Fundamentals of Probability and Bayes Rule• What’s a Joint Distribution• How to do inference (i.e. P(E1|E2)) once you have a JD

• Density Estimation• What is DE and what is it good for• How to learn a Joint DE• How to learn a naïve DE

44

What you should know

• Bayes Classifiers• How to build one• How to predict with a BC• Contrast between naïve and joint BCs

45

Interesting Questions

• Suppose you were evaluating NaiveBC, JointBC, and Decision Trees

• Invent a problem where only NaiveBC would do well• Invent a problem where only Dtree would do well• Invent a problem where only JointBC would do well• Invent a problem where only NaiveBC would do poorly• Invent a problem where only Dtree would do poorly• Invent a problem where only JointBC would do poorly

46

Venn Diagram

47

For more information

• Two nice books• L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.

Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.

• C4.5 : Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning) by J. Ross Quinlan

• Dozens of nice papers, including• Learning Classification Trees, Wray Buntine, Statistics and

Computation (1992), Vol 2, pages 63-73

• Kearns and Mansour, On the Boosting Ability of Top-Down Decision Tree Learning Algorithms, STOC: ACM Symposium on Theory of Computing, 1996“

• Dozens of software implementations available on the web for free and commercially for prices ranging between $50 - $300,000

48

Probability Model Uses

ClassifierInput

Attributes

AnomalyDetector

Data point x P(x | M)

P(C | E)

Inference

Engine

Subset Evidence E1P(E2 | e1)

ClustererData setclustersof points

Variables E2

49

How to Build a Bayes Classifier

Data Set P(I,A,R,C)

This function simulates a four-dimensional lookup table

of the probability of each possible Industry/Analyte/Result/Class

Each record has aclass of either “normal”or “outbreak”

50


Data Set

Data SetOutbreaks

Data SetNormals

P(I,A,R,O) P(I,A,R | normal)

51


Suppose that a new test result arrives…

<meat, salmonella, negative>

P(meat, salmonella, negative, normal) = 0.19

P(meat, salmonella, negative, outbreak) = 0.005

0.19-------- = 38.00.005

Class = “normal”!

52


Next test:

<Seafood, Vibrio, Positive>

P(seafood, vibrio, positive, normal) = 0.02

P(seafood, vibrio, positive, outbreak) = 0.07

0.02------ = 0.290.07

Class = “outbreak”!

Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S....

Documents

Transcript of Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S....