Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S....

52
Slide 1 Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~brigham [email protected]
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    226
  • download

    2

Transcript of Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S....

Page 1: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

Slide 1

Probabilistic Machine Learning

Brigham S. Anderson

School of Computer Science

Carnegie Mellon University

www.cs.cmu.edu/~brigham

[email protected]

Page 2: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

2

ML: Some Successful Applications

• Learning to recognize spoken words (speech recognition);

• Text categorization (SPAM, newsgroups);

• Learning to play world-class chess, backgammon and checkers;

• Handwriting recognition;

• Learning to classify new astronomical data;

• Learning to detect cancerous tissues (e.g. colon polyp detection).

Page 3: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

3

Machine Learning Application Areas

• Science• astronomy, bioinformatics, drug discovery, …

• Business• advertising, CRM (Customer Relationship management),

investments, manufacturing, sports/entertainment, telecom, e-Commerce, targeted marketing, health care, …

• Web: • search engines, bots, …

• Government• law enforcement, profiling tax cheaters, anti-terror(?)

Page 4: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

4

Classification Application: Assessing Credit Risk

• Situation: Person applies for a loan• Task: Should a bank approve the loan?• Banks develop credit models using variety of

machine learning methods. • Mortgage and credit card proliferation are the

results of being able to successfully predict if a person is likely to default on a loan

• Widely deployed in many countries

Page 5: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

5

Prob. TableAnomaly Detector

• Suppose we have the following model:

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) =

• We’re trying to detect anomalous cars.

• If the next example we see is <good,high>, how anomalous is it?

Page 6: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

6

Prob. TableAnomaly Detector

04.0

),(),(

highgoodPhighgoodlikelihood

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) = How likely is

<good,high>?

Page 7: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

7How likely is a

<good,high,fast> example?

)|()|()(),,( highfastPgoodhighPgoodPfasthighgoodP

Mpg Horse Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg) P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89

P(Accel|Horse)

039.0

)89.0)(11.0)(4.0(

Bayes NetAnomaly Detector

Page 8: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

8

Probability Model Uses

ClassifierData point x

AnomalyDetector

Data point x P(x)

P(C | x)

Inference

Engine

Evidence e1P(E2 | e1) Missing Variables E2

Page 9: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

9

Bayes Classifiers

• A formidable and sworn enemy of decision trees

DT BC

ClassifierData point x P(C | x)

Page 10: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

10

Dead-SimpleBayes Classifier Example

• Suppose we have the following model:

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) =

• We’re trying to classify cars as Mpg = “good” or “bad”

• If the next example we see is Horse = “low”, how do we classify it?

Page 11: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

11

Dead-SimpleBayes Classifier Example

)(

),()|(

lowP

lowgoodPlowgoodP

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) =

),(),(

),(

lowbadPlowgoodP

lowgoodP

739.012.036.0

36.0

How do we classify

<Horse=low>?

The P(good | low) = 0.75,so we classify the example

as “good”

Page 12: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

12

Bayes Classifiers

• That was just inference!

• In fact, virtually all machine learning tasks are a form of inference

• Anomaly detection: P(x)• Classification: P(Class | x)• Regression: P(Y | x)• Model Learning: P(Model | dataset)• Feature Selection: P(Model | dataset)

Page 13: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

13Suppose we get a

<Horse=low, Accel=fast> example?

),(

),,(),|(

fastlowP

fastlowgoodPfastlowgoodP

Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89

P(Accel|Horse)

),(

)|()|()(

fastlowP

lowfastPgoodlowPgoodP

),(

0178.0

),(

)05.0)(89.0)(4.0(

fastlowPfastlowP

),,(),,(

0178.0

fastlowbadPfastlowgoodP

75.0 Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier…

Page 14: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

14Suppose we get a

<Horse=low, Accel=fast> example?

),(

),,(),|(

fastlowP

fastlowgoodPfastlowgoodP

Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89

P(Accel|Horse)

),(

)|()|()(

fastlowP

lowfastPgoodlowPgoodP

),(

0178.0

),(

)05.0)(89.0)(4.0(

fastlowPfastlowP

),,(),,(

0178.0

fastlowbadPfastlowgoodP

75.0 Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier…

The P(good | low, fast) = 0.75,so we classify the example

as “good”.

…but that seems somehow familiar…

Wasn’t that the same answer asP(good | low)?

Page 15: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

15Suppose we get a

<Horse=low, Accel=fast> example?

),(

),,(),|(

fastlowP

fastlowgoodPfastlowgoodP

Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89

P(Accel|Horse)

Page 16: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

16

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …

vny.

• Assume there are m input attributes called X1, X2, … Xm

• Break dataset into nY smaller datasets called DS1, DS2, … DSny.

• Define DSi = Records in which Y=vi

• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.

Page 17: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

17

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …

vny.

• Assume there are m input attributes called X1, X2, … Xm

• Break dataset into nY smaller datasets called DS1, DS2, … DSny.

• Define DSi = Records in which Y=vi

• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.

• Mi estimates P(X1, X2, … Xm | Y=vi )

Page 18: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

18

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2,

… vny.

• Assume there are m input attributes called X1, X2, … Xm

• Break dataset into nY smaller datasets called DS1, DS2, … DSny.

• Define DSi = Records in which Y=vi

• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.

• Mi estimates P(X1, X2, … Xm | Y=vi )

• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely

)|(argmax 11predict vYuXuXPY mm

v

Is this a good idea?

Page 19: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

19

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2,

… vny.

• Assume there are m input attributes called X1, X2, … Xm

• Break dataset into nY smaller datasets called DS1, DS2, … DSny.

• Define DSi = Records in which Y=vi

• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.

• Mi estimates P(X1, X2, … Xm | Y=vi )

• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely

)|(argmax 11predict vYuXuXPY mm

v

Is this a good idea?

This is a Maximum Likelihood classifier.

It can get silly if some Ys are very unlikely

Page 20: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

20

How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2,

… vny.

• Assume there are m input attributes called X1, X2, … Xm

• Break dataset into nY smaller datasets called DS1, DS2, … DSny.

• Define DSi = Records in which Y=vi

• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.

• Mi estimates P(X1, X2, … Xm | Y=vi )

• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(Y=vi | X1, X2, … Xm) most likely

)|(argmax 11predict

mmv

uXuXvYPY

Is this a good idea?

Much Better Idea

Page 21: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

21

Terminology

• MLE (Maximum Likelihood Estimator):

• MAP (Maximum A-Postiori Estimator):

)|(argmax 11predict

mmv

uXuXvYPY

)|(argmax 11predict vYuXuXPY mm

v

Page 22: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

22

Getting what we need

)|(argmax 11predict

mmv

uXuXvYPY

Page 23: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

23

Getting a posterior probability

Yn

jjjmm

mm

mm

mm

mm

vYPvYuXuXP

vYPvYuXuXP

uXuXP

vYPvYuXuXP

uXuXvYP

111

11

11

11

11

)()|(

)()|(

)(

)()|(

)|(

Page 24: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

24

Bayes Classifiers in a nutshell

)()|(argmax

)|(argmax

11

11predict

vYPvYuXuXP

uXuXvYPY

mmv

mmv

1. Learn the distribution over inputs for each value Y.

2. This gives P(X1, X2, … Xm | Y=vi ).

3. Estimate P(Y=vi ). as fraction of records with Y=vi .

4. For a new prediction:

Page 25: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

25

Bayes Classifiers in a nutshell

)()|(argmax

)|(argmax

11

11predict

vYPvYuXuXP

uXuXvYPY

mmv

mmv

1. Learn the distribution over inputs for each value Y.

2. This gives P(X1, X2, … Xm | Y=vi ).

3. Estimate P(Y=vi ). as fraction of records with Y=vi .

4. For a new prediction: We can use our favorite Density Estimator here.

Right now we have three options:

• Probability Table• Naïve Density• Bayes Net

Page 26: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

26

Joint Density Bayes Classifier

)()|(argmax 11predict vYPvYuXuXPY mm

v

In the case of the joint Bayes Classifier this degenerates to a very simple rule:

Ypredict = the most common value of Y among records in which X1 = u1, X2 = u2, …. Xm = um.

Note that if no records have the exact set of inputs X1 = u1, X2 = u2, …. Xm = um, then P(X1, X2, … Xm | Y=vi ) = 0 for all values of Y.

In that case we just have to guess Y’s value

Page 27: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

27

Joint BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The Classifier

learned by “Joint BC”

Page 28: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

28

Joint BC Results: “All Irrelevant”The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated 50-50 randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25

Page 29: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

29

Page 30: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

30

BC Results: “MPG”: 392

records

The Classifier

learned by “Naive BC”

Page 31: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

31

Joint Distribution

Horsepower

Mpg Acceleration

Maker

Page 32: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

32

Joint Distribution

P(Mpg, Horse) = P(Mpg) * P(Horse|Mpg)

Recall:A joint distribution can be decomposed via the chain rule…

Note that this takes the same amount of information to create.

We “gain” nothing from this decomposition

Page 33: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

33

Naive Distribution

Mpg

Cylinders

P(Mpg)

P(Cylinders|Mpg)

Horsepower

P(Horsepower|Mpg)

Weight

P(Weight|Mpg)

MakerModelyear Acceleration

P(Modelyear|Mpg) P(Maker|Mpg)

P(Acceleration|Mpg)

Page 34: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

34

Naïve Bayes Classifier

)()|(argmax 11predict vYPvYuXuXPY mm

v

In the case of the naive Bayes Classifier this can be simplified:

Yn

jjj

vvYuXPvYPY

1

predict )|()(argmax

Page 35: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

35

Naïve Bayes Classifier

)()|(argmax 11predict vYPvYuXuXPY mm

v

In the case of the naive Bayes Classifier this can be simplified:

Yn

jjj

vvYuXPvYPY

1

predict )|()(argmax

Technical Hint:If you have 10,000 input attributes that product will underflow in floating point math. You should use logs:

Yn

jjj

vvYuXPvYPY

1

predict )|(log)(logargmax

Page 36: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

36

BC Results: “XOR”The “XOR” dataset consists of 40,000 records and 2 boolean inputs called a and b, generated 50-50 randomly as 0 or 1. c (output) = a XOR b

The Classifier

learned by “Naive BC”

The Classifier

learned by “Joint BC”

Page 37: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

37

Naive BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The Classifier

learned by “Naive BC”

Page 38: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

38

Naive BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The Classifier

learned by “Joint BC”

This result surprised Andrew until he had thought about it a little

Page 39: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

39

Naïve BC Results: “All Irrelevant”The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated 50-50 randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25The Classifier

learned by “Naive BC”

Page 40: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

40

BC Results: “MPG”: 392

records

The Classifier

learned by “Naive BC”

Page 41: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

41

BC Results: “MPG”: 40

records

Page 42: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

42

More Facts About Bayes Classifiers

• Many other density estimators can be slotted in*.• Density estimation can be performed with real-valued inputs*• Bayes Classifiers can be built with real-valued inputs*• Rather Technical Complaint: Bayes Classifiers don’t try to be

maximally discriminative---they merely try to honestly model what’s going on*

• Zero probabilities are painful for Joint and Naïve. A hack (justifiable with the magic words “Dirichlet Prior”) can help*.

• Naïve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully!

*See future Andrew Lectures

Page 43: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

43

What you should know

• Probability• Fundamentals of Probability and Bayes Rule• What’s a Joint Distribution• How to do inference (i.e. P(E1|E2)) once you have a JD

• Density Estimation• What is DE and what is it good for• How to learn a Joint DE• How to learn a naïve DE

Page 44: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

44

What you should know

• Bayes Classifiers• How to build one• How to predict with a BC• Contrast between naïve and joint BCs

Page 45: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

45

Interesting Questions

• Suppose you were evaluating NaiveBC, JointBC, and Decision Trees

• Invent a problem where only NaiveBC would do well• Invent a problem where only Dtree would do well• Invent a problem where only JointBC would do well• Invent a problem where only NaiveBC would do poorly• Invent a problem where only Dtree would do poorly• Invent a problem where only JointBC would do poorly

Page 46: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

46

Venn Diagram

Page 47: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

47

For more information

• Two nice books• L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.

Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.

• C4.5 : Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning) by J. Ross Quinlan

• Dozens of nice papers, including• Learning Classification Trees, Wray Buntine, Statistics and

Computation (1992), Vol 2, pages 63-73

• Kearns and Mansour, On the Boosting Ability of Top-Down Decision Tree Learning Algorithms, STOC: ACM Symposium on Theory of Computing, 1996“

• Dozens of software implementations available on the web for free and commercially for prices ranging between $50 - $300,000

Page 48: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

48

Probability Model Uses

ClassifierInput

Attributes

AnomalyDetector

Data point x P(x | M)

P(C | E)

Inference

Engine

Subset Evidence E1P(E2 | e1)

ClustererData setclustersof points

Variables E2

Page 49: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

49

How to Build a Bayes Classifier

Data Set P(I,A,R,C)

This function simulates a four-dimensional lookup table

of the probability of each possible Industry/Analyte/Result/Class

Each record has aclass of either “normal”or “outbreak”

Page 50: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

50

How to Build a Bayes Classifier

Data Set

Data SetOutbreaks

Data SetNormals

P(I,A,R,O) P(I,A,R | normal)

Page 51: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

51

How to Build a Bayes Classifier

Suppose that a new test result arrives…

<meat, salmonella, negative>

P(meat, salmonella, negative, normal) = 0.19

P(meat, salmonella, negative, outbreak) = 0.005

0.19-------- = 38.00.005

Class = “normal”!

Page 52: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.

52

How to Build a Bayes Classifier

Next test:

<Seafood, Vibrio, Positive>

P(seafood, vibrio, positive, normal) = 0.02

P(seafood, vibrio, positive, outbreak) = 0.07

0.02------ = 0.290.07

Class = “outbreak”!