Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S....
-
date post
19-Dec-2015 -
Category
Documents
-
view
226 -
download
2
Transcript of Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S....
Slide 1
Probabilistic Machine Learning
Brigham S. Anderson
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~brigham
2
ML: Some Successful Applications
• Learning to recognize spoken words (speech recognition);
• Text categorization (SPAM, newsgroups);
• Learning to play world-class chess, backgammon and checkers;
• Handwriting recognition;
• Learning to classify new astronomical data;
• Learning to detect cancerous tissues (e.g. colon polyp detection).
3
Machine Learning Application Areas
• Science• astronomy, bioinformatics, drug discovery, …
• Business• advertising, CRM (Customer Relationship management),
investments, manufacturing, sports/entertainment, telecom, e-Commerce, targeted marketing, health care, …
• Web: • search engines, bots, …
• Government• law enforcement, profiling tax cheaters, anti-terror(?)
4
Classification Application: Assessing Credit Risk
• Situation: Person applies for a loan• Task: Should a bank approve the loan?• Banks develop credit models using variety of
machine learning methods. • Mortgage and credit card proliferation are the
results of being able to successfully predict if a person is likely to default on a loan
• Widely deployed in many countries
5
Prob. TableAnomaly Detector
• Suppose we have the following model:
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) =
• We’re trying to detect anomalous cars.
• If the next example we see is <good,high>, how anomalous is it?
6
Prob. TableAnomaly Detector
04.0
),(),(
highgoodPhighgoodlikelihood
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) = How likely is
<good,high>?
7How likely is a
<good,high,fast> example?
)|()|()(),,( highfastPgoodhighPgoodPfasthighgoodP
Mpg Horse Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg) P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89
P(Accel|Horse)
039.0
)89.0)(11.0)(4.0(
Bayes NetAnomaly Detector
8
Probability Model Uses
ClassifierData point x
AnomalyDetector
Data point x P(x)
P(C | x)
Inference
Engine
Evidence e1P(E2 | e1) Missing Variables E2
9
Bayes Classifiers
• A formidable and sworn enemy of decision trees
DT BC
ClassifierData point x P(C | x)
10
Dead-SimpleBayes Classifier Example
• Suppose we have the following model:
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) =
• We’re trying to classify cars as Mpg = “good” or “bad”
• If the next example we see is Horse = “low”, how do we classify it?
11
Dead-SimpleBayes Classifier Example
)(
),()|(
lowP
lowgoodPlowgoodP
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) =
),(),(
),(
lowbadPlowgoodP
lowgoodP
739.012.036.0
36.0
How do we classify
<Horse=low>?
The P(good | low) = 0.75,so we classify the example
as “good”
12
Bayes Classifiers
• That was just inference!
• In fact, virtually all machine learning tasks are a form of inference
• Anomaly detection: P(x)• Classification: P(Class | x)• Regression: P(Y | x)• Model Learning: P(Model | dataset)• Feature Selection: P(Model | dataset)
13Suppose we get a
<Horse=low, Accel=fast> example?
),(
),,(),|(
fastlowP
fastlowgoodPfastlowgoodP
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89
P(Accel|Horse)
),(
)|()|()(
fastlowP
lowfastPgoodlowPgoodP
),(
0178.0
),(
)05.0)(89.0)(4.0(
fastlowPfastlowP
),,(),,(
0178.0
fastlowbadPfastlowgoodP
75.0 Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier…
14Suppose we get a
<Horse=low, Accel=fast> example?
),(
),,(),|(
fastlowP
fastlowgoodPfastlowgoodP
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89
P(Accel|Horse)
),(
)|()|()(
fastlowP
lowfastPgoodlowPgoodP
),(
0178.0
),(
)05.0)(89.0)(4.0(
fastlowPfastlowP
),,(),,(
0178.0
fastlowbadPfastlowgoodP
75.0 Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier…
The P(good | low, fast) = 0.75,so we classify the example
as “good”.
…but that seems somehow familiar…
Wasn’t that the same answer asP(good | low)?
15Suppose we get a
<Horse=low, Accel=fast> example?
),(
),,(),|(
fastlowP
fastlowgoodPfastlowgoodP
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89
P(Accel|Horse)
16
How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …
vny.
• Assume there are m input attributes called X1, X2, … Xm
• Break dataset into nY smaller datasets called DS1, DS2, … DSny.
• Define DSi = Records in which Y=vi
• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.
17
How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …
vny.
• Assume there are m input attributes called X1, X2, … Xm
• Break dataset into nY smaller datasets called DS1, DS2, … DSny.
• Define DSi = Records in which Y=vi
• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
18
How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2,
… vny.
• Assume there are m input attributes called X1, X2, … Xm
• Break dataset into nY smaller datasets called DS1, DS2, … DSny.
• Define DSi = Records in which Y=vi
• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely
)|(argmax 11predict vYuXuXPY mm
v
Is this a good idea?
19
How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2,
… vny.
• Assume there are m input attributes called X1, X2, … Xm
• Break dataset into nY smaller datasets called DS1, DS2, … DSny.
• Define DSi = Records in which Y=vi
• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely
)|(argmax 11predict vYuXuXPY mm
v
Is this a good idea?
This is a Maximum Likelihood classifier.
It can get silly if some Ys are very unlikely
20
How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2,
… vny.
• Assume there are m input attributes called X1, X2, … Xm
• Break dataset into nY smaller datasets called DS1, DS2, … DSny.
• Define DSi = Records in which Y=vi
• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(Y=vi | X1, X2, … Xm) most likely
)|(argmax 11predict
mmv
uXuXvYPY
Is this a good idea?
Much Better Idea
21
Terminology
• MLE (Maximum Likelihood Estimator):
• MAP (Maximum A-Postiori Estimator):
)|(argmax 11predict
mmv
uXuXvYPY
)|(argmax 11predict vYuXuXPY mm
v
22
Getting what we need
)|(argmax 11predict
mmv
uXuXvYPY
23
Getting a posterior probability
Yn
jjjmm
mm
mm
mm
mm
vYPvYuXuXP
vYPvYuXuXP
uXuXP
vYPvYuXuXP
uXuXvYP
111
11
11
11
11
)()|(
)()|(
)(
)()|(
)|(
24
Bayes Classifiers in a nutshell
)()|(argmax
)|(argmax
11
11predict
vYPvYuXuXP
uXuXvYPY
mmv
mmv
1. Learn the distribution over inputs for each value Y.
2. This gives P(X1, X2, … Xm | Y=vi ).
3. Estimate P(Y=vi ). as fraction of records with Y=vi .
4. For a new prediction:
25
Bayes Classifiers in a nutshell
)()|(argmax
)|(argmax
11
11predict
vYPvYuXuXP
uXuXvYPY
mmv
mmv
1. Learn the distribution over inputs for each value Y.
2. This gives P(X1, X2, … Xm | Y=vi ).
3. Estimate P(Y=vi ). as fraction of records with Y=vi .
4. For a new prediction: We can use our favorite Density Estimator here.
Right now we have three options:
• Probability Table• Naïve Density• Bayes Net
26
Joint Density Bayes Classifier
)()|(argmax 11predict vYPvYuXuXPY mm
v
In the case of the joint Bayes Classifier this degenerates to a very simple rule:
Ypredict = the most common value of Y among records in which X1 = u1, X2 = u2, …. Xm = um.
Note that if no records have the exact set of inputs X1 = u1, X2 = u2, …. Xm = um, then P(X1, X2, … Xm | Y=vi ) = 0 for all values of Y.
In that case we just have to guess Y’s value
27
Joint BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped
The Classifier
learned by “Joint BC”
28
Joint BC Results: “All Irrelevant”The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated 50-50 randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25
29
30
BC Results: “MPG”: 392
records
The Classifier
learned by “Naive BC”
31
Joint Distribution
Horsepower
Mpg Acceleration
Maker
32
Joint Distribution
P(Mpg, Horse) = P(Mpg) * P(Horse|Mpg)
Recall:A joint distribution can be decomposed via the chain rule…
Note that this takes the same amount of information to create.
We “gain” nothing from this decomposition
33
Naive Distribution
Mpg
Cylinders
P(Mpg)
P(Cylinders|Mpg)
Horsepower
P(Horsepower|Mpg)
Weight
P(Weight|Mpg)
MakerModelyear Acceleration
P(Modelyear|Mpg) P(Maker|Mpg)
P(Acceleration|Mpg)
34
Naïve Bayes Classifier
)()|(argmax 11predict vYPvYuXuXPY mm
v
In the case of the naive Bayes Classifier this can be simplified:
Yn
jjj
vvYuXPvYPY
1
predict )|()(argmax
35
Naïve Bayes Classifier
)()|(argmax 11predict vYPvYuXuXPY mm
v
In the case of the naive Bayes Classifier this can be simplified:
Yn
jjj
vvYuXPvYPY
1
predict )|()(argmax
Technical Hint:If you have 10,000 input attributes that product will underflow in floating point math. You should use logs:
Yn
jjj
vvYuXPvYPY
1
predict )|(log)(logargmax
36
BC Results: “XOR”The “XOR” dataset consists of 40,000 records and 2 boolean inputs called a and b, generated 50-50 randomly as 0 or 1. c (output) = a XOR b
The Classifier
learned by “Naive BC”
The Classifier
learned by “Joint BC”
37
Naive BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped
The Classifier
learned by “Naive BC”
38
Naive BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped
The Classifier
learned by “Joint BC”
This result surprised Andrew until he had thought about it a little
39
Naïve BC Results: “All Irrelevant”The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated 50-50 randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25The Classifier
learned by “Naive BC”
40
BC Results: “MPG”: 392
records
The Classifier
learned by “Naive BC”
41
BC Results: “MPG”: 40
records
42
More Facts About Bayes Classifiers
• Many other density estimators can be slotted in*.• Density estimation can be performed with real-valued inputs*• Bayes Classifiers can be built with real-valued inputs*• Rather Technical Complaint: Bayes Classifiers don’t try to be
maximally discriminative---they merely try to honestly model what’s going on*
• Zero probabilities are painful for Joint and Naïve. A hack (justifiable with the magic words “Dirichlet Prior”) can help*.
• Naïve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully!
*See future Andrew Lectures
43
What you should know
• Probability• Fundamentals of Probability and Bayes Rule• What’s a Joint Distribution• How to do inference (i.e. P(E1|E2)) once you have a JD
• Density Estimation• What is DE and what is it good for• How to learn a Joint DE• How to learn a naïve DE
44
What you should know
• Bayes Classifiers• How to build one• How to predict with a BC• Contrast between naïve and joint BCs
45
Interesting Questions
• Suppose you were evaluating NaiveBC, JointBC, and Decision Trees
• Invent a problem where only NaiveBC would do well• Invent a problem where only Dtree would do well• Invent a problem where only JointBC would do well• Invent a problem where only NaiveBC would do poorly• Invent a problem where only Dtree would do poorly• Invent a problem where only JointBC would do poorly
46
Venn Diagram
47
For more information
• Two nice books• L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.
• C4.5 : Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning) by J. Ross Quinlan
• Dozens of nice papers, including• Learning Classification Trees, Wray Buntine, Statistics and
Computation (1992), Vol 2, pages 63-73
• Kearns and Mansour, On the Boosting Ability of Top-Down Decision Tree Learning Algorithms, STOC: ACM Symposium on Theory of Computing, 1996“
• Dozens of software implementations available on the web for free and commercially for prices ranging between $50 - $300,000
48
Probability Model Uses
ClassifierInput
Attributes
AnomalyDetector
Data point x P(x | M)
P(C | E)
Inference
Engine
Subset Evidence E1P(E2 | e1)
ClustererData setclustersof points
Variables E2
49
How to Build a Bayes Classifier
Data Set P(I,A,R,C)
This function simulates a four-dimensional lookup table
of the probability of each possible Industry/Analyte/Result/Class
Each record has aclass of either “normal”or “outbreak”
50
How to Build a Bayes Classifier
Data Set
Data SetOutbreaks
Data SetNormals
P(I,A,R,O) P(I,A,R | normal)
51
How to Build a Bayes Classifier
Suppose that a new test result arrives…
<meat, salmonella, negative>
P(meat, salmonella, negative, normal) = 0.19
P(meat, salmonella, negative, outbreak) = 0.005
0.19-------- = 38.00.005
Class = “normal”!
52
How to Build a Bayes Classifier
Next test:
<Seafood, Vibrio, Positive>
P(seafood, vibrio, positive, normal) = 0.02
P(seafood, vibrio, positive, outbreak) = 0.07
0.02------ = 0.290.07
Class = “outbreak”!