Learning from Observations
-
Upload
phoebe-stanley -
Category
Documents
-
view
30 -
download
0
description
Transcript of Learning from Observations
![Page 1: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/1.jpg)
Learning from Observations
Chapter 18
Section 1 – 4
![Page 2: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/2.jpg)
Outline
• Learning agents
• Inductive learning
• Decision tree learning
• Boosting
This is about the best “off-the-shelves” classification algorithm!(and you’ll get to know it)
![Page 3: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/3.jpg)
Learning agents
Sometimes we want to invest time and effort to observe the feedback from our environment to our actions in order to improve these actionsso we can more effectively optimize our utility in the future.
![Page 4: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/4.jpg)
Learning element
• Design of a learning element is affected by– Which components of the performance element are to
be learned (e.g. learn to stop for traffic light)– What feedback is available to learn these components
(e.g. visual feedback form camera)– What representation is used for the components (e.g. logic, probabilistic descriptions, attributes,...)
• Type of feedback:– Supervised learning: correct answers for each
example (label).– Unsupervised learning: correct answers not given.– Reinforcement learning: occasional rewards
![Page 5: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/5.jpg)
Two Examples of Learning Object Categories.Here is your training set (2 classes):
![Page 6: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/6.jpg)
Here is your test set:Does it belong to one of the above classes?
![Page 7: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/7.jpg)
S. Savarese, 2003Copied from P. Perona
talk slides.
Learning from 1 Example
![Page 8: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/8.jpg)
P. Buegel, 1562
![Page 9: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/9.jpg)
Inductive learning
• Simplest form: learn a function from examples
f is the target function
An example is a pair (x, f(x))
Problem: find a hypothesis h
such that h ≈ fgiven a training set of examples
![Page 10: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/10.jpg)
Inductive learning method
• Construct/adjust h to agree with f on training set• (h is consistent if it agrees with f on all examples)• E.g., curve fitting:
![Page 11: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/11.jpg)
Inductive learning method
• Construct/adjust h to agree with f on training set• (h is consistent if it agrees with f on all examples)• E.g., curve fitting:
![Page 12: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/12.jpg)
Inductive learning method
![Page 13: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/13.jpg)
Inductive learning method
![Page 14: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/14.jpg)
Inductive learning method
which curve is best?
![Page 15: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/15.jpg)
• Ockham’s razor: prefer the simplest hypothesis consistent with data
Inductive learning method
![Page 16: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/16.jpg)
Decision Trees
Problem: decide whether to wait for a table at a restaurant, based on the following attributes:
1. Alternate: is there an alternative restaurant nearby?2. Bar: is there a comfortable bar area to wait in?3. Fri/Sat: is today Friday or Saturday?4. Hungry: are we hungry?5. Patrons: number of people in the restaurant (None, Some, Full)6. Price: price range ($, $$, $$$)7. Raining: is it raining outside?8. Reservation: have we made a reservation?9. Type: kind of restaurant (French, Italian, Thai, Burger)10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
![Page 17: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/17.jpg)
Attribute-based representations
• Examples described by attribute values (Boolean, discrete, continuous)
• E.g., situations where I will/won't wait for a table:
• Classification of examples is positive (T) or negative (F)
![Page 18: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/18.jpg)
Decision trees• One possible representation for hypotheses• We imagine someone talking a sequence of decisions.• E.g., here is the “true” tree for deciding whether to wait:
![Page 19: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/19.jpg)
Expressiveness• Decision trees can express any function of the input attributes.• E.g., for Boolean functions, truth table row → path to leaf:
• Trivially, there is a consistent decision tree for any training set with one path to leaf for each example (unless f nondeterministic in x) but it probably won't generalize to new examples
• Prefer to find more compact decision trees: we don’t want to memorize the data, we want to find structure in the data!
![Page 20: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/20.jpg)
Hypothesis spaces
How many distinct decision trees with n Boolean attributes?
= number of Boolean functions
= number of distinct truth tables with 2n rows = 22n
• E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees
n=2. 2^2 = 4 rows. For each row we can choose T or F: 2^4 functions.
![Page 21: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/21.jpg)
Decision tree learning• If there are so many possible trees, can we actually
search this space? (solution: greedy search).
• Aim: find a small tree consistent with the training examples
• Idea: (recursively) choose "most significant" attribute as root of (sub)tree.
![Page 22: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/22.jpg)
Choosing an attribute
• Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative"
• Patrons or type? To wait or not to wait is still at 50%.
![Page 23: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/23.jpg)
Using information theory
• Entropy measures the amount of uncertainty
in a probability distribution:
Consider tossing a biased coin.If you toss the coin VERY often,the frequency of heads is, say, p, and hence the frequency of tails is 1-p. (fair coin p=0.5).
The uncertainty in any actual outcomeis given by the entropy:
Note, the uncertainty is zero if p=0 or 1and maximal if we have p=0.5.
![Page 24: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/24.jpg)
Using information theory
• If there are more than two states s=1,2,..n we have (e.g. a die):
( ) ( 1)log[ ( 1)]
( 2)log[ ( 2)]
...
( )log[ ( )]
Entropy p p s p s
p s p s
p s n p s n
1
( ) 1n
s
p s
![Page 25: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/25.jpg)
Using information theory
• Imagine we have p examples which are true (positive) and n examples which are false (negative).
• Our best estimate of true or false is given by:
• Hence the entropy is given by:
( , ) log logp p pn n n
Entropyp n p n p n p n p n p n
( ) /
( ) /
P true p p n
p false n p n
![Page 26: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/26.jpg)
Using Information Theory
• How much information do we gain if we disclose the value of some attribute?
• Answer:
uncertainty before – uncertainty after
![Page 27: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/27.jpg)
Example
Before: Entropy = - ½ log(1/2) – ½ log(1/2)=log(2) = 1 bit: There is “1 bit of information to be discovered”.
After: for Type: If we go into branch “French” we have 1 bit, similarly for the others.French: 1bitItalian: 1 bitThai: 1 bitBurger: 1bit
After: for Patrons: In branch “None” and “Some” entropy = 0!, In “Full” entropy = -1/3log(1/3)-2/3log(2/3).
So Patrons gains more information!
On average: 1 bit ! We gained nothing!
![Page 28: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/28.jpg)
Information Gain• How do we combine branches:
1/6 of the time we enter “None”, so we weight“None” with 1/6. Similarly: “Some” has weight: 1/3 and “Full” has weight ½.
1
( ) ( , )n
i i i i
i i i i i
p n p nEntropy A Entropy
p n p n p n
weight for each branch
entropy for each branch.
![Page 29: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/29.jpg)
Information gain
• Information Gain (IG) or reduction in entropy from the attribute test:
• Choose the attribute with the largest IG
( )I G A Entropy before Entropy af ter
![Page 30: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/30.jpg)
Information gain
For the training set, p = n = 6, I(6/12, 6/12) = 1 bit
Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root
bits 0)]4
2,
4
2(
12
4)
4
2,
4
2(
12
4)
2
1,
2
1(
12
2)
2
1,
2
1(
12
2[1)(
bits 0541.)]6
4,
6
2(
12
6)0,1(
12
4)1,0(
12
2[1)(
IIIITypeIG
IIIPatronsIG
![Page 31: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/31.jpg)
Example contd.
• Decision tree learned from the 12 examples:
• Substantially simpler than “true” tree---a more complex hypothesis isn’t justified by small amount of data
![Page 32: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/32.jpg)
What to Do if...
• In some leaf there are no examples:
Choose True or False according to the number of positive/negative examples at your parent.
• There are no attributes left
Two or more examples have the same attributes but different label: we have an error/noise. Stop and use majority vote.
Demo:http://www.cs.ubc.ca/labs/lci/CIspace/Version4/dTree/
![Page 33: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/33.jpg)
Boosting
• Main idea: – train classifiers (e.g. decision trees) in a sequence.– a new classifier should focus on those cases which
were incorrectly classified in the last round.– combine the classifiers by letting them vote on the
final prediction.– each classifier could be (should be) very “weak”,
e.g. a DT, with only one node: decision stump.
![Page 34: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/34.jpg)
Example
this line is one simple classifier saying that everything to the left + and everything to the rightis -
![Page 35: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/35.jpg)
Decision Stump
• Data: {(X1,Y1),(X2,Y2),....,(Xn,Yn)}
attributes (e.g. temperature outside)
label (e.g. True or False, 0 or 1 -1 or +1)
Xi
threshold-1
+11( )
1i
i
if Xf x
if X
![Page 36: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/36.jpg)
The Algorithm in Detail
(Z,H) = AdaBoost[Xtrain,Ytrain,Rounds]
weights = 1/N; (N = # datacases)
• For r=1 to Rounds Do• H[r] = LearnWeakClassifier[Xtrain,Ytrain,weights];• error = 0;• For i=1 to N Do
•If H[r,Xtrain(i)]=Ytrain(i) error = error + weight(i);• For i=1 to N Do
•If H[r,Xtrain(i)]=Ytrain(i) weight(i)=weight(i) error/(1-error);• Normalize Weights• Z[r] = log[(1-error)/error]
Final classifier:
1
( ) ( ) ( , )R
r
F x Z r H r x
( , ) 2 ( ) ( ( )) ( ) 0 1H r x sign r X axis r r
![Page 37: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/37.jpg)
And in a Picture
training casecorrectlyclassified
training casehas large weightin this round
this DT has a strong vote.
![Page 38: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/38.jpg)
And in animation
Original Training set : Equal Weights to all training samples
Taken from “A Tutorial on Boosting” by Yoav Freund and Rob Schapire
![Page 39: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/39.jpg)
AdaBoost(Example)
ROUND 1
![Page 40: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/40.jpg)
AdaBoost(Example)
ROUND 2
![Page 41: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/41.jpg)
AdaBoost(Example)
ROUND 3
![Page 42: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/42.jpg)
AdaBoost(Example)
![Page 43: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/43.jpg)
k-Nearest Neighbors • Another simple classification algorithm
• Idea: Look around you too see how your neighbors classify data.
• Classify a new data-point according to a majority vote of your k-nearest neighbors.
![Page 44: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/44.jpg)
k-NNdecision line
![Page 45: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/45.jpg)
Distance Metric
• How do we measure what it means to be “close”?
• Depending on the problem we should choose an appropriate distance metric.
( , ) | | { discrete}; Hamming distance
( , ) ( ) ( ) { .}; Scaled EuclideanDistancen m n m
Tn m n m n m
D x x x x x
D x x x x A x x x cont
![Page 46: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/46.jpg)
Algorithm in Detail
• For a new test example:
• Find the k closest neighbors
• Each neighbor predicts the class to be it sown class
• Take majority vote.
Demo: http://www.comp.lancs.ac.uk/~kristof/research/notes/nearb/cluster.html
Demo: MATLAB
![Page 47: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/47.jpg)
Logistic Regression / Perceptron
• Fits a soft decision boundary between the classes.
1 dimension 2 dimensions
![Page 48: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/48.jpg)
The logit / sigmoid
)exp(1
1)(
bWxxf
Determines the offset
Determines the angleand the steepness.
![Page 49: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/49.jpg)
Objective
• We interpret f(x) as the probability of classifying a data case as positive.
• We want to maximize the total probability of the data-vectors:
)1( )0(
)(1log)(log
n nyexamplespositive
yexamplesnegative
nn xfxfO
![Page 50: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/50.jpg)
Algorithm in detail
• Repeat until convergence (gradient descend):
b
Obb
W
OWW
)0()1(
)0()1(
)()(1
)()(1
nn
nn
yexamplesnegative
n
yexamplespositive
n
n
yexamplesnegative
n
yexamplespositive
nn
xfxfb
O
xxfxxfW
O
(Or better: Newton steps)
![Page 51: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/51.jpg)
Cross-validation
• You are ultimately interested in good performance on new (unseen) test data.
• To estimate that, split of a (smallish) subset of the training data (called validation set).
• Train without validation data and “test” on validation data.
• This will give an indication of how long to run boosting.
![Page 52: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/52.jpg)
Project
• Demo
• For this project I will obtain data from a company who want to predict activity of a chemical compound.
• Training data will be provided
• Test data will not!
• May the best win this competition.
![Page 53: Learning from Observations](https://reader035.fdocuments.net/reader035/viewer/2022070402/568138f1550346895da0a6f4/html5/thumbnails/53.jpg)
Summary• Learning agent = performance element + learning
element
• For supervised learning, the aim is to find a simple hypothesis approximately consistent with training examples
• Decision tree learning + boosting
• Learning performance = prediction accuracy measured on test set