CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization...
Transcript of CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization...
Perceptrons
4 November 2020
CMPU 365 · Artificial Intelligence
https://twitter.com/Adequate_Scott/status/1323978722731544577
Where are we?
Hello,
Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just
# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...
SPAM or +
PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...
“2”
x f(x) yInputs Features Outputs
Data: Labeled instances – inputs and outputs (x, y)
Features: attribute–value pairs that characterize each input x.
Parameters: Numeric parts of the model that determine how each x is mapped to an output y.
Experimentation cycle: Learn (train) parameters on training set
(Tune hyperparameters on held-out set)
Evaluate on test set
E.g., accuracy: fraction of instances predicted correctly
Training data
Held-out data
Test data
Aside: Alternative ways of splitting data
What if you choose an unlucky split into training and testing, where the data sets are very different?
Aren’t we wasting data?
Alternative: k-fold cross validation! Repeat k times:
Partition into train (n − n/k) and test (n/k) data sets
Train on training set; test on test set
Average results across k choices of test set
Naïve Bayes
Bayes’ Rule lets us do diagnostic queries with causal probabilities.
The Naïve Bayes assumption takes all features Fi to be independent given the class label, Y.
We can build classifiers out of a Naïve Bayes model using training data.
Y
F1 F2 … Fn
P(Y ∣ f1, …, fn) ∝ P(Y)∏i
P( fi ∣ Y)
Generalization and overfitting
We want a classifier which does well on test data.
Dangers: Underfitting: Fits the training set poorly
Overfitting: Fitting the training data very closely, but not generalizing well
0 2 4 6 8 10 12 14 16 18 20−15
−10
−5
0
5
10
15
20
25
30 Overfitting
0 2 4 6 8 10 12 14 16 18 20−15
−10
−5
0
5
10
15
20
25
30
Degree 15 polynomial
Overfitting
Example: Overfitting
Example: Overfitting
Example: Overfitting
Example: Overfitting
Example: Overfitting
Example: Overfitting
Example: Overfitting
“2” wins!
Example: Overfitting
Posteriors determined by relative probabilities (odds ratios):P(W ∣ ham)P(W ∣ spam)
P(W ∣ spam)P(W ∣ ham)
Example: Overfitting
Posteriors determined by relative probabilities (odds ratios):
"south-west": inf "nation": inf "morally": inf "nicely": inf "extent": inf "seriously": inf ...
What went wrong here?
"screens": inf "minute": inf "guaranteed": inf "$205.00": inf "delivery": inf "signature": inf ...
P(W ∣ ham)P(W ∣ spam)
P(W ∣ spam)P(W ∣ ham)
Generalization and overfitting
Relative frequency parameters will overfit the training data!
Just because we never saw a 3 with pixel (15, 15) on during training doesn’t mean we won’t see it at test time.
Unlikely that every occurrence of “minute” is 100% spam.
Unlikely that every occurrence of “seriously” is 100% ham.
What about all the words that don’t occur in the training set at all?
In general, we can’t go around giving unseen events zero probability.
Generalization and overfitting
As an extreme case, imagine using the entire email as the only feature.
Would get the training data perfect (if deterministic labeling).
Wouldn’t generalize at all.
Just making the bag-of-words assumption gives us some generalization, but isn’t enough.
To generalize better, we need to smooth or regularize the estimates.
Parameter estimation
Maximum likelihood estimation (MLE) infers the parameter value (θ) that maximize the probability of the observed data.
Maximum likelihood estimation typically assumes: Each sample is drawn from the same distribution. That is, each instance is identically distributed.
Each sample is conditionally independent of the others given the parameters for our distribution.
This is a strong assumption, but it helps simplify the problem of MLE – and generally works in practice.
All possible values of θ are equally likely before we’ve seen any data (uniform prior).
independent, identically distributed (i.i.d.)
Imagine you have a bag of blue and red balls. You draw samples by taking a ball out, recording its color, and then putting the ball back.
If you sample red, red, blue, that suggests that 2/3 of the balls in the bag are red and 1/3 are blue.
r r b
PML(r) = 2/3
Consider our spam classifier.
The maximum likelihood estimate for the parameter P(Wi = 1 | Y = ham) corresponds to just counting the number of legitimate emails that word i appears in and dividing it by the total number of legitimate emails.
Smoothing
Laplace smoothing
Pretend you saw every outcome once more than you actually did
r r b
Laplace smoothing
Pretend you saw every outcome once more than you actually did
r r b
Laplace smoothing
Pretend you saw every outcome once more than you actually did
r r b
Laplace smoothing
Laplace’s estimate (extended):Pretend you saw every outcome k extra times
What’s Laplace with k = 0?
k is the strength of the prior
Laplace for conditionals:
r r b
Laplace smoothing
Laplace’s estimate (extended):Pretend you saw every outcome k extra times
What’s Laplace with k = 0?
k is the strength of the prior
Laplace for conditionals:
r r b
Laplace smoothing
Laplace’s estimate (extended):Pretend you saw every outcome k extra times
What’s Laplace with k = 0?
k is the strength of the prior
Laplace for conditionals:
r r b
Laplace smoothing
Laplace’s estimate (extended):Pretend you saw every outcome k extra times
What’s Laplace with k = 0?
k is the strength of the prior
Laplace for conditionals:Smooth each condition independently:
r r b
Real Naïve Bayes: Smoothing
For real classification problems, smoothing is critical
New odds ratios:
"helvetica": 11.4 "seems": 10.8 "group": 10.2 "ago": 8.4 "areas": 8.3 ...
"verdana": 28.8 "Credit": 28.4 "ORDER": 27.2 "<FONT>": 26.9 "money": 26.5 ...
Do these make more sense?
P(W ∣ ham)P(W ∣ spam)
P(W ∣ spam)P(W ∣ ham)
Now we’ve got two kinds of unknowns Parameters: the probabilities P(X | Y), P(Y)
Hyperparameters: e.g. the amount / type of smoothing to do, k
What should we learn where? We learn parameters from the training data.
We need to tune the hyperparameters using a different set – the held-out data.
Choose the best value and do a final test on the test data.
Tuning on held-out data
Error-driven classification
Examples of errors
Dear GlobalSCAPE Customer, GlobalSCAPE has partnered with ScanSoft to offer you the latest version of OmniPage Pro, for just $99.99* - the regular list price is $499! The most common question we've received about this offer is - Is this genuine? We would like to assure you that this offer is authorized by ScanSoft, is genuine and valid. You can get the . . .
. . . To receive your $30 Amazon.com promotional certificate, click through to http://www.amazon.com/apparel and see the prominent link for the $30 offer. All details are there. We hope you enjoyed receiving this message. However, if you'd rather not receive future e-mails announcing new store launches, please click . . .
What to do about errors?Problem: There’s still spam in your inbox.
Need more features – words aren’t enough! Have you emailed the sender before?
Have a million other people just gotten the same email?
Is the sending information consistent?
Is the email in ALL CAPS?
Do inline URLs point where they say they point?
Does the email address you by name?
Naïve Bayes models can incorporate a variety of features, but they tend to do best in homogeneous cases, e.g., all features are word occurrences.
Linear classifiers
Feature vectors
Hello,
Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just
# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...
SPAM or +
x f(x) y
Feature vectors
Hello,
Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just
# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...
SPAM or +
PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...
“2”
x f(x) y
Loose inspiration
Dendrite
Cell body or Soma
Nucleus
SynapseAxon
Axon from another cell
Axonal arborization
Synapses
Human neuron
Loose inspiration
Dendrite
Cell body or Soma
Nucleus
SynapseAxon
Axon from another cell
Axonal arborization
Synapses
Human neuron Artificial neurons
Linear classifiers
Inputs are feature values, fi
Each feature has a weight, wi
Weighted sum is the activation:
If the activation is > 0, output label “+1”
< 0, output “−1”
Σf1f2f3
w1
w2
w3> 0?
activationw(x) = ∑i
wi ⋅ fi(x) = w ⋅ f(x)
Recall: Dot product
Given two vectors of equal length, the dot product (or scalar product) produces a scalar (a single number) by multiplying the corresponding entries in the vectors: a = [a1, a2, …, an]
b = [b1, b2, …, bn]
a · b = a1b1 + a2b2 + ⋯ + anbn
Recall: Dot product
Given two vectors of equal length, the dot product (or scalar product) produces a scalar (a single number) by multiplying the corresponding entries in the vectors: a = [a1, a2, …, an]
b = [b1, b2, …, bn]
a · b = a1b1 + a2b2 + ⋯ + anbn
a = [1, 2, 3] b = [4, 5, 6] # Version 1 sum(map(lambda x: x[0] * x[1], zip(a, b))) # Version 2 sum([x * y for x, y in zip(a, b)])
Weights
Binary case: Compare features to a weight vector
Learning: Figure out the weight vector from examples
# free : 4 YOUR_NAME :-1 MISSPELLED : 1 FROM_FRIEND :-3 ...
w · f being positive means the positive class/label
w
Weights
Binary case: Compare features to a weight vector
Learning: Figure out the weight vector from examples
# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...
# free : 4 YOUR_NAME :-1 MISSPELLED : 1 FROM_FRIEND :-3 ...
w · f being positive means the positive class/label
w f(x1)
Weights
Binary case: Compare features to a weight vector
Learning: Figure out the weight vector from examples
# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...
# free : 4 YOUR_NAME :-1 MISSPELLED : 1 FROM_FRIEND :-3 ...
# free : 0 YOUR_NAME : 1 MISSPELLED : 1 FROM_FRIEND : 1 ...
w · f being positive means the positive class/label
w f(x1)
f(x2)
Decision rules
Key idea: decision boundary
The boundary at which the label changesKey Idea: Decision Boundary
+
++
+
+
+
-
-
-
-
-
-
-
-
- -
--
+ ++
++
Boundary at which label changes
w f(x1)
f(x2)
+
−Angle between f(x) and w is < 90 degrees. x is classified as positive for the label.
Angle between f(x) and w is > 90 degrees. x is classified as negative for the label.
Binary decision rule
In the space of feature vectors
Examples are points
Any weight vector defines a hyperplane decision boundary
One side corresponds to Y = +1
Other corresponds to Y = −1
Add a bias feature (value always 1) to allow shifted decision boundaries.
BIAS : -3 free : 4 money : 2 ...
0 10
1
2
free
mon
ey
w
Binary decision rule
In the space of feature vectors
Examples are points
Any weight vector defines a hyperplane decision boundary
One side corresponds to Y = +1
Other corresponds to Y = −1
Add a bias feature (value always 1) to allow shifted decision boundaries.
BIAS : -3 free : 4 money : 2 ...
0 10
1
2
free
mon
ey
w
f ⋅ w = 0
Binary decision rule
In the space of feature vectors
Examples are points
Any weight vector defines a hyperplane decision boundary
One side corresponds to Y = +1
Other corresponds to Y = −1
Add a bias feature (value always 1) to allow shifted decision boundaries.
BIAS : -3 free : 4 money : 2 ...
0 10
1
2
free
mon
ey
+1 = spam
−1 = ham
w
f ⋅ w = 0
Weight updates
Learning: Binary perceptron
Start with weights = 0
Cycle through training examples repeatedly.
Learning: Binary perceptron
Start with weights = 0
Cycle through training examples repeatedly.
For each example:Classify with current weights.
Learning: Binary perceptron
Start with weights = 0
Cycle through training examples repeatedly.
For each example:Classify with current weights.
If correct, no change!
Learning: Binary perceptron
Start with weights = 0
Cycle through training examples repeatedly.
For each example:Classify with current weights.
If correct, no change!
Learning: Binary perceptron
Start with weights = 0
Cycle through training examples repeatedly.
For each example:Classify with current weights.
If correct, no change!
If wrong, adjust the weight vector.
Learning: Binary perceptron
Start with weights = 0
Cycle through training examples repeatedly.
For each example: Classify with current weights:
If correct (i.e., y = y*), no change!
If wrong, adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is −1.
w = w + y* ⋅ f(x)
y = {+1 if w ⋅ f(x) ≥ 0−1 if w ⋅ f(x) < 0
w
f(x)
Learning: Binary perceptron
Start with weights = 0
Cycle through training examples repeatedly.
For each example: Classify with current weights:
If correct (i.e., y = y*), no change!
If wrong, adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is −1.
w = w + y* ⋅ f(x)
y = {+1 if w ⋅ f(x) ≥ 0−1 if w ⋅ f(x) < 0
w
f(x)y* ⋅ f(x)
Learning: Binary perceptron
Start with weights = 0
Cycle through training examples repeatedly.
For each example: Classify with current weights:
If correct (i.e., y = y*), no change!
If wrong, adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is −1.
w = w + y* ⋅ f(x)
y = {+1 if w ⋅ f(x) ≥ 0−1 if w ⋅ f(x) < 0
w
f(x)y* ⋅ f(x)
Examples: Perceptron
Examples: Perceptron
Examples: Perceptron
Examples: Perceptron
Examples: Perceptron
Examples: Perceptron
Examples: Perceptron
If we have more than two classes:A weight vector for each class:
Multi-class decision rule
Binary is multi-class where the negative class has weight zero
wyw1
w2 w3
If we have more than two classes:A weight vector for each class:
Score (activation) of a class y:
Multi-class decision rule
Binary is multi-class where the negative class has weight zero
wy
wy ⋅ f(x)
w1
w2 w3
If we have more than two classes:A weight vector for each class:
Score (activation) of a class y:
Prediction highest score wins
Multi-class decision rule
Binary is multi-class where the negative class has weight zero
wy
wy ⋅ f(x)
y = arg maxy
wy ⋅ f(x)
w1 ⋅ f(x) biggest
w2 ⋅ f(x) biggest w3 ⋅ f(x) biggest
w1
w2 w3
Learning: Multi-class perceptron
Start with all weights = 0.
Pick up training examples one by one.Predict with current weights:
y = arg maxy
wy ⋅ f(x)
Learning: Multi-class perceptron
Start with all weights = 0.
Pick up training examples one by one.Predict with current weights:
y = arg maxy
wy ⋅ f(x)
wy
f(x)wy*
wy′
Learning: Multi-class perceptron
Start with all weights = 0.
Pick up training examples one by one.Predict with current weights:
If correct, no change!
y = arg maxy
wy ⋅ f(x)
wy
f(x)wy*
wy′
Learning: Multi-class perceptron
Start with all weights = 0.
Pick up training examples one by one.Predict with current weights:
If correct, no change!
If wrong, we use the feature weights to lower score of wrong label (y) and to raise score of right label (y*):
y = arg maxy
wy ⋅ f(x)
wy = wy − f(x)wy* = wy* + f(x)
wy
f(x)wy*
wy′
Example: Multiclass perceptron
BIAS : 1 win : 0 game : 0 vote : 0 the : 0 ...
BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...
BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...
wPOLITICS wTECHwSPORTS
Example: Multiclass perceptron
“win the vote”
x y* (true) f(x) y (prediction)BIAS : 1 win : 1 game : 0 vote : 1 the : 1 ...
wPOLITICS wTECHwSPORTS
BIAS : 1 win : 0 game : 0 vote : 0 the : 0 ...
BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...
BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...
Example: Multiclass perceptron
“win the vote”
x y* (true) f(x) y (prediction)
POLITICS SPORTSBIAS : 1 win : 1 game : 0 vote : 1 the : 1 ...
wPOLITICS wTECHwSPORTS
BIAS : 1 win : 0 game : 0 vote : 0 the : 0 ...
BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...
BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...
Example: Multiclass perceptron
BIAS : 1 0 win : 0 -1 game : 0 0 vote : 0 -1 the : 0 -1 ...
“win the vote”
x y* (true) f(x) y (prediction)
POLITICS SPORTSBIAS : 1 win : 1 game : 0 vote : 1 the : 1 ...
BIAS : 0 1 win : 0 1 game : 0 0 vote : 0 1 the : 0 1 ...
BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...
wPOLITICS wTECHwSPORTS
Example: Multiclass perceptron
x y* (true) f(x) y (prediction)
POLITICS SPORTSBIAS : 1 win : 1 game : 0 vote : 1 the : 1 ...
wPOLITICS wTECHwSPORTS
BIAS : 0 win : -1 game : 0 vote : -1 the : -1 ...
BIAS : 1 win : 1 game : 0 vote : 1 the : 1 ...
BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...
BIAS : 1 win : 1 game : 0 vote : 0 the : 1 ...
POLITICSPOLITICS“win the election”
“win the vote”
Example: Multiclass perceptron
“win the vote”
x y* (true) f(x) y (prediction)
POLITICS SPORTSBIAS : 1 win : 1 game : 0 vote : 1 the : 1 ...
wPOLITICS wTECHwSPORTS
BIAS : 0 1 win : -1 0 game : 0 1 vote : -1 -1 the : -1 0 ...
BIAS : 1 0 win : 1 0 game : 0 -1 vote : 1 1 the : 1 0 ...
BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...
“win the election” POLITICS POLITICS
“win the game” SPORTS POLITICS
BIAS : 1 win : 1 game : 0 vote : 0 the : 1 ...
BIAS : 1 win : 1 game : 1 vote : 0 the : 1 ...
Properties of perceptrons
Separability: True if some parameters get the training set perfectly correct.
Convergence: If the training is separable, perceptron will eventually converge (binary case).
Mistake bound: The maximum number of mistakes (binary case) related to the margin or degree of separability:
Separable
Non-separable
Number of features
Margin (thickest separating line)
mistakes <kδ2
Classification: Comparison
Naïve Bayes Builds a model of the training data
Gives prediction probabilities
Strong assumptions about feature independence
One pass through data (counting)
Perceptrons Makes fewer assumptions about data
Error-driven learning
Multiple passes through data (prediction)
Often more accurate
Next time: Improving the perceptron
Acknowledgments
The lecture incorporates material from: Dan Klein and Pieter Abbeel, University of California, Berkeley; ai.berkeley.edu
George Konidaris, Brown University
Peter Norvig and Stuart Russell, Artificial Intelligence: A Modern Approach
Ketrina Yim (illustrations)