CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization...

92
Perceptrons 4 November 2020 CMPU 365 · Artificial Intelligence

Transcript of CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization...

Page 1: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Perceptrons

4 November 2020

CMPU 365 · Artificial Intelligence

Page 2: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

https://twitter.com/Adequate_Scott/status/1323978722731544577

Page 3: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just
Page 4: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Where are we?

Page 5: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Hello,

Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just

# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...

SPAM or +

PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...

“2”

x f(x) yInputs Features Outputs

Page 6: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Data: Labeled instances – inputs and outputs (x, y)

Features: attribute–value pairs that characterize each input x.

Parameters: Numeric parts of the model that determine how each x is mapped to an output y.

Page 7: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Experimentation cycle: Learn (train) parameters on training set

(Tune hyperparameters on held-out set)

Evaluate on test set

E.g., accuracy: fraction of instances predicted correctly

Training data

Held-out data

Test data

Page 8: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Aside: Alternative ways of splitting data

What if you choose an unlucky split into training and testing, where the data sets are very different?

Aren’t we wasting data?

Alternative: k-fold cross validation! Repeat k times:

Partition into train (n − n/k) and test (n/k) data sets

Train on training set; test on test set

Average results across k choices of test set

Page 9: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Naïve Bayes

Bayes’ Rule lets us do diagnostic queries with causal probabilities.

The Naïve Bayes assumption takes all features Fi to be independent given the class label, Y.

We can build classifiers out of a Naïve Bayes model using training data.

Y

F1 F2 … Fn

P(Y ∣ f1, …, fn) ∝ P(Y)∏i

P( fi ∣ Y)

Page 10: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Generalization and overfitting

Page 11: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

We want a classifier which does well on test data.

Dangers: Underfitting: Fits the training set poorly

Overfitting: Fitting the training data very closely, but not generalizing well

Page 12: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

0 2 4 6 8 10 12 14 16 18 20−15

−10

−5

0

5

10

15

20

25

30 Overfitting

Page 13: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

0 2 4 6 8 10 12 14 16 18 20−15

−10

−5

0

5

10

15

20

25

30

Degree 15 polynomial

Overfitting

Page 14: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Example: Overfitting

Page 15: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Example: Overfitting

Page 16: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Example: Overfitting

Page 17: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Example: Overfitting

Page 18: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Example: Overfitting

Page 19: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Example: Overfitting

Page 20: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Example: Overfitting

“2” wins!

Page 21: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Example: Overfitting

Posteriors determined by relative probabilities (odds ratios):P(W ∣ ham)P(W ∣ spam)

P(W ∣ spam)P(W ∣ ham)

Page 22: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Example: Overfitting

Posteriors determined by relative probabilities (odds ratios):

"south-west": inf "nation": inf "morally": inf "nicely": inf "extent": inf "seriously": inf ...

What went wrong here?

"screens": inf "minute": inf "guaranteed": inf "$205.00": inf "delivery": inf "signature": inf ...

P(W ∣ ham)P(W ∣ spam)

P(W ∣ spam)P(W ∣ ham)

Page 23: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Generalization and overfitting

Relative frequency parameters will overfit the training data!

Just because we never saw a 3 with pixel (15, 15) on during training doesn’t mean we won’t see it at test time.

Unlikely that every occurrence of “minute” is 100% spam.

Unlikely that every occurrence of “seriously” is 100% ham.

What about all the words that don’t occur in the training set at all?

In general, we can’t go around giving unseen events zero probability.

Page 24: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Generalization and overfitting

As an extreme case, imagine using the entire email as the only feature.

Would get the training data perfect (if deterministic labeling).

Wouldn’t generalize at all.

Just making the bag-of-words assumption gives us some generalization, but isn’t enough.

To generalize better, we need to smooth or regularize the estimates.

Page 25: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Parameter estimation

Page 26: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Maximum likelihood estimation (MLE) infers the parameter value (θ) that maximize the probability of the observed data.

Page 27: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Maximum likelihood estimation typically assumes: Each sample is drawn from the same distribution. That is, each instance is identically distributed.

Each sample is conditionally independent of the others given the parameters for our distribution.

This is a strong assumption, but it helps simplify the problem of MLE – and generally works in practice.

All possible values of θ are equally likely before we’ve seen any data (uniform prior).

independent, identically distributed (i.i.d.)

Page 28: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Imagine you have a bag of blue and red balls. You draw samples by taking a ball out, recording its color, and then putting the ball back.

If you sample red, red, blue, that suggests that 2/3 of the balls in the bag are red and 1/3 are blue.

r r b

PML(r) = 2/3

Page 29: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Consider our spam classifier.

The maximum likelihood estimate for the parameter P(Wi = 1 | Y = ham) corresponds to just counting the number of legitimate emails that word i appears in and dividing it by the total number of legitimate emails.

Page 30: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Smoothing

Page 31: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Laplace smoothing

Pretend you saw every outcome once more than you actually did

r r b

Page 32: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Laplace smoothing

Pretend you saw every outcome once more than you actually did

r r b

Page 33: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Laplace smoothing

Pretend you saw every outcome once more than you actually did

r r b

Page 34: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Laplace smoothing

Laplace’s estimate (extended):Pretend you saw every outcome k extra times

What’s Laplace with k = 0?

k is the strength of the prior

Laplace for conditionals:

r r b

Page 35: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Laplace smoothing

Laplace’s estimate (extended):Pretend you saw every outcome k extra times

What’s Laplace with k = 0?

k is the strength of the prior

Laplace for conditionals:

r r b

Page 36: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Laplace smoothing

Laplace’s estimate (extended):Pretend you saw every outcome k extra times

What’s Laplace with k = 0?

k is the strength of the prior

Laplace for conditionals:

r r b

Page 37: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Laplace smoothing

Laplace’s estimate (extended):Pretend you saw every outcome k extra times

What’s Laplace with k = 0?

k is the strength of the prior

Laplace for conditionals:Smooth each condition independently:

r r b

Page 38: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Real Naïve Bayes: Smoothing

For real classification problems, smoothing is critical

New odds ratios:

"helvetica": 11.4 "seems": 10.8 "group": 10.2 "ago": 8.4 "areas": 8.3 ...

"verdana": 28.8 "Credit": 28.4 "ORDER": 27.2 "<FONT>": 26.9 "money": 26.5 ...

Do these make more sense?

P(W ∣ ham)P(W ∣ spam)

P(W ∣ spam)P(W ∣ ham)

Page 39: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Now we’ve got two kinds of unknowns Parameters: the probabilities P(X | Y), P(Y)

Hyperparameters: e.g. the amount / type of smoothing to do, k

What should we learn where? We learn parameters from the training data.

We need to tune the hyperparameters using a different set – the held-out data.

Choose the best value and do a final test on the test data.

Tuning on held-out data

Page 40: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Error-driven classification

Page 41: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Examples of errors

Dear GlobalSCAPE Customer, GlobalSCAPE has partnered with ScanSoft to offer you the latest version of OmniPage Pro, for just $99.99* - the regular list price is $499! The most common question we've received about this offer is - Is this genuine? We would like to assure you that this offer is authorized by ScanSoft, is genuine and valid. You can get the . . .

. . . To receive your $30 Amazon.com promotional certificate, click through to http://www.amazon.com/apparel and see the prominent link for the $30 offer. All details are there. We hope you enjoyed receiving this message. However, if you'd rather not receive future e-mails announcing new store launches, please click . . .

Page 42: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

What to do about errors?Problem: There’s still spam in your inbox.

Need more features – words aren’t enough! Have you emailed the sender before?

Have a million other people just gotten the same email?

Is the sending information consistent?

Is the email in ALL CAPS?

Do inline URLs point where they say they point?

Does the email address you by name?

Naïve Bayes models can incorporate a variety of features, but they tend to do best in homogeneous cases, e.g., all features are word occurrences.

Page 43: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Linear classifiers

Page 44: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Feature vectors

Hello,

Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just

# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...

SPAM or +

x f(x) y

Page 45: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Feature vectors

Hello,

Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just

# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...

SPAM or +

PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...

“2”

x f(x) y

Page 46: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Loose inspiration

Dendrite

Cell body or Soma

Nucleus

SynapseAxon

Axon from another cell

Axonal arborization

Synapses

Human neuron

Page 47: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Loose inspiration

Dendrite

Cell body or Soma

Nucleus

SynapseAxon

Axon from another cell

Axonal arborization

Synapses

Human neuron Artificial neurons

Page 48: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Linear classifiers

Inputs are feature values, fi

Each feature has a weight, wi

Weighted sum is the activation:

If the activation is > 0, output label “+1”

< 0, output “−1”

Σf1f2f3

w1

w2

w3> 0?

activationw(x) = ∑i

wi ⋅ fi(x) = w ⋅ f(x)

Page 49: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Recall: Dot product

Given two vectors of equal length, the dot product (or scalar product) produces a scalar (a single number) by multiplying the corresponding entries in the vectors: a = [a1, a2, …, an]

b = [b1, b2, …, bn]

a · b = a1b1 + a2b2 + ⋯ + anbn

Page 50: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Recall: Dot product

Given two vectors of equal length, the dot product (or scalar product) produces a scalar (a single number) by multiplying the corresponding entries in the vectors: a = [a1, a2, …, an]

b = [b1, b2, …, bn]

a · b = a1b1 + a2b2 + ⋯ + anbn

a = [1, 2, 3] b = [4, 5, 6] # Version 1 sum(map(lambda x: x[0] * x[1], zip(a, b))) # Version 2 sum([x * y for x, y in zip(a, b)])

Page 51: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Weights

Binary case: Compare features to a weight vector

Learning: Figure out the weight vector from examples

# free : 4 YOUR_NAME :-1 MISSPELLED : 1 FROM_FRIEND :-3 ...

w · f being positive means the positive class/label

w

Page 52: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Weights

Binary case: Compare features to a weight vector

Learning: Figure out the weight vector from examples

# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...

# free : 4 YOUR_NAME :-1 MISSPELLED : 1 FROM_FRIEND :-3 ...

w · f being positive means the positive class/label

w f(x1)

Page 53: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Weights

Binary case: Compare features to a weight vector

Learning: Figure out the weight vector from examples

# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...

# free : 4 YOUR_NAME :-1 MISSPELLED : 1 FROM_FRIEND :-3 ...

# free : 0 YOUR_NAME : 1 MISSPELLED : 1 FROM_FRIEND : 1 ...

w · f being positive means the positive class/label

w f(x1)

f(x2)

Page 54: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Decision rules

Page 55: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Key idea: decision boundary

The boundary at which the label changesKey Idea: Decision Boundary

+

++

+

+

+

-

-

-

-

-

-

-

-

- -

--

+ ++

++

Boundary at which label changes

Page 56: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

w f(x1)

f(x2)

+

−Angle between f(x) and w is < 90 degrees. x is classified as positive for the label.

Angle between f(x) and w is > 90 degrees. x is classified as negative for the label.

Page 57: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Binary decision rule

In the space of feature vectors

Examples are points

Any weight vector defines a hyperplane decision boundary

One side corresponds to Y = +1

Other corresponds to Y = −1

Add a bias feature (value always 1) to allow shifted decision boundaries.

BIAS : -3 free : 4 money : 2 ...

0 10

1

2

free

mon

ey

w

Page 58: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Binary decision rule

In the space of feature vectors

Examples are points

Any weight vector defines a hyperplane decision boundary

One side corresponds to Y = +1

Other corresponds to Y = −1

Add a bias feature (value always 1) to allow shifted decision boundaries.

BIAS : -3 free : 4 money : 2 ...

0 10

1

2

free

mon

ey

w

f ⋅ w = 0

Page 59: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Binary decision rule

In the space of feature vectors

Examples are points

Any weight vector defines a hyperplane decision boundary

One side corresponds to Y = +1

Other corresponds to Y = −1

Add a bias feature (value always 1) to allow shifted decision boundaries.

BIAS : -3 free : 4 money : 2 ...

0 10

1

2

free

mon

ey

+1 = spam

−1 = ham

w

f ⋅ w = 0

Page 60: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Weight updates

Page 61: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Learning: Binary perceptron

Start with weights = 0

Cycle through training examples repeatedly.

Page 62: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Learning: Binary perceptron

Start with weights = 0

Cycle through training examples repeatedly.

For each example:Classify with current weights.

Page 63: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Learning: Binary perceptron

Start with weights = 0

Cycle through training examples repeatedly.

For each example:Classify with current weights.

If correct, no change!

Page 64: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Learning: Binary perceptron

Start with weights = 0

Cycle through training examples repeatedly.

For each example:Classify with current weights.

If correct, no change!

Page 65: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Learning: Binary perceptron

Start with weights = 0

Cycle through training examples repeatedly.

For each example:Classify with current weights.

If correct, no change!

If wrong, adjust the weight vector.

Page 66: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Learning: Binary perceptron

Start with weights = 0

Cycle through training examples repeatedly.

For each example: Classify with current weights:

If correct (i.e., y = y*), no change!

If wrong, adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is −1.

w = w + y* ⋅ f(x)

y = {+1 if w ⋅ f(x) ≥ 0−1 if w ⋅ f(x) < 0

w

f(x)

Page 67: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Learning: Binary perceptron

Start with weights = 0

Cycle through training examples repeatedly.

For each example: Classify with current weights:

If correct (i.e., y = y*), no change!

If wrong, adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is −1.

w = w + y* ⋅ f(x)

y = {+1 if w ⋅ f(x) ≥ 0−1 if w ⋅ f(x) < 0

w

f(x)y* ⋅ f(x)

Page 68: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Learning: Binary perceptron

Start with weights = 0

Cycle through training examples repeatedly.

For each example: Classify with current weights:

If correct (i.e., y = y*), no change!

If wrong, adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is −1.

w = w + y* ⋅ f(x)

y = {+1 if w ⋅ f(x) ≥ 0−1 if w ⋅ f(x) < 0

w

f(x)y* ⋅ f(x)

Page 69: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Examples: Perceptron

Page 70: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Examples: Perceptron

Page 71: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Examples: Perceptron

Page 72: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Examples: Perceptron

Page 73: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Examples: Perceptron

Page 74: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Examples: Perceptron

Page 75: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Examples: Perceptron

Page 76: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

If we have more than two classes:A weight vector for each class:

Multi-class decision rule

Binary is multi-class where the negative class has weight zero

wyw1

w2 w3

Page 77: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

If we have more than two classes:A weight vector for each class:

Score (activation) of a class y:

Multi-class decision rule

Binary is multi-class where the negative class has weight zero

wy

wy ⋅ f(x)

w1

w2 w3

Page 78: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

If we have more than two classes:A weight vector for each class:

Score (activation) of a class y:

Prediction highest score wins

Multi-class decision rule

Binary is multi-class where the negative class has weight zero

wy

wy ⋅ f(x)

y = arg maxy

wy ⋅ f(x)

w1 ⋅ f(x) biggest

w2 ⋅ f(x) biggest w3 ⋅ f(x) biggest

w1

w2 w3

Page 79: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Learning: Multi-class perceptron

Start with all weights = 0.

Pick up training examples one by one.Predict with current weights:

y = arg maxy

wy ⋅ f(x)

Page 80: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Learning: Multi-class perceptron

Start with all weights = 0.

Pick up training examples one by one.Predict with current weights:

y = arg maxy

wy ⋅ f(x)

wy

f(x)wy*

wy′

Page 81: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Learning: Multi-class perceptron

Start with all weights = 0.

Pick up training examples one by one.Predict with current weights:

If correct, no change!

y = arg maxy

wy ⋅ f(x)

wy

f(x)wy*

wy′

Page 82: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Learning: Multi-class perceptron

Start with all weights = 0.

Pick up training examples one by one.Predict with current weights:

If correct, no change!

If wrong, we use the feature weights to lower score of wrong label (y) and to raise score of right label (y*):

y = arg maxy

wy ⋅ f(x)

wy = wy − f(x)wy* = wy* + f(x)

wy

f(x)wy*

wy′

Page 83: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Example: Multiclass perceptron

BIAS : 1 win : 0 game : 0 vote : 0 the : 0 ...

BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

wPOLITICS wTECHwSPORTS

Page 84: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Example: Multiclass perceptron

“win the vote”

x y* (true) f(x) y (prediction)BIAS : 1 win : 1 game : 0 vote : 1 the : 1 ...

wPOLITICS wTECHwSPORTS

BIAS : 1 win : 0 game : 0 vote : 0 the : 0 ...

BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

Page 85: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Example: Multiclass perceptron

“win the vote”

x y* (true) f(x) y (prediction)

POLITICS SPORTSBIAS : 1 win : 1 game : 0 vote : 1 the : 1 ...

wPOLITICS wTECHwSPORTS

BIAS : 1 win : 0 game : 0 vote : 0 the : 0 ...

BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

Page 86: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Example: Multiclass perceptron

BIAS : 1 0 win : 0 -1 game : 0 0 vote : 0 -1 the : 0 -1 ...

“win the vote”

x y* (true) f(x) y (prediction)

POLITICS SPORTSBIAS : 1 win : 1 game : 0 vote : 1 the : 1 ...

BIAS : 0 1 win : 0 1 game : 0 0 vote : 0 1 the : 0 1 ...

BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

wPOLITICS wTECHwSPORTS

Page 87: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Example: Multiclass perceptron

x y* (true) f(x) y (prediction)

POLITICS SPORTSBIAS : 1 win : 1 game : 0 vote : 1 the : 1 ...

wPOLITICS wTECHwSPORTS

BIAS : 0 win : -1 game : 0 vote : -1 the : -1 ...

BIAS : 1 win : 1 game : 0 vote : 1 the : 1 ...

BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

BIAS : 1 win : 1 game : 0 vote : 0 the : 1 ...

POLITICSPOLITICS“win the election”

“win the vote”

Page 88: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Example: Multiclass perceptron

“win the vote”

x y* (true) f(x) y (prediction)

POLITICS SPORTSBIAS : 1 win : 1 game : 0 vote : 1 the : 1 ...

wPOLITICS wTECHwSPORTS

BIAS : 0 1 win : -1 0 game : 0 1 vote : -1 -1 the : -1 0 ...

BIAS : 1 0 win : 1 0 game : 0 -1 vote : 1 1 the : 1 0 ...

BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

“win the election” POLITICS POLITICS

“win the game” SPORTS POLITICS

BIAS : 1 win : 1 game : 0 vote : 0 the : 1 ...

BIAS : 1 win : 1 game : 1 vote : 0 the : 1 ...

Page 89: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Properties of perceptrons

Separability: True if some parameters get the training set perfectly correct.

Convergence: If the training is separable, perceptron will eventually converge (binary case).

Mistake bound: The maximum number of mistakes (binary case) related to the margin or degree of separability:

Separable

Non-separable

Number of features

Margin (thickest separating line)

mistakes <kδ2

Page 90: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Classification: Comparison

Naïve Bayes Builds a model of the training data

Gives prediction probabilities

Strong assumptions about feature independence

One pass through data (counting)

Perceptrons Makes fewer assumptions about data

Error-driven learning

Multiple passes through data (prediction)

Often more accurate

Page 91: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Next time: Improving the perceptron

Page 92: CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization and over!tting Relative frequency parameters will over!t the training data! Just

Acknowledgments

The lecture incorporates material from: Dan Klein and Pieter Abbeel, University of California, Berkeley; ai.berkeley.edu

George Konidaris, Brown University

Peter Norvig and Stuart Russell, Artificial Intelligence: A Modern Approach

Ketrina Yim (illustrations)