CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization...

Perceptrons

4 November 2020

CMPU 365 · Artificial Intelligence

https://twitter.com/Adequate_Scott/status/1323978722731544577

Where are we?

Hello,

Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just

# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...

SPAM or +

PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...

“2”

x f(x) yInputs Features Outputs

Data: Labeled instances – inputs and outputs (x, y)

Features: attribute–value pairs that characterize each input x.

Parameters: Numeric parts of the model that determine how each x is mapped to an output y.

Experimentation cycle: Learn (train) parameters on training set

(Tune hyperparameters on held-out set)

Evaluate on test set

E.g., accuracy: fraction of instances predicted correctly

Training data

Held-out data

Test data

Aside: Alternative ways of splitting data

What if you choose an unlucky split into training and testing, where the data sets are very different?

Aren’t we wasting data?

Alternative: k-fold cross validation! Repeat k times:

Partition into train (n − n/k) and test (n/k) data sets

Train on training set; test on test set

Average results across k choices of test set

Naïve Bayes

Bayes’ Rule lets us do diagnostic queries with causal probabilities.

The Naïve Bayes assumption takes all features Fi to be independent given the class label, Y.

We can build classifiers out of a Naïve Bayes model using training data.

Y

F1 F2 … Fn

P(Y ∣ f1, …, fn) ∝ P(Y)∏i

P( fi ∣ Y)

Generalization and overfitting

We want a classifier which does well on test data.

Dangers: Underfitting: Fits the training set poorly

Overfitting: Fitting the training data very closely, but not generalizing well

0 2 4 6 8 10 12 14 16 18 20−15

−10

−5

0

5

10

15

20

25

30 Overfitting

0 2 4 6 8 10 12 14 16 18 20−15

−10

−5

0

5

10

15

20

25

30

Degree 15 polynomial

Overfitting

Example: Overfitting


“2” wins!


Posteriors determined by relative probabilities (odds ratios):P(W ∣ ham)P(W ∣ spam)

P(W ∣ spam)P(W ∣ ham)


Posteriors determined by relative probabilities (odds ratios):

"south-west": inf "nation": inf "morally": inf "nicely": inf "extent": inf "seriously": inf ...

What went wrong here?

"screens": inf "minute": inf "guaranteed": inf "$205.00": inf "delivery": inf "signature": inf ...

P(W ∣ ham)P(W ∣ spam)



Relative frequency parameters will overfit the training data!

Just because we never saw a 3 with pixel (15, 15) on during training doesn’t mean we won’t see it at test time.

Unlikely that every occurrence of “minute” is 100% spam.

Unlikely that every occurrence of “seriously” is 100% ham.

What about all the words that don’t occur in the training set at all?

In general, we can’t go around giving unseen events zero probability.


As an extreme case, imagine using the entire email as the only feature.

Would get the training data perfect (if deterministic labeling).

Wouldn’t generalize at all.

Just making the bag-of-words assumption gives us some generalization, but isn’t enough.

To generalize better, we need to smooth or regularize the estimates.

Parameter estimation

Maximum likelihood estimation (MLE) infers the parameter value (θ) that maximize the probability of the observed data.

Maximum likelihood estimation typically assumes: Each sample is drawn from the same distribution. That is, each instance is identically distributed.

Each sample is conditionally independent of the others given the parameters for our distribution.

This is a strong assumption, but it helps simplify the problem of MLE – and generally works in practice.

All possible values of θ are equally likely before we’ve seen any data (uniform prior).

independent, identically distributed (i.i.d.)

Imagine you have a bag of blue and red balls. You draw samples by taking a ball out, recording its color, and then putting the ball back.

If you sample red, red, blue, that suggests that 2/3 of the balls in the bag are red and 1/3 are blue.

r r b

PML(r) = 2/3

Consider our spam classifier.

The maximum likelihood estimate for the parameter P(Wi = 1 | Y = ham) corresponds to just counting the number of legitimate emails that word i appears in and dividing it by the total number of legitimate emails.

Smoothing

Laplace smoothing

Pretend you saw every outcome once more than you actually did

r r b

Laplace smoothing

Laplace’s estimate (extended):Pretend you saw every outcome k extra times

What’s Laplace with k = 0?

k is the strength of the prior

Laplace for conditionals:

r r b

Laplace smoothing

Laplace’s estimate (extended):Pretend you saw every outcome k extra times

What’s Laplace with k = 0?

k is the strength of the prior

Laplace for conditionals:Smooth each condition independently:

r r b

Real Naïve Bayes: Smoothing

For real classification problems, smoothing is critical

New odds ratios:

"helvetica": 11.4 "seems": 10.8 "group": 10.2 "ago": 8.4 "areas": 8.3 ...

"verdana": 28.8 "Credit": 28.4 "ORDER": 27.2 "<FONT>": 26.9 "money": 26.5 ...

Do these make more sense?

P(W ∣ ham)P(W ∣ spam)


Now we’ve got two kinds of unknowns Parameters: the probabilities P(X | Y), P(Y)

Hyperparameters: e.g. the amount / type of smoothing to do, k

What should we learn where? We learn parameters from the training data.

We need to tune the hyperparameters using a different set – the held-out data.

Choose the best value and do a final test on the test data.

Tuning on held-out data

Error-driven classification

Examples of errors

Dear GlobalSCAPE Customer, GlobalSCAPE has partnered with ScanSoft to offer you the latest version of OmniPage Pro, for just $99.99* - the regular list price is $499! The most common question we've received about this offer is - Is this genuine? We would like to assure you that this offer is authorized by ScanSoft, is genuine and valid. You can get the . . .

. . . To receive your $30 Amazon.com promotional certificate, click through to http://www.amazon.com/apparel and see the prominent link for the $30 offer. All details are there. We hope you enjoyed receiving this message. However, if you'd rather not receive future e-mails announcing new store launches, please click . . .

What to do about errors?Problem: There’s still spam in your inbox.

Need more features – words aren’t enough! Have you emailed the sender before?

Have a million other people just gotten the same email?

Is the sending information consistent?

Is the email in ALL CAPS?

Do inline URLs point where they say they point?

Does the email address you by name?

Naïve Bayes models can incorporate a variety of features, but they tend to do best in homogeneous cases, e.g., all features are word occurrences.

Linear classifiers

Feature vectors

Hello,



SPAM or +

x f(x) y

Feature vectors

Hello,



SPAM or +

PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...

“2”

x f(x) y

Loose inspiration

Dendrite

Cell body or Soma

Nucleus

SynapseAxon

Axon from another cell

Axonal arborization

Synapses

Human neuron

Loose inspiration

Dendrite

Cell body or Soma

Nucleus

SynapseAxon

Axon from another cell

Axonal arborization

Synapses

Human neuron Artificial neurons

Linear classifiers

Inputs are feature values, fi

Each feature has a weight, wi

Weighted sum is the activation:

If the activation is > 0, output label “+1”

< 0, output “−1”

Σf1f2f3

w1

w2

w3> 0?

activationw(x) = ∑i

wi ⋅ fi(x) = w ⋅ f(x)

Recall: Dot product

Given two vectors of equal length, the dot product (or scalar product) produces a scalar (a single number) by multiplying the corresponding entries in the vectors: a = [a1, a2, …, an]

b = [b1, b2, …, bn]

a · b = a1b1 + a2b2 + ⋯ + anbn

Recall: Dot product

Given two vectors of equal length, the dot product (or scalar product) produces a scalar (a single number) by multiplying the corresponding entries in the vectors: a = [a1, a2, …, an]

b = [b1, b2, …, bn]

a · b = a1b1 + a2b2 + ⋯ + anbn

a = [1, 2, 3] b = [4, 5, 6] # Version 1 sum(map(lambda x: x[0] * x[1], zip(a, b))) # Version 2 sum([x * y for x, y in zip(a, b)])

Weights

Binary case: Compare features to a weight vector

Learning: Figure out the weight vector from examples

# free : 4 YOUR_NAME :-1 MISSPELLED : 1 FROM_FRIEND :-3 ...

w · f being positive means the positive class/label

w

Weights






w f(x1)

Weights







w f(x1)

f(x2)

Decision rules

Key idea: decision boundary

The boundary at which the label changesKey Idea: Decision Boundary

+

++

+

+

+

-

-

-

-

-

-

-

-

- -

--

+ ++

++

Boundary at which label changes

w f(x1)

f(x2)

+

−Angle between f(x) and w is < 90 degrees. x is classified as positive for the label.

Angle between f(x) and w is > 90 degrees. x is classified as negative for the label.

Binary decision rule

In the space of feature vectors

Examples are points

Any weight vector defines a hyperplane decision boundary

One side corresponds to Y = +1

Other corresponds to Y = −1

Add a bias feature (value always 1) to allow shifted decision boundaries.

BIAS : -3 free : 4 money : 2 ...

0 10

1

2

free

mon

ey

w



Examples are points






0 10

1

2

free

mon

ey

w

f ⋅ w = 0



Examples are points






0 10

1

2

free

mon

ey

+1 = spam

−1 = ham

w

f ⋅ w = 0

Weight updates

Learning: Binary perceptron

Start with weights = 0

Cycle through training examples repeatedly.




For each example:Classify with current weights.





If correct, no change!






If wrong, adjust the weight vector.




For each example: Classify with current weights:

If correct (i.e., y = y*), no change!

If wrong, adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is −1.

w = w + y* ⋅ f(x)

y = {+1 if w ⋅ f(x) ≥ 0−1 if w ⋅ f(x) < 0

w

f(x)




For each example: Classify with current weights:

If correct (i.e., y = y*), no change!

If wrong, adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is −1.

w = w + y* ⋅ f(x)

y = {+1 if w ⋅ f(x) ≥ 0−1 if w ⋅ f(x) < 0

w

f(x)y* ⋅ f(x)

Examples: Perceptron

If we have more than two classes:A weight vector for each class:

Multi-class decision rule

Binary is multi-class where the negative class has weight zero

wyw1

w2 w3


Score (activation) of a class y:



wy

wy ⋅ f(x)

w1

w2 w3


Score (activation) of a class y:

Prediction highest score wins



wy

wy ⋅ f(x)

y = arg maxy

wy ⋅ f(x)

w1 ⋅ f(x) biggest

w2 ⋅ f(x) biggest w3 ⋅ f(x) biggest

w1

w2 w3

Learning: Multi-class perceptron

Start with all weights = 0.

Pick up training examples one by one.Predict with current weights:

y = arg maxy

wy ⋅ f(x)




y = arg maxy

wy ⋅ f(x)

wy

f(x)wy*

wy′





If wrong, we use the feature weights to lower score of wrong label (y) and to raise score of right label (y*):

y = arg maxy

wy ⋅ f(x)

wy = wy − f(x)wy* = wy* + f(x)

wy

f(x)wy*

wy′

Example: Multiclass perceptron

BIAS : 1 win : 0 game : 0 vote : 0 the : 0 ...



wPOLITICS wTECHwSPORTS


“win the vote”

x y* (true) f(x) y (prediction)BIAS : 1 win : 1 game : 0 vote : 1 the : 1 ...






“win the vote”

x y* (true) f(x) y (prediction)

POLITICS SPORTSBIAS : 1 win : 1 game : 0 vote : 1 the : 1 ...






BIAS : 1 0 win : 0 -1 game : 0 0 vote : 0 -1 the : 0 -1 ...

“win the vote”



BIAS : 0 1 win : 0 1 game : 0 0 vote : 0 1 the : 0 1 ...







BIAS : 0 win : -1 game : 0 vote : -1 the : -1 ...




POLITICSPOLITICS“win the election”

“win the vote”


“win the vote”




BIAS : 0 1 win : -1 0 game : 0 1 vote : -1 -1 the : -1 0 ...

BIAS : 1 0 win : 1 0 game : 0 -1 vote : 1 1 the : 1 0 ...


“win the election” POLITICS POLITICS

“win the game” SPORTS POLITICS



Properties of perceptrons

Separability: True if some parameters get the training set perfectly correct.

Convergence: If the training is separable, perceptron will eventually converge (binary case).

Mistake bound: The maximum number of mistakes (binary case) related to the margin or degree of separability:

Separable

Non-separable

Number of features

Margin (thickest separating line)

mistakes <kδ2

Classification: Comparison

Naïve Bayes Builds a model of the training data

Gives prediction probabilities

Strong assumptions about feature independence

One pass through data (counting)

Perceptrons Makes fewer assumptions about data

Error-driven learning

Multiple passes through data (prediction)

Often more accurate

Next time: Improving the perceptron

Acknowledgments

The lecture incorporates material from: Dan Klein and Pieter Abbeel, University of California, Berkeley; ai.berkeley.edu

George Konidaris, Brown University

Peter Norvig and Stuart Russell, Artificial Intelligence: A Modern Approach

Ketrina Yim (illustrations)

CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization...

Documents

Transcript of CMPU 365 · Artificial Intelligencecs365/lectures/2020-11-04.pdf · 2020. 11. 4. · Generalization...