MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004...

MACHİNE LEARNİNG3. Supervised Learning

Learning a Class from Examples

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

2

Class C of a “family car” Prediction: Is car x a family car? Knowledge extraction: What do people expect from

a family car? Output:

Positive (+) and negative (–) examples Input representation:

Expert suggestions x1: price, x2 : engine power Ignore other attributes

Training set XNt

tt,r 1}{ xX1 if is positive0 if is negative

r

xx

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)3

2

1

xxx

Class C


4

•Assume class model (rectangle)

•(p1 ≤ price ≤ p2) & (e1 ≤ engine power ≤ e2)

S, G, and the Version Space


5

most specific hypothesis, Smost general hypothesis, G

h H, between S and G isconsistent

and make up the version space

(Mitchell, 1997)

Hypothesis class H

negative as classifies if 0positive as classifies if 1)(

xxx

hh

h

1

( | ) 1N

t t

tE h h r

xX

6

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Error of h on H

Generalization


7

Problem of generalization: how well our hypothesis will correctly classify future examples

In our example: hypothesis is characterized by 4 numbers (p1,p2,e1,e2)

Choose the best one Include all positive and none negative Infinitely many hypothesis for real-valued

parameters

Doubt


8

In some applications, a wrong decision is very costly

May reject an instance if fall between S (most specific) and G (most general)

VC Dimension


9

Assumed that H (hypothesis space) includes true class C

H should be flexible enough or have enough capacity to include C

Need some measure of hypothesis space “flexibility” complexity

Can try to increase complexity of hypothesis space

VC Dimension

Based on for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

10

N points can be labeled in 2N ways as +/–

H shatters N if there exists h H consistent for any of these: VC(H ) = N

An axis-aligned rectangle shatters 4 points only !


11

Fix a probability of target classification error (planned future)

Actual error depends on training sample(past)

Want the actual probability error(actual future) be less than a target with high probability

Probably Approximately Correct (PAC) Learning

Probably Approximately Correct (PAC) Learning


12

How many training examples N should we have, such that with probability at least 1 ‒ δ, h has error at most ε ?(Blumer et al., 1989)

Let’s calculate how many samples wee need for S Each strip is at most ε/4 Pr that we miss a strip 1‒ ε/4 Pr that N instances miss a strip (1 ‒ ε/4)N

Pr that N instances miss 4 strips 4(1 ‒ ε/4)N

1-4(1 ‒ ε/4)N >1- δ and (1 ‒ x)≤exp( ‒ x) 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)

Noise


13

Imprecision in recording the input attributes Error in labeling data points (teacher noise) Additional attributes not taken into account

(hidden or latent) Same price/engine with different label due to a

color Effect of this attributes modeled as a noise

Class boundary might be not simple Need more complicated hypothesis

space/model

Noise and Model Complexity


14

Use the simpler one because Simpler to use

(lower computational complexity)

Easier to train (lower space complexity)

Easier to explain (more interpretable)

Generalizes better (lower variance - Occam’s razor)

Occam Razor


15

If actual class is simple and there is mislabeling or noise, the simpler model will generalized better

Simpler model result in more errors on training set

Will generalized better , won’t try to explain noise in training sample

Simple explanations are more plausible

Multiple Classes


16

General case K classes Family, Sport , Luxury cars

Classes can overlap

Can use different/same hypothesis class

Fall into two classes? Sometimes worth to reject

Multiple Classes, Ci i=1,...,KNt

tt,r 1}{ xX

, if 0 if 1

ijr

jt

it

ti C

Cxx

, if 0 if 1

ijh

jt

it

ti C

Cxxx


17

Train hypotheses hi(x), i =1,...,K:

Regression


18

Output is not Boolean (yes/no) or label but numeric value

Training Set of examples Interpolation: fit function (polynomial) Extrapolation: predict output for any x Regression : added noise Assumption: hidden variables Approximate output by model: g(x)

Regression


19

Empirical error on training set

Hypothesis space is linear functions

Calculate best parameters to minimize error by taking partial derivatives

Higher-order polynomials


20

Model Selection & Generalization


21

Learning is an ill-posed problem; data is not sufficient to find a unique solution Each sample remove irrelevant hypothesis

The need for inductive bias, assumptions about H E.g. rectangles in our example

Generalization: How well a model performs on new data

Overfitting: H more complex than C or f Underfitting: H less complex than C or f

Triple Trade-Off


22

There is a trade-off between three factors (Dietterich, 2003):

1. Complexity of H, c (H),2. Training set size, N, 3. Generalization error, E, on new data

As NE As c (H)first Eand then E

Cross-Validation


23

To estimate generalization error, we need data unseen during training. We split the data as Training set (50%)

To train a model Validation set (25%)

To select a model (e.g. degree of polynomials) Test (publication) set (25%)

Estimate the error Resampling when there is few data

Dimensions of a Supervised Learner1. Model :

2. Loss function:

3. Optimization procedure:

|xg

t

tt g,rLE || xX


24

X|min arg

E*

MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004...

Documents

Transcript of MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004...