.ppt

Post on 13-Sep-2014

402 views 0 download

Tags:

description

 

Transcript of .ppt

Introduction to Machine Learning

course 67577 fall 2007

Lecturer: Amnon ShashuaTeaching Assistant: Yevgeny Seldin

School of Computer Science and EngineeringHebrew University

What is Machine Learning?• Inference engine (computer program) that when given

sufficient data (examples) computes a function that matches as close as possible the process generating the data.

• Make accurate prediction based on observed data

• Algorithms to optimize a performance criterion based on observed data

• Learning to do better in the future based on what was experienced in the past

• Programming by examples: instead of writing a program to solve a task directly, machine learning seeks methods by which the computer will come up with its own program based on training examples.

Why Machine Learning?• Data-driven algorithms are able examine large amounts

of data. A human expert on the other hand is likely to be guided by subjective impressions or by examining a relatively small number of examples.

• Humans often have trouble expressing what they know but have no difficulty in labeling data

• Machine learning is effective in domains where declarative (rule based) knowledge is difficult to obtain yet generating training data is easy

Typical Examples• Visual recognition (say, detect faces in an image): the amount of

variability in appearance introduce challenges that are beyond the capacity of direct programming

• Spam filtering: data-driven programming can adapt to changing tactics by spammers

• Extract topics from documents: categorize news articles whether they are about politics, sports, science, etc.

• Natural language understanding: from spoken words to text; categorize the meaning of spoken sentences

• Optical character recognition (OCR)

• Medical diagnosis: from symptoms to diagnosis

• Credit card transaction fraud detection

• Wealth prediction

Fundamental Issues• Over-fitting: doing well on a training set does not guarantee

accuracy on new examples

• What is the resource we wish to optimize? For a given accuracy, use the smallest size training set

• Examples are drawn from some (fixed) distribution D over X x Y (instance space x output space). Does the learner actually need to recover D during the learning process?

• How does the learning process depend on the complexity of the family of learning functions (concept class C)? How does one define complexity of C?

• When the goal is to learn the joint distribution D then the problem is computationally unwieldy because the joint distribution table is exponentially large. What assumptions can be made to simplify the task?

Multiclass classification. K=2 is normally of most interest.

Supervised vs. Un-supervised

YXf :

kY ,...,1

Supervised Learning Models:

where X is the instance (data) space and Y is the output space

RY Regression. Predict the price of a used car given brand, year, mileage..Kinematics of a robot arm; navigate by determining steering angle from image input..

Un-supervised Learning Models:Find regularities in the input data assuming there is some structure in the input space

• Density estimation• Clustering (non-parametric density estimation): divide customers to groups which have similar attributes..• Latent class models: extract topics from documents• Compression: represent the input space with fewer parameters; projection to lower-dimensional spaces

Notations

),(),...,,( 11 mm yxyxZ

*,,}1,0{ XRXX nn

mxxS ,...,1

},,{,},1,1{ YRYY

Xx""),2.7,0,3.2,5.0(),0,0,1,1,1,0( textxxx

X is the instance space: space from which observations are drawn. Examples,

input instance, a single observation. Examples,

Y is the output space: set of possible outcomes that can be associated with a measurement. Examples,

An example is an instance-label pair (x,y). If |Y|=2 one typically uses {0,1} or {-1,1}.We say that an example (x,y) is positive if y=1 and otherwise we call it a negative exampleA training set Z consists of m instance-label pairs:In some cases we refer to the training set without labels:

Separating hyperplanes: a concept h(x) is specified by a vector and a scalar b such that:

Conjunction learning: a conjunction is a special case of a Boolean formula. A literalIs a variable or its negation and a term is a conjunction of literals, i.e.A target function is a term which consists of a subset of literals.In this case and

Each is called a concept or hypothesis or classifier. Example, if

Notations

||C

)( 321 xxx

YXhhC :|

Ch

},{,}1,0{ falsetrueYX n

}1,0{,}1,0{ YX n

ixxhihC )(:|

A concept (hypothesis) class C is a set (not necessarily finite) of functions of the form:

Other examples:

then C might be:

Decision trees: when then any boolean function can be described by a binary tree. Thus, C consists of decision trees ( )

nRX

otherwise

bxwxh

T

11

)(

nX 2 nC 3

nRw

The Formal Learning ModelProbably Approximate Correct (PAC)

• Distribution invariant: Learner does not need to estimate the joint distribution D over X x Y. Assumptions are that examples arrive i.i.d. and that D exists and is fixed.

• The training sample complexity (size of the training set Z) depends only the desired accuracy and confidence parameters - does not depend on D.

• Not all concept classes D are PAC-learnable. But some interesting classes are.

Unrealizable case: when and the training set isand D is over XxY

Realizable case: when a target concept is known to lie inside C.In this case, the training set is

sampled randomly and independently (i.i.d) according to some (unknown) Distribution D, i.e., S is distributed according to the product distribution

},...,{ 1 mxxS

Cxct )(

Cct

miiti xcxZ 1)(,

miii yxZ 1,

Ch )()(:)( xhxcxprobherr tD

dxxDxhxcindXx

t )())()((

falsextruex

xind01

)(

mDDD ...

Given a concept function

)(herr is the probability that an instance x sampled according to D will belabeled incorrectly by h(x)

PAC Model Definitions

given to the learner specifies desired accuracy, i.e.

Note: in realizable case because

)(min)( herrCoptCh

0)( Copt0)( tcerr

0 )()( Coptherr

01 given to the learner specifies desired confidence, i.e.

1])()([ Coptherrprob

The learner is allowed to deviate occasionally from the desired accuracybut only rarely so..

PAC Model Definitions

We will say that an algorithm L learns C if for every and for everyD over XxY, L generates a concept function such that theprobability that is at least

tc tc

tc tc

)()( Coptherr 1Ch

tc

PAC Model Definitions

from the set of all training examples to C with the following property:given any there is an integer such that if then, for any probability distribution D on XxY, if Z is atraining set of length m drawn randomly according to , thenwith probability of at least then hypothesis is such that

tc tc

tc tc

)()( Coptherr1

mD 1,0,

Formal Definition of PAC Learning

11),(:

m

miii CyxL

),(0 m0mm

CZLh )(

A learning algorithm L is a function:

We say that C is learnable (or PAC-learnable) if there is a learning algorithm for C

tc tc

tc tc

Formal Definition of PAC Learning

),(0 m does not depend on D, i.e., PAC model is distribution invariant

),(0 mThe class C determines the sample complexity. For “simple” classes would be small compared to more “complex” classes.

Notes:

tc tc

tc tc

Course Syllabus

C

m ln1C

C 1ln)(1ln1 Cvcdm

3 x PAC:

2 x Separating Hyperplanes:

Support Vector Machine, Kernels, LinearDiscriminant Analysis

3 x Unsupervised Learning:Dimensionality Reduction (PCA), Density Estimation,Non-parametric Clustering (spectral methods)

5 x Statistical Inference:Maximum Likelihood, Conditional Independence,Latent Class Models, Expectation-Maximization Algorithm,Graphical Models