Inleiding Machine Learning - SNNbertk/inl_ml/murphy_ch1.pdf · Machine Learning: ” a set of...

Inleiding Machine Learning

Bert KappenSNN Donders Institute, Radboud University, Nijmegen

Gatsby Unit, UCL London

September 1, 2014

Bert Kappen

Based on the book

Machine Learning: A probabilsitic Perspective

Kevin P. Murphy

Bert Kappen Inl. ML

Big data

• Science: neuroscience, astronomy, medical

• Internet (facebook, yahoo, google, IBM)

• Advertisement

• ...

Bert Kappen Inl. ML

Machine Learning: ” a set of methods that can automatically detect patterns indata, and then use the uncovered patterns to predict future data, or to performother kinds of decision making under uncertainty (such as planning how to collectmore data!)”.

Types of machine learning:

• Predictive or Supervised: classification, regression, function fitting

• Descriptive or Unsupervised: clustering, pattern recognition, visualization, ’dis-covery’

• Reinforcement learning (not this course)

Bert Kappen Inl. ML

Supervised Learning/Classification

Learn a mapping y = f (x) from inputs x to outputs y. y = {1, . . . ,C}, x are ’features’.

Predict y for previously unseen x (generalization).

We require probabilities for this: p(y|x,D) with D the previous data. This is a vectorof length C such that

∑y p(y|x,D) = 1.

Use MAP estimate y = maxy p(y|x,D) as most likely prediction.

Use uncertainty to quantify uncertainty. Examples: Jeopardy and ad placement.

Bert Kappen Inl. ML

Document classification

Aim is to classify document in a number of classes. Compute p(y|x,D) with x somerepresentation of the text. Examples: spam filtering, classify as spam yes/no (ch8). News feeds: classify topics.

Text is often represented as bag of words: xi j = 1 iff word j appears in document i.

words

docum

ents

10 20 30 40 50 60 70 80 90 100

100

200

300

400

500

600

700

800

900

1000

Subset of 1000 documents and 100 words from USENET. Each row is a document (represented asa bag-of-words bit vector), each column is a word. The red lines separate the 4 classes, which arecomp, rec, sci, talk.

Bert Kappen Inl. ML

Flower classification

Feature extraction is non-trivial first step. Features extracted: sepal length andwidth, and petal length and width.

Bert Kappen Inl. ML

Flower classification

sepa

l len

gth

sepal length

sepa

l wid

thpe

tal l

engt

hpe

tal w

idth

sepal width petal length petal width

We see that it is easy to distinguish setosas (red circles) from the other two classesby just checking if their petal length. However, distinguishing versicolor from vir-ginica is slightly harder; any decision will need to be based on at least two features.

Bert Kappen Inl. ML

No features: Handwritten character recognition

true class = 7 true class = 2 true class = 1






Data from National Institute of Standards (MNIST data set) consists of 60.000training and 10.000 testing images.

Most methods are insensitive to the spatial layout of the pixels: a classifier trainedon permuted data works equally well (ex. 1.1).

Bert Kappen Inl. ML

Face detection and face recognition

Face detection: classify areas that likely contain a face.

Face recognition: classify face areas to identify persons

Bert Kappen Inl. ML

Regression

Is like classification, but with y continuous.

Examples:tomorrows stock value based on todays stock values and other attributeslocation of a robots position based on sensor datasurvival probability of a patient with certain characteristics...

0 5 10 15 20−10

−5

0

5

10

15degree 1

0 5 10 15 20−10

−5

0

5

10

15degree 2

(Over) Fitting the data: find the best model to describe a given data setGeneralization: find a model that best predicts on new data

Bert Kappen Inl. ML

Unsupervised learning

Goal is to find ’interesting structure’ in the data.

55 60 65 70 75 8080

100

120

140

160

180

200

220

240

260

280

height

wei

ght

Build a model of the form p(x|θ) with θ a (set of) parameters and x (high-dimensional) data.

Bert Kappen Inl. ML

Unsupervised learning is more like human or animal learning because it does notrequire labeled data:

When were learning to see, nobody’s telling us what the right answers are - wejust look. Every so often, your mother says ”that’s a dog”, but that’s very littleinformation. You’d be lucky if you got a few bits of information - even one bit persecond - that way. The brain’s visual system has 1014 neural connections. And youonly live for 109 seconds. So it’s no use learning one bit per second. You needmore like 105 bits per second. And there’s only one place you can get that muchinformation: from the input itself. (Hinton 1996).

Bert Kappen Inl. ML

Discovering clusters

55 60 65 70 75 8080

100

120

140

160

180

200

220

240

260

280

height

wei

ght

55 60 65 70 75 8080

100

120

140

160

180

200

220

240

260

280

height

wei

ght

K=2

Height and weight of 210 personsa) Unsupervised: only the data pointsb) Supervised: also the class labels (male and female)

Choose the number of clusters K and compute p(K|D). The model with K∗ =argmaxK p(K|D) is the best model.

Assign a label to each data point: z∗i = argmaxz=1,2p(z|xi,D) when K = 2. (rightfigure).

Bert Kappen Inl. ML

Examples of clustering problems

Examples (refs in book):

• The autoclass system discovered a new type of star based on clustering astro-physical measurements

• In e-commerce, users are clustered into groups based on their purchasing orother behavior and then selectively targeted for advertisement

• In biology, flow cytometry data is clustered into groups to discover different sub-plpulations of cells.

Bert Kappen Inl. ML

Dimension reduction

Often high dimensional data can be reduced to a lower dimension that capturesthe ’essence’ of the problem.

−8 −6 −4 −2 0 2 4 6 8−4

−20

24

−2

0

2

−50

5−4

−20

24

A simple method is called principle component analysis (PCA). In the example, the3 dimensional data lie in 2 dimensions and their variability can approximately becharacterized by 1 number (latent factor).

PCA is a linear dimension reduction method. Also non-linear methods exist.

Bert Kappen Inl. ML

Dimension reduction

a) 25 randomly chosen 64 × 64 pixel images from the Olivetti face database. (b) The mean and thefirst three principal component basis vectors (eigenfaces).

Advantage of dimension reduction: higher accuracy of classification, better gener-alization, visualization.

Bert Kappen Inl. ML

Discovering graph structure

Represent the dependencies between variables in a graph G: nodes representvariables, and edges represent direct dependence between variables.

We learn a distribution over graphs from data (p(G|D)) and choose the best oneargmaxG p(G|D).

Biology: measure the phosphorylation status of some proteins in a cell (Sachs etal. 2005) and construct the interdependencies between the different proteins.

lambda=7.00, nedges=18

Finance: model covariance between large number of different stocks. sparse priors

Bert Kappen Inl. ML

Matrix completion

A questionaire where people have not filled in all questions.Image inpainting based on pairwise Markov random fields

Collaborative filtering: xi j denotes how much user i likes movie j.

��XVHUV

PRYLHV

� � " � � "

" � � � � �

� � � � � "

Netflix 2006 prize of 1 M$. 18k movies and 500k users. 1% entries observed. Seesection 27.6.2.

Bert Kappen Inl. ML

Basic concepts of machine learning

Parametric versus non-parametric models: does the model have a fixed set ofparameters or does the number of parameters grow with the model.

Example: K nearest neighbor classifier (KNN).

p(y|x,D) =1K

∑i∈NK(x,D)

I(yi = c)

NK(x,D) is the set of K nearest neighbors of x in the total data set D, I is theindicator function. Distance is typically Euclidean, but can be other.

−3 −2 −1 0 1 2 3

−2

−1

0

1

2

3

4

5

train p(y=1|data,K=10)

20 40 60 80 100

20

40

60

80

100

120

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1p(y=2|data,K=10)

20 40 60 80 100

20

40

60

80

100

120

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

−3 −2 −1 0 1 2 3−2

−1

0

1

2

3

4

5

predicted label, K=10

c1

c2

c3

Two dimensional example with 3 classes and K = 10. a) data. b) p(y = 1|x,D). c) p(y = 2|x,D). d)argmaxyp(y|x,D).

Bert Kappen Inl. ML

Curse of dimensionality

The KNN model performs poorly in high dimensions.

s

1

1

0

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fraction of data in neighborhood

Edge length

of cube

d=1

d=3

d=5d=7d=10

The reasoning:Estimating p(y|x,D) requires a fixed fraction f of the data in the neighborhood of x, otherwise nostatistics.Assume data is uniform distributed in N dimensional unit hyper cube.Fraction of data in smaller hyper cube with linear dimension εN( f ) = f 1/N.When N large, neighborhood almost coincides with unit hyper cube, even for small f .Thus, KNN looks at neighbors very far away from x, which are not good predictors for the classmembership of x.

Bert Kappen Inl. ML

Parametric models

Models with fixed number of parameters tend to suffer less from curse of dimen-sionality.

Linear regression:

y(x) =∑

j

w jx j + ε

ε is the error between predicted output wT x and actual output y. ε is often modelledas a Gaussian random variable ε ∼ N(µ, σ2) (see Ch. 2).

−3 −2 −1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4PDF

Bert Kappen Inl. ML

The probability is

p(y|x, θ) = N(y|µ(x), σ2(x))

In the simplest case µ(x) = wT x and noise is fixed σ2(x) = σ2. Then θ = (w, σ2)are the parameters of the model.

Bert Kappen Inl. ML

Linear regression with basis functions

We can model non-linear relations by using (fixed) non-linear functions φ j(x). Then

p(y|x, θ) = N(y|wTφ(x), σ2)

0 5 10 15 20−10

−5

0

5

10

15degree 14

0 5 10 15 20−10

−5

0

5

10

15degree 20

φ(x) = (1, x, x2, . . . , xd) a) with d = 14. b) with d = 20.

Many popular methods (support vector machines, neural networks, tree methods)extend this basic idea (ch. 14 and 16).

When d is too large, the model overfits on the data: poor generalization.

Bert Kappen Inl. ML

Logistic regression

In the case of classification, the output y is binary y ∈ {0, 1}. We construct aclassifier in two steps:

• construct µ(x) = σ(wT x) with

σ(x) =1

1 + e−x

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 ≤ µ(x) ≤ 1 can be interpreted as a probability.

• Define p(y|x,w) = Ber(y|µ(x)), with Ber(y|p) = py(1 − p)1−y the Bernouilli distri-bution (ch. 2)

Bert Kappen Inl. ML

Example: SAT scores

Classify whether students pass or fail a class (y) based on their SAT scores (x).

p(yi = 1|xi,w) = σ(w0 + w1xi)

460 480 500 520 540 560 580 600 620 640

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Black dots show the training data. The red circles plot p(yi = 1|xi, w) with w the parametersestimated from the training data (ch. 8). Green dots: two students with identical SAT perform

different in the class.

Decision rule: y(x) = 1 ⇔ p(y = 1|x) > 0.5 is not perfect. Data is not linearlyseparable.

Bert Kappen Inl. ML

Model selection

When varying K we obtain different KNN classifiers: Large K averages more datapoints and gives smoother models.

−3 −2 −1 0 1 2 3−2

−1

0

1

2

3

4

5


c1

c2

c3

−3 −2 −1 0 1 2 3−2

−1

0

1

2

3

4

5


c1

c2

c3

0 20 40 60 80 100 1200

0.05

0.1

0.15

0.2

0.25

0.3

0.35

K

mis

clas

sific

atio

n ra

te

traintest

Misclassification rate err( f ,D) = 1N∑N

i=1 I( f (xi) , yi) on training data (blue) and test data usingcross validation (red).

Bert Kappen Inl. ML

No free lunch

All models are wrong, but some models are useful (George Box 1987).

We can empirically choose the best model for our problem.

There is no universally best model (no free lunch theorem)

Bert Kappen Inl. ML

Outline of course (tentative)

1. Introduction

2. Probability

(a) review of probability theory(b) discrete distributions(c) continuous distributions(d) joint distributions (Gaussian, Dirichlet)(e) Transformation of random variables(f) Monte Carlo sampling

(g) Information theory

3. Generative models for discrete data

(a) Bayesian concept learning(b) beta-binomial model(c) Dirichlet model(d) Naive Bayes classifiers

Bert Kappen Inl. ML

4. Gaussian models

(a) Maximum likelihood models(b) Discriminant analysis(c) Inference in jointly Gaussian distributions

5. Bayesian statistics

(a) Summarizing posterior distribution(b) Bayesian model selection(c) (Priors)(d) Hierarchical Bayes(e) Empirical Bayes

6. ...

Bert Kappen Inl. ML

Inleiding Machine Learning - SNNbertk/inl_ml/murphy_ch1.pdf · Machine Learning: ” a set of...

Documents

Transcript of Inleiding Machine Learning - SNNbertk/inl_ml/murphy_ch1.pdf · Machine Learning: ” a set of...