DATA11002 Introduction to Machine Learning · 2018. 12. 14. · DATA11002 Introduction to Machine...

DATA11002

Introduction to Machine Learning

Lecturer: Antti UkkonenTAs: Saska Donges and Janne Leppa-aho

Department of Computer ScienceUniversity of Helsinki

(based in part on material by Patrik Hoyer, Jyrki Kivinen, andTeemu Roos)

November 1st–December 14th 2018

1 ,

Recap and what to expect in the exam

2 ,

General ideas

I We considered the main concepts of machine learning process,such as

I taskI modelI machine learning algorithm as a method of creating a model

using data

I This leaves out many work phases that are not usuallyconsidered part of machine learnign as such but have crucialimportance for the overall result, such as

I defining the problemI getting and wranling the dataI interpreting and utilising results

I In practice these work phases often tend take place as aniterative loop.

3 ,

Statistical learning: assumptions & terminology

I Let X = (X1,X2, . . . ,Xp

) denote a set of input variables,aka. features, predictors, covariates or independent variables.

I Let Y denote an output variable, aka. response, dependentvariable, or (class) label (in classification).

I For now, we focus on the following setting:

Y = f (X ) + ✏,

where f is some unknown (possibly insanely complicated)function, and ✏ is an error term.

I All systematic information that X provides about Y iscontained in f .

I Our objective is to learn f that is an estimate of f .

3 ,

Overfitting

I Overfitting means creating models that follow too closely thespecifics of the training data, resulting in poor performance onunseen data

I Overfitting often results from using too complex models withtoo little data

I complex models allow high accuracy but require lots of data totrain

I simple models require less training data but are incapable ofmodelling complex phenomena accurately

I Choosing the right model complexity is a di�cult problem forwhich there are many methods (incl. cross validation;Exercise 1.3)

27 ,

Error vs flexibility (train and test)

I Left: Data source (black line), data (circles), and threeregression models of increasing complexity; Right: trainingand test errors (squared error) of the three models

(Figure 2.9 from the course textbook)

0 20 40 60 80 100

24

68

10

12

X

Y

2 5 10 20

0.0

0.5

1.0

1.5

2.0

2.5

Flexibility

Mean S

quare

d E

rror

29 ,

Classification: Probabilistic Methods

2 ,

Some definitions/notation

IP(X = x

i

,Y = y

i

) denotes the probability of pair (xi

, yi

).(Sometimes we write P(x

i

, yi

) for short.)

IP(Y = y

i

| X = x

i

) denotes the conditional probability ofobserving y

i

given x

i

. (We can also write P(Y = y

i

| xi

) orP(y

i

| xi

) for short.)

IIf we knew P (which we in practice never do!), we couldimplement an “optimal” classifier by assigning x

i

to the class

y

i

that maximises P(yi

| xi

).

I In the following, we are particularly interested in models forP(Y = y

i

| xi

). These models are almost always “wrong”, butperhaps they give good predictions anyway!

4 ,

Logistic regression

I Logistic regression models are linear models for probabilisticbinary classification (so, not really regression where responseis continuous)

I Given input (vector) x, the output is a probability that Y = 1.Let’s denote it by p(Y = 1 | x).

I However, instead of using a linear model directly as in

p(Y = 1 | x) = � · xwe let

logp(Y = 1 | x)p(Y = 0 | x) = � · x

I This amounts to the same as

p(Y = 1 | x) = exp(� · x)1 + exp(� · x) =

1

exp(�� · x) + 1

5 ,

Generative vs discriminative learning

I Logistic regression was an example of a discriminative andprobabilistic classifier that directly models the classdistribution P(y | x)

I Another probabilistic way to approach the problem is to usegenerative learning that builds a model for the whole jointdistribution P(x, y) — often using the decompositionP(y)P(x | y)

I Both approaches have their pros and cons:I Discriminative learning: only solve the task that you need to

solve; may provide better accuracy since focuses on thespecific learning task; optimization tends to be harder

I Generative learning: often more natural to build models forP(x | y) than for P(y | x); handles missing data morenaturally; optimization often easier

9 ,

Classification: Gaussian classifiers and NB

2 ,

Classification with Bayes

I Given an instance x = (x1, . . . , xp), and any class valuec 2 { 1, . . . , k }, Bayes theorem gives us

P(Y = c | X = x) =P(X = x | Y = c)P(Y = c)

Pk

c

0=1 P(X = x | Y = c

0)P(Y = c

0)

I The basic version of naive Bayes then predicts class c withmaximum posterior probability (MAP):

c(x) = arg maxc

P(Y = c | X = x)

I Probabilistic predictions are obtained directly fromP(Y = c | X = x)

3 ,

Gaussian Naive Bayes

I Assume that we have p input features, X1, . . . ,Xp

.

I The naive Bayes assumption is that input features areconditionally independent given class:

P(X1, . . . ,Xp

| Y ) = P(X1 | Y ) . . .P(Xp

| Y )

I Thus for a feature vector x, we have

p(x | Y = +1) = p(X1 = x1 | Y = +1) . . . p(Xp

= x

p

| Y = +1)

p(x | Y = �1) = p(X1 = x1 | Y = �1) . . . p(Xp

= x

p

| Y = �1)

I In the Gaussian naive Bayes model, we let p(Xj

= x | Y = c)be independent univariate Gaussians for each feature X

j

andclass c

5 ,

Learning a naive Bayes model

I Assume there are k classes 1, . . . , k and p input featureswhere for j = 1, . . . , p feature X

j

has range { 1, . . . , qj

}

I We model P(X | Y = c) separately for each class c andfeature X 2 {X1, . . . ,Xp

}:I For each c k , j d , and x q

j

, let nc,j,x be the number of

examples in the training data in class c with feature valueX

j

= x , and n

c

=P

q

j

x=1 nc,j,x

I We estimate

P(Xj

= x | Y = c) =n

c,j,x +m

c,j,x

n

c

+m

c,j

where m

c,j,x is a prior pseudocount and m

c,j =P

q

j

x=1 mc,j,x

I Usual choices for pseudocounts are m

c,j,x = 0 (maximumlikelihood), m

c,j,x = 1 (Laplace smoothing), mc,j,x = 1/2

(Krichevsky-Trofimov), and m

c,j,x = 1/3 (the “Jaasaarimethod”)

11 ,

Classification: k-NN and Decision trees

2 ,

Nearest neighbour classifier

I Nearest-neighbour classifier is a simple geometric model basedon distances:

I store all the training dataI to classify a new data point, find the closest one in the

training set and use its class

I More generally, k-nearest-neighbour classifier finds k nearestpoints in the training set and uses the majority class (ties arebroken arbitrarily)

I Di↵erent notions of distance can be used, but Euclidean is themost obvious

6 ,

Nearest neighbour classifier (3)

from (James et al., 2013)

8 ,

Decision trees: An example

I Idea: Ask a sequence of questions to infer the class

includes ‘netflix prize’includes ‘viagra’

includes ‘millions’

includes ‘meds’

no yes

not spamspam

yes

spam

yes nono

spam

yesno

not spam

9 ,

Properties of decision trees

INonparametric approach:

I If the tree size is unlimited, can in approximate any decisionboundary to arbitrary precision (like k-NN)

I So with infinite data could in principle always learn the optimalclassifier

I But with finite training data needs some way of avoidingoverfitting (pruning) — bias–variance tradeo↵!

x1

x2

0 1 2 3 4 50

1

2

3

22 ,

Properties of decision trees (contd.)

ILocal, greedy learning to find a reasonable solution inreasonable time (as opposed to finding a globally optimalsolution)

I Relatively easy to interpret (by experts or regular users of thesystem)

I Classification generally very fast (linear in depth of the tree)

I Usually competitive only when combined with ensemblemethods: bagging, random forests, boosting

I The textbook may give the impression that decision trees andensemble methods belong together, but we will considerensemble methods more generally (in a week or two)

23 ,

Support Vector Machines

2 ,

Margin

I Given a data set (xi , yi )ni=1 and M > 0, we say that acoe�cient vector � separates the data with margin M if forall i we have

yi (� · xi )k�k2

� M

I ExplanationI � · xi/ k�k2 is the scalar projection of xi onto vector �I

yi (� · xi ) � 0 means we predict the correct classI |� · xi | / k�k2 is Euclidean distance between point xi and the

decision boundary, i.e., hyperplane � · x = 0

w�

�

β

M15 ,

Observations on max margin classifiers

I Consider the linearly separable case ✏i ⌘ 0.

I The maximal margin touches a set of training data points xi ,which are called support vectors

Fig. 9.3 in (James et al., 2013)

18 ,

Observations on max margin classifiers (2)

I Given a set of support vectors, the coe�cients defining thehyperplane can be defined as

� =nX

i=1

ciyixi ,

with some ci � 0, where ci > 0 only if the ith data pointtouches the margin

I In other words, the classifier is defined by a few data

points!

I A similar property holds for the soft margin: the more the ithpoint violates the margin, the larger ci , and for points that donot violate the margin, ci = 0

19 ,

Kernel trick

I Since the data only appear through hxi , xji, we can use thefollowing kernel trick

I Imagine that we want to introduce non-linearity by mappingthe original data into a higher-dimensional representation

I remember the polynomial example xi 7! 1, xi , x2i , x3i , . . .

I interaction terms are an another example:(xi , xj) 7! (xi , xj , xixj)

I Denote this mapping by � : Rp ! Rq, q > p

I Define the kernel function as K (xi , x) = h�(xi ),�(x)i

I The trick is to evaluate K (xi , x) without actually computingthe mappings �(xi ) and �(x)

23 ,

Clustering

2 ,

Flat clustering: basic idea

I Each data vector xi is assigned to one of K clusters

I Typically K and a similarity measure is selected by the user,while the chosen algorithm then learns the actual partition

I In the example below, K = 3 and the partition are shownusing color (red, green, blue)

X X

⇒

4 ,

Hierarchical clustering: basic idea

X

⇒x

1

x14

x25

x6

x19

x1

x19

x25

x6

x14

I In this approach, data vectors are arranged in a tree, wherenearby (similar) vectors xi and xj should be placed close toeach other: e.g., x6 and x25 end up being siblings while x14 isa distant cousin

I Any horizontal cut corresponds to a flat clustering

I In the example above, the 3 colors have been added manuallyfor emphasis (they are not produced by the algorithm) 6 ,

K-means

I The most popular distance-based clustering method isK-means

I We specifically assume that X = Rp and use squaredEuclidean distance as dissimilarity measure

I Ideally, we would wish to find partition and exemplars thatminimise the total distance of data points from their assignedexemplars

KX

j=1

X

x2Dj

kx � µjk22 =nX

i=1

��xi � µj(i)

��22

I However minimising this exactly is computationally di�cult(NP-hard) so in practice we usually use heuristic algorithms

13 ,

K-means algorithm

I We start by picking K initial cluster exemplars (for examplerandomly from our data set)

I We then alternate between the following two steps, untilnothing changes any more:

I Keeping the examplars fixed, assign each data point to theclosest exemplar

I Keeping the assignments fixed, move each exemplar to thecenter of its assigned data points

I In this context we call the exemplars cluster means. Noticethat generally they are not sample points in our data set, butcan be arbitrary vectors in Rd

I This is also known as Lloyd’s algorithm; see Algorithm 10.1 intextbook

15 ,

Hierarchical clustering

I Dendrogram representation:

I Nested cluster structureI Binary tree with datapoints (objects) as leavesI Cutting the tree at any height produces a partitional clustering

I Example 1:516 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms

p1 p2 p3 p4

(a) Dendrogram.

p1

p2

p3p4

(b) Nested cluster diagram.

Figure 8.13. A hierarchical clustering of four points shown as a dendrogram and as nested clusters.

relationships and the order in which the clusters were merged (agglomerativeview) or split (divisive view). For sets of two-dimensional points, such as thosethat we will use as examples, a hierarchical clustering can also be graphicallyrepresented using a nested cluster diagram. Figure 8.13 shows an example ofthese two types of figures for a set of four two-dimensional points. These pointswere clustered using the single-link technique that is described in Section 8.3.2.

8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm

Many agglomerative hierarchical clustering techniques are variations on a sin-gle approach: starting with individual points as clusters, successively mergethe two closest clusters until only one cluster remains. This approach is ex-pressed more formally in Algorithm 8.3.

Algorithm 8.3 Basic agglomerative hierarchical clustering algorithm.1: Compute the proximity matrix, if necessary.2: repeat3: Merge the closest two clusters.4: Update the proximity matrix to reflect the proximity between the new

cluster and the original clusters.5: until Only one cluster remains.

26 ,

Hierarchical clustering (3)

General approaches to hierarchical clustering:

I Divisive approach:

1. Start with one cluster containing all the datapoints.

2. Repeat for all non-singleton clusters:

ISplit the cluster in two using some partitional clustering

approach (e.g. K-means)

I Agglomerative approach:

1. Start with each datapoint being its own cluster

2. Repeat until there is just one cluster left:

ISelect the pair of clusters which are most similar and join

them into a single cluster

(The agglomerative approach is much more common, and we will

exclusively focus on it in what follows.)

28 ,

Principal Component Analysis

2 ,

Dimensionality Reduction

I When the dimension of the data p is large, it is often hard tovisualize and process the data

I Classification and regression models easily overfit

I However, in many cases, the data can be approximatelysummarized by a lower-dimensional representation

I For example, large questionnaire data sets can often bereduced to a few dimensions (cf. psychological scales such asMyers–Briggs, Keirsey)

I Perhaps the data actually resides on a rather low-dimensionalmanifold in the high-dimensional input feature space.

I Visualization requires one-, two- or three-dimensionalrepresentations

3 ,

Principal Componenent Analysis

I Principal Component Analysis is a common (linear!)dimensionality reduction technique

I The basic idea is to project the data onto a lower dimensionalsubspace so that as much variance as possible is retained

I Assume from now on that data is zero-centeredI If the original instances are z1, . . . , zn, we replace them by

xi = zi � z where z = 1n

Pni=1 zi denotes the overall mean

I ThenP

i xi = 0 and therefore the mean of each feature, j , isalso zero xj = 0 for j = 1, . . . , p

4 ,

Principal Componenent Analysis (3)

I Example

Source: (James et al., 2013), p. 230

7 ,

Resampling and Ensemble Methods

2 ,

Cross-validation

Recall that cross-validation gives us K (e.g., K = 10) train–testsplits

1. Divide the data into K equal-sized subsets:

available data

1 2 3 4 5

2. For j goes from 1 to K :

2.1 Train the model(s) using all data except that of subset j

2.2 Compute the resulting validation error on the subset j

3. Average the K results

When K = N (i.e. each datapoint is a separate subset) this isknown as leave-one-out cross-validation.

6 ,

Bootstrap

I Another popular resampling method is bootstrap

I The idea is to reuse the “training” set to obtain multiple datasets

I Not restricted to supervised learning (hence “training”)

I These bootstrap samples can be used to estimate thevariability of an estimate of parameter ✓

I Bootstrap:1. Let D = (x1, . . . , xn) be the actual data

2. Repeat j = 1, . . . ,K times:2.1 Create D⇤

j by drawing n objects from D with replacement

2.2 Obtain estimate ✓⇤j from D⇤j

3. Use the bootstrap estimates ✓⇤1 , . . . , ✓⇤K to estimate variability

7 ,

The Exam

4 ,

Exam structure 1/2

I The exam consists of four problems, each divided to a numberof subtasks.

I Problem 1: Briefly (in a few sentences + a drawing if needed)explain some (about 6) core concepts introduced in thecourse. (⇠20% of exam points)

I Problems 2–4: Each of these concerns a method/concept

discussed in the course. (⇠25-30% of exam points each)

I Total number of points: 60. You need at least 30 points topass. Whether you can get to 30 points by perfectlycompleting only two problems may or may not be possibleaccording to historical data.

5 ,

Exam structure 2/2

In problems 2–4 you will be asked to do some of the followingthings:

I Explain what the method can be used for, and discuss how itworks at a high level.

I Explain/show how the method works on a given example data.

I Calculate some simple things given example data.

I Define core concepts (e.g. loss function etc.) related to themethod.

I Discuss how you would apply the method in practice.

6 ,

What is a good answer?

ICorrect. The questions almost without exception have asingle correct answer.

ITo the point. Don’t write long essays, these will in generalnot give you any more points than a brief but correct one.

IComprehensive. Try to capture all relevant aspects of thequestion in your answer.

IEasy to read. If your handwriting is totally unintelligible, wemight not spend a lot of time trying to read it.

INot “copy-paste” from book/lecture slides. Use your ownwords.

7 ,

Example of Problem 1

“Explain briefly the following terms ...

a) bias of a learning algorithm (as in bias-variance trade-o↵)

Answer: “The bias of a learning algorithm refers to its inherentinability to correctly estimate the target variable due to erroneousassumptions. For example, when approximating a nonlinearfunction with a linear function, the resulting model will in generalbe biased. More formally, let fS(x) denote the prediction of modelfS obtained with training data S , and denote by y the correctvalue of the target variable associated with predictors x . The biasof fS at x is the di↵erence (ES [fS(x)]� y), where the expectedvalue is taken over all possible training datasets S .”

8 ,

Example of problems 2–4

“Here we will study linear regression with a simple dataset withone predictor variable and one response variable as shown in thefigure below. [Imagine a scatterplot somewhere here.]a) Make a copy of the figure in your answer sheet, and sketch the linearregression line in the figure. Explain in general terms (withoutmathematics!) how the model is fitted.b) Give an equation that formally defines a multivariate linear regressionmodel, and describe it’s parts. Indicate in the figure how di↵erent partsof the equation relate to the visualisation in the case of a single predictormodel.c) Define and explain briefly the loss funciton used to fit a linearregression model. How do you find the optimal model in the general caseof p predictors?

d) Briefly explain how you can use the machinery of linear regression to

also fit non-linear models.”

9 ,

Example of problems 2–4, the answer

a) “A linear regression model is a linear function (a line) fitted sothat for every value of the predictor variable x , the line is as closeas possible to the corresponding value of the output variable y .”

b) “A linear regression model with p predictor variables and anintercept term is defined as

y = �0 + �1x1 + �2x2 + · · ·+ �pxp + ✏,

where the �i are coe�cients, and ✏ is an error term not capturedby the model. The coe�cient �0 is called the intercept, while theother �i , i > 0, are the coe�cients of the corrsponding predictorvariables xi . In practice the model predicts that the value of y willincrease by �i if the value of xi increases by 1.”

10 ,

Example of problems 2–4, the answer, pt2

c) “Linear regression models are most commonly fit using the socalled least-squares criteria, meaning that for a given trainingdataset D = {(x1, y1), . . . , (xn, yn)}, where xi 2 Rp and yi 2 R,the learning algorithm finds the coe�cients �i so that the followingcost is minimised:

X

(x,y)2D

(�0 +pX

i=1

�ixi � y)2.

This can also be expressed in terms of matrices:

(y � X�)T(y � X�),

where X is an n ⇥ (p + 1) matrix with the feature vectors xi asrows, appended by a column of all 1s, and � is the (p + 1)-dimensional vector of model coe�cients. The solution is given by� = X

⇤y, where X

⇤ = (XT

X)�1

X

T is the pseudoinverse of X.”11 ,

Example of problems 2–4, the answer, pt2

d) “In nonlinear models the dependency between the predictor andoutput variables is a nonlinear function, e.g. a polynomial of somedegree d . Or, it may contain interaction terms defined as theproduct of two or more predictor variables. An example of anonlinear function that combines these two aspects is

y = �0 + �1x1 + �2x1x2 + �3x23.

Now we can precompute the values of x1x2 and x

23 , and treat these

as predictors in a regular linear regression model that is fit exactlyin the same way as described above in part c).

12 ,

About the exam in general

I Main di�culty (in Antti’s opinion): There course has a fairlywide scope, and there are a lot of di↵erent methods. Youshould learn the core ideas / assumptions / pitfalls of all ofthese!

I No need to give mathematical proofs.

I However, you should probably learn and understand some ofthe mathematics, such as the loss functions associated withdi↵erent methods.

I Don’t forget: You can prepare a two-sided, handwritten

A4 sheet of whatever notes you think you might need.

I Calculators are allowed.

13 ,

Where to go from here?

I Advanced Course in Machine Learning

I Computational Statistics

I Statistical Data Science

I Deep Learning

I Probabilistic Graphical Models

I Network Analysis

I Many many seminars!

14 ,

Thank you!

15 ,

DATA11002 Introduction to Machine Learning · 2018. 12. 14. · DATA11002 Introduction to Machine...

Documents

Transcript of DATA11002 Introduction to Machine Learning · 2018. 12. 14. · DATA11002 Introduction to Machine...