Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2...

23
Machine Learning Ensemble Learning Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 19

Transcript of Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2...

Page 1: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Machine LearningEnsemble Learning

Hamid Beigy

Sharif University of Technology

Fall 1395

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 19

Page 2: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Table of contents

1 Introduction

2 Diversity measures

3 Design of ensemble systems

4 Building ensemble based systems

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 2 / 19

Page 3: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Outline

1 Introduction

2 Diversity measures

3 Design of ensemble systems

4 Building ensemble based systems

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 3 / 19

Page 4: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Introduction

1 In our daily life1 Asking different doctors opinions before undergoing a major surgery2 Reading user reviews before purchasing a product.3 There are countless number of examples where we consider the decision of mixture of

experts.

2 Ensemble systems follow exactly the same approach to data analysis.

Problem (Ensemble learning)

Given training data set S = {(x1, t1), (x2, t2), . . . , (xN , tN)} drawn from common instancespace X , and

A collection of inductive learning algorithms,

Return a new classification algorithm for x ∈ X that combines outputs from collection ofclassification algorithms

3 Desired PropertyGuarantees of performance of combined prediction.

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 3 / 19

Page 5: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Why we combine classifiers?

Reasons for using ensemble based systems1 Statistical reasons

1 A set of classifiers with similar training data may have different generalizationperformance.

2 Classifiers with similar performance may perform differently in field (depends on testdata).

3 In this case, averaging (combining) may reduce the overall risk of decision.4 In this case, averaging (combining) may or may not beat the performance of the best

classifier.

2 Large volumes of data

1 Usually training of a classifier with a large volumes of data is not practical.2 A more efficient approach is to

Partition the data into smaller subsetsTraining different Classifiers with different partitions of dataCombining their outputs using an intelligent combination rule

3 To little data1 We can use resampling techniques to produce non-overlapping random training data.2 Each of training set can be used to train a classifier.

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 4 / 19

Page 6: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Why we combine classifiers? (cont.)

Reasons for using ensemble based systems1 Data fusion

1 Multiple sources of data (sensors, domain experts, etc.)2 Need to combine systematically, for example a neurologist may order several tests

MRI scan, EEG recording, Blood test3 A single classifier cannot be used to classify data from different sources

(heterogeneous features).2 Divide and conquer

1 Regardless of the amount of data, certain problems are difficult for solving by aclassifier.

2 Complex decision boundaries can be implemented using ensemble Learning.

an intelligent combination rule often proves to be amore efficient approach.

Too Little Data: Ensemble systems can also be usedto address the exact opposite problem of having too lit-tle data. Availability of an adequate and representativeset of training data is of paramount importance for aclassification algorithm to successfully learn the under-lying data distribution. In the absence of adequate train-ing data, resampling techniques can be used for drawingoverlapping random subsets of the available data, eachof which can be used to train a different classifier, creat-ing the ensemble. Such approaches have also proven tobe very effective.

Divide and Conquer: Regardless of the amount ofavailable data, certain problems are just too difficult fora given classifier to solve. More specifically, the decisionboundary that separates data from different classes maybe too complex, or lie outside the space of functions

that can be implemented by the chosen classifier model.Consider the two dimensional, two-class problem with acomplex decision boundary depicted in Figure 1. A lin-ear classifier, one that is capable of learning linearboundaries, cannot learn this complex non-linearboundary. However, appropriate combination of anensemble of such linear classifiers can learn this (or anyother, for that matter) non-linear boundary.

As an example, let us assume that we have access toa classifier model that can generate elliptic/circularshaped boundaries. Such a classifier cannot learn theboundary shown in Figure 1. Now consider a collectionof circular decision boundaries generated by an ensem-ble of such classifiers as shown in Figure 2, where eachclassifier labels the data as class1 (O) or class 2 (X),based on whether the instances fall within or outside ofits boundary. A decision based on the majority voting ofa sufficient number of such classifiers can easily learnthis complex non-circular boundary. In a sense, the clas-sification system follows a divide-and-conquer approachby dividing the data space into smaller and easier-to-learn partitions, where each classifier learns only one ofthe simpler partitions. The underlying complex decisionboundary can then be approximated by an appropriatecombination of different classifiers.

Data Fusion: If we have several sets of data obtainedfrom various sources, where the nature of features aredifferent (heterogeneous features), a single classifiercannot be used to learn the information contained in allof the data. In diagnosing a neurological disorder, forexample, the neurologist may order several tests, suchas an MRI scan, EEG recording, blood tests, etc. Eachtest generates data with a different number and type offeatures, which cannot be used collectively to train asingle classifier. In such cases, data from each testingmodality can be used to train a different classifier,whose outputs can then be combined. Applications inwhich data from different sources are combined to makea more informed decision are referred to as data fusionapplications, and ensemble based approaches have suc-cessfully been used for such applications.

There are many other scenarios in which ensemblebase systems can be very beneficial; however, discussionon these more specialized scenarios require a deeperunderstanding of how, why and when ensemble systemswork. The rest of this paper is therefore organized as fol-lows. In Section 2, we discuss some of the seminal workthat has paved the way for today’s active research area inensemble systems, followed by a discussion on diversity,a keystone and a fundamental strategy shared by allensemble systems. We close Section 2 by pointing outthat all ensemble systems must have two key compo-nents: an algorithm to generate the individual classifiers

23THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE

OO

OO

OOOO

OOOO

OO

OO

OOOO

XXXX

XXXX

XXXX

XX

XXXX

XX

XXXXXX

XX

XX

XXXXXX

XX

XXXX

XXXX

XXXX

XXXXXX

XX

XX

XX

XXXX

XX

XX

XX

XXXXXX

OO

OO OO

OO

OOOO

OOOOOO

OOOO

OOOO

OO

OOOO

OOOOOO

OOOOOOOO

OO

OO

OO

OOOO

OOOO

OO OOOOOOOO

OO OOOO

OOOO OO

OOOOOO

OOOOOO

OOOO

OO

OO

OO

OOOO

OO

OO

OO

OOOOOOOO

OO

OOOO OOOO OO

OOOOOOOO

OOOO

OOOO

OO

OO OO OOOOOO

OO

OO

OO

OO

OO

OO

OO

XX

XX

XX XX

XX

XXXX

XX

OO

OO

OOOO

OO

OO

OOOO

Obs

erva

tion/

Mea

sure

men

t/Fea

ture

2

Training Data Examplesfor Class 1

Observation/Measurement/Feature 1

Training Data Examplesfor Class 2

Complex DecisionBoundary to Be Learned

OO

Figure 1. Complex decision boundary that cannot be learnedby linear or circular classifiers.

OO

OO

OOOO

OOOO

OO

OO

OOOO

XXXX

XXXX

XXXX

XX

XXXX

XX

XXXXXX

XX

XX

XXXXXX

XX

XXXX

XXXX

XXXX

XXXXXX

XX

XX

XX

XXXX

XX

XX

XX

XXXXXX

OO

OO OO

OO

OOOO

OOOOOO

OOOO

OOOO

OO

OOOO

OOOOOO

OOOOOOOO

OO

OO

OO

OOOO

OOOO

OO OOOOOOOO

OO OOOO

OOOO OO

OOOOOOOO

OOOOOO

OOOO

OO

OO

OO

OOOO

OO

OO

OO

OOOOOOOO

OO

OOOO OOOO OO

OOOOOOOO

OOOO

OOOO

OO

OO OOOO

OOOO

OO

OO

OO

OO

OO

OO

OO

XX

XX

XX XX

XX

XXXX

XX

OO

OO

OOOO

OO

OO

OOOO

Observation/Measurement/Feature 1

Obs

erva

tion/

Mea

sure

men

t/Fea

ture

2

OOOO

OOOOOOOO OOOOOOOO

OOOOOOOO

OOOOOOO

OOOO

OOOO

XXXX

XXXXXXXXXXXX

XXXXXXXXXXXX

OOOOOOO

OOOOOOOO

OOO OOOO

OOO

OOO

OO

OOOOOOOO

OO

OOOO

OOOOOOOO

OOOOOOO

XXXX

XXXX

XXXX

XXXXOOOO

OOOO OOOOOOOOOOO

OOOOOOOO

OOOOOOOO

XXXXXXXX

XOOOO

OOOO

OOOO

OOOOOOOOO

OOOO

OOOOOOOO

OOOOOOOO

OOOO

XXXXXXXX

XXXXXXXX

XXXX

XXXXOOOO

OOOO

OOOOOOOO

OOOO

OOOOOOOOOOOOOOOO

OOOO

OOOOOOOOOOO

OOOOO

OOOOOOOOOOOOO

OOOOO

OOOOOOOO

OOOOOOOOOOOO

OOOOOOOOOO

OOOOO

OOOOOO XXXX

XXXX

XXXX

OOOOOOOO

XXXXXXXX

XXXXXXXXXXXXXXXXXXXXX

XXXXXXX

XXXX

OOOO

OOO

OOOOOOOO

OOOO

OOOO

OOOOOOOOOOOOOOOO

OOOO

OOOOOOOO OOOOOOOO OOOO

OOOOOOOOOOOOOOO

OOOO

OOOOOOOOOOOOOOO

OOOOOOOOO

OOOOOOOO

OOOOOOOOOOOO

OOOOOOOOOOOO XXXX

XXXX XXOOO

OOOOOOOO

Decision Boundaries Generated by Individual Classifiers

XXXXXXXXXX

XXXXXXXX

XXXX XXXXXXXX

XXXX

XXXXXXX

XXXX

XXX

XXXXXXXX

XXXX

XXXXXXXXXXXXXXXXXXOOO

XXXXOOXXXXXXXX

XXXXXXXX XX XXXX

XXXXXXXX

XXXX

XXXXXXXX XXXXXXX

XXXXXXX

XXXXXXXXX

XXXXXXX

XXXX

XXXX

XXX

Figure 2. Ensemble of classifiers spanning the decisionspace.

an intelligent combination rule often proves to be amore efficient approach.

Too Little Data: Ensemble systems can also be usedto address the exact opposite problem of having too lit-tle data. Availability of an adequate and representativeset of training data is of paramount importance for aclassification algorithm to successfully learn the under-lying data distribution. In the absence of adequate train-ing data, resampling techniques can be used for drawingoverlapping random subsets of the available data, eachof which can be used to train a different classifier, creat-ing the ensemble. Such approaches have also proven tobe very effective.

Divide and Conquer: Regardless of the amount ofavailable data, certain problems are just too difficult fora given classifier to solve. More specifically, the decisionboundary that separates data from different classes maybe too complex, or lie outside the space of functions

that can be implemented by the chosen classifier model.Consider the two dimensional, two-class problem with acomplex decision boundary depicted in Figure 1. A lin-ear classifier, one that is capable of learning linearboundaries, cannot learn this complex non-linearboundary. However, appropriate combination of anensemble of such linear classifiers can learn this (or anyother, for that matter) non-linear boundary.

As an example, let us assume that we have access toa classifier model that can generate elliptic/circularshaped boundaries. Such a classifier cannot learn theboundary shown in Figure 1. Now consider a collectionof circular decision boundaries generated by an ensem-ble of such classifiers as shown in Figure 2, where eachclassifier labels the data as class1 (O) or class 2 (X),based on whether the instances fall within or outside ofits boundary. A decision based on the majority voting ofa sufficient number of such classifiers can easily learnthis complex non-circular boundary. In a sense, the clas-sification system follows a divide-and-conquer approachby dividing the data space into smaller and easier-to-learn partitions, where each classifier learns only one ofthe simpler partitions. The underlying complex decisionboundary can then be approximated by an appropriatecombination of different classifiers.

Data Fusion: If we have several sets of data obtainedfrom various sources, where the nature of features aredifferent (heterogeneous features), a single classifiercannot be used to learn the information contained in allof the data. In diagnosing a neurological disorder, forexample, the neurologist may order several tests, suchas an MRI scan, EEG recording, blood tests, etc. Eachtest generates data with a different number and type offeatures, which cannot be used collectively to train asingle classifier. In such cases, data from each testingmodality can be used to train a different classifier,whose outputs can then be combined. Applications inwhich data from different sources are combined to makea more informed decision are referred to as data fusionapplications, and ensemble based approaches have suc-cessfully been used for such applications.

There are many other scenarios in which ensemblebase systems can be very beneficial; however, discussionon these more specialized scenarios require a deeperunderstanding of how, why and when ensemble systemswork. The rest of this paper is therefore organized as fol-lows. In Section 2, we discuss some of the seminal workthat has paved the way for today’s active research area inensemble systems, followed by a discussion on diversity,a keystone and a fundamental strategy shared by allensemble systems. We close Section 2 by pointing outthat all ensemble systems must have two key compo-nents: an algorithm to generate the individual classifiers

23THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE

OO

OO

OOOO

OOOO

OO

OO

OOOO

XXXX

XXXX

XXXX

XX

XXXX

XX

XXXXXX

XX

XX

XXXXXX

XX

XXXX

XXXX

XXXX

XXXXXX

XX

XX

XX

XXXX

XX

XX

XX

XXXXXX

OO

OO OO

OO

OOOO

OOOOOO

OOOO

OOOO

OO

OOOO

OOOOOO

OOOOOOOO

OO

OO

OO

OOOO

OOOO

OO OOOOOOOO

OO OOOO

OOOO OO

OOOOOO

OOOOOO

OOOO

OO

OO

OO

OOOO

OO

OO

OO

OOOOOOOO

OO

OOOO OOOO OO

OOOOOOOO

OOOO

OOOO

OO

OO OO OOOOOO

OO

OO

OO

OO

OO

OO

OO

XX

XX

XX XX

XX

XXXX

XX

OO

OO

OOOO

OO

OO

OOOO

Obs

erva

tion/

Mea

sure

men

t/Fea

ture

2

Training Data Examplesfor Class 1

Observation/Measurement/Feature 1

Training Data Examplesfor Class 2

Complex DecisionBoundary to Be Learned

OO

Figure 1. Complex decision boundary that cannot be learnedby linear or circular classifiers.

OO

OO

OOOO

OOOO

OO

OO

OOOO

XXXX

XXXX

XXXX

XX

XXXX

XX

XXXXXX

XX

XX

XXXXXX

XX

XXXX

XXXX

XXXX

XXXXXX

XX

XX

XX

XXXX

XX

XX

XX

XXXXXX

OO

OO OO

OO

OOOO

OOOOOO

OOOO

OOOO

OO

OOOO

OOOOOO

OOOOOOOO

OO

OO

OO

OOOO

OOOO

OO OOOOOOOO

OO OOOO

OOOO OO

OOOOOOOO

OOOOOO

OOOO

OO

OO

OO

OOOO

OO

OO

OO

OOOOOOOO

OO

OOOO OOOO OO

OOOOOOOO

OOOO

OOOO

OO

OO OOOO

OOOO

OO

OO

OO

OO

OO

OO

OO

XX

XX

XX XX

XX

XXXX

XX

OO

OO

OOOO

OO

OO

OOOO

Observation/Measurement/Feature 1

Obs

erva

tion/

Mea

sure

men

t/Fea

ture

2OOOO

OOOOOOOO OOOOOOOO

OOOOOOOO

OOOOOOO

OOOO

OOOO

XXXX

XXXXXXXXXXXX

XXXXXXXXXXXX

OOOOOOO

OOOOOOOO

OOO OOOO

OOO

OOO

OO

OOOOOOOO

OO

OOOO

OOOOOOOO

OOOOOOO

XXXX

XXXX

XXXX

XXXXOOOO

OOOO OOOOOOOOOOO

OOOOOOOO

OOOOOOOO

XXXXXXXX

XOOOO

OOOO

OOOO

OOOOOOOOO

OOOO

OOOOOOOO

OOOOOOOO

OOOO

XXXXXXXX

XXXXXXXX

XXXX

XXXXOOOO

OOOO

OOOOOOOO

OOOO

OOOOOOOOOOOOOOOO

OOOO

OOOOOOOOOOO

OOOOO

OOOOOOOOOOOOO

OOOOO

OOOOOOOO

OOOOOOOOOOOO

OOOOOOOOOO

OOOOO

OOOOOO XXXX

XXXX

XXXX

OOOOOOOO

XXXXXXXX

XXXXXXXXXXXXXXXXXXXXX

XXXXXXX

XXXX

OOOO

OOO

OOOOOOOO

OOOO

OOOO

OOOOOOOOOOOOOOOO

OOOO

OOOOOOOO OOOOOOOO OOOO

OOOOOOOOOOOOOOO

OOOO

OOOOOOOOOOOOOOO

OOOOOOOOO

OOOOOOOO

OOOOOOOOOOOO

OOOOOOOOOOOO XXXX

XXXX XXOOO

OOOOOOOO

Decision Boundaries Generated by Individual Classifiers

XXXXXXXXXX

XXXXXXXX

XXXX XXXXXXXX

XXXX

XXXXXXX

XXXX

XXX

XXXXXXXX

XXXX

XXXXXXXXXXXXXXXXXXOOO

XXXXOOXXXXXXXX

XXXXXXXX XX XXXX

XXXXXXXX

XXXX

XXXXXXXX XXXXXXX

XXXXXXX

XXXXXXXXX

XXXXXXX

XXXX

XXXX

XXX

Figure 2. Ensemble of classifiers spanning the decisionspace.

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 5 / 19

Page 7: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Outline

1 Introduction

2 Diversity measures

3 Design of ensemble systems

4 Building ensemble based systems

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 6 / 19

Page 8: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Diversity

1 Strategy of ensemble systemsCreation of many classifiers and combine their outputs in asuch a way thatcombination improves upon the performance of a single classifier.x

2 RequirementThe individual classifiers must make errors on different inputs.

3 If errors are different then strategic combination of classifiers can reduce total error.

4 SolutionWe need classifiers whose decision boundaries are adequately different from others.Such a set of classifiers is said to be diverse.

5 Classifier diversity can be obtained

Using different training datasets for training different classifiers.Using unstable classifiers.Using different training parameters(such as different topologies for NN).Using different featuresets (such as random subspace method).

6 ReferenceG. Brown, J. Wyatt, R. Harris, and X. Yao, ’’Diversity creation methods : a surveyand categorization”, Information fusion, Vo. 6, pp. 5-20, 2005.

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 6 / 19

Page 9: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Classifier diversity using different training sets

filtering of the noise. The overarching principal in ensem-ble systems is therefore to make each classifier as uniqueas possible, particularly with respect to misclassifiedinstances. Specifically, we need classifiers whose decisionboundaries are adequately different from those of others.Such a set of classifiers is said to be diverse.

Classifier diversity can be achieved in several ways.The most popular method is to use different trainingdatasets to train individual classifiers. Such datasets areoften obtained through resampling techniques, such asbootstrapping or bagging, where training data subsetsare drawn randomly, usually with replacement, from theentire training data. This is illustrated in Figure 3, whererandom and overlapping training data subsets are select-ed to train three classifiers, which then form three differ-ent decision boundaries. These boundaries are combinedto obtain a more accurate classification.

To ensure that individual boundaries are adequatelydifferent, despite using substantially similar training

data, unstable classifiers are used as base models, sincethey can generate sufficiently different decision bound-aries even for small perturbations in their trainingparameters. If the training data subsets are drawn with-out replacement, the procedure is also called jackknifeor k-fold data split: the entire dataset is split into kblocks, and each classifier is trained only on k-1 of them.A different subset of k blocks is selected for each classi-fier as shown in Figure 4.

Another approach to achieve diversity is to use dif-ferent training parameters for different classifiers. Forexample, a series of multilayer perceptron (MLP) neuralnetworks can be trained by using different weight initial-izations, number of layers/nodes, error goals, etc. Adjust-ing such parameters allows one to control the instabilityof the individual classifiers, and hence contribute totheir diversity. The ability to control the instability ofneural network and decision tree type classifiers makethem suitable candidates to be used in an ensemble

25THIRD QUARTER 2006 IEEE CIRCUITS AND SYSTEMS MAGAZINE

Classifier 1→ Decision Boundary 1 Classifier 2 → Decision Boundary 2 Classifier 3 → Decision Boundary 3

Feature 1

Ensemble Decision Boundary

Feat

ure

2

Feature 1Feature 1

Feat

ure

2

Feature 1 Feature 1

Feat

ure

2

Feat

ure

2

Feat

ure

2

Figure 3. Combining classifiers that are trained on different subsets of the training data.Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 7 / 19

Page 10: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Diversity measures

1 Pairwise measures (assuming that we have T classifiers)

We can calculate T (T−1)2 pair-wise diversity measures.

hj is correct hj is incorrect

hi is correct a b

hi is incorrect c d

For a team of T classifiers, the diversity measures (dij) are averaged over all pairs

Dij =2

T (T − 1)

T−1∑i=1

T∑j=1

dij

2 Pair-wise diversity measures

1 Correlation diversity is measured as the correlation between two classifier outputs.

ρij =ad − bc√

a+ b)(c + d)(a+ c)(b + d

When classifiers are uncorrelated, maximum diversity is obtained and ρ = 0.2 Q-Statistic defined as

Qij = (adbc)/(ad + bc)

Q is positive when the same instances are correctly classified by both classifiers; and isnegative, otherwise.Maximum diversity is, once again, obtained for Q = 0.

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 8 / 19

Page 11: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Diversity measures

1 Pairwise measures (assuming that we have T classifiers)qwq

We can calculate T (T−1)2 pair-wise diversity measures, and average them.

hj is correct hj is incorrect

hi is correct a b

hi is incorrect c d

2 Pair-wise diversity measures

1 Disagreement measure is the probability that the two classifiers will disagree,

Dij = b + c

The diversity increases with the disagreement value.2 Double fault measure is the probability that both classifiers are incorrect,

DFij = d .

The diversity increases with the double fault value.

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 9 / 19

Page 12: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Diversity measures

1 Non-pairwise measures (assuming that we have T classifiers)

1 Entropy measure makes the assumption that the diversity is highest if half of theclassifiers are correct, and the remaining ones are incorrect.

E =1

N

N∑i=1

1

T −⌈T2

⌉ min{ξi , (T − ξi )}

where ξi is the number of classifiers that that misclassifies instance xi .Entropy varies between 0 and 1, where 1 indicates highest diversity.

2 Kohavi–Wolpert variance

KW =1

NT 2

N∑i=1

ξi (T − ξi )

Kohavi–Wolpert variance follows a similar approach to the disagreement measure.3 Measure of difficulty is

θ =1

T

T∑t=0

(zt − z̄)

where z =[0, 1

T , 2T , . . . , 1

]and z̄ s mean of z .

z is the fraction of classifiers that misclassify xi .How Measure of difficulty shows the diversity?

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 10 / 19

Page 13: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Diversity measures

Comparison of different diversity measures

Machine Learning

8

Diversity Measures (2)

! Non-Pairwise measures (assuming that we have T classifiers)

" Entropy Measure :

# Makes the assumption that the diversity is highest if half of the classifiers are correct and the remaining ones are incorrect.

" Kohavi-Wolpert Variance

" Measure of difficulty

! Comparison of different diversity measures

ReferenceL. I. Kuncheva and C. J. Whitaker, Measures of diversity in classifier ensembles andtheir relationship with the ensemble accuracy, Machine Learning, Vol. 51, pp.181-207, 2003.

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 11 / 19

Page 14: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Outline

1 Introduction

2 Diversity measures

3 Design of ensemble systems

4 Building ensemble based systems

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 12 / 19

Page 15: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Design of ensemble systems

Two key components of an ensemble system

1 Creating an ensemble by creating weak learners.

1 Bagging2 Boosting3 Stacked generalization4 Mixture of experts

2 Combination of classifiers outputs (trainable vs. fixed rule).

1 Majority Voting2 Weighted Majority Voting3 Averaging

What is weak learners?

Definition (Weak learner)

A weak learner does not guarantee to do better than random guessing.

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 12 / 19

Page 16: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Design of ensemble systems

In ensemble learning, a rule is needed to combine outputs of classifiers.

1 Classifier selection1 Each classifier is trained to become an expert in some local area of feature space.2 Combination of classifiers is based on the given feature vector.3 Classifier that was trained with the data closest to the vicinity of the feature vector is

given the highest credit.4 One or more local classifiers can be nominated to make the decision.

2 Classifier fusion1 Each classifier is trained over the entire feature space.2 Classifier Combination involves merging the individual waek classifier design to obtain

a single Strong classifier.

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 13 / 19

Page 17: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Outline

1 Introduction

2 Diversity measures

3 Design of ensemble systems

4 Building ensemble based systems

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 14 / 19

Page 18: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Bagging

Bootstrap Aggregating (Bagging)1 Create T bootstrap samples S [1], S [2], . . . , S [T ].2 Train distinct inducer on each S [t] to produce T classifiers.3 Classify new instance by classifier vote (majority vote).

Application of bootstrap sampling1 Given set S containing N training examples2 Create S [t] by drawing N examples at random with replacement from S3 S [t] of size N: expected to leave out 75%− 100% of examples from S . (show it)

Variations

1 Random forestsCan be created from decision trees, whose certain parameters vary randomly.

Pasting small votes (for large datasets)1 RVotes : Creates the data sets randomly2 IVotes : Creates the data sets based on the importance of instances, easy to hard

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 14 / 19

Page 19: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Boosting

Schapire proved that a weak learner can be turned into a strong learner thatgenerates a classifier that can correctly classify all but an arbitrarily small fractionof the instances.

In boosting, the training data are ordered from easy to hard. Easy samples areclassified first, and hard samples are classified later.

Boosting algorithm1 Create the first classifier same as Bagging2 The second classifier is trained on training data only half of which is correctly

classified by the first one and the other half is misclassified.3 The third one is trained with data that two first disagree.

Variations1 AdaBoost.M12 AdaBoost.R

ReferenceRobert E. Schapire, The strength of weak learnability, Machine Learning, Vol. 5,pp. 197-227 (1990).

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 15 / 19

Page 20: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Stacked Generalization

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 16 / 19

Page 21: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Mixture Models

1 Train multiple learners1 Each uses subsample of S2 May be ANN, decision tree, etc.

2 Gating Network usually is NN

Machine Learning

20

Mixture Models

! Intuitive Idea

" Train multiple learners

# Each uses subsample of D

# May be ANN, decision tree, etc.

" Gating Network usually is NN GatingNetwork

x

g1

g2

ExpertNetwork

y1

ExpertNetwork

y2

ΣΣΣΣ

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 17 / 19

Page 22: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

Mixture Models

Cascade learners in order of complexity

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 18 / 19

Page 23: Machine Learningce.sharif.edu/courses/95-96/1/ce717-1/resources/root/...Outline 1 Introduction 2 Diversity measures 3 Design of ensemble systems 4 Building ensemble based systems Hamid

References

1 Robi Polikar, Ensemble based system in decision making, IEEE Circuits andSystems Magazine, Vol. 6, No. 3, pp. 21 - 45 (2006).

2 T. G. Dietterich, Machine Learning Research: four current directions, AI Magazine.18(4), 97-136 (1997).

3 T. G. Dietterich, Ensemble Methods in Machine Learning, Lecture Notes inComputer Science, Vol. 1857, pp 1-15 (2000).

4 Ron Meir, Gunnar Ratsch, An introduction to Boosting and Leveraging, LectureNotes in Computer Science, Vol. 2600, pp 118-183 (2003).

5 David Opitz, Richard Maclin, Popular Ensemble Methods: An Empirical Study,journal of artificial intelligence research, pp. 169-198 (1999).

6 L.I. Kuncheva, Combining Pattern Classifiers, Methods and Algorithms. New York,NY: Wiley Interscience, 2005.

Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 19 / 19