Lecture 6: Ensemble Methods

33
Lecture 6: Ensemble Methods October 2013 Machine Learning for Language Technology Marina Santini, Uppsala University Department of Linguistics and Philology

description

What is an "ensemble learner"? How can we combine different base learners into an ensemble in order to improve the overall classification performance? In this lecture, we are providing some answers to these questions.

Transcript of Lecture 6: Ensemble Methods

Page 1: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods

October 2013

Machine Learning for Language Technology

Marina Santini, Uppsala University

Department of Linguistics and Philology

Page 2: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods2

Where we are… Previous lectures, various different learning

methods: Decision trees Nearest neighbors Linear classifiers Structured Prediction

This lecture: How to combine classifiers

Page 3: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods3

Combining Multiple Learners

Thanks to E. Alpaydin and Oscar Täckström

Page 4: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods4

Wisdom of the Crowd Guess the weight of an ox Average of people's votes close to true weight Better than most individual members' votes

and cattle experts' votes Intuitively, the law of large numbers…

Page 5: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods5

Definition An ensemble of classifiers is a set of classifiers

whose individual decisions are combined in some way to classify new examples (Dietterich, 2000)

Page 6: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods6

Diversity vs accuracy

An ensemble of classifiers must be more accurate than any of its individual members.

The indivudual classifiers composing an ensemble must be accurate and diverse: An accurate classifier is one that has an error rate

better than random when guessing new examples Two classifiers are diverse if they make different

errors on new data points.

Page 7: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods7

Why it can be a good idea to build an ensemble It is possible to build good ensembles for three fundamental reasons. (Dietterich

, 2000):1. Statistical: if little data2. Computational: enough data, but local optima produced by local search3. Representational: when the true function f cannot be represeted by any of the

hypothesis in H (weighted sums of hypotheses drawn from H might expand the space

Page 8: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods8

Distinctions Base learner

Arbitrary learning algorithm which could be used on its own

Ensemble A learning algorithm composed of a set of base

learners. The base learners may be organized in some structure

However, not completely clear cut E.g. a linear classifier is a combination of multiple

simple learners, in the sense that each dimension is in a simple predictor…

Page 9: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods9

The main purpose of an ensemble: maximising individual accuracy and diversity

Different learners use different Algorithms Hyperparameters Representations /Modalities/Views Training sets Subproblems

Page 10: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods10

Practical Example

Page 11: Lecture 6: Ensemble Methods

11

Rationale

Lecture 6: Ensemble Methods

No Free Lunch Theorem: There is no algorithm that is always the most accurate in all situations.

Generate a group of base-learners which when combined has higher accuracy.

Page 12: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods12

Methods for Constructing Ensembles

Page 13: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods13

Approaches… How do we generate base-learners that complement each

other?

How do we combine the outputs of base learner for maximum accuracy?

Examples:

VotingBoostrap Resampling

BaggingBoostingAdaBoostStacking

Cascading

Page 14: Lecture 6: Ensemble Methods

14

Voting

Linear combination

Lecture 6: Ensemble Methods

Page 15: Lecture 6: Ensemble Methods

15

Fixed Combination Rules

Lecture 6: Ensemble Methods

Page 16: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods16

Boostrap Resampling Daume’ (2012): 150

Page 17: Lecture 6: Ensemble Methods

17

Bagging (bootstrap+aggregating)

Lecture 6: Ensemble Methods

Use bootstrapping to generate L training sets Train L base learners using an unstable

learning procedure During test, take the avarage In bagging, generating complementary base-

learners is left to chance and to the instability of the learning method.

**Unstable algorithm: when small change in the training set causes a large differnce in the base learners.

Page 18: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods18

Boosting: Weak learner vs Strong learner

In boosting, we actively try to generate complementary base-learners by training the next learner on the mistakes of the previous learners.

The original boosting algorithm (Schapire 1990) combines three weak learners to generate a strong learner.

A weak learner has error probability less than 1/2, which makes it better than random guessing on a two-class problem

A strong learner has arbitrarily small error probability.

Page 19: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods19

Boosting (ii) [Alpaydin, 2012: 431] Given a large training set, we randomly divide it into

three. We use X1 and train d1. We then take X2 and feed it to d1. We take all

instances misclassified by d1 and also as many instances on which d1 is correct from X2, and these together form the training set of d2.

We then take X3 and feed it to d1 and d2. The instances on which d1 and d2 disagree form the training set of d3.

During testing, given an instance, we give it to d1 and d2; if they agree, that is the response, otherwise the response of d3 is taken as the output.

Page 20: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods20

Boosting: drawback Though it is quite successful, the

disadvantage of the original boosting method is that it requires a very large training sample.

Page 21: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods21

Adaboost (adaptive boosting) Use the same training set over and over and

thus need not to be large. Classifiers must be simple so they do not

overfit. Can combine an arbitrary number of base

learners, not only three.

Page 22: Lecture 6: Ensemble Methods

22

AdaBoost

Lecture 6: Ensemble Methods

Generate a sequence of base-learners each focusing on previous one’s errors.

The porbability of a correctly classified instance is decreased, and the probability of a missclassified instance increases. This has the effect that the next classifier focuses more on instances missclassified by the previous classifier.

[Alpaydin, 2012: 432-433]

Page 23: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods23

Adaboost: Testing Given an instance, all the classifiers decide

and a weighted vote is taken. The weights are proportional to the base

learners’ accuracies on the training set.

improved accuracy

The success of Adaboost is due to its property of increasing the margin. If the margin increases, the training istances are better separated and errors are less likely. (This aim is similar to SVMs)

Page 24: Lecture 6: Ensemble Methods

24

Stacking (i)

Lecture 6: Ensemble Methods

In stacked generalization, the combiner f( ) is another learner and is not restricted to being a linear combination as in voting.

Page 25: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods25

Stacking (ii) The combiner system should learn how the

base learners make errors. Stacking is a means of estimating and

correcting for the biases of the base-learners.

Therefore, the combiner should be trained on data unused in training the base-learners

Page 26: Lecture 6: Ensemble Methods

26

Cascading

Lecture 6: Ensemble Methods

Use dj only if preceding ones are not confident

Cascade learners in order of complexity

Page 27: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods27

Cascading Cascading is a multistage method, and we use dj

only if all preceding learners are not confident. Associated with each learner is a confidence wj such

that we say dj is confident of its output and can be used if wj > θj (the threshold).

Confident: misclassifications as well as the instances for which the posterior is not high enough.

Important: The idea is that an early simple classifier handles the majority of instances, and a more complex classifier is used only for a small percentage, so does not significantly increase the overall complexity.

Page 28: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods28

Summary It is often a good idea to combine several

learning methods

We want diverse classifiers, so their errors cancel out

However, remember, ensemble methods do not get free lunch…

Page 29: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods29

Example in the case of arc-factored graph-based

parsing, we relied on spanning tree over a dense graph over the input.

a dense graph is a graph that contains all possible arcs (wordforms)

a spanning tree is a tree that has an incoming arc for each word.

Page 30: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods30

Example: Ensemble MST Dependency Parsing

Page 31: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods31

Conclusions Combining multiple learners has been a

popular topic in machine learning since the early 1990s, and research has been going on ever since.

Recently, it has been noticed that ensembles do not always improve accuracy and research has started to focus on the criteria that a good ensemble should satisfy or how to form a good one.

Page 32: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods32

Reading Dietterich (2000) Alpaydin (2010): Ch. 17 Daumé (2012): Ch. 11

Page 33: Lecture 6: Ensemble Methods

Lecture 6: Ensemble Methods33

Thanx for your attention!