Using Meta Learning to Initialize Bayesian Optimization · !Castuse experience intoanalgorithm. ......

Using Meta Learning to InitializeBayesian Optimization

Albert-Ludwigs-Universität Freiburg

Matthias Feurer1 Jost Tobias Springenberg2 Frank Hutter11Research Group on Learning, Optimization, and Automated Algorithm Design2Machine Learning LabDepartment of Computer Science, University of Freiburg, GermanyECAI-2014 Workshop on Meta-learning & Algorithm Selection, 19 August 2014

Your task: Build an Iris classification system

Choose an algorithmbased on datasetcharacteristics, e.g. forthe Iris dataset this couldbe an SVMManual tuning-> fiddling withhyperparameters.Better: Use automatedmethods like PSO, GA orSMBOBest: AutoWeka

MetaSel ’14 Feurer, Springenberg and Hutter – MI-SMBO 2 / 18

The iris pictures on this slide are from wikimedia commons and used under the following licenses: Top left: IrisVersicolor is public domain; Bottom left: Iris setosa is licensed by Radomil under CC BY-SA 3.0; Top right: IrisVirginica is licensed by C T Johansson under CC BY 3.0.


Choose an algorithmbased on datasetcharacteristics, e.g. forthe Iris dataset this couldbe an SVM

Manual tuning-> fiddling withhyperparameters.Better: Use automatedmethods like PSO, GA orSMBOBest: AutoWeka




Choose an algorithmbased on datasetcharacteristics, e.g. forthe Iris dataset this couldbe an SVMManual tuning-> fiddling withhyperparameters.

Better: Use automatedmethods like PSO, GA orSMBOBest: AutoWeka




Choose an algorithmbased on datasetcharacteristics, e.g. forthe Iris dataset this couldbe an SVMManual tuning-> fiddling withhyperparameters.Better: Use automatedmethods like PSO, GA orSMBO

Best: AutoWeka




Choose an algorithmbased on datasetcharacteristics, e.g. forthe Iris dataset this couldbe an SVMManual tuning-> fiddling withhyperparameters.Better: Use automatedmethods like PSO, GA orSMBOBest: AutoWeka



Adding the Iris Japonica to the dataset

Manual tuning:Use experience and startfrom the parametersfound on the Iris datasetAutomated methods-> start from scratch→ Cast use experienceinto an algorithm.


The iris pictures on this slide are from wikimedia commons and used under the following licenses: Top left: IrisVersicolor is public domain; Bottom left: Iris setosa is licensed by Radomil under CC BY-SA 3.0; Top right: IrisVirginica is licensed by C T Johansson under CC BY 3.0; Bottom right: Iris Japonica is licensed by KENPEI underCC BY-SA 3.0


Manual tuning:Use experience and startfrom the parametersfound on the Iris dataset

Automated methods-> start from scratch→ Cast use experienceinto an algorithm.




Manual tuning:Use experience and startfrom the parametersfound on the Iris datasetAutomated methods-> start from scratch

→ Cast use experienceinto an algorithm.




Manual tuning:Use experience and startfrom the parametersfound on the Iris datasetAutomated methods-> start from scratch→ Cast use experienceinto an algorithm.



Sequential Model-based Bayesian Optimization(SMBO)

ML Algorithm A

ConfigurationSpace Λ of A

Dataset D



Configuration Task

ML Algorithm A


Dataset D

Configuration λ ∗



Configuration Task

ML Algorithm A


Dataset D


Fit regressionmodel on

(λ ,Aλ (D)) pairs

Select promisingconfiguration

λ ∈ Λ

Evaluate Aλ (D)


Metalearning-Initialized SMBO (MI-SMBO)

Configuration Task

ML Algorithm A


Dataset Dnew

Fit regressionmodel on pairs of

(λ ,Aλ (Dnew))


λ ∈ Λ

EvaluateAλ (Dnew)



Metalearning-Initialized SMBO (MI-SMBO)

Configuration Task

ML Algorithm A


Dataset Dnew

Fit regressionmodel on pairs of

(λ ,Aλ (Dnew))


λ ∈ Λ

EvaluateAλ (Dnew)


Find Datasets Disimilar to Dnew

Initialize Searchwith λ ∗Di


Metafeatures

# training examples: 150# classes: 3# features: 4# numerical features: 4# categorical features: 0missing values? No



Metalearning-Initialized Bayesian Optimization

For a new dataset Dnew :Sort known datasets D1:N by distance to Dnew .For each of these datasets, extract the best knownhyperparameter configuration λ ∗Di

.Initialize SMBO with the first k hyperparameterconfigurations from the sorted list.


Similarity of Datasets

wavef

orm

-500

0iri

s

pend

igits

yeas

t

satim

age

balanc

e-sc

ale

ecoli

cmc

page

-block

sta

egl

ass

nurs

ery

spam

base

vehi

cle

segm

ent

car

optd

igits

prim

ary-

tum

or

mfe

at-fo

urier

mfe

at-k

arhu

nen

mfe

at-zer

nike

abalon

e

vowel

mfe

at-fa

ctor

s

derm

atolog

y

lette

r

mfe

at-m

orph

olog

ical

mfe

at-p

ixel

euca

lypt

us

arrh

ythm

ia

zoo

audi

olog

y

auto

s

soyb

ean

cylin

der-b

ands

vote

iono

sphe

re

tic-ta

c-to

e

braz

iltou

rism

hepa

titisla

bor

hear

t-sta

tlog

brea

st-w

mus

hroo

m

hear

t-c

hear

t-h

post

oper

ative-

patie

nt-d

ata

lym

ph

brea

st-c

ance

r

diab

etesliv

er-d

isord

ers

kr-v

s-kp

cred

it-a

anne

al.O

RIG

cred

it-g

sona

r

habe

rman


Finding the nearest datasets (1)

waveform-5000

iris

pendigits

yeast

satimage

balance-scale

ecoli

cmc

page-blocks

tae

glass

nursery

spambase

vehicle

segment

car

optdigits

primary-tumor

mfeat-fourier

mfeat-karhunen

mfeat-zernike

abalone

vowel

mfeat-factors

dermatology

letter

mfeat-morphological

mfeat-pixel

eucalyptus

arrhythmia

zoo

audiology

autos

soybean

cylinder-bands

vote

ionosphere

tic-tac-toe

braziltourism

hepatitisla

bor

heart-statlog

breast-w

mushroom

heart-c

heart-h

postoperative-patient-data

lymph

breast-cancer

diabetesliv

er-disorders

kr-vs-kp

credit-a

anneal.ORIG

credit-g

sonar

haberman



waveform-5000

iris

pendigits

yeast

satimage

balance-scale

ecoli

cmc

page-blocks

tae

glass

nursery

spambase

vehicle

segment

car

optdigits

primary-tumor

mfeat-fourier

mfeat-karhunen

mfeat-zernike

abalone

vowel

mfeat-factors

dermatology

letter

mfeat-morphological

mfeat-pixel

eucalyptus

arrhythmia

zoo

audiology

autos

soybean

cylinder-bands

vote

ionosphere

tic-tac-toe

braziltourism

hepatitisla

bor

heart-statlog

breast-w

mushroom

heart-c

heart-h


lymph

breast-cancer

diabetesliv

er-disorders

kr-vs-kp

credit-a

anneal.ORIG

credit-g

sonar

haberman


Distance metric (1)

Commonly used in literature, the L1 norm:

d(Dnew,Dj) = ∑i|mnew

i −mji | (1)


Experimental Setup

57 datasets from the OpenML repository

46 metafeatures from the literature:Split into five different subsets, including landmarking[Pfahringer et al. 2000]

Two case studiesSupport Vector Machine with MI-Spearmint [Snoek et al. 2012]AutoSklearn with MI-SMAC [Hutter et al. 2011]

Tried 5, 10, 20 and 25 initial configurationsran each instantiation 10 times on each dataset→ 26220 optimization runstherefore, precomputed a dense grid for every dataset


Experimental Setup

57 datasets from the OpenML repository46 metafeatures from the literature:

Split into five different subsets, including landmarking[Pfahringer et al. 2000]




Experimental Setup




Tried 5, 10, 20 and 25 initial configurations

ran each instantiation 10 times on each dataset→ 26220 optimization runstherefore, precomputed a dense grid for every dataset


Experimental Setup




Tried 5, 10, 20 and 25 initial configurationsran each instantiation 10 times on each dataset→ 26220 optimization runs

therefore, precomputed a dense grid for every dataset


Experimental Setup






Combined Algorithm Selection andHyperparameter Optimization problem (CASH)

Max Features

SVM

gamma C(SVM)

Classifier

Random Forest LinearSVM

Criterion Min Samples Split loss C(LinearSVM)

[Auto-WEKA, Thornton et al. 2013]


AutoSklearn: Hyperparameters

Component Hyperparameter # Values

Main λclassifier 3Main preprocessing 2SVM log2(C) 21SVM log2(γ) 19LinearSVM log2(C) 21LinearSVM penalty 2RF min splits 5RF max features 10RF criterion 2PCA variance to keep 2

1623 hyperparameter configurations


AutoSklearn: Results (1)

0 10 20 30 40 50#Function evaluations

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Diffe

rence

to m

in f

unct

ion v

alu

e

SMAC




0.00

0.02

0.04

0.06

0.08

0.10

Diffe

rence

to m

in f

unct

ion v

alu

e

SMACrandomTPEMI-SMAC(10,L1 ,landmarking)



0 10 20 30 40 501.8

2.0

2.2

2.4

2.6

2.8

3.0

SMACrandom

TPE MI-SMAC(10,L1 ,landmarking)



0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

MI-SMAC(10,L1 ,landmarking) vs SMAC



0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

MI-SMAC(10,L1 ,landmarking) vs SMAC MI-SMAC(10,L1 ,landmarking) vs TPE



0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

MI-SMAC(10,L1 ,landmarking) vs SMAC

MI-SMAC(10,L1 ,landmarking) vs TPE

MI-SMAC(10,L1 ,landmarking) vs random


Open questions

Does MI-SMBO scale to larger configuration spaces?What if gridsearch is too expensive?Can the metalearning component be added directly intothe SMBO procedure?


Take home messages

SMBO can be substantially improved by providing goodinitial configurations.Metalearning provides a sound framework to find theseconfigurations.MI-SMAC improves on state-of-the-art methods on a largeconfiguration space, namely AutoSklearn.


The end

Thank you for your attention.

Further questions: [email protected]

This presentation was partially supported by an ECCAI TravelAward and the ECCAI sponsors.


[email protected]



0.00

0.02

0.04

0.06

0.08

0.10

Min

funct

ion v

alu

e

SMACrandomTPEMI-SMAC(10,L1 ,landmarking)



0.0

0.1

0.2

0.3

0.4

0.5

0.6

0 10 20 30 40 50

Function evaluations

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0.0

MI-SMAC(10,L1,all) vs MI-SMAC(10,L1,landmarking)MI-SMAC(10,L1,all) vs SMAC

MI-SMAC(10,L1,all) vs TPEMI-SMAC(10,L1,all) vs random



0.0

0.1

0.2

0.3

0.4

0.5

0.6


0.6

0.5

0.4

0.3

0.2

0.1

0.0

SMAC vs MI-SMAC(10,L1 ,landmarking)

SMAC vs TPE

SMAC vs random



0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 10 20 30 40 50

Function evaluations

−0.7

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0.0

MI-SMAC(25,L1,landmarking) vs MI-SMAC(10,L1,landmarking)MI-SMAC(25,L1,landmarking) vs MI-SMAC(20,L1,landmarking)MI-SMAC(25,L1,landmarking) vs MI-SMAC(5,L1,landmarking)

MI-SMAC(25,L1,landmarking) vs SMACMI-SMAC(25,L1,landmarking) vs TPEMI-SMAC(25,L1,landmarking) vs random


Using Meta Learning to Initialize Bayesian Optimization · !Castuse experience intoanalgorithm. ......

Documents

Transcript of Using Meta Learning to Initialize Bayesian Optimization · !Castuse experience intoanalgorithm. ......