Exposé: An ontology for machine learning...

Post on 27-Aug-2018

269 views 0 download

Transcript of Exposé: An ontology for machine learning...

Joaquin Vanschoren, K.U.Leuven (Belgium), U. Leiden (The Netherlands)Larisa Soldatova, University of Aberystwyth (UK)

Exposé: An ontology for machine learning experimentation

1DM Ontology Jamboree 2010

Joaquin Vanschoren, K.U.Leuven (Belgium), U. Leiden (The Netherlands)Larisa Soldatova, University of Aberystwyth (UK)

Exposé: An ontology for machine learning experimentation

1DM Ontology Jamboree 2010

OverviewOntology lessonsExposé ontology

Use cases

Ontology lessonsWhat did we learn from other ontologies

Ontology design

• Start from accepted classes & properties (top-level ontologies, e.g. OBI, RO)

• If possible, reuse prior ontologies to build on their knowledge/consensus

• Use ontology design patterns: reusable patterns for recurrent problems

• http://ontologydesignpatterns.org

• Check clarity, consistency, extensibility, minimal commitment

Ontology recap:OntoDM (Panov et al., ’09,’10)

• Aim: unified framework for DM research, builds on BFO

DM algorithm

task

achieves

component

has part

classification, pattern mining,...

kernel, distance function,...

dataset

data types,feature types,...

has input

generalization

has output

model, pattern, clustering,...

constraints

realizes

algo impl algo appl

plan processplan specification

Ontology recap:OntoDM (Panov et al., ’09,’10)

• Aim: unified framework for DM research, builds on BFO

DM algorithm

task

achieves

component

has part

classification, pattern mining,...

kernel, distance function,...

dataset

data types,feature types,...

has input

generalization

has output

model, pattern, clustering,...

constraints

realizes

algo impl algo appl

plan processplan specification

classification algorithm

dataset

specifiesinput type

model

specifies output type

DM object

optimizationproblem

optimizationstrategy

has opt. problem

has optimizationstrategy

model complexitycontrol strategy

has model complexitycontrol strategy

algorithmassumption

assumes

constraint

has constraint

inductioncost function loss function

regularizationparameter

p=?has obj. funct.

operator

implements

modelstructure

p=?model

parameter

+-

decisionboundary

p=?

hyperparameter

has hyperparameter

Ontology recap:DMOP (Hilario et al., ’09)

• Model internal structure of learning algorithms

classification algorithm

dataset

specifiesinput type

model

specifies output type

DM object

optimizationproblem

optimizationstrategy

has opt. problem

has optimizationstrategy

model complexitycontrol strategy

has model complexitycontrol strategy

algorithmassumption

assumes

constraint

has constraint

inductioncost function loss function

regularizationparameter

p=?has obj. funct.

operator

implements

modelstructure

p=?model

parameter

+-

decisionboundary

p=?

hyperparameter

has hyperparameter

Ontology recap:DMOP (Hilario et al., ’09)

• Model internal structure of learning algorithms

Ontology recap:DMWF (Kietz et al., ’09)

• Reason about KD operators: in/outputs, conditions/effects (SWRL rules)

data

Thing

operator

model evaluation

modelingdata tableprocessing

modelprocessing

writerreader

model report

IO object

meta-data

attribute_value_tablecategorical,...

producesuses

Ontology recap:DMWF (Kietz et al., ’09)

• Reason about KD operators: in/outputs, conditions/effects (SWRL rules)

data

Thing

operator

model evaluation

modelingdata tableprocessing

modelprocessing

writerreader

model report

IO object

meta-data

attribute_value_tablecategorical,...

producesuses

Ontology recap:DMWF (Kietz et al., ’09)

• Reason about KD operators: in/outputs, conditions/effects (SWRL rules)

data

Thing

operator

model evaluation

modelingdata tableprocessing

modelprocessing

writerreader

model report

IO object

meta-data

attribute_value_tablecategorical,...

producesuses

”RapidMiner.ID3”:Superclass:

ClassificationLearning and (uses exactly 1 AttributeValueDataTable) and (produces exactly 1 Model) and (simpleParameter1(name=”minimal size for split”) exactly 1 integer) and (simpleParameter2(name=”minimal leaf size”) exactly 1 integer) ...

Condition:(AttributeValueDataTable and MissingValueFreeData and(inputAttribute only (hasAttributeType only Categorial)) and(targetAttribute exactly 1 (hasAttributeType only Categorial)) )(?D), noOfRecords(?D,?Size), ?P1 is ?Size / 100 → uses(this,?D), simpleParameter2(this,?P1)

Effect:uses(this,?D), hasFormat(?D,?F), inputAttribute(?D,?IA),targetAttribute(?D,?TA), → new(?M,?D), DecisionTree(?M), produces(this,?M), hasFormat(?M,?F), inputAttribute(?M,?IA),predictedAttribute(?M,?TA),

data

Thing

operator

model evaluation

modelingdata tableprocessing

modelprocessing

writerreader

model report

IO object

meta-data

attribute_value_tablecategorical,...

producesuses

Ontology recap:DMWF (Kietz et al., ’09)

• Reason about KD operators: in/outputs, conditions/effects (SWRL rules)

data

Thing

operator

model evaluation

modelingdata tableprocessing

modelprocessing

writerreader

model report

IO object

meta-data

attribute_value_tablecategorical,...

producesuses

Ontology recap:DMWF (Kietz et al., ’09)

• Reason about KD operators: in/outputs, conditions/effects (SWRL rules)

”RapidMiner.ID3”:Superclass:

ClassificationLearning and (uses exactly 1 AttributeValueDataTable) and (produces exactly 1 Model) and (simpleParameter1(name=”minimal size for split”) exactly 1 integer) and (simpleParameter2(name=”minimal leaf size”) exactly 1 integer) ...

Condition:(AttributeValueDataTable and MissingValueFreeData and(inputAttribute only (hasAttributeType only Categorial)) and(targetAttribute exactly 1 (hasAttributeType only Categorial)) )(?D), noOfRecords(?D,?Size), ?P1 is ?Size / 100 → uses(this,?D), simpleParameter2(this,?P1)

Effect:uses(this,?D), hasFormat(?D,?F), inputAttribute(?D,?IA),targetAttribute(?D,?TA), → new(?M,?D), DecisionTree(?M), produces(this,?M), hasFormat(?M,?F), inputAttribute(?M,?IA),predictedAttribute(?M,?TA),

Ontology recap:EXPO (Soldatova and King, ’06)

• Make goal and structure of scientific experiments more explicit

experiment

experimentgoal

has part

experimentdesign

has part

compare,confirm hypothesis,explain,...

experimentmodel

has part

experimentdesign strategy

has part

experimentalvariable

has part

(un)controlled,(in)dependent,...

factorial,orthogonal,...

admin info

has part

author,biblio_reference,...

experiment hypothesis

has part

experimentalresult

has part

Ontology recap:EXPO (Soldatova and King, ’06)

• Make goal and structure of scientific experiments more explicit

experiment

experimentgoal

has part

experimentdesign

has part

compare,confirm hypothesis,explain,...

experimentmodel

has part

experimentdesign strategy

has part

experimentalvariable

has part

(un)controlled,(in)dependent,...

factorial,orthogonal,...

admin info

has part

author,biblio_reference,...

experiment hypothesis

has part

experimentalresult

has part

Exposéan ontology for data mining experimentation

Context

• Giant, public database(s) of data mining experiments

• We need:

• Common language to share experiments (through DM tools)

• Intuitive ways to store and query experimental results

• We want:

• Interoperable ontology: OntoDM for top-level, DMOP for detailed properties of learning algorithms

• Driven by actual experiments submitted to database

• New algorithms -> ideally, described by author

• Instances automatically extracted from database

Problem 1: ExperimentsWhat is a machine learning experiment?

What do we need to know about it?

Exposé: Experiments

KDworkflow

experimentworkflow

hp: has participanthd: has description

Workflow:has inputs, outputs,operators (participants)

Exposé: Experiments

KDworkflow

experimentworkflow

composite experiment

experimentaldesign

experimentalvariable

singular experiment

machine

EXPO

hp: has participant hp hp

is exec.on

hd: has description

hp

Workflow:has inputs, outputs,operators (participants)

Exposé: Experiments

KDworkflow

experimentworkflow

composite experiment

experimentaldesign

experimentalvariable

singular experiment

learnerevaluation

machine

model evaluation

result

EXPO

evaluation

model evaluationfunction

learning algorithm model

prediction resultdataset

performanceestimation

hp: has participant

hp

hp

has output

hp

haspart

hp hp

is exec.on

hd: has description

hdhp

hd

has inputhas output

has output

has inputWorkflow:has inputs, outputs,operators (participants)

Problem 2: AlgorithmsWhen talking about an algorithm, what is meant?

General algorithm?Specific implementation? Which version?When run, which parameters, components?

Exposé: AlgorithmsSpecification, implementation, application

operator

algo appl

algo impl

hp: has participanthd: has description

p=?

param impl

algo quality

algorithmspecif.

function appl

has parthas quality

ico

hp

hp

p=?

param settinghp

hdname, version, url,...

ico: is concretization of Similar to OntoDM

algo implalgo appl

operator

algorithmspecif.

hp ico

ico = is concretization ofp=? p=? p=?

param setting param impl parameter

ico

has part has partfunction appl.

hp = has participant

hp

hp

hp

functionfunction impl.

hp ico

Same for functions and parameters

Problem 3: Algorithm composition

plug-in functions, kernels, other algorithms

such components play different roles-> Agent-role pattern

Exposé: Algorithms

operator

algo appl

algo impl

hp: has participanthd: has description

p=?

param impl

algo quality

algorithmspecif.

function appl

has parthas quality

ico

hp

hp

p=?

param settinghp

hdname, version, url,...

Exposé: Algorithms

operator

algo appl

algo impl

hp: has participanthd: has description

p=?

param impl

algo quality

algorithmspecif.

rolealgorithm

component role

algorithmrole

functionrole

baselearner

search

data preprocessor

kernel

distance functions

function appl

has parthas quality

ico

hp

hprealizes

p=?

param settinghp

hdname, version, url,...

Exposé: Algorithms

operator

algo appl

algo impl

hp: has participanthd: has description

p=?

param impl

algo quality

learningalgorithm

algorithmspecif.

kernelizedalgorithm

rolealgorithm

component role

algorithmrole

functionrole

baselearner

search

data preprocessor

kernel

distance functions

function appl

has parthas quality

ico

hp

hp

has part

realizes

p=?

param settinghp

hdname, version, url,...

Problem 4: WorkflowsInputs, outputs, operators

Hierarchical: workflows within workflowsReuse, parameterize common workflows, e.g. k-fold CV

Exposé: workflows

20

data processingapplication

data processingapplication

data processingapplication

dataset

dataset

dataset

dataset

data processing workflow

learnerapplication

performance estimationapplication

model evaluation function applicationmodel

learner evaluation

model evaluation

result

train testevaluation

d1

workflow

op1 d2 op2 d2

has output

has output

has input

has participanthas participant

has input has output

has input

Similar to RapidMiner?

Problem 5: ReuseHow can we make maximal use of existing ontologies?

OBI: top-levelOntoDM: top-level DM concepts

DMO: operators, learning mechanisms

BFO: accepted top-level classes

dataset

thing

quality realizable entity

material entity

plannedprocess

digitalentity

informationcontent entity

operator implemen-tation

data quality

-algorithm

quality

role

machineworkflow

model

hp

BFO: accepted top-level classes

dataset

thing

quality realizable entity

material entity

plannedprocess

digitalentity

informationcontent entity

operator implemen-tation

data quality

-algorithm

quality

role

machineworkflow

model

hp

BFO

OntoDM: top-level DM concepts

objective

dataset

algo implalgo appl

thing

quality realizable entity

material entity

plannedprocess

digitalentity

informationcontent entity

operator implemen-tation

algorithmspecif.

data quality

-algorithm

quality

role

machine

hp ico

model

dataset spec

model spec

ico

has descr.

ico = is concretization of

function appl.

hp = has participant

hp

functionfunction impl.

hp ico

BFO

OntoDM

DMO: operators, learning mechanisms

objective

dataset

algo implalgo appl

thing

quality realizable entity

material entity

plannedprocess

digitalentity

informationcontent entity

operator implemen-tation

algorithmspecif.

data quality

-algorithm

quality

role

machineKD

workflow

hp ico

model

dataset spec

model spec

ico

has descr.

ico = is concretization of

function appl.

hp = has participant

hp

hp

functionfunction impl.

hp icop=?

+-

BFO

OntoDM

DMOP

Exposé: top level classes

25

objective

dataset

algo implalgo appl

thing

quality realizable entity

material entity

plannedprocess

digitalentity

informationcontent entity

operator implemen-tation

algorithmspecif.

data quality

-algorithm

quality

role

machineKD

workflow

hp ico

model

dataset spec

model spec

ico

has descr.

ico = is concretization ofp=? p=? p=?

param setting param impl parameter

ico

has part has partfunction appl.

hp = has participant

hp

hp

hp

hpexecuted

on

functionfunction impl.

hp icop=?

+-

BFO

OntoDM

DMOP

Other aspects

KDworkflow

data processingworkflow

hp

dataset

data processing appl

has input

has output

Datasets

data role

realizes

dataset

attribute-value table

time series

item sequence

relational database

graph

role data mining data role

bootstrap bag

test set

training set

optimization setis concretization of

data specification

has quality

data featuredata instance

has part has part

target feature

numerictarget feature

class feature

quality data property

dataset property

feature property

instance property labeled

labeling unlabeled

has quality has quality

has part

qualitative feature

property

quantitative feature property

feature datatype

numeric datatype

nominal value set

feature entropy

feature kurtosis

quantitative dataset property

statistical dataset property

information-theoretic dataset property landmarker

simple dataset property

# features# instances

# missing values

target skewnessfrac1

identifier

data item

name

url

data repository

version

part of

has description

sequence

set of instances set of tuples

...

...

KDworkflow

data processingworkflow

hp

data processing appl

has input

has output

Datasets

learnerevaluation

evaluationfunction appl.

evaluationfunction impl.

evaluationfunction

hphp

ico

Evaluation

learnerevaluation

association evaluation measure

binary prediction evaluation function

confusion matrix

multi-class predictionevaluation measure

computational evaluation measure

build cpu time

build memory consumption

clustering evaluation measure

probabilistic distribution evaluation measure

predictive model evaluation measure class prediction

evaluation measure

graphical evaluation measure

numeric prediction evaluation measure

probabilistic model distance measure

AUROC

derived measure

f- measure

precisionrecall

specificity

supportconfidence

lift

leverage

conviction

frequency

density-based clustering measure

distance-based clustering measure

integrated squared error

inter-cluster similarity

intra-cluster variance

integrated average squared error

probability distribution scoring function

distribution likelihood

distribution log-likelihood

class RMSEpredictive accuracy

averaged binary prediction measure

kappa statistic

cost curve

lift chart

precision-recall curve

ROC_curve

correlation coefficient

probability error-based measure

error-based evaluation measure

RMSE MADMAPERRSE

RSS

information criterion

AIC BIC

Kullback-Leibner divergence

likelihood ratio

versionname

has description

single point AUROC

AUPRC

PRgraph point

has part

has part

has participant

has participant

evaluationfunction appl.

evaluationfunction impl.

evaluationfunction

hphp

ico

Evaluation

Experiment context

composite experiment

singular experiment

hp

experimentaldesign

experimental variable

has participant

has participant

experiment conclusion

has part

experimentalgoal

has part

experimentalhypothesis

has part

author name

has description

bibliographic reference

has description

experiment id

experimentproperty

quality

experimentcontext

experimentadmin info

experimentexecution status

has description

experimenterror

experimentpriority

active learning

exploration design

factorial design

one factor at a time

orthogonal design

latin hypercube

monte carlo design

random sampling design

latin square design

full factorial designfractional

factorial design

Tagushi matrix

Planckett-Burman design

controlled experimental

variable

uncontrolled experimental

variable

dependent experimental

variable

task

has description

Experiment context

composite experiment

singular experiment

hp

Exposé: final notes

• In total 860 classes, 32 properties (from RO + DMOP)

• Individuals: all algorithms, preprocessors, evaluation from WEKA

• actually stored in experiment database

• should be programmatically added (and updated)

• Written in OWL-DL, using Protégé 4.0

• Can be browsed at:

• http://expdb.cs.kuleuven.be/expdb/expose.owl

• http://www.e-lico.eu/OWLBrowser2/manage/

Use Cases

Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable

32

Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable

!new algorithm

32

Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable

!

datasets

32

Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable

!preprocessing

workflows

32

Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable

!

evaluationprocedures

32

Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable

!

32

Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable

!

32

Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable

!

32

Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable

!

32

Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable

!

32

• A lot of work, limits depth

• Results cannot be reused by others (have to be repeated)

• Hard to repeat experiments from descriptions in papers!

Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable

!

32

• A lot of work, limits depth

• Results cannot be reused by others (have to be repeated)

• Hard to repeat experiments from descriptions in papers!

The journal system is perhaps the most open system

for the transmission of knowledge that could be built ...

with 17th century media. Nielsen (APS Physics 2008)

Data mining as an e-scienceOntologies: experiments shared, run automatically

!

33

Data mining as an e-scienceOntologies: experiments shared, run automatically

• Share experiments• Internet = large, collaborative workspace

33

Data mining as an e-scienceOntologies: experiments shared, run automatically

• Store them in experiment databases• Ensure reproducibility• Reuse millions of prior experiments• Use all info on algorithms, datasets• Results universally accessible and useful

• Share experiments• Internet = large, collaborative workspace

33

e-SciencesAstrophysics: Virtual Observatories

34

e-Sciences Bio-informatics: Micro-array Databases

35

e-Sciences Bio-informatics: Micro-array Databases

35

Collaborative ExperimentationWhy?

36

Collaborative ExperimentationWhy?

ReproducibilityGood science

36

Collaborative ExperimentationWhy?

ReproducibilityGood science

Quick, easy analysisQuerying: Answer questions

Test hypotheses

36

Collaborative ExperimentationWhy?

ReproducibilityGood science

Quick, easy analysisQuerying: Answer questions

Test hypotheses

Reuse Save time & energy

(e.g. benchmarking)

36

Collaborative ExperimentationWhy?

ReproducibilityGood science

Quick, easy analysisQuerying: Answer questions

Test hypotheses

Reuse Save time & energy

(e.g. benchmarking)

Generalizability:Plug into prior results: larger studies

36

Collaborative ExperimentationWhy?

ReproducibilityGood science

Quick, easy analysisQuerying: Answer questions

Test hypotheses

Reuse Save time & energy

(e.g. benchmarking)

Generalizability:Plug into prior results: larger studies

IntegrationData mining tools

import/export

36

Collaborative ExperimentationWhy?

ReproducibilityGood science

Quick, easy analysisQuerying: Answer questions

Test hypotheses

Reuse Save time & energy

(e.g. benchmarking)

Generalizability:Plug into prior results: larger studies

IntegrationData mining tools

import/export

Reference‘Map’ of known approachesCompare to state-of-the-art

Includes negative results36

Use Case 1Describe experiments in a common language

-> sharing or running experiments on grid

Use Exposé to define common language: ExpML

38

learnerevaluation

Use Exposé to define common language: ExpML

38

learnerevaluation

in

Use Exposé to define common language: ExpML

38

learnerevaluation

in out

Use Exposé to define common language: ExpML

38

learnerevaluation

in out

has participant

performance estimation

evaluation function

Use Exposé to define common language: ExpML

38

learnerevaluation

in out

has participant

has participant

appl

performance estimation

evaluation function

Use Exposé to define common language: ExpML

38

learnerevaluation

in out

has participant

has participant

has participant

impl

parametersetting

operator(component)

has participant appl

p=?

performance estimation

evaluation function

ExpML: a markup language for DM experiments

• Share DM experiments, XML-based

39

learnerevaluation

impl

param.sett.

appl

operator

perform. estim. appl.

eval. function appl.

model evaluation

dataset

appl

appl

p=?

ExpML: a markup language for DM experiments

• Share DM experiments, XML-based

39

learnerevaluation

impl

param.sett.

appl

operator

perform. estim. appl.

eval. function appl.

model evaluation

dataset

<expml><dataset id=‘d1’><learner evaluation id=‘e1’ input_data=‘d1’>

<learner_appl><learner_impl name=... version=...><parameter_setting name=‘P’ value=‘100’/><learner_appl role= ‘base-learner’>

...</learner_appl><performance_estimation_appl>...<model_evaluation_function_appl>...

</learner_evaluation><model_evaluation_result output_of=‘e1’>

<evaluation name=‘accuracy’ value= ‘0.99’>...

appl

appl

p=?

ExpML: a markup language for DM experiments

• Share DM experiments, XML-based

39

learnerevaluation

impl

param.sett.

appl

operator

perform. estim. appl.

eval. function appl.

model evaluation

dataset

<expml><dataset id=‘d1’><learner evaluation id=‘e1’ input_data=‘d1’>

<learner_appl><learner_impl name=... version=...><parameter_setting name=‘P’ value=‘100’/><learner_appl role= ‘base-learner’>

...</learner_appl><performance_estimation_appl>...<model_evaluation_function_appl>...

</learner_evaluation><model_evaluation_result output_of=‘e1’>

<evaluation name=‘accuracy’ value= ‘0.99’>...

appl

appl

p=?

ontology XML

has-part,has-participant XML subelement

(with role attribute)

has-description (required) attribute

has-quality `property’ subelement

is-concretization-of implementation_of attr.

part-of attributes

has-specific-input input_data attribute

has-specified-output output_of attribute

Use Case 2Collect experiments in a database

to query all empirical results

ExpDB: a database to share experiments

41

learnerevaluation

in out

has participant

has participant

has participant

impl

p=?

parameterimplementation

parametersetting

operator(component)

has participant appl

p=?

Experiment Database

>650,000 experiments, 54 algorithms, >87 datasets, 45 evaluation measures, 2 data processors, bias-variance analysis

42

learnerevaluation impl

parametersetting

componentsetting

applspec

algorithm properties

p=?

parameterimplementation

appl impl spec

in

outp=?

laid

eid

experiment

evaid

did

sid

data_type

description

default

min

max

suggested_rangeliid

laid

learner_application

is_default lid

liid

learner_implementation

name

lpid

liid

learner_property_value

value

caid

laid

learner_component

role

lpid

laid

learner_parameter_setting

value

kiid

caid

kernel_application

is_default

fiid

caid

function_application

is_defaultfid

fiid

function_implementation

name

name

kid

kernel

function

description

name

fid

function

function

description

kid

kiid

kernel_implementation

name

version, url, path, library

version, url, path, library

caid name

lid

learner

function

description

version, url, path, library

fiid

lpid

learner_parameter

kiid

liid

name

alias

name

lpid

learner_property

description

formula

min

max

unit

Experiment Database

>650,000 experiments, 54 algorithms, >87 datasets, 45 evaluation measures, 2 data processors, bias-variance analysis

in

out

Use Case 3Intuitive querying

Query Interface (YouTube “experiment database”)http://expdb.cs.kuleuven.be

44

The way ahead

• 3rd generation of tools could make data mining into e-science

• Experiments shared, reused, run worldwide

• Repeatable, generalizable, reusable

• Cooperation on a standardized ontology for data mining?

• Automatic ontology extraction: DM paper -> ontology extension

• RDF experiment databases?

• Open problems:

• Queriable models, auto-population (active meta-learning), quality control

45

http://expdb.cs.kuleuven.be

Thanks

Gracias

Xie XieDanke

Dank U

Merci

Efharisto

Dhanyavaad

GrazieSpasiba

Obrigado

Tesekkurler

Diolch

KöszönömArigato

Hvala

Toda

46