Exposé: An ontology for machine learning...

Joaquin Vanschoren, K.U.Leuven (Belgium), U. Leiden (The Netherlands)Larisa Soldatova, University of Aberystwyth (UK)

Exposé: An ontology for machine learning experimentation

1DM Ontology Jamboree 2010

Joaquin Vanschoren, K.U.Leuven (Belgium), U. Leiden (The Netherlands)Larisa Soldatova, University of Aberystwyth (UK)

Exposé: An ontology for machine learning experimentation

1DM Ontology Jamboree 2010

OverviewOntology lessonsExposé ontology

Use cases

Ontology lessonsWhat did we learn from other ontologies

Ontology design

• Start from accepted classes & properties (top-level ontologies, e.g. OBI, RO)

• If possible, reuse prior ontologies to build on their knowledge/consensus

• Use ontology design patterns: reusable patterns for recurrent problems

• http://ontologydesignpatterns.org

• Check clarity, consistency, extensibility, minimal commitment

Ontology recap:OntoDM (Panov et al., ’09,’10)

• Aim: unified framework for DM research, builds on BFO

DM algorithm

achieves

component

has part

classification, pattern mining,...

kernel, distance function,...

dataset

data types,feature types,...

has input

generalization

has output

model, pattern, clustering,...

constraints

realizes

algo impl algo appl

plan processplan specification

Ontology recap:OntoDM (Panov et al., ’09,’10)

• Aim: unified framework for DM research, builds on BFO

DM algorithm

achieves

component

has part

classification, pattern mining,...

kernel, distance function,...

dataset

data types,feature types,...

has input

generalization

has output

model, pattern, clustering,...

constraints

realizes

algo impl algo appl

plan processplan specification

classification algorithm

dataset

specifiesinput type

specifies output type

DM object

optimizationproblem

optimizationstrategy

has opt. problem

has optimizationstrategy

model complexitycontrol strategy

has model complexitycontrol strategy

algorithmassumption

assumes

constraint

has constraint

inductioncost function loss function

regularizationparameter

p=?has obj. funct.

operator

implements

modelstructure

p=?model

parameter

decisionboundary

hyperparameter

has hyperparameter

Ontology recap:DMOP (Hilario et al., ’09)

• Model internal structure of learning algorithms

classification algorithm

dataset

specifiesinput type

specifies output type

DM object

optimizationproblem

optimizationstrategy

has opt. problem

has optimizationstrategy

model complexitycontrol strategy

has model complexitycontrol strategy

algorithmassumption

assumes

constraint

has constraint

inductioncost function loss function

regularizationparameter

p=?has obj. funct.

operator

implements

modelstructure

p=?model

parameter

decisionboundary

hyperparameter

has hyperparameter

Ontology recap:DMOP (Hilario et al., ’09)

• Model internal structure of learning algorithms

Ontology recap:DMWF (Kietz et al., ’09)

• Reason about KD operators: in/outputs, conditions/effects (SWRL rules)

operator

model evaluation

modelingdata tableprocessing

modelprocessing

writerreader

model report

IO object

meta-data

attribute_value_tablecategorical,...

producesuses

operator

model evaluation

modelprocessing

writerreader

model report

IO object

meta-data

producesuses

operator

model evaluation

modelprocessing

writerreader

model report

IO object

meta-data

producesuses

”RapidMiner.ID3”:Superclass:

ClassificationLearning and (uses exactly 1 AttributeValueDataTable) and (produces exactly 1 Model) and (simpleParameter1(name=”minimal size for split”) exactly 1 integer) and (simpleParameter2(name=”minimal leaf size”) exactly 1 integer) ...

Condition:(AttributeValueDataTable and MissingValueFreeData and(inputAttribute only (hasAttributeType only Categorial)) and(targetAttribute exactly 1 (hasAttributeType only Categorial)) )(?D), noOfRecords(?D,?Size), ?P1 is ?Size / 100 → uses(this,?D), simpleParameter2(this,?P1)

Effect:uses(this,?D), hasFormat(?D,?F), inputAttribute(?D,?IA),targetAttribute(?D,?TA), → new(?M,?D), DecisionTree(?M), produces(this,?M), hasFormat(?M,?F), inputAttribute(?M,?IA),predictedAttribute(?M,?TA),

operator

model evaluation

modelprocessing

writerreader

model report

IO object

meta-data

producesuses

operator

model evaluation

modelprocessing

writerreader

model report

IO object

meta-data

producesuses

”RapidMiner.ID3”:Superclass:

ClassificationLearning and (uses exactly 1 AttributeValueDataTable) and (produces exactly 1 Model) and (simpleParameter1(name=”minimal size for split”) exactly 1 integer) and (simpleParameter2(name=”minimal leaf size”) exactly 1 integer) ...

Condition:(AttributeValueDataTable and MissingValueFreeData and(inputAttribute only (hasAttributeType only Categorial)) and(targetAttribute exactly 1 (hasAttributeType only Categorial)) )(?D), noOfRecords(?D,?Size), ?P1 is ?Size / 100 → uses(this,?D), simpleParameter2(this,?P1)

Effect:uses(this,?D), hasFormat(?D,?F), inputAttribute(?D,?IA),targetAttribute(?D,?TA), → new(?M,?D), DecisionTree(?M), produces(this,?M), hasFormat(?M,?F), inputAttribute(?M,?IA),predictedAttribute(?M,?TA),

Ontology recap:EXPO (Soldatova and King, ’06)

• Make goal and structure of scientific experiments more explicit

experiment

experimentgoal

has part

experimentdesign

has part

compare,confirm hypothesis,explain,...

experimentmodel

has part

experimentdesign strategy

has part

experimentalvariable

has part

(un)controlled,(in)dependent,...

factorial,orthogonal,...

admin info

has part

author,biblio_reference,...

experiment hypothesis

has part

experimentalresult

has part

Ontology recap:EXPO (Soldatova and King, ’06)

• Make goal and structure of scientific experiments more explicit

experiment

experimentgoal

has part

experimentdesign

has part

compare,confirm hypothesis,explain,...

experimentmodel

has part

experimentdesign strategy

has part

(un)controlled,(in)dependent,...

factorial,orthogonal,...

admin info

has part

author,biblio_reference,...

experiment hypothesis

has part

experimentalresult

has part

Exposéan ontology for data mining experimentation

Context

• Giant, public database(s) of data mining experiments

• We need:

• Common language to share experiments (through DM tools)

• Intuitive ways to store and query experimental results

• We want:

• Interoperable ontology: OntoDM for top-level, DMOP for detailed properties of learning algorithms

• Driven by actual experiments submitted to database

• New algorithms -> ideally, described by author

• Instances automatically extracted from database

Problem 1: ExperimentsWhat is a machine learning experiment?

What do we need to know about it?

Exposé: Experiments

KDworkflow

experimentworkflow

hp: has participanthd: has description

Workflow:has inputs, outputs,operators (participants)

KDworkflow

experimentworkflow

composite experiment

experimentaldesign

singular experiment

machine

hp: has participant hp hp

is exec.on

hd: has description

Workflow:has inputs, outputs,operators (participants)

KDworkflow

experimentworkflow

experimentaldesign

singular experiment

learnerevaluation

machine

model evaluation

result

evaluation

model evaluationfunction

learning algorithm model

prediction resultdataset

performanceestimation

hp: has participant

has output

haspart

is exec.on

hd: has description

has inputhas output

has output

has inputWorkflow:has inputs, outputs,operators (participants)

Problem 2: AlgorithmsWhen talking about an algorithm, what is meant?

General algorithm?Specific implementation? Which version?When run, which parameters, components?

Exposé: AlgorithmsSpecification, implementation, application

operator

algo appl

algo impl

param impl

algo quality

algorithmspecif.

function appl

has parthas quality

param settinghp

hdname, version, url,...

ico: is concretization of Similar to OntoDM

algo implalgo appl

operator

algorithmspecif.

hp ico

ico = is concretization ofp=? p=? p=?

param setting param impl parameter

has part has partfunction appl.

hp = has participant

functionfunction impl.

hp ico

Same for functions and parameters

Problem 3: Algorithm composition

plug-in functions, kernels, other algorithms

such components play different roles-> Agent-role pattern

Exposé: Algorithms

operator

algo appl

algo impl

param impl

algo quality

algorithmspecif.

function appl

has parthas quality

param settinghp

Exposé: Algorithms

operator

algo appl

algo impl

param impl

algo quality

algorithmspecif.

rolealgorithm

component role

algorithmrole

functionrole

baselearner

search

data preprocessor

kernel

distance functions

function appl

has parthas quality

hprealizes

param settinghp

Exposé: Algorithms

operator

algo appl

algo impl

param impl

algo quality

learningalgorithm

algorithmspecif.

kernelizedalgorithm

rolealgorithm

component role

algorithmrole

functionrole

baselearner

search

data preprocessor

kernel

distance functions

function appl

has parthas quality

has part

realizes

param settinghp

Problem 4: WorkflowsInputs, outputs, operators

Hierarchical: workflows within workflowsReuse, parameterize common workflows, e.g. k-fold CV

Exposé: workflows

data processingapplication

dataset

data processing workflow

learnerapplication

performance estimationapplication

model evaluation function applicationmodel

learner evaluation

model evaluation

result

train testevaluation

workflow

op1 d2 op2 d2

has output

has input

has participanthas participant

has input has output

has input

Similar to RapidMiner?

Problem 5: ReuseHow can we make maximal use of existing ontologies?

OBI: top-levelOntoDM: top-level DM concepts

DMO: operators, learning mechanisms

BFO: accepted top-level classes

dataset

quality realizable entity

material entity

plannedprocess

digitalentity

informationcontent entity

operator implemen-tation

data quality

-algorithm

quality

machineworkflow

BFO: accepted top-level classes

dataset

material entity

plannedprocess

digitalentity

data quality

-algorithm

quality

machineworkflow

OntoDM: top-level DM concepts

objective

dataset

algo implalgo appl

material entity

plannedprocess

digitalentity

algorithmspecif.

data quality

-algorithm

quality

machine

hp ico

dataset spec

model spec

has descr.

ico = is concretization of

function appl.

hp ico

OntoDM

DMO: operators, learning mechanisms

objective

dataset

algo implalgo appl

material entity

plannedprocess

digitalentity

algorithmspecif.

data quality

-algorithm

quality

machineKD

workflow

hp ico

dataset spec

model spec

has descr.

ico = is concretization of

function appl.

hp icop=?

OntoDM

Exposé: top level classes

objective

dataset

algo implalgo appl

material entity

plannedprocess

digitalentity

algorithmspecif.

data quality

-algorithm

quality

machineKD

workflow

hp ico

dataset spec

model spec

has descr.

ico = is concretization ofp=? p=? p=?

param setting param impl parameter

has part has partfunction appl.

hpexecuted

hp icop=?

OntoDM

Other aspects

KDworkflow

data processingworkflow

dataset

data processing appl

has input

has output

Datasets

data role

realizes

dataset

attribute-value table

time series

item sequence

relational database

role data mining data role

bootstrap bag

test set

training set

optimization setis concretization of

data specification

has quality

data featuredata instance

has part has part

target feature

numerictarget feature

class feature

quality data property

dataset property

feature property

instance property labeled

labeling unlabeled

has quality has quality

has part

qualitative feature

property

quantitative feature property

feature datatype

numeric datatype

nominal value set

feature entropy

feature kurtosis

quantitative dataset property

statistical dataset property

information-theoretic dataset property landmarker

simple dataset property

# features# instances

# missing values

target skewnessfrac1

identifier

data item

data repository

version

part of

has description

sequence

set of instances set of tuples

KDworkflow

data processingworkflow

data processing appl

has input

has output

Datasets

learnerevaluation

evaluationfunction appl.

evaluationfunction impl.

evaluationfunction

Evaluation

learnerevaluation

association evaluation measure

binary prediction evaluation function

confusion matrix

multi-class predictionevaluation measure

computational evaluation measure

build cpu time

build memory consumption

clustering evaluation measure

probabilistic distribution evaluation measure

predictive model evaluation measure class prediction

evaluation measure

graphical evaluation measure

numeric prediction evaluation measure

probabilistic model distance measure

derived measure

f- measure

precisionrecall

specificity

supportconfidence

leverage

conviction

frequency

density-based clustering measure

distance-based clustering measure

integrated squared error

inter-cluster similarity

intra-cluster variance

integrated average squared error

probability distribution scoring function

distribution likelihood

distribution log-likelihood

class RMSEpredictive accuracy

averaged binary prediction measure

kappa statistic

cost curve

lift chart

precision-recall curve

ROC_curve

correlation coefficient

probability error-based measure

error-based evaluation measure

RMSE MADMAPERRSE

information criterion

AIC BIC

Kullback-Leibner divergence

likelihood ratio

versionname

has description

single point AUROC

PRgraph point

has part

has participant

evaluationfunction appl.

evaluationfunction impl.

evaluationfunction

Evaluation

Experiment context

singular experiment

experimentaldesign

experimental variable

has participant

experiment conclusion

has part

experimentalgoal

has part

experimentalhypothesis

has part

author name

has description

bibliographic reference

has description

experiment id

experimentproperty

quality

experimentcontext

experimentadmin info

experimentexecution status

has description

experimenterror

experimentpriority

active learning

exploration design

factorial design

one factor at a time

orthogonal design

latin hypercube

monte carlo design

random sampling design

latin square design

full factorial designfractional

factorial design

Tagushi matrix

Planckett-Burman design

controlled experimental

variable

uncontrolled experimental

variable

dependent experimental

variable

has description

Experiment context

singular experiment

Exposé: final notes

• In total 860 classes, 32 properties (from RO + DMOP)

• Individuals: all algorithms, preprocessors, evaluation from WEKA

• actually stored in experiment database

• should be programmatically added (and updated)

• Written in OWL-DL, using Protégé 4.0

• Can be browsed at:

• http://expdb.cs.kuleuven.be/expdb/expose.owl

• http://www.e-lico.eu/OWLBrowser2/manage/

Use Cases

Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable

!new algorithm

datasets

!preprocessing

workflows

evaluationprocedures

• A lot of work, limits depth

• Results cannot be reused by others (have to be repeated)

• Hard to repeat experiments from descriptions in papers!

• A lot of work, limits depth

• Results cannot be reused by others (have to be repeated)

• Hard to repeat experiments from descriptions in papers!

The journal system is perhaps the most open system

for the transmission of knowledge that could be built ...

with 17th century media. Nielsen (APS Physics 2008)

Data mining as an e-scienceOntologies: experiments shared, run automatically

• Share experiments• Internet = large, collaborative workspace

Data mining as an e-scienceOntologies: experiments shared, run automatically

• Store them in experiment databases• Ensure reproducibility• Reuse millions of prior experiments• Use all info on algorithms, datasets• Results universally accessible and useful

• Share experiments• Internet = large, collaborative workspace

e-SciencesAstrophysics: Virtual Observatories

e-Sciences Bio-informatics: Micro-array Databases

Collaborative ExperimentationWhy?

ReproducibilityGood science

Quick, easy analysisQuerying: Answer questions

Test hypotheses

Reuse Save time & energy

(e.g. benchmarking)

Test hypotheses

(e.g. benchmarking)

Generalizability:Plug into prior results: larger studies

Test hypotheses

(e.g. benchmarking)

IntegrationData mining tools

import/export

Test hypotheses

(e.g. benchmarking)

IntegrationData mining tools

import/export

Reference‘Map’ of known approachesCompare to state-of-the-art

Includes negative results36

Use Case 1Describe experiments in a common language

-> sharing or running experiments on grid

Use Exposé to define common language: ExpML

learnerevaluation

in out

learnerevaluation

in out

has participant

performance estimation

evaluation function

learnerevaluation

in out

has participant

evaluation function

learnerevaluation

in out

has participant

parametersetting

operator(component)

has participant appl

evaluation function

ExpML: a markup language for DM experiments

• Share DM experiments, XML-based

learnerevaluation

param.sett.

operator

perform. estim. appl.

eval. function appl.

model evaluation

dataset

learnerevaluation

param.sett.

operator

model evaluation

dataset

<learner_appl><learner_impl name=... version=...><parameter_setting name=‘P’ value=‘100’/><learner_appl role= ‘base-learner’>

...</learner_appl><performance_estimation_appl>...<model_evaluation_function_appl>...

</learner_evaluation><model_evaluation_result output_of=‘e1’>

<evaluation name=‘accuracy’ value= ‘0.99’>...

learnerevaluation

param.sett.

operator

model evaluation

dataset

<learner_appl><learner_impl name=... version=...><parameter_setting name=‘P’ value=‘100’/><learner_appl role= ‘base-learner’>

...</learner_appl><performance_estimation_appl>...<model_evaluation_function_appl>...

</learner_evaluation><model_evaluation_result output_of=‘e1’>

<evaluation name=‘accuracy’ value= ‘0.99’>...

ontology XML

has-part,has-participant XML subelement

(with role attribute)

has-description (required) attribute

has-quality `property’ subelement

is-concretization-of implementation_of attr.

part-of attributes

has-specific-input input_data attribute

has-specified-output output_of attribute

Use Case 2Collect experiments in a database

to query all empirical results

ExpDB: a database to share experiments

learnerevaluation

in out

has participant

parameterimplementation

parametersetting

operator(component)

has participant appl

Experiment Database

>650,000 experiments, 54 algorithms, >87 datasets, 45 evaluation measures, 2 data processors, bias-variance analysis

learnerevaluation impl

parametersetting

componentsetting

applspec

algorithm properties

parameterimplementation

appl impl spec

outp=?

experiment

data_type

description

default

suggested_rangeliid

learner_application

is_default lid

learner_implementation

learner_property_value

learner_component

learner_parameter_setting

kernel_application

is_default

function_application

is_defaultfid

function_implementation

kernel

function

description

function

description

kernel_implementation

version, url, path, library

caid name

learner

function

description

version, url, path, library

learner_parameter

learner_property

description

formula

Experiment Database

>650,000 experiments, 54 algorithms, >87 datasets, 45 evaluation measures, 2 data processors, bias-variance analysis

Use Case 3Intuitive querying

Query Interface (YouTube “experiment database”)http://expdb.cs.kuleuven.be

The way ahead

• 3rd generation of tools could make data mining into e-science

• Experiments shared, reused, run worldwide

• Repeatable, generalizable, reusable

• Cooperation on a standardized ontology for data mining?

• Automatic ontology extraction: DM paper -> ontology extension

• RDF experiment databases?

• Open problems:

• Queriable models, auto-population (active meta-learning), quality control

http://expdb.cs.kuleuven.be

Thanks

Gracias

Xie XieDanke

Dank U

Efharisto

Dhanyavaad

GrazieSpasiba

Obrigado

Tesekkurler

Diolch

KöszönömArigato

Exposé: An ontology for machine learning...

Documents

Transcript of Exposé: An ontology for machine learning...

Exposé immunologie;

Ontology Engineering: Ontology Use

Ontologie-Management Kapitel 3: Anwendungen · 3 Klassifikation von Ontologien Top-Level Ontology (Upper Ontology, Foundation Ontology) Application Ontology Domain Ontology Task Ontology

exposé strategie

FINAL EXPOSE.pdf

Exposé Leadership

exposé notalweg

Exposé VOIP

Exposé Japon

Exposé tbi

AUSTRIA.pptx exposé

Exposé pfe

Ontology engineering: Ontology alignment

Exposé ISA_New1

Exposé GSM

Arval Haironville-Pab BARDAGE HORIZONTALvisthackcool3d.free.fr/Fredo/Mes documents/doc... · Normal Exposé Normal Exposé Normal Exposé Normal Exposé ... Poseur : SEO. Arval Haironville-Pab

exposé complet

Ontology Alignment. Ontology alignment Ontology alignment Ontology alignment strategies Evaluation of ontology alignment strategies Ontology alignment.

Exposé biomasse

Exposé Playstation.docx