Exposé: An ontology for machine learning...
Transcript of Exposé: An ontology for machine learning...
Joaquin Vanschoren, K.U.Leuven (Belgium), U. Leiden (The Netherlands)Larisa Soldatova, University of Aberystwyth (UK)
Exposé: An ontology for machine learning experimentation
1DM Ontology Jamboree 2010
Joaquin Vanschoren, K.U.Leuven (Belgium), U. Leiden (The Netherlands)Larisa Soldatova, University of Aberystwyth (UK)
Exposé: An ontology for machine learning experimentation
1DM Ontology Jamboree 2010
OverviewOntology lessonsExposé ontology
Use cases
Ontology lessonsWhat did we learn from other ontologies
Ontology design
• Start from accepted classes & properties (top-level ontologies, e.g. OBI, RO)
• If possible, reuse prior ontologies to build on their knowledge/consensus
• Use ontology design patterns: reusable patterns for recurrent problems
• http://ontologydesignpatterns.org
• Check clarity, consistency, extensibility, minimal commitment
Ontology recap:OntoDM (Panov et al., ’09,’10)
• Aim: unified framework for DM research, builds on BFO
DM algorithm
task
achieves
component
has part
classification, pattern mining,...
kernel, distance function,...
dataset
data types,feature types,...
has input
generalization
has output
model, pattern, clustering,...
constraints
realizes
algo impl algo appl
plan processplan specification
Ontology recap:OntoDM (Panov et al., ’09,’10)
• Aim: unified framework for DM research, builds on BFO
DM algorithm
task
achieves
component
has part
classification, pattern mining,...
kernel, distance function,...
dataset
data types,feature types,...
has input
generalization
has output
model, pattern, clustering,...
constraints
realizes
algo impl algo appl
plan processplan specification
classification algorithm
dataset
specifiesinput type
model
specifies output type
DM object
optimizationproblem
optimizationstrategy
has opt. problem
has optimizationstrategy
model complexitycontrol strategy
has model complexitycontrol strategy
algorithmassumption
assumes
constraint
has constraint
inductioncost function loss function
regularizationparameter
p=?has obj. funct.
operator
implements
modelstructure
p=?model
parameter
+-
decisionboundary
p=?
hyperparameter
has hyperparameter
Ontology recap:DMOP (Hilario et al., ’09)
• Model internal structure of learning algorithms
classification algorithm
dataset
specifiesinput type
model
specifies output type
DM object
optimizationproblem
optimizationstrategy
has opt. problem
has optimizationstrategy
model complexitycontrol strategy
has model complexitycontrol strategy
algorithmassumption
assumes
constraint
has constraint
inductioncost function loss function
regularizationparameter
p=?has obj. funct.
operator
implements
modelstructure
p=?model
parameter
+-
decisionboundary
p=?
hyperparameter
has hyperparameter
Ontology recap:DMOP (Hilario et al., ’09)
• Model internal structure of learning algorithms
Ontology recap:DMWF (Kietz et al., ’09)
• Reason about KD operators: in/outputs, conditions/effects (SWRL rules)
data
Thing
operator
model evaluation
modelingdata tableprocessing
modelprocessing
writerreader
model report
IO object
meta-data
attribute_value_tablecategorical,...
producesuses
Ontology recap:DMWF (Kietz et al., ’09)
• Reason about KD operators: in/outputs, conditions/effects (SWRL rules)
data
Thing
operator
model evaluation
modelingdata tableprocessing
modelprocessing
writerreader
model report
IO object
meta-data
attribute_value_tablecategorical,...
producesuses
Ontology recap:DMWF (Kietz et al., ’09)
• Reason about KD operators: in/outputs, conditions/effects (SWRL rules)
data
Thing
operator
model evaluation
modelingdata tableprocessing
modelprocessing
writerreader
model report
IO object
meta-data
attribute_value_tablecategorical,...
producesuses
”RapidMiner.ID3”:Superclass:
ClassificationLearning and (uses exactly 1 AttributeValueDataTable) and (produces exactly 1 Model) and (simpleParameter1(name=”minimal size for split”) exactly 1 integer) and (simpleParameter2(name=”minimal leaf size”) exactly 1 integer) ...
Condition:(AttributeValueDataTable and MissingValueFreeData and(inputAttribute only (hasAttributeType only Categorial)) and(targetAttribute exactly 1 (hasAttributeType only Categorial)) )(?D), noOfRecords(?D,?Size), ?P1 is ?Size / 100 → uses(this,?D), simpleParameter2(this,?P1)
Effect:uses(this,?D), hasFormat(?D,?F), inputAttribute(?D,?IA),targetAttribute(?D,?TA), → new(?M,?D), DecisionTree(?M), produces(this,?M), hasFormat(?M,?F), inputAttribute(?M,?IA),predictedAttribute(?M,?TA),
data
Thing
operator
model evaluation
modelingdata tableprocessing
modelprocessing
writerreader
model report
IO object
meta-data
attribute_value_tablecategorical,...
producesuses
Ontology recap:DMWF (Kietz et al., ’09)
• Reason about KD operators: in/outputs, conditions/effects (SWRL rules)
data
Thing
operator
model evaluation
modelingdata tableprocessing
modelprocessing
writerreader
model report
IO object
meta-data
attribute_value_tablecategorical,...
producesuses
Ontology recap:DMWF (Kietz et al., ’09)
• Reason about KD operators: in/outputs, conditions/effects (SWRL rules)
”RapidMiner.ID3”:Superclass:
ClassificationLearning and (uses exactly 1 AttributeValueDataTable) and (produces exactly 1 Model) and (simpleParameter1(name=”minimal size for split”) exactly 1 integer) and (simpleParameter2(name=”minimal leaf size”) exactly 1 integer) ...
Condition:(AttributeValueDataTable and MissingValueFreeData and(inputAttribute only (hasAttributeType only Categorial)) and(targetAttribute exactly 1 (hasAttributeType only Categorial)) )(?D), noOfRecords(?D,?Size), ?P1 is ?Size / 100 → uses(this,?D), simpleParameter2(this,?P1)
Effect:uses(this,?D), hasFormat(?D,?F), inputAttribute(?D,?IA),targetAttribute(?D,?TA), → new(?M,?D), DecisionTree(?M), produces(this,?M), hasFormat(?M,?F), inputAttribute(?M,?IA),predictedAttribute(?M,?TA),
Ontology recap:EXPO (Soldatova and King, ’06)
• Make goal and structure of scientific experiments more explicit
experiment
experimentgoal
has part
experimentdesign
has part
compare,confirm hypothesis,explain,...
experimentmodel
has part
experimentdesign strategy
has part
experimentalvariable
has part
(un)controlled,(in)dependent,...
factorial,orthogonal,...
admin info
has part
author,biblio_reference,...
experiment hypothesis
has part
experimentalresult
has part
Ontology recap:EXPO (Soldatova and King, ’06)
• Make goal and structure of scientific experiments more explicit
experiment
experimentgoal
has part
experimentdesign
has part
compare,confirm hypothesis,explain,...
experimentmodel
has part
experimentdesign strategy
has part
experimentalvariable
has part
(un)controlled,(in)dependent,...
factorial,orthogonal,...
admin info
has part
author,biblio_reference,...
experiment hypothesis
has part
experimentalresult
has part
Exposéan ontology for data mining experimentation
Context
• Giant, public database(s) of data mining experiments
• We need:
• Common language to share experiments (through DM tools)
• Intuitive ways to store and query experimental results
• We want:
• Interoperable ontology: OntoDM for top-level, DMOP for detailed properties of learning algorithms
• Driven by actual experiments submitted to database
• New algorithms -> ideally, described by author
• Instances automatically extracted from database
Problem 1: ExperimentsWhat is a machine learning experiment?
What do we need to know about it?
Exposé: Experiments
KDworkflow
experimentworkflow
hp: has participanthd: has description
Workflow:has inputs, outputs,operators (participants)
Exposé: Experiments
KDworkflow
experimentworkflow
composite experiment
experimentaldesign
experimentalvariable
singular experiment
machine
EXPO
hp: has participant hp hp
is exec.on
hd: has description
hp
Workflow:has inputs, outputs,operators (participants)
Exposé: Experiments
KDworkflow
experimentworkflow
composite experiment
experimentaldesign
experimentalvariable
singular experiment
learnerevaluation
machine
model evaluation
result
EXPO
evaluation
model evaluationfunction
learning algorithm model
prediction resultdataset
performanceestimation
hp: has participant
hp
hp
has output
hp
haspart
hp hp
is exec.on
hd: has description
hdhp
hd
has inputhas output
has output
has inputWorkflow:has inputs, outputs,operators (participants)
Problem 2: AlgorithmsWhen talking about an algorithm, what is meant?
General algorithm?Specific implementation? Which version?When run, which parameters, components?
Exposé: AlgorithmsSpecification, implementation, application
operator
algo appl
algo impl
hp: has participanthd: has description
p=?
param impl
algo quality
algorithmspecif.
function appl
has parthas quality
ico
hp
hp
p=?
param settinghp
hdname, version, url,...
ico: is concretization of Similar to OntoDM
algo implalgo appl
operator
algorithmspecif.
hp ico
ico = is concretization ofp=? p=? p=?
param setting param impl parameter
ico
has part has partfunction appl.
hp = has participant
hp
hp
hp
functionfunction impl.
hp ico
Same for functions and parameters
Problem 3: Algorithm composition
plug-in functions, kernels, other algorithms
such components play different roles-> Agent-role pattern
Exposé: Algorithms
operator
algo appl
algo impl
hp: has participanthd: has description
p=?
param impl
algo quality
algorithmspecif.
function appl
has parthas quality
ico
hp
hp
p=?
param settinghp
hdname, version, url,...
Exposé: Algorithms
operator
algo appl
algo impl
hp: has participanthd: has description
p=?
param impl
algo quality
algorithmspecif.
rolealgorithm
component role
algorithmrole
functionrole
baselearner
search
data preprocessor
kernel
distance functions
function appl
has parthas quality
ico
hp
hprealizes
p=?
param settinghp
hdname, version, url,...
Exposé: Algorithms
operator
algo appl
algo impl
hp: has participanthd: has description
p=?
param impl
algo quality
learningalgorithm
algorithmspecif.
kernelizedalgorithm
rolealgorithm
component role
algorithmrole
functionrole
baselearner
search
data preprocessor
kernel
distance functions
function appl
has parthas quality
ico
hp
hp
has part
realizes
p=?
param settinghp
hdname, version, url,...
Problem 4: WorkflowsInputs, outputs, operators
Hierarchical: workflows within workflowsReuse, parameterize common workflows, e.g. k-fold CV
Exposé: workflows
20
data processingapplication
data processingapplication
data processingapplication
dataset
dataset
dataset
dataset
data processing workflow
learnerapplication
performance estimationapplication
model evaluation function applicationmodel
learner evaluation
model evaluation
result
train testevaluation
d1
workflow
op1 d2 op2 d2
has output
has output
has input
has participanthas participant
has input has output
has input
Similar to RapidMiner?
Problem 5: ReuseHow can we make maximal use of existing ontologies?
OBI: top-levelOntoDM: top-level DM concepts
DMO: operators, learning mechanisms
BFO: accepted top-level classes
dataset
thing
quality realizable entity
material entity
plannedprocess
digitalentity
informationcontent entity
operator implemen-tation
data quality
-algorithm
quality
role
machineworkflow
model
hp
BFO: accepted top-level classes
dataset
thing
quality realizable entity
material entity
plannedprocess
digitalentity
informationcontent entity
operator implemen-tation
data quality
-algorithm
quality
role
machineworkflow
model
hp
BFO
OntoDM: top-level DM concepts
objective
dataset
algo implalgo appl
thing
quality realizable entity
material entity
plannedprocess
digitalentity
informationcontent entity
operator implemen-tation
algorithmspecif.
data quality
-algorithm
quality
role
machine
hp ico
model
dataset spec
model spec
ico
has descr.
ico = is concretization of
function appl.
hp = has participant
hp
functionfunction impl.
hp ico
BFO
OntoDM
DMO: operators, learning mechanisms
objective
dataset
algo implalgo appl
thing
quality realizable entity
material entity
plannedprocess
digitalentity
informationcontent entity
operator implemen-tation
algorithmspecif.
data quality
-algorithm
quality
role
machineKD
workflow
hp ico
model
dataset spec
model spec
ico
has descr.
ico = is concretization of
function appl.
hp = has participant
hp
hp
functionfunction impl.
hp icop=?
+-
BFO
OntoDM
DMOP
Exposé: top level classes
25
objective
dataset
algo implalgo appl
thing
quality realizable entity
material entity
plannedprocess
digitalentity
informationcontent entity
operator implemen-tation
algorithmspecif.
data quality
-algorithm
quality
role
machineKD
workflow
hp ico
model
dataset spec
model spec
ico
has descr.
ico = is concretization ofp=? p=? p=?
param setting param impl parameter
ico
has part has partfunction appl.
hp = has participant
hp
hp
hp
hpexecuted
on
functionfunction impl.
hp icop=?
+-
BFO
OntoDM
DMOP
Other aspects
KDworkflow
data processingworkflow
hp
dataset
data processing appl
has input
has output
Datasets
data role
realizes
dataset
attribute-value table
time series
item sequence
relational database
graph
role data mining data role
bootstrap bag
test set
training set
optimization setis concretization of
data specification
has quality
data featuredata instance
has part has part
target feature
numerictarget feature
class feature
quality data property
dataset property
feature property
instance property labeled
labeling unlabeled
has quality has quality
has part
qualitative feature
property
quantitative feature property
feature datatype
numeric datatype
nominal value set
feature entropy
feature kurtosis
quantitative dataset property
statistical dataset property
information-theoretic dataset property landmarker
simple dataset property
# features# instances
# missing values
target skewnessfrac1
identifier
data item
name
url
data repository
version
part of
has description
sequence
set of instances set of tuples
...
...
KDworkflow
data processingworkflow
hp
data processing appl
has input
has output
Datasets
learnerevaluation
evaluationfunction appl.
evaluationfunction impl.
evaluationfunction
hphp
ico
Evaluation
learnerevaluation
association evaluation measure
binary prediction evaluation function
confusion matrix
multi-class predictionevaluation measure
computational evaluation measure
build cpu time
build memory consumption
clustering evaluation measure
probabilistic distribution evaluation measure
predictive model evaluation measure class prediction
evaluation measure
graphical evaluation measure
numeric prediction evaluation measure
probabilistic model distance measure
AUROC
derived measure
f- measure
precisionrecall
specificity
supportconfidence
lift
leverage
conviction
frequency
density-based clustering measure
distance-based clustering measure
integrated squared error
inter-cluster similarity
intra-cluster variance
integrated average squared error
probability distribution scoring function
distribution likelihood
distribution log-likelihood
class RMSEpredictive accuracy
averaged binary prediction measure
kappa statistic
cost curve
lift chart
precision-recall curve
ROC_curve
correlation coefficient
probability error-based measure
error-based evaluation measure
RMSE MADMAPERRSE
RSS
information criterion
AIC BIC
Kullback-Leibner divergence
likelihood ratio
versionname
has description
single point AUROC
AUPRC
PRgraph point
has part
has part
has participant
has participant
evaluationfunction appl.
evaluationfunction impl.
evaluationfunction
hphp
ico
Evaluation
Experiment context
composite experiment
singular experiment
hp
experimentaldesign
experimental variable
has participant
has participant
experiment conclusion
has part
experimentalgoal
has part
experimentalhypothesis
has part
author name
has description
bibliographic reference
has description
experiment id
experimentproperty
quality
experimentcontext
experimentadmin info
experimentexecution status
has description
experimenterror
experimentpriority
active learning
exploration design
factorial design
one factor at a time
orthogonal design
latin hypercube
monte carlo design
random sampling design
latin square design
full factorial designfractional
factorial design
Tagushi matrix
Planckett-Burman design
controlled experimental
variable
uncontrolled experimental
variable
dependent experimental
variable
task
has description
Experiment context
composite experiment
singular experiment
hp
Exposé: final notes
• In total 860 classes, 32 properties (from RO + DMOP)
• Individuals: all algorithms, preprocessors, evaluation from WEKA
• actually stored in experiment database
• should be programmatically added (and updated)
• Written in OWL-DL, using Protégé 4.0
• Can be browsed at:
• http://expdb.cs.kuleuven.be/expdb/expose.owl
• http://www.e-lico.eu/OWLBrowser2/manage/
Use Cases
Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable
32
Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable
!new algorithm
32
Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable
!
datasets
32
Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable
!preprocessing
workflows
32
Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable
!
evaluationprocedures
32
Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable
!
32
Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable
!
32
Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable
!
32
Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable
!
32
Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable
!
32
• A lot of work, limits depth
• Results cannot be reused by others (have to be repeated)
• Hard to repeat experiments from descriptions in papers!
Goal: Collaborative experimentationNow: small-scale, not repeatable, not reusable
!
32
• A lot of work, limits depth
• Results cannot be reused by others (have to be repeated)
• Hard to repeat experiments from descriptions in papers!
The journal system is perhaps the most open system
for the transmission of knowledge that could be built ...
with 17th century media. Nielsen (APS Physics 2008)
Data mining as an e-scienceOntologies: experiments shared, run automatically
!
33
Data mining as an e-scienceOntologies: experiments shared, run automatically
• Share experiments• Internet = large, collaborative workspace
33
Data mining as an e-scienceOntologies: experiments shared, run automatically
• Store them in experiment databases• Ensure reproducibility• Reuse millions of prior experiments• Use all info on algorithms, datasets• Results universally accessible and useful
• Share experiments• Internet = large, collaborative workspace
33
e-SciencesAstrophysics: Virtual Observatories
34
e-Sciences Bio-informatics: Micro-array Databases
35
e-Sciences Bio-informatics: Micro-array Databases
35
Collaborative ExperimentationWhy?
36
Collaborative ExperimentationWhy?
ReproducibilityGood science
36
Collaborative ExperimentationWhy?
ReproducibilityGood science
Quick, easy analysisQuerying: Answer questions
Test hypotheses
36
Collaborative ExperimentationWhy?
ReproducibilityGood science
Quick, easy analysisQuerying: Answer questions
Test hypotheses
Reuse Save time & energy
(e.g. benchmarking)
36
Collaborative ExperimentationWhy?
ReproducibilityGood science
Quick, easy analysisQuerying: Answer questions
Test hypotheses
Reuse Save time & energy
(e.g. benchmarking)
Generalizability:Plug into prior results: larger studies
36
Collaborative ExperimentationWhy?
ReproducibilityGood science
Quick, easy analysisQuerying: Answer questions
Test hypotheses
Reuse Save time & energy
(e.g. benchmarking)
Generalizability:Plug into prior results: larger studies
IntegrationData mining tools
import/export
36
Collaborative ExperimentationWhy?
ReproducibilityGood science
Quick, easy analysisQuerying: Answer questions
Test hypotheses
Reuse Save time & energy
(e.g. benchmarking)
Generalizability:Plug into prior results: larger studies
IntegrationData mining tools
import/export
Reference‘Map’ of known approachesCompare to state-of-the-art
Includes negative results36
Use Case 1Describe experiments in a common language
-> sharing or running experiments on grid
Use Exposé to define common language: ExpML
38
learnerevaluation
Use Exposé to define common language: ExpML
38
learnerevaluation
in
Use Exposé to define common language: ExpML
38
learnerevaluation
in out
Use Exposé to define common language: ExpML
38
learnerevaluation
in out
has participant
performance estimation
evaluation function
Use Exposé to define common language: ExpML
38
learnerevaluation
in out
has participant
has participant
appl
performance estimation
evaluation function
Use Exposé to define common language: ExpML
38
learnerevaluation
in out
has participant
has participant
has participant
impl
parametersetting
operator(component)
has participant appl
p=?
performance estimation
evaluation function
ExpML: a markup language for DM experiments
• Share DM experiments, XML-based
39
learnerevaluation
impl
param.sett.
appl
operator
perform. estim. appl.
eval. function appl.
model evaluation
dataset
appl
appl
p=?
ExpML: a markup language for DM experiments
• Share DM experiments, XML-based
39
learnerevaluation
impl
param.sett.
appl
operator
perform. estim. appl.
eval. function appl.
model evaluation
dataset
<expml><dataset id=‘d1’><learner evaluation id=‘e1’ input_data=‘d1’>
<learner_appl><learner_impl name=... version=...><parameter_setting name=‘P’ value=‘100’/><learner_appl role= ‘base-learner’>
...</learner_appl><performance_estimation_appl>...<model_evaluation_function_appl>...
</learner_evaluation><model_evaluation_result output_of=‘e1’>
<evaluation name=‘accuracy’ value= ‘0.99’>...
appl
appl
p=?
ExpML: a markup language for DM experiments
• Share DM experiments, XML-based
39
learnerevaluation
impl
param.sett.
appl
operator
perform. estim. appl.
eval. function appl.
model evaluation
dataset
<expml><dataset id=‘d1’><learner evaluation id=‘e1’ input_data=‘d1’>
<learner_appl><learner_impl name=... version=...><parameter_setting name=‘P’ value=‘100’/><learner_appl role= ‘base-learner’>
...</learner_appl><performance_estimation_appl>...<model_evaluation_function_appl>...
</learner_evaluation><model_evaluation_result output_of=‘e1’>
<evaluation name=‘accuracy’ value= ‘0.99’>...
appl
appl
p=?
ontology XML
has-part,has-participant XML subelement
(with role attribute)
has-description (required) attribute
has-quality `property’ subelement
is-concretization-of implementation_of attr.
part-of attributes
has-specific-input input_data attribute
has-specified-output output_of attribute
Use Case 2Collect experiments in a database
to query all empirical results
ExpDB: a database to share experiments
41
learnerevaluation
in out
has participant
has participant
has participant
impl
p=?
parameterimplementation
parametersetting
operator(component)
has participant appl
p=?
Experiment Database
>650,000 experiments, 54 algorithms, >87 datasets, 45 evaluation measures, 2 data processors, bias-variance analysis
42
learnerevaluation impl
parametersetting
componentsetting
applspec
algorithm properties
p=?
parameterimplementation
appl impl spec
in
outp=?
laid
eid
experiment
evaid
did
sid
data_type
description
default
min
max
suggested_rangeliid
laid
learner_application
is_default lid
liid
learner_implementation
name
lpid
liid
learner_property_value
value
caid
laid
learner_component
role
lpid
laid
learner_parameter_setting
value
kiid
caid
kernel_application
is_default
fiid
caid
function_application
is_defaultfid
fiid
function_implementation
name
name
kid
kernel
function
description
name
fid
function
function
description
kid
kiid
kernel_implementation
name
version, url, path, library
version, url, path, library
caid name
lid
learner
function
description
version, url, path, library
fiid
lpid
learner_parameter
kiid
liid
name
alias
name
lpid
learner_property
description
formula
min
max
unit
Experiment Database
>650,000 experiments, 54 algorithms, >87 datasets, 45 evaluation measures, 2 data processors, bias-variance analysis
in
out
Use Case 3Intuitive querying
Query Interface (YouTube “experiment database”)http://expdb.cs.kuleuven.be
44
The way ahead
• 3rd generation of tools could make data mining into e-science
• Experiments shared, reused, run worldwide
• Repeatable, generalizable, reusable
• Cooperation on a standardized ontology for data mining?
• Automatic ontology extraction: DM paper -> ontology extension
• RDF experiment databases?
• Open problems:
• Queriable models, auto-population (active meta-learning), quality control
45
http://expdb.cs.kuleuven.be
Thanks
Gracias
Xie XieDanke
Dank U
Merci
Efharisto
Dhanyavaad
GrazieSpasiba
Obrigado
Tesekkurler
Diolch
KöszönömArigato
Hvala
Toda
46