Modeling and Storing Scientific Protocols -...

Post on 05-Jun-2018

222 views 0 download

Transcript of Modeling and Storing Scientific Protocols -...

Modeling and Storing ScientificProtocols

Natalia KwasnikowskaHasselt University, Belgium

Yi Chen and Zoé LacroixArizona State University, AZ, USA

KSinBITOctober 29, 2006

Overview

• Our motivation

• Protocol Model

• Example

• ProtocolDB

• Future Work

Scientific portfolio

• Reproducibility• Archived experiment

– input and output data– intermediate data– detailed description of

the process• Poorly recorded

– paper– only implementation

Scientific protocol

• Complex process composed ofinterconnected tasks– data-analysis pipeline– workflow, dataflow

• Workflow Management Systems– Taverna– Kepler– Pipeline Pilot

Scientific Protocol• Germination proportions were analyzed by using

program genmod.sas, with the filemaxmingermshoots.xls as input and two output filesas result: maxshoots diffs.xls and maxshootslsmeans.xls.

• Preprocessing of data necessary for determination ofbase and optimal temperatures for germination wasachieved in two sub-steps. First, Observations.xls wasused as input to sample numbers for DAPest.sasresulting in file DAPest sample numbers.xls, wassubsequently used as input to DAPest.sas whichproduced DAPestData.xls. Also, graphing to print.saswas run with Observations.xls as input and producedfive bitmaps.

• Base temperature (TB) for germination wasdetermined by two separate methods, only one…

Problems

• Mix the conceptual design level with actualimplementation

• Often lack detailed information about usedresources– which version?

– required parameters?

– what data formats?

Problems

• Difficult to track data provenance– data lineage or data pedigree

– resources may be updated

• Difficult retrieval and comparison ofprotocols– limited querying possibilities

Our contribution

• High-level abstract protocol model• Independent of execution model• Clear distinction between design and

possible implementations• Explicit mapping between them• Suitable for storage in database systems

Design and Implementation

Find all available information about proteins involved in thelatent stage of multiple sclerosis.

Design and Implementation

Find all available information about proteins involved in thelatent stage of multiple sclerosis.

Design and Implementation

Find all available information about proteins involved in thelatent stage of multiple sclerosis.

OMIM

PubMed

Medline

EntrezGene HGNC

RefSeq

IPI SwissProt

InterPro

• Set of design tasks

• design task D:– name N

– input type i

– output type o

• Set of conceptual types –ontology

• Each task is a base protocol D,with input i and output o

Protocol Design Model

T: Codes_ForGene

Protein

D: Ni

o

Protocol Design Model

successor P=P’·P” split-merge P=P’⊕P”

P’i’

o’P”

i”

o”

i

o

P

i ≤ i’ and o” ≤ o

P’i’

o’

P”i”

o”

i

o

P o’ ≤ i”

i ≤ i’⊕i” and o ≤ o’⊕o”

Protocol Design Model

k-recursion star-recursion

i ≤ i’ and o’ ≤ o

P’i’

o’

i

o

P*

P’i’

o’

i

o

Pk

i ≤ i’ and o’ ≤ o

Protocol Implementation Model

• Similar to protocol design model• Set of application names

– instead of design task names

• Set of format names– instead of conceptual type names

• Imposes equality of format names– instead of subtyping

Mapping Design toImplementation

• Conceptual type mapping– Gene → Genbank format– Gene → FASTA format– SeedData → Excel Spreadshead

• Protocol design task mapping– each design task is mapped to an implementation protocol– consistent with the conceptual type mapping

• Protocol design mapping– homomorphic extension

Scientific Protocol

A pair of• a protocol design• a set of protocol implementations together

with– conceptual type mapping– protocol design task mapping

Germination Protocol

D1:MaxGermination

D2:Proportions

D3:Preprocessing

D4:BaseTemp

D5:BaseOptTemp

SeedData

SeedData

PD = (D1 · D2) ⊕ (D3 · (D4 ⊕ D5))

Germination Protocol

D1:MaxGermination

D2:Proportions

D3:Preprocessing

D4:BaseTemp

D5:BaseOptTemp

SeedData

SeedData

PD = (D1 · D2) ⊕ (D3 · (D4 ⊕ D5))

Germination Protocol

D1

D2

D3

D4 D5

I12:Broken.sas

I11:Pho341.sas

I10:Pho341.sas

k

(I10 · I11 · I12)k

Germination Protocol

D1

D2

D3

D4 D5

I12:Broken.sas

I11:Pho341.sas

I10:Pho341.sas

k

(I10 · I11 · I12)k

Germination Protocol

(I7m · I8 · I9) ⊕ (I10 · I11 · I12)k

D1

D2

D3

D4 D5

I12:Broken.sas

I11:Pho341.sas

I10:Pho341.sas

k

I9:Mixed.sas

I8:Merge.sas

I7:Reg.sas

m

Benefits of our approach

• Scientific protocols are modeled at twolevels– design– implementation

• One design may have differentimplementations– easier to compare results– facilitates integration

ProtocolDB

http://bioinformatics.eas.asu.edu/protocoleDatabase.htm

Future work

• Operator semantics• Extending model with data provenance• Querying data provenance• Querying of protocols

– retrieval of similar protocols• Further development of ProtocolDB

http://bioinformatics.eas.asu.edu/protocoleDatabase.htm