Presentation to the J. Craig Venter Institute, Dec. 2014

Post on 14-Jul-2015

139 views 1 download

Transcript of Presentation to the J. Craig Venter Institute, Dec. 2014

“Shopping for data should be as easy as

shopping for shoes!”

Dr. Carole Goble

Professor, Dept. of Computer Science

University of Manchester

“A little bit of semantics goes a long way”

Dr. James Hendler

Artificial Intelligence Researcher

Rensselaer Polytechnic Institute

One of the originators of the Semantic Web

…but a lot of semantics goes a long, long way!

Mark Wilkinson

Isaac Peral Distinguished ResearcherDirector, Fundación BBVA Chair in Biological Informatics

Center for Plant Biotechnology and GenomicsTechnical University of Madrid

Making the Web a

biomedical research platform

from hypothesis through to publication

Publication

Discourse

Hypothesis

Experiment

Interpretation

Publication

Discourse

Hypothesis

Experiment

Interpretation

Motivation:

3 intersecting trends in the Life Sciences

that are now, or soon will be,

extremely problematic

NON-REPRODUCIBLE SCIENCE & THE FAILURE OF PEER REVIEW

TREND #1

Trend #1

Multiple recent surveys of high-throughput biology

reveal that upwards of 50% of published studies

are not reproducible

- Baggerly, 2009

- Ioannidis, 2009

Similar (if not worse!) in clinical studies

- Begley & Ellis, Nature, 2012

- Booth, Forbes, 2012

- Huang & Gottardo, Briefings in Bioinformatics, 2012

Trend #1

Trend #1

“the most common errors are simple,

the most simple errors are common”

At least partially because the

analytical methodology was inappropriate

and/or not sufficiently described

- Baggerly, 2009

Trend #1

These errors pass peer review

The researcher is (sometimes) unaware of the error

The process that led to the error is not recorded

Therefore it cannot be detected during peer-review

Agencies have Noticed!

In March, 2012, the US Institute of Medicine ~said

“Enough is enough!”

Agencies have Noticed!

Institute of Medicine Recommendations

For Conduct of High-Throughput Research:

Evolution of Translational Omics Lessons Learned and the Path Forward. The

Institute of Medicine of the National Academies, Report Brief, March 2012.

1. Rigorously-described, -annotated, and -followed data

management and manipulation procedures

2. “Lock down” the computational analysis pipeline once it

has been selected

3. Publish the analytical workflow in a formal manner,

together with the full starting and result datasets

BIGGER, CHEAPER DATA

TREND #2

Trend #2

High-throughput technologies are becoming

cheaper and easier to use

Trend #2

High-throughput technologies are becoming

cheaper and easier to use

But there are still very few experts trained in

statistical analysis of high-throughput data

Trend #2

The number of job postings for data scientist

positions increased by 15,000% between the

summers of 2011 and 2012

-- Indeed.com job trends data reported by

http://blogs.nature.com/naturejobs/2013/03/18/so-you-want-to-be-a-data-scientist

Trend #2

Therefore

Even small, moderately-funded laboratories

can now afford to produce more data

than they can manage or interpret

Trend #2

Therefore

Even small, moderately-funded laboratories

can now afford to produce more data

than they can manage or interpret

These labs will likely never be able to afford

a qualified data scientist

“THE SINGULARITY”

TREND #3

The Healthcare

Singularity and the

Age of Semantic

Medicine, Michael

Gillam, et al, The

Fourth Paradigm:

Data-Intensive

Scientific Discovery

Tony Hey (Editor),

2009

Slide adapted with

permission from

Joanne Luciano,

Presentation at

Health Web

Science Workshop

2012, Evanston IL,

USA

June 22, 2012.

Trend #3

The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009

Slide Borrowed with Permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USA

June 22, 2012.

“The Singularity”

The X-intercept is where, the moment a discovery is made,

it is immediately put into practice

Scientific research would have to be

conducted within a medium that

immediately interpreted

and disseminated the results...

You Are

Here

...in a form that immediately (actively!) affected the

results of other researchers...

You Are

Here

...without requiring them to be aware

of these new discoveries.

You Are

Here

3 intersecting and problematic trends

Non-reproducible science that passes peer-review

Cheaper production of larger and more complex datasets

that require specialized expertise to analyze properly

Need to more rapidly disseminate and use new discoveries

We Want More!

I don’t just want to reproduce

your experiment...

I want to re-use your experiment

In my own laboratory... On MY DATA!

When I do my analysis

I want to draw on the knowledge

of global domain-experts like

statisticians and pathologists...

...as if they were mentors sitting

in the chair beside me.

Image from: Mark Smiciklas

Intersection Consulting, cc-nca

Please don’t make me find

all of the data and knowledge

that I require to do my experiment

...it simply isn’t possible anymore...

Image from AJ Cann

cc-by-a license

I want to support peer review(ers)

so that I do better science.

How do we get there from here?

To overcome these intersecting problems

and to achieve the goals of transparent

reproducible research

We must learn how to

do research IN the Web

Not OVER the Web

How we use

The Web today

The Web is not a pigeon!

Semantic Web Technologies

The Web

The Semantic Web

causally related to

This is the critical bit!

causally related to

The link is explicitly labeled!

???

http://semanticscience.org/resource/SIO_000243

SIO_000243:

<owl:ObjectProperty rdf:about="&resource;SIO_000243">

<rdfs:label xml: lang="en"> is causally related with</rdfs:label>

<rdf:type rdf:resource="&owl;SymmetricProperty"/>

<rdf:type rdf:resource="&owl;TransitiveProperty"/>

<dc:description xml:lang="en"> A transitive, symmetric, temporal relation

in which one entity is causally related with another non-identical entity.

</dc:description>

<rdfs:subPropertyOf rdf:resource="&resource;SIO_000322"/>

</owl:ObjectProperty>

causally related with

http://semanticscience.org/resource/SIO_000243

SIO_000243:

<owl:ObjectProperty rdf:about="&resource;SIO_000243">

<rdfs:label xml: lang="en"> is causally related with</rdfs:label>

<rdf:type rdf:resource="&owl;SymmetricProperty"/>

<rdf:type rdf:resource="&owl;TransitiveProperty"/>

<dc:description xml:lang="en"> A transitive, symmetric, temporal relation

in which one entity is causally related with another non-identical entity.

</dc:description>

<rdfs:subPropertyOf rdf:resource="&resource;SIO_000322"/>

</owl:ObjectProperty>

causally related with

Semantic Web Technologies

“deep semantics”

Deep Semantics?

Ontology Spectrum

Catalog/

ID

Selected

Logical

Constraints(disjointness,

inverse, …)

Terms/

glossary

Thesauri

“narrower

term”

relationFormal

is-a

Frames

(Properties)

Informal

is-a

Formal

instanceValue Restrs. General

Logical

constraints

Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty;– updated by McGuinness.

Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html

Ontology Spectrum

Catalog/

ID

Selected

Logical

Constraints(disjointness,

inverse, …)

Terms/

glossary

Thesauri

“narrower

term”

relationFormal

is-a

Frames

(Properties)

Informal

is-a

Formal

instanceValue Restrs. General

Logical

constraints

Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty;– updated by McGuinness.

Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html

Most biomedical ontologies

e.g. Gene Ontology

Ontology Spectrum

Catalog/

ID

Selected

Logical

Constraints(disjointness,

inverse, …)

Terms/

glossary

Thesauri

“narrower

term”

relationFormal

is-a

Frames

(Properties)

Informal

is-a

Formal

instanceValue Restrs. General

Logical

constraints

Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty;– updated by McGuinness.

Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html

Ontologies being used in today’s talk

Most biomedical ontologies

e.g. Gene Ontology

Ontology Spectrum

Catalog/

ID

Selected

Logical

Constraints(disjointness,

inverse, …)

Terms/

glossary

Thesauri

“narrower

term”

relationFormal

is-a

Frames

(Properties)

Informal

is-a

Formal

instanceValue Restrs. General

Logical

constraints

Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty;– updated by McGuinness.

Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html

Categorization Systems

Like library shelves, inflexible

Discovery & Interpretation systems – flexible!

Remember, this is the critical bit!

http://semanticscience.org/resource/SIO_000243

causally related with

It’s relationships that make

the Semantic Web “Semantic”

Semantic Web Technologies

“deep semantics”

Even with “deep semantics”

a lot of important information cannot be represented

on the Semantic Web

For example, all of the data that results from

analytical algorithms and statistical analyses

Varying estimates

put the size of the

Deep Web between

500 and 800 times

larger than the

surface Web

On the WWW

“automation” of

access to Deep Web

data happens through

“Web Services”

There are many suggestions for how to bring the Deep Web

into the Semantic Web using Semantic Web Services (SWS)

There are many suggestions for how to bring the Deep Web

into the Semantic Web using Semantic Web Services (SWS)

Describe input data

Describe output data

Describe how the system manipulates the data

Describe how the world changes as a result

There are many suggestions for how to bring the Deep Web

into the Semantic Web using Semantic Web Services (SWS)

Describe input data

Describe output data

Describe how the system manipulates the data

Describe how the world changes as a result

None, so far, has proven to be wildly successful

(in my opinion)

There are many suggestions for how to bring the Deep Web

into the Semantic Web using Semantic Web Services (SWS)

Describe input data

Describe output data

Describe how the system manipulates the data

Describe how the world changes as a result

None, so far, has proven to be wildly successful

(in my opinion)

…because describing what a Service does is HARD!

Lord, Phillip, et al. The Semantic Web–ISWC 2004 (2004): 350-364.

Lord, Phillip, et al. The Semantic Web–ISWC 2004 (2004): 350-364.

Scientific Web Services are DIFFERENT!

Lord, Phillip, et al. The Semantic Web–ISWC 2004 (2004): 350-364.

“The service interfaces within bioinformatics are relatively simple. An extensible or constrained interoperability

framework is likely to suffice for current demands: a fully generic framework is currently not necessary.”

Scientific Web Services are DIFFERENT!

They’re simpler!

So perhaps we can solve the Semantic Web Service problem

as it pertains to this (important!) domain

With respect to the Semantic Web

What is missing from this list?

Describe input data

Describe output data

Describe how the system manipulates the data

Describe how the world changes as a result

http://semanticscience.org/resource/SIO_000243

causally related with

http://semanticscience.org/resource/SIO_000243

The Semantic Web gets its semantics from relationships

causally related with

http://semanticscience.org/resource/SIO_000243

In 2008 I published a set of design-patterns

for scientific Semantic Web Services

that focuses on the biological relationship that the Service “exposes”

causally related with

The Semantic Web gets its semantics from relationships

Design Pattern for

Web Services on the Semantic Web

AACTCTTCGTAGTG...

BLAST

Web Service

AACTCTTCGTAGTG...

BLAST

SADI

has

homology

to

Terminal Flower

type

gene

species

A. thal.

SADI requires you to explicitly declare

as part of your analytical output,

the biological relationship that your

algorithm “exposed”.

sequence

has_seq_string

AACTCTTCGTAGTG...

sequence

has_seq_string

I want to share several stories that demonstrate

the cool things that happen when you use

SADI + deep semantics

The Semantic Health

and Research Environment

Story #1: SHARE

A proof-of-concept workflow orchestrator

+ SADI Semantic Web Service registry

Objective: answer biologists’ questions

The SHARE registry

indexes all of the input/output/relationship

triples that can be generated by all known services

This is how SHARE discovers services

SHARE demonstrations

with increasingsemantic complexity

What is the phenotype of every allele of the

Antirrhinum majus DEFICIENS gene

SELECT ?allele ?image ?desc

WHERE {locus:DEF genetics:hasVariant ?allele .?allele info:visualizedByImage ?image .?image info:hasDescription ?desc

}

What is the phenotype of every allele of the

Antirrhinum majus DEFICIENS gene

SELECT ?allele ?image ?desc

WHERE {locus:DEF genetics:hasVariant ?allele .?allele info:visualizedByImage ?image .?image info:hasDescription ?desc

}

The query language here is SPARQL

The W3C-approved, standard query language for the Semantic Web

What is the phenotype of every allele of the

Antirrhinum majus DEFICIENS gene

SELECT ?allele ?image ?desc

WHERE {locus:DEF genetics:hasVariant ?allele .?allele info:visualizedByImage ?image .?image info:hasDescription ?desc

}

Note that there is no “FROM” clause!

We don’t tell it where it should get the information,

The machine has to figure that out by itself...

What is the phenotype of every allele of the

Antirrhinum majus DEFICIENS gene

SELECT ?allele ?image ?desc

WHERE {locus:DEF genetics:hasVariant ?allele .?allele info:visualizedByImage ?image .?image info:hasDescription ?desc

}

Starting data: the locus “DEF” (Deficiens)

What is the phenotype of every allele of the

Antirrhinum majus DEFICIENS gene

SELECT ?allele ?image ?desc

WHERE {locus:DEF genetics:hasVariant ?allele .?allele info:visualizedByImage ?image .?image info:hasDescription ?desc

}

Query: A series of relationships v.v. DEF

Enter that query into

SHARE

Click “Submit”...

...and in a few seconds you get your answer.

Based on the relationships in your query, SHARE queried its registry

to automatically discover SADI Services capable of generating those triples

Because it is the Semantic Web

The query results are live hyperlinks

to the respective Database or images

(The answer is IN the Web!)

What pathways does UniProt protein P47989 belong to?

PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {

uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .

}

What pathways does UniProt protein P47989 belong to?

PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {

uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .

}

What pathways does UniProt protein P47989 belong to?

PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>SELECT ?gene ?pathway WHERE {

uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .

}

Note again that there is no “From” clause…

I have not told SHARE where to look for the

answer, I am simply asking my question

Enter that query into

SHARE

Two different

providers of

gene

information

(KEGG &

NCBI);

were found &

accessed

Two different

providers of

pathway

information

(KEGG and

GO);

were found &

accessed

The results are all links to the original data(The answer is IN the Web!)

Show me the latest Blood Urea Nitrogen and Creatinine levels

of patients who appear to be rejecting their transplants

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#> PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#> SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {

?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .

}

Show me the latest Blood Urea Nitrogen (BUN) and

Creatinine levels of patients who appear to be

rejecting their transplants

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX patient: <http://sadiframework.org/ontologies/patients.owl#> PREFIX l: <http://sadiframework.org/ontologies/predicates.owl#> SELECT ?patient ?bun ?creatFROM <http://sadiframework.org/ontologies/patients.rdf>WHERE {

?patient rdf:type patient:LikelyRejecter .?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat .

}

Likely Rejecter:

A patient who has creatinine levels

that are increasing over time

- - Mark D Wilkinson’s definition

Likely Rejecter:

…but there is no “likely rejecter”

column or table in our database…

only blood chemistry measurements

at various time-points

Likely Rejecter:

So the data required to answer this question

DOESN’T EXIST!

My definition of a Likely Rejecter is encoded in

a machine-readable document written in the OWL Ontology language

Basically:

“the regression line over creatinine measurements should have an increasing slope”

Our ontology refers to other ontologies (possibly published by other people)

to learn about what the properties of “regression models” are

e.g. that regression models have slopes and intercepts

and that slopes and intercepts have decimal values

?

Enter that query into

SHARE

SHARE examines the query

Burrows around the Web reading the various ontologies

then uses the discovered Class definitions as a template to map a path from what it has, to what it needs, using

SADI services

Based on the Class definition

SHARE decides that it needs to do a

Linear Regression analysis

on the blood creatinine measurements

?

The conversation between SHARE and the registry

reveals the use of “Deep Semantics”

Q: Is there a SADI service that will consume instances of Patient and give

me instances of LikelyRejector

A: No

Q: Okay... So LikelyRejectors need a regression model of increasing slope

over their BloodCreatinine, so... Is there a SADI service that will consume

BloodCreatinine over time and give me its linear regression model?

A: No

Q: Okay... Blood Creatinine over time is a subclass of data of type

X/Y coordinate, so is there a service that consumes X/Y data and

returns its regression model?

A: Yes here’s the URL.

The SHARE system utilizes SADI to discover

analytical services on the Web that do linear regression analysis

and sends the data to be analyzed

This happens iteratively(e.g. SHARE also has to examine the slope of the regression line

using another service, find the “latest” in a series of time measurements, etc.)

There is reasoning after every Service invocation

(i.e. after every clause in the query)

Once it is able to find instances (OWL Individuals)

of the LikelyRejector class, it continues with the

rest of the query

VOILA!

The way SHARE “interprets” data varies

depending on the context of the query

(i.e. which ontologies it reads – Mine? Yours?)

and on what part of the query

it is trying to answer at any given moment

(which ontological concept is relevant to that clause)

Example?

Blood Creatinine measurements

were not dictated to be

Blood Creatinine measurements

Example?

The data had the ‘qualities/properties’ that

allowed one machine to interpret

that they were Blood Creatinine measurements

(e.g. to determine which patients were rejecting)

Example?

But the data also had the ‘qualities/properties’ that

allowed another machine to interpret them as

Simple X/Y coordinate data

(e.g. the Linear Regression calculation tool)

Benefit

of Deep Semantics

Data is amenable to

constant re-interpretation

http://www.flickr.com/people/faernworks/

One example of the “little ways”

that Semantics will help researchers

day-by-day

Story #2: Measurement Units

Units must be harmonized

Don’t leave this up to the researcher(it’s fiddly, time-consuming, and error-prone)

NASA Mars Climate Orbiter

Oops!

ID HEIGHT WEIGHT SBP CHOL HDL BMI

GR

SBP

GR

CHOL

GR

HDL

GR

pt1 1.82 177 128 227 55 0 0 1 0

pt2 179 196 13.4 5.9 1.7 1 0 1 0

The Reality of Clinical Datasets

(this is a small snapshot of a dataset we worked on,

courtesy of Dr. Bruce McManus & Janet McManus, from the PROOF COE)

Height in m and cm Chol in mmol/l and mg/l

...and other delicious weirdness The clinical analyses described here

were supported in part by the

PROOF Center of Excellence

for the Prevention of Organ Failure

GOAL: reduce the likelihood of errors by

getting the clinical researcher

“out of the loop”

(as per the Institute of Medicine Recommendations)

Experiment:

Reproduce a clinical study

(from >10 years ago)

by logically encoding

the clinical diagnosis guidelines

of the American Heart Association

then ask SHARE to automatically

analyse the patient clinical data

Semantically defining globally-accepted clinical phenotypes;

Building on the expertise of others

SystolicBloodPressure =

GALEN:SystolicBloodPressure and

("sio:has measurement value" some "sio:measurement" and ("sio:has unit" some “om: unit of measure”) and

(“om:dimension” value “om:pressure or stress dimension”) and

"sio:has value" some rdfs:Literal))

GALEN is a popular biomedical ontology

but it is largely, like GO, a series of

named but undefined Classes

Semantically defining globally-accepted clinical phenotypes;

Building on the expertise of others

SystolicBloodPressure =

GALEN:SystolicBloodPressure and

("sio:has measurement value" some "sio:measurement" and ("sio:has unit" some “om: unit of measure”) and

(“om:dimension” value “om:pressure or stress dimension”) and

"sio:has value" some rdfs:Literal))

So we use OWL to extend the GALEN

Classes with rich, logical descriptors

that take advantage of rich semantic

relationships like “has measurement valule”

and “dimension” and “has unit”

Semantically defining globally-accepted clinical phenotypes;

Building on the expertise of others

SystolicBloodPressure =

GALEN:SystolicBloodPressure and

("sio:has measurement value" some "sio:measurement" and ("sio:has unit" some “om: unit of measure”) and

(“om:dimension” value “om:pressure or stress dimension”) and

"sio:has value" some rdfs:Literal))

Very general definition

“some kind of pressure unit”

(so that others can build on this as they wish!)

HighRiskSystolicBloodPressure (as defined by Framingham)

SystolicBloodPressure and

sio:hasMeasurement some

(sio:Measurement and

(“sio:has unit” value om:kilopascal) and

(sio:hasValue some double[>= "18.7"^^double])))

Now we are specific to our clinical study (Framingham definitions):

MUST be in kpascal and must be > 18.7

Semantically defining globally-accepted clinical phenotypes;

Building on the expertise of others

SELECT ?record ?convertedvalue ?convertedunit

FROM <./patient.rdf>

WHERE {

?record rdf:type measure:HighRiskSystolicBloodPressure .

?record sio:hasMeasurement ?measurement.

?measurement sio:hasValue ?Pressure.

}

RecordID Start Val Start Unit Pressure End Unit

Pt1 15 cmHg 19.998 KiloPascal

Pt2 14.6 cmHg 19.465 KiloPascal

Pt1 148 mmHg 19.731 KiloPascal

Pt2 146 mmHg 19.465 KiloPascal

Running the Clinical Analysis

“Select the patients who are at-risk”

All measurements have now been automatically

harmonized to KiloPascal, because we encoded the

semantics in the model

While doing this experiment, we noticed

some interesting anomalies…

Visual inspection of our output data and the AHA guidelines

showed that in many cases the clinician

“tweaked” the guidelines when doing their analysis

------------------

AHA BMI risk threshold: BMI=25

In our dataset the clinical researcher used BMI=26

------------------

AHA HDL guideline HDL<=1.03mmol/l

The dataset from our researcher: HDL<=0.89mmol/l

-------------------

Visual inspection of our output data and the AHA guidelines

showed that in many cases the clinician

“tweaked” the guidelines when doing their analysis

These Alterations Were Not Recorded

in Their Study Notes!

Adjusting our Semantic definitions and re-running the analysis

resulted in nearly 100% correspondence with the clinical researcher

HighRiskCholesterolRecord=

PatientRecord and

(sio:hasAttribute some

(cardio:SerumCholesterolConcentration and

sio:hasMeasurement some ( sio:Measurement and

(sio:hasUnit value cardio:mili-mole-per-liter) and

(sio:hasValue some double[>= 5.0]))))

HighRiskCholesterolRecord=

PatientRecord and

(sio:hasAttribute some

(cardio:SerumCholesterolConcentration and

sio:hasMeasurement some ( sio:Measurement and

(sio:hasUnit value cardio:mili-mole-per-liter) and

(sio:hasValue some double[>= 5.2]))))

Reflect on this for a second... Because this is important!

1. We semantically encoded clinical guidelines

2. We found that clinical researchers did not follow the official guidelines

3. Their “personalization” of the guidelines was unreported

4. Nevertheless, we were able to create “personalized” Semantic Models

5. These models reflect the opinion of an individual domain-expert

6. These models are shared on the Web

7. Can be automatically re-used by others to interpret their own data using

that clinical expert’s viewpoint

AHA:HighRiskCholesterolRecord

PatientRecord and

(sio:hasAttribute some

(cardio:SerumCholesterolConcentration and

sio:hasMeasurement some ( sio:Measurement and

(sio:hasUnit value cardio:mili-mole-per-liter) and

(sio:hasValue some double[>= 5.0]))))

McManus:HighRiskCholesterolRecord

PatientRecord and

(sio:hasAttribute some

(cardio:SerumCholesterolConcentration and

sio:hasMeasurement some ( sio:Measurement and

(sio:hasUnit value cardio:mili-mole-per-liter) and

(sio:hasValue some double[>= 5.2]))))

PREFIX AHA =http://americanheart.org/measurements/

PREFIX McManus=http://stpaulshospital.org/researchers/mcmanus/

To do the analysis using AHL guidelines

SELECT ?patient ?risk

WHERE {

?patient rdf:type AHA: HighRiskCholesterolRecord .

?patient ex:hasCholesterolProfile ?risk

}

To do the analysis using McManus’ expert-opinion

SELECT ?patient ?risk

WHERE {

?patient rdf:type McManus:HighRiskCholesterolRecord .

?patient ex:hasCholesterolProfile ?risk

}

Flexibility Transparency

Reproducibility Shareability Comparability

Simplicity Automation

Personalization

(I’m going to return to this point several times)

Reproduce a peer-reviewed

scientific publication

by semantically modelling

the problem

Story #3: in silico Science

The PublicationDiscovering Protein Partners of a

Human Tumor Suppressor Protein

Original Study Simplified

Using what is known about protein interactions

in fly & yeast

predict new interactions with this

Human Tumor Suppressor

Semantic Model of the Experiment

OWL

Note that every word in this

diagram is, in reality, a URL

(it’s a Semantic Web model)

i.e. It refers to the expertise of

other researchers, distributed

around the world on the Web

Semantic Model of the Experiment

In a local data-file

provide the protein we are interested in

and the two species we wish to use in our comparison

taxon:9606 a i:OrganismOfInterest . # human

uniprot:Q9UK53 a i:ProteinOfInterest . # ING1

taxon:4932 a i:ModelOrganism1 . # yeast

taxon:7227 a i:ModelOrganism2 . # fly

Set-up the Experimental Conditions

SELECT ?protein

FROM <file:/local/workflow.input.n3>

WHERE {

?protein a i:ProbableInteractor .

}

Run the Experiment

SELECT ?protein

FROM <file:/local/workflow.input.n3>

WHERE {

?protein a i:ProbableInteractor .

}

Run the Experiment

This is the URL that leads our computer

to the Semantic model of the problem

SHARE examines the semantic model of

Probable Interactors

Retrieves third-party expertise from the Web

Discusses with SADI

what analytical tools are necessary

Chooses the right tools for the problem

Solves the problem!

SHARE derives (and executes) the following analysis automatically

SHARE is aware of the context of the specific question being asked

There are five very cool things about what you just saw...

There are five very cool things about what you just saw...

was able to create a

workflow based on a

semantic model1.

There are five very cool things about what you just saw...

was able to create a

COMPUTATIONAL workflow

based on a BIOLOGICAL model

2.

There are five very cool things about what you just saw...

(this is important because we want

this system to be used by clinicians and biologists

who don’t speak computerese!)2.

There are five very cool things about what you just saw...

The workflow it created, and services

selected, differed depending on the

context of the question

taxon:4932 a i:ModelOrganism1 . # yeast

taxon:7227 a i:ModelOrganism2 . # fly

3.

The workflow it created, and services

chosen, differed depending on the

context of the question

3.

There are five very cool things about what you just saw...

taxon:4932 a i:ModelOrganism1 . # yeast

taxon:7227 a i:ModelOrganism2 . # fly

The machine was contextually “aware of”

BOTH the biological model

AND the data it was analysing

(...remember this... It will be important later!)

There are five very cool things about what you just saw...

The ontological model was abstract (and

shareable!), but the workflow generated

from that model was explicit and concrete

4.

There are five very cool things about what you just saw...

The ontological model was abstract (and

shareable!), but the workflow generated

from that model was explicit and concrete

4.

There are five very cool things about what you just saw...

The ontological model was abstract (and

shareable!), but the workflow generated

from that model was explicit and concrete

4.

This matters because…

RememberTrend #1

“the most common errors are simple,

the most simple errors are common”

At least partially because the

analytical methodology was inappropriate

and/or not sufficiently described

RememberTrend #1

“the most common errors are simple,

the most simple errors are common”

At least partially because the

analytical methodology was inappropriate

and/or not sufficiently described

Here, the methodology leading to a result is explicit

and automatically constructed from an abstract template

so this is (at least in part) a

Solved Problem

There are five very cool things about what you just saw...

The choice of tool-selection was

guided by the knowledge of

worldwide domain-experts encoded in

globally-distributed ontologies

(e.g. Expert high-throughput statisticians, etc...)

5.

There are five very cool things about what you just saw...

The choice of tool-selection was

guided by the knowledge of

worldwide domain-experts encoded in

globally-distributed ontologies

(e.g. Expert high-throughput statisticians, etc...)

And this matters because…

5.

RememberTrend #2

Even small, moderately-funded laboratories

can now afford to produce more data

than they can manage or interpret

These labs will likely never be able to afford

a qualified data scientist

RememberTrend #2

Even small, moderately-funded laboratories

can now afford to produce more data

than they can manage or interpret

These labs will likely never be able to afford

a qualified data scientist

But if the expert knowledge of data scientists is

encoded in ontologies, and can be discovered

in a contextually-aware manner… then this is a

SOLVED PROBLEM

Can we make the Health information

on the Web

more “personal”?

Story #4: Personalized Health Info

Remember when I said...

The machine was contextually “aware of”

BOTH the biological model

AND the data it was analysing

This “dual-awareness” provides some

very interesting opportunities

for personalizing a patient’s Health Research activity

PROBLEM:

Patients are self-educating

both about their personal medical situation

(e.g. getting themselves sequenced)

also surfing the Web, getting dubious advice

from sites of dubious authority

and joining social-health groups

to exchange (often anecdotal)

medical “advice” with other patients

PROBLEM:

Patients are self-educating

The information on any given site

may or may not

be relevant to THAT patient

Information on the Web is, by nature, not personalized

PROBLEM:

Clinicians often have patients

(especially chronically-ill patients)

on a “trajectory” of treatment

Medicine is complicated!

e.g. the treatment trajectory of the patient can be

multi-step, and a specific sign/symptom might be

perfectly normal at a particular phase in their

“flow” of treatment

PROBLEM SUMMARY

Patients are reading non-personalized medical text

of dubious quality and relevance

Clinicians have no way to intervene

in this self-education process

explaining to patients how the information they read

relates to their personal “health trajectory”

Now you might see why this is so relevant!

The machine was contextually “aware of”

BOTH the biological model

AND the data it was analysing

This is an early prototype of a

Patient-driven Personalized Medicine

Web interface

Basically, it is a set of SHARE queries

Attached to a local database

of patient information

Running behind a Web bookmarklet

The queries text-mine a Web page

then compare the concepts in the page

to the patient’s personal data

using a SHARE query

The queries text-mine a Web page

then compare the concepts in the page

to the patient’s personal data

using a SHARE query

(that could contain ontologies...

...ontologies designed by their clinician!!)

Matching based on official name, compound name, brand name, trade name,

or “common name”

Still needs some work...

??!?!?

Link out to PubMed

Why the alert?

The SADI+SHARE workflow and reasoning was

personalized to YOUR medical data

In future iterations, we will enable the workflow

to be further customized through “personalized”

OWL Classes (e.g. Provided by your Clinician!!)

These OWL Classes might include information about the

current trajectory of your treatment for a chronic disease,

for example, such that what you read on the Web is

placed in the context of your expert Clinical care...

Frankly, I think it’s quite cool that people

patients

are creating and running

“personal health-research” workflows

at the touch of a button!

Almost the end…

Three brief final points....

Publication

Discourse

Hypothesis

Experiment

Interpretation

??

The Semantic Model represents

a possible solution to a problem

The Semantic Model represents

a possible solution to a problem

By my definition, that is a hypothesis

The Semantic Model represents

a possible solution to a problem

That hypothesis is tested by automatically converting it into a workflow;

The Semantic Model represents

a possible solution to a problem

That hypothesis is tested by automatically converting it into a workflow;

the workflow, and the results of the workflow are intimately tied to the hypothesis

The Semantic Model represents

a possible solution to a problem

i.e. You (or anyone!) can determine exactly which aspect

of the hypothesis led to which output data element, why, and how

The Semantic Model represents

a possible solution to a problem

“Exquisite Provenance”

a perfect record not only of what was done, when, and how

but also WHY

And this is important because...

“Exquisite Provenance”

is required

for the output data and knowledge

to be published as...

Richly annotated, citable, and queryable snippets of

scientific knowledge encoded in Linked Data/OWL

i.e. a way to publish data and knowledge on the Semantic Web

Publication

Discourse

Hypothesis

Experiment

Interpretation

A “modest” vision for

pure in silico Science

Last point… perhaps this is not yet obvious…

SADI services consume Linked Data on the Web

SADI services consume Linked Data on the Web

The ontologies provided to SHARE are

written in OWL, and are therefore

inherently part of the Web

SADI services consume Linked Data on the Web

The ontologies provided to SHARE are

written in OWL, and are therefore

inherently part of the Web

SADI services create novel semantic links

between existing data-points on the Web, or

between existing data and new data

SADI services consume Linked Data on the Web

The ontologies provided to SHARE are

written in OWL, and are therefore

inherently part of the Web

SADI services create novel semantic links

between existing data-points on the Web, or

between existing data and new data

The output of the automatically-generated workflow

is therefore Linked Data

and is therefore inherently part of the Web

SADI services consume Linked Data on the Web

The ontologies provided to SHARE are

written in OWL, and are therefore

inherently part of the Web

SADI services create novel semantic links

between existing data-points on the Web, or

between existing data and new data

The output of the automatically-generated workflow

is therefore Linked Data

and is therefore inherently part of the Web

The concluding NanoPublications are a combination

of Linked Data and OWL, and are published directly to the Web

The Life Science “Singularity”

The Semantic Web is a cradle-to-grave

biomedical research platform

that can, and will, dramatically improve

how biomedical research is done

WeAre

Here!

The important people

Luke McCarthy

(SADI/SHARE)

Benjamin Vandervalk

(SHARE)

Dr. Soroush Samadian

(clinical experiments)

Ian Wood

(Experiment-replication experiment)

Microsoft Research