Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Improving Transparency and Reproducibility

of Biomedical ResearchUsing Semantic Technologies

Mark Wilkinson

World Research & Innovation Congress, Brussels, 2013

Isaac Peral Senior Researcher in Biological InformaticsCentro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain

Adjunct Professor of Medical Genetics, University of British ColumbiaVancouver, BC, Canada.

Making the Web abiomedical research platform

from hypothesis through to publication

Publication

Discourse

Hypothesis

Experiment

Interpretation

Motivation:

3 intersecting trends in the Life Sciences

that are now, or soon will be,extremely problematic

NON-REPRODUCIBLE SCIENCE & THE FAILURE OF PEER REVIEW

TREND #1

Trend #1

Multiple recent surveys of high-throughput biology

reveal that upwards of 50% of published studies

are not reproducible

- Baggerly, 2009- Ioannidis, 2009

Similar (if not worse!) in clinical studies

- Begley & Ellis, Nature, 2012 - Booth, Forbes, 2012

- Huang & Gottardo, Briefings in Bioinformatics, 2012

Trend #1

Trend #1

“the most common errors are simple,the most simple errors are common”

At least partially because the analytical methodology was inappropriate

and/or not sufficiently described

- Baggerly, 2009

Trend #1

These errors pass peer review

The researcher is (sometimes) unaware of the error

The process that led to the error is not recorded

Therefore it cannot be detected during peer-review

Agencies have Noticed!

In March, 2012, the US Institute of Medicine ~said

“Enough is enough!”

Agencies have Noticed!

Institute of Medicine RecommendationsFor Conduct of High-Throughput Research:

Evolution of Translational Omics Lessons Learned and the Path Forward. The Institute of Medicine of the National Academies, Report Brief, March 2012.

1. Rigorously-described, -annotated, and -followed data management and manipulation procedures

2. “Lock down” the computational analysis pipeline once it has been selected

3. Publish the analytical workflow in a formal manner, together with the full starting and result datasets

BIGGER, CHEAPER DATATREND #2

Trend #2

High-throughput technologies are becomingcheaper and easier to use

Trend #2

High-throughput technologies are becomingcheaper and easier to use

But there are still very few experts trained in statistical analysis of high-throughput data

Trend #2

Therefore

Even small, moderately-funded laboratories can now afford to produce more data

than they can manage or interpret

“THE SINGULARITY”TREND #3

The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009

Slide adapted with permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USAJune 22, 2012.

http://www.amazon.com/s/ref=ntt_athr_dp_sr_1?_encoding=UTF8&sort=relevancerank&search-alias=books&ie=UTF8&field-author=Tony%20Hey



The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009Slide Borrowed with Permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USAJune 22, 2012.

“The Singularity”

The X-intercept is where, the moment a discovery is made, it is immediately put into practice




Scientific research would have to be conducted within a medium that

immediately interpreted and disseminated the results...

You Are

Here

...in a form that immediately (actively!) affected the results of other researchers...

You Are

Here

...without requiring them to be aware of these new discoveries.

You Are

Here

3 intersecting and problematic trends

Non-reproducible science that passes peer-review

Cheaper production of larger and more complex datasetsthat require specialized expertise to analyze properly

Need to more rapidly disseminate and use new discoveries

We Want More!

I don’t just want to reproduceyour experiment...

I want to re-use your experiment

In my own laboratory... On MY DATA!

When I do my analysis

I want to draw on the knowledge

of global domain-experts like

statisticians and pathologists...

...as if they were mentors sitting

in the chair beside me.

Image from: Mark Smiciklas Intersection Consulting, cc-nca

Please don’t make me find

all of the data and knowledge

that I require to do my experiment

...it simply isn’t possible anymore...

Image from AJ Canncc-by-a license

I want to support peer review(ers)so that I do better science.

How do we get there from here?

To overcome these intersecting problems

and to achieve the goals of transparentreproducible research

We must learn how to do research IN the Web

Not OVER the Web

How we use The Web today

The Web is not a pigeon!

Semantic Web Technologies

Design Pattern for PublishingAnalytical Tools on the Semantic Web

Application that uses SADIto interpret globally-distributed

expert knowledge

in order to discover and executethe right tool, at the right time, for the right analysis

Reproduce a peer-reviewed scientific publication

by semantically modellingthe problem

CHALLENGE:

The PublicationDiscovering Protein Partners of aHuman Tumor Suppressor Protein

Original Study Simplified

Using what is known about protein interactions

in fly & yeast

predict new interactions with this Human Tumor Suppressor

Semantic Model of the Experiment

OWL

Web Ontology Language (OWL) is the language approved by the W3C

for representing knowledge in the Web

Note that every word in this diagram is, in reality, a URL (it’s a Semantic Web model)

i.e. It refers to the expertise of other researchers, distributedaround the world on the Web(i.e. NanoPubs***)

*** remember this word!! It will be important later!!

Semantic Model of the Experiment

In a local data-file

provide the protein we are interested in

and the two species we wish to use in our comparison

taxon:9606 a i:OrganismOfInterest . # humanuniprot:Q9UK53 a i:ProteinOfInterest . # ING1taxon:4932 a i:ModelOrganism1 . # yeasttaxon:7227 a i:ModelOrganism2 . # fly

Set-up the Experimental Conditions

SELECT ?proteinFROM <file:/local/workflow.input.n3>WHERE { ?protein a i:ProbableInteractor .

}

Run the Experiment

SELECT ?proteinFROM <file:/local/workflow.input.n3>WHERE { ?protein a i:ProbableInteractor .

}

Run the Experiment

This is the URL that leads our computerto the Semantic model of the problem

SHARE examines the semantic model of Probable Interactors

Retrieves third-party expertise from the Web

Discusses with SADI what analytical tools are necessary

Chooses the right tools for the problem

Solves the problem!

SHARE derives (and executes) the following analysis automatically

SHARE is aware of the context of the specific question being asked

There are four very cool things about what you just saw...


was able to create a workflow based on a

semantic model


was able to create a COMPUTATIONAL workflow

based on a BIOLOGICAL model


(this is important because we wantthis system to be used by clinicians and biologists

who don’t speak computerese!)


The workflow it created, and services chosen, differed depending on the context of the

specific question being asked

taxon:4932 a i:ModelOrganism1 . # yeast

taxon:7227 a i:ModelOrganism2 . # fly


The choice of tool-selection was

guided by the knowledge of worldwide domain-experts encoded in

globally-distributed ontologies

(e.g. Expert high-throughput statisticians, etc...)

We have not over-trivialized the problemof interpreting clinical data...

Measurement Units

One example of the “little ways” that Semantics will help clinical researchers

day-by-day

Units must be harmonized

Don’t leave this up to the researcher(it’s fiddly, time-consuming, and error-prone)

NASA Mars Climate Orbiter

ID

HEIGHT

WEIGHT

SBP CHOL

HDL

BMI

GR

SBP

GR

CHOL

GR

HDL

GR

pt1 1.82 177 128 227 55 0 0 1 0

pt2 179 196 13.4 5.9 1.7 1 0 1 0

The Chaos of Real-world Clinical Datasets(this is a snapshot of an actual dataset we worked on)

Height in m and cm Chol in mmol/l and mg/l

...and other delicious weirdness

GOAL: get the clinical researcher “out of the loop” once the data is collected

(as per the Institute of Medicine Recommendations)

Semantically defining clinical phenotypes;Building on the expertise of others

SystolicBloodPressure =

GALEN:SystolicBloodPressure and ("sio:has measurement value" some "sio:measurement" and ("sio:has unit" some “om: unit of measure”) and (“om:dimension” value “om:pressure or stress dimension”) and "sio:has value" some rdfs:Literal))

Very general definition“some kind of pressure unit”

(so that others can build on this as they wish!)

HighRiskSystolicBloodPressure (as defined by Framingham)

SystolicBloodPressure and sio:hasMeasurement some (sio:Measurement and (“sio:has unit” value om:kilopascal) and (sio:hasValue some double[>= "18.7"^^double])))

Now we are specific to our clinical study:MUST be in kpascal and must be > 18.7

Semantically defining clinical phenotypes;Building on the expertise of others

SELECT ?record ?convertedvalue ?convertedunitFROM <./patient.rdf> WHERE {

?record rdf:type measure:HighRiskSystolicBloodPressure . ?record sio:hasMeasurement ?measurement. ?measurement sio:hasValue ?Pressure. }

RecordID Start Val Start Unit Pressure End Unit Pt1 15 cmHg 19.998 KiloPascalPt2 14.6 cmHg 19.465 KiloPascalPt1 148 mmHg 19.731 KiloPascalPt2 146 mmHg 19.465 KiloPascal

Running the Clinical Analysis

All measurements have now been automaticallyharmonized to KiloPascal, because we encoded thesemantics in the model

Visual inspection of our output data and the AHA guidelines

showed that in many cases the clinician

“tweaked” the guidelines when doing their own analysis

------------------AHA BMI risk threshold: BMI=25

In our dataset the clinical researcher used BMI=26------------------

AHA HDL guideline HDL<=1.03mmol/lThe dataset from our researcher: HDL<=0.89mmol/l

-------------------

Visual inspection of our output data and the AHA guidelines

showed that in many cases the clinician

“tweaked” the guidelines when doing their own analysis

These Alterations Were Not Recorded in Their Study Notes!

Adjusting our Semantic definitions and re-running the analysisresulted in nearly 100% correspondence with the clinical researcher

HighRiskCholesterolRecord=

PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.0]))))

HighRiskCholesterolRecord=


Reflect on this for a second... Because this is important!

1. We semantically encoded clinical guidelines

2. We found that clinical researchers did not follow the official guidelines

3. Their “personalization” of the guidelines was unreported

4. Nevertheless, we were able to create “personalized” Semantic Models

5. These reflect the opinion of an individual domain-expert

6. These models are shared on the Web

7. Can be automatically re-used by others to interpret their own data using

that clinical expert’s viewpoint

AHA:HighRiskCholesterolRecord


McManus:HighRiskCholesterolRecord


PREFIX AHA =http://americanheart.org/measurements/

PREFIX McManus=http://stpaulshospital.org/researchers/mcmanus/

http://americanheart.org/measurements/

http://stpaulshospital.org/researchers/mcmanus/

To do the analysis using AHL guidelines

SELECT ?patient ?risk

WHERE {

?patient rdf:type AHA: HighRiskCholesterolRecord .

?patient ex:hasCholesterolProfile ?risk

}

To do the analysis using McManus’ expert-opinion

SELECT ?patient ?risk

WHERE {

?patient rdf:type McManus:HighRiskCholesterolRecord .

?patient ex:hasCholesterolProfile ?risk

}

Flexibility Transparency

Reproducibility Shareability Comparability

Simplicity Automation

Two final points....

Publication

Discourse

Hypothesis

Experiment

Interpretation

??

The Semantic Model represents a possible solution to a problem

By my definition, that is a hypothesis


That hypothesis is tested by automatically converting it into a workflow;the results of the workflow are intimately tied to the hypothesis


i.e. You (or anyone!) can determine exactly which aspect of the hypothesis led to which output data element, why, and how


“Exquisite Provenance”

And this is important because...

Note that every word in this diagram is, in reality, a URL (it’s a Semantic Web model)

i.e. It refers to the expertise of other researchers, distributedaround the world on the Web(i.e. NanoPubs***)

*** remember this word!! It will be important later!!

Remember when I said this...?

“Exquisite Provenance”

is required

for the output data and knowledgeto be published as...

Semantic Web-based, richly annotated, citable, and queryablesnippets of scientific knowledge

(that can be used to construct novel SHARE hypotheses!)

Life Web Science:

The Semantic Web is a cradle-to-grave biomedical research platform

that can, and will, dramatically improve how biomedical research is done

WeAre

Here!

Microsoft Research

Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies

Documents

Transcript of Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies