Reproducibility SuperPlots: Communicating reproducibility ...
Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies
description
Transcript of Improving Transparency and Reproducibility of Biomedical Research Using Semantic Technologies
Improving Transparency and Reproducibility
of Biomedical ResearchUsing Semantic Technologies
Mark Wilkinson
World Research & Innovation Congress, Brussels, 2013
Isaac Peral Senior Researcher in Biological InformaticsCentro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain
Adjunct Professor of Medical Genetics, University of British ColumbiaVancouver, BC, Canada.
Making the Web abiomedical research platform
from hypothesis through to publication
Publication
Discourse
Hypothesis
Experiment
Interpretation
Motivation:
3 intersecting trends in the Life Sciences
that are now, or soon will be,extremely problematic
NON-REPRODUCIBLE SCIENCE & THE FAILURE OF PEER REVIEW
TREND #1
Trend #1
Multiple recent surveys of high-throughput biology
reveal that upwards of 50% of published studies
are not reproducible
- Baggerly, 2009- Ioannidis, 2009
Similar (if not worse!) in clinical studies
- Begley & Ellis, Nature, 2012 - Booth, Forbes, 2012
- Huang & Gottardo, Briefings in Bioinformatics, 2012
Trend #1
Trend #1
“the most common errors are simple,the most simple errors are common”
At least partially because the analytical methodology was inappropriate
and/or not sufficiently described
- Baggerly, 2009
Trend #1
These errors pass peer review
The researcher is (sometimes) unaware of the error
The process that led to the error is not recorded
Therefore it cannot be detected during peer-review
Agencies have Noticed!
In March, 2012, the US Institute of Medicine ~said
“Enough is enough!”
Agencies have Noticed!
Institute of Medicine RecommendationsFor Conduct of High-Throughput Research:
Evolution of Translational Omics Lessons Learned and the Path Forward. The Institute of Medicine of the National Academies, Report Brief, March 2012.
1. Rigorously-described, -annotated, and -followed data management and manipulation procedures
2. “Lock down” the computational analysis pipeline once it has been selected
3. Publish the analytical workflow in a formal manner, together with the full starting and result datasets
BIGGER, CHEAPER DATATREND #2
Trend #2
High-throughput technologies are becomingcheaper and easier to use
Trend #2
High-throughput technologies are becomingcheaper and easier to use
But there are still very few experts trained in statistical analysis of high-throughput data
Trend #2
Therefore
Even small, moderately-funded laboratories can now afford to produce more data
than they can manage or interpret
“THE SINGULARITY”TREND #3
The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009
Slide adapted with permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USAJune 22, 2012.
The Healthcare Singularity and the Age of Semantic Medicine, Michael Gillam, et al, The Fourth Paradigm: Data-Intensive Scientific Discovery Tony Hey (Editor), 2009Slide Borrowed with Permission from Joanne Luciano, Presentation at Health Web Science Workshop 2012, Evanston IL, USAJune 22, 2012.
“The Singularity”
The X-intercept is where, the moment a discovery is made, it is immediately put into practice
Scientific research would have to be conducted within a medium that
immediately interpreted and disseminated the results...
You Are
Here
...in a form that immediately (actively!) affected the results of other researchers...
You Are
Here
...without requiring them to be aware of these new discoveries.
You Are
Here
3 intersecting and problematic trends
Non-reproducible science that passes peer-review
Cheaper production of larger and more complex datasetsthat require specialized expertise to analyze properly
Need to more rapidly disseminate and use new discoveries
We Want More!
I don’t just want to reproduceyour experiment...
I want to re-use your experiment
In my own laboratory... On MY DATA!
When I do my analysis
I want to draw on the knowledge
of global domain-experts like
statisticians and pathologists...
...as if they were mentors sitting
in the chair beside me.
Image from: Mark Smiciklas Intersection Consulting, cc-nca
Please don’t make me find
all of the data and knowledge
that I require to do my experiment
...it simply isn’t possible anymore...
Image from AJ Canncc-by-a license
I want to support peer review(ers)so that I do better science.
How do we get there from here?
To overcome these intersecting problems
and to achieve the goals of transparentreproducible research
We must learn how to do research IN the Web
Not OVER the Web
How we use The Web today
The Web is not a pigeon!
Semantic Web Technologies
Design Pattern for PublishingAnalytical Tools on the Semantic Web
Application that uses SADIto interpret globally-distributed
expert knowledge
in order to discover and executethe right tool, at the right time, for the right analysis
Reproduce a peer-reviewed scientific publication
by semantically modellingthe problem
CHALLENGE:
The PublicationDiscovering Protein Partners of aHuman Tumor Suppressor Protein
Original Study Simplified
Using what is known about protein interactions
in fly & yeast
predict new interactions with this Human Tumor Suppressor
Semantic Model of the Experiment
OWL
Web Ontology Language (OWL) is the language approved by the W3C
for representing knowledge in the Web
Note that every word in this diagram is, in reality, a URL (it’s a Semantic Web model)
i.e. It refers to the expertise of other researchers, distributedaround the world on the Web(i.e. NanoPubs***)
*** remember this word!! It will be important later!!
Semantic Model of the Experiment
In a local data-file
provide the protein we are interested in
and the two species we wish to use in our comparison
taxon:9606 a i:OrganismOfInterest . # humanuniprot:Q9UK53 a i:ProteinOfInterest . # ING1taxon:4932 a i:ModelOrganism1 . # yeasttaxon:7227 a i:ModelOrganism2 . # fly
Set-up the Experimental Conditions
SELECT ?proteinFROM <file:/local/workflow.input.n3>WHERE { ?protein a i:ProbableInteractor .
}
Run the Experiment
SELECT ?proteinFROM <file:/local/workflow.input.n3>WHERE { ?protein a i:ProbableInteractor .
}
Run the Experiment
This is the URL that leads our computerto the Semantic model of the problem
SHARE examines the semantic model of Probable Interactors
Retrieves third-party expertise from the Web
Discusses with SADI what analytical tools are necessary
Chooses the right tools for the problem
Solves the problem!
SHARE derives (and executes) the following analysis automatically
SHARE is aware of the context of the specific question being asked
There are four very cool things about what you just saw...
There are four very cool things about what you just saw...
was able to create a workflow based on a
semantic model
There are four very cool things about what you just saw...
was able to create a COMPUTATIONAL workflow
based on a BIOLOGICAL model
There are four very cool things about what you just saw...
(this is important because we wantthis system to be used by clinicians and biologists
who don’t speak computerese!)
There are four very cool things about what you just saw...
The workflow it created, and services chosen, differed depending on the context of the
specific question being asked
taxon:4932 a i:ModelOrganism1 . # yeast
taxon:7227 a i:ModelOrganism2 . # fly
There are four very cool things about what you just saw...
The choice of tool-selection was
guided by the knowledge of worldwide domain-experts encoded in
globally-distributed ontologies
(e.g. Expert high-throughput statisticians, etc...)
We have not over-trivialized the problemof interpreting clinical data...
Measurement Units
One example of the “little ways” that Semantics will help clinical researchers
day-by-day
Units must be harmonized
Don’t leave this up to the researcher(it’s fiddly, time-consuming, and error-prone)
NASA Mars Climate Orbiter
Oops!
ID
HEIGHT
WEIGHT
SBP CHOL
HDL
BMI
GR
SBP
GR
CHOL
GR
HDL
GR
pt1 1.82 177 128 227 55 0 0 1 0
pt2 179 196 13.4 5.9 1.7 1 0 1 0
The Chaos of Real-world Clinical Datasets(this is a snapshot of an actual dataset we worked on)
Height in m and cm Chol in mmol/l and mg/l
...and other delicious weirdness
GOAL: get the clinical researcher “out of the loop” once the data is collected
(as per the Institute of Medicine Recommendations)
Semantically defining clinical phenotypes;Building on the expertise of others
SystolicBloodPressure =
GALEN:SystolicBloodPressure and ("sio:has measurement value" some "sio:measurement" and ("sio:has unit" some “om: unit of measure”) and (“om:dimension” value “om:pressure or stress dimension”) and "sio:has value" some rdfs:Literal))
Very general definition“some kind of pressure unit”
(so that others can build on this as they wish!)
HighRiskSystolicBloodPressure (as defined by Framingham)
SystolicBloodPressure and sio:hasMeasurement some (sio:Measurement and (“sio:has unit” value om:kilopascal) and (sio:hasValue some double[>= "18.7"^^double])))
Now we are specific to our clinical study:MUST be in kpascal and must be > 18.7
Semantically defining clinical phenotypes;Building on the expertise of others
SELECT ?record ?convertedvalue ?convertedunitFROM <./patient.rdf> WHERE {
?record rdf:type measure:HighRiskSystolicBloodPressure . ?record sio:hasMeasurement ?measurement. ?measurement sio:hasValue ?Pressure. }
RecordID Start Val Start Unit Pressure End Unit Pt1 15 cmHg 19.998 KiloPascalPt2 14.6 cmHg 19.465 KiloPascalPt1 148 mmHg 19.731 KiloPascalPt2 146 mmHg 19.465 KiloPascal
Running the Clinical Analysis
All measurements have now been automaticallyharmonized to KiloPascal, because we encoded thesemantics in the model
Visual inspection of our output data and the AHA guidelines
showed that in many cases the clinician
“tweaked” the guidelines when doing their own analysis
------------------AHA BMI risk threshold: BMI=25
In our dataset the clinical researcher used BMI=26------------------
AHA HDL guideline HDL<=1.03mmol/lThe dataset from our researcher: HDL<=0.89mmol/l
-------------------
Visual inspection of our output data and the AHA guidelines
showed that in many cases the clinician
“tweaked” the guidelines when doing their own analysis
These Alterations Were Not Recorded in Their Study Notes!
Adjusting our Semantic definitions and re-running the analysisresulted in nearly 100% correspondence with the clinical researcher
HighRiskCholesterolRecord=
PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.0]))))
HighRiskCholesterolRecord=
PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.2]))))
Reflect on this for a second... Because this is important!
1. We semantically encoded clinical guidelines
2. We found that clinical researchers did not follow the official guidelines
3. Their “personalization” of the guidelines was unreported
4. Nevertheless, we were able to create “personalized” Semantic Models
5. These reflect the opinion of an individual domain-expert
6. These models are shared on the Web
7. Can be automatically re-used by others to interpret their own data using
that clinical expert’s viewpoint
AHA:HighRiskCholesterolRecord
PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.0]))))
McManus:HighRiskCholesterolRecord
PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.2]))))
PREFIX AHA =http://americanheart.org/measurements/
PREFIX McManus=http://stpaulshospital.org/researchers/mcmanus/
To do the analysis using AHL guidelines
SELECT ?patient ?risk
WHERE {
?patient rdf:type AHA: HighRiskCholesterolRecord .
?patient ex:hasCholesterolProfile ?risk
}
To do the analysis using McManus’ expert-opinion
SELECT ?patient ?risk
WHERE {
?patient rdf:type McManus:HighRiskCholesterolRecord .
?patient ex:hasCholesterolProfile ?risk
}
Flexibility Transparency
Reproducibility Shareability Comparability
Simplicity Automation
Two final points....
Publication
Discourse
Hypothesis
Experiment
Interpretation
??
The Semantic Model represents a possible solution to a problem
By my definition, that is a hypothesis
The Semantic Model represents a possible solution to a problem
That hypothesis is tested by automatically converting it into a workflow;the results of the workflow are intimately tied to the hypothesis
The Semantic Model represents a possible solution to a problem
i.e. You (or anyone!) can determine exactly which aspect of the hypothesis led to which output data element, why, and how
The Semantic Model represents a possible solution to a problem
“Exquisite Provenance”
And this is important because...
Note that every word in this diagram is, in reality, a URL (it’s a Semantic Web model)
i.e. It refers to the expertise of other researchers, distributedaround the world on the Web(i.e. NanoPubs***)
*** remember this word!! It will be important later!!
Remember when I said this...?
“Exquisite Provenance”
is required
for the output data and knowledgeto be published as...
Semantic Web-based, richly annotated, citable, and queryablesnippets of scientific knowledge
(that can be used to construct novel SHARE hypotheses!)
Life Web Science:
The Semantic Web is a cradle-to-grave biomedical research platform
that can, and will, dramatically improve how biomedical research is done
WeAre
Here!
Microsoft Research