Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

71
Life Web Science 2013, Paris Improving Transparency and Reproducibility of Clinical Research Using Semantic Technologies Soroush Samadian & Mark Wilkinson Isaac Peral Senior Researcher in Biological Informatics Centro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain Adjunct Professor of Medical Genetics, University of British Columbia Vancouver, BC, Canada.

description

We were interested in whether we could model well-established clinical risk guidelines in OWL, and use these to automatically classify patient data v.v. "risk" (e.g. using the Framingham risk categories). What we ended-up doing, however, was wandering down a very interesting path of attempting to model clinical intuition! This reports the first phase of the experiment. A subsequent SlideShare will give part II of this investigation. This is the work of Soroush Samadian, Ph.D. Candidate at the University of British Columbia Bioinformatics Graduate Programme.

Transcript of Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Page 1: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Life Web Science 2013, Paris

Improving Transparency and Reproducibility

of Clinical ResearchUsing Semantic Technologies

Soroush Samadian & Mark Wilkinson

Isaac Peral Senior Researcher in Biological InformaticsCentro de Biotecnología y Genómica de Plantas, UPM, Madrid, Spain

Adjunct Professor of Medical Genetics, University of British ColumbiaVancouver, BC, Canada.

Page 2: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Can we make the Web ascientific research platform

from hypothesis right through to publication

Page 3: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Focus on publishing citable snippets of scientific knowledge using SemWeb standards

Page 4: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

That’s a good start v.v. academic publishing, but...

What about the rest of the scientific process?

Page 5: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Publication

Discourse

Hypothesis

Experiment

Interpretation

Page 6: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Houston, we have a problem...

Page 7: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Context

Multiple recent surveys of high-throughput biology

reveal that upwards of 50% of published studies

are not reproducible

- Baggerly, 2009

- Ioannidis, 2009

Page 8: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Context

Similar (if not worse!) in clinical studies

- Begley & Ellis, Nature, 2012 - Booth, Forbes, 2012

- Huang & Gottardo, Briefings in Bioinformatics, 2012

Page 9: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Context

“the most common errors are simple,the most simple errors are common”

At least partially because the analytical methodology was inappropriate

and/or not sufficiently described

- Baggerly, 2009

Page 10: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Context

These errors pass peer review

The researcher is unaware of the error

The process that led to the error is not recorded

Therefore it cannot be detected during peer-review

Page 11: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Context

Discovery of such errors have resulted in retractions

and even shut-down clinical trials

- Ioannidis, 2009

Page 12: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Context

In March, 2012, the US Institute of Medicine ~said

“Enough is enough!”

Page 13: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Context

Institute of Medicine RecommendationsFor Conduct of High-Throughput Research:

Evolution of Translational Omics Lessons Learned and the Path Forward. The Institute of Medicine of the National Academies, Report Brief, March 2012.

1. Rigorously-described, -annotated, and -followed data management and manipulation procedures

2. “Lock down” the computational analysis pipeline once it has been selected

3. Publish the analytical workflow in a formal manner, together with the full starting and result datasets

Page 14: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

How do we get there from here?

Page 15: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Our early attempts at

supporting clinical research

through SemWeb

Page 16: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

MEASUREMENT UNITS

Problem #1: Integrating Clinical Data

Page 17: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Units must be expressed

Units must be harmonized

Don’t leave this up to the researcher(it’s fiddly, time-consuming, and error-prone)

Page 18: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

NASA Mars Climate Orbiter

Page 19: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Oops!

Page 20: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

QUDT: Quantities, Units, Dimensions and Types Ontology

conversion offset & conversion multiplier enable conversion between non-SI-based unit and its SI equivalent.

Page 21: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

OM: Ontology of Units of Measurement

Useful for inventing new units that are commonly used in clinical researchbut not (currently) in any Unit ontology

Page 22: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

ID

 

HEIGHT

 

WEIGHT

 

SBP CHOL

 

HDL

 

BMI

GR

SBP

GR

CHOL

GR

HDL

GR

pt1 1.82 177 128 227 55 0 0 1 0

pt2 179 196 13.4 5.9 1.7 1 0 1 0

Legacy clinical dataset used in our studies

Height in m and cm Chol in mmol/l and mg/l

...and other delicious weirdness

Expert decision on “risk” (e.g. BMI=1 means “at health-risk with this BMI)

Page 23: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

GOAL: get the clinical researcher “out of the loop” once the data is collected

Complete transparency in analysis withNO PEEKING & NO TWEAKING!

(see U.S. IOM Recommendations)

Page 24: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Extending the GALEN ontology with richer logicincluding measurement values and units

measure:SystolicBloodPressure =

galen:SystolicBloodPressure and ("sio:has measurement value" some "sio:measurement" and ("sio:has unit" some “om: unit of measure”) and (“om:dimension” value “om:pressure or stress dimension”) and "sio:has value" some rdfs:Literal))

Very general definition“some kind of pressure unit”

Page 25: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Now Galen classes can be used to “carry” rich data

Move beyond use of ontologies for simple keyword curation (keyword hierarchies are SO last-decade!)

Page 26: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Now, what do we do with the unit-soup that is in our legacy dataset?

Page 27: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

SADI Semantic Web Service for automated Unit conversion

• Send it a dataset with mixed units• (optional) tell it the harmonized unit you want back• Returns you a dataset with harmonized units

Automatic semantic detection of the “nature” of the incoming unit type (e.g. “unit of pressure”)

Automatic conversion based on dimensionality and/or offset & multiplier

Page 28: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Create additional ontological classesrepresenting clinical features of interest based on clinical guidelines

measure:HighRiskSystolicBloodPressure

measure:SystolicBloodPressure and sio:hasMeasurement some (sio:Measurement and (“sio:has unit” value om:kilopascal) and (sio:hasValue some double[>= "18.7"^^double])))

Remember that this is fromour extension of Galen

Extend, Reuse, Recycle!

Now we’re being specificMUST be in kpascal and must be > 18.7

Page 29: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

SELECT ?record ?convertedvalue ?convertedunitFROM <./patient.rdf> WHERE {

?record rdf:type measure:HighSystolicBloodPressure . ?record sio:hasMeasurement ?measurement. ?measurement sio:hasValue ?convertedvalue. ?record cardio:ExpertClassification ?riskgrade . }

RecordID Start Val Start Unit End Val End Unit

cm_hg1 15 cmHg 19.998 KiloPascal

cm_hg2 14.6 cmHg 19.465 KiloPascal

mm_hg1 14.8 mmHg 19.731 KiloPascal

mm_hg2 146 mmHg 19.465 KiloPascal

SHARE query (SHARE is a SADI-enhanced SPARQL query engine)

Because the OWL definition of HighSBPrequired kpascal, SHARE used SADI toauto-convert everything into kpascal

Page 30: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

ASSESSING RISK

Problem #1: Automatic Interpretation of Clinical Data

Page 31: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Framingham risk measurements:

AgeGenderHeightWeightBody Mass Index(BMI)Systolic Blood Pressure(SBP)Diastolic Blood Pressure(DBP)GlucoseCholesterolLow Density Lipoprotein (LDL)High Density Lipoprotein (HDL)Triglyceride(TG)

All modeledas OWL Classesmuch the sameas described before

Page 32: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Measurements like BMI are derived from calculationsover other “core” measurements

Again, we use SADI and semantics to achieve this automatically

(and of course, any unit conflicts in the input data are also automatically detected and resolved by the previous SADI service we discussed)

Page 33: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Semantic Modeling of the American Heart Association Risk grades

HighRiskBMI =

PatientRecord and (sio:hasAttribute some (cardio:BodyMassIndex and sio:hasMeasurement some (sio:Measurement and (sio:hasUnit value cardio:kilogram-per-meter-squared) and (sio:hasValue some double[>= 25.0]))))

Limit taken directly from clinical guidelinesSimilarly for SBP, Cholesterol, etc....

Page 34: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

SHARE Query for High Risk SBP (SHARE is a SADI-enhanced SPARQL query engine)

Page 35: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Now the interesting question...

Page 36: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

How does our automated risk evaluationcompare to a clinician’s expert risk classification?

Page 37: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

True positive rate

“at risk” %

False positive rate

“at risk”%

SBP 100 0

DBP 100 0

CHOL 92.6 0

HDL 100 56.5

TG 100 8.5

BMI 100 18.8

LDL 100 0

How does our automated risk evaluationcompare to a clinician’s expert risk classification?

Yuck!

Page 38: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

But... We encoded and were following the guidelines!

We double-checked and our definitions were definitely correct

How could we possibly be wrong??

Page 39: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Visual inspection of data and guidelines showed

in many cases the clinician had “tweaked” the guideline

------------------AHA BMI risk threshold: BMI=25

In our dataset the clinical researcher used BMI=26------------------

HDL “official” guideline HDL=1.03mmol/lThe dataset from our researcher: HDL=0.89mmol/l

-------------------

Page 40: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Adjusting our OWL class definitions and re-running the analysisResulted in nearly 100% correspondence with the clinical researcher

(at least for binary risk assessment on simple measurements)

HighRiskCholesterolRecord=

PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.0]))))

HighRiskCholesterolRecord=

PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.2]))))

Page 41: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Reflect on this for a second... Because this is important!

1. We automated data cleansing and analysis using Semantic Web Services

2. We encoded clinical guidelines in OWL (first time this has been done AFAIK)

3. We found that clinical researchers did not follow the official guidelines

• This is fine! They’re the experts! But...

4. Their “personalization” of the guidelines was unreported

5. Nevertheless, we were able to create “personalized” OWL Classes representing the viewpoint of that clinical researcher

6. These personalized viewpoints, in OWL, were published on the Web

7. These published, personalized OWL classes can be automatically re-used by others to interpret their own data using that clinician’s viewpoint

Page 42: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

AHA:HighRiskCholesterolRecord

PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.0]))))

McManus:HighRiskCholesterolRecord

PatientRecord and (sio:hasAttribute some (cardio:SerumCholesterolConcentration and sio:hasMeasurement some ( sio:Measurement and (sio:hasUnit value cardio:mili-mole-per-liter) and (sio:hasValue some double[>= 5.2]))))

PREFIX AHA=http://americanheart.org/measurements/

PREFIX McManus=http://stpaulshospital.org/researchers/mcmanus/

Page 43: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

To do the “experiment” using AHL guidelines

SELECT ?patient ?risk

WHERE {

?patient rdf:type AHL: HighRiskCholesterolRecord .

?patient ex:hasCholesterolProfile ?risk

}

Page 44: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

To do the “experiment” using McManus guidelines

SELECT ?patient ?risk

WHERE {

?patient rdf:type McManus:HighRiskCholesterolRecord .

?patient ex:hasCholesterolProfile ?risk

}

Page 45: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Transparency!

Reproducibility! Sharability! Comparability!

Simplicity!

Automation!

Expert “tweaking” is allowed(the expert retains their expert authority)

but these tweaks are explicit and transparent

Page 46: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

CAN WE INTERPRET COMPLEX CLINICAL PHENOTYPES?

Problem #3: Moving Beyond Simple Binary Risk

Page 47: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

The next step was to attempt to model the Framingham Risk Scores

e.g. 10-year Cardiovascular Disease Risk

This takes a large number of variables(SBP, BMI, and disease states such as diabetes)

and calculates a patient’s risk

Page 48: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

How do we do with these non-trivial cases?

...not very well LOL!

Page 49: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

OWL Modeling of Framingham 10-year risk for general CVD

UGH! Awful!

Page 50: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Discussions with the clinical researcher revealed the problem...

The patients were on drugs that affected their clinical measurements

(effectively, the drugs made them more “normal”)

however the expert continued to classify them as having the clinical problem based on their implicit knowledge

regardless of the clinical measurement value

Page 51: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Can we compensate for that level of expert intuition?

We believe so

and the required knowledge is already encoded for us!

Page 52: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

NDF-RT from the U.S. Veteran's Authority

Page 53: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

The resource to automate interpretation of a patient’s prescriptions exists

and would allow us to (more) properly interpret their phenotype

IF

We could accurately get this information out of their clinical record

Page 54: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Patient1 Patient2

DRUG 1 ASPRIN* ASCRIPTIN

DRUG 1 DOSAGE 1 DLY 1DLY,10MG AS NEEDED

DRUG 2 PROCARDIA PERSANTINE

DRUG 2DOSAGE 10MG 1 3X DLY 75MG TID

DRUG 3 BUFFERIN Lopid

DRUG 3 DOSAGE 1DLY 4X300MG DLY

DRUG 4 VASOTEC DICUMAROL

DRUG 4 DOSAGE 2 DLY

DRUG 5 XSD ASCRIPTIN TRANRENE

DRUG 5 DOSAGE

DRUG 6 DIPYRIDAMOLE

100MG

PERSANTINE

DRUG 6 DOSAGE 1 75MG, 3X DLY

DRUG 7 VASOTEC

DRUG 7 DOSAGE

Treated for HBP 1 1

Treated for Diabetes 1 1

Treated for High

Cholesterol

0 1

Page 55: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Patient1 Patient2

DRUG 1 ASPRIN* ASCRIPTIN

DRUG 1 DOSAGE 1 DLY 1DLY,10MG AS NEEDED

DRUG 2 PROCARDIA PERSANTINE

DRUG 2DOSAGE 10MG 1 3X DLY 75MG TID

DRUG 3 BUFFERIN Lopid

DRUG 3 DOSAGE 1DLY 4X300MG DLY

DRUG 4 VASOTEC DICUMAROL

DRUG 4 DOSAGE 2 DLY

DRUG 5 XSD ASCRIPTIN TRANRENE

DRUG 5 DOSAGE

DRUG 6 DIPYRIDAMOLE

100MG

PERSANTINE

DRUG 6 DOSAGE 1 75MG, 3X DLY

DRUG 7 VASOTEC

DRUG 7 DOSAGE

Treated for HBP 1 1

Treated for Diabetes 1 1

Treated for High

Cholesterol

0 1

Page 56: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

RxNav and ChemSpider have APIs for canonicalization of drug names

Use SADI service workflow to migrate legacy data into canonicalized form

Page 57: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

This workflow has a ~4% failure rate(my small trials with Google Suggest looked promising at improving this!)

Page 58: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

UnderHypertensionTreatment=

galen:Patient and cardio:isPrescribed some (cardio:CanonicalizedDrugCollection and sio:has_member some (cardio:HypertensionTreatmentMedication))

cardio:HypertensionTreatmentMedication=

cardio:CanonicalDrugRecord and ( ndf:may_treat some ndf:Hypertension )

Adding prescription information into our OWL Framingham Risk models

Note how easy it is to connect semantic data into your system - Just refer to it in your definition!! Also note that we’re not listing a bunch of drugs,

we’re including any drug defined as a Hypertension treatment by NDF-RT. The Semantic Definition, not an explicit drug list!

Page 59: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Now how are we doing?

The answer is a bit surprising...

Page 60: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Patient ID Automatic Risk Grade (based on drugs prescribed)

Expert-assigned Grade (BP_TREATMENT_STATUS)

Uri4627 1 1

Uri4275 1 1

Uri822 1 0

Uri893 1 1

For “treated for cholesterol” and “treated for diabetes”we achieved detection specificities of 96->99%

But for blood pressure it was a bit more of a mess... Only 44% specificity! What went wrong?

Page 61: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

A large number of drugs are used to treat blood pressure

but these drugs can also used to treat other things

From the perspective of the treating clinicianthe purpose for which they prescribed the drug

is the purpose that they record in the chart

Page 62: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

But the drug has other effects that the clinical researcher

might not (does not) account for in their expert evaluation

How do we define “correct” in this scenario?

????

(i.e. Is this a bug, or a feature??)

Page 63: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Accuracy Precision Recall

High Risk 0.82 0.89 0.71 0.84 0.30 0.61

Moderate

Risk

0.68 0.73 0.68 0.72 0.71 0.74

Low Risk 0.76 0.83 0.55 0.65 0.80 0.81

OverallOur ability to classify raw clinical data

(with spelling mistakes and all)into the Framingham Risk evaluation

compared to the expert clinical assessment

White = before including prescription informationGrey = including the NDF-RT drug knowledgebase

Page 64: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

We’re looking for other “intuitive” decisionsmade by the clinician

that will account for our remaining inaccuracies

We are optimistic that we can record at least some of these in OWL and/or as features within the SPARQL query

Remember – the objective is transparencynot necessarily 100% semantic encoding

Page 65: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Interestingly, we were also able to create a simple OWL Classthat allowed classification of patients based on being

prescribed contra-indicated drugs

~4.2% of patients were taking dangerous drug combinations

We “got this for free” by connecting a bunch of semantic resources together!

Page 66: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Take-home messages - repeated

1. We automated analysis using Semantic Web Services

2. We encoded clinical guidelines in OWL

3. We found that clinical researchers did not follow the official

guidelines

• This is fine! They’re the experts! But...

4. Their “personalization” of the guidelines was unreported

5. We were able to create “personalized” OWL Classes

representing the viewpoint of that clinical researcher

6. These personalized viewpoints were published on the Web

7. The OWL classes can be automatically re-used by others

Page 67: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Publication

Discourse

Hypothesis

Experiment

Interpretation

??

Page 68: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

The OWL Classes we constructedrepresent a particular clinician’s view of “reality”

By my definition, that IS a hypothesis!

Other work in our lab has demonstrated* that we canduplicate an entire published research paper

“simply” by creating an OWL class representing the hypothetical view of that researcher

(and note that these hypotheses are explicit, shared on the Web,and re-usable by others!!)

* Wood et al, Proc. ISoLA, 2012

Page 69: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Life Web Science:

The Web is a cradle-to-grave biomedical research platform.

Page 70: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

This is the work of Soroush Samadian

Ph.D. Candidate

Bioinformatics Programme, UBCVancouver, BC, Canada

Page 71: Enhancing Reproducibility and Transparency in Clinical Research through Semantic Technologies

Microsoft Research