Using ontologies to make sense of unstructured medical data
description
Transcript of Using ontologies to make sense of unstructured medical data
Using ontologies to make sense of unstructured medical data
Nigam Shah, MBBS, [email protected]
NCBO: Key activities
• We create and maintain a library of biomedical ontologies.
• We build tools and Web services to enable the use of ontologies and their derivatives.
• We collaborate with scientific communities that develop and use ontologies.
http:
//re
st.b
ioon
tolo
gy.o
rghtt
p://
rest
.bio
onto
logy
.org
Ontology ServicesOntology Services
• Download• Traverse• Search• Comment
• Download• Traverse• Search• Comment
WidgetsWidgets• Tree-view• Auto-complete• Graph-view
• Tree-view• Auto-complete• Graph-view
AnnotationAnnotation
Data AccessData Access
Mapping ServicesMapping Services
• Create• Download• Upload
• Create• Download• Upload
ViewsViews
Term recognitionTerm recognition
Fetch “data” annotated with a given term
Fetch “data” annotated with a given term
http://bioportal.bioontology.orghttp://bioportal.bioontology.org
Annotation service
Process textual metadata to automatically tag text with as many ontology terms as possible.
90 million calls, ~700 GB of data90 million calls, ~700 GB of data
Resource index
Pubmed AbstractsPubmed Abstracts
Adverse Events (AERS)Adverse Events (AERS)
GEOGEO
::
Clinical TrialsClinical Trials
Drug BankDrug Bank
Won 1st prize at the 2010 Semantic Web Challenge @ ISWC
Won 1st prize at the 2010 Semantic Web Challenge @ ISWC
Creating Lexicons
Term – 1:::Term – n
Sentence in Clinical Note – 1:::Sentence in Clinical Note – m
tf df NN JJ … VP …
ID Term-1 150,879 90,000 0.90 0.05 … 0.03 …
ID :
ID Term-n
Syntactic typesSyntactic typesFrequencyFrequency
Frequency counter
Annotation AnalyticsAnnotation Analytics
Analyzing tagged data for hypothesis generation in bioinformatics
Genome
Generic GO based analysis routineGeneric GO based analysis routine
Reference set
Study Set• Get annotations for each
gene in a set
• Count the occurrence of each annotation term in the study set
• Count the occurrence of that term in some reference set (whole genome?)
• P-value for how surprising their overlap is.
Annotation Analytics Landscape
SNOMED-CT
Gene Ontology
Gene Sets
NCIT
ICD-9
Human Disease
Cell Type
MeSH
Drugs, Chemicals
Grant Sets
Paper Sets
Patient Sets
Drug Sets
: ??
Health Indicator Warehouse datasets
Open questions
1. Can we use something other than the GO?
2. Lack of annotations—even today, roughly 20% of genes lack any GO annotation.
3. Annotation bias—annotation with certain ontology terms is not independent of each other.
4. Lack of a systematic mechanism to define a level of abstraction.
Profiling a set of Aging genes
Disease Ontology
~ 30% of genome
261 Age-related genes
Genome
Using ontologies other than GO
ERCC6 nucleoplasmPARP1 protein N-terminus bindingERCC6 nucleoplasmPARP1 protein N-terminus binding
ERCC6 <disease term?> PARP1 <disease term?>ERCC6 <disease term?> PARP1 <disease term?>
ERCC6 GO:0005654 PMID:16107709ERCC6 GO:0008094 PMID:16107709PARP1 GO:0047485 PMID:16107709ERCC6 GO:0005730 PMID:16107709PARP1 GO:0003950 PMID:16107709
http://www.geneontology.org/GO.downloads.annotations.shtmlhttp://www.geneontology.org/GO.downloads.annotations.shtml
Enrichment Analysis with the DO
www.ncbi.nlm.nih.gov/pubmed/16107709www.ncbi.nlm.nih.gov/pubmed/16107709
NCBO Annotator:http://bioportal.bioontology.orgNCBO Annotator:http://bioportal.bioontology.org
{ERCC6, PARP1} PMID:16107709{ERCC6, PARP1} PMID:16107709
{ERCC6, PARP1} {Cockayne syndrome, DNA damage}{ERCC6, PARP1} {Cockayne syndrome, DNA damage}
Annotation Analytics on EMR dataAnnotation Analytics on EMR data
Analysis of tagged data from electronic health records
Profiling patient setsProfiling patient sets
86k patient Reports
ICD9 789.00 (Abdominal pain, unspecified site)
Patient records processed from U. Pittsburgh NLP Repository with IRB approval.
Annotation (Clinical Text)
Term – 1:::Term – nSyntactic typesSyntactic types
FrequencyFrequency
Term recognition tool NCBO Annotator
Term recognition tool NCBO Annotator
NegEx Patterns
NegEx Patterns
NegEx Rules – Negation detection
NegEx Rules – Negation detection
P1 ICD9 ICD9 ICD9 ICD9 ICD9 ICD9
P1 T1, T2, no T4
… T5, T4, T3
… T4, T3, T1
T8, T9, T4
… T6, T8, T10
T1, T2, no T4
P2
P2
P3
P3
:
:
Pn
PnTerms form a temporal series of tags
Coh
ort
of
Inte
rest
DiseasesDiseases
ProceduresProcedures
DrugsDrugs
BioPortal – knowledge graph
Creating clean lexicons
Annotation Workflow
Furt
her A
naly
sis
Text clinical note
Terms Recognized
Negation detection
Generation of tagged dataGeneration of tagged data
ROR of 2.058, CI of [1.804, 2.349]The X2 statistic has p-value < 10-7
ROR=1.524, CI=[0.872, 2.666] X2 p-value = 0.06816.
Detecting the Vioxx Risk SignalDetecting the Vioxx Risk Signal
Vioxx Patients (1,560)
RA Patients (14,079)
MI Patients (1,827)
VioxxMI (339)
p-value < 1.3x10-24
Detecting Adverse Events
Detecting Adverse Events
Linear Space Features Logarithmic Space Features
Drug frequency Drug frequency
Disease frequency Disease frequency
Observed drug-first fraction Observed co-mention count
Drug-first fraction z-score (fixed drug)
Co-mention count z-score (fixed drug)
Drug-first fraction z-score (fixed disease)
Co-mention count z-score (fixed disease)
Detecting Adverse Events
Detecting Off-label useDetecting Off-label use
Annotation Analytics Landscape
SNOMED-CT
Gene Ontology
Gene Sets
NCIT
ICD-9
Human Disease
Cell Type
MeSH
Drugs, Chemicals
Grant Sets
Paper Sets
AgingAging
Patient Sets
Drug Sets
:
EMRsEMRs
What questions
can we ask?
What questions
can we ask?
Health Indicator Warehouse datasets
Associations and outcomes
Gene Disease Drug Device Procedure Environment
Gene
Disease
Drug
Device
Procedure
Environment
Side effects
Off-label Indications
Enrichment
What questions
can we ask?
What questions
can we ask?
Acknowledgements
• Paea LePendu• Yi Liu• Srinivasan Iyer• Steve Racunas• Anna Bauer-Mehren• Clement Jonquet• Rong Xu
• Mark Musen• NIH – NCBO funding• Mayo Team
• Hongfang Liu• Stephen Wu
• Sylvia Holland• Alex Skrenchuk
Mining Annotations of Grants, Publications
Grants from 1972 to 2007 30 funding agencies
Publications from MedlineOnly “Journal articles”
Sponsorship and AllocationSponsorship and Allocation
Who funds whatWho funds what