Medical Information Retrieval Workshop Keynote (MedIR@SIGIR2014)
-
Upload
karin-verspoor -
Category
Health & Medicine
-
view
291 -
download
0
description
Transcript of Medical Information Retrieval Workshop Keynote (MedIR@SIGIR2014)
Practice-based Evidence in Medicine: Where Information Retrieval Meets Data Mining
Karin M. Verspoor
Department of Computing and Information Systems
Health and Biomedical Informatics Centre
The University of Melbourne
08 July 2014
Myriad problems, Myriad solutions
genomics
clinical decision support
furthering biological
knowledge
empowering patients
novel treatments
Evidence-based medicine
“Best available clinical evidence” = randomized clinical trials
How physicians are taught EBM
Critically appraise the key article(s)
Select key article(s)Make a decision
Integrate decisioninto practice
Evaluate impactof decision
Formulatea question
Practice Other triggers
From EBM curriculum, U. of Ottawa http://www.med.uottawa.ca/sim/data/EBM_Intro_slides.ppt
Select key article(s)
Formulatea question
That sounds familiar, right?
Select key article(s)
Formulatea question
How to use the evidence base
• Critical appraisal– What are the study results?– Are the study results valid?
• Will the results help in caring for my patient?– Were all clinically important outcomes reported?– Are likely treatment benefits greater than
potential harms?– Can the results help my patient?
IR in Evidence-based medicine
• Identify articles relevant to a clinical question.
• Identify clinical elements of the literature.– PICO: Population/Patient,
Intervention/Indicator, Comparator/Control, Outcome
• Support a systematic review of the clinical literature.
Lots of opportunities for IR here. But I won’t say much more about literature mining.
Limitations of EBM
• Clinical variability• Biological variability• Randomized controlled trials
– Undertaken under controlled conditions– Applicability to patient not always clear
• Clinical judgement about how the evidence fits the patient
EVIDENCE FITTING PATIENT?
EBM + Clinical Judgement
• Do the results of the study apply to my patient?
• What if my patient mostly but not completely satisfies the inclusion criteria?
http://commons.wikimedia.org/wiki/File%3APregnant_woman.jpg
Natural Experiments
• Randomized Clinical Trials are limited– By design
(specific inclusion, exclusion criteria)
– By resources(limited patient cohorts recruited)
• How do the results generalize?
http://commons.wikimedia.org/wiki/File%3ABig_Day_Out_(8392285402).jpg
Natural Experiments
• What we would really like to do is to study large populations of people– to identify side-effects or interactions that
appear when a treatment is provided to 1000s or 10,000s of people rather than 100s
– to explore what characteristics of individuals are ultimately responsible for a positive/negative outcome
Evidence deriving from clinical practice “in the wild” rather than controlled studies
practice-based evidence
Whence the Evidence?
Mining electronic health records: towards better research applications and clinical carePeter B. Jensen, Lars J. Jensen & Søren Brunak, Nature Reviews Genetics 13, 395-405 (2012)doi:10.1038/nrg3208
Electronic Health Records
EHRs facilitate better care
• More complete picture of the patient and ongoing patient history– understand trends in vital signs– track allergies
• Integrate billing, pharmacy, radiology, laboratory• Streamlined clinical workflow: Share lab results,
imaging, specialist assessments, etc. directly• Better tracking of prescription/test orders• Fewer medical errors / possibility for error
checking• Clinical decision support
EHRs enable analysis
• Analyse trends in the effectiveness of treatments• Investigate the efficacy of medications in
patients with co-morbidities• Are we seeing evidence of an emerging
epidemic?• Outcomes research:
– why are patients in one geographic region having higher rates of cancer recurrence than in another region?
– do patients with diabetes experience higher rates of hearing loss?
Clinical Text
About 80% of clinical information is in textual form
– ED triage notes– Clinical progress notes– Radiology and Pathology reports– GP and specialist letters– Discharge summaries– Medicare claims– Registry data– Literature: Medical articles
WORKING WITH EHR
“This system is designed for physicians to point and click their way through an entire exam quickly and effortlessly.” (EMR product review)
Electronic Health Records
Variations in unstructured text
1 Tablet(s) PO Daily1 tab by mouth or orally daily 1 tab orally every 24 hours. 1 tab(s) PO (oral) qDay 1 tab(s) orally once a day. 1 tabs QD1.0 tab po qdONE TABLET; ORAL QD One orally dailyOne tablet po dailyTAKE 1 TABLET DAILY TAKE ONE PO QDTake 1 Tab by mouth daily. Take 1 tab daily daily orally Take 1 tab daily orally Take 1 tab po qdayTake 1 tab qd poTake 1 tab qday POTake 1 tab(s) daily orally
Take 1 tablet by mouth daily. Take 1 tablet orally Daily Take 1 tablet orally every day Take one orally daily Take one orally daily as discussede Take one tablet by mouth daily Take one tablet by mouth every day Take one tablet daily Take one tablet once per day orally Take one tablet po qdby mouth one po qdone orally once a day one orally per dayone tablet by mouth daily one tablet dailyone tablet once a day take 1 tab po dailytake 1 tab po qdtake one orally each day
Structuring knowledge
• Information Extraction– Of Entities– Of Concepts– Of Relations
Extract structured content from
unstructured text
Take ONE to TWO tablets a day when required
Rx_Dosage: AmtMin: 1 AmtMax: 2 AmtUnit: tabletRx_Frequency: PerWdwDays: 1 DosesPerWdw: 1
Structured vocabularies
• Play a strategic role in providing access to computerized health information because clinicians use a variety of terms for the same concept. – either “leukopenia” or “low white cell count” might be written
in a patient record—usually these are synonyms.– Without a structured vocabulary, an automated system will
not recognize these terms as being equivalent.
• Encode data for exchange, comparison, aggregation • SNOMED CT (Systematized Nomenclature Of
Medicine Clinical Terms): core general terminology for patient data
• ICD (International Classification of Disease): used for diagnosis and procedure data
One EHR fits all?
• EHRs are used in complex clinical environments.
• Features and interfaces appropriate for one medical specialty (such as pediatrics) may be frustratingly unusable in another (such as the intensive care unit).
• The data presented, the format, the level of detail, the order of presentation may need to be different.
• “Clinical IT projects are complex social endeavors in unforgiving clinical settings that happen to involve computers, as opposed to IT projects that happen to involve doctors.”
• -- Scot M. Silverstein, MD, Drexel University
Remember the user
“Clinical IT projects are complex social endeavors in unforgiving clinical settings
that happen to involve computers, as opposed to IT projects that happen to
involve doctors.” -- Scot M. Silverstein, MD, Drexel University
The Clinical Narrative
“...In years past, a well-written history and physical, or progress note, would unfold like a story, giving a vivid description of the patient’s symptoms and physical exam at the point of the encounter, as well as the synthesis of the data and the plan of care."
“EMRs: Finding a balance between billing efficiency and patient care", Henry F. Smith, Jr., MD, Commentary, The Times Leader, Wilkes-Barre, PA, June 12, 2011.
A typical clinical narrative
April 14, 2007 CHIEF COMPLAINT: Shortness of breath. HISTORY OF PRESENT ILLNESS: This 68-year-old female presents to the emergency department with shortness of breath that has gone on for 4-5 days, progressively getting worse. It comes on with any kind of activity whatsoever. She has had a nonproductive cough. She has not had any chest pain. She has had chills but no fever. EMERGENCY DEPARTMENT COURSE: The patient was admitted. She has had intermittent episodes of severe dyspnea. Lungs were clear. These would mildly respond to breathing treatments and morphine. Her D‐dimer was positive. We cannot scan her chest; therefore, a nuclear V/Q scan has been ordered. However, after consultation with Dr. C, it is felt that she is potentially too unstable to go for this. Given the positive D‐dimer and her severe dyspnea, we have weighed the risks and benefits of anticoagulation with her heme-positive stools. She states that she has been constipated lately and doing a lot of straining. Given the possibility of a PE, it was felt like anticoagulation was very important at this time period; therefore, she was anticoagulated. The patient will be admitted to the hospital under Dr. C.
Identifying clinical terms
April 14, 2007 CHIEF COMPLAINT: Shortness of breath. HISTORY OF PRESENT ILLNESS: This 68-year-old female presents to the emergency department with shortness of breath that has gone on for 4-5 days, progressively getting worse. It comes on with any kind of activity whatsoever. She has had a nonproductive cough. She has not had any chest pain. She has had chills but no fever. EMERGENCY DEPARTMENT COURSE: The patient was admitted. She has had intermittent episodes of severe dyspnea. Lungs were clear. These would mildly respond to breathing treatments and morphine. Her D‐dimer was positive. We cannot scan her chest; therefore, a nuclear V/Q scan has been ordered. However, after consultation with Dr. C, it is felt that she is potentially too unstable to go for this. Given the positive D‐dimer and her severe dyspnea, we have weighed the risks and benefits of anticoagulation with her heme-positive stools. She states that she has been constipated lately and doing a lot of straining. Given the possibility of a PE, it was felt like anticoagulation was very important at this time period; therefore, she was anticoagulated. The patient will be admitted to the hospital under Dr. C.
details
April 14, 2007 CHIEF COMPLAINT: Shortness of breath. HISTORY OF PRESENT ILLNESS: This 68-year-old female presents to the emergency department with shortness of breath that has gone on for 4-5 days, progressively getting worse. It comes on with any kind of activity whatsoever. She has had a nonproductive cough. She has not had any chest pain. She has had chills but no fever. EMERGENCY DEPARTMENT COURSE: The patient was admitted. She has had intermittent episodes of severe dyspnea. Lungs were clear. These would mildly respond to breathing treatments and morphine. Her D‐dimer was positive. We cannot scan her chest; therefore, a nuclear V/Q scan has been ordered. However, after consultation with Dr. C, it is felt that she is potentially too unstable to go for this. Given the positive D‐dimer and her severe dyspnea, we have weighed the risks and benefits of anticoagulation with her heme-positive stools. She states that she has been constipated lately and doing a lot of straining. Given the possibility of a PE, it was felt like anticoagulation was very important at this time period; therefore, she was anticoagulated. The patient will be admitted to the hospital under Dr. C.
reasoning
EMERGENCY DEPARTMENT COURSE: The patient was admitted and nontoxic in appearance. Blood pressure was brought down aggressively. With this combined with BiPAP, she has reversed her respiratory distress promptly. She has improved significantly. She will not require intubation at this time period. Her family has elected to go back to M, Dr. W. I did discuss this case with Dr. G who is on call for L Cardiology. She has accepted him in transfer; however, there are no PCU or ICU beds at this time period. Will admit here for a brief period until a bed is available at M. I discussed this case with Dr. R who will admit.
• Clinicians were trying to determine whether the shortness of breath was due exclusively to her failing heart, or whether she has pneumonia.
• Prompt response indicates that pneumonia is not the issue.
Unlocking information in the text
• (Semantic) Information retrieval– Finding relevant documents, paragraphs related to
specified concepts
• Entity recognition– Identifying relevant and important entities
• Relationship identification– Understanding underlying language to determine
relationships between entities of interest
unexpectedassociations new insights
new knowledge
MEDICAL CONCEPT RECOGNITION
MetaMap: UMLS concept annotation
http://www.cvast.tuwien.ac.at/projects/iUMLS
Abstracting linguistic variation
• Terminology mapping tools generalise language variation
• e.g. UMLS Concept C0027497• nausea• nauseated• feels sick• feeling sick• queasy• felt sick• nauseous
ICD coding
http://www.zydoc.com/zydoc-extracts-icd-10-codes-from-unstructured-text-with-nlp-driven-cac/
NegEx: identify negated concepts
http://healthinformatics.wikispaces.com/NegEx+Algorithm
Classification framework
Training setNotes + labels
for classes of interest(e.g. ICD-10 codes)
Machine learning algorithm
Words, Phrases,Linguistic categories;
names of entities;Domain concepts; Document features
Biomedical knowledge sources
UMLS (SnomedCT, ICD)
Language processing
ModelRelating features
of the text to classes of interest
SYNDROMIC SURVEILLANCEFROM CLINICAL NOTES
SynSurv
• SynSurv– Victorian Department of Health pilot
syndromic surveillance program– Detection of outbreaks based on ICD-10
diagnostic codes and presenting complaints as captured in free text notes
Our focus:Extracting information from unstructured free text to enable subsequent analysis and monitoring
Objectives of our project
• To enable surveillance directly from notes; integration into natural workflow of ED
• To support higher sensitivity and higher precision than keyword-based methods
Emergency Department triage notes
• Free text notes– written by triage nurse upon assessment in
the Emergency Department– captures presenting symptoms and
complaints of a patient
CENTRAL CHEST DISCOMFORT WHILE EATING, RADIATING TO ARMS. PPM INSERTED 2/52 AGO. PAIN FREE O/A. HR72, BP160
FEBRILE ILLNESS FLU LIKE SYMPTOMS NAUSEA
L BASAL GANGLIAN BLEED POST COLLAPSE, NON VERBAL, EYES SPON OPENED, HYPERTENSIVE, P 70REG, PEARL, PMX CEREBRAL BLEED
SynSurv data characteristics
• 918,330 records• 730,054 records with ICD-10 diagnosis• 456,213 records with note text• 316,362 records with ICD-10 diagnosis
and note text
Two sets of Experiments
• Given a free text note,– Predict the ICD-10 code(s) for the note
– Predict a syndromic group, based on pre-defined sets of ICD-10 codes of interest
Predicting ICD-10 codes
• Approaches– Baseline strategy
• direct detection of ICD-10 terms in triage notes
– Augmented baseline• direct detection of SNOMED-CT terms in notes• map to ICD-10 codes via reference mapping
– Machine learning• Build a set of binary classifiers; one yes/no
classifier per ICD-10 code• Experiment with different features and different
learning algorithms
Predicting ICD-10 codes(Results)
• Direct term matching strategy outperformed by machine learning– Performance difference between micro-
average and macro-average indicates that some ICD-10 codes are underrepresented in the data, and cannot be modeled well
Predicting Syndromic Groups
• Task– Syndromic groups are defined by sets of
ICD-10 codes, e.g. Flu like group
Syndrome distribution
• Data– 6 groups with a reasonable number of
examples– Large imbalance between yes/no classes
Predicting Syndromic Groups(Approach)
– Machine learning• Build a set of binary classifiers; one yes/no
classifier per Syndromic Group• Experiment with different features and different
learning algorithms• Incorporate ICD-10 and SNOMED term
recognition in pre-processing (to generalise over linguistic variation)
Predicting Syndromic Groups(Detailed Results)
Syndromic Group Expansion Results
• Improve syndromic group definitions by adding related ICD-10 codes to the provided definitions
• Done using a data-driven strategy– Look for ICD-10 codes with similar records – Compare groups of records based on
cosine similarity– Select ICD-10 codes from the most similar
ones with relevant records
Syndromic Group Expansion Results(Aggregate results)
• Results for SynSurv_Acute_respiratory, SynSurv_Diarrhoea and SynSurv_Flu_like_illness
Issues for low performance
• Inconsistency in ICD-10 annotation– ? FISH BONE IN THROAT J03– ? FISH BONE IN THROAT T18– ? FISH BONE IN THROAT T18– ? FISH BONE IN THROAT S10.9– ? FISH BONE IN THROAT J02.0
• Notes not related to the patient´s visit– DIRECT ADMISSION FROM BAIRNSDALE TO 3S BED 25
• Typos in the notes text– ? FIH BONE IN THROAT
Integrating with Syndromic Surveillance framework
• Input to the BioSurv system– Trained machine learning models used as
input to BioSurv (e.g., C2 algorithm)– Prediction probability > 0.5
Model
Predicted Classification
(label)
Yesflu-like illness
No
BioSurvCount +1
Example: Flu like syndrome NLP notes annotation
• Records with no ICD-10 codes in the database are now available to SynSurv
• 730,054 out of 918,330 records with ICD-10 codes
C2 algorithm: ICD-10 vs NLP
• Earlier alert time using NLP methods
ICD-10 NLP
RETRIEVE DISEASE-RELEVANT CLINICAL RECORDS
Disease Recognition from Clinical Reports
• Task: classify records according to specified disease– Enables retrieval of specific cases– Detect patterns of disease occurrence– Support creation of patient cohorts– Prelude to automated ICD-encoding
• Disease: Lung Cancer– Identified by ICD-10 code
• C34: Malignant neoplasm of bronchus and lung
Text mining for lung cancer cases over large patient admission data (2014) Martinez D, Cavedon L, Alam Z, Bain C, Verspoor K. HISA Big Data 2014, CEUR vol. 1149. http://ceur-ws.org/Vol-1149/bd2014_cavedon.pdf
Alfred Health (Melbourne)
Method
• Data: radiology reports for 2 (financial) years (2011--2013) extracted from REASON platform– 756,502 reports, plus associated metadata
• Each report linked to an admission record• Metadata: ICD-10 (manually assigned) used as
ground truth;• demographics, reason for admission, etc.
– Data pre-processed to remove ICD-10 codes and extract features
• Challenge: real distribution highly skewed – only 0.8% of data are positive for lung cancer
Method
• Features:– Bags-of-Words from report text– Bags-of-Phrases identified by MetaMap– Negative context identified by NegEx
– Metadata from admission record• Name, Dob, Sex, MaritalStatus, Religion• AdmissionReason, AdmissionUnit,
AdmissionType• Allergies, DrugCode, DrugDesc• ...
Method
• Machine learning algorithms– Support Vector Machines– Correlation-based feature selection filter
• Baseline: keyword-based approach“lung cancer”, “lung malignancy”,
“lung malignant”, “lung neoplasm”,
“lung tumour”, “lung carcinoma”
Results
Evaluation: stratified 10-fold cross-validation
Classifier Precision Recall F-score
Text features only 0.855 0.800 0.825
Full feature set (including metadata)
0.871 0.820 0.843
Term-matching baseline 0.643 0.742 0.689
* Results not using feature selection, which reduced performance
Temporal variation
Related Applications
Fungal infection surveillance by classifying CT scan reports
Extracting information from pathology reports
Work by Lawrence Cavedon, David Martinez, and others at NICTA in recent years
Martinez et al (2014) Cross-hospital portability of information extraction of cancer staging information. AI in Medicine.
DATA MINING
Making Sense of clinical data
66
Data analysis
Once the data is structured, anything is possible• Association rule mining• Clustering• Machine learning• Hypothesis testing• Statistical analysis• Etc.
Clustering patients
• Patients represented as sets of features
• Features could include any aspect of their profile– demographic– clinical– treatment
• drugs• devices
– genomic– environmental– nutritional– etc.Roque et al. (2011) Using Electronic Patient Records to Discover Disease
Correlations and Stratify Patient Cohorts. PLoS Comput Biol 7(8): e1002141. doi:10.1371/journal.pcbi.1002141
Aggregating resources
Electronic Patient-Reported Data Capture as a Foundation of Rapid Learning Cancer Care
Abernethy, Amy P.; Ahmad, Asif; Zafar, S. Yousuf; Wheeler, Jane L.; Reese, Jennifer Barsky; Lyerly, H. Kim. Medical Care. 48(6):S32-S38, June 2010. doi: 10.1097/MLR.0b013e3181db53a4
Pharmacovigilance
• Mining of clinical records to identify adverse drug events– Estimated >90% of adverse events do not appear in coded
data– Transform patient records into patient-feature matrix
encoded using clinical terminologies
70LePendu et al. (2013) “Pharmacovigilance Using Clinical Notes” Clinical Pharmacology & Therapeutics 93(6), 547–555; doi: 10.1038/clpt.2013.47
A “Phenotypic code” for complex disease
• Simple and complex diseases appear to share a genetic architecture
• Mining of co-morbidities of complex diseases and Mendelian diseases with known genetic cause identifies a ‘code’ for each complex disease in terms of Mendelian genetic loci.
• Evidence of epistasis among the Mendelian variants (superlinear complex disease risk)
71
Blair et al. Cell (2013); 155 (1); 70-80. http://dx.doi.org/10.1016/j.cell.2013.08.030
Personalised Medicine
• Look for evidence in text for:(classification) – Which patients have not
responded or had a toxicity event?
(prediction)– Which patients are
likely to respond to the drug?
(interpretation)– Why did some patients
respond well?
Rapid Learning Healthcare
Electronic Patient-Reported Data Capture as a Foundation of Rapid Learning Cancer Care
Abernethy, Amy P.; Ahmad, Asif; Zafar, S. Yousuf; Wheeler, Jane L.; Reese, Jennifer Barsky; Lyerly, H. Kim. Medical Care. 48(6):S32-S38, June 2010. doi: 10.1097/MLR.0b013e3181db53a4
Conclusions
• We are at the beginning of a transition from evidence-based medicine to practice-based evidence
Prediction: factors in disease and effective treatmentDetection: observables indicating diseasePrevention: what factors circumvent those related to prediction
• enabled by increasing roll-out of EHR and HI systems • Linked hospital data allows multiple sources to be
leveraged for complex analytic tasks• Text is a major and important part of the clinical record
Many data structuring and mining problems in the clinical context can be treated as retrieval problems.
Impact of Informatics for Biomedicine
Advancing the science of medicine
Improving the effectiveness of
healthcare
© Copyright The University of Melbourne 2014