Leveraging Text Classification Strategies for Clinical and Public Health Applications

55
Leveraging Text Classification Strategies for Clinical and Public Health Applications Karin M. Verspoor @karinv [email protected] The University of Melbourne Melbourne, Victoria, Australia January 2016, Qatar Computing Research Institute

Transcript of Leveraging Text Classification Strategies for Clinical and Public Health Applications

Page 1: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Leveraging Text Classification Strategies for Clinical and Public Health Applications

Karin M. Verspoor

@[email protected] University of MelbourneMelbourne, Victoria, Australia

January 2016, Qatar Computing Research Institute

Page 2: Leveraging Text Classification Strategies for Clinical and Public Health Applications

(clinical) Data everywhere

• Electronic health records– Patient demographics and biometrics– Laboratory test results– Clinical notes

• Radiology and pathology– Images: X-ray, MRI and PET Scans– (Synoptic) Reports

• Databases– Health Service reporting– National Prescribing Service– Registry data, Births and Deaths– Medicare/insurance claim data etc…..

Page 3: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Don’t forget unstructured data!

• About 80% of clinical information is in textual form– ED triage notes– Clinical progress notes– Radiology and Pathology reports– GP and specialist letters– Discharge summaries

• Published Literature– Clinical Trials– Molecular-level studies

• and … social media text!

Page 4: Leveraging Text Classification Strategies for Clinical and Public Health Applications

How is text used in medicine?

• Direct analysis of clinical records– Information retrieval for clinical trials– Syndromic surveillance– Hospital Services Research– Clinical Decision Support– Pharmacovigilance

• Literature mining– Evidence-based medicine– Systematic Reviews

Page 5: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Evidence from EHRs

Mining electronic health records: towards better research applications and clinical carePeter B. Jensen, Lars J. Jensen & Søren Brunak, Nature Reviews Genetics 13, 395-405 (2012)doi:10.1038/nrg3208

Page 6: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Pharmacovigilance from EHRs

Mining of clinical records to identify adverse drug events

Estimated >90% of adverse events do not appear in coded data

6LePendu et al. (2013) “Pharmacovigilance Using Clinical Notes” Clinical Pharmacology & Therapeutics 93(6), 547–555; doi: 10.1038/clpt.2013.47

Page 7: Leveraging Text Classification Strategies for Clinical and Public Health Applications

… from social media

Pacific Symposium on Biocomputing Shared Task on Social Media Mining

Classification of tweets: mention an Adverse Drug Reaction?

ADR classified@NAME Q makes me hungry. Olanzapine made me want to eat my own arm!

Non-ADR classifiedI couldnt be a chef without nicotine and caffeine

Page 8: Leveraging Text Classification Strategies for Clinical and Public Health Applications

OutlineProblem settingApproach and Results

EHR disease classification

Kocbek et al (2015) Evaluating classification power of linked admission data sources with text mining; Proceedings of the Scientific Stream at Big Data in Health Analytics 2015 (BigData 2015).

Page 9: Leveraging Text Classification Strategies for Clinical and Public Health Applications

ICD classification of EHR data

• We address the task of detecting clinical records in a large record system corresponding to a given diagnosis of interest, based on text analysis

• We focus on lung cancer records for a pilot study

• We developed a system that classifies each admission as positive or negative for lung cancer

• Not as simple as looking for “lung cancer” or synonyms in the EHRs!

Kocbek et al (2015) HISA Big Data conference. http://ceur-ws.org/Vol-1468/bd2015_kocbek.pdf

Page 10: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Alfred REASON platformKocbek et al. Big Data 2015, Sydney

• 15+ years of data from. • 171,000+ updates each day.• 62.4 million updates per annum.

Page 11: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Radiology question

50yo complaining of left shoulder pain. Tender generally. Difficulty abducting the shoulder past 45 degrees. Home on HITH tomorrow - either inpatient or outpatient please

Task

Radiology report

Mobile Chest performed on 02-JUN-2012 at 08:27 AM: The nasogastric tube has its tip in the stomach. The tracheostomy is seen at T2 level. ….

Pathology report

Urine Culture Acc No: 12-183-0731Source: Urine ------------ URINE MICROSCOPY (PHASE CONTRAST) ------------- Leucocytes x10^6/L (Ref <10).... <10 Erythrocytes x10^6/L (Ref <10).. <10.......

Additional data

Age: 50Date of admission: Jun/12Gender: FCountry: …

Admission

ICD-10 code

Page 12: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Data Characteristics

• Extracted data for 2 financial years, 2012-2014:– 150,521 admissions, – 40,800 radiology reports with associated

question,– 20,872 pathology reports,– 121,700 additional data entries (demographics,

hospital admission info).• Admissions are associated to ICD-10 codes:

– Used as ground truth– ICD-10 code C34.*; positive cases for lung

cancer– 496 such positive admissions– an additional 496 non-lung cancer submissions

randomly subsampled as negatives

Page 13: Leveraging Text Classification Strategies for Clinical and Public Health Applications

OutlineProblem settingApproach and Results

EHR disease classification

Page 14: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Research Question

• Most previous TM applications use a single textual data source from the EHR despite a diversity of potential data

• What is the impact of using more than one textual data source for the EHR classification task?– Considering different text sources;– and including patient (structured) meta-data?

Page 15: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Methods

Radiology reports

Machine learning algorithm (SVM)

Textual and other features

Biomedical knowledge

sources Language processing

ClassificationModel

Additional data

Pathology reports

Radiology questions

REASON sources

Page 16: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Text Processing

• Medical terminology recognition and normalisation using MetaMap

• NegEx to detect negation and negation scope

The nasogastric tube has its tip in the stomach.Meta Candidates (Total=9; Excluded=0; Pruned=0; Remaining=9)1000 C0085678:Nasogastric tube [Medical Device]1000 C0812428:Nasogastric tube (Nasogastric tube procedures) [Therapeutic Procedure] 861 C0175730:Tube (biomedical tube device) [Medical Device]861 C0694637:Nasogastric (Nasogastric Route of Drug Administration) [Functional Concept] 861 C1547937:Tube NOS (Specimen Source Codes - Tube) [Intellectual Product]861 C1561954:tube [Conceptual Entity]861 C1704730:TUBE (Packaging Tube) [Medical Device]861 C1704731:Tube (Tube Device Component) [Medical Device]861 C3282907:Nasogastric [Body Location or Region]

Meta Mapping (1000):1000 C0085678:Nasogastric tube [Medical Device]

Page 17: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Features

Texts• bag of (MetaMap) phrases

– separate feature for Positive/Negative context– experimented with keeping phrases separated

according to source, or merging across sources

Patient meta-data• demographic data (gender, age, ethnic origin,

country, language, marital status, religion, and death date)

• hospital-related admission data (hospital code, admission date and time, discharge date and time, length of stay, reason for admission, admission unit, discharge unit, admission type, source, destination and criteria)

Page 18: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Experimental setting

• Heavily skewed data: undersampling of negatives

• 10-fold cross validation• Support Vector Machine (Weka)

Page 19: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Results: Lung Cancer

1 2 3 40.870

0.885

0.900

0.915

0.930

0.873

0.901

radiology reports + 1 data source(F-Score)

radiology question pathology report additional data

Page 20: Leveraging Text Classification Strategies for Clinical and Public Health Applications

1 2 3 40.870

0.885

0.900

0.915

0.930

0.873

0.901

0.917

radiology reports + 2 additional data sources

(F-score)

radiology question pathology reports additional data

Results: Lung Cancer

Page 21: Leveraging Text Classification Strategies for Clinical and Public Health Applications

1 2 3 40.870

0.885

0.900

0.915

0.930

0.873

0.901

0.917

0.93

F-Score using 4 data sources

radiology question pathology reports additional data

Results: Lung Cancer

Page 22: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Discussion

• More data sources lead to better performance

• The classifier with the highest performance was built using features from all four data sources

• Merging sources into aggregate features better

• Not all improvements are significant:– Radiology question and metadata add clear

value– Pathology reports does not

• Not all admissions had a pathology report associated with them.

Page 23: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Case study 1: Conclusions

• We built a text mining system for detecting lung cancer admissions using machine learning methods.

• Our results show more effective systems can generally be built by including multiple linked data sources.

• Work in progress:– Other diseases– Imbalanced datasets– Feature engineeringand selection

1 2 3 40.820

0.830

0.840

0.850

0.860

0.870

0.880

0.890

0.900

0.910

0.920

0.893

Breast cancer

Page 24: Leveraging Text Classification Strategies for Clinical and Public Health Applications

OutlineDOD with TwitterEmotion classificationDOD signal 1: Tweet emotion shiftDOD signal 2: Tweet lexical shift

Disease Outbreak Detection

Ofoghi et al (2016) Towards early discovery of salient health threats: A social media emotion classification technique; Pacific Symposium on Biocomputing.

Page 25: Leveraging Text Classification Strategies for Clinical and Public Health Applications

25

Page 26: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Twitter for Outbreak Detection

Assumptions• People tweet about diseases in the context

of emerging outbreaks• Twitter can provide an “early warning” of an

outbreak

“Tweets started to rise in Nigeria 3-7 days prior to the official announcement of the first probable Ebola case. The topics discussed in tweets include risk factors, prevention education, disease trends, and compassion.”

Amer J Infection Control (2015)

Page 27: Leveraging Text Classification Strategies for Clinical and Public Health Applications

“Early warning” tweets

Page 28: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Ebola on Twitter

28

Page 29: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Twitter for Outbreak Detection

Strategy• Trends: counting of (hashtag, term)

frequencies• Coupled with geographic origin of tweets• Sentiment or content analyis

Challenges• High volume of (mostly irrelevant) tweets• Hashtags alone may not be adequate• A mention of a disease does not necessarily

indicate an active case

Page 30: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Many reasons to mention Ebola

Page 31: Leveraging Text Classification Strategies for Clinical and Public Health Applications

DOD with Twitter | Previous Work

31

Page 32: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Is there a local emergent threat?

Can we use shifts in emotional and lexical content of tweets

to detect a disease outbreak?

Page 33: Leveraging Text Classification Strategies for Clinical and Public Health Applications

A sliding window model

Page 34: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Ebola event/background data

Dataset Date (±7) pre-corpus post-corpus#tweets |vocab| #tweets |vocab|

ebola-event-1 29-Dec-14 73 204 337 906

ebola-event-2 31-Jan-15 165 700 90 417

ebola-background 16-12-14 429 1453 340 1208

Page 35: Leveraging Text Classification Strategies for Clinical and Public Health Applications

OutlineDOD with TwitterEmotion classificationDOD signal 1: Tweet emotion shiftDOD signal 2: Tweet lexical shift

Page 36: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Emotion classes

• ECs: Ekman’s six basic emotions plus …– News-related– Criticism– Sarcasm

https://www.behance.net/gallery/6-Basic-Emotions/930168

Sarcasticatsign atsign think I got Ebola there two minutes ago

News-relatedatsign Another 4 American Ebola workers flown back to USA for monitoring..

Page 37: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Emotion classifier data

• Data: collection– Twitter API– Second half of March 2015– Total of 12,101 tweets– Contained “ebola” or “#ebola”– 4,405 tweets remained after some filtering…– Amazon’s Mechanical Turk was used to label

tweets

Page 38: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Lexicon-Based Classification

• Created an emotion vocabulary– Profile of Mood States (POMS)– FrameNet– Existing “feelings list”– Wikipedia

• Vector space model– Binary vector per emotion– Binary vector per tweet– Cosine Similarity emotion vs tweet

1

2

3 anxious

4

5

6

7 affronted

8

9

497

498

499 :-|

.

.

.

https://bitbucket.org/readbiomed/socialsurveillance

Page 39: Leveraging Text Classification Strategies for Clinical and Public Health Applications

39

Page 40: Leveraging Text Classification Strategies for Clinical and Public Health Applications

OutlineDOD with TwitterEmotion classificationDOD signal 1: Tweet emotion shiftDOD signal 2: Tweet lexical shift

Page 41: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Emotion class distribution

Classes Dataset p-value6 emotions ebola-event-1 0.004*

ebola-event-2 0.002*ebola-backgr. 0.259

6 emotions + 3 add’l ebola-event-1 0.009*ebola-event-2 0.007*ebola-backgr. 0.079

paired t-test, pre- and post-event windows; * Statistically significant at 5% level

Page 42: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Jensen-Shannon divergence

Class ebola-event-1

ebola-event-2

ebola-bg.

Sarcasm 0.0227 0.0032 0.1365News-rel. 0.0226 0.0001 0.0074Anger 0.0572 0.0382 0.0169Criticism 0.0180 0.0056 0.0060Surprise 0.1161 0.0220 0.0023Fear 0.0768 0.0813 0.0913Happiness 0.0444 0.0415 0.0064Disgust 0.0604 0.0025 0.0044Sadness 0.0023 0.0322 0.0060AVERAGE 0.0467 0.0252 0.0308

Big differences compared with background, in both e1 and e2

Page 43: Leveraging Text Classification Strategies for Clinical and Public Health Applications

OutlineDOD with TwitterEmotion classificationDOD signal 1: Tweet emotion shiftDOD signal 2: Tweet lexical shift

Page 44: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Lexical shift analysis

Within-corpus analysis:

Cross-corpus analysis:

Page 45: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Term freq changes: Event 1

Page 46: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Term freq changes: Background

Page 47: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Case study 2: Conclusions

• We introduced an Ebola tweet-based emotion classifier.

• There are statistically significant differences in the distribution of emotion classes and lexical items in tweets preceding and following a salient emergent health threat.

• This effect does not occur in a neutral background collection.

Proposal:• Disease outbreak detection can be

supported with monitoring of tweets using a sliding window model that tests for such distributional changes

Page 48: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Conclusions

• There are myriad problems in the clinical context where unstructured data can be leveraged to good effect

• Text classification is one tool that can be drawn on to make use of this unstructured data

• Heterogeneous data integration is also important

• Challenges exist in – Terminology– Skewed data– Missing data

Page 49: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Acknowledgements

• Amazon Mechanical Turkers• James McCaw, Melbourne School of Population and

Global Health

Bahador Ofoghi

Lawrence Cavedon

Simon Kocbek

Page 50: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Thank you!

Page 51: Leveraging Text Classification Strategies for Clinical and Public Health Applications

ML-Based Classification

• MALLET Naïve Bayes• Features

– bag of words[+lem,-lem]– Lexicon-based similarity– emotion vocabulary– emoticons– punctuation– (Stanford) sentiment

Page 52: Leveraging Text Classification Strategies for Clinical and Public Health Applications

KL-Divergence, full vocabulary

Page 53: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Emotion-level distribution

KL-divergence (pre- vs. post-event, post- vs. pre-)

P(x) and Q(x) represent probability of positive and negative emotion classesin the respective corpora

Page 54: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Top lexically distinct items

Page 55: Leveraging Text Classification Strategies for Clinical and Public Health Applications

Log Likelihood analysis