Identifying adverse drug reactions by analyzing twitter messages

51
Identifying Adverse Drug Reactions by Analyzing Twitter Messages Presented by - Parinda Rajapaksha 1 15th International Conference on Advances in ICT for Emerging Regions ICTer2015 Authors - Parinda Rajapaksha, Ruvwan Weerasinghe

Transcript of Identifying adverse drug reactions by analyzing twitter messages

1

Identifying Adverse Drug Reactions by Analyzing

Twitter Messages

Presented by - Parinda Rajapaksha

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

Authors - Parinda Rajapaksha, Ruvwan Weerasinghe

2

“ The person who takes medicine must recover twice, once from the disease and once from the medicine ”

- William Osler, M.D.

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

3

• Introduction, Motivation & Related works

• Proposed solution, Research Question & limitations

• Design, Implementation & Evaluation

• Discussion

• Future works

ROAD MAP

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

4

• What is an Adverse Drug Reaction (ADR)?- Harm associated with normal dosage during normal use

- Unintended, harmful reaction

- Nausea, insomnia, hallucination, headache, depression

• Becoming a dire global problem

– Over 770 000 people are injured or died in each year

– Prescription drugs have become 4th leading medical cause of death in

Canada and US

INTRODUCTION

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

5

• Some regulatory bodies have begun programs- Surveillance systems

- Reporting systems

- Conduct clinical trials

• BUT

– Reporting systems are voluntary in most of the countries

– Spontaneous self reports do not uncover all aspect of drug safety

– Clinical trials are very cumbersome

INTRODUCTION Traditional Solutions

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

6

• Recent explosion of Social Media platforms presents a

valuable information source

• People share personal medical experiences with each other

through online community

MOTIVATION

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

7

• Gurulingappa et al. used MEDLINE case reports- 5 000 drugs were extracted from nearly 3 000 case reports

- Ontology driven methodology

• Eiji et al. extracted clinical information from Electronic health

records

– 3 000 discharge summaries accumulated in one month at Tokyo hospitals

RELATED WORKS Medical Case Reports

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

8

• Robert et al. collected comments of health related web sites- DailyStrength web site

- Limited to North America

- Not consider demographic

- Beneficial and Adverse effects are unclear

• Brant et al. investigated ‘Withdrawn’ and ‘watchlist’ drugs

– Yahoo! Groups

– No. of messages for each drug was not evenly distributed

– Did not have adequate data to prove the analysis

RELATED WORKS Online Health Forums

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

9

• Jiang et al. Analyzed textual and semantic features of Twitter- 2 Billion Tweets , 5 Cancer drugs

- Used Topic modeling approach

- Performance was limited due to data sparseness and high level of noise

• Clark et al. Extracted 7Million Tweets in Digital drug safety

surveillance research

– Data sample tends to be noisy

– Difference between internet speech, writing patterns and standardize

clinical data

RELATED WORKS Twitter Related

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

10

• Analyze user experiences through Twitter messages

• Twitter as a micro blogging platform

• WHY Twitter??

OUR PROPOSED SOLUTION

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

11

• Analyze user experiences through Twitter messages

• Twitter as a micro blogging platform

• WHY Twitter??- Public availability, update frequency and message volume

OUR PROPOSED SOLUTION

Statistic Brain -2014/01/01 (http://www.statisticbrain.com/twitter-statistics/)

Total number of active Twitter users – 645 million

Average number of tweets per day - 58 million

No. of tweets that happen in every second - 9,100

12

RESEARCH QUESTIONS

“ How to Identify drug related Tweets by removing

noise in the Twitter messages and automatically classify

them into adverse effects and other effects? ”

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

13

• Limited to one pharmaceutical name- Very large number of drugs in the world and growing frequently

• Only works for Twitter messages with English texts– Language processing becomes really hard without knowing other

languages

SCOPE AND LIMITATION

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

14

DESIGN

Data Acquisition & Filtering

Tweet Preprocessing

Text Processing

Classification

Adverse Effects Other Effects

Manual Annotation

Feature Extraction

2.

1.

3.

4.

5.

6.

15

IMPLEMENTATION Data Acquisition

• Ethical concerns? - Accordance with Terms & Conditions of Twitter API

- NOT from privet accounts

• Xanax as the test case- Used for Panic disorders and Anxiety disorders

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

16

• Why filtering method?- Capture more useful data while downloading

IMPLEMENTATION Data Filtering

Misspelled Drug Names

Trade Names

Generic Name

Twitter Data Stream

Misspelled words

DictionaryFiltered Drug Related Tweets

17

• Assumption 1- These are the only categories which can be given possible misspelled words- Xxnaa as Xanax NOT possible, hardly misspelled

IMPLEMENTATION Misspelled Word Dictionary

Reason for Misspelling

Examples

Skip Letters Xnax, XanxDouble Letters Xaanax, Xannax Reversed Letters Xnaax, Xaanx Missed Key Xabax, Xahax, XajaxInserted key Xabnax, Xamnax

18

IMPLEMENTATION Data Collection

1 829 (3%)

51 467 (94%)

1 477 (3%)

Generic Name

Trade Names

Misspelled Names

1 477 misspelled Twitter messages were captured

54 774 messages within 7 weeks

(14 Aug 2014 - 1 Oct 2014 )

19

• Identifying Twitter specific noisy information

- Retweets (RT)

- User mentions (@)

- Hash tags (#)

IMPLEMENTATION Pre-processing

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

20

IMPLEMENTATION Pre-processing

doc said being of the xanax was giving my

heart major issues and causing problems that

weren't even there in the 1st place

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

21

IMPLEMENTATION Text Processing

• Remove advertisements, news and forum posts

• Assumption 2

– Possibility of having a link in a legitimate Twitter message is considerably low15th International Conference on Advances in ICT for Emerging Regions ICTer2015

22

IMPLEMENTATION Text Processing

• Replace slang words, emoticons and abbreviations

Slang Word

Intended Meaning

abt About

w/o without

smh somehow

idk I don’t know

n2g Not too good

lol Laugh out loud

… ………………

Emoticon Intended Meaning

:-D Big grin

:((( Sad

:’( Crying

:-@ Screaming

O.o Confused

B-) Cool

… ………………

Slang word dictionary (5 242) Emoticons dictionary (80)

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

23

IMPLEMENTATION Text Processing

being doing good on my diet giving up soda and

iced coffee I do not think so laugh out loud

i would need xanax smile

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

24

IMPLEMENTATION Text Processing

• Some Twitter messages are NOT related to the context

- “ I need some Xanax”

- “ Xanax is expensive but I'm worth it”

- “@yung i need to buy xanax but the site won't let me ship to

canada?”

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

25

IMPLEMENTATION Medical Terminologies

• Use of medical terminologies

- MedDRA (Medical Dictionary for Regulatory Activities)

- SIDDER (Side Effect Resource)

- Collected 15 205 medical terminologies

• Data set reduced to 3 334 messages after checking 54 774 messages

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

26

IMPLEMENTATION Feature Extraction

• Used Bag-of-word model- Only consider occurrence (presence or absence)

• Stop words NOT removed- A Twitter message can includes really few words- Character limitation <140

• Stemming

– Ex: Takes , Taking Take

– Used Porter stemmer available in Python NLTK

– 4 572 Features

27

IMPLEMENTATION Manual Annotation

• Condition 1 :– There should be a person or a group of people who involved with the

drug to label a message as Adverse

• Condition 2 :

– Beneficial effects, Conditions or indications as Other

• Condition 3 :– Sentence should be in affirmative (not interrogative, not subjunctive)

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

28

IMPLEMENTATION Annotated Messages

• Adverse Effects

- “ People take xanax so lightly like it's nothing. people get addicted and die from that shit. i blacked out driving while high on it once ”

- “This xanax medicine causes suicidal thoughts when first taking it. what the f*** ”

- “ I should stop taking all the drugs man they are obviously ruining your brain and making you bipolar. xanax is the main contributor ”

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

29

IMPLEMENTATION Annotated Messages

• Other Effects

- “ I am going to sleep now because this xanax got me feeling good ”

- “ I need a prescription to xanax or valium or anything that will help me chill out and sleep for once ”

- “ I am not sure if it is the xanax or lack of sleep but f*** i do not feel real ”

Beneficial Effect

Suspicious feeling

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

30

IMPLEMENTATION Data Distribution

• Data distribution highly unbalanced

• 93% data contributed from Other category

Advere Effects Other Effects0

500100015002000250030003500

221

3 113

Sales

Class Label

No.

of o

bser

vatio

ns

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

31

IMPLEMENTATION Sampling

• Why Undersampling ??

- Reduce observations from Other category

- Amount of Adverse effects will not change

- It will not add synthetic observations to the Adverse effect class

A O A O A OInitial Behavior Undersampling Oversampling

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

32

IMPLEMENTATION Classification

• Used Naïve Bayes algorithm - Best and naïve classification algorithm for text classification approaches

according to the literature

- Achieve highest classifier performance in related works

• Compared with Decision Tree algorithm

• 10 fold cross validation

• Used WEKA tool box

– It supports for data pre-processing, regression, classification, clustering

and data visualization15th International Conference on Advances in ICT for Emerging Regions ICTer2015

33

• Purpose is to identifying Adverse effects as much as possible

EVALUATION Balanced vs. Unbalanced Data Set

Balanced Data Set (A-221 O-221)

Adverse Effect 0.67 0.71 0.69 0.71

Other 0.69 0.65 0.67 0.71

Unbalanced Data Set (A-221, O-3113)

Precision Recall F-Measure AUC

Adverse Effect 0.18 0.19 0.18 0.71

Other 0.94 0.93 0.94 0.71

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

34

• Proposed solution (Balanced data set) perform really well

EVALUATION Balanced vs. Unbalanced Data Set

Unbalanced Balanced

No. of Observations 3 334 442

Training Time (sec) 20.8 1.6

Accuracy 89% 68%

Contributed from Other category

Reduced data set

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

35

EVALUATION NB vs. DT

Naïve Bayes

True Class

AE O

Classified As

AE 157 64

O 78 143

Decision Tree

True Class

AE O

Classified As

AE 137 84

O 83 138

68% of Accuracy

62% of Accuracy

36

EVALUATION NB vs. DT

Naïve Bayes

Precision Recall F-Measure AUC

Adverse Effect 0.67 0.71 0.69 0.71

Other 0.69 0.65 0.67 0.71

Decision Tree

Adverse Effect 0.62 0.62 0.62 0.72

Other 0.62 0.62 0.62 0.72• Proposed solution (Naïve Bayes) perform really well

37

EVALUATION ROC Curve

0.2 0.4 0.6 0.8 10

0.10.20.30.40.50.60.70.80.9

1 Chart Title

FP Rate

TP R

ate

AUC = 0.7

More than a random guess

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

38

EVALUATION Error Analysis

• Curse of dimensionality

– The performance of the classier decreases when the dimensionality of

the problem becomes too large

4 572 features15th International Conference on Advances in ICT for Emerging Regions ICTer2015

39

CONCLUSION

• Proposed a method to identify ADR from Twitter data

• Proposed filtering method capture 1 477 (3%) additional data

• All the performance measurements lie around 70%. Training time 1.6 seconds

• ‘Curse of dimensionality ‘ has reduced the performance of classifier

• Results suggested the potential for extracting ADR related information from Twitter

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

40

FUTURE WORKS

1) Degree or level of ADR

– Can be categorized the effect into High, normal, low

– Useful in prioritizing the effects in pharmacovigilance

EX:

– “ This medicine causes suicidal thoughts when first taking it ” -

Extremely negative

– “ I'm stressed I can't even sleep after using this pills ” – High

– “ Its 2:20 a.m. and I am yawning and shaking my head vamplife ” – Less

harm

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

41

FUTURE WORKS

2) Identifying the dissemination of drug users

– Twitter provides geo locations of Twitter messages

– Weather conditions and habitual actions of each country may affect to

the drug and their effects

EX:

- “ My mom asks me to get beer while picking up a Xanax prescription.

So that looks good ” Beer + Xanax = ??

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

42

Thank YouNLP could save a Life !

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

43

PILOT STUDY

2014/04/08 8.30 AM - 11.30 AM

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

44

EVALUATION

Naïve Bayes

(NB)Decision Tree

(DT)

No. of Observations 442 442

Training Time (sec) 1.6 14.7

Accuracy 68% 62%

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

45

SIGNIFICANCE OF THIS RESEARCH

• Filtering method- Identified misspelled Drug related messages- Capture more useful data

• Removing Advertisements, Forum posts and News related Twitter messages

• Building medical corpus to remove unwanted Twitter messages- SIDDER- MedDRA

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

46

TOOLS & TECHNOLOGIES

• Data acquisition & filtering- Twitter streaming API- Tweepy library in Python- Key word typo generator online tool

• Text processing - Python NLTK (Natural Language Tool Kit)- Porter stemmer

• Classification, sub sampling , ensemble learning and data visualization- WEKA tool box

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

47

DATA DISTRIBUTION

25 289 (46%)

RT

29 485 (54%)

Pre-Processed Messages Retweets

Adds, News, Forum posts

2 113

23 176Duplicate Messages

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

48

CLASSIFICATION PROCESS

15th International Conference on Advances in ICT for Emerging Regions ICTer2015

49

TEXT PROCESSING

Raw Message After Text Processing

@utemim @goddess1207 @LadyZ_712 @misscolor63 @Cozyrosy1 There isn't enough Xanax to make me spend an hour w/ a room full of 5yr olds!

there is not enough xanax to make me spend an hour with a room full of 5 year olds

Being doing good on my diet, giving up soda and iced coffee I don't think so lol I would need Xanax :)

being doing good on my diet giving up soda and iced coffee i do not think so laugh out loud i would need xanax smile

@OMGImBoss i WANT MY XANAX BITCH :( AND im asking u with who you or they want papakush there?? O.o

i want my xanax bitch sad and i am asking you with who you or they want papakush there confused

50

JUST ASSUME…

EBOLA Virus Affected to millions of people around the world

Doctors found a cureTested using monkeys

51

JUST ASSUME…

Medicine distributed to peoplevaccination

Clinical trials not possibleAsk them to Tweet

ADR

No ADR