A Uniﬁed Multi-task Adversarial Learning Framework for ...pb/papers/acl19-pharma.pdfEarly...

A Unified Multi-task Adversarial Learning Framework forPharmacovigilance Mining

Shweta Yadav, Asif Ekbal, Sriparna Saha, Pushpak BhattacharyyaIndian Institute of Technology Patna

Patna, India{shweta.pcs14,asif,sriparna,pb}@iitp.ac.in

Abstract

The mining of adverse drug reaction (ADR)has a crucial role in the pharmacovigilance.The traditional ways of identifying ADR arereliable but time-consuming, non-scalable andoffer a very limited amount of ADR relevantinformation. With the unprecedented growthof information sources in the forms of so-cial media texts (Twitter, Blogs, Reviews etc.),biomedical literature, and Electronic MedicalRecords (EMR), it has become crucial to ex-tract the most pertinent ADR related informa-tion from these free-form texts. In this paper,we propose a neural network inspired multi-task learning framework that can simultane-ously extract ADRs from various sources. Weadopt a novel adversarial learning-based ap-proach to learn features across multiple ADRinformation sources. Unlike the other exist-ing techniques, our approach is capable to ex-tracting fine-grained information (such as ‘In-dications’, ‘Symptoms’, ‘Finding’, ‘Disease’,‘Drug’) which provide important cues in phar-macovigilance. We evaluate our proposedapproach on three publicly available real-world benchmark pharmacovigilance datasets,a Twitter dataset from PSB 2016 Social Me-dia Shared Task, CADEC corpus and MedlineADR corpus. Experiments show that our uni-fied framework achieves state-of-the-art per-formance on individual tasks associated withthe different benchmark datasets. This estab-lishes the fact that our proposed approach isgeneric, which enables it to achieve high per-formance on the diverse datasets. The sourcecode is available here1.

1 Introduction

Early detection and monitoring of adverse drugreactions (ADRs) can minimize the deleteri-ous impact on patients and health-care systems(Hakkarainen et al., 2012; Sultana et al., 2013).

1https://bit.ly/2EMln36

For prevention, the drug safety organizationsknown as pharmacovigilance agencies conductpost-market surveillance to identify the drug’s sideeffects post-release. However, the majority ofthe existing ADE surveillance systems utilizespassive spontaneous reporting system databases,such as the Federal Drug Administration’s Ad-verse Event Reporting System (FAERS) (Li et al.,2014). These systems are often under-reported, bi-ased and delayed. To overcome the limitation of apassive reporting system, active methods to ADRmonitoring continuously explores frequently up-dated ADR data sources (Behrman et al., 2011).

The quantity and near-instantaneous nature ofsocial media provide potential opportunities forreal-time monitoring of Adverse Drug Reaction(ADR). The fact that this data is up-to-date and isgenerated by patients overcomes the weaknessesof traditional ADR surveillance techniques (Lea-man et al., 2010). Thus, social media could com-plement traditional information sources for moreeffective pharmacovigilance studies, as well as po-tentially serve as an early warning system for un-known ADR, which may be important for a clin-ical decision. Additionally, the high statisticallysignificant correlation (p < 0.001, ρ = 0.75) be-tween FAERS and ADRs (extracted through Twit-ter data) shows that Twitter is a viable pharma-covigilance data source (Freifeld et al., 2014).

With the enormous amount of data generatedevery day, it is desirable to have an automatedADR extraction system that can ease the workof domain experts to quickly investigate the vastamount of unstructured text and identify emerg-ing trends. This may correspond to mapping pre-viously undiscovered adverse effect with a givendrug, or discovering an unforeseen impact to achange in the manufacturing process. However,extracting this information from the unstructuredtext poses several challenges as follows:

https://bit.ly/2EMln36

Figure 1: Sample sentences from CADEC (Text1),MEDLINE (Text 2) and Twitter (Text 3,4) dataset. Thetoken in red represents ADR, purple denotes Finding,blue represent Drug name and brown colour text repre-sents Indication.

• Multiple Context: Context carries an essen-tial role in determining the semantic labels ofthe medical concepts. For example, considerthe following tweets:Tweet 1: “Advil cured my horrific pain, butmade my stomach upset”Tweet 2: “Advil cured my upset stomach butgave me a horrific pain”The above tweets, although have a similarmedical concept, their contexts specify theassociated class types. In Tweet 1, ‘pain’refers to the class type Symptom, while inTweet 2, it refers to ADR.

• Multiple word form: Social media text of-fers some inherently distinct challenges suchas containing short word-forms ( eg,“need tosleep 24/7”), misspelled wordforms (eg, “flu-oxetine, it just make me so tiered ’), abbre-viated words (eg, CIT for Citopram), slangs(eg, “seroquel knocked me out”), implicitsense (eg, “hard time getting some Z’s”),symbols (such as emoticons), and figurativelanguages (eg, “quetiapine zombie”). Thisarbitrariness increases the difficulty level incapturing the semantic relationships betweenthe different types.

To overcome these limitations, several machinelearning and deep learning models are introducedfor ADR mining. However, these models are verytask-specific and often fail to show reasonable ac-curacies when these evaluated for some other do-mains or other annotation schemes.

In this paper, we propose a unified multi-tasklearning (MTL) framework that works on the con-cept of adversarial learning. Our model is capa-ble of learning several tasks associated with ADRmonitoring with different levels of supervisionscollectively. The proposed approach differs from

the previous studies in two aspects:Firstly, most of the existing methods in multi-tasklearning attempt to divide the features of differ-ent tasks based on task-specific and task-invariantfeature space, considering only component-wiseparameters. The major drawback of this mecha-nism is that the common feature space often in-corporates the task-specific feature space, leadingto feature redundancy. Given this issue in multi-task learning (MTL), in our proposed frameworkwe employ adversarial learning (Goodfellow et al.,2014), which helps in eliminating redundant fea-tures from the feature space and prevent the con-tamination between shared and task-specific fea-tures. Secondly, we also employ the highwayand residual connection whenever necessary toavoid the vanishing gradient problem and improvethe performance of our deep neural model (multi-headed attention based stacked recurrent and con-volutional neural network).Contributions:Contributions of our current work can be summa-rized as follows:(1) We propose a unified multi-task learning(MTL) framework for pharmacovigilance min-ing that exploits the capabilities of adversariallearning to learn the shared complementary fea-tures across the multiple ADR datasets. To ourbest knowledge, this is the very first attempt tostudy the effect of adversarial learning method inMTL environment, especially for pharmacovigi-lance mining.(2) Our proposed model is capable of automati-cally identifying the various information (such asSymptom, Finding, Disease, Drug), in addition tothe ADR.(3) We validate our proposed framework onthree popular benchmark datasets, namely Twitter(Sarker et al., 2016), CADEC (Karimi et al., 2015)and MEDLINE (Gurulingappa et al., 2012a) forpharmacovigilance mining, having different anno-tation schemes. We extract the following tags:ADR, Drugs, and Indications from the Twitterdataset, ADR, Disease, Drug, Finding; and Symp-tom from the CADEC dataset; and Drug and ADRmentions from the MEDLINE dataset. Figure-1shows exemplary sentences from each dataset.(4) Our unified multi-task model achieves thestate-of-the-art performance in the ADR labelingand outperforms the strong baseline models for allthe other pharmacovigilance labels.

Figure 2: Proposed model architecture for pharmacovigilance mining. (all the neurons representation are hypo-thetical). The right part of the image describes the Component 1 and Component 2.

2 Related Work

Depending upon the source of data, we categorizethe previous works as:(i) Biomedical Text and Electronic MedicalRecord:Several Natural Language Processing (NLP) tech-niques have been proposed to extract ADRs fromthe Electronic Medical Record (Wang et al., 2009;Friedman, 2009; Aramaki et al., 2010) and medi-cal case reports (Gurulingappa et al., 2011). Gu-rulingappa et al. (2012a) adapted machine learn-ing technique for the identification and extractionof potential adverse drug event relations from theMEDLINE case reports. Unlike other spontaneousdata sources such as social media, both EMR andmedical case reports offer several advantages ofhaving complete records of patients’ medical his-tory, treatment, conditions and the possible riskfactors, and is also not restricted to the patientsexperiencing ADRs (Harpaz et al., 2012b). Re-cently, a study conducted by (Sarker and Gonza-lez, 2015) utilized the data from MEDLINE casereports and Twitter. They proposed several textualfeatures and investigated how the combination ofdifferent datasets would increase the performanceof identifying ADRs. With the advancement ofthe neural network technique, (Huynh et al., 2016)investigated multiple neural network (NN) frame-works for ADR classification on both medical casereports and Twitter dataset.(ii) Social Media: Social media offers a veryrich and viable source of information for iden-tifying potential ADRs in a real-time. Leaman

et al. (2010) conducted very first study utilizinguser comments from their social media post. Intotal, the dataset contains 6, 890 user comments.The research shows that user comments are highlybeneficial in uncovering the ADRs. Further works(Gurulingappa et al., 2012b; Benton et al., 2011;Harpaz et al., 2012a) utilized the lexicon-based ap-proach to extract the ADRs. However, these ap-proaches are only restricted to a number of targetADRs. Nikfarjam and Gonzalez (2011) exploitedrule-based technique over naive lexicon-based ap-proach on the same dataset which was capable ofdetecting ADR not included in lexicons.

With the emergence of annotated data, sev-eral research works have employed supervisedmachine learning techniques such as SupportVector Machine (SVM) (Sarker and Gonzalez,2015), Conditional Random Field (CRF) (Nikfar-jam et al., 2015) and Random Forest (Zhang et al.,2016).

In recent years with the introduction of deeplearning techniques, most of the studies utilizedeep learning model to predict ADRs. Lee et al.(2017) developed semi-supervised deep learningmodel on the Twitter corpus. In particular, theyused the Convolution Neural Network (CNN)for classification. Stanovsky et al. (2017) usedthe Recurrent Neural Network integrated withknowledge graph embedding on the CADEC cor-pus. Their study shows that this integration canmake the model more accurate. Tutubalina andNikolenko (2017) explored the combination ofCRF and Recurrent Neural Network (RNN). Their

results show that CRF can assist RNN modelin capturing the context well. The most rele-vant work to this study is the work conducted byChowdhury et al. (2018). They learned jointlyfor three tasks: binary classification, ADR label-ing, and indication labeling using RNN-attention-coverage model.

3 Methodology

With our adversarial multi-task framework, wejointly learn to label the ADR events from mul-tiple ADR datasets. ADR labeling is a sequencelabeling problem. For a given input sequence X ,the model learns to find the optimal tag sequencey∗. Mathematically,

y∗ = argmaxy

P (Y |X) (1)

Our proposed adversarial multi-task framework isdepicted in Figure 2. Our model comprises of fivecomponents:(1) Embedding Layer: It captures the meaningand semantic associations between pharmacovigi-lance word that appears in the text.(2) Encoder/Feature Extractor Layer, whichgenerates both task-specific and task-shared fea-ture. Each of these feature generator modules con-sists of Convolutional Neural Network (CNN) fol-lowed by stacked Bi-Gated Recurrent Unit (GRU).Task-specific feature generator is responsible forcapturing the features specific to the task. In thetask-shared feature generator, there is an addi-tional adversarial learning component, where fea-ture extractor (Generator) is working operates ad-versarially towards a learnable multi-layer percep-tron (Discriminator), preventing it from making anaccurate prediction about the types of the task thefeature generated from.(3) Concatenation Layer: This is responsible forconcatenating the feature representation obtainedby both the feature extractor modules.(4) Multi-head Attention Layer: This learns toencode better the given word by looking at theother words in the text.(5) CRF Layer: This is used to predict the mostprobable tag sequence.

3.1 Input TextThe input to our model is a sequence of wordsX = (x1, x2, . . . , xn) corresponding to social-media posts/medical case reports comprising of nwords.

3.2 Embedding LayerThis layer generates two forms of representations:Word embedding: maps each word xi to low di-mensional vector wi ∈ Rde . We use pre-trainedword embedding of dimension de.Character embedding:: to capture the morpho-logical features. The character embedding canhelp in capturing the representations of the outof vocabulary (OOV) words, misspelt words andvariations in noun or verb phrase. When it comesto the social media text, this issue even becomesmore crucial to resolve. Character embedding isone of the ways to resolve this issue. It allows themodel to learn lexical patterns (e.g. suffix or pre-fix) which eventually helps in capturing the out-of-vocabulary (OOV) words and some other infor-mation which is difficult to capture through wordembedding.

We employ CNN for character embedding.Let C = {c1, c2, . . . , ck} be the character se-

quence of words xi having length l. Each charac-ter cj is represented as a one-hot vector of lengthC, which is the number of unique characters in thedataset. The resulted one-hot representations of allthe characters in the word are stacked to form amatrix M ∈ Rk×|C|. Thereafter, we apply severalfilters of different width to this matrix. The widthof these filters varies from 1 to k, i.e., these fil-ters look at 1 to k-gram character sequences. Themax-pooling operation is performed followed bythe convolutional operation to pick the most rel-evant feature. We call this character embeddingfeature as ci.

Finally, the output of word embedding for theith word is the concatenation of word embeddingwi and the character embedding ci. For each xi ∈X , the embedding layer generate the embeddingin the following way:

ei = wi ⊕ ci (2)

3.3 Feature ExtractorOur feature extractor utilizes CNN and stacked Bi-GRU to encode the output of the Embedding layer.CNN and stacked Bi-GRU takes the Embeddinglayer output as input and generate the features tofurther encode the sequence information. Since,we employ the stacked Bi-GRU, there could bevanishing gradient problem. To tackle this, we em-ploy highway layer (Srivastava et al., 2015), thathas shown a significant impact in reducing vanish-ing gradient problem in various NLP tasks (Kim

et al., 2016; Costa-jussa and Fonollosa, 2016).Let us the consider the input sequence to this

layer is E = {e1, e2, . . . , en}. A convolutionoperation is performed over the zero-padded se-quence Ep. Similar to the character embedding,a set of k filter of size m are applied to the se-quence. We obtain convoluted features ct at giventime t for t = 1, 2, . . . , n.

ct = relu(F [et−m−1

2. . . et . . . et+m−1

2]) (3)

Then, we generate the feature vectors C ′ =[c′1, c

′2 . . . c

′n], by applying max pooling on C.

Inspired by the success of stacked attentiveRNN in solving other NLP tasks (Wu et al., 2016;Graves et al., 2013; Dyer et al., 2015; Prakashet al., 2016), we use the stacked GRU to encodethe input text. The stacked GRU is an extensionto GRU model that has multiple hidden GRU lay-ers. The purpose of using multiple GRUs layersis to learn more sophisticated conditional distribu-tions from the data (Bahdanau et al., 2015). Inthis work, we employ vertical stacking strategywhere the output of the previous layer of GRU isfed to the highway layer and corresponding out-put is passed as input to the next layer of GRU.Let the number of layers in stacked GRU is L thenthe GRU computes the hidden state for each layerl ∈ L as follows:

hlk = GRU(hl−1

k , hlk−1) (4)

where, hlk is the hidden state representation at lth

layer. The input h0k to the first layer (l = 1) ofGRU are initialized randomly. The first layer ofGRU unit at kth word feature takes the input asthe embedding layer output ek of the kth word.We compute the forward (

−→hk) and backward (

←−hk)

hidden state for each word k in the sentence. Thefinal hidden state at layer l ∈ L is computed by

augmenting both the hidden states: zlk = [−→hlk⊕←−hlk].

The final input text representation from stackedBi-GRU layer is calculated by taking the hiddenstate of the last layer (L) of the GRU as follows:

h1, h2, . . . , hn = [−→hL1 ⊕←−hL1 ], [−→hL2 ⊕←−hL2 ], . . . , [

−→hLn⊕←−hLn ] (5)

We compute the overall input text representationby concatenating the output of CNN layer C

′and

stacked Bi-GRU (eq. 5) as follows:

z1, z2, . . . , zn = [c′1 ⊕ h1], [c

′2 ⊕ h2], . . . , [c

′n ⊕ hn] (6)

The above approach to generate task specific fea-ture is computed at for each task separately. In or-der to capture the common features along the task,

we utilize the above feature extractor frameworkwhich serves as a Generator model and the feedforward neural network as a Discriminator.

3.4 Task Discriminator LayerOur feature extractor layer is generating two typesof features, shared and task-specific. Ideally bothfeature spaces should be mutually exclusive. Toensure that task-specific features of given task donot exist in the shared space, we exploit the con-cept of adversarial training (Goodfellow et al.,2014) into shared feature space. We follow thesame method as introduced by (Liu et al., 2017) tomake the shared feature space uncontaminated bythe task-specific features.

For achieving the aforementioned strategy, aTask Discriminator D is used to map theattention prioritized shared feature to estimatethe task of its origin. In our case, TaskDiscriminator is a fully connected layer us-ing a softmax layer to produce the probability dis-tribution of the shared features belonging to anytask. The shared feature extractor (c.f. 3.3) worksas Generator (G) to generate shared features.The shared feature extractor is made to work in anadversarial way, preventing the discriminator frompredicting the task and hence preventing contam-ination in the shared space. The adversarial lossis used to train the model. Let us assume that theshared feature (c.f. equation 6) is {zs1, zs2, . . . , zsn}.It can be represented as:

D(zs) = softmax(zsnWd + bd) (7)

where W d and bd are the weight matrix and bias,respectively.

3.5 Concatenation LayerLet us denote the shared and task-specific featuresfor input text are zs = {zs1, zs2, . . . , zsn} and zt ={zt1, zt2, . . . , ztn}. Finally, the output of the featureextractor layer is computed as the concatenation ofthe shared and task-specific feature as follows:

S = zs1 ⊕ zt1, zs2 ⊕ zt2, . . . , z

sn ⊕ ztn

= s1, s2, . . . , sn−1, sn(8)

3.6 Multi-head Attention LayerThe multi-head attention is used to learn the de-pendencies between any pair of words in the inputtext. We apply the multi-head attention on the fi-nal representation of the input text S as computedin Equation 8. The multi-head attention (Vaswani

et al., 2017) can be precisely described as follows:

Attention(Q,K, V ) = softmax(QKT

√d

)V (9)

where, Q, K and V are the query, key and valuematrix. In our experiment, all these values areequivalent to the S (with the multiplication of therespective learning weights) and d is the dimen-sion of the feature extraction units. Multi-head at-tention first linearly projects the queries, keys andvalues to the given no. of the head (t) using differ-ent linear projections. Then these projections per-form the scaled dot-product attention in parallel.Finally, these results of attention are concatenatedand once again projected to get the new represen-tation. Formally, the multi-head attention at headi can be computed by:

headi = Attention(SWQi , SW

Ki , SW

Vi )

S′=W (head1 ⊕ head2⊕, . . . ,⊕headt)

(10)

whereWQi ,WK

i andW Vi are the weight matrices.

3.7 Conditional Random Field Layer

In sequence labeling problem there is often adependency between the successive labels. In-stead of predicting the current label independentlythrough softmax layer, we employ the CRF (Laf-ferty et al., 2001) layer, which takes care of theprevious label to predict the current label. Firstly,the attentive feature at given time step t is pro-jected to another space which has a dimensionequal to the number of output tags. Mathemati-cally, it can be formulated as follows:

ot = WSS,t + bS (11)

Thereafter, we calculate the score to predict agiven label sequence y as follows:

score(y|X) =n∑

t=1

(At−1,t + ot,yt) (12)

where A is the transition score matrix. Finally,we select the tag sequence with highest score asfollows:

y = argmaxy∈Y

score(y|x) (13)

In decoding stage, we use Viterbi algorithm tocompute the optimal tag sequence.

4 Experimental Details

4.1 Network TrainingWe have optimized two different losses to train ourmulti-task model. The first loss is task-specificloss ofLtask, which is specific for each task. Apartfrom task-specific loss, we also optimize the ad-versarial loss to train the network not correctly topredict the task.

For task-specific loss, we use negative log-likelihood objective as the loss function for eachtask. Given the total number of task T and Ntraining samples (xi, yi) from task t ∈ T , the taskloss Ltask can be computed by the following equa-tion:

Ltask = −T∑

t=1

N∑i=1

logp(yit|xti) (14)

The likelihood function p(yit|xti) can be computedby the following equation:

p(yit|xt

i) =escore(yi

t|xti)∑

y∈Y escore(yit|xt

i)(15)

The score(.) function is computed by the equation12. The adversarial loss trains the shared featureextractor to generate the shared features such thatthe task discriminator layer cannot reliably recog-nize which task the input text comes from. Theadversarial loss Ladv can be computed as follows:

Ladv = minG

(maxD

( T∑t=1

N∑i=1

dtilog[D(G(xt

i))]))

(16)

where dti is the gold label indicating the type ofthe current task and xti is the ith example of task t.The min-max optimization problem is addressedby the gradient reversal layer (Ganin and Lempit-sky, 2015). The final loss of the model is definedby the following equation:

L = α× Ltask + β × Ladv (17)

where α and β are the scalar parameter.

4.2 Hyper-parametersWe use the pre-trained word embedding 2 fromPyysalo et al. (2013) of dimension 200. It istrained on the combination of PubMed and PMCbiomedical texts with texts extracted from a recentEnglish Wikipedia dump. We set the maximumlength of input text as 44 and maximum character

2http://evexdb.org/pmresources/vec-space-models/

length of 10. The CNN based character embed-ding length of 100 is used in this experiment. Theoptimal hidden state dimension of GRU is set tobe 100. We use 4 GRU layers to form the stackedGRU layer. The CNN layer uses the filter set:{2, 3, 4}. In multi-head attention layer, we use atotal of 4 heads to compute the attentive represen-tation. We set the dropout rate to 0.5. The batchsize is set to 16 and value of loss weights α and βare set to be 0.8 and 0.2, respectively. The AdamOptimization (Kingma and Ba, 2015) method witha learning rate of 0.01 is used during training tooptimize the network weights. The optimal val-ues of hyper-parameters are achieved through the10-fold cross validation experiment.

4.3 DatasetsWe use three different ADR labeling datasets :PSB 2016 Social Media Shared Task for ADRExtraction dataset (Twitter), CADEC, and MED-LINE to evaluate our multi-task model perfor-mance. It is to be noted that our model is trainedsimultaneously on the different ADR datasets.The different datasets used in the experiment areas follows:

1. Twitter dataset: The first dataset, which weuse is the Twitter dataset from PSB 2016 So-cial Media Shared Task for ADR Extractiontask. It contains 572 tweets which are fullyannotated for mentions of ADR, tweet ID,start and end offset, UMLS ID, annotated textspan and the related drugs. We extracted thefollowing three tags from this dataset: ADR,Drugs, and Indications.

2. CADEC adverse drugs events dataset: Theanother dataset, which we use is the CADECadverse drugs event dataset. It contains atotal of 1248 sentences containing differenttags. Our model extract the following tagsfrom CADEC Corpus: ADR, Disease, Drug,Finding and Symptom.

3. MEDLINE ADR dataset: This ADR corpuswas released by Gurulingappa et al. (2012b).It was derived from the MEDLINE case re-ports3. This case report provides informationabout the symptoms, signs, diagnosis, treat-ment and follow-up of individual patients.This corpus contains 2972 documents with

3https://www.nlm.nih.gov/bsd/indexing/training/PUB_050

20967 sentences. Out of which, 4272 sen-tences are annotated with names and rela-tionships between drugs, adverse effects anddosages. Our model extract the Drug andADR mentions in the sentences.

5 Result and Analysis

We evaluate the pharmacovigilance labeling tasksin terms of Precision, Recall and F1-Score. Un-like the existing system, we evaluate the perfor-mance of our model, using the exact matchingscheme, where a prediction sequence is countedas correct only if all the sequence labels are pre-dicted correctly. We will begin by first describingthe baselines models, followed by the results ob-tained from the proposed model and then presentthe analysis of the results.

5.1 Baselines

We compare our adversarial multi-task model withthe following state-of-the-art baselines. It is to benoted that these baselines are re-implementationof the state-of-the-art methods for ADR extrac-tion.(1) ST-BLSTM: This is a single task model forADR labeling with Bi-LSTM as sentence encoder.In our experiment, we build the individual modelfor each dataset.(2) ST-CNN: This model is similar to baseline ST-BLSTM, but instead of using Bi-LSTM for sen-tence encoder, we use CNN with filters: {2, 3, 4}.(3) CRNN: In this model CNN and LSTM are to-gether used for sentence encoder (Huynh et al.,2016). We adopt the same architecture for ADRextraction by classifying each token of the sen-tence into a pre-defined set of tags.(4) RCNN: This model is similar to the third base-line, but here we extract the LSTM feature first andthen pass these features as the input to the CNNnetwork.(5) MT-BLSTM: It is a multi-task model (Chowd-hury et al., 2018) with a shared Bi-LSTM layeracross the task for sentence encoder and task-specific Bi-LSTM for each task. The final repre-sentation is obtained by concatenating shared andtask-specific Bi-LSTM.(6) MT-Atten-BLSTM: This baseline model(Chowdhury et al., 2018) is similar to the MT-BLSTM. The sentence encoder of this model isalso equipped with the word level attention mech-anism.

https://www.nlm.nih.gov/bsd/indexing/training/PUB_050

https://www.nlm.nih.gov/bsd/indexing/training/PUB_050

ModelsTwitter CADEC MEDLINE

P R F1 P R F1 P R F1ST-BLSTM 57.7 56.8 57.3 52.9 49.4 51.1 71.65 72.19 71.91

ST-CNN 63.8 65.8 67.1 39.7 42.7 42.0 66.88 73.81 70.17CRNN (Huynh et al., 2016) 61.1 62.4 64.9 49.5 46.9 48.2 71.0 77.3 75.5RCNN (Huynh et al., 2016) 57.6 58.7 63.6 42.4 44.9 43.6 73.5 72.0 74.0

MT-BLSTM (Chowdhury et al., 2018) 65.57 61.02 63.19 60.50 55.16 57.62 72.72 75.49 74.0MT-Atten-BLSTM (Chowdhury et al., 2018) 62.26 69.62 65.73 56.63 60.0 58.27 75.08 81.06 77.95

Proposed Model 68.78 70.81 69.69 64.33 67.03 65.58 81.97 82.61 82.18

Table 1: Result comparison of the proposed method with the state-of-art baseline methods. Here, ‘P’, ‘R’, ‘F1’represents Precision, Recall and F1-Score. The results on CADEC and MEDLINE are on 10-fold cross validation;for the twitter dataset, we use the train and test sets as provided by the PSB 2016 shared task.

ModelComponents Twitter CADEC MEDLINE

Proposed Model 69.69 65.58 82.18– Character Embedding 67.63 (2.06 ↓) 56.10 (9.48 ↓) 76.34 (5.84 ↓)– Multi-head Attention 68.65 (1.04 ↓) 60.51 (5.07 ↓) 77.71 (4.47 ↓)– Adversarial Learning 68.11 (1.58 ↓) 58.57 (7.01 ↓) 71.21 (10.97 ↓)

Table 2: Ablation study on all the dataset. The val-ues in the bracket shows the absolute decrements (↓) inthe proposed model by removing the respective com-ponent. It shows the contribution (in terms of modelperformance in F1 Score) of that component in our pro-posed model.

5.2 Results

The extensive results of our proposed model withcomparisons to the state-of-the-art baselines tech-niques are reported in Table 1. Our proposedmodel outperforms the state-of-the-art baselinestechniques by fair margins in terms of precision,recall and F1-Score for all the datasets. In ourfirst experiment, we train two models (i.e. Single-Task BLSTM and Multi-Task BLSTM) to analyzethe effect of the multi-task model (MT-BLSTM)over a single task model (ST-BLSTM). On all thethree datasets, we can visualize from Table 1 that,the multi-task framework with its sharing schemecan help in boost the performance of the sys-tem. We observe the performance improvementof 5.89, 6.52 and 2.09 F1-Score points on Twit-ter, CADEC, and MEDLINE dataset, respectively.The similar improvement is also observed in termsof precision and recall.

In comparison to the baseline 5 model, ourproposed method achieve the improvement of6.5, 7.96, and 8.18 F1-Score points on Twit-ter, CADEC, and MEDLINE dataset, respectively.This shows the robustness of our proposed multi-task method. We also compare our proposed sys-tem with MT-Atten-BLSTM model. The resultsshow the performance improvement of 3.96, 7.31,

and 4.23 F1 Score points for Twitter, CADEC andMEDLINE dataset, respectively. The improve-ments over all the baselines methods are statisti-cally significant as p < 0.05.

5.3 Ablation StudyTo analyze the impact of various component of ourmodel, we perform the ablation study (c.f. Table-2) by removing one component from the proposedmodel and evaluate the performance on all thethree datasets. Character embedding is found tobe the most crucial component on Twitter, andCADEC datasets as both of these datasets are fromthe social media text and carry the nature of theshort text and out of vocabulary words.

To prove our hypothesis (introduction of adver-sarial learning in the multi-task framework canmake shared space independent of the task in-variant features), we exclude the adversarial lossfrom our proposed framework. We could see asignificant decline in performance. This depictsthat making the task shared space free from thecontamination of task-specific feature, can signifi-cantly improve the performance of the system. Re-moval of the multi-head attention also lead to dropof an average 4% F1-Score points across all thedatasets.

6 Analysis

To get a deeper insight into how our multi-taskmodel performs over the state-of-the-art multi-task baseline model, we sample few sentencesfrom all the three datasets. In the Table-3, wedemonstrate the capability of our model in cor-rectly predicting all the labels, while the MT-LSTM and MT-LSTM-atten make the incorrectprediction. In the sentence 1 due to the sharingscheme, bipolar was correctly labeled as Indica-tion.

Sentence 1 fluoxetine and quet combo zombified me ahh the med merrygoround bipolarActual Labels B-Drug O B-Drug O B-ADR O O O O O B-IndicationMT-LSTM B-Drug O B-Drug O O O O O O O OMT-LSTM-Atten B-Drug O B-Drug O B-ADR O O O O O OProposed Approach B-Drug O B-Drug O B-ADR O O O O O B-IndicationSentence 2 clozapine-induced tonic-clonic seizure managed with valproate implication for clinical careActual Labels B-Drug B-ADR I-ADR O O O O O O OMT-LSTM B-Drug O O O O O O O O OMT-LSTM-Atten B-Drug O B-ADR O O B-ADR O O O OProposed Approach B-Drug B-ADR I-ADR O O O O O O O

Table 3: Comparison of the predictions of the proposed approach with the baseline models.

Type-1Sentence 1 too much zoloft and seroquel to get the horn my life is lieActual O O B-Drug O B-Drug O O O B-ADR I-ADR I-ADR O OPredicted O O B-Drug O B-Drug O O O O O O O O

Type-2Sentence 2 pain in upper right arm could not sleep on it or move it behind my backActual B-ADR I-ADR I-ADR I-ADR I-ADR O O O O O O O O O O OPredicted B-ADR I-ADR I-ADR I-ADR I-ADR I-ADR I-ADR I-ADR I-ADR I-ADR I-ADR I-ADR I-ADR I-ADR I-ADR I-ADR

Type-3Sentence 3 terrible joint pain could not move shoulder hip hurtActual O B-ADR B-ADR B-ADR I-ADR I-ADR I-ADR B-ADR I-ADRPredicted B-ADR I-ADR I-ADR B-ADR I-ADR I-ADR I-ADR I-ADR I-ADR

Table 4: Exemplar description of various types of error. Here, Type-1 represent the error due ‘Presence of implicitmention’. Type-2 represent the error due to ‘Issue in annotation’ and Type-3 represents the error of type ‘Boundarydetection problem’.

In the sentence 2, we observe that, only MT-LSTM-Atten model is able to predict the partialADR (i.e. seizure instead of tonic-clonic seizure.),while our model is able to predict the full ADRphrase correctly.

6.1 Error Analysis

In this subsection, we analyze the different sourcesof errors which lead to mis-classification. Weclosely study the false positive and false negativeinstances and come up with the following obser-vations:(1) Presence of implicit mention: We observethat in the Twitter dataset user often tends touse very implicit and creative language to de-scribe their adverse drug reaction. For e.g., in thesentence-1 of Table-4, user describes his ADR as‘horn my life’ by taking drug (zoloft and seroquel).(2) Issue in annotation: For the CADEC dataset,we observe some of the sentences are not com-pletely tagged. For e.g., in the sentence-2 of Table-4, here ‘could not sleep’, ‘move it behind my back’is also an ADR, in addition to ‘pain in upper rightarm’. However, the first two ADRs are not labeledin the dataset.(3) Boundary detection problem: We also ob-serve that, our system sometimes fails to detectthe proper boundary. This might be because ofthe task sharing feature, which learns the featuredistributions across the dataset which may not becorrect for the given dataset as shown in sentence-3 of Table-4.

7 Conclusion

In this paper, we have proposed an end-to-endmulti-task framework that provides a unified so-lution for pharmacovigilance mining. We haveutilized an adversarial training based multi-taskframework, which ensures that task-specific andtask shared features are not contaminated. Weevaluated this framework on three benchmarkpharmacovigilance datasets. Our results demon-strate the capability of our model across all thedatasets. In future, we would like to assist themodel with multiple linguistic aspects of socialmedia text like figurative languages.

Acknowledgement

Sriparna Saha and Asif Ekbal gratefully ac-knowledge the Young Faculty Research Fellow-ship (YFRF) Award, supported by VisvesvarayaPhD scheme for Electronics and IT, Ministry ofElectronics and Information Technology (MeitY),Government of India, being implemented by Dig-ital India Corporation (formerly Media Lab Asia)for carrying out this research. Authors acknowl-edge “Shusrut: ezDI Research Lab on Health In-formatics”, Department of Computer Science andEngineering, IIT Patna, India.

ReferencesEiji Aramaki, Yasuhide Miura, Masatsugu Tonoike,

Tomoko Ohkuma, Hiroshi Masuichi, Kayo Waki,and Kazuhiko Ohe. 2010. Extraction of adverse

drug effects from clinical records. Studies in healthtechnology and informatics, 160 Pt 1:739–43.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In 3rd Inter-national Conference on Learning Representations,ICLR 2015, San Diego, CA, USA, May 7-9, 2015,Conference Track Proceedings.

Rachel E Behrman, Joshua S Benner, Jeffrey S Brown,Mark McClellan, Janet Woodcock, and RichardPlatt. 2011. Developing the sentinel systema na-tional resource for evidence development. New Eng-land Journal of Medicine, 364(6):498–499.

Adrian Benton, Lyle Ungar, Shawndra Hill, Sean Hen-nessy, Jun Mao, Annie Chung, Charles E Leonard,and John H Holmes. 2011. Identifying potential ad-verse effects using the web: A new approach to med-ical hypothesis generation. Journal of biomedicalinformatics, 44(6):989–996.

Shaika Chowdhury, Chenwei Zhang, and Philip S. Yu.2018. Multi-task pharmacovigilance mining fromsocial media posts. In Proceedings of the 2018World Wide Web Conference, WWW ’18, pages117–126, Republic and Canton of Geneva, Switzer-land. International World Wide Web ConferencesSteering Committee.

Marta R. Costa-jussa and Jose A. R. Fonollosa. 2016.Character-based neural machine translation. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), pages 357–361, Berlin, Germany. Associa-tion for Computational Linguistics.

Chris Dyer, Miguel Ballesteros, Wang Ling, AustinMatthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: LongPapers), pages 334–343, Beijing, China. Associa-tion for Computational Linguistics.

Clark C Freifeld, John S Brownstein, Christopher MMenone, Wenjie Bao, Ross Filice, Taha Kass-Hout,and Nabarun Dasgupta. 2014. Digital drug safetysurveillance: monitoring pharmaceutical products intwitter. Drug safety, 37(5):343–350.

Carol Friedman. 2009. Discovering novel adverse drugevents using natural language processing and miningof the electronic health record. In Conference onArtificial Intelligence in Medicine in Europe, pages1–5. Springer.

Yaroslav Ganin and Victor S. Lempitsky. 2015. Un-supervised domain adaptation by backpropagation.In Proceedings of the 32nd International Conferenceon Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 1180–1189.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In Advances in neural informationprocessing systems, pages 2672–2680.

Alex Graves, Abdel-rahman Mohamed, and GeoffreyHinton. 2013. Speech recognition with deep recur-rent neural networks. In Acoustics, speech and sig-nal processing (icassp), 2013 ieee international con-ference on, pages 6645–6649. IEEE.

Harsha Gurulingappa, Juliane Fluck, Martin Hofmann-Apitius, and Luca Toldo. 2011. Identification of ad-verse drug event assertive sentences in medical casereports. In First international workshop on knowl-edge discovery and health care management (KD-HCM), European conference on machine learningand principles and practice of knowledge discoveryin databases (ECML PKDD), pages 16–27.

Harsha Gurulingappa, Abdul Mateen-Rajpu, and LucaToldo. 2012a. Extraction of potential adverse drugevents from medical case reports. Journal ofbiomedical semantics, 3(1):15.

Harsha Gurulingappa, Abdul Mateen Rajput, AngusRoberts, Juliane Fluck, Martin Hofmann-Apitius,and Luca Toldo. 2012b. Development of abenchmark corpus to support the automatic extrac-tion of drug-related adverse effects from medicalcase reports. Journal of biomedical informatics,45(5):885–892.

Katja M Hakkarainen, Khadidja Hedna, Max Petzold,and Staffan Hagg. 2012. Percentage of patients withpreventable adverse drug reactions and preventabil-ity of adverse drug reactions–a meta-analysis. PloSone, 7(3):e33236.

Rave Harpaz, William DuMouchel, Nigam H Shah,David Madigan, Patrick Ryan, and Carol Friedman.2012a. Novel data-mining methodologies for ad-verse drug event discovery and analysis. ClinicalPharmacology & Therapeutics, 91(6):1010–1021.

Rave Harpaz, Santiago Vilar, William DuMouchel, Ho-jjat Salmasian, Krystl Haerian, Nigam H Shah, Her-bert S Chase, and Carol Friedman. 2012b. Comb-ing signals from spontaneous reports and electronichealth records for detection of adverse drug reac-tions. Journal of the American Medical InformaticsAssociation, 20(3):413–419.

Trung Huynh, Yulan He, Alistair Willis, and StefanRueger. 2016. Adverse drug reaction classificationwith deep neural networks. In Proceedings of COL-ING 2016, the 26th International Conference onComputational Linguistics: Technical Papers, pages877–887, Osaka, Japan. The COLING 2016 Orga-nizing Committee.

Sarvnaz Karimi, Alejandro Metke-Jimenez, MadonnaKemp, and Chen Wang. 2015. Cadec: A corpus ofadverse drug event annotations. Journal of biomed-ical informatics, 55:73–81.

http://arxiv.org/abs/1409.0473


https://doi.org/10.1145/3178876.3186053

https://doi.org/10.1145/3178876.3186053

https://doi.org/10.18653/v1/P16-2058

https://doi.org/10.3115/v1/P15-1033

https://doi.org/10.3115/v1/P15-1033

https://doi.org/10.3115/v1/P15-1033

http://proceedings.mlr.press/v37/ganin15.html

http://proceedings.mlr.press/v37/ganin15.html

https://www.aclweb.org/anthology/C16-1084

https://www.aclweb.org/anthology/C16-1084

Yoon Kim, Yacine Jernite, David Sontag, and Alexan-der M. Rush. 2016. Character-aware neural lan-guage models. In Proceedings of the Thirtieth AAAIConference on Artificial Intelligence, AAAI’16,pages 2741–2749. AAAI Press.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In 3rd Inter-national Conference on Learning Representations,ICLR 2015, San Diego, CA, USA, May 7-9, 2015,Conference Track Proceedings.

John D. Lafferty, Andrew McCallum, and FernandoC. N. Pereira. 2001. Conditional random fields:Probabilistic models for segmenting and labeling se-quence data. In Proceedings of the Eighteenth Inter-national Conference on Machine Learning, ICML’01, pages 282–289, San Francisco, CA, USA. Mor-gan Kaufmann Publishers Inc.

Robert Leaman, Laura Wojtulewicz, Ryan Sullivan,Annie Skariah, Jian Yang, and Graciela Gonzalez.2010. Towards internet-age pharmacovigilance: ex-tracting adverse drug reactions from user posts tohealth-related social networks. In Proceedings ofthe 2010 workshop on biomedical natural languageprocessing, pages 117–125. Association for Compu-tational Linguistics.

Kathy Lee, Ashequl Qadir, Sadid A. Hasan, VivekDatla, Aaditya Prakash, Joey Liu, and OladimejiFarri. 2017. Adverse drug event detection intweets with semi-supervised convolutional neuralnetworks. In Proceedings of the 26th InternationalConference on World Wide Web, WWW ’17, pages705–714, Republic and Canton of Geneva, Switzer-land. International World Wide Web ConferencesSteering Committee.

Hui Li, Xiao-Jing Guo, Xiao-Fei Ye, Hong Jiang, Wen-Min Du, Jin-Fang Xu, Xin-Ji Zhang, and Jia He.2014. Adverse drug reactions of spontaneous re-ports in shanghai pediatric population. PLoS One,9(2):e89829.

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017.Adversarial multi-task learning for text classifica-tion. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 1–10, Vancouver,Canada. Association for Computational Linguistics.

Azadeh Nikfarjam and Graciela H Gonzalez. 2011.Pattern mining for extraction of mentions of adversedrug reactions from user comments. In AMIA An-nual Symposium Proceedings, volume 2011, page1019. American Medical Informatics Association.

Azadeh Nikfarjam, Abeed Sarker, Karen Oconnor,Rachel Ginn, and Graciela Gonzalez. 2015. Phar-macovigilance from social media: mining adversedrug reaction mentions using sequence labelingwith word embedding cluster features. Journalof the American Medical Informatics Association,22(3):671–681.

Aaditya Prakash, Sadid A. Hasan, Kathy Lee, Vivek V.Datla, Ashequl Qadir, Joey Liu, and Oladimeji Farri.2016. Neural paraphrase generation with stackedresidual LSTM networks. In COLING 2016, 26thInternational Conference on Computational Lin-guistics, Proceedings of the Conference: Techni-cal Papers, December 11-16, 2016, Osaka, Japan,pages 2923–2934.

S Pyysalo, F Ginter, H Moen, T Salakoski, and S Ana-niadou. 2013. Distributional semantics resources forbiomedical text processing. In Proceedings of LBM2013, pages 39–44.

Abeed Sarker and Graciela Gonzalez. 2015. Portableautomatic text classification for adverse drug reac-tion detection via multi-corpus training. Journal ofbiomedical informatics, 53:196–207.

Abeed Sarker, Azadeh Nikfarjam, and Graciela Gon-zalez. 2016. Social media mining shared task work-shop. In Biocomputing 2016: Proceedings of thePacific Symposium, pages 581–592. World Scien-tific.

Rupesh K Srivastava, Klaus Greff, and Jurgen Schmid-huber. 2015. Training very deep networks. InC. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,and R. Garnett, editors, Advances in Neural Infor-mation Processing Systems 28, pages 2377–2385.Curran Associates, Inc.

Gabriel Stanovsky, Daniel Gruhl, and Pablo Mendes.2017. Recognizing mentions of adverse drug reac-tion in social media using knowledge-infused recur-rent models. In Proceedings of the 15th Conferenceof the European Chapter of the Association for Com-putational Linguistics: Volume 1, Long Papers, vol-ume 1, pages 142–151.

Janet Sultana, Paola Cutroneo, and Gianluca Trifiro.2013. Clinical and economic burden of adversedrug reactions. Journal of pharmacology & phar-macotherapeutics, 4(Suppl1):S73.

Elena Tutubalina and Sergey Nikolenko. 2017. Combi-nation of deep recurrent neural networks and condi-tional random fields for extracting adverse drug re-actions from user reviews. Journal of HealthcareEngineering, 2017.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 5998–6008.

Xiaoyan Wang, George Hripcsak, Marianthi Marka-tou, and Carol Friedman. 2009. Active computer-ized pharmacovigilance using natural language pro-cessing, statistics, and electronic health records: afeasibility study. Journal of the American MedicalInformatics Association, 16(3):328–337.

http://dl.acm.org/citation.cfm?id=3016100.3016285







https://doi.org/10.1145/3038912.3052671

https://doi.org/10.1145/3038912.3052671

https://doi.org/10.1145/3038912.3052671

https://doi.org/10.18653/v1/P17-1001

https://doi.org/10.18653/v1/P17-1001

http://aclweb.org/anthology/C/C16/C16-1275.pdf

http://aclweb.org/anthology/C/C16/C16-1275.pdf

http://papers.nips.cc/paper/5850-training-very-deep-networks.pdf

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprintarXiv:1609.08144.

Zhifei Zhang, JY Nie, and Xuyao Zhang. 2016. Anensemble method for binary classification of adversedrug reactions from social media. In Proceedings ofthe Social Media Mining Shared Task Workshop atthe Pacific Symposium on Biocomputing.

A Uniﬁed Multi-task Adversarial Learning Framework for ...pb/papers/acl19-pharma.pdfEarly...

Documents

Transcript of A Uniﬁed Multi-task Adversarial Learning Framework for ...pb/papers/acl19-pharma.pdfEarly...