Multi-Emotion Detection in Dutch Political Tweets · 2020. 9. 13. · measure value of 32.22% in a...

Multi-Emotion Detection in DutchPolitical Tweets

Vincent S. Erich10384081

Bachelor thesisCredits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of AmsterdamFaculty of ScienceScience Park 904

1098 XH Amsterdam

Supervisorsdhr. drs. Isaac Sijaranamualdhr. dr. Evangelos Kanoulas

Information and Language Processing SystemsFaculty of Science

University of AmsterdamScience Park 904

1098 XH Amsterdam

June 26th, 2015

1

Abstract

This work describes the development of an emotion classificationsystem (i.e., a classifier) for ThemeStreams that can detect and clas-sify multiple emotions in Dutch political tweets. Using data from dif-ferent installations of ThemeStreams, a hand-labelled dataset of 399tweets covering eight emotion categories has been realised. Two prob-lem reduction methods have been tested for performing multi-labelclassification: the binary relevance method and the random k -labelset(RAKEL) method. Using the binary relevance method as problemreduction method, a multi-label classifier has been developed thatachieves an overall F1 score of 0.209 on the developed dataset. ThisF1 score has been achieved by testing different combinations of fea-tures and classifier parameters, of which the combination that usesunigrams, bigrams, and Part-of-Speech information as features, resultsin the highest overall F1 score.

2

Contents

1 Introduction 4

2 Related Work 5

3 Method and Approach 63.1 Data and Data Reduction . . . . . . . . . . . . . . . . . . . . 63.2 Labelling the Dataset . . . . . . . . . . . . . . . . . . . . . . 83.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.5 Classification Algorithms . . . . . . . . . . . . . . . . . . . . 15

3.5.1 The Binary Relevance Method . . . . . . . . . . . . . 163.5.2 The Random k -Labelset Method . . . . . . . . . . . . 17

4 Results 19

5 Conclusion 23

6 Discussion and Future Research 24

3

1 Introduction

Emotions play an important role in everyday life: they influence decision-making, social interaction, perception, memory, and more. Emotions arenot only communicated verbally and visually, but also textually in the formof microblog posts, blog posts, and forum discussions. Due to the increasingamount of such emotion-rich textual content, recognition of emotion in writ-ten text has received a great amount of attention in recent years. Much ofthis attention is usually focused on sentiment analysis and opinion mining,which deals with the automatic identification and extraction of a writer’sattitude with respect to some topic (i.e., whether the expressed opinion ispositive, negative, or neutral).

This work focuses on the automatic detection and classification of morefine-grained emotional states in the context of tweets. This work is theconsequence of the development of ThemeStreams (de Rooij et al., 2013).ThemeStreams is an on-line web interface that visualizes the stream ofthemes discussed in politics. The system maps political discussions tothemes and influencers and visualizes this mapping. The current systemprovides an overview of the topics discussed in Dutch politics by analysingthe tweets from people from four influencer groups.

The goal of this work is to develop an emotion classification system(i.e., a classifier) for ThemeStreams that can be used to examine whetherthe emotions in the tweets affect the political discussions. The classifiermust be able to detect multiple emotions in the tweets and classify theseemotions. If a tweet expresses multiple emotions, the classifier must assignmultiple emotion labels to the tweet (i.e., it is a multi-label classificationtask). Thus, the following research question is addressed in this work: Howto develop a classifier that can detect multiple emotions in Dutch politicaltweets?

To answer this question, data dumps from different installations of Theme-Streams are used. Every data dump consists of a couple of hundred thousandtweets in JSON format. The label set that is used is the same as the onethat has been used by Buitinck et al. (2015) and consists of the seven ba-sic emotions identified by Shaver et al. (1987), with the emotion ‘interest’added. This work explores how to develop a labelled dataset, what featuresto use, and what supervised machine learning algorithm to use.

This work proceeds in five parts. First, the related work is reviewedin Section 2, Section 3 then describes the approach that has been taken,followed by the results in Section 4. Section 5 summarises the work, andSection 6 concludes with the discussion and plans for future research.

4

2 Related Work

Recognition of emotion in written text has been the focus of much researchin recent years; there has been a great deal of research on developing modelsthat can automatically classify texts as expressing certain emotions. Muchof the initial work on emotion recognition in written text has focused onpolarity classification, that is, whether a text expresses positive, negative,or neutral emotions. Alm et al. (2005), for example, have explored theautomatic classification of sentences in children’s fairy tales as expressingpositive, negative, or no emotions (neutral).

However, the recent trend has been to go beyond polarity classificationand detect a broader spectrum of emotions. This yields the question of whatkind of emotions can be detected in written text. Most label sets that areused in emotion recognition tasks seem to be based on the six basic emotionsidentified in the hierarchical cluster analysis done by Shaver et al. (1987),or on the six basic emotions identified by Ekman (1992).

Danisman and Alpkocak (2008), for example, use a Vector Space Model(VSM) for the automatic classification of anger, disgust, fear, joy, and sademotions at the sentence/snippet level. Their classifier achieves an overall F-measure value of 32.22% in a five-way multiclass prediction problem on theSemEval dataset. In a more recent study, Buitinck et al. (2015) describe andexperimentally validate two algorithms that can detect multiple emotions insentences of movie reviews. They do so by reducing the multi-label problemto either binary or multiclass learning.

More related to this work is that of Wang et al. (2012) and Bala-bantaray et al. (2012), who both perform emotion recognition on tweets.Wang et al. (2012) demonstrate that a combination of unigrams, bigrams,sentiment/emotion-bearing words, and Part-of-Speech information seems tobe the most effective for extracting emotions from tweets. They achievean accuracy of 65.57% in a seven-way multiclass prediction problem usingtraining data consisting of about 2 million tweets (automatically collectedand labelled). Balabantaray et al. (2012) use a hand-labeled dataset of8150 tweets, and use a classifier based on multiclass SVM kernels to classifytweets into six categories of emotion. Their classifier achieves an accuracyof 73.24%.

This work differs from the work described above in two important ways.First, it is not assumed that emotions are mutually exclusive (i.e., a tweetcan express multiple emotions). Second, the dataset that is used in thiswork consists of a few hundred hand-labelled Dutch tweets, whereas earlierstudies have used datasets consisting of thousands of English tweets.

5

Table 1: Comparison of different approaches. *: anger, disgust, fear, sad-ness, joy **: anger, fear, interest, joy, love, sadness, surprise ***: happiness,sadness, anger, disgust, surprise, fear + neutral class ****: joy, sadness,anger, love, fear, thankfulness, surprise

Study Dataset Classifier Performance

Danisman andAlpkocak (2008)

Domain: sentences/text snippetsSize: 7666 text snippetsFive emotion categories* VSM F1:32.22%

Buitinck et al. (2015)

Domain: Sentences of English movie reviewsSize: 629 sentencesSeven emotion categories**Multiple labels per sentenceManually annotated

One-vs.-RestRandom k -labelset

Acc: 0.841, F1: 0.432

Acc: 0.854, F1: 0.456

Balabantaray et al. (2012)

Domain: English tweetsSize: 8150 tweetsSix emotion categories***One label per tweetManually annotated SVMs Acc: 73.24%

Wang et al. (2012)

Domain: English tweetsSize: ± 2.5M tweetsSeven emotion categories****One label per tweetAutomatically annotated LIBLINEAR Acc: 65.57%

3 Method and Approach

As described in the introduction, the goal of this work is to develop anemotion classification system (i.e., a classifier) for ThemeStreams that candetect and classify multiple emotions in Dutch political tweets. This sectiondescribes the approach that has been taken to achieve this goal and proceedsin five parts. First, Section 3.1 describes the data and the steps that havebeen taken to reduce the data. The development of a labelled dataset isdescribed in section 3.2, followed by an overview of the data preprocessingin Section 3.3. Section 3.4 then describes the feature set. Finally, Section3.5 describes the learning algorithms.

3.1 Data and Data Reduction

The data that has been used in this work was obtained from four data dumpsfrom different installations of ThemeStreams. Every data dump consists of acouple of hundred thousand tweets in JSON format (JSON strings). EveryJSON string in a data dump contains a tweet and metadata about the writerof the tweet (i.e., the influencer group to which the writer belongs, profileinformation) and the tweet itself (whether the tweet is a retweet, whether

6

Table 2: An overview of the data reduction process. Combining the datafrom the four data dumps resulted in a dataset of 1100306 tweets. The finaldataset contains 1500 tweets.

Step Number of tweets in the dataset

1. Combining the data from the four data dumps 11003062. Removing duplicate tweets 3684603. Applying filtering heuristics 1883994. Random sample selection 1500

there are links, hashtags, and mentions in the tweet). Every JSON stringalso has an unique identifier (id).

Combining the data from the four data dumps resulted in a datasetof 1100306 tweets (JSON strings). However, since two of the data dumpswere from two different installations of ThemeStreams that simultaneouslycollected data, the resulting dataset contained a lot of duplicate tweets. Re-moving duplicate tweets from the dataset resulted in a new dataset of 368460tweets; a reduction of 66.51%. A set of filtering heuristics was developedto filter out irrelevant tweets (i.e., tweets that do not express any emotion).For this, a number of filtering heuristics described in Wang et al. (2012,pp. 588-589) were adopted and implemented. The filtering heuristics are asfollows:

• Discard retweets (i.e., tweets starting with ‘RT’).

• Discard tweets which contain URLs.

• Discard tweets which contain quotations.

• Discard tweets which have less than five words. Only user mentions(i.e., ‘@person’) are not considered as words.

• Discard tweets having more than three hashtags.

A random sample of 300 tweets was selected from the aforementioneddataset of 368460 tweets. For every tweet in this sample it was determinedwhether it was relevant or not, and, if the tweet was irrelevant, whether oneof the filtering heuristics would discard it. Based on this analysis it wasassumed that the filtering heuristics are effective.

Applying the filtering heuristics on the aforementioned dataset of 368460tweets resulted in a new dataset of 188399 tweets; a reduction of 48.87%.However, since this new dataset still contained too many tweets to labelmanually, a random sample of 1500 tweets was selected as the final dataset.Table 2 summarizes the data reduction process.

7

Table 3: The eight emotions (emotion categories) that are included in thelabel set (in English and in Dutch).

English Dutch

1. Anger Boosheid2. Fear Angst3. Interest Interesse4. Joy Vreugde5. Sadness Verdriet6. Surprise Verrassing7. Disgust/contempt Afschuw8. Love Liefde

3.2 Labelling the Dataset

Since the tweets in the final dataset did not have emotion labels associatedwith them (i.e., it was an unlabelled dataset), the tweets had to be labelled.The label set that is used in this work is the same as the one that hasbeen used by Buitinck et al. (2015) and consists of the seven basic emotionsidentified by Shaver et al. (1987), with the emotion ‘interest’ added. Thus,the label set consists of the following eight emotions (emotion categories):anger, fear, interest, joy, sadness, surprise, disgust/contempt, and love.

CrowdFlower1 has been used to obtain a labelled dataset. CrowdFloweris an online crowdsourcing service that (amongst others) allows costumersto put a labelling task online (tasks are called ‘jobs’ in the CrowdFlowerplatform) and let contributors label the data against a small fee.

For every emotion category, a CrowdFlower job has been set up whichresulted in eight different jobs. Every job is a binary classification task.Another option would have been to set up a single CrowdFlower job andask contributors to select all the emotion categories that are present in atweet (from the eight emotion categories that are included in the label set).However, selecting all the emotion categories that are present in a tweetfrom a predefined set of emotion categories is more difficult than decidingwhether a single emotion category is present in a tweet. It was thereforedecided to use the aforementioned setup and set up one CrowdFlower jobper emotion category. It is also assumed that this results in a higher-qualitylabelled dataset.

Every job used the same final dataset of 1500 tweets. However, sinceCrowdFlower only allows spreadsheets to be uploaded (.csv, .tsv, .xlsx, or.ods), the final dataset (with JSON strings) was converted to a .csv file withthree columns: ‘tweet id’ (the id of the tweet), ‘name’ (the name of theauthor of the tweet), and ‘content’ (the tweet itself).

1http://www.crowdflower.com

8

Figure 1: A question from the job for the emotion anger (Dutch: boosheid).(A) The title of the job. (B) The ‘Instructions’ button. If clicked, theinstructions are shown to the contributor. (C) The tweet in question. (D)The question and the choices/answer options.

In every job, contributors were asked whether the author of a tweet wasupset about something (in case of the emotion anger), anxious for something(in case of the emotion fear), interested in something (in case of the emotioninterest), etc. For the emotion love, contributors were asked whether theauthor of a tweet had affection/devotion for someone/something. Contrib-utors always had three choices: ‘Yes’, ‘No’, or ‘No idea’. Figure 1 shows aquestion from the job for the emotion anger.

Every job was provided with the same set of instructions. The instruc-tions gave an overview of the job, a description of the procedure, and a shortsummary. Below are some important points regarding the instructions:

• Since the tweets are in Dutch, it was explicitly stated that the jobswere only intended for contributors who can read (and understand)Dutch.

• Contributors were asked to determine whether the author of a tweetwas upset about something (in case of the emotion anger), anxious forsomething (in case of the emotion fear), interested in something (incase of the emotion interest), etc., not the person for whom the tweetis intended.

• It was explicitly stated that, for a contributor to answer ‘Yes’, it did

9

not matter to what extent the emotion in question occurs in a tweet.

• Since emotional states are generally not mutually exclusive, contribu-tors were asked to focus solely on the emotion in question.

• Contributors were asked not to overuse the choice ‘No Idea’ and toonly select this choice if they were in doubt or had no idea.

Before the quality settings of the jobs are discussed, it is important toaddress the following two points. (1) Before launching a job, it is required tocreate a number of test questions. Test questions are used to train contrib-utors and to remove contributors that do not perform well enough. Whencreating test questions, CrowdFlower randomly selects a row (tweet) fromthe uploaded dataset and allows you to give the correct answer (it is alsopossible to skip a row whereupon CrowdFlower just selects another randomrow from the uploaded dataset). It is also possible to upload test questions.(2) When a job is launched, the uploaded dataset is divided into a number ofpages. Each page show a variable number of rows (tweets) that contributorscan label, for example: If the uploaded dataset consists of 100 rows, and thenumber of rows per page is set to ten, then the dataset will be divided intoten pages.

In order to obtain a high-quality labelled dataset, the number of con-tributors who will answer a given question (i.e., label a given tweet) wasset to three throughout all the jobs. The number of rows (tweets) displayedon each page was set to ten and only contributors from Dutch-speakingcountries were allowed to the jobs (i.e., Belgium, Netherlands, Suriname).Furthermore, contributors had to maintain a minimum accuracy of 70% onthe test questions, and the minimum time it should take a contributor tocomplete a page of work was set to 30 seconds (i.e., three seconds per tweeton average). Contributors who did not maintain a minimum accuracy of70% on the test questions, or completed a page of work in less than 30seconds, were removed from the jobs (including the data they had labelled).

When a contributor has worked on a job, he/she can voluntarily take partin a contributor exit survey. In this exit survey, the job is assessed on fivepoints: ‘Overall’, ‘Instructions Clear’, ‘Test Questions Fair’, ‘Ease Of Job’,and ‘Pay’. In order to gain insight into the quality of the created jobs, a testrun was done for every job in which the first 100 rows of the uploaded datasetwere launched. Since the results of the contributor exit surveys showed thatevery job scored low on ‘Test Questions Fair’ (it turned out that in every jobtoo many test questions were being contested by contributors), changes weremade to the test questions: Every test question that was being contestedby contributors was reconsidered and if the contributors’ contentions weredeemed justified, the test question was removed from the job. Furthermore,since CrowdFlower indicated that every job contained too few test questions,extra test questions were created as follows: For every job, the aggregated

10

Figure 2: The results of the contributor exit surveys of the jobs (overall,results of the test runs are included). In every exit survey, a job is assessedon five points: ‘Overall’, ‘Instructions Clear’, ‘Test Questions Fair’, ‘EaseOf Job’, and ‘Pay’. The maximum score for each category is five.

results were downloaded (which show the single, top answer for every row)and every row that was not a test question and where the confidence of theanswer was 1.0 (i.e., all three contributors agreed upon the answer), wastransformed into a test question.

Due to the limited resources that were available to us, it was only possibleto launch 300 other rows for every job (i.e., rows 101 to 400 of the uploadeddataset). Thus, only 400 rows of the uploaded dataset have been labelled inevery job.

Figure 2 shows the results of the contributor exit surveys of the jobs.Since the overall score for every job is ≥ 4, it is assumed that the jobsare of good quality. Although every job was provided with the same set ofinstructions, the score for ‘Instructions Clear’ varies over the jobs, with thethe job for the emotion joy having the lowest score (4.1) and the job for theemotion love having the highest score (5). Since the score for ‘InstructionsClear’ is ≥ 4.1 for every job, it can be assumed that contributors knew whatwas expected of them.

11

Table 4: The label distribution in the annotated final dataset (399 tweets).

Label Absolute label frequency Percentage of total number of labels (%)

Anger 68 14.91Fear 9 1.97Interest 130 28.51Joy 51 11.18Sadness 14 3.07Surprise 41 8.99Disgust/contempt 113 24.78Love 30 6.58

456 100.00

Even though test questions were created by selecting tweets where threecontributors agreed upon the answer (see above), the scores for ‘Test Ques-tions Fair’ are not as high as was expected: the job for the emotion interesthas the lowest score for ‘Test Questions Fair’ (2.6) and the job for the emo-tion love has the highest score for ‘Test Questions Fair’ (3.8). The lowscores for ‘Test Questions Fair’ might be due to the subjective character ofemotions or the difficulty of detecting an emotion. Nevertheless, the createdtest questions were assumed to be fair and to represent what was expectedof contributors.

The score for ‘Ease Of Job’ also varies over the jobs, with the job for theemotion joy having the lowest score (3.8) and the job for the emotion lovehaving the highest score (4.8). Since the score for ‘Ease Of Job’ is ≥ 3.8 forevery job, it is assumed that contributors found the jobs easy (which is whyeight different jobs were set up). Further analysis of the results in Figure 2is beyond the scope of this work.

The final dataset of 1500 tweets (JSON strings) was labelled using theaggregated results from every job. The aggregated results show the single,top answer for every row (i.e., the answer that most contributors agreeupon). Only rows (tweets) that have been labelled in all the eight jobs areincluded in the labelled final dataset. However, this resulted in 399 labelledtweets instead of 400 labelled tweets2. The label distribution is given inTable 4.

Of the 399 tweets, 127 have no label(s), showing that expression of emo-tions is not prevalent in the tweets. Table 5 maps the number of labelsper tweet to the number of tweets, showing that the maximum numberof labels per tweet is five (the combination ‘Anger-Fear-Interest-Sadness-Disgust/contempt’, which occurs once). Table 6 shows the text of the tweetthat has five labels associated with it, including four other examples from

2The reason for this is unknown. It is assumed that something went wrong in one ofthe CrowdFlower jobs.

12

the annotated final dataset.

Table 5: Mapping of the number of labels per tweet to the number oftweets.

Number of labels per tweet Number of tweets Percentage of total number of tweets (%)

0 127 31.831 140 35.092 89 22.313 35 8.774 7 1.755 1 0.256 0 0.007 0 0.008 0 0.00

399 100.00

Table 6: Five tweets from the annotated final dataset with their associatedlabel(s).

Tweet Associated label(s)“@cornaldm @freekvonk @DWDDUniversityInfotainment wint van coverpulp hoera!” Joy“@peterkwint nou dat is niet best. Al kun jedit verwachten van de Prive. Daar kun jenauwelijks intelligentie verwachten?Alleen gossip.....” Anger, Disgust/contempt“@jorisluyendijk ligt ie al in de winkel? Ik hebzondag een boekenbon gevonden! :D” Interest, Joy“@Bertine83 Ja, he? En wat een leuke, frisseCU-dame! Stak gunstig af tegen de tikjeonsympathiek debatterende De Rouwe. Joy, Surprise, Love“@JohnKerstens Het is maar wat je leest,feit is dat steeds meer mensen in mijnomgeving hun baan kwijt raken of salarisin moeten leveren”

Anger, Fear, Interest,Sadness, Disgust/contempt

3.3 Data Preprocessing

Before features were extracted to train and test the classifier, the tweets inthe annotated final dataset had to be preprocessed. The data preprocessinginvolves four steps that are adopted from Wang et al. (2012, p. 589). First,all the words were lower-cased. Second, user mentions (e.g., ‘@person’) werereplaced with ‘@user’. Third, letters and punctuations that are repeated

13

Table 7: An overview of the predefined informal expressions and theirnormalized form.

Informal expression(s) Normalized form

mn, m’n mijnzn, z’n zijnzo’n zo eenn, ’n eenm, ’m hemt, ’t hetidd inderdaad

more than twice were replaced with the same two letters and punctuations.Fourth, and finally, some predefined informal expressions were normalized.Table 7 gives an overview of the predefined informal expressions and theirnormalized form. Wang et al. also performed a fifth preprocessing step:they stripped hash symbols (since they used the hashtags as the source fortheir labels). However, since hashtags are used as a feature for the classifier(see section 3.4), this preprocessing step was not implemented.

As an example, Figure 3 shows the result of applying the preprocessingsteps described above to the (example) tweet “@persoon JAAA ik heb zo’nzin in de vakantie!!!!”.

Tweet before preprocessing: “@persoon JAAA ik heb zo’n zin in de vakantie!!!!”Tweet after preprocessing: “@user ja ik heb zo een zin in de vakantie!!”

Figure 3: An example of a tweet before and after preprocessing.

3.4 Feature Set

Based on the related work described in Section 2, a feature set was realisedthat includes the following six features: unigrams, bigrams, hashtags, emoti-cons, unigrams over the Part-of-Speech (POS), and bigrams over the POS.

Unigrams and bigrams: N-gram features are used in the studies ofWang et al. (2012) and Balabantaray et al. (2012). We have worked withunigrams (N = 1) and bigrams (N = 2) which were extracted by looping overall the tweets in the annotated final dataset. Punctuations and emoticonswere also included and neither stemming nor stop word removal was applied(as in Wang et al. (2012, p. p591)). A tf-idf weighted feature was used foreach unigram and bigram, using a sublinear term frequency (tf):

tf(t, d) = 1 + log(raw frequency of t in d) (1)

14

and a smoothed inverse document frequency (idf) is:

idf(t,D) = 1 + log(N + 1

1 + |d ∈ D : t ∈ D|) (2)

In these equations, t is a unigram or bigram, d is a tweet in the annotatedfinal dataset, N is the total number of tweets in the annotated final dataset,and |d ∈ D : t ∈ D| is the number of tweets in the annotated final datasetthat contain unigram or bigram t.

Hashtags: Hashtags were extracted by looping over all the tweets in theannotated final dataset. A tf-idf weighted feature was used for each hashtag.

:) (: :( )::)) ((: :(( ))::-) (-: :-( )-::-)) ((-: :-(( ))-::/ :\\ /: \\:;) ;-) ;)) ;-)):p ;-p

Emoticons: Since emoticons are often used to ex-press emotions, it was decided to include them in thefeature set. A predefined set of emoticons was con-structed and a tf-idf weighted feature was used for eachemoticon in this set. The predefined set of emoticonsis shown on the right.

Unigrams and bigrams over the Part-of-Speech (POS): POS features have been proven ef-fective in the studies of Wang et al. (2012) and Bala-bantaray et al. (2012). Frog3 (Bosch et al., 2007) was used for POS tagging.Though the tagger that is used in Frog is not trained on Dutch tweets (i.e., itis trained on a broad selection of manually annotated Part-of-Speech taggedcorpora for Dutch), it is assumed that the tagger is effective for Dutch tweets.The unigrams and bigrams over the POS were extracted by looping over allthe tagged tweets in the annotated final dataset. Again, a tf-idf weightedfeature was used for each POS-unigram and POS-bigram.

All the features in the feature set have a variable threshold t (except forthe emoticons). The value for this threshold determines in how many differ-ent tweets the extracted feature must appear, for example: If the thresholdfor unigrams is set to three, then all the extracted unigrams must appear inat least three different tweets; extracted unigrams that appear in less thanthree different tweets are discarded. The thresholds were used to improvethe performance of the classifier (more on this in section 3.5).

3.5 Classification Algorithms

There exist multiple problem transformation methods for multi-label clas-sification. Two problem transformation methods were tested: the binary-relevance method (Tsoumakas et al., 2007) and the random k -labelset (RAKEL)method (Tsoumakas and Vlahavas, 2007). The binary relevance method wasimplemented and tested using scikit-learn (Pedregosa et al., 2011) and theRAKEL method was tested using MEKA4.

3http://ilk.uvt.nl/frog/4http://meka.sourceforge.net/

15

Table 8: The initial search grid for testing different combinations of featuresand classifier parameters for each linear Support Vector Machine (SVM).

Feature Value

Unigrams [yes, no]Threshold for the unigrams [2, 3, 4, 5]Bigrams [yes, no]Threshold for the bigrams [2, 3, 4, 5]Hashtags [yes, no]Threshold for the hashtags [2, 3, 4, 5]Emoticons [yes, no]POS-unigrams [yes, no]Threshold for the POS-unigrams [2, 3, 4, 5]POS-bigrams [yes, no]Threshold for the POS-bigrams [2, 3, 4, 5]

Classifier parameter Value

Regularization/penalty [L1, L2]α [0.0001, 0.001, 0.01]

3.5.1 The Binary Relevance Method

The binary relevance method reduces the multi-label classification prob-lem to a single binary classifier per emotion category (i.e., each classifier istrained to distinguish one emotion category from all others). A linear Sup-port Vector Machine (SVM) with stochastic gradient descent (SGD) learningwas trained for every emotion category.

Different combinations of features and classifier parameters were testedfor each SVM. Table 8 shows the initial search grid. Since it was not possibleto test all the combinations defined by this search grid for each SVM5, adifferent approach was taken which involved testing seven sets of combina-tions.

The first set of combinations that were tested for each SVM includedall the unigrams and bigrams as features (i.e., the threshold for unigramsand bigrams was set to one), regularization ∈ {L1, L2}, and α ∈ {0.0001,0.001, 0.001}. For every combination, an overall F1 score was computed byaveraging over the F1 scores of the SVMs, which in turn were computed byaveraging over ten repeats of five-fold cross-validation on the annotated finaldataset.

52 x 5 x 2 x 5 x 2 x 5 x 2 x 2 x 5 x 2 x 5 x 2 x 3 = 1.2e6 combinations for eachlinear Support Vector Machine (SVM). Of course, there are lots of redundant combina-tions, for example: When a combination has been tested where ‘Unigrams’ = ‘no’ and‘Threshold for unigrams’ = 1, than a combination were ‘Unigrams’ = ‘no’ and all other fea-tures/parameters are the same but ‘Threshold for unigrams’, is redundant since unigramsare excluded from the feature set when ‘Unigrams’ = ‘no’.

16

The second set of combinations that were tested for each SVM includedall the unigrams, bigrams, and hashtags as features, the best regularizationparameter from the first set of combinations (i.e., the regularization param-eter used in the combination from set 1 with the highest overall F1 score),and α ∈ {0.00001, 0.0001, 0.001, 0.01, 0.1}.

The third set of combinations included all the unigrams, bigrams, hash-tags, and emoticons as features, the best regularization parameter from thefirst set of combinations, and α ∈ {0.00001, 0.0001, 0.001, 0.01, 0.1}.

The fourth set of combinations included all the unigrams, bigrams, POS-unigrams, and POS-bigrams as features, the best regularization parameterfrom the first set of combinations, and α ∈ {0.00001, 0.0001, 0.001, 0.01,0.1}.

The fifth set of combinations included all the features (i.e., unigrams,bigrams, hashtags, emoticons, POS-unigrams, and POS-bigrams), the bestregularization parameter from the first set of combinations, and α ∈ {0.00001,0.0001, 0.001, 0.01, 0.1}.

For the sixth set of combinations, the best combination from the fourthset was chosen (i.e., the one with the highest overall F1 score) and thethreshold for all the features in this combination was ranged over {2, 3, 4,5} (i.e., the same threshold for every feature).

Finally, for the seventh set of combinations, the best combination fromthe fifth set was chosen and the threshold for all the features in this combi-nation was ranged over {2, 3, 4, 5}.

Table 9 gives an overview of the seven sets of combinations that weretested for each SVM. The results are presented in section 4.

3.5.2 The Random k-Labelset Method

The random k -labelset (RAKEL) method is a problem reduction methodthat allows for learning dependencies/correlations between labels. Givena set of labels L, RAKEL first creates an ensemble of m k -labelsets. Ak -labelset is a subset of L with cardinality k. Given the labelset that isused in this work, the subset {Joy, Love} is a k -labelset with k = 2. Them k -labelsets are randomly selected from all the k -labelsets on L withoutreplacement. m and k are user-specified parameters.

For every selected k -labelset, a label powerset classifier is realised. Alabel powerset classifier is a binary classifier that is trained on data whereinstances that contain the labels in the k -labelset are positive, and all theother instances are negative. RAKEL thus creates an ensemble of m labelpowerset classifiers.

A new instance is classified using a voting scheme. Every label powersetclassifier provides a binary decision for the labels in its corresponding k -labelset, for example: If the label powerset classifier associated with thek -labelset {Joy, Love} classifies the new instance as belonging to that k -

17

Table 9: An overview of the seven sets of combinations that were testedfor each linear Support Vector Machine in the binary relevance method. *:is (transitively) the best from set 1.

SetUni-

gramsBi-

gramsHash-tags

Emoti-cons

POS-unigrams

POS-bigrams

Threshold forall the

featuresRegula-rization α

1. x x {1} {L1, L2}

{0.0001,0.001,0.01}

2. x x x {1}Best from

set 1.

{0.00001,0.0001,0.001,0.01,0,1}

3. x x x x {1}Best from

set 1.

{0.00001,0.0001,0.001,0.01,0,1}

4. x x x x {1}Best from

set 1.

{0.00001,0.0001,0.001,0.01,0,1}

5. x x x x x x {1}Best from

set 1.

{0.00001,0.0001,0.001,0.01,0,1}

6. x x x x {2, 3, 4, 5}Best from

set 4.*Best from

set 4.

7. x x x x x x {2, 3, 4, 5}Best from

set 5.*Best from

set 5.

labelset, the labels ‘Joy’ and ‘Love’ each get a vote. RAKEL then computesthe average decision for each label in L. If the average decision for a labelis above some threshold t, the label is predicted for the new instance, thusallowing the prediction of multiple labels. t is a user-specified parameter.

Before the random k -labelset method was tested, the correlations be-tween the labels in the annotated final dataset were computed, which areshown in Table 10. Since there are labels that positively correlate with oneanother, it was decided to test the random k labelset method with k = 2.Furthermore, it was decided to set m to fourteen: the number of 2-labelsetsin the annotated final dataset. It is not possible to set the value of the tparameter in MEKA; the threshold is automatically calibrated (default op-

18

Table 10: The correlations between the labels in the annotated finaldataset.

Label (1) (2) (3) (4) (5) (6) (7) (8)

(1) Disgust/contempt - 0.017 0.514 -0.033 -0.158 0.001 -0.030 -0.224(2) Fear - 0.111 0.002 -0.043 0.338 0.060 -0.058(3) Anger - -0.016 -0.129 0.131 0.044 -0.174(4) Interest - -0.016 0.071 0.152 -0.010(5) Love - -0.054 0.029 0.460(6) Sadness - 0.115 -0.073(7) Surprise - 0.068(8) Joy -

tion) with the possibility to automatically calibrate a threshold per labelset.The default option was chosen. Since it was not possible to use a linear Sup-port Vector Machine with stochastic gradient descent learning as RAKEL’sbase learner, it was decided to use a linear Support Vector Machine withsequential minimal optimization with C = 1.0.

The best combination from the binary relevance method was selected(i.e., the one with the highest overall F1 score) and the features that areincluded in this combination are the ones that were used by RAKEL. Dueto time constraints, it was not possible to test different combinations offeatures and classifier parameters (as with the binary relevance method).RAKEL’s F1 score was computed according to five-fold cross-validation onthe annotated final dataset. The results are presented in Section 4.

4 Results

The goal of this work is to develop an emotion classification system (i.e., aclassifier) for ThemeStreams that can detect and classify multiple emotionsin Dutch political tweets. The previous section described the approach thathas been taken to achieve this goal. This section describes the results ofapplying the two algorithms described in Section 3.5 on the annotated finaldataset described in Section 3.2.

The F1 scores that are reported in this section are traditional F1 scores(unless stated otherwise), that is, they are computed as the harmonic meanof precision and recall: F1 = 2 ∗ precision∗recall

precision+recall , where precision =true positives

true positives+false positives and recall = true positivestrue positives+false negatives . A tweet be-

longs to the positive class if it has the emotion label in question associatedwith it.

Seven sets of combinations of features and classifier parameters weretested for the binary relevance method, which are shown in Table 9. Anoverall accuracy and F1 score was computed for every combination by av-

19

eraging over the accuracies and F1 scores of the Support Vector Machines(SVMs), which in turn were computed by averaging over ten repeats of fivefive-fold cross-validation on the annotated final dataset. The best combina-tion from every set is reported in Tables 11-13. The best combination fromevery set is the one with the highest overall F1 score. Accuracy scores are notconsidered the main evaluation metric, since the annotated final dataset is(highly-)unbalanced, resulting in accuracy scores that tend to overestimateperformance.

Tables 11-13 show that set 6 includes the best combination of featuresand classifier parameters. This combination includes unigrams, bigrams,POS-unigrams, and POS-bigrams, all occurring in at least five differenttweets, as features, L1 regularization, and an α of 0.01. The combinationresults in an overall F1 score of 0.209, with a standard deviation of 0.147.The standard deviation is rather high, since the F1 score per emotion cate-gory varies a lot.

Tables 11-13 also show that the binary relevance method does not learnthe ‘Fear’ label at all. The problem with the ‘Fear’ label is that it onlyoccurs nine times in the annotated final dataset (399 tweets), which is toolittle for the corresponding SVM to learn the label. As a result, the SVMonly predicts the absence of the label (with very few exceptions), resultingin a high accuracy score that overestimates the performance of the SVM.

Figure 4 shows a scatter plot of the F1 score per emotion category of thebest combination from the binary relevance method against the percentagelabel frequency of that emotion category (i.e.,absolute label frequency of emotion category

total number of labels ∗ 100, see Table 4). The figure showsthat the F1 score increases with the percentage label frequency.

Table 11: The (overall) accuracies and F1 scores for the best combinationsfrom sets 1-3. *: regularization = L1, α = 0.01 **: regularization = L1, α= 0.001 ***: regularization = L1, α = 0.001. See Table 9 for the featuresthat are included in each combination.

Best combination set 1.* Best combination set 2.** Best combination set 3.***

Accuracy F1 Accuracy F1 Accuracy F1

Anger 0.734 ± 0.012 0.194 ± 0.033 0.746 ± 0.008 0.194 ± 0.037 0.749 ± 0.010 0.180 ± 0.044Fear 0.959 ± 0.006 0.000 ± 0.000 0.945 ± 0.007 0.008 ± 0.024 0.947 ± 0.009 0.000 ± 0.000Interest 0.634 ± 0.016 0.419 ± 0.029 0.652 ± 0.012 0.439 ± 0.026 0.645 ± 0.026 0.427 ± 0.033Joy 0.822 ± 0.012 0.220 ± 0.053 0.818 ± 0.011 0.214 ± 0.038 0.816 ± 0.013 0.224 ± 0.026Sadness 0.941 ± 0.008 0.122 ± 0.067 0.936 ± 0.007 0.147 ± 0.080 0.936 ± 0.008 0.131 ± 0.082Surprise 0.832 ± 0.011 0.139 ± 0.043 0.835 ± 0.012 0.116 ± 0.031 0.841 ± 0.012 0.127 ± 0.025Disgust/contempt 0.634 ± 0.017 0.358 ± 0.032 0.621 ± 0.023 0.351 ± 0.035 0.630 ± 0.015 0.361 ± 0.033Love 0.880 ± 0.012 0.089 ± 0.030 0.877 ± 0.008 0.084 ± 0.047 0.874 ± 0.005 0.087 ± 0.056

Overall 0.804 ± 0.119 0.193 ± 0.130 0.804 ± 0.114 0.194 ± 0.132 0.805 ± 0.114 0.192 ± 0.133

20

Table 12: The (overall) accuracies and F1 scores for the best combinationsfrom sets 4-6. *: regularization = L1, α = 0.01 **: regularization = L1,α = 0.01 ***: threshold for all the features = 5, regularization = L1, α =0.01. See Table 9 for the features that are included in each combination.

Best combination set 4.* Best combination set 5.** Best combination set 6.***

Accuracy F1 Accuracy F1 Accuracy F1

Anger 0.713 ± 0.011 0.214 ± 0.038 0.710 ± 0.012 0.220 ± 0.033 0.711 ± 0.019 0.224 ± 0.033Fear 0.945 ± 0.009 0.000 ± 0.000 0.954 ± 0.012 0.000 ± 0.000 0.946 ± 0.008 0.000 ± 0.000Interest 0.632 ± 0.020 0.441 ± 0.024 0.629 ± 0.018 0.441 ± 0.033 0.652 ± 0.015 0.478 ± 0.025Joy 0.805 ± 0.013 0.242 ± 0.041 0.812 ± 0.011 0.276 ± 0.045 0.809 ± 0.016 0.296 ± 0.039Sadness 0.930 ± 0.013 0.140 ± 0.057 0.928 ± 0.008 0.094 ± 0.035 0.924 ± 0.014 0.069 ± 0.074Surprise 0.810 ± 0.018 0.136 ± 0.027 0.813 ± 0.014 0.142 ± 0.042 0.817 ± 0.014 0.151 ± 0.050Disgust/contempt 0.597 ± 0.020 0.321 ± 0.026 0.604 ± 0.020 0.327 ± 0.037 0.588 ± 0.012 0.339 ± 0.017Love 0.854 ± 0.010 0.069 ± 0.030 0.857 ± 0.010 0.102 ± 0.054 0.857 ± 0.010 0.111 ± 0.047

Overall 0.786 ± 0.121 0.195 ± 0.132 0.788 ± 0.122 0.200 ± 0.134 0.788 ± 0.119 0.209 ± 0.147

Table 13: The (overall) accuracy and F1 score for the best combinationfrom set 7. *: threshold for all the features = 2, regularization = L1, α =0.01. See Table 9 for the features that are included in the combination.

Best combination set 7.*

Accuracy F1

Anger 0.692 ± 0.012 0.195 ± 0.031Fear 0.942 ± 0.010 0.000 ± 0.000Interest 0.635 ± 0.022 0.456 ± 0.023Joy 0.796 ± 0.016 0.250 ± 0.049Sadness 0.922 ± 0.012 0.097 ± 0.073Surprise 0.823 ± 0.015 0.157 ± 0.070Disgust/contempt 0.593 ± 0.019 0.332 ± 0.036Love 0.852 ± 0.012 0.117 ± 0.063

Overall 0.782 ± 0.121 0.200 ± 0.134

The random k -labelset (RAKEL) method was tested using the featuresthat are included in the best combination from the binary relevance method.k was set to two and m was set to fourteen. A linear SVM with sequentialminimal optimization was used as RAKEL’s base learner with C = 1.0.RAKEL’s F1 score was computed according to five-fold cross-validation onthe annotated final dataset.

There are two important points regarding RAKEL’s F1 score. First,MEKA outputs micro- and macro-averaged F1 scores. For the micro-averagedF1 score, precision and recall are calculated by counting the total true pos-itives, false negatives, and false positives (the micro-averaged F1 score is

21

Figure 4: A scatter plot of the F1 score per emotion category of the bestcombination from the binary relevance method against the percentage labelfrequency of that emotion category.

the harmonic mean of these two figures). For the macro-averaged F1 score,precision and recall are calculated per class, and then averaged over theclasses (the macro-averaged F1 score is the harmonic mean of the averageprecision and recall). However, for the binary relevance method, the F1

score was computed over the positive class only (i.e., the emotion label ispresent), since incorporating the negative class (i.e., the emotion label is notpresent) results in F1 scores that tend to overestimate performance (sincethe annotated final dataset is (highly-)unbalanced).

Second, MEKA does not output the F1 score per emotion category.It does output precision and recall per emotion category, but since it isnot known whether these are computed according to the micro- or macro-averaged method, it is not possible to reliably compute the F1 score peremotion category as the harmonic mean of precision and recall.

In order to compare the binary relevance method to RAKEL, an overallmicro-averaged F1 score was computed for the best combination from thebinary relevance method. This micro-averaged F1 score was computed byaveraging over the micro-averaged F1 scores of the SVMs, which in turnwere computed by averaging over ten repeats of five-fold cross-validation onthe annotated final dataset. Table 14 compares the binary relevance methodto RAKEL.

22

Table 14: The (overall) micro-averaged F1 score for the binary relevancemethod and RAKEL. The micro-averaged F1 score per emotion category ismissing for RAKEL.

Binary relevance method RAKELMicro-averaged F1 score Micro-averaged F1 score

Anger 0.701 ± 0.022 -Fear 0.947 ± 0.001 -Interest 0.634 ± 0.026 -Joy 0.794 ± 0.011 -Sadness 0.921 ± 0.007 -Surprise 0.808 ± 0.019 -Disgust/contempt 0.571 ± 0.016 -Love 0.856 ± 0.009 -

Overall 0.779 ± 0.125 0.912 ± 0.004

5 Conclusion

This work has described the development of an emotion classification sys-tem (i.e., a classifier) for ThemeStreams that can detect and classify multipleemotions in Dutch political tweets. Using data from different installationsof ThemeStreams, a hand-labelled dataset of 399 tweets covering eight emo-tion categories has been realised. Using the binary relevance method asproblem reduction method, the developed multi-label classifier achieves anoverall F1 score of 0.209 on this dataset. This F1 score has been achieved bytesting different combinations of features and classifier parameters, of whichthe combination that uses unigrams, bigrams, Part-of-Speech-unigrams, andPart-of-Speech bigrams, all occurring in at least five different tweets, as fea-tures, L1 regularization, and an α of 0.01, results in the highest overall F1

score.Although Table 14 suggests that the random k -labelset (RAKEL) method

outperforms the binary relevance method, too many settings/options inMEKA remain unknown to reliably conclude this. Furthermore, it is notknown how the overall micro-averaged F1 score of RAKEL is computed(which has a significantly smaller standard deviation than the overall micro-averaged F1 score of the binary relevance method).

The number of positive instances per emotion category tends to affect theperformance of the developed multi-label classifier. Figure 4 shows that theF1 score per emotion category increases with the percentage label frequency(i.e., the higher the number of positive instances for an emotion category,the higher the F1 score for that emotion category). However, this is true upto a certain value, since the presence of the emotion category will otherwiseprevail.

23

6 Discussion and Future Research

There are a number of ways in which the results of this work could beimproved. First, the distribution of emotions in the annotated final datasetis (highly-)unbalanced, e.g., the percentage label frequency of the emotionfear is 1.97% (nine positive instances) and the percentage label frequency ofthe emotion sadness is 3.07% (fourteen positive instances). As a results, theclassifier does not perform well on these emotions. The performance couldbe improved by oversampling tweets from minority emotions.

Second, for the extraction of unigrams and bigrams over the Part-of-Speech (i.e., POS-unigrams and POS-bigrams), a tagger has been used thatis trained on a broad selection of manually annotated Part-of-Speech taggedcorpora for Dutch. The use of a tagger that is trained on an annotated(Dutch) tweet corpus could result in higher-quality features that can beused to better predict emotions in the tweets.

Third, for the binary relevance method, seven sets of combinations offeatures and classifier parameters have been tested. Testing more combi-nations of features and classifier parameters could result in a combinationthat improves the performance of the classifier. Furthermore, incorporatingmore/other features in the feature set could also improve performance (e.g.,lexicon based features, sentiment features).

Finally, it was not possible to reliably compare the results of the randomk -labelset (RAKEL) method to the results of the binary relevance method,since too many settings/options in MEKA were unknown. However, theresults of RAKEL look promising and more research into MEKA could re-sult in a classifier that outperforms the one that uses the binary relevancemethod. In addition, an own implementation of RAKEL might provide analternative approach.

As future work, we intend to use the developed emotion classificationsystem (i.e., the classifier) to examine how the emotions in the tweets affectthe political discussions that are visualized by ThemeStreams, e.g., whetherthe emotions in the tweets propagate through the political discussions andwhether certain topics are more loaded than others. This should providefurther inside into the political discourse and the Dutch political landscape.

References

Alm, C. O., Roth, D., and Sproat, R. (2005). Emotions from text: Machinelearning for text-based emotion prediction. In Proceedings of the Confer-ence on Human Language Technology and Empirical Methods in NaturalLanguage Processing, HLT ’05, pages 579–586, Stroudsburg, PA, USA.Association for Computational Linguistics.

Balabantaray, R., Mohammad, M., and Sharma, N. (2012). Multi-class

24

twitter emotion classification: A new approach. International Journal ofApplied Information Systems, 4(1):48–53.

Bosch, A. v. d., Busser, B., Canisius, S., and Daelemans, W. (2007). Anefficient memory-based morphosyntactic tagger and parser for dutch. LOTOccasional Series, 7:191–206.

Buitinck, L., van Amerongen, J., Tan, E., and de Rijke, M. (2015). Multi-emotion detection in user-generated reviews. In Hanbury, A., Kazai, G.,Rauber, A., and Fuhr, N., editors, Advances in Information Retrieval,volume 9022 of Lecture Notes in Computer Science, pages 43–48. SpringerInternational Publishing.

Danisman, T. and Alpkocak, A. (2008). Feeler: Emotion classification oftext using vector space model. In AISB 2008 Convention Communication,Interaction and Social Intelligence, volume 1, page 53.

de Rooij, O., Odijk, D., and de Rijke, M. (2013). Themestreams: Visualizingthe stream of themes discussed in politics. In Proceedings of the 36thInternational ACM SIGIR Conference on Research and Development inInformation Retrieval, SIGIR ’13, pages 1077–1078, New York, NY, USA.ACM.

Ekman, P. (1992). An argument for basic emotions. Cognition & emotion,6(3-4):169–200.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011).Scikit-learn: Machine learning in python. The Journal of Machine Learn-ing Research, 12:2825–2830.

Shaver, P., Schwartz, J., Kirson, D., and O’connor, C. (1987). Emotionknowledge: further exploration of a prototype approach. Journal of per-sonality and social psychology, 52(6):1061.

Tsoumakas, G. et al. (2007). Multi label classification: An overview. Inter-national Journal of Data Warehousing and Mining, 3(3):1–13.

Tsoumakas, G. and Vlahavas, I. (2007). Random k-labelsets: An ensemblemethod for multilabel classification. In Machine learning: ECML 2007,pages 406–417. Springer.

Wang, W., Chen, L., Thirunarayan, K., and Sheth, A. P. (2012). Harnessingtwitter” big data” for automatic emotion identification. In Privacy, Se-curity, Risk and Trust (PASSAT), 2012 International Conference on and2012 International Confernece on Social Computing (SocialCom), pages587–592. IEEE.

25

Multi-Emotion Detection in Dutch Political Tweets · 2020. 9. 13. · measure value of 32.22% in a...

Documents

Transcript of Multi-Emotion Detection in Dutch Political Tweets · 2020. 9. 13. · measure value of 32.22% in a...