A Multi-Platform Arabic News Comment Dataset for Offensive … · 2020. 5. 22. · Therefore, in...

10
Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 6203–6212 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC 6203 A Multi-Platform Arabic News Comment Dataset for Offensive Language Detection Shammur A Chowdhury, Hamdy Mubarak, Ahmed Abdelali, Soon-gyo Jung, Bernard J Jansen, Joni Salminen Qatar Computing Research Institute Hamad Bin Khalifa University, Doha, Qatar {shchowdhury, hmubarak, aabdelali, sjung, bjansen, jsalminen}@hbku.edu.qa Abstract Access to social media often enables users to engage in conversation with limited accountability. This allows a user to share their opinions and ideology, especially regarding public content, occasionally adopting offensive language. This may encourage hate crimes or cause mental harm to targeted individuals or groups. Hence, it is important to detect offensive comments in social media platforms. Typically, most studies focus on offensive commenting in one platform only, even though the problem of offensive language is observed across multiple platforms. Therefore, in this paper, we introduce and make publicly available a new dialectal Arabic news comment dataset, collected from multiple social media platforms, including Twitter, Facebook, and YouTube. We follow two-step crowd-annotator selection criteria for low-representative language annotation task in a crowdsourcing platform. Furthermore, we analyze the distinctive lexical content along with the use of emojis in offensive comments. We train and evaluate the classifiers using the annotated multi-platform dataset along with other publicly available data. Our results highlight the importance of multiple platform dataset for (a) cross-platform, (b) cross-domain, and (c) cross-dialect generalization of classifier performance. Keywords: Offensive Language, Arabic, Text Classification, Social Media Platforms, Twitter, Facebook, YouTube 1. Introduction Social media platforms provide access for users from all over the world to connect with each other, share their knowledge, and express their opinions (Intapong et al., 2017). Within these platforms, people not only voice their opinions and concerns but also look for social connection, belongingness, and assurance of being accepted by society. Thus, any negative perceived image of a group or an indi- vidual, by a small fraction of the community, can impact users’ psychological well-being (G¨ ulac ¸tı, 2010; Waldron, 2012), e.g., by creating propaganda and giving rise to hate crimes. Often, this type of abusive propaganda spreads through ei- ther by the content posted in the platforms or by the toxic behavior of users in the social media. Users may use offen- sive post/comments containing vulgar, pornographic and/or hateful language, to spread such verbal hostility. Hence, the increased risk and effect of such hostility using offensive language, in social media, has attracted many multidisci- plinary researchers and provoked the need of automatically detecting the offensiveness of post/comments. Various classification techniques have been studied and used to detect offensive language (Davidson et al., 2017; Mubarak and Darwish, 2019), along with related categories of hate speech (Badjatiya et al., 2017; Salminen et al., 2018), cyberbullying (Dadvar et al., 2013; Hosseinmardi et al., 2015), aggression (Kumar et al., 2018) among oth- ers. Modeling techniques such as keyword-based search, traditional machine learning to deep learning have been ex- plored. However, most of the previous studies are limited to Indo- European languages due to the availability of resources in these languages. Unlike these resource-rich languages, studies and resources for detecting offensive language in dialectal or Modern Standard Arabic (MSA) are still very limited. Similar to the Indo-European languages, most of the publicly available datasets for Arabic (Mubarak and Darwish, 2019; Mulki et al., 2019; Alakrot et al., 2018; Al-Ajlan and Ykhlef, 2018; Mubarak et al., 2017) origi- nate mainly from one social media platform: either Twitter (TW) or YouTube (YT). However, the challenges of detecting offensive language are not constrained to one or two platforms but have a cross- platform nature (Salminen et al., 2020). Therefore, research efforts are needed to develop rich resources that can be used to design and evaluate cross-platform offensive language classifiers. In this study, we introduce one of the first Dialectal Ara- bic (DA) offensive language datasets, extracted from three different social media platforms: TW, YT, and Facebook (FB). Our annotated dataset comprises a collection of 4000 comments posted in social media platforms. The domain of the dataset is focused on the news comments from an in- ternational news organization targeting audiences from all over the world. The rationale for this choice of domain is that news content posted in the social media platforms at- tracts active interaction between the platform users and the content itself is directly not offensive – thus any offensive- ness in their respective comments are the product of users’ opinion and belief. In addition, the study also addresses the difficulties and pos- sible approaches taken to build such a versatile resource for DA dataset. Building a reliable dataset with high-quality annotation is necessary for designing accurate classifiers. Specifically, in this study, we highlight the need and quan- tify the criteria of annotator and the annotation selection while using crowdsourcing platforms for a less represented language (Difallah et al., 2018; Ross et al., 2009). We thoroughly evaluated the annotated data using (a) inter- annotator agreement between the accepted annotators, and

Transcript of A Multi-Platform Arabic News Comment Dataset for Offensive … · 2020. 5. 22. · Therefore, in...

Page 1: A Multi-Platform Arabic News Comment Dataset for Offensive … · 2020. 5. 22. · Therefore, in this paper, we introduce and make publicly available a new dialectal Arabic news comment

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) pages 6203ndash6212Marseille 11ndash16 May 2020

ccopy European Language Resources Association (ELRA) licensed under CC-BY-NC

6203

A Multi-Platform Arabic News Comment Dataset forOffensive Language Detection

Shammur A Chowdhury Hamdy Mubarak Ahmed AbdelaliSoon-gyo Jung Bernard J Jansen Joni Salminen

Qatar Computing Research InstituteHamad Bin Khalifa University Doha Qatar

shchowdhury hmubarak aabdelali sjung bjansen jsalminenhbkueduqa

AbstractAccess to social media often enables users to engage in conversation with limited accountability This allows a user to share theiropinions and ideology especially regarding public content occasionally adopting offensive language This may encourage hatecrimes or cause mental harm to targeted individuals or groups Hence it is important to detect offensive comments in social mediaplatforms Typically most studies focus on offensive commenting in one platform only even though the problem of offensive languageis observed across multiple platforms Therefore in this paper we introduce and make publicly available a new dialectal Arabic newscomment dataset collected from multiple social media platforms including Twitter Facebook and YouTube We follow two-stepcrowd-annotator selection criteria for low-representative language annotation task in a crowdsourcing platform Furthermore weanalyze the distinctive lexical content along with the use of emojis in offensive comments We train and evaluate the classifiers usingthe annotated multi-platform dataset along with other publicly available data Our results highlight the importance of multiple platformdataset for (a) cross-platform (b) cross-domain and (c) cross-dialect generalization of classifier performance

Keywords Offensive Language Arabic Text Classification Social Media Platforms Twitter Facebook YouTube

1 IntroductionSocial media platforms provide access for users from allover the world to connect with each other share theirknowledge and express their opinions (Intapong et al2017) Within these platforms people not only voice theiropinions and concerns but also look for social connectionbelongingness and assurance of being accepted by societyThus any negative perceived image of a group or an indi-vidual by a small fraction of the community can impactusersrsquo psychological well-being (Gulactı 2010 Waldron2012) eg by creating propaganda and giving rise to hatecrimesOften this type of abusive propaganda spreads through ei-ther by the content posted in the platforms or by the toxicbehavior of users in the social media Users may use offen-sive postcomments containing vulgar pornographic andorhateful language to spread such verbal hostility Hence theincreased risk and effect of such hostility using offensivelanguage in social media has attracted many multidisci-plinary researchers and provoked the need of automaticallydetecting the offensiveness of postcommentsVarious classification techniques have been studied andused to detect offensive language (Davidson et al 2017Mubarak and Darwish 2019) along with related categoriesof hate speech (Badjatiya et al 2017 Salminen et al2018) cyberbullying (Dadvar et al 2013 Hosseinmardiet al 2015) aggression (Kumar et al 2018) among oth-ers Modeling techniques such as keyword-based searchtraditional machine learning to deep learning have been ex-ploredHowever most of the previous studies are limited to Indo-European languages due to the availability of resourcesin these languages Unlike these resource-rich languagesstudies and resources for detecting offensive language indialectal or Modern Standard Arabic (MSA) are still very

limited Similar to the Indo-European languages most ofthe publicly available datasets for Arabic (Mubarak andDarwish 2019 Mulki et al 2019 Alakrot et al 2018Al-Ajlan and Ykhlef 2018 Mubarak et al 2017) origi-nate mainly from one social media platform either Twitter(TW) or YouTube (YT)However the challenges of detecting offensive languageare not constrained to one or two platforms but have a cross-platform nature (Salminen et al 2020) Therefore researchefforts are needed to develop rich resources that can be usedto design and evaluate cross-platform offensive languageclassifiersIn this study we introduce one of the first Dialectal Ara-bic (DA) offensive language datasets extracted from threedifferent social media platforms TW YT and Facebook(FB) Our annotated dataset comprises a collection of 4000comments posted in social media platforms The domainof the dataset is focused on the news comments from an in-ternational news organization targeting audiences from allover the world The rationale for this choice of domain isthat news content posted in the social media platforms at-tracts active interaction between the platform users and thecontent itself is directly not offensive ndash thus any offensive-ness in their respective comments are the product of usersrsquoopinion and beliefIn addition the study also addresses the difficulties and pos-sible approaches taken to build such a versatile resource forDA dataset Building a reliable dataset with high-qualityannotation is necessary for designing accurate classifiersSpecifically in this study we highlight the need and quan-tify the criteria of annotator and the annotation selectionwhile using crowdsourcing platforms for a less representedlanguage (Difallah et al 2018 Ross et al 2009) Wethoroughly evaluated the annotated data using (a) inter-annotator agreement between the accepted annotators and

6204

(b) measuring accuracy between the crowdsourced annota-tion with expert annotationTo understand and to explore the generalizability of the in-troduced dataset we present a series of classification exper-iments using Support Vector Machines (SVM)These experiments include studying the performance of thetrained classifier on (a) a large number of comments (bothin- and out-of-domain data) (b) on cross-platform data and(c) on particular DA (Egyptian and Levantine) data Wealso investigate lexical cues and use of emojis in the offen-sive instances present in the datasetOverall the key contributions of the paper can be sum-marised as follows

bull Introduction of a new Dialectal Arabic (DA) offensivelanguage dataset from the comments of a news postextracted from multiple social media platforms ndash TWYT and FB ndash of an international news agencyrsquos socialmedia accounts To best of our knowledge this is oneof the first multi-platform datasets for Arabic socialmedia comments

bull Designing and quantifying a reliable two-step crowd-annotator selection criteria for low-representative lan-guage (such as DA)

bull Analyzing lexical content and emoji usage in offensivelanguage

bull Designing and evaluating classification models for

ndash in-domain and cross-domain social media data

ndash cross-platform performance

ndash across-dialect ndash Egyptian and Levantine ndash dataset

bull Publicly releasing the dataset and listing all the re-sources of the study1

The rest of the paper is structured as follows We provide abrief overview of the existing work and datasets for Arabicand other languages in Section 2 We then discuss the dataand the annotation collection procedures in Section 3 Wepresent a detailed analysis of the lexical and emojis presentin Section 4 and Section 5 provides the results of experi-ments Finally we conclude our work in Section 6

2 Related StudiesRecent years have witnessed a rising use of offensive lan-guage in the social media platform This has forced manywebsites and organizations to remove such use of offensivecontent using either manual or automatic filtering processThe current trend of using offensive language and its effecton particular individual or groups has also attracted manymulti-disciplinary studiesCurrent conceptualization categorizes offensive languageusage for social media mostly as hate speech includ-ing remarks attacking particular race religion nationalityamong others vulgar or obscene and pornographic com-ments including explicit and rude sexual references (Jay

1Data download link httpsgithubcomshammurArabic-Offensive-Multi-Platform-SocialMedia-Comment-Dataset

and Janschewitz 2008) Most studies are conducted inEnglish (Davidson et al 2017 Silva et al 2016 Mon-dal et al 2017 Badjatiya et al 2017 Chatzakou et al2017a Chatzakou et al 2017b Chatzakou et al 2017cElSherief et al 2018 Unsvag and Gamback 2018 Agar-wal and Sureka 2014) and German (Wiegand et al 2018)using supervised deep learning architectures like variantsof Recurrent Neural Networks (RNN) (Pitsilis et al 2018Pavlopoulos et al 2017) Convolutional Neural Networks(CNN) (Zhang et al 2018) as well as Naive Bayes (NB)(Rish and others 2001) SVM (Platt 1998) and othersUnlike Indo-European languages a handful of research hasbeen conducted for Arabic languages Even though Arabicas a language is spoken by a large portion of the worldrsquospopulation in practice the language is mutually unintelli-gible based on the lack of knowledge on dialects Thesemake modelling offensive language for Arabic a notablechallenge mostly due to lack of resource availabilityThe authors in (Alakrot et al 2018) used a YT commentdataset of 16k containing comments from Egypt Iraq andLibya and annotated it with native speakers using ternaryscheme (offensive inoffensive and neutral) with the inter-annotator agreement of 71 Using SVM the authors ob-tained a F-measure of 82 Unfortunately the aforemen-tioned dataset was not available to us while carrying outthis studyIn (Mubarak et al 2017) the authors first presented a TWcollection of (11k) contents including 100 Egyptian tweetsfrom 10 controversial user accounts and their correspond-ing comments The data was annotated by 3 annotatorsfrom Egypt with an agreement of 85 This dataset wasalso used in (Mubarak and Darwish 2019) to test a methodfor generating large scale training data using a seed list ofoffensive wordsIn addition to the tweet data the authors in (Mubarak etal 2017) also publicly released a large dataset of asymp 32kdeleted comments from Aljazeeranet an online news por-tal The comments include varieties of Arabic dialects andMSA and were annotated by 3 annotators with an agree-ment of 87 For both datasets the scheme for annotationincluded obscene offensive (but not obscene) and cleanIn (Albadi et al 2018) the authors explored hate speechdetection for different religious groups such as MuslimsJews and others For the study the authors introduced amulti-dialectal dataset of 66k tweets In addition the au-thors created lexicon used commonly in religious discus-sion and with scores representing polarity and strength for(non)hate by using techniques like Pointwise Mutual Infor-mation (PMI) The author also suggested that the annota-tor agreement varies based on which religious group is be-ing targeted Moreover the study presented classificationresults using lexicon-based SVM and GRU classificationwith pre-trained word embedding techniques and showed aperformance of 77 of F-measureA Levantine (Syrian Lebanese Palestinian and Jordanian)DA dataset was presented in a recent study by the authorsin (Mulki et al 2019) In this work the authors created amanually annotated political Twitter dataset of size 58kfor classifying hate speech abusive and normal contentwith the help of native speakers The authors also pre-

6205

Figure 1 Platform wise distribution of the comments in thedataset

sented lexical analysis in addition to classification resultsusing SVM and NB algorithmsUnlike most of the previous work for Arabic offensivelanguage detection we introduce a multi-dialectal Arabiccomment dataset for offensive comment detection frommultiple social media platforms including TW YT andFB To the best of our knowledge this is one of the firststudies for Arabic and also one of the few datasets for offen-sive content detection research to include data from morethan one platform In addition to our dataset we utilizethree other datasets (Mubarak et al 2017 Mulki et al2019) to study how such multi-platform data could helpgeneralize offensive detection models We also study theuse of emojis and their discriminating characteristics foroffensive language

3 Data

31 Data CollectionTo study how offensive language is used in online inter-action for news posts and to design a classifier we col-lected over 100k comments from different social mediaplatforms - including Facebook Twitter and YouTube fora well-reputed international news agency We first col-lected all news content posted from 2011 to 2019 by thenews agency in their social media accounts We collectedthe contents from each platform through their own API(YouTube Facebook and Twitter) Then using each con-tent ID the comments for the content are collected AsTwitter does not provide the API for retrieving commentsdirectly we used the Standard Search API of Twitter whichonly provides comments for past 7 days only To overcomethis challenge we periodically collected the comments inevery 6 hours for a certain period to extract the completecomment history of the news post

32 Data Preparation and SelectionFor the data selection we retain comments that in-cludes 5 or more Arabic tokens apart from emojis whileanonymizing them by replacing any user mentions with alsquoUSERIDXrsquo tag and urls with lsquoURLrsquo tags In addition weremoved duplicate comments based on textual content Wethen selected a random subset of 4000 comments in totalfrom the three major social media platforms (comment dis-tribution shown in Figure 1) for manual annotation

33 Annotation GuidelineThe annotation task presented in this paper is straightfor-ward ndash ldquoGiven a comment categorize if the comment isoffensive or notrdquo Even though the task seems benign de-ciding whether the comment is offensive or not is not sim-ple The annotators were instructed to rely on their instinctand suggested to ignore their personal biases such as polit-ical religious belief or their cultural background To makethe decision-making process easier we provided elaboratedand detailed instructions to the annotators with examplesFor the task we asked the annotators to consider instancesas offensive if comment contain (a) abusive words (b) ex-plicit or implicit hostile intention that threats or project vi-olence (c) contempt humiliate or to underestimate groupsor individuals using ndash animal analogy name calling at-tacking their political ideologies or their disabilities curs-ing insulting religious beliefs incitement racial and ethnichatred or other forms of insultingprofanity Details of theannotation guideline and examples presented to the annota-tor in each cases are made publicly available 2In addition to the detailed examples we apologize for thepresence of any offensive words that might hurt the anno-tatorsrsquo personal beliefs and thanked them for taking intoaccount the accuracy of the task and for their participation

34 Annotations via CrowdsourcingGiven the specified task we used Amazon Mechanical Turk(AMT)3 a well-known crowdsourcing platform to obtainmanual annotations of these 4000 news comments (men-tioned in Section 32) For each comment we collected 3judgments with a cost of 1 cent per judgment Our pilotstudy for the task shows that 1 cent per comment is a rea-sonable amount to pay for such a small (binary) task andwith higher rewards no further performance improvementwith respect to time or quality is found Details of parame-ters used for the annotation task is shown in Table 1AMT is an efficient and cost-effective way to acquire man-ual annotations however it can possess several limitationsfor a task Some of the main limitations faced for the an-notation is to judge annotators language proficiency and thetask understanding capabilities To ensure the quality of theannotation and language proficiency we utilized two differ-ent evaluation criteria of the annotatorThe first criterion for the annotators to qualify for attempt-ing the task includes answering a series of multiple-choicequestions (Example 1) - designed by experts - that reflectsthe annotatorsrsquo language proficiency and understanding ofthe questions

Example 1 H

13B YEumleth ugrave

Ograve

XAOgrave

What we call the father of the fatherOptions are

bull NtildeordfEuml (The uncle) bull YEumlntildeEuml (The father)

bull EgraveAmIgrave (The

maternal-uncle)bull Ym

Igrave (The grandfather)

2httpsgithubcomshammurArabic-Offensive-Multi-Platform-SocialMedia-Comment-Datasetannotation guidelineannotation guidelinepdf

3httpmturkcom

6206

No of Comments 4000Cost per Comment 1 centCommentsHIT 25GoldHIT 5Gold Review Policy dynamicAssignmentsHIT 3Task duration 7 daysAssignment Duration 1 hourAvg AnnotatorComment 323Max AnnotatorComment 5No Comment with more than 3 annotation 787

Table 1 Crowdsourcing annotation task details In thiscase one or two annotators could not pass the quality eval-uation Thus their annotations were rejected and the taskwas extended to a new annotator

The annotator must provide correct answers to 80 of thequestions (ie 8 out of 10) to pass the qualification testOnce the annotators pass the test they can attempt the mainclassification task A total of 26 annotators qualified thistest among them only asymp 54 scored full and the rest ofthe annotators scored the minimum 80 to pass the testIn order to evaluate an annotatorrsquos assignment for task ac-curacy we used gold standard instances hidden in the de-signed Human Intelligence Tasks (HITs) For each HIT weassigned 25 comments for the annotation and out of them 5were randomly assigned as the test questions A thresholdof 80 is again set for the selection criteria implying that 4out of 5 gold instances must be correct in order for the as-signment by an annotator to be accepted These test ques-tions (comments) are selected randomly from a pool of 60comments and are annotated by the domain experts The se-lected gold comments included 55 (out of 60) of offensivecomments and have an average agreement of asymp 9563A closer look into the agreement per class indicates thatthe annotators were more confident in case of non-offensivecomments with an average agreement of 9791 (mini-mum agreement of 8476) in compare to offensive com-ments (average = 9563 with minimum of agreementis 617)

Figure 2 Platform wise annotation label distribution ldquoAllrdquorepresents the total of instances for offensive and not-offensive comments in the annotated dataset

Figure 3 Platform wise annotation label distribution fortypes of offensive comments present in the annotateddataset

35 Annotation ResultsAnnotation Agreement and Evaluation As each com-ment has 3 annotations (judgments) we assigned the la-bel agreed by the majority (ie 2

3 ) of the annotator Wehave observed that for offensive comments only 542 ofcomments have unanimous agreement whereas for non-offensive comments ndash 837 of the annotators are in com-plete agreement This shows the complexity of identifyingthe offensive task especially in an Arabic language contextdue to the ambiguity of lexical variations in Arabic dialectscompared to other languages such as English and GermanFor the crowd-annotated dataset we obtain a Fleissrsquos kappa(Falotico and Quatto 2015) κ = 072 Although κ theo-retically applies to a fixed set of annotators but for our casewe randomly assigned 3 judgments to the hypothetical an-notator A1-3To assess the reliability of the crowd-annotations we ran-domly selected 500 comments to be annotated by a domainexpert Using the expert annotation as the reference weobserved that the accuracy of our crowdsourced dataset is94 From further error analysis we observed that the mis-labeled comments by the crowd annotators include offen-sive contents written using sarcasm irony along with somelong comments mixed with statements rather than opinion

Annotation Distribution Based on the majority votingwe present the distribution of the finally obtained annotateddataset in Figure 2 From the dataset we observed that thepercentage of offensive comments varies in each platformas shown in Figure 2 with TW comments which has moreoffensive instances followed by FB and then YT

Types of Offensive Comments To further understandthe annotated data and to discriminate between the differenttypes of offensive comments we manually annotated theobtained 675 (1688 see in Figure 2 ndash lsquoAllOffensiversquo)offensive comments in our dataset for hate speech and vul-gar (but not hate) categories The obtained distribution ofeach class (vulgar and hate speech) are presented in Figure3 From the class distribution we can observe that in allthe three platforms ldquohaterdquo categories are more prominentwith respect to only ldquovulgarrdquo Our results indicate that FBfollowed by TW to have significantly more hate commentsthan YT channel comments

6207

Figure 4 Word-cloud for highly non offensive (NOT) to-kens with valence score ϑ() = 1 in comments Some ofthe most frequent words include lsquothank yoursquo lsquoimportantrsquoalong with other conversation fillers and reaction like lsquoha-hahrsquo

Figure 5 Word-cloud for highly offensive tokens with va-lence score ϑ() = 1 in comments Corresponding En-glish examples can be found in Table 3

4 Data Analysis41 Lexical AnalysisIn order to understand how distinctive the comments arewith respect to the offensive and non-offensive classes weanalyzed the lexical content of the datasetWe compared the vocabularies of the two classes using thevalence score (Conover et al 2011 Mubarak and Darwish2019) ϑ for every token x following the Equation 1

ϑ(x) = 2 lowastC(x|OFF )

TOFF

C(x|OFF )TOFF

+ C(x|NOT )TNOT

minus 1 (1)

where C() is the frequency of the token x for a given class(OFF or NOT) TOFF and TNOT are the total number oftokens present in each class The ϑ(x) isin [minus1+1] with +1indicating that the use of the token is significantly highly inoffensive content than non-offensive and vice-versaWe identified the most distinctive unigrams in each classesusing valence score and presented using word-clouds inFigure 4 and Figure 5 We also presented top 10 highly of-fensive tokens in Table 3 with its frequency Furthermoreusing the same score we also presented top highly distinc-tive bi- and tri-grams in Table 2

From the lexical analysis we observed that words likelsquodogsrsquo lsquobarkrsquo lsquodirtyrsquo lsquoenvyrsquo are more common in offen-sive instances than not-offensive instances Similarly whenlooking into the bi- and tri-grams we observed that use ofwords like lsquoragersquo lsquocursersquo are very frequent to offensive-ness

42 Use of Emojis in Offensive LanguageAnalyzing the comment reveals treats that are specific toeach of the types offensive and non-offensive Furtheranalysis of the usage of emojis shows that there are somespecifics for each type Their usage of emojis is different

Figure 6 Reported analysis on use of emojis in offensivecomments Figure (a) ndash lists the 10 most highly offensiveemojis with valence score ϑ = 10 Figure (b) ndash Top 3groups of emojis with ϑ = 10 For each group ω is re-ported in percentage

To understand the distinctive usage of emoji we also cal-culated the valence ϑ() score for all the emojis Figure6 shows top 10 emojis present in the offensive commentsdataset with valence score ϑ() = 1Furthermore we grouped all the emojis based on their cate-gories ndash including ldquopeople face and smilieysrdquo ldquoanimal andnaturesrdquo among others ndash as defined by Unicode Consor-tium4 We then calculated the weight of each group ωg

using the following Equation 2

ωg =

sumeiising C(ei|ϑ = 10)sumeisinL C(e|ϑ = 10))

(2)

where for ei represent the ith emoji present in group ggiven the valence score ϑ = 10 L is the list of emojispresent in the offensive dataset and C() is the usage fre-quency of the given emoji The ω for top 3 frequent groupsare shown in Figure 6Our analysis of offensive emoji also contained studying themost frequent bi-grams with high valence score ϑ = 10We observed that the top 2 most frequent bi-grams are also

from the animals group ndash ldquo rdquo (frequency = 3) and

ldquo rdquo (3)

4httpsunicodeorgemojicharts-121emoji-orderinghtml

6208

Highly OFF Highly NOTNgrams Freq Ngrams Freq

egraveQKQ mIgrave

egraveA

Jmacr (piglet channel) 5 eacuteltEuml EgraventildeP (Messenger of God) 27

Otildeordm

centJordfK ntilde

KntildeOacute (die with your rage) 4 Yg Yg (many many) 17

eacutekApoundB Otildeccedil

(was downed) 4

egraventildemacr Egraventildek (will power) 17

copyJOgravem

eacutekApoundB (downing all) 4

eacuteJK QordfEuml EgraveethYEuml (Arab countries) 15

HAcentcentmtimes copyJOgravem

(with all plots) 4 eacuteltEuml Z A

AOacute (Godrsquos willing) 14

OtildeordmJEcirclaquo eacuteltEuml eacuteJordfEuml (Godrsquo curse on you all) 4 eacuteltEumlAK B

egraventilde

macr (power except with God) 17

QOgravemIgrave I JEcircg (Donkeyrsquos milk) 4 i JEcircmIgrave EgraveethX (Gulf countries) 10

aacuteOacute eth (and who) 2 I

Jraquo A

K (I was) 9

uacuteIcircEuml

aacuteOacute (who is the one) 2 eacuteltEuml B (except God) 9

Table 2 List of top most frequent n-grams (n=23) for both offensive (OFF) and non-offensive (NOT) comments usingvalence score are as follows () are the translation in English Freq represents frequency of the ngram in the correspondingclass

Tokens FrequencyegraveQK

Q mIgrave (Piglet) 21

I Ecircfrac34Euml (Dog) 8

PY

macr (Dirty) 8

eacutemacr Q

KQOumlIuml (Mercenaries) 7

iJK (Bark) 6

Otildeordm

centJordfK (With your envyrage) 5

m

(filthy) 5

H Ciquest AK (O Dogs) 5

shyK13 (Nose) 4

HAcentcent

mtimes (Plots) 4

Table 3 List of top 10 most highly offensive unigrams withϑ(word) = 10 () are the translation in English

Consequently our findings suggest that emojis representinganimals appear uniquely in offensive comments indicatingtheir use by analogy for offensive purpose This observa-tion is in line with our findings for distinctive lexical tokens(such as I Ecircfrac34Euml ndash ldquodogrdquo) usage in offensive comments

5 Experiments and ResultsThis section describes the details of feature extraction andmachine learning algorithm used to test the performanceof the dataset The motivation for such experimental explo-ration is to test the performance of the released dataset andto have baseline results in order to facilitate future studiesIn addition we also utilize the presence of comments froma different platform to test generalizability of a platform-dependent model (trained on a specific platform data)across (a) other social medianews platforms and (b) out-of-domain short content and comment datasets

51 DatasetsIn addition to our dataset refereed as Multi PlatformsOffensive Language Dataset (MPOLD) we also utilized

three previously published dataset details in Table 4For evaluating the performance on out-of-domain5 datasetswe used Egyptian and Levantine Arabic datasetsThe Egyptian Arabic dialect data includes 100 TW postsand their 10 corresponding comments6 per tweet Thetweets are extracted from 10 most controversial EgyptianTW users (Mubarak et al 2017) This dataset is refer-eed in this paper as Egyptian Tweets and correspondingComment Dataset (ETCD) for further mentions The Lev-antine dataset (referred to as L-HSAB)(Mulki et al 2019)are collected from the timelines of politicians socialpo-litical activists and TV anchors and does not contain anycommentsSimilarly we also used the Deleted Comments7 Dataset(further refereed as DCD) from Aljazeeranet(Mubarak et al 2017) to study how a offensive detectionmodel perform in a different platforms

Preprocessing To prepare the data for classification wefirst tokenized the input sentences removing punctuationURLs stop words diacritics present in the text We in-serted whitespace in cases where there was no separationbetween the adjacent words and emojis in the text For ourstudy we kept the emojis and hashtags due to the contextualinformation represented using them

52 Classification DesignAs a baseline for the MPOLD dataset we conducted experi-ments using a classical machine learning algorithm ndash SVMThe SVM algorithm (Platt 1998) is based on the StructuralRisk Minimization principle from computational learningtheory The algorithm is proven as universal learners andwell-known for its ability to learn independently of the di-mensionality of the feature space and the class distributionof the training dataset These properties of the algorithmmake it one of the most popular supervised classification

5Here representing that the contents are not news posts6httpaltqcriorg˜hmubarakoffensive

TweetClassification-Summaryxlsx7httpaltqcriorg˜hmubarakoffensive

AJCommentsClassification-CFxlsx

6209

Abbr Dataset Description Labels I (OFF) avg len Used for

MPOLD News comments fromTW FB and YT

OFF NOT 4000 (169) 228 traintestV HS Oth 180185417

ETCD100 Egyptian tweets

and corresponding comments OBs OFF-OBs NOT 1100 (asymp 590) 135 test

L-HSAB Levantine tweets Abs HS NOT 5846 (asymp 376) 120 test

DCDDeleted comments from

a news website OBs OFF-OBs NOT 31692 (asymp 82) 174traintest

Table 4 Details of the datasets used for the study represent the out-of-domain dataset and represent the differentplatforms I represent the number of instances whereas OFF represents the number of offensive labels used in thisstudy In the case of ETCD and DCD we merged the offensive comments ndash obscene (OBs) and other offensive comments(OFF-OBs) as OFF for matching with our labels Similarly we merged HS and Abs (abusive) for the L-HSAB dataset toOFF avg len represents the average length of the corresponding content in terms of the number of words V - Vulgar HS- Hatespeech Oth-Other offensive categories

methods and thus our choice for the baseline and all theexperiments conducted in this studyTo train the classifier we transform the online commentsinto bag-of-character-n-grams vector weighted with loga-rithmic term frequencies (tf) multiplied with inverse doc-ument frequencies (idf) For this we created charactern-grams using character inside word boundaries only andpadded the edge character with whitespace To utilize thecontextual information ie n-grams we extracted featuresusing n = 1 to n = 5

53 Performance EvaluationTo measure the performance of the classification we re-ported macro-averaged F1-measure (Fm) calculated usingmacro-averaged precision and recall ndash an average of P andR for both classes respectivelyThe motivation behind such choice is due to the imbalancedclass distribution which makes the well-known measuressuch as accuracy and micro-average F1-measure not well-representative of the performance Since the performanceof both classes is of interest we also report the F-measureof the individual classes

Fm NOT OFFChance 049 083 014Lex 053 084 021SVM 074 092 056

Table 5 Reported performance on MPOLD dataset using5-fold cross validation (with SVM and Chance baseline)

54 ResultsMPOLD performance To evaluate the classificationperformance we performed five-fold8 cross-validation onthe dataset maintaining the natural distribution of theclasses in each fold The performance on all the instancesis shown in Table 5

8We choose 5 instead of 10 as the number of folds to ensurea good sample size and distribution of both the labels in each foldfor test purpose

Testset Trained on Fm NOT OFF

FB

DCD 034 043 025TW 062 090 034YT 062 092 031TW + YT 068 093 044TW + YT + DCD 078 094 063

TW

DCD 052 051 053FB 051 083 019YT 054 082 025FB + YT 059 084 035FB + YT + DCD 084 090 078

YT

DCD 044 067 022FB 053 096 010TW 060 094 027TW + FB 063 095 031TW + FB + DCD 082 097 067

DCD

TW 037 039 034FB 021 031 010YT 029 026 032FB + YT + TW 040 035 045

Table 6 F-measure performances based on platform wisedataset The test sets presented here are a subset of MPOLDdata For instance TW isin MPOLD is the dataset extractedfrom Twitter Similarly YT represents YouTube data FBrepresents comments from Facebook DCD - Deleted com-ments dataset NOT = Not-offensive OFF = offensiveinstances The blue rows presents the results when theplatform-wise data present in our dataset is modelled alongwith online-news-platform data DCD

To compare the obtained results we reported chance levelbaseline based on the prior distribution of the labels in eachfold In addition to this we also present a simple lexical-based baseline (Lex) using a list of possible offensive wordsand phrases This word list is the combination of obscenewords presented in (Mubarak et al 2017) and filtered en-tries from Hatebase (Hatebaseorg) lexicon From Ta-ble 5 we observed that the SVM performs significantly bet-ter than the lexicon and chance level baselines

Platform specific performance To access the general-ization capabilities of the designed model we evaluated

6210

Testset Model Fm NOT OFF

ETCD

Lex 051 058 044DCD 064 076 053MPOLD 061 066 056MPOLD+ 068 061 075

L-HSAB

Lex 046 074 018DCD 059 060 058MPOLD 064 077 051MPOLD+ 062 064 059

Table 7 Reported F-measures of the models trainedusing DCD MPOLD (MPOLD) and MPOLD + DCD(MPOLD+) and tested on out-of-domain test sets The testsets includes (a) ETCD - controversial Egyptian tweetsand comments dataset and (b) L-HSAB Levantine tweetsdatasets Lex is lexicon-based baseline for the test sets

the models using leave-one-platform-dataset-out (LOPO)cross-validation The performance of the models on eachsocial media platform data is presented in Table 6 Besidesthe three platforms in our dataset for the experiment wealso included DCD ndash deleted comments from online newsportal as a fourth platform

We observed that in most cases models trained on oneparticular platform performs poorly compared to the mod-els trained using multiple-platform datasets This multipleplatform model includes both the combination of (a) 2 plat-forms from MPOLD data or (b) 2 platforms from MPOLD+ DCD datasets

For instance the model trained on TW +Y T performs sig-nificantly better on FB test set than the individual YT andTW models The same pattern is also observed when theperformance of DCD in combination (Y T + TW ) modelare compared In this case a significant jump in OFF classis noticed thus increasing the overall macro-averaged f-measure (Fm) This indicates the importance of multipleplatform datasets for generalization of the offensive detec-tion model

In addition to a different platform data of a similar domainDCD dataset also brings two more added advantages fortraining the model These advantages include (a) additionof comments written using MSA along with DA and (b)increase the number of offensive examples for the modeltraining The MPOLD comments are mostly dialectal Ara-bic therefore adding DCD includes examples of commentsin MSA for the model to recognise and learn At the sametime increases the number of (not)offensive examples forthe training Thus the results from the combined platformmodels (for eg TW + Y T +DCD) indicates the impactof varieties in examples and size of the training data

However by comparing the results of individual models ndashtrained on only DCD TW YT or FB ndash we observed thateven though DCD has a large number of training instancesthis model cannot outperform TWYTFB alone in mostcases This again highlights the inadequacy of model gen-eralization when trained on a particular platform and testedon another platform

Cross domain performance The performance ofthe models trained on MPOLD and MPOLD+DCD(MPOLD+) across domain is presented in Table 7 Theresults indicate that for ETCD dataset both our models(MPOLD and MPOLD+) outperform the baseline andthe DCD models for OFF class f-measure The overallbest performance is obtained using MPOLD+ with animprovement of 7 and 4 over MPOLD and DCDmodelsAs for L-HASB data MPOLD model outperforms all othermodels in terms of overall f-measure However for the of-fensive class only (OFF) the best performance is obtainedby MPOLD+The performance of the MPOLD and MPOLD+ models onboth ETCD and L-HSAB dataset indicates the positive im-pact of multiple-platform training dataset and implicates thecross-domain generalizability

6 ConclusionIn this paper we introduced one of the first dialectal Ara-bic comment dataset extracted from multiple social mediaplatform Utilizing this dataset we designed classificationexperiments to study the impact of multi-platform data forplatform-wise model generalization For this we evaluatedthe model using platform-wise cross-validation In additionto our dataset we also exploit existing publicly availabledataset We also utilized deleted comment dataset from anews website We observed that combined platform mod-els perform significantly higher than most individual mod-els Thus indicating the importance of such multi-platformdatasetUsing our designed models we also evaluated other out-of-domain (ETCD and L-HSAB) data extracted from TWBesides being out-of-domain dataset another advantage ofevaluating these datasets is to see how our models performin particular for DA content ie ETCD (Egyptian dialects)and L-HSAB (Levantine dialects) Our findings indicatethat the multi-platform news comment dataset model hasthe potential to capture diversity in different dialects anddomains In addition to evaluation of the generalizationpower of the models we also presented an in-depth anal-ysis of emoji usage in offensive comments Our findingssuggest emojis in the animal category are exploited in theoffensive comments similar to our lexical observationTo the best of our knowledge this is one of the first studiesto explore the effect of multiple platform datasets for offen-sive language detection on cross-platform cross-domainand for dialectal varieties present in Arabic In future weplan to extend the study by introducing a larger dataset withfurther fine-grained classification and content analysis

7 Bibliographical ReferencesAgarwal S and Sureka A (2014) A focused crawler

for mining hate and extremism promoting videos onyoutube In Proceedings of the 25th ACM conference onHypertext and social media pages 294ndash296 ACM

Al-Ajlan M A and Ykhlef M (2018) Optimized twit-ter cyberbullying detection based on deep learning In2018 21st Saudi Computer Society National ComputerConference (NCC) pages 1ndash5 IEEE

6211

Alakrot A Murray L and Nikolov N S (2018) Datasetconstruction for the detection of anti-social behaviourin online communication in arabic Procedia ComputerScience 142174ndash181

Albadi N Kurdi M and Mishra S (2018) Are they ourbrothers analysis and detection of religious hate speechin the arabic twittersphere In 2018 IEEEACM Interna-tional Conference on Advances in Social Networks Anal-ysis and Mining (ASONAM) pages 69ndash76 IEEE

Badjatiya P Gupta S Gupta M and Varma V (2017)Deep learning for hate speech detection in tweets InProceedings of the 26th International Conference onWorld Wide Web Companion pages 759ndash760 Interna-tional World Wide Web Conferences Steering Commit-tee

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017a) Hate is not bi-nary Studying abusive behavior of gamergate on twit-ter In Proceedings of the 28th ACM conference on hy-pertext and social media pages 65ndash74 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017b) Mean birdsDetecting aggression and bullying on twitter In Pro-ceedings of the 2017 ACM on web science conferencepages 13ndash22 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017c) Measuringgamergate A tale of hate sexism and bullying In Pro-ceedings of the 26th international conference on worldwide web companion pages 1285ndash1290 InternationalWorld Wide Web Conferences Steering Committee

Conover M D Ratkiewicz J Francisco M GoncalvesB Menczer F and Flammini A (2011) Political po-larization on twitter In Fifth international AAAI confer-ence on weblogs and social media

Dadvar M Trieschnigg D Ordelman R and de JongF (2013) Improving cyberbullying detection with usercontext In European Conference on Information Re-trieval pages 693ndash696 Springer

Davidson T Warmsley D Macy M and Weber I(2017) Automated hate speech detection and the prob-lem of offensive language In Eleventh internationalaaai conference on web and social media

Difallah D Filatova E and Ipeirotis P (2018) Demo-graphics and dynamics of mechanical turk workers InProceedings of the eleventh acm international confer-ence on web search and data mining pages 135ndash143ACM

ElSherief M Nilizadeh S Nguyen D Vigna G andBelding E (2018) Peer to peer hate Hate speech in-stigators and their targets In Twelfth International AAAIConference on Web and Social Media

Falotico R and Quatto P (2015) Fleissrsquo kappa statisticwithout paradoxes Quality amp Quantity 49(2)463ndash470

Gulactı F (2010) The effect of perceived social supporton subjective well-being Procedia-Social and Behav-ioral Sciences 2(2)3844ndash3849

Hosseinmardi H Mattson S A Rafiq R I Han R LvQ and Mishra S (2015) Analyzing labeled cyberbul-

lying incidents on the instagram social network In Inter-national conference on social informatics pages 49ndash66Springer

Intapong P Charoenpit S Achalakul T and OhkuraM (2017) Assessing symptoms of excessive sns usagebased on user behavior and emotion analysis of data ob-tained by sns apis In 9th International Conference onSocial Computing and Social Media SCSM 2017 heldas part of the 19th International Conference on Human-Computer Interaction HCI International 2017 pages71ndash83 Springer Verlag

Jay T and Janschewitz K (2008) The pragmatics ofswearing Journal of Politeness Research LanguageBehaviour Culture 4(2)267ndash288

Kumar S Hamilton W L Leskovec J and Jurafsky D(2018) Community interaction and conflict on the webIn Proceedings of the 2018 World Wide Web Conferencepages 933ndash943 International World Wide Web Confer-ences Steering Committee

Mondal M Silva L A and Benevenuto F (2017) Ameasurement study of hate speech in social media InProceedings of the 28th ACM Conference on Hypertextand Social Media pages 85ndash94 ACM

Mubarak H and Darwish K (2019) Arabic offensivelanguage classification on twitter In International Con-ference on Social Informatics pages 269ndash276 Springer

Mubarak H Darwish K and Magdy W (2017) Abu-sive language detection on arabic social media In Pro-ceedings of the First Workshop on Abusive Language On-line pages 52ndash56

Mulki H Haddad H Ali C B and Alshabani H(2019) L-hsab A levantine twitter dataset for hatespeech and abusive language In Proceedings of theThird Workshop on Abusive Language Online pages111ndash118

Pavlopoulos J Malakasiotis P and Androutsopoulos I(2017) Deeper attention to abusive user content mod-eration In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing pages1125ndash1135

Pitsilis G K Ramampiaro H and Langseth H (2018)Detecting offensive language in tweets using deep learn-ing arXiv preprint arXiv180104433

Platt J (1998) Sequential minimal optimization A fastalgorithm for training support vector machines

Rish I et al (2001) An empirical study of the naive bayesclassifier In IJCAI 2001 workshop on empirical methodsin artificial intelligence volume 3 pages 41ndash46

Ross J Zaldivar A Irani L and Tomlinson B (2009)Who are the turkers worker demographics in amazonmechanical turk Department of Informatics Universityof California Irvine USA Tech Rep

Salminen J Almerekhi H Milenkovic M Jung S-gAn J Kwak H and Jansen B J (2018) Anatomyof Online Hate Developing a Taxonomy and MachineLearning Models for Identifying and Classifying Hatein Online News Media In Proceedings of The Inter-national AAAI Conference on Web and Social Media(ICWSM 2018) San Francisco California USA June

6212

Salminen J Hopf M Chowdhury S A Jung S-gAlmerekhi H and Jansen B J (2020) Developingan online hate classifier for multiple social media plat-forms Human-centric Computing and Information Sci-ences 10(1)

Silva L Mondal M Correa D Benevenuto F and We-ber I (2016) Analyzing the targets of hate in onlinesocial media In Tenth International AAAI Conferenceon Web and Social Media

Unsvag E F and Gamback B (2018) The effects of userfeatures on twitter hate speech detection In Proceed-ings of the 2nd Workshop on Abusive Language Online(ALW2) pages 75ndash85

Waldron J (2012) The harm in hate speech HarvardUniversity Press

Wiegand M Siegel M and Ruppenhofer J (2018)Overview of the germeval 2018 shared task on the iden-tification of offensive language

Zhang Z Robinson D and Tepper J (2018) Detect-ing hate speech on twitter using a convolution-gru baseddeep neural network In European Semantic Web Con-ference pages 745ndash760 Springer

  • Introduction
  • Related Studies
  • Data
    • Data Collection
    • Data Preparation and Selection
    • Annotation Guideline
    • Annotations via Crowdsourcing
    • Annotation Results
      • Data Analysis
        • Lexical Analysis
        • Use of Emojis in Offensive Language
          • Experiments and Results
            • Datasets
            • Classification Design
            • Performance Evaluation
            • Results
              • Conclusion
              • Bibliographical References
Page 2: A Multi-Platform Arabic News Comment Dataset for Offensive … · 2020. 5. 22. · Therefore, in this paper, we introduce and make publicly available a new dialectal Arabic news comment

6204

(b) measuring accuracy between the crowdsourced annota-tion with expert annotationTo understand and to explore the generalizability of the in-troduced dataset we present a series of classification exper-iments using Support Vector Machines (SVM)These experiments include studying the performance of thetrained classifier on (a) a large number of comments (bothin- and out-of-domain data) (b) on cross-platform data and(c) on particular DA (Egyptian and Levantine) data Wealso investigate lexical cues and use of emojis in the offen-sive instances present in the datasetOverall the key contributions of the paper can be sum-marised as follows

bull Introduction of a new Dialectal Arabic (DA) offensivelanguage dataset from the comments of a news postextracted from multiple social media platforms ndash TWYT and FB ndash of an international news agencyrsquos socialmedia accounts To best of our knowledge this is oneof the first multi-platform datasets for Arabic socialmedia comments

bull Designing and quantifying a reliable two-step crowd-annotator selection criteria for low-representative lan-guage (such as DA)

bull Analyzing lexical content and emoji usage in offensivelanguage

bull Designing and evaluating classification models for

ndash in-domain and cross-domain social media data

ndash cross-platform performance

ndash across-dialect ndash Egyptian and Levantine ndash dataset

bull Publicly releasing the dataset and listing all the re-sources of the study1

The rest of the paper is structured as follows We provide abrief overview of the existing work and datasets for Arabicand other languages in Section 2 We then discuss the dataand the annotation collection procedures in Section 3 Wepresent a detailed analysis of the lexical and emojis presentin Section 4 and Section 5 provides the results of experi-ments Finally we conclude our work in Section 6

2 Related StudiesRecent years have witnessed a rising use of offensive lan-guage in the social media platform This has forced manywebsites and organizations to remove such use of offensivecontent using either manual or automatic filtering processThe current trend of using offensive language and its effecton particular individual or groups has also attracted manymulti-disciplinary studiesCurrent conceptualization categorizes offensive languageusage for social media mostly as hate speech includ-ing remarks attacking particular race religion nationalityamong others vulgar or obscene and pornographic com-ments including explicit and rude sexual references (Jay

1Data download link httpsgithubcomshammurArabic-Offensive-Multi-Platform-SocialMedia-Comment-Dataset

and Janschewitz 2008) Most studies are conducted inEnglish (Davidson et al 2017 Silva et al 2016 Mon-dal et al 2017 Badjatiya et al 2017 Chatzakou et al2017a Chatzakou et al 2017b Chatzakou et al 2017cElSherief et al 2018 Unsvag and Gamback 2018 Agar-wal and Sureka 2014) and German (Wiegand et al 2018)using supervised deep learning architectures like variantsof Recurrent Neural Networks (RNN) (Pitsilis et al 2018Pavlopoulos et al 2017) Convolutional Neural Networks(CNN) (Zhang et al 2018) as well as Naive Bayes (NB)(Rish and others 2001) SVM (Platt 1998) and othersUnlike Indo-European languages a handful of research hasbeen conducted for Arabic languages Even though Arabicas a language is spoken by a large portion of the worldrsquospopulation in practice the language is mutually unintelli-gible based on the lack of knowledge on dialects Thesemake modelling offensive language for Arabic a notablechallenge mostly due to lack of resource availabilityThe authors in (Alakrot et al 2018) used a YT commentdataset of 16k containing comments from Egypt Iraq andLibya and annotated it with native speakers using ternaryscheme (offensive inoffensive and neutral) with the inter-annotator agreement of 71 Using SVM the authors ob-tained a F-measure of 82 Unfortunately the aforemen-tioned dataset was not available to us while carrying outthis studyIn (Mubarak et al 2017) the authors first presented a TWcollection of (11k) contents including 100 Egyptian tweetsfrom 10 controversial user accounts and their correspond-ing comments The data was annotated by 3 annotatorsfrom Egypt with an agreement of 85 This dataset wasalso used in (Mubarak and Darwish 2019) to test a methodfor generating large scale training data using a seed list ofoffensive wordsIn addition to the tweet data the authors in (Mubarak etal 2017) also publicly released a large dataset of asymp 32kdeleted comments from Aljazeeranet an online news por-tal The comments include varieties of Arabic dialects andMSA and were annotated by 3 annotators with an agree-ment of 87 For both datasets the scheme for annotationincluded obscene offensive (but not obscene) and cleanIn (Albadi et al 2018) the authors explored hate speechdetection for different religious groups such as MuslimsJews and others For the study the authors introduced amulti-dialectal dataset of 66k tweets In addition the au-thors created lexicon used commonly in religious discus-sion and with scores representing polarity and strength for(non)hate by using techniques like Pointwise Mutual Infor-mation (PMI) The author also suggested that the annota-tor agreement varies based on which religious group is be-ing targeted Moreover the study presented classificationresults using lexicon-based SVM and GRU classificationwith pre-trained word embedding techniques and showed aperformance of 77 of F-measureA Levantine (Syrian Lebanese Palestinian and Jordanian)DA dataset was presented in a recent study by the authorsin (Mulki et al 2019) In this work the authors created amanually annotated political Twitter dataset of size 58kfor classifying hate speech abusive and normal contentwith the help of native speakers The authors also pre-

6205

Figure 1 Platform wise distribution of the comments in thedataset

sented lexical analysis in addition to classification resultsusing SVM and NB algorithmsUnlike most of the previous work for Arabic offensivelanguage detection we introduce a multi-dialectal Arabiccomment dataset for offensive comment detection frommultiple social media platforms including TW YT andFB To the best of our knowledge this is one of the firststudies for Arabic and also one of the few datasets for offen-sive content detection research to include data from morethan one platform In addition to our dataset we utilizethree other datasets (Mubarak et al 2017 Mulki et al2019) to study how such multi-platform data could helpgeneralize offensive detection models We also study theuse of emojis and their discriminating characteristics foroffensive language

3 Data

31 Data CollectionTo study how offensive language is used in online inter-action for news posts and to design a classifier we col-lected over 100k comments from different social mediaplatforms - including Facebook Twitter and YouTube fora well-reputed international news agency We first col-lected all news content posted from 2011 to 2019 by thenews agency in their social media accounts We collectedthe contents from each platform through their own API(YouTube Facebook and Twitter) Then using each con-tent ID the comments for the content are collected AsTwitter does not provide the API for retrieving commentsdirectly we used the Standard Search API of Twitter whichonly provides comments for past 7 days only To overcomethis challenge we periodically collected the comments inevery 6 hours for a certain period to extract the completecomment history of the news post

32 Data Preparation and SelectionFor the data selection we retain comments that in-cludes 5 or more Arabic tokens apart from emojis whileanonymizing them by replacing any user mentions with alsquoUSERIDXrsquo tag and urls with lsquoURLrsquo tags In addition weremoved duplicate comments based on textual content Wethen selected a random subset of 4000 comments in totalfrom the three major social media platforms (comment dis-tribution shown in Figure 1) for manual annotation

33 Annotation GuidelineThe annotation task presented in this paper is straightfor-ward ndash ldquoGiven a comment categorize if the comment isoffensive or notrdquo Even though the task seems benign de-ciding whether the comment is offensive or not is not sim-ple The annotators were instructed to rely on their instinctand suggested to ignore their personal biases such as polit-ical religious belief or their cultural background To makethe decision-making process easier we provided elaboratedand detailed instructions to the annotators with examplesFor the task we asked the annotators to consider instancesas offensive if comment contain (a) abusive words (b) ex-plicit or implicit hostile intention that threats or project vi-olence (c) contempt humiliate or to underestimate groupsor individuals using ndash animal analogy name calling at-tacking their political ideologies or their disabilities curs-ing insulting religious beliefs incitement racial and ethnichatred or other forms of insultingprofanity Details of theannotation guideline and examples presented to the annota-tor in each cases are made publicly available 2In addition to the detailed examples we apologize for thepresence of any offensive words that might hurt the anno-tatorsrsquo personal beliefs and thanked them for taking intoaccount the accuracy of the task and for their participation

34 Annotations via CrowdsourcingGiven the specified task we used Amazon Mechanical Turk(AMT)3 a well-known crowdsourcing platform to obtainmanual annotations of these 4000 news comments (men-tioned in Section 32) For each comment we collected 3judgments with a cost of 1 cent per judgment Our pilotstudy for the task shows that 1 cent per comment is a rea-sonable amount to pay for such a small (binary) task andwith higher rewards no further performance improvementwith respect to time or quality is found Details of parame-ters used for the annotation task is shown in Table 1AMT is an efficient and cost-effective way to acquire man-ual annotations however it can possess several limitationsfor a task Some of the main limitations faced for the an-notation is to judge annotators language proficiency and thetask understanding capabilities To ensure the quality of theannotation and language proficiency we utilized two differ-ent evaluation criteria of the annotatorThe first criterion for the annotators to qualify for attempt-ing the task includes answering a series of multiple-choicequestions (Example 1) - designed by experts - that reflectsthe annotatorsrsquo language proficiency and understanding ofthe questions

Example 1 H

13B YEumleth ugrave

Ograve

XAOgrave

What we call the father of the fatherOptions are

bull NtildeordfEuml (The uncle) bull YEumlntildeEuml (The father)

bull EgraveAmIgrave (The

maternal-uncle)bull Ym

Igrave (The grandfather)

2httpsgithubcomshammurArabic-Offensive-Multi-Platform-SocialMedia-Comment-Datasetannotation guidelineannotation guidelinepdf

3httpmturkcom

6206

No of Comments 4000Cost per Comment 1 centCommentsHIT 25GoldHIT 5Gold Review Policy dynamicAssignmentsHIT 3Task duration 7 daysAssignment Duration 1 hourAvg AnnotatorComment 323Max AnnotatorComment 5No Comment with more than 3 annotation 787

Table 1 Crowdsourcing annotation task details In thiscase one or two annotators could not pass the quality eval-uation Thus their annotations were rejected and the taskwas extended to a new annotator

The annotator must provide correct answers to 80 of thequestions (ie 8 out of 10) to pass the qualification testOnce the annotators pass the test they can attempt the mainclassification task A total of 26 annotators qualified thistest among them only asymp 54 scored full and the rest ofthe annotators scored the minimum 80 to pass the testIn order to evaluate an annotatorrsquos assignment for task ac-curacy we used gold standard instances hidden in the de-signed Human Intelligence Tasks (HITs) For each HIT weassigned 25 comments for the annotation and out of them 5were randomly assigned as the test questions A thresholdof 80 is again set for the selection criteria implying that 4out of 5 gold instances must be correct in order for the as-signment by an annotator to be accepted These test ques-tions (comments) are selected randomly from a pool of 60comments and are annotated by the domain experts The se-lected gold comments included 55 (out of 60) of offensivecomments and have an average agreement of asymp 9563A closer look into the agreement per class indicates thatthe annotators were more confident in case of non-offensivecomments with an average agreement of 9791 (mini-mum agreement of 8476) in compare to offensive com-ments (average = 9563 with minimum of agreementis 617)

Figure 2 Platform wise annotation label distribution ldquoAllrdquorepresents the total of instances for offensive and not-offensive comments in the annotated dataset

Figure 3 Platform wise annotation label distribution fortypes of offensive comments present in the annotateddataset

35 Annotation ResultsAnnotation Agreement and Evaluation As each com-ment has 3 annotations (judgments) we assigned the la-bel agreed by the majority (ie 2

3 ) of the annotator Wehave observed that for offensive comments only 542 ofcomments have unanimous agreement whereas for non-offensive comments ndash 837 of the annotators are in com-plete agreement This shows the complexity of identifyingthe offensive task especially in an Arabic language contextdue to the ambiguity of lexical variations in Arabic dialectscompared to other languages such as English and GermanFor the crowd-annotated dataset we obtain a Fleissrsquos kappa(Falotico and Quatto 2015) κ = 072 Although κ theo-retically applies to a fixed set of annotators but for our casewe randomly assigned 3 judgments to the hypothetical an-notator A1-3To assess the reliability of the crowd-annotations we ran-domly selected 500 comments to be annotated by a domainexpert Using the expert annotation as the reference weobserved that the accuracy of our crowdsourced dataset is94 From further error analysis we observed that the mis-labeled comments by the crowd annotators include offen-sive contents written using sarcasm irony along with somelong comments mixed with statements rather than opinion

Annotation Distribution Based on the majority votingwe present the distribution of the finally obtained annotateddataset in Figure 2 From the dataset we observed that thepercentage of offensive comments varies in each platformas shown in Figure 2 with TW comments which has moreoffensive instances followed by FB and then YT

Types of Offensive Comments To further understandthe annotated data and to discriminate between the differenttypes of offensive comments we manually annotated theobtained 675 (1688 see in Figure 2 ndash lsquoAllOffensiversquo)offensive comments in our dataset for hate speech and vul-gar (but not hate) categories The obtained distribution ofeach class (vulgar and hate speech) are presented in Figure3 From the class distribution we can observe that in allthe three platforms ldquohaterdquo categories are more prominentwith respect to only ldquovulgarrdquo Our results indicate that FBfollowed by TW to have significantly more hate commentsthan YT channel comments

6207

Figure 4 Word-cloud for highly non offensive (NOT) to-kens with valence score ϑ() = 1 in comments Some ofthe most frequent words include lsquothank yoursquo lsquoimportantrsquoalong with other conversation fillers and reaction like lsquoha-hahrsquo

Figure 5 Word-cloud for highly offensive tokens with va-lence score ϑ() = 1 in comments Corresponding En-glish examples can be found in Table 3

4 Data Analysis41 Lexical AnalysisIn order to understand how distinctive the comments arewith respect to the offensive and non-offensive classes weanalyzed the lexical content of the datasetWe compared the vocabularies of the two classes using thevalence score (Conover et al 2011 Mubarak and Darwish2019) ϑ for every token x following the Equation 1

ϑ(x) = 2 lowastC(x|OFF )

TOFF

C(x|OFF )TOFF

+ C(x|NOT )TNOT

minus 1 (1)

where C() is the frequency of the token x for a given class(OFF or NOT) TOFF and TNOT are the total number oftokens present in each class The ϑ(x) isin [minus1+1] with +1indicating that the use of the token is significantly highly inoffensive content than non-offensive and vice-versaWe identified the most distinctive unigrams in each classesusing valence score and presented using word-clouds inFigure 4 and Figure 5 We also presented top 10 highly of-fensive tokens in Table 3 with its frequency Furthermoreusing the same score we also presented top highly distinc-tive bi- and tri-grams in Table 2

From the lexical analysis we observed that words likelsquodogsrsquo lsquobarkrsquo lsquodirtyrsquo lsquoenvyrsquo are more common in offen-sive instances than not-offensive instances Similarly whenlooking into the bi- and tri-grams we observed that use ofwords like lsquoragersquo lsquocursersquo are very frequent to offensive-ness

42 Use of Emojis in Offensive LanguageAnalyzing the comment reveals treats that are specific toeach of the types offensive and non-offensive Furtheranalysis of the usage of emojis shows that there are somespecifics for each type Their usage of emojis is different

Figure 6 Reported analysis on use of emojis in offensivecomments Figure (a) ndash lists the 10 most highly offensiveemojis with valence score ϑ = 10 Figure (b) ndash Top 3groups of emojis with ϑ = 10 For each group ω is re-ported in percentage

To understand the distinctive usage of emoji we also cal-culated the valence ϑ() score for all the emojis Figure6 shows top 10 emojis present in the offensive commentsdataset with valence score ϑ() = 1Furthermore we grouped all the emojis based on their cate-gories ndash including ldquopeople face and smilieysrdquo ldquoanimal andnaturesrdquo among others ndash as defined by Unicode Consor-tium4 We then calculated the weight of each group ωg

using the following Equation 2

ωg =

sumeiising C(ei|ϑ = 10)sumeisinL C(e|ϑ = 10))

(2)

where for ei represent the ith emoji present in group ggiven the valence score ϑ = 10 L is the list of emojispresent in the offensive dataset and C() is the usage fre-quency of the given emoji The ω for top 3 frequent groupsare shown in Figure 6Our analysis of offensive emoji also contained studying themost frequent bi-grams with high valence score ϑ = 10We observed that the top 2 most frequent bi-grams are also

from the animals group ndash ldquo rdquo (frequency = 3) and

ldquo rdquo (3)

4httpsunicodeorgemojicharts-121emoji-orderinghtml

6208

Highly OFF Highly NOTNgrams Freq Ngrams Freq

egraveQKQ mIgrave

egraveA

Jmacr (piglet channel) 5 eacuteltEuml EgraventildeP (Messenger of God) 27

Otildeordm

centJordfK ntilde

KntildeOacute (die with your rage) 4 Yg Yg (many many) 17

eacutekApoundB Otildeccedil

(was downed) 4

egraventildemacr Egraventildek (will power) 17

copyJOgravem

eacutekApoundB (downing all) 4

eacuteJK QordfEuml EgraveethYEuml (Arab countries) 15

HAcentcentmtimes copyJOgravem

(with all plots) 4 eacuteltEuml Z A

AOacute (Godrsquos willing) 14

OtildeordmJEcirclaquo eacuteltEuml eacuteJordfEuml (Godrsquo curse on you all) 4 eacuteltEumlAK B

egraventilde

macr (power except with God) 17

QOgravemIgrave I JEcircg (Donkeyrsquos milk) 4 i JEcircmIgrave EgraveethX (Gulf countries) 10

aacuteOacute eth (and who) 2 I

Jraquo A

K (I was) 9

uacuteIcircEuml

aacuteOacute (who is the one) 2 eacuteltEuml B (except God) 9

Table 2 List of top most frequent n-grams (n=23) for both offensive (OFF) and non-offensive (NOT) comments usingvalence score are as follows () are the translation in English Freq represents frequency of the ngram in the correspondingclass

Tokens FrequencyegraveQK

Q mIgrave (Piglet) 21

I Ecircfrac34Euml (Dog) 8

PY

macr (Dirty) 8

eacutemacr Q

KQOumlIuml (Mercenaries) 7

iJK (Bark) 6

Otildeordm

centJordfK (With your envyrage) 5

m

(filthy) 5

H Ciquest AK (O Dogs) 5

shyK13 (Nose) 4

HAcentcent

mtimes (Plots) 4

Table 3 List of top 10 most highly offensive unigrams withϑ(word) = 10 () are the translation in English

Consequently our findings suggest that emojis representinganimals appear uniquely in offensive comments indicatingtheir use by analogy for offensive purpose This observa-tion is in line with our findings for distinctive lexical tokens(such as I Ecircfrac34Euml ndash ldquodogrdquo) usage in offensive comments

5 Experiments and ResultsThis section describes the details of feature extraction andmachine learning algorithm used to test the performanceof the dataset The motivation for such experimental explo-ration is to test the performance of the released dataset andto have baseline results in order to facilitate future studiesIn addition we also utilize the presence of comments froma different platform to test generalizability of a platform-dependent model (trained on a specific platform data)across (a) other social medianews platforms and (b) out-of-domain short content and comment datasets

51 DatasetsIn addition to our dataset refereed as Multi PlatformsOffensive Language Dataset (MPOLD) we also utilized

three previously published dataset details in Table 4For evaluating the performance on out-of-domain5 datasetswe used Egyptian and Levantine Arabic datasetsThe Egyptian Arabic dialect data includes 100 TW postsand their 10 corresponding comments6 per tweet Thetweets are extracted from 10 most controversial EgyptianTW users (Mubarak et al 2017) This dataset is refer-eed in this paper as Egyptian Tweets and correspondingComment Dataset (ETCD) for further mentions The Lev-antine dataset (referred to as L-HSAB)(Mulki et al 2019)are collected from the timelines of politicians socialpo-litical activists and TV anchors and does not contain anycommentsSimilarly we also used the Deleted Comments7 Dataset(further refereed as DCD) from Aljazeeranet(Mubarak et al 2017) to study how a offensive detectionmodel perform in a different platforms

Preprocessing To prepare the data for classification wefirst tokenized the input sentences removing punctuationURLs stop words diacritics present in the text We in-serted whitespace in cases where there was no separationbetween the adjacent words and emojis in the text For ourstudy we kept the emojis and hashtags due to the contextualinformation represented using them

52 Classification DesignAs a baseline for the MPOLD dataset we conducted experi-ments using a classical machine learning algorithm ndash SVMThe SVM algorithm (Platt 1998) is based on the StructuralRisk Minimization principle from computational learningtheory The algorithm is proven as universal learners andwell-known for its ability to learn independently of the di-mensionality of the feature space and the class distributionof the training dataset These properties of the algorithmmake it one of the most popular supervised classification

5Here representing that the contents are not news posts6httpaltqcriorg˜hmubarakoffensive

TweetClassification-Summaryxlsx7httpaltqcriorg˜hmubarakoffensive

AJCommentsClassification-CFxlsx

6209

Abbr Dataset Description Labels I (OFF) avg len Used for

MPOLD News comments fromTW FB and YT

OFF NOT 4000 (169) 228 traintestV HS Oth 180185417

ETCD100 Egyptian tweets

and corresponding comments OBs OFF-OBs NOT 1100 (asymp 590) 135 test

L-HSAB Levantine tweets Abs HS NOT 5846 (asymp 376) 120 test

DCDDeleted comments from

a news website OBs OFF-OBs NOT 31692 (asymp 82) 174traintest

Table 4 Details of the datasets used for the study represent the out-of-domain dataset and represent the differentplatforms I represent the number of instances whereas OFF represents the number of offensive labels used in thisstudy In the case of ETCD and DCD we merged the offensive comments ndash obscene (OBs) and other offensive comments(OFF-OBs) as OFF for matching with our labels Similarly we merged HS and Abs (abusive) for the L-HSAB dataset toOFF avg len represents the average length of the corresponding content in terms of the number of words V - Vulgar HS- Hatespeech Oth-Other offensive categories

methods and thus our choice for the baseline and all theexperiments conducted in this studyTo train the classifier we transform the online commentsinto bag-of-character-n-grams vector weighted with loga-rithmic term frequencies (tf) multiplied with inverse doc-ument frequencies (idf) For this we created charactern-grams using character inside word boundaries only andpadded the edge character with whitespace To utilize thecontextual information ie n-grams we extracted featuresusing n = 1 to n = 5

53 Performance EvaluationTo measure the performance of the classification we re-ported macro-averaged F1-measure (Fm) calculated usingmacro-averaged precision and recall ndash an average of P andR for both classes respectivelyThe motivation behind such choice is due to the imbalancedclass distribution which makes the well-known measuressuch as accuracy and micro-average F1-measure not well-representative of the performance Since the performanceof both classes is of interest we also report the F-measureof the individual classes

Fm NOT OFFChance 049 083 014Lex 053 084 021SVM 074 092 056

Table 5 Reported performance on MPOLD dataset using5-fold cross validation (with SVM and Chance baseline)

54 ResultsMPOLD performance To evaluate the classificationperformance we performed five-fold8 cross-validation onthe dataset maintaining the natural distribution of theclasses in each fold The performance on all the instancesis shown in Table 5

8We choose 5 instead of 10 as the number of folds to ensurea good sample size and distribution of both the labels in each foldfor test purpose

Testset Trained on Fm NOT OFF

FB

DCD 034 043 025TW 062 090 034YT 062 092 031TW + YT 068 093 044TW + YT + DCD 078 094 063

TW

DCD 052 051 053FB 051 083 019YT 054 082 025FB + YT 059 084 035FB + YT + DCD 084 090 078

YT

DCD 044 067 022FB 053 096 010TW 060 094 027TW + FB 063 095 031TW + FB + DCD 082 097 067

DCD

TW 037 039 034FB 021 031 010YT 029 026 032FB + YT + TW 040 035 045

Table 6 F-measure performances based on platform wisedataset The test sets presented here are a subset of MPOLDdata For instance TW isin MPOLD is the dataset extractedfrom Twitter Similarly YT represents YouTube data FBrepresents comments from Facebook DCD - Deleted com-ments dataset NOT = Not-offensive OFF = offensiveinstances The blue rows presents the results when theplatform-wise data present in our dataset is modelled alongwith online-news-platform data DCD

To compare the obtained results we reported chance levelbaseline based on the prior distribution of the labels in eachfold In addition to this we also present a simple lexical-based baseline (Lex) using a list of possible offensive wordsand phrases This word list is the combination of obscenewords presented in (Mubarak et al 2017) and filtered en-tries from Hatebase (Hatebaseorg) lexicon From Ta-ble 5 we observed that the SVM performs significantly bet-ter than the lexicon and chance level baselines

Platform specific performance To access the general-ization capabilities of the designed model we evaluated

6210

Testset Model Fm NOT OFF

ETCD

Lex 051 058 044DCD 064 076 053MPOLD 061 066 056MPOLD+ 068 061 075

L-HSAB

Lex 046 074 018DCD 059 060 058MPOLD 064 077 051MPOLD+ 062 064 059

Table 7 Reported F-measures of the models trainedusing DCD MPOLD (MPOLD) and MPOLD + DCD(MPOLD+) and tested on out-of-domain test sets The testsets includes (a) ETCD - controversial Egyptian tweetsand comments dataset and (b) L-HSAB Levantine tweetsdatasets Lex is lexicon-based baseline for the test sets

the models using leave-one-platform-dataset-out (LOPO)cross-validation The performance of the models on eachsocial media platform data is presented in Table 6 Besidesthe three platforms in our dataset for the experiment wealso included DCD ndash deleted comments from online newsportal as a fourth platform

We observed that in most cases models trained on oneparticular platform performs poorly compared to the mod-els trained using multiple-platform datasets This multipleplatform model includes both the combination of (a) 2 plat-forms from MPOLD data or (b) 2 platforms from MPOLD+ DCD datasets

For instance the model trained on TW +Y T performs sig-nificantly better on FB test set than the individual YT andTW models The same pattern is also observed when theperformance of DCD in combination (Y T + TW ) modelare compared In this case a significant jump in OFF classis noticed thus increasing the overall macro-averaged f-measure (Fm) This indicates the importance of multipleplatform datasets for generalization of the offensive detec-tion model

In addition to a different platform data of a similar domainDCD dataset also brings two more added advantages fortraining the model These advantages include (a) additionof comments written using MSA along with DA and (b)increase the number of offensive examples for the modeltraining The MPOLD comments are mostly dialectal Ara-bic therefore adding DCD includes examples of commentsin MSA for the model to recognise and learn At the sametime increases the number of (not)offensive examples forthe training Thus the results from the combined platformmodels (for eg TW + Y T +DCD) indicates the impactof varieties in examples and size of the training data

However by comparing the results of individual models ndashtrained on only DCD TW YT or FB ndash we observed thateven though DCD has a large number of training instancesthis model cannot outperform TWYTFB alone in mostcases This again highlights the inadequacy of model gen-eralization when trained on a particular platform and testedon another platform

Cross domain performance The performance ofthe models trained on MPOLD and MPOLD+DCD(MPOLD+) across domain is presented in Table 7 Theresults indicate that for ETCD dataset both our models(MPOLD and MPOLD+) outperform the baseline andthe DCD models for OFF class f-measure The overallbest performance is obtained using MPOLD+ with animprovement of 7 and 4 over MPOLD and DCDmodelsAs for L-HASB data MPOLD model outperforms all othermodels in terms of overall f-measure However for the of-fensive class only (OFF) the best performance is obtainedby MPOLD+The performance of the MPOLD and MPOLD+ models onboth ETCD and L-HSAB dataset indicates the positive im-pact of multiple-platform training dataset and implicates thecross-domain generalizability

6 ConclusionIn this paper we introduced one of the first dialectal Ara-bic comment dataset extracted from multiple social mediaplatform Utilizing this dataset we designed classificationexperiments to study the impact of multi-platform data forplatform-wise model generalization For this we evaluatedthe model using platform-wise cross-validation In additionto our dataset we also exploit existing publicly availabledataset We also utilized deleted comment dataset from anews website We observed that combined platform mod-els perform significantly higher than most individual mod-els Thus indicating the importance of such multi-platformdatasetUsing our designed models we also evaluated other out-of-domain (ETCD and L-HSAB) data extracted from TWBesides being out-of-domain dataset another advantage ofevaluating these datasets is to see how our models performin particular for DA content ie ETCD (Egyptian dialects)and L-HSAB (Levantine dialects) Our findings indicatethat the multi-platform news comment dataset model hasthe potential to capture diversity in different dialects anddomains In addition to evaluation of the generalizationpower of the models we also presented an in-depth anal-ysis of emoji usage in offensive comments Our findingssuggest emojis in the animal category are exploited in theoffensive comments similar to our lexical observationTo the best of our knowledge this is one of the first studiesto explore the effect of multiple platform datasets for offen-sive language detection on cross-platform cross-domainand for dialectal varieties present in Arabic In future weplan to extend the study by introducing a larger dataset withfurther fine-grained classification and content analysis

7 Bibliographical ReferencesAgarwal S and Sureka A (2014) A focused crawler

for mining hate and extremism promoting videos onyoutube In Proceedings of the 25th ACM conference onHypertext and social media pages 294ndash296 ACM

Al-Ajlan M A and Ykhlef M (2018) Optimized twit-ter cyberbullying detection based on deep learning In2018 21st Saudi Computer Society National ComputerConference (NCC) pages 1ndash5 IEEE

6211

Alakrot A Murray L and Nikolov N S (2018) Datasetconstruction for the detection of anti-social behaviourin online communication in arabic Procedia ComputerScience 142174ndash181

Albadi N Kurdi M and Mishra S (2018) Are they ourbrothers analysis and detection of religious hate speechin the arabic twittersphere In 2018 IEEEACM Interna-tional Conference on Advances in Social Networks Anal-ysis and Mining (ASONAM) pages 69ndash76 IEEE

Badjatiya P Gupta S Gupta M and Varma V (2017)Deep learning for hate speech detection in tweets InProceedings of the 26th International Conference onWorld Wide Web Companion pages 759ndash760 Interna-tional World Wide Web Conferences Steering Commit-tee

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017a) Hate is not bi-nary Studying abusive behavior of gamergate on twit-ter In Proceedings of the 28th ACM conference on hy-pertext and social media pages 65ndash74 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017b) Mean birdsDetecting aggression and bullying on twitter In Pro-ceedings of the 2017 ACM on web science conferencepages 13ndash22 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017c) Measuringgamergate A tale of hate sexism and bullying In Pro-ceedings of the 26th international conference on worldwide web companion pages 1285ndash1290 InternationalWorld Wide Web Conferences Steering Committee

Conover M D Ratkiewicz J Francisco M GoncalvesB Menczer F and Flammini A (2011) Political po-larization on twitter In Fifth international AAAI confer-ence on weblogs and social media

Dadvar M Trieschnigg D Ordelman R and de JongF (2013) Improving cyberbullying detection with usercontext In European Conference on Information Re-trieval pages 693ndash696 Springer

Davidson T Warmsley D Macy M and Weber I(2017) Automated hate speech detection and the prob-lem of offensive language In Eleventh internationalaaai conference on web and social media

Difallah D Filatova E and Ipeirotis P (2018) Demo-graphics and dynamics of mechanical turk workers InProceedings of the eleventh acm international confer-ence on web search and data mining pages 135ndash143ACM

ElSherief M Nilizadeh S Nguyen D Vigna G andBelding E (2018) Peer to peer hate Hate speech in-stigators and their targets In Twelfth International AAAIConference on Web and Social Media

Falotico R and Quatto P (2015) Fleissrsquo kappa statisticwithout paradoxes Quality amp Quantity 49(2)463ndash470

Gulactı F (2010) The effect of perceived social supporton subjective well-being Procedia-Social and Behav-ioral Sciences 2(2)3844ndash3849

Hosseinmardi H Mattson S A Rafiq R I Han R LvQ and Mishra S (2015) Analyzing labeled cyberbul-

lying incidents on the instagram social network In Inter-national conference on social informatics pages 49ndash66Springer

Intapong P Charoenpit S Achalakul T and OhkuraM (2017) Assessing symptoms of excessive sns usagebased on user behavior and emotion analysis of data ob-tained by sns apis In 9th International Conference onSocial Computing and Social Media SCSM 2017 heldas part of the 19th International Conference on Human-Computer Interaction HCI International 2017 pages71ndash83 Springer Verlag

Jay T and Janschewitz K (2008) The pragmatics ofswearing Journal of Politeness Research LanguageBehaviour Culture 4(2)267ndash288

Kumar S Hamilton W L Leskovec J and Jurafsky D(2018) Community interaction and conflict on the webIn Proceedings of the 2018 World Wide Web Conferencepages 933ndash943 International World Wide Web Confer-ences Steering Committee

Mondal M Silva L A and Benevenuto F (2017) Ameasurement study of hate speech in social media InProceedings of the 28th ACM Conference on Hypertextand Social Media pages 85ndash94 ACM

Mubarak H and Darwish K (2019) Arabic offensivelanguage classification on twitter In International Con-ference on Social Informatics pages 269ndash276 Springer

Mubarak H Darwish K and Magdy W (2017) Abu-sive language detection on arabic social media In Pro-ceedings of the First Workshop on Abusive Language On-line pages 52ndash56

Mulki H Haddad H Ali C B and Alshabani H(2019) L-hsab A levantine twitter dataset for hatespeech and abusive language In Proceedings of theThird Workshop on Abusive Language Online pages111ndash118

Pavlopoulos J Malakasiotis P and Androutsopoulos I(2017) Deeper attention to abusive user content mod-eration In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing pages1125ndash1135

Pitsilis G K Ramampiaro H and Langseth H (2018)Detecting offensive language in tweets using deep learn-ing arXiv preprint arXiv180104433

Platt J (1998) Sequential minimal optimization A fastalgorithm for training support vector machines

Rish I et al (2001) An empirical study of the naive bayesclassifier In IJCAI 2001 workshop on empirical methodsin artificial intelligence volume 3 pages 41ndash46

Ross J Zaldivar A Irani L and Tomlinson B (2009)Who are the turkers worker demographics in amazonmechanical turk Department of Informatics Universityof California Irvine USA Tech Rep

Salminen J Almerekhi H Milenkovic M Jung S-gAn J Kwak H and Jansen B J (2018) Anatomyof Online Hate Developing a Taxonomy and MachineLearning Models for Identifying and Classifying Hatein Online News Media In Proceedings of The Inter-national AAAI Conference on Web and Social Media(ICWSM 2018) San Francisco California USA June

6212

Salminen J Hopf M Chowdhury S A Jung S-gAlmerekhi H and Jansen B J (2020) Developingan online hate classifier for multiple social media plat-forms Human-centric Computing and Information Sci-ences 10(1)

Silva L Mondal M Correa D Benevenuto F and We-ber I (2016) Analyzing the targets of hate in onlinesocial media In Tenth International AAAI Conferenceon Web and Social Media

Unsvag E F and Gamback B (2018) The effects of userfeatures on twitter hate speech detection In Proceed-ings of the 2nd Workshop on Abusive Language Online(ALW2) pages 75ndash85

Waldron J (2012) The harm in hate speech HarvardUniversity Press

Wiegand M Siegel M and Ruppenhofer J (2018)Overview of the germeval 2018 shared task on the iden-tification of offensive language

Zhang Z Robinson D and Tepper J (2018) Detect-ing hate speech on twitter using a convolution-gru baseddeep neural network In European Semantic Web Con-ference pages 745ndash760 Springer

  • Introduction
  • Related Studies
  • Data
    • Data Collection
    • Data Preparation and Selection
    • Annotation Guideline
    • Annotations via Crowdsourcing
    • Annotation Results
      • Data Analysis
        • Lexical Analysis
        • Use of Emojis in Offensive Language
          • Experiments and Results
            • Datasets
            • Classification Design
            • Performance Evaluation
            • Results
              • Conclusion
              • Bibliographical References
Page 3: A Multi-Platform Arabic News Comment Dataset for Offensive … · 2020. 5. 22. · Therefore, in this paper, we introduce and make publicly available a new dialectal Arabic news comment

6205

Figure 1 Platform wise distribution of the comments in thedataset

sented lexical analysis in addition to classification resultsusing SVM and NB algorithmsUnlike most of the previous work for Arabic offensivelanguage detection we introduce a multi-dialectal Arabiccomment dataset for offensive comment detection frommultiple social media platforms including TW YT andFB To the best of our knowledge this is one of the firststudies for Arabic and also one of the few datasets for offen-sive content detection research to include data from morethan one platform In addition to our dataset we utilizethree other datasets (Mubarak et al 2017 Mulki et al2019) to study how such multi-platform data could helpgeneralize offensive detection models We also study theuse of emojis and their discriminating characteristics foroffensive language

3 Data

31 Data CollectionTo study how offensive language is used in online inter-action for news posts and to design a classifier we col-lected over 100k comments from different social mediaplatforms - including Facebook Twitter and YouTube fora well-reputed international news agency We first col-lected all news content posted from 2011 to 2019 by thenews agency in their social media accounts We collectedthe contents from each platform through their own API(YouTube Facebook and Twitter) Then using each con-tent ID the comments for the content are collected AsTwitter does not provide the API for retrieving commentsdirectly we used the Standard Search API of Twitter whichonly provides comments for past 7 days only To overcomethis challenge we periodically collected the comments inevery 6 hours for a certain period to extract the completecomment history of the news post

32 Data Preparation and SelectionFor the data selection we retain comments that in-cludes 5 or more Arabic tokens apart from emojis whileanonymizing them by replacing any user mentions with alsquoUSERIDXrsquo tag and urls with lsquoURLrsquo tags In addition weremoved duplicate comments based on textual content Wethen selected a random subset of 4000 comments in totalfrom the three major social media platforms (comment dis-tribution shown in Figure 1) for manual annotation

33 Annotation GuidelineThe annotation task presented in this paper is straightfor-ward ndash ldquoGiven a comment categorize if the comment isoffensive or notrdquo Even though the task seems benign de-ciding whether the comment is offensive or not is not sim-ple The annotators were instructed to rely on their instinctand suggested to ignore their personal biases such as polit-ical religious belief or their cultural background To makethe decision-making process easier we provided elaboratedand detailed instructions to the annotators with examplesFor the task we asked the annotators to consider instancesas offensive if comment contain (a) abusive words (b) ex-plicit or implicit hostile intention that threats or project vi-olence (c) contempt humiliate or to underestimate groupsor individuals using ndash animal analogy name calling at-tacking their political ideologies or their disabilities curs-ing insulting religious beliefs incitement racial and ethnichatred or other forms of insultingprofanity Details of theannotation guideline and examples presented to the annota-tor in each cases are made publicly available 2In addition to the detailed examples we apologize for thepresence of any offensive words that might hurt the anno-tatorsrsquo personal beliefs and thanked them for taking intoaccount the accuracy of the task and for their participation

34 Annotations via CrowdsourcingGiven the specified task we used Amazon Mechanical Turk(AMT)3 a well-known crowdsourcing platform to obtainmanual annotations of these 4000 news comments (men-tioned in Section 32) For each comment we collected 3judgments with a cost of 1 cent per judgment Our pilotstudy for the task shows that 1 cent per comment is a rea-sonable amount to pay for such a small (binary) task andwith higher rewards no further performance improvementwith respect to time or quality is found Details of parame-ters used for the annotation task is shown in Table 1AMT is an efficient and cost-effective way to acquire man-ual annotations however it can possess several limitationsfor a task Some of the main limitations faced for the an-notation is to judge annotators language proficiency and thetask understanding capabilities To ensure the quality of theannotation and language proficiency we utilized two differ-ent evaluation criteria of the annotatorThe first criterion for the annotators to qualify for attempt-ing the task includes answering a series of multiple-choicequestions (Example 1) - designed by experts - that reflectsthe annotatorsrsquo language proficiency and understanding ofthe questions

Example 1 H

13B YEumleth ugrave

Ograve

XAOgrave

What we call the father of the fatherOptions are

bull NtildeordfEuml (The uncle) bull YEumlntildeEuml (The father)

bull EgraveAmIgrave (The

maternal-uncle)bull Ym

Igrave (The grandfather)

2httpsgithubcomshammurArabic-Offensive-Multi-Platform-SocialMedia-Comment-Datasetannotation guidelineannotation guidelinepdf

3httpmturkcom

6206

No of Comments 4000Cost per Comment 1 centCommentsHIT 25GoldHIT 5Gold Review Policy dynamicAssignmentsHIT 3Task duration 7 daysAssignment Duration 1 hourAvg AnnotatorComment 323Max AnnotatorComment 5No Comment with more than 3 annotation 787

Table 1 Crowdsourcing annotation task details In thiscase one or two annotators could not pass the quality eval-uation Thus their annotations were rejected and the taskwas extended to a new annotator

The annotator must provide correct answers to 80 of thequestions (ie 8 out of 10) to pass the qualification testOnce the annotators pass the test they can attempt the mainclassification task A total of 26 annotators qualified thistest among them only asymp 54 scored full and the rest ofthe annotators scored the minimum 80 to pass the testIn order to evaluate an annotatorrsquos assignment for task ac-curacy we used gold standard instances hidden in the de-signed Human Intelligence Tasks (HITs) For each HIT weassigned 25 comments for the annotation and out of them 5were randomly assigned as the test questions A thresholdof 80 is again set for the selection criteria implying that 4out of 5 gold instances must be correct in order for the as-signment by an annotator to be accepted These test ques-tions (comments) are selected randomly from a pool of 60comments and are annotated by the domain experts The se-lected gold comments included 55 (out of 60) of offensivecomments and have an average agreement of asymp 9563A closer look into the agreement per class indicates thatthe annotators were more confident in case of non-offensivecomments with an average agreement of 9791 (mini-mum agreement of 8476) in compare to offensive com-ments (average = 9563 with minimum of agreementis 617)

Figure 2 Platform wise annotation label distribution ldquoAllrdquorepresents the total of instances for offensive and not-offensive comments in the annotated dataset

Figure 3 Platform wise annotation label distribution fortypes of offensive comments present in the annotateddataset

35 Annotation ResultsAnnotation Agreement and Evaluation As each com-ment has 3 annotations (judgments) we assigned the la-bel agreed by the majority (ie 2

3 ) of the annotator Wehave observed that for offensive comments only 542 ofcomments have unanimous agreement whereas for non-offensive comments ndash 837 of the annotators are in com-plete agreement This shows the complexity of identifyingthe offensive task especially in an Arabic language contextdue to the ambiguity of lexical variations in Arabic dialectscompared to other languages such as English and GermanFor the crowd-annotated dataset we obtain a Fleissrsquos kappa(Falotico and Quatto 2015) κ = 072 Although κ theo-retically applies to a fixed set of annotators but for our casewe randomly assigned 3 judgments to the hypothetical an-notator A1-3To assess the reliability of the crowd-annotations we ran-domly selected 500 comments to be annotated by a domainexpert Using the expert annotation as the reference weobserved that the accuracy of our crowdsourced dataset is94 From further error analysis we observed that the mis-labeled comments by the crowd annotators include offen-sive contents written using sarcasm irony along with somelong comments mixed with statements rather than opinion

Annotation Distribution Based on the majority votingwe present the distribution of the finally obtained annotateddataset in Figure 2 From the dataset we observed that thepercentage of offensive comments varies in each platformas shown in Figure 2 with TW comments which has moreoffensive instances followed by FB and then YT

Types of Offensive Comments To further understandthe annotated data and to discriminate between the differenttypes of offensive comments we manually annotated theobtained 675 (1688 see in Figure 2 ndash lsquoAllOffensiversquo)offensive comments in our dataset for hate speech and vul-gar (but not hate) categories The obtained distribution ofeach class (vulgar and hate speech) are presented in Figure3 From the class distribution we can observe that in allthe three platforms ldquohaterdquo categories are more prominentwith respect to only ldquovulgarrdquo Our results indicate that FBfollowed by TW to have significantly more hate commentsthan YT channel comments

6207

Figure 4 Word-cloud for highly non offensive (NOT) to-kens with valence score ϑ() = 1 in comments Some ofthe most frequent words include lsquothank yoursquo lsquoimportantrsquoalong with other conversation fillers and reaction like lsquoha-hahrsquo

Figure 5 Word-cloud for highly offensive tokens with va-lence score ϑ() = 1 in comments Corresponding En-glish examples can be found in Table 3

4 Data Analysis41 Lexical AnalysisIn order to understand how distinctive the comments arewith respect to the offensive and non-offensive classes weanalyzed the lexical content of the datasetWe compared the vocabularies of the two classes using thevalence score (Conover et al 2011 Mubarak and Darwish2019) ϑ for every token x following the Equation 1

ϑ(x) = 2 lowastC(x|OFF )

TOFF

C(x|OFF )TOFF

+ C(x|NOT )TNOT

minus 1 (1)

where C() is the frequency of the token x for a given class(OFF or NOT) TOFF and TNOT are the total number oftokens present in each class The ϑ(x) isin [minus1+1] with +1indicating that the use of the token is significantly highly inoffensive content than non-offensive and vice-versaWe identified the most distinctive unigrams in each classesusing valence score and presented using word-clouds inFigure 4 and Figure 5 We also presented top 10 highly of-fensive tokens in Table 3 with its frequency Furthermoreusing the same score we also presented top highly distinc-tive bi- and tri-grams in Table 2

From the lexical analysis we observed that words likelsquodogsrsquo lsquobarkrsquo lsquodirtyrsquo lsquoenvyrsquo are more common in offen-sive instances than not-offensive instances Similarly whenlooking into the bi- and tri-grams we observed that use ofwords like lsquoragersquo lsquocursersquo are very frequent to offensive-ness

42 Use of Emojis in Offensive LanguageAnalyzing the comment reveals treats that are specific toeach of the types offensive and non-offensive Furtheranalysis of the usage of emojis shows that there are somespecifics for each type Their usage of emojis is different

Figure 6 Reported analysis on use of emojis in offensivecomments Figure (a) ndash lists the 10 most highly offensiveemojis with valence score ϑ = 10 Figure (b) ndash Top 3groups of emojis with ϑ = 10 For each group ω is re-ported in percentage

To understand the distinctive usage of emoji we also cal-culated the valence ϑ() score for all the emojis Figure6 shows top 10 emojis present in the offensive commentsdataset with valence score ϑ() = 1Furthermore we grouped all the emojis based on their cate-gories ndash including ldquopeople face and smilieysrdquo ldquoanimal andnaturesrdquo among others ndash as defined by Unicode Consor-tium4 We then calculated the weight of each group ωg

using the following Equation 2

ωg =

sumeiising C(ei|ϑ = 10)sumeisinL C(e|ϑ = 10))

(2)

where for ei represent the ith emoji present in group ggiven the valence score ϑ = 10 L is the list of emojispresent in the offensive dataset and C() is the usage fre-quency of the given emoji The ω for top 3 frequent groupsare shown in Figure 6Our analysis of offensive emoji also contained studying themost frequent bi-grams with high valence score ϑ = 10We observed that the top 2 most frequent bi-grams are also

from the animals group ndash ldquo rdquo (frequency = 3) and

ldquo rdquo (3)

4httpsunicodeorgemojicharts-121emoji-orderinghtml

6208

Highly OFF Highly NOTNgrams Freq Ngrams Freq

egraveQKQ mIgrave

egraveA

Jmacr (piglet channel) 5 eacuteltEuml EgraventildeP (Messenger of God) 27

Otildeordm

centJordfK ntilde

KntildeOacute (die with your rage) 4 Yg Yg (many many) 17

eacutekApoundB Otildeccedil

(was downed) 4

egraventildemacr Egraventildek (will power) 17

copyJOgravem

eacutekApoundB (downing all) 4

eacuteJK QordfEuml EgraveethYEuml (Arab countries) 15

HAcentcentmtimes copyJOgravem

(with all plots) 4 eacuteltEuml Z A

AOacute (Godrsquos willing) 14

OtildeordmJEcirclaquo eacuteltEuml eacuteJordfEuml (Godrsquo curse on you all) 4 eacuteltEumlAK B

egraventilde

macr (power except with God) 17

QOgravemIgrave I JEcircg (Donkeyrsquos milk) 4 i JEcircmIgrave EgraveethX (Gulf countries) 10

aacuteOacute eth (and who) 2 I

Jraquo A

K (I was) 9

uacuteIcircEuml

aacuteOacute (who is the one) 2 eacuteltEuml B (except God) 9

Table 2 List of top most frequent n-grams (n=23) for both offensive (OFF) and non-offensive (NOT) comments usingvalence score are as follows () are the translation in English Freq represents frequency of the ngram in the correspondingclass

Tokens FrequencyegraveQK

Q mIgrave (Piglet) 21

I Ecircfrac34Euml (Dog) 8

PY

macr (Dirty) 8

eacutemacr Q

KQOumlIuml (Mercenaries) 7

iJK (Bark) 6

Otildeordm

centJordfK (With your envyrage) 5

m

(filthy) 5

H Ciquest AK (O Dogs) 5

shyK13 (Nose) 4

HAcentcent

mtimes (Plots) 4

Table 3 List of top 10 most highly offensive unigrams withϑ(word) = 10 () are the translation in English

Consequently our findings suggest that emojis representinganimals appear uniquely in offensive comments indicatingtheir use by analogy for offensive purpose This observa-tion is in line with our findings for distinctive lexical tokens(such as I Ecircfrac34Euml ndash ldquodogrdquo) usage in offensive comments

5 Experiments and ResultsThis section describes the details of feature extraction andmachine learning algorithm used to test the performanceof the dataset The motivation for such experimental explo-ration is to test the performance of the released dataset andto have baseline results in order to facilitate future studiesIn addition we also utilize the presence of comments froma different platform to test generalizability of a platform-dependent model (trained on a specific platform data)across (a) other social medianews platforms and (b) out-of-domain short content and comment datasets

51 DatasetsIn addition to our dataset refereed as Multi PlatformsOffensive Language Dataset (MPOLD) we also utilized

three previously published dataset details in Table 4For evaluating the performance on out-of-domain5 datasetswe used Egyptian and Levantine Arabic datasetsThe Egyptian Arabic dialect data includes 100 TW postsand their 10 corresponding comments6 per tweet Thetweets are extracted from 10 most controversial EgyptianTW users (Mubarak et al 2017) This dataset is refer-eed in this paper as Egyptian Tweets and correspondingComment Dataset (ETCD) for further mentions The Lev-antine dataset (referred to as L-HSAB)(Mulki et al 2019)are collected from the timelines of politicians socialpo-litical activists and TV anchors and does not contain anycommentsSimilarly we also used the Deleted Comments7 Dataset(further refereed as DCD) from Aljazeeranet(Mubarak et al 2017) to study how a offensive detectionmodel perform in a different platforms

Preprocessing To prepare the data for classification wefirst tokenized the input sentences removing punctuationURLs stop words diacritics present in the text We in-serted whitespace in cases where there was no separationbetween the adjacent words and emojis in the text For ourstudy we kept the emojis and hashtags due to the contextualinformation represented using them

52 Classification DesignAs a baseline for the MPOLD dataset we conducted experi-ments using a classical machine learning algorithm ndash SVMThe SVM algorithm (Platt 1998) is based on the StructuralRisk Minimization principle from computational learningtheory The algorithm is proven as universal learners andwell-known for its ability to learn independently of the di-mensionality of the feature space and the class distributionof the training dataset These properties of the algorithmmake it one of the most popular supervised classification

5Here representing that the contents are not news posts6httpaltqcriorg˜hmubarakoffensive

TweetClassification-Summaryxlsx7httpaltqcriorg˜hmubarakoffensive

AJCommentsClassification-CFxlsx

6209

Abbr Dataset Description Labels I (OFF) avg len Used for

MPOLD News comments fromTW FB and YT

OFF NOT 4000 (169) 228 traintestV HS Oth 180185417

ETCD100 Egyptian tweets

and corresponding comments OBs OFF-OBs NOT 1100 (asymp 590) 135 test

L-HSAB Levantine tweets Abs HS NOT 5846 (asymp 376) 120 test

DCDDeleted comments from

a news website OBs OFF-OBs NOT 31692 (asymp 82) 174traintest

Table 4 Details of the datasets used for the study represent the out-of-domain dataset and represent the differentplatforms I represent the number of instances whereas OFF represents the number of offensive labels used in thisstudy In the case of ETCD and DCD we merged the offensive comments ndash obscene (OBs) and other offensive comments(OFF-OBs) as OFF for matching with our labels Similarly we merged HS and Abs (abusive) for the L-HSAB dataset toOFF avg len represents the average length of the corresponding content in terms of the number of words V - Vulgar HS- Hatespeech Oth-Other offensive categories

methods and thus our choice for the baseline and all theexperiments conducted in this studyTo train the classifier we transform the online commentsinto bag-of-character-n-grams vector weighted with loga-rithmic term frequencies (tf) multiplied with inverse doc-ument frequencies (idf) For this we created charactern-grams using character inside word boundaries only andpadded the edge character with whitespace To utilize thecontextual information ie n-grams we extracted featuresusing n = 1 to n = 5

53 Performance EvaluationTo measure the performance of the classification we re-ported macro-averaged F1-measure (Fm) calculated usingmacro-averaged precision and recall ndash an average of P andR for both classes respectivelyThe motivation behind such choice is due to the imbalancedclass distribution which makes the well-known measuressuch as accuracy and micro-average F1-measure not well-representative of the performance Since the performanceof both classes is of interest we also report the F-measureof the individual classes

Fm NOT OFFChance 049 083 014Lex 053 084 021SVM 074 092 056

Table 5 Reported performance on MPOLD dataset using5-fold cross validation (with SVM and Chance baseline)

54 ResultsMPOLD performance To evaluate the classificationperformance we performed five-fold8 cross-validation onthe dataset maintaining the natural distribution of theclasses in each fold The performance on all the instancesis shown in Table 5

8We choose 5 instead of 10 as the number of folds to ensurea good sample size and distribution of both the labels in each foldfor test purpose

Testset Trained on Fm NOT OFF

FB

DCD 034 043 025TW 062 090 034YT 062 092 031TW + YT 068 093 044TW + YT + DCD 078 094 063

TW

DCD 052 051 053FB 051 083 019YT 054 082 025FB + YT 059 084 035FB + YT + DCD 084 090 078

YT

DCD 044 067 022FB 053 096 010TW 060 094 027TW + FB 063 095 031TW + FB + DCD 082 097 067

DCD

TW 037 039 034FB 021 031 010YT 029 026 032FB + YT + TW 040 035 045

Table 6 F-measure performances based on platform wisedataset The test sets presented here are a subset of MPOLDdata For instance TW isin MPOLD is the dataset extractedfrom Twitter Similarly YT represents YouTube data FBrepresents comments from Facebook DCD - Deleted com-ments dataset NOT = Not-offensive OFF = offensiveinstances The blue rows presents the results when theplatform-wise data present in our dataset is modelled alongwith online-news-platform data DCD

To compare the obtained results we reported chance levelbaseline based on the prior distribution of the labels in eachfold In addition to this we also present a simple lexical-based baseline (Lex) using a list of possible offensive wordsand phrases This word list is the combination of obscenewords presented in (Mubarak et al 2017) and filtered en-tries from Hatebase (Hatebaseorg) lexicon From Ta-ble 5 we observed that the SVM performs significantly bet-ter than the lexicon and chance level baselines

Platform specific performance To access the general-ization capabilities of the designed model we evaluated

6210

Testset Model Fm NOT OFF

ETCD

Lex 051 058 044DCD 064 076 053MPOLD 061 066 056MPOLD+ 068 061 075

L-HSAB

Lex 046 074 018DCD 059 060 058MPOLD 064 077 051MPOLD+ 062 064 059

Table 7 Reported F-measures of the models trainedusing DCD MPOLD (MPOLD) and MPOLD + DCD(MPOLD+) and tested on out-of-domain test sets The testsets includes (a) ETCD - controversial Egyptian tweetsand comments dataset and (b) L-HSAB Levantine tweetsdatasets Lex is lexicon-based baseline for the test sets

the models using leave-one-platform-dataset-out (LOPO)cross-validation The performance of the models on eachsocial media platform data is presented in Table 6 Besidesthe three platforms in our dataset for the experiment wealso included DCD ndash deleted comments from online newsportal as a fourth platform

We observed that in most cases models trained on oneparticular platform performs poorly compared to the mod-els trained using multiple-platform datasets This multipleplatform model includes both the combination of (a) 2 plat-forms from MPOLD data or (b) 2 platforms from MPOLD+ DCD datasets

For instance the model trained on TW +Y T performs sig-nificantly better on FB test set than the individual YT andTW models The same pattern is also observed when theperformance of DCD in combination (Y T + TW ) modelare compared In this case a significant jump in OFF classis noticed thus increasing the overall macro-averaged f-measure (Fm) This indicates the importance of multipleplatform datasets for generalization of the offensive detec-tion model

In addition to a different platform data of a similar domainDCD dataset also brings two more added advantages fortraining the model These advantages include (a) additionof comments written using MSA along with DA and (b)increase the number of offensive examples for the modeltraining The MPOLD comments are mostly dialectal Ara-bic therefore adding DCD includes examples of commentsin MSA for the model to recognise and learn At the sametime increases the number of (not)offensive examples forthe training Thus the results from the combined platformmodels (for eg TW + Y T +DCD) indicates the impactof varieties in examples and size of the training data

However by comparing the results of individual models ndashtrained on only DCD TW YT or FB ndash we observed thateven though DCD has a large number of training instancesthis model cannot outperform TWYTFB alone in mostcases This again highlights the inadequacy of model gen-eralization when trained on a particular platform and testedon another platform

Cross domain performance The performance ofthe models trained on MPOLD and MPOLD+DCD(MPOLD+) across domain is presented in Table 7 Theresults indicate that for ETCD dataset both our models(MPOLD and MPOLD+) outperform the baseline andthe DCD models for OFF class f-measure The overallbest performance is obtained using MPOLD+ with animprovement of 7 and 4 over MPOLD and DCDmodelsAs for L-HASB data MPOLD model outperforms all othermodels in terms of overall f-measure However for the of-fensive class only (OFF) the best performance is obtainedby MPOLD+The performance of the MPOLD and MPOLD+ models onboth ETCD and L-HSAB dataset indicates the positive im-pact of multiple-platform training dataset and implicates thecross-domain generalizability

6 ConclusionIn this paper we introduced one of the first dialectal Ara-bic comment dataset extracted from multiple social mediaplatform Utilizing this dataset we designed classificationexperiments to study the impact of multi-platform data forplatform-wise model generalization For this we evaluatedthe model using platform-wise cross-validation In additionto our dataset we also exploit existing publicly availabledataset We also utilized deleted comment dataset from anews website We observed that combined platform mod-els perform significantly higher than most individual mod-els Thus indicating the importance of such multi-platformdatasetUsing our designed models we also evaluated other out-of-domain (ETCD and L-HSAB) data extracted from TWBesides being out-of-domain dataset another advantage ofevaluating these datasets is to see how our models performin particular for DA content ie ETCD (Egyptian dialects)and L-HSAB (Levantine dialects) Our findings indicatethat the multi-platform news comment dataset model hasthe potential to capture diversity in different dialects anddomains In addition to evaluation of the generalizationpower of the models we also presented an in-depth anal-ysis of emoji usage in offensive comments Our findingssuggest emojis in the animal category are exploited in theoffensive comments similar to our lexical observationTo the best of our knowledge this is one of the first studiesto explore the effect of multiple platform datasets for offen-sive language detection on cross-platform cross-domainand for dialectal varieties present in Arabic In future weplan to extend the study by introducing a larger dataset withfurther fine-grained classification and content analysis

7 Bibliographical ReferencesAgarwal S and Sureka A (2014) A focused crawler

for mining hate and extremism promoting videos onyoutube In Proceedings of the 25th ACM conference onHypertext and social media pages 294ndash296 ACM

Al-Ajlan M A and Ykhlef M (2018) Optimized twit-ter cyberbullying detection based on deep learning In2018 21st Saudi Computer Society National ComputerConference (NCC) pages 1ndash5 IEEE

6211

Alakrot A Murray L and Nikolov N S (2018) Datasetconstruction for the detection of anti-social behaviourin online communication in arabic Procedia ComputerScience 142174ndash181

Albadi N Kurdi M and Mishra S (2018) Are they ourbrothers analysis and detection of religious hate speechin the arabic twittersphere In 2018 IEEEACM Interna-tional Conference on Advances in Social Networks Anal-ysis and Mining (ASONAM) pages 69ndash76 IEEE

Badjatiya P Gupta S Gupta M and Varma V (2017)Deep learning for hate speech detection in tweets InProceedings of the 26th International Conference onWorld Wide Web Companion pages 759ndash760 Interna-tional World Wide Web Conferences Steering Commit-tee

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017a) Hate is not bi-nary Studying abusive behavior of gamergate on twit-ter In Proceedings of the 28th ACM conference on hy-pertext and social media pages 65ndash74 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017b) Mean birdsDetecting aggression and bullying on twitter In Pro-ceedings of the 2017 ACM on web science conferencepages 13ndash22 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017c) Measuringgamergate A tale of hate sexism and bullying In Pro-ceedings of the 26th international conference on worldwide web companion pages 1285ndash1290 InternationalWorld Wide Web Conferences Steering Committee

Conover M D Ratkiewicz J Francisco M GoncalvesB Menczer F and Flammini A (2011) Political po-larization on twitter In Fifth international AAAI confer-ence on weblogs and social media

Dadvar M Trieschnigg D Ordelman R and de JongF (2013) Improving cyberbullying detection with usercontext In European Conference on Information Re-trieval pages 693ndash696 Springer

Davidson T Warmsley D Macy M and Weber I(2017) Automated hate speech detection and the prob-lem of offensive language In Eleventh internationalaaai conference on web and social media

Difallah D Filatova E and Ipeirotis P (2018) Demo-graphics and dynamics of mechanical turk workers InProceedings of the eleventh acm international confer-ence on web search and data mining pages 135ndash143ACM

ElSherief M Nilizadeh S Nguyen D Vigna G andBelding E (2018) Peer to peer hate Hate speech in-stigators and their targets In Twelfth International AAAIConference on Web and Social Media

Falotico R and Quatto P (2015) Fleissrsquo kappa statisticwithout paradoxes Quality amp Quantity 49(2)463ndash470

Gulactı F (2010) The effect of perceived social supporton subjective well-being Procedia-Social and Behav-ioral Sciences 2(2)3844ndash3849

Hosseinmardi H Mattson S A Rafiq R I Han R LvQ and Mishra S (2015) Analyzing labeled cyberbul-

lying incidents on the instagram social network In Inter-national conference on social informatics pages 49ndash66Springer

Intapong P Charoenpit S Achalakul T and OhkuraM (2017) Assessing symptoms of excessive sns usagebased on user behavior and emotion analysis of data ob-tained by sns apis In 9th International Conference onSocial Computing and Social Media SCSM 2017 heldas part of the 19th International Conference on Human-Computer Interaction HCI International 2017 pages71ndash83 Springer Verlag

Jay T and Janschewitz K (2008) The pragmatics ofswearing Journal of Politeness Research LanguageBehaviour Culture 4(2)267ndash288

Kumar S Hamilton W L Leskovec J and Jurafsky D(2018) Community interaction and conflict on the webIn Proceedings of the 2018 World Wide Web Conferencepages 933ndash943 International World Wide Web Confer-ences Steering Committee

Mondal M Silva L A and Benevenuto F (2017) Ameasurement study of hate speech in social media InProceedings of the 28th ACM Conference on Hypertextand Social Media pages 85ndash94 ACM

Mubarak H and Darwish K (2019) Arabic offensivelanguage classification on twitter In International Con-ference on Social Informatics pages 269ndash276 Springer

Mubarak H Darwish K and Magdy W (2017) Abu-sive language detection on arabic social media In Pro-ceedings of the First Workshop on Abusive Language On-line pages 52ndash56

Mulki H Haddad H Ali C B and Alshabani H(2019) L-hsab A levantine twitter dataset for hatespeech and abusive language In Proceedings of theThird Workshop on Abusive Language Online pages111ndash118

Pavlopoulos J Malakasiotis P and Androutsopoulos I(2017) Deeper attention to abusive user content mod-eration In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing pages1125ndash1135

Pitsilis G K Ramampiaro H and Langseth H (2018)Detecting offensive language in tweets using deep learn-ing arXiv preprint arXiv180104433

Platt J (1998) Sequential minimal optimization A fastalgorithm for training support vector machines

Rish I et al (2001) An empirical study of the naive bayesclassifier In IJCAI 2001 workshop on empirical methodsin artificial intelligence volume 3 pages 41ndash46

Ross J Zaldivar A Irani L and Tomlinson B (2009)Who are the turkers worker demographics in amazonmechanical turk Department of Informatics Universityof California Irvine USA Tech Rep

Salminen J Almerekhi H Milenkovic M Jung S-gAn J Kwak H and Jansen B J (2018) Anatomyof Online Hate Developing a Taxonomy and MachineLearning Models for Identifying and Classifying Hatein Online News Media In Proceedings of The Inter-national AAAI Conference on Web and Social Media(ICWSM 2018) San Francisco California USA June

6212

Salminen J Hopf M Chowdhury S A Jung S-gAlmerekhi H and Jansen B J (2020) Developingan online hate classifier for multiple social media plat-forms Human-centric Computing and Information Sci-ences 10(1)

Silva L Mondal M Correa D Benevenuto F and We-ber I (2016) Analyzing the targets of hate in onlinesocial media In Tenth International AAAI Conferenceon Web and Social Media

Unsvag E F and Gamback B (2018) The effects of userfeatures on twitter hate speech detection In Proceed-ings of the 2nd Workshop on Abusive Language Online(ALW2) pages 75ndash85

Waldron J (2012) The harm in hate speech HarvardUniversity Press

Wiegand M Siegel M and Ruppenhofer J (2018)Overview of the germeval 2018 shared task on the iden-tification of offensive language

Zhang Z Robinson D and Tepper J (2018) Detect-ing hate speech on twitter using a convolution-gru baseddeep neural network In European Semantic Web Con-ference pages 745ndash760 Springer

  • Introduction
  • Related Studies
  • Data
    • Data Collection
    • Data Preparation and Selection
    • Annotation Guideline
    • Annotations via Crowdsourcing
    • Annotation Results
      • Data Analysis
        • Lexical Analysis
        • Use of Emojis in Offensive Language
          • Experiments and Results
            • Datasets
            • Classification Design
            • Performance Evaluation
            • Results
              • Conclusion
              • Bibliographical References
Page 4: A Multi-Platform Arabic News Comment Dataset for Offensive … · 2020. 5. 22. · Therefore, in this paper, we introduce and make publicly available a new dialectal Arabic news comment

6206

No of Comments 4000Cost per Comment 1 centCommentsHIT 25GoldHIT 5Gold Review Policy dynamicAssignmentsHIT 3Task duration 7 daysAssignment Duration 1 hourAvg AnnotatorComment 323Max AnnotatorComment 5No Comment with more than 3 annotation 787

Table 1 Crowdsourcing annotation task details In thiscase one or two annotators could not pass the quality eval-uation Thus their annotations were rejected and the taskwas extended to a new annotator

The annotator must provide correct answers to 80 of thequestions (ie 8 out of 10) to pass the qualification testOnce the annotators pass the test they can attempt the mainclassification task A total of 26 annotators qualified thistest among them only asymp 54 scored full and the rest ofthe annotators scored the minimum 80 to pass the testIn order to evaluate an annotatorrsquos assignment for task ac-curacy we used gold standard instances hidden in the de-signed Human Intelligence Tasks (HITs) For each HIT weassigned 25 comments for the annotation and out of them 5were randomly assigned as the test questions A thresholdof 80 is again set for the selection criteria implying that 4out of 5 gold instances must be correct in order for the as-signment by an annotator to be accepted These test ques-tions (comments) are selected randomly from a pool of 60comments and are annotated by the domain experts The se-lected gold comments included 55 (out of 60) of offensivecomments and have an average agreement of asymp 9563A closer look into the agreement per class indicates thatthe annotators were more confident in case of non-offensivecomments with an average agreement of 9791 (mini-mum agreement of 8476) in compare to offensive com-ments (average = 9563 with minimum of agreementis 617)

Figure 2 Platform wise annotation label distribution ldquoAllrdquorepresents the total of instances for offensive and not-offensive comments in the annotated dataset

Figure 3 Platform wise annotation label distribution fortypes of offensive comments present in the annotateddataset

35 Annotation ResultsAnnotation Agreement and Evaluation As each com-ment has 3 annotations (judgments) we assigned the la-bel agreed by the majority (ie 2

3 ) of the annotator Wehave observed that for offensive comments only 542 ofcomments have unanimous agreement whereas for non-offensive comments ndash 837 of the annotators are in com-plete agreement This shows the complexity of identifyingthe offensive task especially in an Arabic language contextdue to the ambiguity of lexical variations in Arabic dialectscompared to other languages such as English and GermanFor the crowd-annotated dataset we obtain a Fleissrsquos kappa(Falotico and Quatto 2015) κ = 072 Although κ theo-retically applies to a fixed set of annotators but for our casewe randomly assigned 3 judgments to the hypothetical an-notator A1-3To assess the reliability of the crowd-annotations we ran-domly selected 500 comments to be annotated by a domainexpert Using the expert annotation as the reference weobserved that the accuracy of our crowdsourced dataset is94 From further error analysis we observed that the mis-labeled comments by the crowd annotators include offen-sive contents written using sarcasm irony along with somelong comments mixed with statements rather than opinion

Annotation Distribution Based on the majority votingwe present the distribution of the finally obtained annotateddataset in Figure 2 From the dataset we observed that thepercentage of offensive comments varies in each platformas shown in Figure 2 with TW comments which has moreoffensive instances followed by FB and then YT

Types of Offensive Comments To further understandthe annotated data and to discriminate between the differenttypes of offensive comments we manually annotated theobtained 675 (1688 see in Figure 2 ndash lsquoAllOffensiversquo)offensive comments in our dataset for hate speech and vul-gar (but not hate) categories The obtained distribution ofeach class (vulgar and hate speech) are presented in Figure3 From the class distribution we can observe that in allthe three platforms ldquohaterdquo categories are more prominentwith respect to only ldquovulgarrdquo Our results indicate that FBfollowed by TW to have significantly more hate commentsthan YT channel comments

6207

Figure 4 Word-cloud for highly non offensive (NOT) to-kens with valence score ϑ() = 1 in comments Some ofthe most frequent words include lsquothank yoursquo lsquoimportantrsquoalong with other conversation fillers and reaction like lsquoha-hahrsquo

Figure 5 Word-cloud for highly offensive tokens with va-lence score ϑ() = 1 in comments Corresponding En-glish examples can be found in Table 3

4 Data Analysis41 Lexical AnalysisIn order to understand how distinctive the comments arewith respect to the offensive and non-offensive classes weanalyzed the lexical content of the datasetWe compared the vocabularies of the two classes using thevalence score (Conover et al 2011 Mubarak and Darwish2019) ϑ for every token x following the Equation 1

ϑ(x) = 2 lowastC(x|OFF )

TOFF

C(x|OFF )TOFF

+ C(x|NOT )TNOT

minus 1 (1)

where C() is the frequency of the token x for a given class(OFF or NOT) TOFF and TNOT are the total number oftokens present in each class The ϑ(x) isin [minus1+1] with +1indicating that the use of the token is significantly highly inoffensive content than non-offensive and vice-versaWe identified the most distinctive unigrams in each classesusing valence score and presented using word-clouds inFigure 4 and Figure 5 We also presented top 10 highly of-fensive tokens in Table 3 with its frequency Furthermoreusing the same score we also presented top highly distinc-tive bi- and tri-grams in Table 2

From the lexical analysis we observed that words likelsquodogsrsquo lsquobarkrsquo lsquodirtyrsquo lsquoenvyrsquo are more common in offen-sive instances than not-offensive instances Similarly whenlooking into the bi- and tri-grams we observed that use ofwords like lsquoragersquo lsquocursersquo are very frequent to offensive-ness

42 Use of Emojis in Offensive LanguageAnalyzing the comment reveals treats that are specific toeach of the types offensive and non-offensive Furtheranalysis of the usage of emojis shows that there are somespecifics for each type Their usage of emojis is different

Figure 6 Reported analysis on use of emojis in offensivecomments Figure (a) ndash lists the 10 most highly offensiveemojis with valence score ϑ = 10 Figure (b) ndash Top 3groups of emojis with ϑ = 10 For each group ω is re-ported in percentage

To understand the distinctive usage of emoji we also cal-culated the valence ϑ() score for all the emojis Figure6 shows top 10 emojis present in the offensive commentsdataset with valence score ϑ() = 1Furthermore we grouped all the emojis based on their cate-gories ndash including ldquopeople face and smilieysrdquo ldquoanimal andnaturesrdquo among others ndash as defined by Unicode Consor-tium4 We then calculated the weight of each group ωg

using the following Equation 2

ωg =

sumeiising C(ei|ϑ = 10)sumeisinL C(e|ϑ = 10))

(2)

where for ei represent the ith emoji present in group ggiven the valence score ϑ = 10 L is the list of emojispresent in the offensive dataset and C() is the usage fre-quency of the given emoji The ω for top 3 frequent groupsare shown in Figure 6Our analysis of offensive emoji also contained studying themost frequent bi-grams with high valence score ϑ = 10We observed that the top 2 most frequent bi-grams are also

from the animals group ndash ldquo rdquo (frequency = 3) and

ldquo rdquo (3)

4httpsunicodeorgemojicharts-121emoji-orderinghtml

6208

Highly OFF Highly NOTNgrams Freq Ngrams Freq

egraveQKQ mIgrave

egraveA

Jmacr (piglet channel) 5 eacuteltEuml EgraventildeP (Messenger of God) 27

Otildeordm

centJordfK ntilde

KntildeOacute (die with your rage) 4 Yg Yg (many many) 17

eacutekApoundB Otildeccedil

(was downed) 4

egraventildemacr Egraventildek (will power) 17

copyJOgravem

eacutekApoundB (downing all) 4

eacuteJK QordfEuml EgraveethYEuml (Arab countries) 15

HAcentcentmtimes copyJOgravem

(with all plots) 4 eacuteltEuml Z A

AOacute (Godrsquos willing) 14

OtildeordmJEcirclaquo eacuteltEuml eacuteJordfEuml (Godrsquo curse on you all) 4 eacuteltEumlAK B

egraventilde

macr (power except with God) 17

QOgravemIgrave I JEcircg (Donkeyrsquos milk) 4 i JEcircmIgrave EgraveethX (Gulf countries) 10

aacuteOacute eth (and who) 2 I

Jraquo A

K (I was) 9

uacuteIcircEuml

aacuteOacute (who is the one) 2 eacuteltEuml B (except God) 9

Table 2 List of top most frequent n-grams (n=23) for both offensive (OFF) and non-offensive (NOT) comments usingvalence score are as follows () are the translation in English Freq represents frequency of the ngram in the correspondingclass

Tokens FrequencyegraveQK

Q mIgrave (Piglet) 21

I Ecircfrac34Euml (Dog) 8

PY

macr (Dirty) 8

eacutemacr Q

KQOumlIuml (Mercenaries) 7

iJK (Bark) 6

Otildeordm

centJordfK (With your envyrage) 5

m

(filthy) 5

H Ciquest AK (O Dogs) 5

shyK13 (Nose) 4

HAcentcent

mtimes (Plots) 4

Table 3 List of top 10 most highly offensive unigrams withϑ(word) = 10 () are the translation in English

Consequently our findings suggest that emojis representinganimals appear uniquely in offensive comments indicatingtheir use by analogy for offensive purpose This observa-tion is in line with our findings for distinctive lexical tokens(such as I Ecircfrac34Euml ndash ldquodogrdquo) usage in offensive comments

5 Experiments and ResultsThis section describes the details of feature extraction andmachine learning algorithm used to test the performanceof the dataset The motivation for such experimental explo-ration is to test the performance of the released dataset andto have baseline results in order to facilitate future studiesIn addition we also utilize the presence of comments froma different platform to test generalizability of a platform-dependent model (trained on a specific platform data)across (a) other social medianews platforms and (b) out-of-domain short content and comment datasets

51 DatasetsIn addition to our dataset refereed as Multi PlatformsOffensive Language Dataset (MPOLD) we also utilized

three previously published dataset details in Table 4For evaluating the performance on out-of-domain5 datasetswe used Egyptian and Levantine Arabic datasetsThe Egyptian Arabic dialect data includes 100 TW postsand their 10 corresponding comments6 per tweet Thetweets are extracted from 10 most controversial EgyptianTW users (Mubarak et al 2017) This dataset is refer-eed in this paper as Egyptian Tweets and correspondingComment Dataset (ETCD) for further mentions The Lev-antine dataset (referred to as L-HSAB)(Mulki et al 2019)are collected from the timelines of politicians socialpo-litical activists and TV anchors and does not contain anycommentsSimilarly we also used the Deleted Comments7 Dataset(further refereed as DCD) from Aljazeeranet(Mubarak et al 2017) to study how a offensive detectionmodel perform in a different platforms

Preprocessing To prepare the data for classification wefirst tokenized the input sentences removing punctuationURLs stop words diacritics present in the text We in-serted whitespace in cases where there was no separationbetween the adjacent words and emojis in the text For ourstudy we kept the emojis and hashtags due to the contextualinformation represented using them

52 Classification DesignAs a baseline for the MPOLD dataset we conducted experi-ments using a classical machine learning algorithm ndash SVMThe SVM algorithm (Platt 1998) is based on the StructuralRisk Minimization principle from computational learningtheory The algorithm is proven as universal learners andwell-known for its ability to learn independently of the di-mensionality of the feature space and the class distributionof the training dataset These properties of the algorithmmake it one of the most popular supervised classification

5Here representing that the contents are not news posts6httpaltqcriorg˜hmubarakoffensive

TweetClassification-Summaryxlsx7httpaltqcriorg˜hmubarakoffensive

AJCommentsClassification-CFxlsx

6209

Abbr Dataset Description Labels I (OFF) avg len Used for

MPOLD News comments fromTW FB and YT

OFF NOT 4000 (169) 228 traintestV HS Oth 180185417

ETCD100 Egyptian tweets

and corresponding comments OBs OFF-OBs NOT 1100 (asymp 590) 135 test

L-HSAB Levantine tweets Abs HS NOT 5846 (asymp 376) 120 test

DCDDeleted comments from

a news website OBs OFF-OBs NOT 31692 (asymp 82) 174traintest

Table 4 Details of the datasets used for the study represent the out-of-domain dataset and represent the differentplatforms I represent the number of instances whereas OFF represents the number of offensive labels used in thisstudy In the case of ETCD and DCD we merged the offensive comments ndash obscene (OBs) and other offensive comments(OFF-OBs) as OFF for matching with our labels Similarly we merged HS and Abs (abusive) for the L-HSAB dataset toOFF avg len represents the average length of the corresponding content in terms of the number of words V - Vulgar HS- Hatespeech Oth-Other offensive categories

methods and thus our choice for the baseline and all theexperiments conducted in this studyTo train the classifier we transform the online commentsinto bag-of-character-n-grams vector weighted with loga-rithmic term frequencies (tf) multiplied with inverse doc-ument frequencies (idf) For this we created charactern-grams using character inside word boundaries only andpadded the edge character with whitespace To utilize thecontextual information ie n-grams we extracted featuresusing n = 1 to n = 5

53 Performance EvaluationTo measure the performance of the classification we re-ported macro-averaged F1-measure (Fm) calculated usingmacro-averaged precision and recall ndash an average of P andR for both classes respectivelyThe motivation behind such choice is due to the imbalancedclass distribution which makes the well-known measuressuch as accuracy and micro-average F1-measure not well-representative of the performance Since the performanceof both classes is of interest we also report the F-measureof the individual classes

Fm NOT OFFChance 049 083 014Lex 053 084 021SVM 074 092 056

Table 5 Reported performance on MPOLD dataset using5-fold cross validation (with SVM and Chance baseline)

54 ResultsMPOLD performance To evaluate the classificationperformance we performed five-fold8 cross-validation onthe dataset maintaining the natural distribution of theclasses in each fold The performance on all the instancesis shown in Table 5

8We choose 5 instead of 10 as the number of folds to ensurea good sample size and distribution of both the labels in each foldfor test purpose

Testset Trained on Fm NOT OFF

FB

DCD 034 043 025TW 062 090 034YT 062 092 031TW + YT 068 093 044TW + YT + DCD 078 094 063

TW

DCD 052 051 053FB 051 083 019YT 054 082 025FB + YT 059 084 035FB + YT + DCD 084 090 078

YT

DCD 044 067 022FB 053 096 010TW 060 094 027TW + FB 063 095 031TW + FB + DCD 082 097 067

DCD

TW 037 039 034FB 021 031 010YT 029 026 032FB + YT + TW 040 035 045

Table 6 F-measure performances based on platform wisedataset The test sets presented here are a subset of MPOLDdata For instance TW isin MPOLD is the dataset extractedfrom Twitter Similarly YT represents YouTube data FBrepresents comments from Facebook DCD - Deleted com-ments dataset NOT = Not-offensive OFF = offensiveinstances The blue rows presents the results when theplatform-wise data present in our dataset is modelled alongwith online-news-platform data DCD

To compare the obtained results we reported chance levelbaseline based on the prior distribution of the labels in eachfold In addition to this we also present a simple lexical-based baseline (Lex) using a list of possible offensive wordsand phrases This word list is the combination of obscenewords presented in (Mubarak et al 2017) and filtered en-tries from Hatebase (Hatebaseorg) lexicon From Ta-ble 5 we observed that the SVM performs significantly bet-ter than the lexicon and chance level baselines

Platform specific performance To access the general-ization capabilities of the designed model we evaluated

6210

Testset Model Fm NOT OFF

ETCD

Lex 051 058 044DCD 064 076 053MPOLD 061 066 056MPOLD+ 068 061 075

L-HSAB

Lex 046 074 018DCD 059 060 058MPOLD 064 077 051MPOLD+ 062 064 059

Table 7 Reported F-measures of the models trainedusing DCD MPOLD (MPOLD) and MPOLD + DCD(MPOLD+) and tested on out-of-domain test sets The testsets includes (a) ETCD - controversial Egyptian tweetsand comments dataset and (b) L-HSAB Levantine tweetsdatasets Lex is lexicon-based baseline for the test sets

the models using leave-one-platform-dataset-out (LOPO)cross-validation The performance of the models on eachsocial media platform data is presented in Table 6 Besidesthe three platforms in our dataset for the experiment wealso included DCD ndash deleted comments from online newsportal as a fourth platform

We observed that in most cases models trained on oneparticular platform performs poorly compared to the mod-els trained using multiple-platform datasets This multipleplatform model includes both the combination of (a) 2 plat-forms from MPOLD data or (b) 2 platforms from MPOLD+ DCD datasets

For instance the model trained on TW +Y T performs sig-nificantly better on FB test set than the individual YT andTW models The same pattern is also observed when theperformance of DCD in combination (Y T + TW ) modelare compared In this case a significant jump in OFF classis noticed thus increasing the overall macro-averaged f-measure (Fm) This indicates the importance of multipleplatform datasets for generalization of the offensive detec-tion model

In addition to a different platform data of a similar domainDCD dataset also brings two more added advantages fortraining the model These advantages include (a) additionof comments written using MSA along with DA and (b)increase the number of offensive examples for the modeltraining The MPOLD comments are mostly dialectal Ara-bic therefore adding DCD includes examples of commentsin MSA for the model to recognise and learn At the sametime increases the number of (not)offensive examples forthe training Thus the results from the combined platformmodels (for eg TW + Y T +DCD) indicates the impactof varieties in examples and size of the training data

However by comparing the results of individual models ndashtrained on only DCD TW YT or FB ndash we observed thateven though DCD has a large number of training instancesthis model cannot outperform TWYTFB alone in mostcases This again highlights the inadequacy of model gen-eralization when trained on a particular platform and testedon another platform

Cross domain performance The performance ofthe models trained on MPOLD and MPOLD+DCD(MPOLD+) across domain is presented in Table 7 Theresults indicate that for ETCD dataset both our models(MPOLD and MPOLD+) outperform the baseline andthe DCD models for OFF class f-measure The overallbest performance is obtained using MPOLD+ with animprovement of 7 and 4 over MPOLD and DCDmodelsAs for L-HASB data MPOLD model outperforms all othermodels in terms of overall f-measure However for the of-fensive class only (OFF) the best performance is obtainedby MPOLD+The performance of the MPOLD and MPOLD+ models onboth ETCD and L-HSAB dataset indicates the positive im-pact of multiple-platform training dataset and implicates thecross-domain generalizability

6 ConclusionIn this paper we introduced one of the first dialectal Ara-bic comment dataset extracted from multiple social mediaplatform Utilizing this dataset we designed classificationexperiments to study the impact of multi-platform data forplatform-wise model generalization For this we evaluatedthe model using platform-wise cross-validation In additionto our dataset we also exploit existing publicly availabledataset We also utilized deleted comment dataset from anews website We observed that combined platform mod-els perform significantly higher than most individual mod-els Thus indicating the importance of such multi-platformdatasetUsing our designed models we also evaluated other out-of-domain (ETCD and L-HSAB) data extracted from TWBesides being out-of-domain dataset another advantage ofevaluating these datasets is to see how our models performin particular for DA content ie ETCD (Egyptian dialects)and L-HSAB (Levantine dialects) Our findings indicatethat the multi-platform news comment dataset model hasthe potential to capture diversity in different dialects anddomains In addition to evaluation of the generalizationpower of the models we also presented an in-depth anal-ysis of emoji usage in offensive comments Our findingssuggest emojis in the animal category are exploited in theoffensive comments similar to our lexical observationTo the best of our knowledge this is one of the first studiesto explore the effect of multiple platform datasets for offen-sive language detection on cross-platform cross-domainand for dialectal varieties present in Arabic In future weplan to extend the study by introducing a larger dataset withfurther fine-grained classification and content analysis

7 Bibliographical ReferencesAgarwal S and Sureka A (2014) A focused crawler

for mining hate and extremism promoting videos onyoutube In Proceedings of the 25th ACM conference onHypertext and social media pages 294ndash296 ACM

Al-Ajlan M A and Ykhlef M (2018) Optimized twit-ter cyberbullying detection based on deep learning In2018 21st Saudi Computer Society National ComputerConference (NCC) pages 1ndash5 IEEE

6211

Alakrot A Murray L and Nikolov N S (2018) Datasetconstruction for the detection of anti-social behaviourin online communication in arabic Procedia ComputerScience 142174ndash181

Albadi N Kurdi M and Mishra S (2018) Are they ourbrothers analysis and detection of religious hate speechin the arabic twittersphere In 2018 IEEEACM Interna-tional Conference on Advances in Social Networks Anal-ysis and Mining (ASONAM) pages 69ndash76 IEEE

Badjatiya P Gupta S Gupta M and Varma V (2017)Deep learning for hate speech detection in tweets InProceedings of the 26th International Conference onWorld Wide Web Companion pages 759ndash760 Interna-tional World Wide Web Conferences Steering Commit-tee

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017a) Hate is not bi-nary Studying abusive behavior of gamergate on twit-ter In Proceedings of the 28th ACM conference on hy-pertext and social media pages 65ndash74 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017b) Mean birdsDetecting aggression and bullying on twitter In Pro-ceedings of the 2017 ACM on web science conferencepages 13ndash22 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017c) Measuringgamergate A tale of hate sexism and bullying In Pro-ceedings of the 26th international conference on worldwide web companion pages 1285ndash1290 InternationalWorld Wide Web Conferences Steering Committee

Conover M D Ratkiewicz J Francisco M GoncalvesB Menczer F and Flammini A (2011) Political po-larization on twitter In Fifth international AAAI confer-ence on weblogs and social media

Dadvar M Trieschnigg D Ordelman R and de JongF (2013) Improving cyberbullying detection with usercontext In European Conference on Information Re-trieval pages 693ndash696 Springer

Davidson T Warmsley D Macy M and Weber I(2017) Automated hate speech detection and the prob-lem of offensive language In Eleventh internationalaaai conference on web and social media

Difallah D Filatova E and Ipeirotis P (2018) Demo-graphics and dynamics of mechanical turk workers InProceedings of the eleventh acm international confer-ence on web search and data mining pages 135ndash143ACM

ElSherief M Nilizadeh S Nguyen D Vigna G andBelding E (2018) Peer to peer hate Hate speech in-stigators and their targets In Twelfth International AAAIConference on Web and Social Media

Falotico R and Quatto P (2015) Fleissrsquo kappa statisticwithout paradoxes Quality amp Quantity 49(2)463ndash470

Gulactı F (2010) The effect of perceived social supporton subjective well-being Procedia-Social and Behav-ioral Sciences 2(2)3844ndash3849

Hosseinmardi H Mattson S A Rafiq R I Han R LvQ and Mishra S (2015) Analyzing labeled cyberbul-

lying incidents on the instagram social network In Inter-national conference on social informatics pages 49ndash66Springer

Intapong P Charoenpit S Achalakul T and OhkuraM (2017) Assessing symptoms of excessive sns usagebased on user behavior and emotion analysis of data ob-tained by sns apis In 9th International Conference onSocial Computing and Social Media SCSM 2017 heldas part of the 19th International Conference on Human-Computer Interaction HCI International 2017 pages71ndash83 Springer Verlag

Jay T and Janschewitz K (2008) The pragmatics ofswearing Journal of Politeness Research LanguageBehaviour Culture 4(2)267ndash288

Kumar S Hamilton W L Leskovec J and Jurafsky D(2018) Community interaction and conflict on the webIn Proceedings of the 2018 World Wide Web Conferencepages 933ndash943 International World Wide Web Confer-ences Steering Committee

Mondal M Silva L A and Benevenuto F (2017) Ameasurement study of hate speech in social media InProceedings of the 28th ACM Conference on Hypertextand Social Media pages 85ndash94 ACM

Mubarak H and Darwish K (2019) Arabic offensivelanguage classification on twitter In International Con-ference on Social Informatics pages 269ndash276 Springer

Mubarak H Darwish K and Magdy W (2017) Abu-sive language detection on arabic social media In Pro-ceedings of the First Workshop on Abusive Language On-line pages 52ndash56

Mulki H Haddad H Ali C B and Alshabani H(2019) L-hsab A levantine twitter dataset for hatespeech and abusive language In Proceedings of theThird Workshop on Abusive Language Online pages111ndash118

Pavlopoulos J Malakasiotis P and Androutsopoulos I(2017) Deeper attention to abusive user content mod-eration In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing pages1125ndash1135

Pitsilis G K Ramampiaro H and Langseth H (2018)Detecting offensive language in tweets using deep learn-ing arXiv preprint arXiv180104433

Platt J (1998) Sequential minimal optimization A fastalgorithm for training support vector machines

Rish I et al (2001) An empirical study of the naive bayesclassifier In IJCAI 2001 workshop on empirical methodsin artificial intelligence volume 3 pages 41ndash46

Ross J Zaldivar A Irani L and Tomlinson B (2009)Who are the turkers worker demographics in amazonmechanical turk Department of Informatics Universityof California Irvine USA Tech Rep

Salminen J Almerekhi H Milenkovic M Jung S-gAn J Kwak H and Jansen B J (2018) Anatomyof Online Hate Developing a Taxonomy and MachineLearning Models for Identifying and Classifying Hatein Online News Media In Proceedings of The Inter-national AAAI Conference on Web and Social Media(ICWSM 2018) San Francisco California USA June

6212

Salminen J Hopf M Chowdhury S A Jung S-gAlmerekhi H and Jansen B J (2020) Developingan online hate classifier for multiple social media plat-forms Human-centric Computing and Information Sci-ences 10(1)

Silva L Mondal M Correa D Benevenuto F and We-ber I (2016) Analyzing the targets of hate in onlinesocial media In Tenth International AAAI Conferenceon Web and Social Media

Unsvag E F and Gamback B (2018) The effects of userfeatures on twitter hate speech detection In Proceed-ings of the 2nd Workshop on Abusive Language Online(ALW2) pages 75ndash85

Waldron J (2012) The harm in hate speech HarvardUniversity Press

Wiegand M Siegel M and Ruppenhofer J (2018)Overview of the germeval 2018 shared task on the iden-tification of offensive language

Zhang Z Robinson D and Tepper J (2018) Detect-ing hate speech on twitter using a convolution-gru baseddeep neural network In European Semantic Web Con-ference pages 745ndash760 Springer

  • Introduction
  • Related Studies
  • Data
    • Data Collection
    • Data Preparation and Selection
    • Annotation Guideline
    • Annotations via Crowdsourcing
    • Annotation Results
      • Data Analysis
        • Lexical Analysis
        • Use of Emojis in Offensive Language
          • Experiments and Results
            • Datasets
            • Classification Design
            • Performance Evaluation
            • Results
              • Conclusion
              • Bibliographical References
Page 5: A Multi-Platform Arabic News Comment Dataset for Offensive … · 2020. 5. 22. · Therefore, in this paper, we introduce and make publicly available a new dialectal Arabic news comment

6207

Figure 4 Word-cloud for highly non offensive (NOT) to-kens with valence score ϑ() = 1 in comments Some ofthe most frequent words include lsquothank yoursquo lsquoimportantrsquoalong with other conversation fillers and reaction like lsquoha-hahrsquo

Figure 5 Word-cloud for highly offensive tokens with va-lence score ϑ() = 1 in comments Corresponding En-glish examples can be found in Table 3

4 Data Analysis41 Lexical AnalysisIn order to understand how distinctive the comments arewith respect to the offensive and non-offensive classes weanalyzed the lexical content of the datasetWe compared the vocabularies of the two classes using thevalence score (Conover et al 2011 Mubarak and Darwish2019) ϑ for every token x following the Equation 1

ϑ(x) = 2 lowastC(x|OFF )

TOFF

C(x|OFF )TOFF

+ C(x|NOT )TNOT

minus 1 (1)

where C() is the frequency of the token x for a given class(OFF or NOT) TOFF and TNOT are the total number oftokens present in each class The ϑ(x) isin [minus1+1] with +1indicating that the use of the token is significantly highly inoffensive content than non-offensive and vice-versaWe identified the most distinctive unigrams in each classesusing valence score and presented using word-clouds inFigure 4 and Figure 5 We also presented top 10 highly of-fensive tokens in Table 3 with its frequency Furthermoreusing the same score we also presented top highly distinc-tive bi- and tri-grams in Table 2

From the lexical analysis we observed that words likelsquodogsrsquo lsquobarkrsquo lsquodirtyrsquo lsquoenvyrsquo are more common in offen-sive instances than not-offensive instances Similarly whenlooking into the bi- and tri-grams we observed that use ofwords like lsquoragersquo lsquocursersquo are very frequent to offensive-ness

42 Use of Emojis in Offensive LanguageAnalyzing the comment reveals treats that are specific toeach of the types offensive and non-offensive Furtheranalysis of the usage of emojis shows that there are somespecifics for each type Their usage of emojis is different

Figure 6 Reported analysis on use of emojis in offensivecomments Figure (a) ndash lists the 10 most highly offensiveemojis with valence score ϑ = 10 Figure (b) ndash Top 3groups of emojis with ϑ = 10 For each group ω is re-ported in percentage

To understand the distinctive usage of emoji we also cal-culated the valence ϑ() score for all the emojis Figure6 shows top 10 emojis present in the offensive commentsdataset with valence score ϑ() = 1Furthermore we grouped all the emojis based on their cate-gories ndash including ldquopeople face and smilieysrdquo ldquoanimal andnaturesrdquo among others ndash as defined by Unicode Consor-tium4 We then calculated the weight of each group ωg

using the following Equation 2

ωg =

sumeiising C(ei|ϑ = 10)sumeisinL C(e|ϑ = 10))

(2)

where for ei represent the ith emoji present in group ggiven the valence score ϑ = 10 L is the list of emojispresent in the offensive dataset and C() is the usage fre-quency of the given emoji The ω for top 3 frequent groupsare shown in Figure 6Our analysis of offensive emoji also contained studying themost frequent bi-grams with high valence score ϑ = 10We observed that the top 2 most frequent bi-grams are also

from the animals group ndash ldquo rdquo (frequency = 3) and

ldquo rdquo (3)

4httpsunicodeorgemojicharts-121emoji-orderinghtml

6208

Highly OFF Highly NOTNgrams Freq Ngrams Freq

egraveQKQ mIgrave

egraveA

Jmacr (piglet channel) 5 eacuteltEuml EgraventildeP (Messenger of God) 27

Otildeordm

centJordfK ntilde

KntildeOacute (die with your rage) 4 Yg Yg (many many) 17

eacutekApoundB Otildeccedil

(was downed) 4

egraventildemacr Egraventildek (will power) 17

copyJOgravem

eacutekApoundB (downing all) 4

eacuteJK QordfEuml EgraveethYEuml (Arab countries) 15

HAcentcentmtimes copyJOgravem

(with all plots) 4 eacuteltEuml Z A

AOacute (Godrsquos willing) 14

OtildeordmJEcirclaquo eacuteltEuml eacuteJordfEuml (Godrsquo curse on you all) 4 eacuteltEumlAK B

egraventilde

macr (power except with God) 17

QOgravemIgrave I JEcircg (Donkeyrsquos milk) 4 i JEcircmIgrave EgraveethX (Gulf countries) 10

aacuteOacute eth (and who) 2 I

Jraquo A

K (I was) 9

uacuteIcircEuml

aacuteOacute (who is the one) 2 eacuteltEuml B (except God) 9

Table 2 List of top most frequent n-grams (n=23) for both offensive (OFF) and non-offensive (NOT) comments usingvalence score are as follows () are the translation in English Freq represents frequency of the ngram in the correspondingclass

Tokens FrequencyegraveQK

Q mIgrave (Piglet) 21

I Ecircfrac34Euml (Dog) 8

PY

macr (Dirty) 8

eacutemacr Q

KQOumlIuml (Mercenaries) 7

iJK (Bark) 6

Otildeordm

centJordfK (With your envyrage) 5

m

(filthy) 5

H Ciquest AK (O Dogs) 5

shyK13 (Nose) 4

HAcentcent

mtimes (Plots) 4

Table 3 List of top 10 most highly offensive unigrams withϑ(word) = 10 () are the translation in English

Consequently our findings suggest that emojis representinganimals appear uniquely in offensive comments indicatingtheir use by analogy for offensive purpose This observa-tion is in line with our findings for distinctive lexical tokens(such as I Ecircfrac34Euml ndash ldquodogrdquo) usage in offensive comments

5 Experiments and ResultsThis section describes the details of feature extraction andmachine learning algorithm used to test the performanceof the dataset The motivation for such experimental explo-ration is to test the performance of the released dataset andto have baseline results in order to facilitate future studiesIn addition we also utilize the presence of comments froma different platform to test generalizability of a platform-dependent model (trained on a specific platform data)across (a) other social medianews platforms and (b) out-of-domain short content and comment datasets

51 DatasetsIn addition to our dataset refereed as Multi PlatformsOffensive Language Dataset (MPOLD) we also utilized

three previously published dataset details in Table 4For evaluating the performance on out-of-domain5 datasetswe used Egyptian and Levantine Arabic datasetsThe Egyptian Arabic dialect data includes 100 TW postsand their 10 corresponding comments6 per tweet Thetweets are extracted from 10 most controversial EgyptianTW users (Mubarak et al 2017) This dataset is refer-eed in this paper as Egyptian Tweets and correspondingComment Dataset (ETCD) for further mentions The Lev-antine dataset (referred to as L-HSAB)(Mulki et al 2019)are collected from the timelines of politicians socialpo-litical activists and TV anchors and does not contain anycommentsSimilarly we also used the Deleted Comments7 Dataset(further refereed as DCD) from Aljazeeranet(Mubarak et al 2017) to study how a offensive detectionmodel perform in a different platforms

Preprocessing To prepare the data for classification wefirst tokenized the input sentences removing punctuationURLs stop words diacritics present in the text We in-serted whitespace in cases where there was no separationbetween the adjacent words and emojis in the text For ourstudy we kept the emojis and hashtags due to the contextualinformation represented using them

52 Classification DesignAs a baseline for the MPOLD dataset we conducted experi-ments using a classical machine learning algorithm ndash SVMThe SVM algorithm (Platt 1998) is based on the StructuralRisk Minimization principle from computational learningtheory The algorithm is proven as universal learners andwell-known for its ability to learn independently of the di-mensionality of the feature space and the class distributionof the training dataset These properties of the algorithmmake it one of the most popular supervised classification

5Here representing that the contents are not news posts6httpaltqcriorg˜hmubarakoffensive

TweetClassification-Summaryxlsx7httpaltqcriorg˜hmubarakoffensive

AJCommentsClassification-CFxlsx

6209

Abbr Dataset Description Labels I (OFF) avg len Used for

MPOLD News comments fromTW FB and YT

OFF NOT 4000 (169) 228 traintestV HS Oth 180185417

ETCD100 Egyptian tweets

and corresponding comments OBs OFF-OBs NOT 1100 (asymp 590) 135 test

L-HSAB Levantine tweets Abs HS NOT 5846 (asymp 376) 120 test

DCDDeleted comments from

a news website OBs OFF-OBs NOT 31692 (asymp 82) 174traintest

Table 4 Details of the datasets used for the study represent the out-of-domain dataset and represent the differentplatforms I represent the number of instances whereas OFF represents the number of offensive labels used in thisstudy In the case of ETCD and DCD we merged the offensive comments ndash obscene (OBs) and other offensive comments(OFF-OBs) as OFF for matching with our labels Similarly we merged HS and Abs (abusive) for the L-HSAB dataset toOFF avg len represents the average length of the corresponding content in terms of the number of words V - Vulgar HS- Hatespeech Oth-Other offensive categories

methods and thus our choice for the baseline and all theexperiments conducted in this studyTo train the classifier we transform the online commentsinto bag-of-character-n-grams vector weighted with loga-rithmic term frequencies (tf) multiplied with inverse doc-ument frequencies (idf) For this we created charactern-grams using character inside word boundaries only andpadded the edge character with whitespace To utilize thecontextual information ie n-grams we extracted featuresusing n = 1 to n = 5

53 Performance EvaluationTo measure the performance of the classification we re-ported macro-averaged F1-measure (Fm) calculated usingmacro-averaged precision and recall ndash an average of P andR for both classes respectivelyThe motivation behind such choice is due to the imbalancedclass distribution which makes the well-known measuressuch as accuracy and micro-average F1-measure not well-representative of the performance Since the performanceof both classes is of interest we also report the F-measureof the individual classes

Fm NOT OFFChance 049 083 014Lex 053 084 021SVM 074 092 056

Table 5 Reported performance on MPOLD dataset using5-fold cross validation (with SVM and Chance baseline)

54 ResultsMPOLD performance To evaluate the classificationperformance we performed five-fold8 cross-validation onthe dataset maintaining the natural distribution of theclasses in each fold The performance on all the instancesis shown in Table 5

8We choose 5 instead of 10 as the number of folds to ensurea good sample size and distribution of both the labels in each foldfor test purpose

Testset Trained on Fm NOT OFF

FB

DCD 034 043 025TW 062 090 034YT 062 092 031TW + YT 068 093 044TW + YT + DCD 078 094 063

TW

DCD 052 051 053FB 051 083 019YT 054 082 025FB + YT 059 084 035FB + YT + DCD 084 090 078

YT

DCD 044 067 022FB 053 096 010TW 060 094 027TW + FB 063 095 031TW + FB + DCD 082 097 067

DCD

TW 037 039 034FB 021 031 010YT 029 026 032FB + YT + TW 040 035 045

Table 6 F-measure performances based on platform wisedataset The test sets presented here are a subset of MPOLDdata For instance TW isin MPOLD is the dataset extractedfrom Twitter Similarly YT represents YouTube data FBrepresents comments from Facebook DCD - Deleted com-ments dataset NOT = Not-offensive OFF = offensiveinstances The blue rows presents the results when theplatform-wise data present in our dataset is modelled alongwith online-news-platform data DCD

To compare the obtained results we reported chance levelbaseline based on the prior distribution of the labels in eachfold In addition to this we also present a simple lexical-based baseline (Lex) using a list of possible offensive wordsand phrases This word list is the combination of obscenewords presented in (Mubarak et al 2017) and filtered en-tries from Hatebase (Hatebaseorg) lexicon From Ta-ble 5 we observed that the SVM performs significantly bet-ter than the lexicon and chance level baselines

Platform specific performance To access the general-ization capabilities of the designed model we evaluated

6210

Testset Model Fm NOT OFF

ETCD

Lex 051 058 044DCD 064 076 053MPOLD 061 066 056MPOLD+ 068 061 075

L-HSAB

Lex 046 074 018DCD 059 060 058MPOLD 064 077 051MPOLD+ 062 064 059

Table 7 Reported F-measures of the models trainedusing DCD MPOLD (MPOLD) and MPOLD + DCD(MPOLD+) and tested on out-of-domain test sets The testsets includes (a) ETCD - controversial Egyptian tweetsand comments dataset and (b) L-HSAB Levantine tweetsdatasets Lex is lexicon-based baseline for the test sets

the models using leave-one-platform-dataset-out (LOPO)cross-validation The performance of the models on eachsocial media platform data is presented in Table 6 Besidesthe three platforms in our dataset for the experiment wealso included DCD ndash deleted comments from online newsportal as a fourth platform

We observed that in most cases models trained on oneparticular platform performs poorly compared to the mod-els trained using multiple-platform datasets This multipleplatform model includes both the combination of (a) 2 plat-forms from MPOLD data or (b) 2 platforms from MPOLD+ DCD datasets

For instance the model trained on TW +Y T performs sig-nificantly better on FB test set than the individual YT andTW models The same pattern is also observed when theperformance of DCD in combination (Y T + TW ) modelare compared In this case a significant jump in OFF classis noticed thus increasing the overall macro-averaged f-measure (Fm) This indicates the importance of multipleplatform datasets for generalization of the offensive detec-tion model

In addition to a different platform data of a similar domainDCD dataset also brings two more added advantages fortraining the model These advantages include (a) additionof comments written using MSA along with DA and (b)increase the number of offensive examples for the modeltraining The MPOLD comments are mostly dialectal Ara-bic therefore adding DCD includes examples of commentsin MSA for the model to recognise and learn At the sametime increases the number of (not)offensive examples forthe training Thus the results from the combined platformmodels (for eg TW + Y T +DCD) indicates the impactof varieties in examples and size of the training data

However by comparing the results of individual models ndashtrained on only DCD TW YT or FB ndash we observed thateven though DCD has a large number of training instancesthis model cannot outperform TWYTFB alone in mostcases This again highlights the inadequacy of model gen-eralization when trained on a particular platform and testedon another platform

Cross domain performance The performance ofthe models trained on MPOLD and MPOLD+DCD(MPOLD+) across domain is presented in Table 7 Theresults indicate that for ETCD dataset both our models(MPOLD and MPOLD+) outperform the baseline andthe DCD models for OFF class f-measure The overallbest performance is obtained using MPOLD+ with animprovement of 7 and 4 over MPOLD and DCDmodelsAs for L-HASB data MPOLD model outperforms all othermodels in terms of overall f-measure However for the of-fensive class only (OFF) the best performance is obtainedby MPOLD+The performance of the MPOLD and MPOLD+ models onboth ETCD and L-HSAB dataset indicates the positive im-pact of multiple-platform training dataset and implicates thecross-domain generalizability

6 ConclusionIn this paper we introduced one of the first dialectal Ara-bic comment dataset extracted from multiple social mediaplatform Utilizing this dataset we designed classificationexperiments to study the impact of multi-platform data forplatform-wise model generalization For this we evaluatedthe model using platform-wise cross-validation In additionto our dataset we also exploit existing publicly availabledataset We also utilized deleted comment dataset from anews website We observed that combined platform mod-els perform significantly higher than most individual mod-els Thus indicating the importance of such multi-platformdatasetUsing our designed models we also evaluated other out-of-domain (ETCD and L-HSAB) data extracted from TWBesides being out-of-domain dataset another advantage ofevaluating these datasets is to see how our models performin particular for DA content ie ETCD (Egyptian dialects)and L-HSAB (Levantine dialects) Our findings indicatethat the multi-platform news comment dataset model hasthe potential to capture diversity in different dialects anddomains In addition to evaluation of the generalizationpower of the models we also presented an in-depth anal-ysis of emoji usage in offensive comments Our findingssuggest emojis in the animal category are exploited in theoffensive comments similar to our lexical observationTo the best of our knowledge this is one of the first studiesto explore the effect of multiple platform datasets for offen-sive language detection on cross-platform cross-domainand for dialectal varieties present in Arabic In future weplan to extend the study by introducing a larger dataset withfurther fine-grained classification and content analysis

7 Bibliographical ReferencesAgarwal S and Sureka A (2014) A focused crawler

for mining hate and extremism promoting videos onyoutube In Proceedings of the 25th ACM conference onHypertext and social media pages 294ndash296 ACM

Al-Ajlan M A and Ykhlef M (2018) Optimized twit-ter cyberbullying detection based on deep learning In2018 21st Saudi Computer Society National ComputerConference (NCC) pages 1ndash5 IEEE

6211

Alakrot A Murray L and Nikolov N S (2018) Datasetconstruction for the detection of anti-social behaviourin online communication in arabic Procedia ComputerScience 142174ndash181

Albadi N Kurdi M and Mishra S (2018) Are they ourbrothers analysis and detection of religious hate speechin the arabic twittersphere In 2018 IEEEACM Interna-tional Conference on Advances in Social Networks Anal-ysis and Mining (ASONAM) pages 69ndash76 IEEE

Badjatiya P Gupta S Gupta M and Varma V (2017)Deep learning for hate speech detection in tweets InProceedings of the 26th International Conference onWorld Wide Web Companion pages 759ndash760 Interna-tional World Wide Web Conferences Steering Commit-tee

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017a) Hate is not bi-nary Studying abusive behavior of gamergate on twit-ter In Proceedings of the 28th ACM conference on hy-pertext and social media pages 65ndash74 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017b) Mean birdsDetecting aggression and bullying on twitter In Pro-ceedings of the 2017 ACM on web science conferencepages 13ndash22 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017c) Measuringgamergate A tale of hate sexism and bullying In Pro-ceedings of the 26th international conference on worldwide web companion pages 1285ndash1290 InternationalWorld Wide Web Conferences Steering Committee

Conover M D Ratkiewicz J Francisco M GoncalvesB Menczer F and Flammini A (2011) Political po-larization on twitter In Fifth international AAAI confer-ence on weblogs and social media

Dadvar M Trieschnigg D Ordelman R and de JongF (2013) Improving cyberbullying detection with usercontext In European Conference on Information Re-trieval pages 693ndash696 Springer

Davidson T Warmsley D Macy M and Weber I(2017) Automated hate speech detection and the prob-lem of offensive language In Eleventh internationalaaai conference on web and social media

Difallah D Filatova E and Ipeirotis P (2018) Demo-graphics and dynamics of mechanical turk workers InProceedings of the eleventh acm international confer-ence on web search and data mining pages 135ndash143ACM

ElSherief M Nilizadeh S Nguyen D Vigna G andBelding E (2018) Peer to peer hate Hate speech in-stigators and their targets In Twelfth International AAAIConference on Web and Social Media

Falotico R and Quatto P (2015) Fleissrsquo kappa statisticwithout paradoxes Quality amp Quantity 49(2)463ndash470

Gulactı F (2010) The effect of perceived social supporton subjective well-being Procedia-Social and Behav-ioral Sciences 2(2)3844ndash3849

Hosseinmardi H Mattson S A Rafiq R I Han R LvQ and Mishra S (2015) Analyzing labeled cyberbul-

lying incidents on the instagram social network In Inter-national conference on social informatics pages 49ndash66Springer

Intapong P Charoenpit S Achalakul T and OhkuraM (2017) Assessing symptoms of excessive sns usagebased on user behavior and emotion analysis of data ob-tained by sns apis In 9th International Conference onSocial Computing and Social Media SCSM 2017 heldas part of the 19th International Conference on Human-Computer Interaction HCI International 2017 pages71ndash83 Springer Verlag

Jay T and Janschewitz K (2008) The pragmatics ofswearing Journal of Politeness Research LanguageBehaviour Culture 4(2)267ndash288

Kumar S Hamilton W L Leskovec J and Jurafsky D(2018) Community interaction and conflict on the webIn Proceedings of the 2018 World Wide Web Conferencepages 933ndash943 International World Wide Web Confer-ences Steering Committee

Mondal M Silva L A and Benevenuto F (2017) Ameasurement study of hate speech in social media InProceedings of the 28th ACM Conference on Hypertextand Social Media pages 85ndash94 ACM

Mubarak H and Darwish K (2019) Arabic offensivelanguage classification on twitter In International Con-ference on Social Informatics pages 269ndash276 Springer

Mubarak H Darwish K and Magdy W (2017) Abu-sive language detection on arabic social media In Pro-ceedings of the First Workshop on Abusive Language On-line pages 52ndash56

Mulki H Haddad H Ali C B and Alshabani H(2019) L-hsab A levantine twitter dataset for hatespeech and abusive language In Proceedings of theThird Workshop on Abusive Language Online pages111ndash118

Pavlopoulos J Malakasiotis P and Androutsopoulos I(2017) Deeper attention to abusive user content mod-eration In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing pages1125ndash1135

Pitsilis G K Ramampiaro H and Langseth H (2018)Detecting offensive language in tweets using deep learn-ing arXiv preprint arXiv180104433

Platt J (1998) Sequential minimal optimization A fastalgorithm for training support vector machines

Rish I et al (2001) An empirical study of the naive bayesclassifier In IJCAI 2001 workshop on empirical methodsin artificial intelligence volume 3 pages 41ndash46

Ross J Zaldivar A Irani L and Tomlinson B (2009)Who are the turkers worker demographics in amazonmechanical turk Department of Informatics Universityof California Irvine USA Tech Rep

Salminen J Almerekhi H Milenkovic M Jung S-gAn J Kwak H and Jansen B J (2018) Anatomyof Online Hate Developing a Taxonomy and MachineLearning Models for Identifying and Classifying Hatein Online News Media In Proceedings of The Inter-national AAAI Conference on Web and Social Media(ICWSM 2018) San Francisco California USA June

6212

Salminen J Hopf M Chowdhury S A Jung S-gAlmerekhi H and Jansen B J (2020) Developingan online hate classifier for multiple social media plat-forms Human-centric Computing and Information Sci-ences 10(1)

Silva L Mondal M Correa D Benevenuto F and We-ber I (2016) Analyzing the targets of hate in onlinesocial media In Tenth International AAAI Conferenceon Web and Social Media

Unsvag E F and Gamback B (2018) The effects of userfeatures on twitter hate speech detection In Proceed-ings of the 2nd Workshop on Abusive Language Online(ALW2) pages 75ndash85

Waldron J (2012) The harm in hate speech HarvardUniversity Press

Wiegand M Siegel M and Ruppenhofer J (2018)Overview of the germeval 2018 shared task on the iden-tification of offensive language

Zhang Z Robinson D and Tepper J (2018) Detect-ing hate speech on twitter using a convolution-gru baseddeep neural network In European Semantic Web Con-ference pages 745ndash760 Springer

  • Introduction
  • Related Studies
  • Data
    • Data Collection
    • Data Preparation and Selection
    • Annotation Guideline
    • Annotations via Crowdsourcing
    • Annotation Results
      • Data Analysis
        • Lexical Analysis
        • Use of Emojis in Offensive Language
          • Experiments and Results
            • Datasets
            • Classification Design
            • Performance Evaluation
            • Results
              • Conclusion
              • Bibliographical References
Page 6: A Multi-Platform Arabic News Comment Dataset for Offensive … · 2020. 5. 22. · Therefore, in this paper, we introduce and make publicly available a new dialectal Arabic news comment

6208

Highly OFF Highly NOTNgrams Freq Ngrams Freq

egraveQKQ mIgrave

egraveA

Jmacr (piglet channel) 5 eacuteltEuml EgraventildeP (Messenger of God) 27

Otildeordm

centJordfK ntilde

KntildeOacute (die with your rage) 4 Yg Yg (many many) 17

eacutekApoundB Otildeccedil

(was downed) 4

egraventildemacr Egraventildek (will power) 17

copyJOgravem

eacutekApoundB (downing all) 4

eacuteJK QordfEuml EgraveethYEuml (Arab countries) 15

HAcentcentmtimes copyJOgravem

(with all plots) 4 eacuteltEuml Z A

AOacute (Godrsquos willing) 14

OtildeordmJEcirclaquo eacuteltEuml eacuteJordfEuml (Godrsquo curse on you all) 4 eacuteltEumlAK B

egraventilde

macr (power except with God) 17

QOgravemIgrave I JEcircg (Donkeyrsquos milk) 4 i JEcircmIgrave EgraveethX (Gulf countries) 10

aacuteOacute eth (and who) 2 I

Jraquo A

K (I was) 9

uacuteIcircEuml

aacuteOacute (who is the one) 2 eacuteltEuml B (except God) 9

Table 2 List of top most frequent n-grams (n=23) for both offensive (OFF) and non-offensive (NOT) comments usingvalence score are as follows () are the translation in English Freq represents frequency of the ngram in the correspondingclass

Tokens FrequencyegraveQK

Q mIgrave (Piglet) 21

I Ecircfrac34Euml (Dog) 8

PY

macr (Dirty) 8

eacutemacr Q

KQOumlIuml (Mercenaries) 7

iJK (Bark) 6

Otildeordm

centJordfK (With your envyrage) 5

m

(filthy) 5

H Ciquest AK (O Dogs) 5

shyK13 (Nose) 4

HAcentcent

mtimes (Plots) 4

Table 3 List of top 10 most highly offensive unigrams withϑ(word) = 10 () are the translation in English

Consequently our findings suggest that emojis representinganimals appear uniquely in offensive comments indicatingtheir use by analogy for offensive purpose This observa-tion is in line with our findings for distinctive lexical tokens(such as I Ecircfrac34Euml ndash ldquodogrdquo) usage in offensive comments

5 Experiments and ResultsThis section describes the details of feature extraction andmachine learning algorithm used to test the performanceof the dataset The motivation for such experimental explo-ration is to test the performance of the released dataset andto have baseline results in order to facilitate future studiesIn addition we also utilize the presence of comments froma different platform to test generalizability of a platform-dependent model (trained on a specific platform data)across (a) other social medianews platforms and (b) out-of-domain short content and comment datasets

51 DatasetsIn addition to our dataset refereed as Multi PlatformsOffensive Language Dataset (MPOLD) we also utilized

three previously published dataset details in Table 4For evaluating the performance on out-of-domain5 datasetswe used Egyptian and Levantine Arabic datasetsThe Egyptian Arabic dialect data includes 100 TW postsand their 10 corresponding comments6 per tweet Thetweets are extracted from 10 most controversial EgyptianTW users (Mubarak et al 2017) This dataset is refer-eed in this paper as Egyptian Tweets and correspondingComment Dataset (ETCD) for further mentions The Lev-antine dataset (referred to as L-HSAB)(Mulki et al 2019)are collected from the timelines of politicians socialpo-litical activists and TV anchors and does not contain anycommentsSimilarly we also used the Deleted Comments7 Dataset(further refereed as DCD) from Aljazeeranet(Mubarak et al 2017) to study how a offensive detectionmodel perform in a different platforms

Preprocessing To prepare the data for classification wefirst tokenized the input sentences removing punctuationURLs stop words diacritics present in the text We in-serted whitespace in cases where there was no separationbetween the adjacent words and emojis in the text For ourstudy we kept the emojis and hashtags due to the contextualinformation represented using them

52 Classification DesignAs a baseline for the MPOLD dataset we conducted experi-ments using a classical machine learning algorithm ndash SVMThe SVM algorithm (Platt 1998) is based on the StructuralRisk Minimization principle from computational learningtheory The algorithm is proven as universal learners andwell-known for its ability to learn independently of the di-mensionality of the feature space and the class distributionof the training dataset These properties of the algorithmmake it one of the most popular supervised classification

5Here representing that the contents are not news posts6httpaltqcriorg˜hmubarakoffensive

TweetClassification-Summaryxlsx7httpaltqcriorg˜hmubarakoffensive

AJCommentsClassification-CFxlsx

6209

Abbr Dataset Description Labels I (OFF) avg len Used for

MPOLD News comments fromTW FB and YT

OFF NOT 4000 (169) 228 traintestV HS Oth 180185417

ETCD100 Egyptian tweets

and corresponding comments OBs OFF-OBs NOT 1100 (asymp 590) 135 test

L-HSAB Levantine tweets Abs HS NOT 5846 (asymp 376) 120 test

DCDDeleted comments from

a news website OBs OFF-OBs NOT 31692 (asymp 82) 174traintest

Table 4 Details of the datasets used for the study represent the out-of-domain dataset and represent the differentplatforms I represent the number of instances whereas OFF represents the number of offensive labels used in thisstudy In the case of ETCD and DCD we merged the offensive comments ndash obscene (OBs) and other offensive comments(OFF-OBs) as OFF for matching with our labels Similarly we merged HS and Abs (abusive) for the L-HSAB dataset toOFF avg len represents the average length of the corresponding content in terms of the number of words V - Vulgar HS- Hatespeech Oth-Other offensive categories

methods and thus our choice for the baseline and all theexperiments conducted in this studyTo train the classifier we transform the online commentsinto bag-of-character-n-grams vector weighted with loga-rithmic term frequencies (tf) multiplied with inverse doc-ument frequencies (idf) For this we created charactern-grams using character inside word boundaries only andpadded the edge character with whitespace To utilize thecontextual information ie n-grams we extracted featuresusing n = 1 to n = 5

53 Performance EvaluationTo measure the performance of the classification we re-ported macro-averaged F1-measure (Fm) calculated usingmacro-averaged precision and recall ndash an average of P andR for both classes respectivelyThe motivation behind such choice is due to the imbalancedclass distribution which makes the well-known measuressuch as accuracy and micro-average F1-measure not well-representative of the performance Since the performanceof both classes is of interest we also report the F-measureof the individual classes

Fm NOT OFFChance 049 083 014Lex 053 084 021SVM 074 092 056

Table 5 Reported performance on MPOLD dataset using5-fold cross validation (with SVM and Chance baseline)

54 ResultsMPOLD performance To evaluate the classificationperformance we performed five-fold8 cross-validation onthe dataset maintaining the natural distribution of theclasses in each fold The performance on all the instancesis shown in Table 5

8We choose 5 instead of 10 as the number of folds to ensurea good sample size and distribution of both the labels in each foldfor test purpose

Testset Trained on Fm NOT OFF

FB

DCD 034 043 025TW 062 090 034YT 062 092 031TW + YT 068 093 044TW + YT + DCD 078 094 063

TW

DCD 052 051 053FB 051 083 019YT 054 082 025FB + YT 059 084 035FB + YT + DCD 084 090 078

YT

DCD 044 067 022FB 053 096 010TW 060 094 027TW + FB 063 095 031TW + FB + DCD 082 097 067

DCD

TW 037 039 034FB 021 031 010YT 029 026 032FB + YT + TW 040 035 045

Table 6 F-measure performances based on platform wisedataset The test sets presented here are a subset of MPOLDdata For instance TW isin MPOLD is the dataset extractedfrom Twitter Similarly YT represents YouTube data FBrepresents comments from Facebook DCD - Deleted com-ments dataset NOT = Not-offensive OFF = offensiveinstances The blue rows presents the results when theplatform-wise data present in our dataset is modelled alongwith online-news-platform data DCD

To compare the obtained results we reported chance levelbaseline based on the prior distribution of the labels in eachfold In addition to this we also present a simple lexical-based baseline (Lex) using a list of possible offensive wordsand phrases This word list is the combination of obscenewords presented in (Mubarak et al 2017) and filtered en-tries from Hatebase (Hatebaseorg) lexicon From Ta-ble 5 we observed that the SVM performs significantly bet-ter than the lexicon and chance level baselines

Platform specific performance To access the general-ization capabilities of the designed model we evaluated

6210

Testset Model Fm NOT OFF

ETCD

Lex 051 058 044DCD 064 076 053MPOLD 061 066 056MPOLD+ 068 061 075

L-HSAB

Lex 046 074 018DCD 059 060 058MPOLD 064 077 051MPOLD+ 062 064 059

Table 7 Reported F-measures of the models trainedusing DCD MPOLD (MPOLD) and MPOLD + DCD(MPOLD+) and tested on out-of-domain test sets The testsets includes (a) ETCD - controversial Egyptian tweetsand comments dataset and (b) L-HSAB Levantine tweetsdatasets Lex is lexicon-based baseline for the test sets

the models using leave-one-platform-dataset-out (LOPO)cross-validation The performance of the models on eachsocial media platform data is presented in Table 6 Besidesthe three platforms in our dataset for the experiment wealso included DCD ndash deleted comments from online newsportal as a fourth platform

We observed that in most cases models trained on oneparticular platform performs poorly compared to the mod-els trained using multiple-platform datasets This multipleplatform model includes both the combination of (a) 2 plat-forms from MPOLD data or (b) 2 platforms from MPOLD+ DCD datasets

For instance the model trained on TW +Y T performs sig-nificantly better on FB test set than the individual YT andTW models The same pattern is also observed when theperformance of DCD in combination (Y T + TW ) modelare compared In this case a significant jump in OFF classis noticed thus increasing the overall macro-averaged f-measure (Fm) This indicates the importance of multipleplatform datasets for generalization of the offensive detec-tion model

In addition to a different platform data of a similar domainDCD dataset also brings two more added advantages fortraining the model These advantages include (a) additionof comments written using MSA along with DA and (b)increase the number of offensive examples for the modeltraining The MPOLD comments are mostly dialectal Ara-bic therefore adding DCD includes examples of commentsin MSA for the model to recognise and learn At the sametime increases the number of (not)offensive examples forthe training Thus the results from the combined platformmodels (for eg TW + Y T +DCD) indicates the impactof varieties in examples and size of the training data

However by comparing the results of individual models ndashtrained on only DCD TW YT or FB ndash we observed thateven though DCD has a large number of training instancesthis model cannot outperform TWYTFB alone in mostcases This again highlights the inadequacy of model gen-eralization when trained on a particular platform and testedon another platform

Cross domain performance The performance ofthe models trained on MPOLD and MPOLD+DCD(MPOLD+) across domain is presented in Table 7 Theresults indicate that for ETCD dataset both our models(MPOLD and MPOLD+) outperform the baseline andthe DCD models for OFF class f-measure The overallbest performance is obtained using MPOLD+ with animprovement of 7 and 4 over MPOLD and DCDmodelsAs for L-HASB data MPOLD model outperforms all othermodels in terms of overall f-measure However for the of-fensive class only (OFF) the best performance is obtainedby MPOLD+The performance of the MPOLD and MPOLD+ models onboth ETCD and L-HSAB dataset indicates the positive im-pact of multiple-platform training dataset and implicates thecross-domain generalizability

6 ConclusionIn this paper we introduced one of the first dialectal Ara-bic comment dataset extracted from multiple social mediaplatform Utilizing this dataset we designed classificationexperiments to study the impact of multi-platform data forplatform-wise model generalization For this we evaluatedthe model using platform-wise cross-validation In additionto our dataset we also exploit existing publicly availabledataset We also utilized deleted comment dataset from anews website We observed that combined platform mod-els perform significantly higher than most individual mod-els Thus indicating the importance of such multi-platformdatasetUsing our designed models we also evaluated other out-of-domain (ETCD and L-HSAB) data extracted from TWBesides being out-of-domain dataset another advantage ofevaluating these datasets is to see how our models performin particular for DA content ie ETCD (Egyptian dialects)and L-HSAB (Levantine dialects) Our findings indicatethat the multi-platform news comment dataset model hasthe potential to capture diversity in different dialects anddomains In addition to evaluation of the generalizationpower of the models we also presented an in-depth anal-ysis of emoji usage in offensive comments Our findingssuggest emojis in the animal category are exploited in theoffensive comments similar to our lexical observationTo the best of our knowledge this is one of the first studiesto explore the effect of multiple platform datasets for offen-sive language detection on cross-platform cross-domainand for dialectal varieties present in Arabic In future weplan to extend the study by introducing a larger dataset withfurther fine-grained classification and content analysis

7 Bibliographical ReferencesAgarwal S and Sureka A (2014) A focused crawler

for mining hate and extremism promoting videos onyoutube In Proceedings of the 25th ACM conference onHypertext and social media pages 294ndash296 ACM

Al-Ajlan M A and Ykhlef M (2018) Optimized twit-ter cyberbullying detection based on deep learning In2018 21st Saudi Computer Society National ComputerConference (NCC) pages 1ndash5 IEEE

6211

Alakrot A Murray L and Nikolov N S (2018) Datasetconstruction for the detection of anti-social behaviourin online communication in arabic Procedia ComputerScience 142174ndash181

Albadi N Kurdi M and Mishra S (2018) Are they ourbrothers analysis and detection of religious hate speechin the arabic twittersphere In 2018 IEEEACM Interna-tional Conference on Advances in Social Networks Anal-ysis and Mining (ASONAM) pages 69ndash76 IEEE

Badjatiya P Gupta S Gupta M and Varma V (2017)Deep learning for hate speech detection in tweets InProceedings of the 26th International Conference onWorld Wide Web Companion pages 759ndash760 Interna-tional World Wide Web Conferences Steering Commit-tee

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017a) Hate is not bi-nary Studying abusive behavior of gamergate on twit-ter In Proceedings of the 28th ACM conference on hy-pertext and social media pages 65ndash74 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017b) Mean birdsDetecting aggression and bullying on twitter In Pro-ceedings of the 2017 ACM on web science conferencepages 13ndash22 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017c) Measuringgamergate A tale of hate sexism and bullying In Pro-ceedings of the 26th international conference on worldwide web companion pages 1285ndash1290 InternationalWorld Wide Web Conferences Steering Committee

Conover M D Ratkiewicz J Francisco M GoncalvesB Menczer F and Flammini A (2011) Political po-larization on twitter In Fifth international AAAI confer-ence on weblogs and social media

Dadvar M Trieschnigg D Ordelman R and de JongF (2013) Improving cyberbullying detection with usercontext In European Conference on Information Re-trieval pages 693ndash696 Springer

Davidson T Warmsley D Macy M and Weber I(2017) Automated hate speech detection and the prob-lem of offensive language In Eleventh internationalaaai conference on web and social media

Difallah D Filatova E and Ipeirotis P (2018) Demo-graphics and dynamics of mechanical turk workers InProceedings of the eleventh acm international confer-ence on web search and data mining pages 135ndash143ACM

ElSherief M Nilizadeh S Nguyen D Vigna G andBelding E (2018) Peer to peer hate Hate speech in-stigators and their targets In Twelfth International AAAIConference on Web and Social Media

Falotico R and Quatto P (2015) Fleissrsquo kappa statisticwithout paradoxes Quality amp Quantity 49(2)463ndash470

Gulactı F (2010) The effect of perceived social supporton subjective well-being Procedia-Social and Behav-ioral Sciences 2(2)3844ndash3849

Hosseinmardi H Mattson S A Rafiq R I Han R LvQ and Mishra S (2015) Analyzing labeled cyberbul-

lying incidents on the instagram social network In Inter-national conference on social informatics pages 49ndash66Springer

Intapong P Charoenpit S Achalakul T and OhkuraM (2017) Assessing symptoms of excessive sns usagebased on user behavior and emotion analysis of data ob-tained by sns apis In 9th International Conference onSocial Computing and Social Media SCSM 2017 heldas part of the 19th International Conference on Human-Computer Interaction HCI International 2017 pages71ndash83 Springer Verlag

Jay T and Janschewitz K (2008) The pragmatics ofswearing Journal of Politeness Research LanguageBehaviour Culture 4(2)267ndash288

Kumar S Hamilton W L Leskovec J and Jurafsky D(2018) Community interaction and conflict on the webIn Proceedings of the 2018 World Wide Web Conferencepages 933ndash943 International World Wide Web Confer-ences Steering Committee

Mondal M Silva L A and Benevenuto F (2017) Ameasurement study of hate speech in social media InProceedings of the 28th ACM Conference on Hypertextand Social Media pages 85ndash94 ACM

Mubarak H and Darwish K (2019) Arabic offensivelanguage classification on twitter In International Con-ference on Social Informatics pages 269ndash276 Springer

Mubarak H Darwish K and Magdy W (2017) Abu-sive language detection on arabic social media In Pro-ceedings of the First Workshop on Abusive Language On-line pages 52ndash56

Mulki H Haddad H Ali C B and Alshabani H(2019) L-hsab A levantine twitter dataset for hatespeech and abusive language In Proceedings of theThird Workshop on Abusive Language Online pages111ndash118

Pavlopoulos J Malakasiotis P and Androutsopoulos I(2017) Deeper attention to abusive user content mod-eration In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing pages1125ndash1135

Pitsilis G K Ramampiaro H and Langseth H (2018)Detecting offensive language in tweets using deep learn-ing arXiv preprint arXiv180104433

Platt J (1998) Sequential minimal optimization A fastalgorithm for training support vector machines

Rish I et al (2001) An empirical study of the naive bayesclassifier In IJCAI 2001 workshop on empirical methodsin artificial intelligence volume 3 pages 41ndash46

Ross J Zaldivar A Irani L and Tomlinson B (2009)Who are the turkers worker demographics in amazonmechanical turk Department of Informatics Universityof California Irvine USA Tech Rep

Salminen J Almerekhi H Milenkovic M Jung S-gAn J Kwak H and Jansen B J (2018) Anatomyof Online Hate Developing a Taxonomy and MachineLearning Models for Identifying and Classifying Hatein Online News Media In Proceedings of The Inter-national AAAI Conference on Web and Social Media(ICWSM 2018) San Francisco California USA June

6212

Salminen J Hopf M Chowdhury S A Jung S-gAlmerekhi H and Jansen B J (2020) Developingan online hate classifier for multiple social media plat-forms Human-centric Computing and Information Sci-ences 10(1)

Silva L Mondal M Correa D Benevenuto F and We-ber I (2016) Analyzing the targets of hate in onlinesocial media In Tenth International AAAI Conferenceon Web and Social Media

Unsvag E F and Gamback B (2018) The effects of userfeatures on twitter hate speech detection In Proceed-ings of the 2nd Workshop on Abusive Language Online(ALW2) pages 75ndash85

Waldron J (2012) The harm in hate speech HarvardUniversity Press

Wiegand M Siegel M and Ruppenhofer J (2018)Overview of the germeval 2018 shared task on the iden-tification of offensive language

Zhang Z Robinson D and Tepper J (2018) Detect-ing hate speech on twitter using a convolution-gru baseddeep neural network In European Semantic Web Con-ference pages 745ndash760 Springer

  • Introduction
  • Related Studies
  • Data
    • Data Collection
    • Data Preparation and Selection
    • Annotation Guideline
    • Annotations via Crowdsourcing
    • Annotation Results
      • Data Analysis
        • Lexical Analysis
        • Use of Emojis in Offensive Language
          • Experiments and Results
            • Datasets
            • Classification Design
            • Performance Evaluation
            • Results
              • Conclusion
              • Bibliographical References
Page 7: A Multi-Platform Arabic News Comment Dataset for Offensive … · 2020. 5. 22. · Therefore, in this paper, we introduce and make publicly available a new dialectal Arabic news comment

6209

Abbr Dataset Description Labels I (OFF) avg len Used for

MPOLD News comments fromTW FB and YT

OFF NOT 4000 (169) 228 traintestV HS Oth 180185417

ETCD100 Egyptian tweets

and corresponding comments OBs OFF-OBs NOT 1100 (asymp 590) 135 test

L-HSAB Levantine tweets Abs HS NOT 5846 (asymp 376) 120 test

DCDDeleted comments from

a news website OBs OFF-OBs NOT 31692 (asymp 82) 174traintest

Table 4 Details of the datasets used for the study represent the out-of-domain dataset and represent the differentplatforms I represent the number of instances whereas OFF represents the number of offensive labels used in thisstudy In the case of ETCD and DCD we merged the offensive comments ndash obscene (OBs) and other offensive comments(OFF-OBs) as OFF for matching with our labels Similarly we merged HS and Abs (abusive) for the L-HSAB dataset toOFF avg len represents the average length of the corresponding content in terms of the number of words V - Vulgar HS- Hatespeech Oth-Other offensive categories

methods and thus our choice for the baseline and all theexperiments conducted in this studyTo train the classifier we transform the online commentsinto bag-of-character-n-grams vector weighted with loga-rithmic term frequencies (tf) multiplied with inverse doc-ument frequencies (idf) For this we created charactern-grams using character inside word boundaries only andpadded the edge character with whitespace To utilize thecontextual information ie n-grams we extracted featuresusing n = 1 to n = 5

53 Performance EvaluationTo measure the performance of the classification we re-ported macro-averaged F1-measure (Fm) calculated usingmacro-averaged precision and recall ndash an average of P andR for both classes respectivelyThe motivation behind such choice is due to the imbalancedclass distribution which makes the well-known measuressuch as accuracy and micro-average F1-measure not well-representative of the performance Since the performanceof both classes is of interest we also report the F-measureof the individual classes

Fm NOT OFFChance 049 083 014Lex 053 084 021SVM 074 092 056

Table 5 Reported performance on MPOLD dataset using5-fold cross validation (with SVM and Chance baseline)

54 ResultsMPOLD performance To evaluate the classificationperformance we performed five-fold8 cross-validation onthe dataset maintaining the natural distribution of theclasses in each fold The performance on all the instancesis shown in Table 5

8We choose 5 instead of 10 as the number of folds to ensurea good sample size and distribution of both the labels in each foldfor test purpose

Testset Trained on Fm NOT OFF

FB

DCD 034 043 025TW 062 090 034YT 062 092 031TW + YT 068 093 044TW + YT + DCD 078 094 063

TW

DCD 052 051 053FB 051 083 019YT 054 082 025FB + YT 059 084 035FB + YT + DCD 084 090 078

YT

DCD 044 067 022FB 053 096 010TW 060 094 027TW + FB 063 095 031TW + FB + DCD 082 097 067

DCD

TW 037 039 034FB 021 031 010YT 029 026 032FB + YT + TW 040 035 045

Table 6 F-measure performances based on platform wisedataset The test sets presented here are a subset of MPOLDdata For instance TW isin MPOLD is the dataset extractedfrom Twitter Similarly YT represents YouTube data FBrepresents comments from Facebook DCD - Deleted com-ments dataset NOT = Not-offensive OFF = offensiveinstances The blue rows presents the results when theplatform-wise data present in our dataset is modelled alongwith online-news-platform data DCD

To compare the obtained results we reported chance levelbaseline based on the prior distribution of the labels in eachfold In addition to this we also present a simple lexical-based baseline (Lex) using a list of possible offensive wordsand phrases This word list is the combination of obscenewords presented in (Mubarak et al 2017) and filtered en-tries from Hatebase (Hatebaseorg) lexicon From Ta-ble 5 we observed that the SVM performs significantly bet-ter than the lexicon and chance level baselines

Platform specific performance To access the general-ization capabilities of the designed model we evaluated

6210

Testset Model Fm NOT OFF

ETCD

Lex 051 058 044DCD 064 076 053MPOLD 061 066 056MPOLD+ 068 061 075

L-HSAB

Lex 046 074 018DCD 059 060 058MPOLD 064 077 051MPOLD+ 062 064 059

Table 7 Reported F-measures of the models trainedusing DCD MPOLD (MPOLD) and MPOLD + DCD(MPOLD+) and tested on out-of-domain test sets The testsets includes (a) ETCD - controversial Egyptian tweetsand comments dataset and (b) L-HSAB Levantine tweetsdatasets Lex is lexicon-based baseline for the test sets

the models using leave-one-platform-dataset-out (LOPO)cross-validation The performance of the models on eachsocial media platform data is presented in Table 6 Besidesthe three platforms in our dataset for the experiment wealso included DCD ndash deleted comments from online newsportal as a fourth platform

We observed that in most cases models trained on oneparticular platform performs poorly compared to the mod-els trained using multiple-platform datasets This multipleplatform model includes both the combination of (a) 2 plat-forms from MPOLD data or (b) 2 platforms from MPOLD+ DCD datasets

For instance the model trained on TW +Y T performs sig-nificantly better on FB test set than the individual YT andTW models The same pattern is also observed when theperformance of DCD in combination (Y T + TW ) modelare compared In this case a significant jump in OFF classis noticed thus increasing the overall macro-averaged f-measure (Fm) This indicates the importance of multipleplatform datasets for generalization of the offensive detec-tion model

In addition to a different platform data of a similar domainDCD dataset also brings two more added advantages fortraining the model These advantages include (a) additionof comments written using MSA along with DA and (b)increase the number of offensive examples for the modeltraining The MPOLD comments are mostly dialectal Ara-bic therefore adding DCD includes examples of commentsin MSA for the model to recognise and learn At the sametime increases the number of (not)offensive examples forthe training Thus the results from the combined platformmodels (for eg TW + Y T +DCD) indicates the impactof varieties in examples and size of the training data

However by comparing the results of individual models ndashtrained on only DCD TW YT or FB ndash we observed thateven though DCD has a large number of training instancesthis model cannot outperform TWYTFB alone in mostcases This again highlights the inadequacy of model gen-eralization when trained on a particular platform and testedon another platform

Cross domain performance The performance ofthe models trained on MPOLD and MPOLD+DCD(MPOLD+) across domain is presented in Table 7 Theresults indicate that for ETCD dataset both our models(MPOLD and MPOLD+) outperform the baseline andthe DCD models for OFF class f-measure The overallbest performance is obtained using MPOLD+ with animprovement of 7 and 4 over MPOLD and DCDmodelsAs for L-HASB data MPOLD model outperforms all othermodels in terms of overall f-measure However for the of-fensive class only (OFF) the best performance is obtainedby MPOLD+The performance of the MPOLD and MPOLD+ models onboth ETCD and L-HSAB dataset indicates the positive im-pact of multiple-platform training dataset and implicates thecross-domain generalizability

6 ConclusionIn this paper we introduced one of the first dialectal Ara-bic comment dataset extracted from multiple social mediaplatform Utilizing this dataset we designed classificationexperiments to study the impact of multi-platform data forplatform-wise model generalization For this we evaluatedthe model using platform-wise cross-validation In additionto our dataset we also exploit existing publicly availabledataset We also utilized deleted comment dataset from anews website We observed that combined platform mod-els perform significantly higher than most individual mod-els Thus indicating the importance of such multi-platformdatasetUsing our designed models we also evaluated other out-of-domain (ETCD and L-HSAB) data extracted from TWBesides being out-of-domain dataset another advantage ofevaluating these datasets is to see how our models performin particular for DA content ie ETCD (Egyptian dialects)and L-HSAB (Levantine dialects) Our findings indicatethat the multi-platform news comment dataset model hasthe potential to capture diversity in different dialects anddomains In addition to evaluation of the generalizationpower of the models we also presented an in-depth anal-ysis of emoji usage in offensive comments Our findingssuggest emojis in the animal category are exploited in theoffensive comments similar to our lexical observationTo the best of our knowledge this is one of the first studiesto explore the effect of multiple platform datasets for offen-sive language detection on cross-platform cross-domainand for dialectal varieties present in Arabic In future weplan to extend the study by introducing a larger dataset withfurther fine-grained classification and content analysis

7 Bibliographical ReferencesAgarwal S and Sureka A (2014) A focused crawler

for mining hate and extremism promoting videos onyoutube In Proceedings of the 25th ACM conference onHypertext and social media pages 294ndash296 ACM

Al-Ajlan M A and Ykhlef M (2018) Optimized twit-ter cyberbullying detection based on deep learning In2018 21st Saudi Computer Society National ComputerConference (NCC) pages 1ndash5 IEEE

6211

Alakrot A Murray L and Nikolov N S (2018) Datasetconstruction for the detection of anti-social behaviourin online communication in arabic Procedia ComputerScience 142174ndash181

Albadi N Kurdi M and Mishra S (2018) Are they ourbrothers analysis and detection of religious hate speechin the arabic twittersphere In 2018 IEEEACM Interna-tional Conference on Advances in Social Networks Anal-ysis and Mining (ASONAM) pages 69ndash76 IEEE

Badjatiya P Gupta S Gupta M and Varma V (2017)Deep learning for hate speech detection in tweets InProceedings of the 26th International Conference onWorld Wide Web Companion pages 759ndash760 Interna-tional World Wide Web Conferences Steering Commit-tee

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017a) Hate is not bi-nary Studying abusive behavior of gamergate on twit-ter In Proceedings of the 28th ACM conference on hy-pertext and social media pages 65ndash74 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017b) Mean birdsDetecting aggression and bullying on twitter In Pro-ceedings of the 2017 ACM on web science conferencepages 13ndash22 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017c) Measuringgamergate A tale of hate sexism and bullying In Pro-ceedings of the 26th international conference on worldwide web companion pages 1285ndash1290 InternationalWorld Wide Web Conferences Steering Committee

Conover M D Ratkiewicz J Francisco M GoncalvesB Menczer F and Flammini A (2011) Political po-larization on twitter In Fifth international AAAI confer-ence on weblogs and social media

Dadvar M Trieschnigg D Ordelman R and de JongF (2013) Improving cyberbullying detection with usercontext In European Conference on Information Re-trieval pages 693ndash696 Springer

Davidson T Warmsley D Macy M and Weber I(2017) Automated hate speech detection and the prob-lem of offensive language In Eleventh internationalaaai conference on web and social media

Difallah D Filatova E and Ipeirotis P (2018) Demo-graphics and dynamics of mechanical turk workers InProceedings of the eleventh acm international confer-ence on web search and data mining pages 135ndash143ACM

ElSherief M Nilizadeh S Nguyen D Vigna G andBelding E (2018) Peer to peer hate Hate speech in-stigators and their targets In Twelfth International AAAIConference on Web and Social Media

Falotico R and Quatto P (2015) Fleissrsquo kappa statisticwithout paradoxes Quality amp Quantity 49(2)463ndash470

Gulactı F (2010) The effect of perceived social supporton subjective well-being Procedia-Social and Behav-ioral Sciences 2(2)3844ndash3849

Hosseinmardi H Mattson S A Rafiq R I Han R LvQ and Mishra S (2015) Analyzing labeled cyberbul-

lying incidents on the instagram social network In Inter-national conference on social informatics pages 49ndash66Springer

Intapong P Charoenpit S Achalakul T and OhkuraM (2017) Assessing symptoms of excessive sns usagebased on user behavior and emotion analysis of data ob-tained by sns apis In 9th International Conference onSocial Computing and Social Media SCSM 2017 heldas part of the 19th International Conference on Human-Computer Interaction HCI International 2017 pages71ndash83 Springer Verlag

Jay T and Janschewitz K (2008) The pragmatics ofswearing Journal of Politeness Research LanguageBehaviour Culture 4(2)267ndash288

Kumar S Hamilton W L Leskovec J and Jurafsky D(2018) Community interaction and conflict on the webIn Proceedings of the 2018 World Wide Web Conferencepages 933ndash943 International World Wide Web Confer-ences Steering Committee

Mondal M Silva L A and Benevenuto F (2017) Ameasurement study of hate speech in social media InProceedings of the 28th ACM Conference on Hypertextand Social Media pages 85ndash94 ACM

Mubarak H and Darwish K (2019) Arabic offensivelanguage classification on twitter In International Con-ference on Social Informatics pages 269ndash276 Springer

Mubarak H Darwish K and Magdy W (2017) Abu-sive language detection on arabic social media In Pro-ceedings of the First Workshop on Abusive Language On-line pages 52ndash56

Mulki H Haddad H Ali C B and Alshabani H(2019) L-hsab A levantine twitter dataset for hatespeech and abusive language In Proceedings of theThird Workshop on Abusive Language Online pages111ndash118

Pavlopoulos J Malakasiotis P and Androutsopoulos I(2017) Deeper attention to abusive user content mod-eration In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing pages1125ndash1135

Pitsilis G K Ramampiaro H and Langseth H (2018)Detecting offensive language in tweets using deep learn-ing arXiv preprint arXiv180104433

Platt J (1998) Sequential minimal optimization A fastalgorithm for training support vector machines

Rish I et al (2001) An empirical study of the naive bayesclassifier In IJCAI 2001 workshop on empirical methodsin artificial intelligence volume 3 pages 41ndash46

Ross J Zaldivar A Irani L and Tomlinson B (2009)Who are the turkers worker demographics in amazonmechanical turk Department of Informatics Universityof California Irvine USA Tech Rep

Salminen J Almerekhi H Milenkovic M Jung S-gAn J Kwak H and Jansen B J (2018) Anatomyof Online Hate Developing a Taxonomy and MachineLearning Models for Identifying and Classifying Hatein Online News Media In Proceedings of The Inter-national AAAI Conference on Web and Social Media(ICWSM 2018) San Francisco California USA June

6212

Salminen J Hopf M Chowdhury S A Jung S-gAlmerekhi H and Jansen B J (2020) Developingan online hate classifier for multiple social media plat-forms Human-centric Computing and Information Sci-ences 10(1)

Silva L Mondal M Correa D Benevenuto F and We-ber I (2016) Analyzing the targets of hate in onlinesocial media In Tenth International AAAI Conferenceon Web and Social Media

Unsvag E F and Gamback B (2018) The effects of userfeatures on twitter hate speech detection In Proceed-ings of the 2nd Workshop on Abusive Language Online(ALW2) pages 75ndash85

Waldron J (2012) The harm in hate speech HarvardUniversity Press

Wiegand M Siegel M and Ruppenhofer J (2018)Overview of the germeval 2018 shared task on the iden-tification of offensive language

Zhang Z Robinson D and Tepper J (2018) Detect-ing hate speech on twitter using a convolution-gru baseddeep neural network In European Semantic Web Con-ference pages 745ndash760 Springer

  • Introduction
  • Related Studies
  • Data
    • Data Collection
    • Data Preparation and Selection
    • Annotation Guideline
    • Annotations via Crowdsourcing
    • Annotation Results
      • Data Analysis
        • Lexical Analysis
        • Use of Emojis in Offensive Language
          • Experiments and Results
            • Datasets
            • Classification Design
            • Performance Evaluation
            • Results
              • Conclusion
              • Bibliographical References
Page 8: A Multi-Platform Arabic News Comment Dataset for Offensive … · 2020. 5. 22. · Therefore, in this paper, we introduce and make publicly available a new dialectal Arabic news comment

6210

Testset Model Fm NOT OFF

ETCD

Lex 051 058 044DCD 064 076 053MPOLD 061 066 056MPOLD+ 068 061 075

L-HSAB

Lex 046 074 018DCD 059 060 058MPOLD 064 077 051MPOLD+ 062 064 059

Table 7 Reported F-measures of the models trainedusing DCD MPOLD (MPOLD) and MPOLD + DCD(MPOLD+) and tested on out-of-domain test sets The testsets includes (a) ETCD - controversial Egyptian tweetsand comments dataset and (b) L-HSAB Levantine tweetsdatasets Lex is lexicon-based baseline for the test sets

the models using leave-one-platform-dataset-out (LOPO)cross-validation The performance of the models on eachsocial media platform data is presented in Table 6 Besidesthe three platforms in our dataset for the experiment wealso included DCD ndash deleted comments from online newsportal as a fourth platform

We observed that in most cases models trained on oneparticular platform performs poorly compared to the mod-els trained using multiple-platform datasets This multipleplatform model includes both the combination of (a) 2 plat-forms from MPOLD data or (b) 2 platforms from MPOLD+ DCD datasets

For instance the model trained on TW +Y T performs sig-nificantly better on FB test set than the individual YT andTW models The same pattern is also observed when theperformance of DCD in combination (Y T + TW ) modelare compared In this case a significant jump in OFF classis noticed thus increasing the overall macro-averaged f-measure (Fm) This indicates the importance of multipleplatform datasets for generalization of the offensive detec-tion model

In addition to a different platform data of a similar domainDCD dataset also brings two more added advantages fortraining the model These advantages include (a) additionof comments written using MSA along with DA and (b)increase the number of offensive examples for the modeltraining The MPOLD comments are mostly dialectal Ara-bic therefore adding DCD includes examples of commentsin MSA for the model to recognise and learn At the sametime increases the number of (not)offensive examples forthe training Thus the results from the combined platformmodels (for eg TW + Y T +DCD) indicates the impactof varieties in examples and size of the training data

However by comparing the results of individual models ndashtrained on only DCD TW YT or FB ndash we observed thateven though DCD has a large number of training instancesthis model cannot outperform TWYTFB alone in mostcases This again highlights the inadequacy of model gen-eralization when trained on a particular platform and testedon another platform

Cross domain performance The performance ofthe models trained on MPOLD and MPOLD+DCD(MPOLD+) across domain is presented in Table 7 Theresults indicate that for ETCD dataset both our models(MPOLD and MPOLD+) outperform the baseline andthe DCD models for OFF class f-measure The overallbest performance is obtained using MPOLD+ with animprovement of 7 and 4 over MPOLD and DCDmodelsAs for L-HASB data MPOLD model outperforms all othermodels in terms of overall f-measure However for the of-fensive class only (OFF) the best performance is obtainedby MPOLD+The performance of the MPOLD and MPOLD+ models onboth ETCD and L-HSAB dataset indicates the positive im-pact of multiple-platform training dataset and implicates thecross-domain generalizability

6 ConclusionIn this paper we introduced one of the first dialectal Ara-bic comment dataset extracted from multiple social mediaplatform Utilizing this dataset we designed classificationexperiments to study the impact of multi-platform data forplatform-wise model generalization For this we evaluatedthe model using platform-wise cross-validation In additionto our dataset we also exploit existing publicly availabledataset We also utilized deleted comment dataset from anews website We observed that combined platform mod-els perform significantly higher than most individual mod-els Thus indicating the importance of such multi-platformdatasetUsing our designed models we also evaluated other out-of-domain (ETCD and L-HSAB) data extracted from TWBesides being out-of-domain dataset another advantage ofevaluating these datasets is to see how our models performin particular for DA content ie ETCD (Egyptian dialects)and L-HSAB (Levantine dialects) Our findings indicatethat the multi-platform news comment dataset model hasthe potential to capture diversity in different dialects anddomains In addition to evaluation of the generalizationpower of the models we also presented an in-depth anal-ysis of emoji usage in offensive comments Our findingssuggest emojis in the animal category are exploited in theoffensive comments similar to our lexical observationTo the best of our knowledge this is one of the first studiesto explore the effect of multiple platform datasets for offen-sive language detection on cross-platform cross-domainand for dialectal varieties present in Arabic In future weplan to extend the study by introducing a larger dataset withfurther fine-grained classification and content analysis

7 Bibliographical ReferencesAgarwal S and Sureka A (2014) A focused crawler

for mining hate and extremism promoting videos onyoutube In Proceedings of the 25th ACM conference onHypertext and social media pages 294ndash296 ACM

Al-Ajlan M A and Ykhlef M (2018) Optimized twit-ter cyberbullying detection based on deep learning In2018 21st Saudi Computer Society National ComputerConference (NCC) pages 1ndash5 IEEE

6211

Alakrot A Murray L and Nikolov N S (2018) Datasetconstruction for the detection of anti-social behaviourin online communication in arabic Procedia ComputerScience 142174ndash181

Albadi N Kurdi M and Mishra S (2018) Are they ourbrothers analysis and detection of religious hate speechin the arabic twittersphere In 2018 IEEEACM Interna-tional Conference on Advances in Social Networks Anal-ysis and Mining (ASONAM) pages 69ndash76 IEEE

Badjatiya P Gupta S Gupta M and Varma V (2017)Deep learning for hate speech detection in tweets InProceedings of the 26th International Conference onWorld Wide Web Companion pages 759ndash760 Interna-tional World Wide Web Conferences Steering Commit-tee

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017a) Hate is not bi-nary Studying abusive behavior of gamergate on twit-ter In Proceedings of the 28th ACM conference on hy-pertext and social media pages 65ndash74 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017b) Mean birdsDetecting aggression and bullying on twitter In Pro-ceedings of the 2017 ACM on web science conferencepages 13ndash22 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017c) Measuringgamergate A tale of hate sexism and bullying In Pro-ceedings of the 26th international conference on worldwide web companion pages 1285ndash1290 InternationalWorld Wide Web Conferences Steering Committee

Conover M D Ratkiewicz J Francisco M GoncalvesB Menczer F and Flammini A (2011) Political po-larization on twitter In Fifth international AAAI confer-ence on weblogs and social media

Dadvar M Trieschnigg D Ordelman R and de JongF (2013) Improving cyberbullying detection with usercontext In European Conference on Information Re-trieval pages 693ndash696 Springer

Davidson T Warmsley D Macy M and Weber I(2017) Automated hate speech detection and the prob-lem of offensive language In Eleventh internationalaaai conference on web and social media

Difallah D Filatova E and Ipeirotis P (2018) Demo-graphics and dynamics of mechanical turk workers InProceedings of the eleventh acm international confer-ence on web search and data mining pages 135ndash143ACM

ElSherief M Nilizadeh S Nguyen D Vigna G andBelding E (2018) Peer to peer hate Hate speech in-stigators and their targets In Twelfth International AAAIConference on Web and Social Media

Falotico R and Quatto P (2015) Fleissrsquo kappa statisticwithout paradoxes Quality amp Quantity 49(2)463ndash470

Gulactı F (2010) The effect of perceived social supporton subjective well-being Procedia-Social and Behav-ioral Sciences 2(2)3844ndash3849

Hosseinmardi H Mattson S A Rafiq R I Han R LvQ and Mishra S (2015) Analyzing labeled cyberbul-

lying incidents on the instagram social network In Inter-national conference on social informatics pages 49ndash66Springer

Intapong P Charoenpit S Achalakul T and OhkuraM (2017) Assessing symptoms of excessive sns usagebased on user behavior and emotion analysis of data ob-tained by sns apis In 9th International Conference onSocial Computing and Social Media SCSM 2017 heldas part of the 19th International Conference on Human-Computer Interaction HCI International 2017 pages71ndash83 Springer Verlag

Jay T and Janschewitz K (2008) The pragmatics ofswearing Journal of Politeness Research LanguageBehaviour Culture 4(2)267ndash288

Kumar S Hamilton W L Leskovec J and Jurafsky D(2018) Community interaction and conflict on the webIn Proceedings of the 2018 World Wide Web Conferencepages 933ndash943 International World Wide Web Confer-ences Steering Committee

Mondal M Silva L A and Benevenuto F (2017) Ameasurement study of hate speech in social media InProceedings of the 28th ACM Conference on Hypertextand Social Media pages 85ndash94 ACM

Mubarak H and Darwish K (2019) Arabic offensivelanguage classification on twitter In International Con-ference on Social Informatics pages 269ndash276 Springer

Mubarak H Darwish K and Magdy W (2017) Abu-sive language detection on arabic social media In Pro-ceedings of the First Workshop on Abusive Language On-line pages 52ndash56

Mulki H Haddad H Ali C B and Alshabani H(2019) L-hsab A levantine twitter dataset for hatespeech and abusive language In Proceedings of theThird Workshop on Abusive Language Online pages111ndash118

Pavlopoulos J Malakasiotis P and Androutsopoulos I(2017) Deeper attention to abusive user content mod-eration In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing pages1125ndash1135

Pitsilis G K Ramampiaro H and Langseth H (2018)Detecting offensive language in tweets using deep learn-ing arXiv preprint arXiv180104433

Platt J (1998) Sequential minimal optimization A fastalgorithm for training support vector machines

Rish I et al (2001) An empirical study of the naive bayesclassifier In IJCAI 2001 workshop on empirical methodsin artificial intelligence volume 3 pages 41ndash46

Ross J Zaldivar A Irani L and Tomlinson B (2009)Who are the turkers worker demographics in amazonmechanical turk Department of Informatics Universityof California Irvine USA Tech Rep

Salminen J Almerekhi H Milenkovic M Jung S-gAn J Kwak H and Jansen B J (2018) Anatomyof Online Hate Developing a Taxonomy and MachineLearning Models for Identifying and Classifying Hatein Online News Media In Proceedings of The Inter-national AAAI Conference on Web and Social Media(ICWSM 2018) San Francisco California USA June

6212

Salminen J Hopf M Chowdhury S A Jung S-gAlmerekhi H and Jansen B J (2020) Developingan online hate classifier for multiple social media plat-forms Human-centric Computing and Information Sci-ences 10(1)

Silva L Mondal M Correa D Benevenuto F and We-ber I (2016) Analyzing the targets of hate in onlinesocial media In Tenth International AAAI Conferenceon Web and Social Media

Unsvag E F and Gamback B (2018) The effects of userfeatures on twitter hate speech detection In Proceed-ings of the 2nd Workshop on Abusive Language Online(ALW2) pages 75ndash85

Waldron J (2012) The harm in hate speech HarvardUniversity Press

Wiegand M Siegel M and Ruppenhofer J (2018)Overview of the germeval 2018 shared task on the iden-tification of offensive language

Zhang Z Robinson D and Tepper J (2018) Detect-ing hate speech on twitter using a convolution-gru baseddeep neural network In European Semantic Web Con-ference pages 745ndash760 Springer

  • Introduction
  • Related Studies
  • Data
    • Data Collection
    • Data Preparation and Selection
    • Annotation Guideline
    • Annotations via Crowdsourcing
    • Annotation Results
      • Data Analysis
        • Lexical Analysis
        • Use of Emojis in Offensive Language
          • Experiments and Results
            • Datasets
            • Classification Design
            • Performance Evaluation
            • Results
              • Conclusion
              • Bibliographical References
Page 9: A Multi-Platform Arabic News Comment Dataset for Offensive … · 2020. 5. 22. · Therefore, in this paper, we introduce and make publicly available a new dialectal Arabic news comment

6211

Alakrot A Murray L and Nikolov N S (2018) Datasetconstruction for the detection of anti-social behaviourin online communication in arabic Procedia ComputerScience 142174ndash181

Albadi N Kurdi M and Mishra S (2018) Are they ourbrothers analysis and detection of religious hate speechin the arabic twittersphere In 2018 IEEEACM Interna-tional Conference on Advances in Social Networks Anal-ysis and Mining (ASONAM) pages 69ndash76 IEEE

Badjatiya P Gupta S Gupta M and Varma V (2017)Deep learning for hate speech detection in tweets InProceedings of the 26th International Conference onWorld Wide Web Companion pages 759ndash760 Interna-tional World Wide Web Conferences Steering Commit-tee

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017a) Hate is not bi-nary Studying abusive behavior of gamergate on twit-ter In Proceedings of the 28th ACM conference on hy-pertext and social media pages 65ndash74 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017b) Mean birdsDetecting aggression and bullying on twitter In Pro-ceedings of the 2017 ACM on web science conferencepages 13ndash22 ACM

Chatzakou D Kourtellis N Blackburn J De CristofaroE Stringhini G and Vakali A (2017c) Measuringgamergate A tale of hate sexism and bullying In Pro-ceedings of the 26th international conference on worldwide web companion pages 1285ndash1290 InternationalWorld Wide Web Conferences Steering Committee

Conover M D Ratkiewicz J Francisco M GoncalvesB Menczer F and Flammini A (2011) Political po-larization on twitter In Fifth international AAAI confer-ence on weblogs and social media

Dadvar M Trieschnigg D Ordelman R and de JongF (2013) Improving cyberbullying detection with usercontext In European Conference on Information Re-trieval pages 693ndash696 Springer

Davidson T Warmsley D Macy M and Weber I(2017) Automated hate speech detection and the prob-lem of offensive language In Eleventh internationalaaai conference on web and social media

Difallah D Filatova E and Ipeirotis P (2018) Demo-graphics and dynamics of mechanical turk workers InProceedings of the eleventh acm international confer-ence on web search and data mining pages 135ndash143ACM

ElSherief M Nilizadeh S Nguyen D Vigna G andBelding E (2018) Peer to peer hate Hate speech in-stigators and their targets In Twelfth International AAAIConference on Web and Social Media

Falotico R and Quatto P (2015) Fleissrsquo kappa statisticwithout paradoxes Quality amp Quantity 49(2)463ndash470

Gulactı F (2010) The effect of perceived social supporton subjective well-being Procedia-Social and Behav-ioral Sciences 2(2)3844ndash3849

Hosseinmardi H Mattson S A Rafiq R I Han R LvQ and Mishra S (2015) Analyzing labeled cyberbul-

lying incidents on the instagram social network In Inter-national conference on social informatics pages 49ndash66Springer

Intapong P Charoenpit S Achalakul T and OhkuraM (2017) Assessing symptoms of excessive sns usagebased on user behavior and emotion analysis of data ob-tained by sns apis In 9th International Conference onSocial Computing and Social Media SCSM 2017 heldas part of the 19th International Conference on Human-Computer Interaction HCI International 2017 pages71ndash83 Springer Verlag

Jay T and Janschewitz K (2008) The pragmatics ofswearing Journal of Politeness Research LanguageBehaviour Culture 4(2)267ndash288

Kumar S Hamilton W L Leskovec J and Jurafsky D(2018) Community interaction and conflict on the webIn Proceedings of the 2018 World Wide Web Conferencepages 933ndash943 International World Wide Web Confer-ences Steering Committee

Mondal M Silva L A and Benevenuto F (2017) Ameasurement study of hate speech in social media InProceedings of the 28th ACM Conference on Hypertextand Social Media pages 85ndash94 ACM

Mubarak H and Darwish K (2019) Arabic offensivelanguage classification on twitter In International Con-ference on Social Informatics pages 269ndash276 Springer

Mubarak H Darwish K and Magdy W (2017) Abu-sive language detection on arabic social media In Pro-ceedings of the First Workshop on Abusive Language On-line pages 52ndash56

Mulki H Haddad H Ali C B and Alshabani H(2019) L-hsab A levantine twitter dataset for hatespeech and abusive language In Proceedings of theThird Workshop on Abusive Language Online pages111ndash118

Pavlopoulos J Malakasiotis P and Androutsopoulos I(2017) Deeper attention to abusive user content mod-eration In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing pages1125ndash1135

Pitsilis G K Ramampiaro H and Langseth H (2018)Detecting offensive language in tweets using deep learn-ing arXiv preprint arXiv180104433

Platt J (1998) Sequential minimal optimization A fastalgorithm for training support vector machines

Rish I et al (2001) An empirical study of the naive bayesclassifier In IJCAI 2001 workshop on empirical methodsin artificial intelligence volume 3 pages 41ndash46

Ross J Zaldivar A Irani L and Tomlinson B (2009)Who are the turkers worker demographics in amazonmechanical turk Department of Informatics Universityof California Irvine USA Tech Rep

Salminen J Almerekhi H Milenkovic M Jung S-gAn J Kwak H and Jansen B J (2018) Anatomyof Online Hate Developing a Taxonomy and MachineLearning Models for Identifying and Classifying Hatein Online News Media In Proceedings of The Inter-national AAAI Conference on Web and Social Media(ICWSM 2018) San Francisco California USA June

6212

Salminen J Hopf M Chowdhury S A Jung S-gAlmerekhi H and Jansen B J (2020) Developingan online hate classifier for multiple social media plat-forms Human-centric Computing and Information Sci-ences 10(1)

Silva L Mondal M Correa D Benevenuto F and We-ber I (2016) Analyzing the targets of hate in onlinesocial media In Tenth International AAAI Conferenceon Web and Social Media

Unsvag E F and Gamback B (2018) The effects of userfeatures on twitter hate speech detection In Proceed-ings of the 2nd Workshop on Abusive Language Online(ALW2) pages 75ndash85

Waldron J (2012) The harm in hate speech HarvardUniversity Press

Wiegand M Siegel M and Ruppenhofer J (2018)Overview of the germeval 2018 shared task on the iden-tification of offensive language

Zhang Z Robinson D and Tepper J (2018) Detect-ing hate speech on twitter using a convolution-gru baseddeep neural network In European Semantic Web Con-ference pages 745ndash760 Springer

  • Introduction
  • Related Studies
  • Data
    • Data Collection
    • Data Preparation and Selection
    • Annotation Guideline
    • Annotations via Crowdsourcing
    • Annotation Results
      • Data Analysis
        • Lexical Analysis
        • Use of Emojis in Offensive Language
          • Experiments and Results
            • Datasets
            • Classification Design
            • Performance Evaluation
            • Results
              • Conclusion
              • Bibliographical References
Page 10: A Multi-Platform Arabic News Comment Dataset for Offensive … · 2020. 5. 22. · Therefore, in this paper, we introduce and make publicly available a new dialectal Arabic news comment

6212

Salminen J Hopf M Chowdhury S A Jung S-gAlmerekhi H and Jansen B J (2020) Developingan online hate classifier for multiple social media plat-forms Human-centric Computing and Information Sci-ences 10(1)

Silva L Mondal M Correa D Benevenuto F and We-ber I (2016) Analyzing the targets of hate in onlinesocial media In Tenth International AAAI Conferenceon Web and Social Media

Unsvag E F and Gamback B (2018) The effects of userfeatures on twitter hate speech detection In Proceed-ings of the 2nd Workshop on Abusive Language Online(ALW2) pages 75ndash85

Waldron J (2012) The harm in hate speech HarvardUniversity Press

Wiegand M Siegel M and Ruppenhofer J (2018)Overview of the germeval 2018 shared task on the iden-tification of offensive language

Zhang Z Robinson D and Tepper J (2018) Detect-ing hate speech on twitter using a convolution-gru baseddeep neural network In European Semantic Web Con-ference pages 745ndash760 Springer

  • Introduction
  • Related Studies
  • Data
    • Data Collection
    • Data Preparation and Selection
    • Annotation Guideline
    • Annotations via Crowdsourcing
    • Annotation Results
      • Data Analysis
        • Lexical Analysis
        • Use of Emojis in Offensive Language
          • Experiments and Results
            • Datasets
            • Classification Design
            • Performance Evaluation
            • Results
              • Conclusion
              • Bibliographical References