arXiv:2007.05304v5 [stat.AP] 25 Sep 2020

31
Noname manuscript No. (will be inserted by the editor) To BAN or Not to BAN: Bayesian Attention Networks for Reliable Hate Speech Detection Kristian Miok · Blaˇ z ˇ Skrlj · Daniela Zaharie · Marko Robnik- ˇ Sikonja Accepted for the ICML 2020, Workshop on Uncertainty and Robustness in Deep Learning. Abstract Hate speech is an important problem in the management of user- generated content. To remove offensive content or ban misbehaving users, content moderators need reliable hate speech detectors. Recently, deep neural networks based on the transformer architecture, such as the (multilingual) BERT model, are achieving superior performance in many natural language classification tasks, including hate speech detection. So far, these methods have not been able to quantify their output in terms of reliability. We propose a Bayesian method using Monte Carlo Dropout within the at- tention layers of the transformer models to provide well-calibrated reliability estimates. We evaluate and visualize the proposed approach on hate speech detection problems in several languages. Additionally, we test if sentic variables can enhance the information extracted by the BERT model in hate speech classification. Our experiments show that Monte Carlo dropout provides a viable mech- anism for reliability estimation in transformer networks. Injected into the BERT model, it offers state-of-the-art classification performance and can detect Kristian Miok (B) · Daniela Zaharie West University of Timisoara, Computer Science Department Bulevardul Vasile Prvan 4, 300223 Timisoara, Romania E-mail: {kristian.miok,daniela.zaharie}@e-uvt.ro Blaˇ z ˇ Skrlj Joˇ zef Stefan International Postgraduate School, and Joˇ zef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia E-mail: [email protected] Marko Robnik- ˇ Sikonja University of Ljubljana, Faculty of Computer and Information Science, Veˇ cna pot 113, 1000 Ljubljana, Slovenia E-mail: [email protected] arXiv:2007.05304v5 [stat.AP] 25 Sep 2020

Transcript of arXiv:2007.05304v5 [stat.AP] 25 Sep 2020

Noname manuscript No.(will be inserted by the editor)

To BAN or Not to BAN:Bayesian Attention Networks forReliable Hate Speech Detection

Kristian Miok · Blaz Skrlj · Daniela Zaharie ·Marko Robnik-Sikonja

Accepted for the ICML 2020, Workshop on Uncertainty and Robustness in DeepLearning.

Abstract Hate speech is an important problem in the management of user-generated content. To remove offensive content or ban misbehaving users,content moderators need reliable hate speech detectors. Recently, deep neuralnetworks based on the transformer architecture, such as the (multilingual)BERT model, are achieving superior performance in many natural languageclassification tasks, including hate speech detection. So far, these methods havenot been able to quantify their output in terms of reliability.

We propose a Bayesian method using Monte Carlo Dropout within the at-tention layers of the transformer models to provide well-calibrated reliabilityestimates. We evaluate and visualize the proposed approach on hate speechdetection problems in several languages. Additionally, we test if sentic variablescan enhance the information extracted by the BERT model in hate speechclassification.

Our experiments show that Monte Carlo dropout provides a viable mech-anism for reliability estimation in transformer networks. Injected into theBERT model, it offers state-of-the-art classification performance and can detect

Kristian Miok (B) · Daniela ZaharieWest University of Timisoara, Computer Science DepartmentBulevardul Vasile Prvan 4, 300223 Timisoara, RomaniaE-mail: {kristian.miok,daniela.zaharie}@e-uvt.ro

Blaz SkrljJozef Stefan International Postgraduate School, and Jozef Stefan Institute,Jamova 39, 1000 Ljubljana, SloveniaE-mail: [email protected]

Marko Robnik-SikonjaUniversity of Ljubljana, Faculty of Computer and Information Science,Vecna pot 113, 1000 Ljubljana, SloveniaE-mail: [email protected]

arX

iv:2

007.

0530

4v5

[st

at.A

P] 2

5 Se

p 20

20

2 Kristian Miok, Blaz Skrlj, Daniela Zaharie, Marko Robnik-Sikonja

less trusted predictions. Although usage of sentic computing information doesnot lead to an improvement of the predictive performance, it can provide anadditional interpretation of emotions involved in hate speech.

Our approach not only improves the classification performance of the state-of-the-art multilingual BERT model, but the computed reliability scores alsosignificantly reduce the workload in an inspection of offending cases and re-annotation campaigns. The provided visualization helps to understand theborderline outcomes.

Keywords prediction uncertainty · reliability estimation · Monte Carlodropout · transformer neural networks · Bayesian BERT · Sentic Computing ·model calibration

1 Introduction

With the rise of social network popularity, hate speech phenomena have sig-nificantly increased [14]. Hate speech not only harms both minority groupsand the whole society, but it can lead to actual crimes [3]. So, automated hatespeech detection mechanisms are urgently needed. In contrast, falsely accusingpeople of hate speech is also a problem. Many content providers rely on humanmoderators to reliably decide if a given context is offensive or not, but thisis a mundane and stressful job which can even cause post-traumatic stressdisorders1. There have been many attempts to automate the detection of hatespeech in social media using machine learning, but existing models lack thequantification of reliability for their decisions.

In the last few years, recurrent neural networks (RNNs) were the most pop-ular text classification choice. The LSTM networks, the most successful RNNarchitecture, were already successfully adapted for the assessment of predictivereliability in hate speech classification [41]. Recently, neural network architec-ture with attention layers, called ‘transformer architecture’ [59], shows evenbetter performance on almost all language processing tasks. Using transformernetworks for masked language modeling produced breakthrough pretrainedmodels, such as BERT [15]. The attention mechanism is at the forefront of nat-ural language understanding with a significant impact on language applications.We aim to investigate the behavior of the attention mechanism concerning thereliability of predictions. We focus on the hate speech recognition task.

In hate speech detection, reliable predictions are needed to remove harmfulcontent and possibly ban malicious users without harming the freedom ofspeech [41]. Standard neural networks are inadequate for the assessment ofpredictive uncertainty, and the best solution is to use the Bayesian inferenceframework. However, classical Bayesian inference techniques do not scale well inneural networks with high dimensional parameter space [25]. Various methodswere proposed in order to overcome this problem [44]. One of the most efficient

1 https://www.bbc.com/news/technology-51245616

Bayesian Attention Networks for Reliable Hate Speech Detection 3

methods is called ‘Monte Carlo Dropout’ (MCD) [20]. Its idea is to use dropoutin neural networks as a regularization technique [54] and interpret it as aBayesian optimization approach that takes samples from the approximateposterior distribution.

Several authors have shown that emotional information extracted from atext can improve the performance of lexical and standard machine learningapproaches [35, 1, 51, 2]. The role and utility of emotional information [5]in deep learning have not yet been established; besides, we still have onlylimited understanding of the emotions in the text. A series of computationalmodels that bridge this gap between the human emotional perspective andmachines are called ‘Sentic Computing’ [6]. The computational initiative,named SenticNet, combines knowledge from psycholinguists, neuroscientists,and computer scientists to better understand emotions in text. We use senticfeatures from SenticNet together with the outputs of the state-of-the-artcontextual language model BERT [15], enhanced with reliability informationobtained with MCD. We follow two goals: i) to test the predictive performanceof emotion-enhanced BERT models in hate speech detection, and ii) to betterunderstand the role of emotions in hate speech.

Our main contributions are:

1. Methodology for the assessment of prediction uncertainty in attentionnetworks (AN) and BERT model.

2. Empirical analysis of the proposed Bayesian Attention Networks (BAN)and BERT models regarding their calibration within the multilingual hatespeech detection tasks.

3. Combination of contextual and reliability information obtained from BayesianBERT with the sentimental information from the SenticNet.

4. Visualization of a prediction uncertainty level for individual instances, aswell as for their groups.

The paper consists of six more sections. In Section 2, we present relatedworks on prediction uncertainty and hate speech detection. In Section 3, wepropose the methodology for uncertainty assessment using attention layers andMCD, while in Section 4, we describe the calibration of predictions. Section 5presents the data sets and the evaluation scenario. The obtained results arepresented in Section 6, followed by conclusions and ideas for further work inSection 7.

2 Related Work

We present four areas of related work concerning the four pillars of the paper.First, we introduce work done on hate speech detection, followed by therelated research on transformer architecture for text classification. In the thirdsubsection we describe existing approaches for the assessment of uncertaintyin text classification. The last subsection relates hate speech detection withthe particularities of sentic computation.

4 Kristian Miok, Blaz Skrlj, Daniela Zaharie, Marko Robnik-Sikonja

2.1 Hate Speech Detection

Analyzing sentiments and extracting emotions from the text are some of themost useful natural language processing applications [57]. From the wide rangeof applications where machines tend to understand human sentiments, hatespeech detection is becoming more and more needed with the rise of socialmedia popularity. Hate speech is defined as written or oral communication thatabuses or threatens a specific group or target [62].

Detecting abusive language for low resource languages is a complex task,which is why multilingual and cross-lingual methods are used to improve theresults [55]. This is especially the case when the languages have morphologicaland geographical similarities [46]. In our work, we investigate and compare thehate speech detection results for English, Croatian, and Slovene. As most widelyused, the English language has the most studies done [33, 14, 63]; however,very recently similar studies were done for Croatian [28, 34, 30] and Slovene[17, 31, 60] languages.

Hate speech detection is a case of a binary text classification problem,and is usually approached with supervised learning methods. In the past,the most frequently used classifier was the Support Vector Machines (SVM)method [53]. However, deep neural networks are now a dominant technique,first through recurrent neural network language models [38], and recently usingthe pre-trained transformer networks [43, 64]. In this work, we used the recentstate-of-the-art pre-trained (multilingual) BERT model.

2.2 Attention Networks for Text Classification

Attention networks and the transformer model were proposed by [59], andsoon replaced LSTM networks for many NLP tasks. Recently, several worksinvestigated large pre-trained transformer models that use the attention mech-anism for text classification. Kant et al. [26] trained both multiplicative LSTM(mLSTM) and Transformer language models on a large 40GB text dataset [36]and transfer those models to binary and multi-class text classification problems.They concluded that the transformer model outperforms the mLSTM model,especially when fine-tuned for multidimensional emotion classification.

The BERT model [15] uses transformer architecture to train the maskedlanguage model on large text corpora. BERT and its follow-ups learn manylanguage characteristics (both syntactic and semantic) and are useful for manylanguage classification tasks. Despite the short time since its conception, BERThas already attracted enormous attention from the NLP community. Hundredsof research groups extensively research it; see a recent overview by Rogerset al. [52]. Practical guidelines on how to fine-tune the BERT model for textclassification were compiled by Sun et al. [56].

A multilingual hierarchical attention mechanism for document classificationwas investigated by several authors [47, 16, 66]. However, different attentionlayers from large pre-trained models were not tested separately or in the context

Bayesian Attention Networks for Reliable Hate Speech Detection 5

of prediction reliability. Also, to the best of our knowledge, predictive reliabilityfor results obtained by BERT has not been investigated yet.

2.3 Prediction Uncertainty for Text Classification

While recent works on classification reliability mostly investigate deep neuralnetworks, many other probabilistic classifiers were analyzed in the past [45, 69,24, 11]. The well-known work of Platt [48] explores the probabilistic propertiesof SVM model predictions.

Prediction uncertainty is an important issue for black-box models likeneural networks, as they do not provide any interpretability or reliabilityinformation about their predictions. Most existing reliability scores for deepneural networks are based on the Bayesian inference. The exception is thework of Lakshminarayanan et al. [29], who proposed using deep ensembles toestimate the prediction uncertainty.

One of the most popular approaches is to mimic Bayesian inference usingMonte Carlo dropout [20]. The dropout technique was first introduced to RNNsin 2013 [61], but further research revealed a negative impact of dropout inRNNs [4]. Later, dropout was successfully applied to language modeling byZaremba et al. [68], who applied it only on fully connected layers. Gal andGhahramani [21] implemented the variational inference based dropout, whichcan also regularize recurrent layers. Additionally, they provide a solution fordropout within word embeddings. The method mimics Bayesian inferenceby combining probabilistic parameter interpretation and deep RNNs. Theauthors introduce the idea of augmenting probabilistic RNN models withthe prediction uncertainty estimation. Several other works investigate how toestimate prediction uncertainty within different data frameworks using RNNs[70], e.g., Bayes by Backpropagation (BBB) was applied to RNNs [18].

Recently, a fast and scalable method called ‘SWAG’ was proposed byMaddox et al. [32]. The main idea of this method is to randomize the learningrate and interpret this as a sampling from a Gaussian distribution. SWAGfits the Gaussian distribution by capturing the Stochastic Weight Averaging(SWA) mean and co-variance matrix, representing the first two moments ofSGD iterations. Different to SWAG, we use the Gaussian distribution as aposterior over neural network weights, and then perform a Bayesian modelaveraging for uncertainty representation and calibration.

MCD was recently used within various models and architectures to obtainthe prediction uncertainty and improve the classification results [39, 42, 40].Transformer networks were not analyzed yet.

2.4 Sentic Computation

Sentiment and emotions play an essential role in hate and offensive speech, andhave been used successfully in their automatic detection. Martins et al. [35] have

6 Kristian Miok, Blaz Skrlj, Daniela Zaharie, Marko Robnik-Sikonja

used eight basic emotions from Plutchik’s model [49], the positive and negativesentiment polarities, indicator of a presence of a word in the Hatebase lexicon2,and the intensity of emotion anger. Their combination of the lexicon-based andmachine learning approach successfully predicted hate speech and showed ahigh utility of emotional features. Alorainy et al., 2018 [1] used the emotionalanalysis on Twitter suspended accounts and discovered that they contain moredisgust, negative sentiment, fear, and sadness than active accounts. Using thisinformation for hate speech detection, their machine learning models showedimproved performance. Rodrıguez et al. [51] also used eight basic emotions intheir emotional analysis and showed that emotions could improve Facebookposts’ clustering. Finally, Bauwelinck and Lefever [2] use several different groupsof features (linguistic, sentiment, and Twitter-specific features such as hashtagsand profanity lexicon) to predict hate speech. Interestingly, their results showthat Twitter-specific features are the most successful, and additional sentimentfeatures do not improve predictive performance.

To advance approaches based on lexical keywords and frequency statis-tics, a framework for emotional computing called ‘SenticNet’ [6] capturessemantics and emotional latent information by relying on the implicit mean-ing associated with commonsense concepts. The last version of the SenticNetmethod was recently published by Cambria et al. [10]. The Hourglass of Emo-tions [58] model from the Sentic family introduces biologically-inspired andpsychologically-motivated emotion categorization. It computes four indepen-dent but concomitant affective dimensions, which can be combined to buildmore complex emotions. Based on this, the Hourglass of Emotions model candescribe and explain the full range of emotional experiences by disassemblingtext to the ground sentiments.

All the methods mentioned above use classical machine learning approachessuch as SVM, Naive Bayes, logistic regression, or random forest. In our work,we use more advanced deep learning models that can extract more informationfrom text. Concerning emotions, the question we investigate is whether addingadditional emotional information to deep neural networks model improves theperformance of hate speech detection. This question is particularly relevant forthe current state-of the-art BERT model [15], which captures a wide plethora oflanguage information, such as POS tags, dependency structure, and sentiments.In this work, we investigate the emotional information from SenticNet andBERT.

3 Bayesian Attention Networks

The BERT model [15] is the representative of transformer networks and hasachieved state-of-the-art results in many NLP tasks, including text classification[65, 23, 13]. In this work, we introduce Monte Carlo Dropout to transformernetworks and BERT to construct their Bayesian variants. Analysis of different

2 http://www.hatebase.org

Bayesian Attention Networks for Reliable Hate Speech Detection 7

amounts of dropout, the different variants of BERT modifications, and theirhyper-parameters would require several rounds of retraining the BERT model,which would require substantial computational resources. For example, traininga single BERT model on four TPUs requires more than a month’s worth ofcomputational time. Due to that, we first explore the reliability on only theencoder part of the BERT architecture, called Attention Network (AN), andthen on the entire pre-trained BERT model. We believe this is a reasonablesetting which still allows the analysis of reliability in transformer networks.

In the following subsections, we first formally define the Attention Networkarchitecture, and then make it Bayesian by introducing MCD. We describe howwe can inject the MCD principle into the already pre-trained BERT model.

3.1 Attention Networks

The basic architecture of the Attention Network follows the architecture oftransformer networks [59] and is shown in Figure 1. The architecture is similar to

Fig. 1: A scheme of Attention Networks. In layers colored blue we introducethe dropout.

the encoder part of the transformer architecture. The difference is in the outputpart, where a single output head was added to perform binary classificationusing the sigmoid activation function. By applying only the encoder part of

8 Kristian Miok, Blaz Skrlj, Daniela Zaharie, Marko Robnik-Sikonja

transformer architecture, orders of magnitude fewer parameters are needed tolearn a particular classification task, e.g., in this work, we used a maximumof 3 million parameters. The architecture can contain many attention heads,where a single attention head is computed as:

oh = softmax(Q ·KT

√dk

) · V ,

The attention matrices are commonly known as the query Q, the key K, andthe value matrix V . The oh represents the output. The attention functioncan be described as mapping a query and a set of key-value pairs to anoutput, where the query, keys, values, and output are all vectors. The output iscomputed as a weighted sum of the values. The weight assigned to each valueis computed by a compatibility function of the query with the correspondingkey. The dk represents the dimensionality of the keys. The positional encoding,as discussed in [59], represents a matrix that encodes individual positionsin a matrix of the same dimensionality as the one holding the informationon sequences (input embedding). Intuitively, the multiplication of query andkey vectors with subsequent values can be understood as the extraction ofrelations. The softmax activation enables that each pair of considered inputtokens is represented with a single real value. It effectively introduces sparsenessinto the weight space – only certain token pairs emerge with high weightsand are relevant for the remaining part of the considered neural networkarchitecture. In practice, multiple such heads can be concatenated and fedinto the succeeding feed-forward layer. The application of softmax has beenshown to emphasize only particular parts of the parameter space, therebymaking the neural network more focused. The positional encoding, being thepart of the transformer architecture, was introduced to account for word order.Here, relative distances between different tokens are taken into account byincorporating the position-related signal into a given token representation.

While there are, in principle, many different ways of how attention networkscan be extended with the Bayesian approach, we propose to use the wellestablished Monte Carlo Dropout.

3.2 Monte Carlo Dropout for Attention Networks

In our proposal, called ‘Monte Carlo Dropout Attention Networks’ (MCDAN, or BAN), contrary to the original dropout setting, the dropout layersare also active during the prediction phase. In this way, predictions are notdeterministic and are sampled from the learned distribution, thereby formingan ensemble of predictions. The obtained distribution can be, for example,inspected for higher moment properties and can offer additional informationon the certainty of a given prediction. During the prediction phase, all layersexcept the dropout layers are deactivated. A forward pass on such partiallyactivated architecture is repeated for a fixed number of samples, which can be

Bayesian Attention Networks for Reliable Hate Speech Detection 9

combined to obtain the final probability, or further inspected as a distributionunderlying the probability.

3.3 Monte Carlo Dropout for BERT

Monte Carlo dropout was used for the BERT predictions in the same way as inBAN. MCD can provide multiple predictions during the test time completelyfree, as long as the dropout was used during the training time [19]. Trainingof neural networks with dropout distributes the information contained in theneurons throughout the network. Hence, during the prediction, such a trainedneural network will be robust; using the dropout principle, a new prediction ispossible in each forward pass. A sufficiently large set of such predictions canbe used to estimate the reliability. The BERT model is trained with 10% ofdropout in all of the layers, and thus allows for multiple predictions using thedescribed principle. We call this model ‘MCD BERT’. A possible limitation ofthis approach is that a single dropout rate of 10% is used during training, whileother dropout probabilities might be more suitable for reliability estimation.

4 Calibration of Probabilistic Classifiers

The quality of reliability scores returned by classifiers (such as MCD AN, andMCD BERT) is typically assessed with calibration measures. A classifier iscalibrated if its output scores are actual probabilities in a sense that a classpredicted with the score p is correct with actual probability p, i.e., in p · 100percent of cases. For most neural networks, it is true that without any measurestaken to ensure that they are calibrated, they will most likely be somewhatoverconfident and will overestimate their probabilities. A model’s calibrationcan be visualized using a calibration plot where the model’s prediction accuracy(true probabilities) is plotted against the predicted output (i.e., a softmax score).A perfect calibration manifests itself as a diagonal in the calibration plot (seean example of a calibration plot in Fig. 5).

Since classifiers are typically not perfectly calibrated, we investigated differ-ent methods to improve the calibration of predictions for a binary model. Wecompared various existing methods for calibration of neural networks with anovel approach that combines the existing techniques with the adaptive predic-tion threshold. We first describe the existing calibration methods, followed bythe proposed adaptive threshold.

4.1 Existing Calibration Methods

Below, we formally describe how to obtain calibrated predictions from thereliability scores. Let (X,Y ) be the input space, where X represents the set ofpredictive variables, and Y is the class variable with possible values 0 and 1. Letf be the predictor (e.g., neural network) with f(X) = (Y , P ), where Y is the

10 Kristian Miok, Blaz Skrlj, Daniela Zaharie, Marko Robnik-Sikonja

binary class prediction, and P is its associated confidence score or probabilityscore of correct prediction. The calibration of the model f is expressed withthe formula:

P (Y = Y |P = p) = p, (1)

where p is the prediction score from [0, 1] interval, obtained from the predictorf . We interpret this score as the probability of a specific outcome assigned bythe model f . Probability p is the model’s confidence or true probability thatmodel f predicts correctly. If a model predicts a certain outcome with highprobability, it is desirable that the confidence of this prediction being correctis also high. In the ideal case of perfect calibration p = p.

Based on Equation (1), there are two ways to reduce the calibration error:either to obtain calibrated predictions p or to manipulate the predictionthreshold in such way that the predicted outcome Y is better calibrated.

To assess the quality of the produced reliability scores, we compare them toresults of two most popular calibration methods, Platt’s method and Isotonicregression.

Platt’s method [48] learns two scalar parameters a, b ∈ R in such way thatthe prediction q = σ(ap + b) presents a calibrated probability of predictedscore p, and σ is a sigmoid function. To find good values of a and b, typically aseparate calibration data set is used. The isotonic regression is a non-parametricform of regression in which we assume that the function is chosen from theclass of all isotonic (i.e., non-decreasing) functions [67]. Given predictions fromour classifier p, and the true target y, the calibrated prediction returned byisotonic regression is:

q = m(p) + ε

where m is a non-decreasing function.

4.2 Adaptive Threshold

As a part of this work, we explored the notion of adaptive threshold (AT),which we apply to classification with MCD ANs. During learning, after eachweight update phase, we assess the performance of AN. For each instance inthe validation set, we do multiple forward passes with unfixed dropout layersand store the average of the returned scores as the probability estimate. Oncethe probability estimates for the validation set are collected, we apply severaldecision thresholds to them and determine the final predictions of the instances.The best-performing threshold w.r.t. a given performance metric, in our caseclassification accuracy, is stored together with its performance and weights.This performance estimate can also be used in the early stopping of the learningphase. When we apply the model to new instances, we use the best thresholdfrom the training phase (instead of 0.5). The purpose of the adaptive thresholdis to automatically find the threshold with better performance than the defaultvalue of 0.5. To summarize, we employ the following procedure:

Bayesian Attention Networks for Reliable Hate Speech Detection 11

1. During training, after each weight update, a probability distribution isgenerated with MCD. The mean of the distribution is considered theprobability of a given instance being assigned to the positive class.

2. Using the validation set, we test a range of possible thresholds that determinethe instances’ labels. In this work, we tested the threshold range between0.1 and 0.9 in increments of 0.001.

3. If the accuracy obtained by the default threshold (0.5) was exceeded by anyother threshold, the MCD AN network stored both the current parameterset as well as the threshold value that was used to obtain this superiorperformance on the validation set.

4. After the end of the training, the weights of the best performing architectureand the matching threshold value are used as the final trained model.

5 Evaluation Settings

We evaluate the proposed novelties concerning three main aspects: i) the calibra-tion of returned probabilities, ii) prediction performance, and iii) visualizationof prediction uncertainty on the hate speech problem. In this section, we presentthe evaluation settings, and in Section 6, we report the results for the threeaspects. Here, we start by describing the used hate speech datasets, followedby the implementation details of used prediction models, evaluation measuresfor prediction performance, and evaluation measures for calibration.

5.1 Hate Speech Datasets

To test the proposed methodology in the multilingual context, we applied thepreviously presented classification models to three different datasets.

1. English dataset3 is extracted from hate speech and offensive languagedetection study of Davidson et al. [14]. The subset of data we used consistsof 5,000 tweets. We took 1,430 tweets labeled as hate speech and randomlysampled 3,670 tweets from the remaining 23,353 tweets.

2. Croatian dataset was collected by Styria media company within the EUHorizon 2020 project EMBEDDIA4. The texts were extracted from thedatabase of user comments on the website of Vecernji list5 news portal.The original data set consists of 9,646,634 comments described with 11attributes from which we selected 8,422 comments, of which 50% are labeledas hate speech by human moderators, and the other half was randomlychosen from the non-problematic comments.

3 https://github.com/t-davidson/hate-speech-and-offensive-language4 http://embeddia.eu5 https://www.vecernji.hr

12 Kristian Miok, Blaz Skrlj, Daniela Zaharie, Marko Robnik-Sikonja

3. Slovene data set is a result of the Slovenian national project FRENK6. Thetext dataset used in the experiment is a combination of two different studiesof Facebook comments [31]. The first group of comments was collected onLGBT homophobia topics, while the second on anti-migrants posts. In ourfinal data set, we used all of the 2,182 hate speech comments, and the samenumber of non-hate speech comments were randomly sampled.

The summary of data set sizes is available in Table 1.

5.2 Sentic Features

To test if additional emotional data can complement the information extractedfrom the text by BERT models, we obtained additional sentiment informationfor the English tweets dataset. We used the SenticPhrase interface based onthe SenticNet 4 [9], and for each sentence extracted four sentic features. WithinSenticNet 4, both verb and noun concepts are linked to primitives, and in thisway, most concept inflections can be captured by the knowledge base verbconcepts. The API implementation for python is freely available within thesentic python package7.

To gain a better understanding of the four sentic variables (pleasantness,attention, sensitivity, and aptitude), Cambria et al. [7] presented the followingexample:

1. the user is happy with the service provided (pleasantness)2. the user is interested in the information supplied (attention)3. the user is comfortable with the interface (sensitivity)4. the user is disposed to use the application (aptitude)

The hate speech language usually expresses unhappiness with the currentsituation and unwillingness to hear or consider different opinions. Hence, thenature of hate speech is opposite to the nature of pleasantness and aptitude,while it can go in the same direction as the attention.

The distributions of sentic variables’ values obtained for English tweets,separately for non-hate speech and hate speech, are shown in Fig. 2. While thedistributions are different among the variables, the differences between hatespeech and non-hate speech are not particularly pronounced. This indicates thateach variable by itself is not a strong indicator of hate speech, but might stillbe useful in combination with other sentic variables or text features extractedby the neural network.

5.3 Prediction Models

We used three types of prediction models. As a baseline, we used MCD LSTMnetworks [41], which include reliability information obtained with MCD. We

6 http://nl.ijs.si/frenk/ FRENK - Raziskave Elektronske Nespodobne Komunikacije(English Research on InappropriateElectronic Communication)

7 https://pypi.org/project/sentic/

Bayesian Attention Networks for Reliable Hate Speech Detection 13

Fig. 2: Distribution of the four emotional variables obtained from SenticNet forEnglish tweets. Left-hand side shows non-hate speech tweets and right-handside shows hate speech tweets.

compared that model with MCD AN and MCD BERT. As the input to MCDLSTM, we used pre-trained word embeddings: sentence encoder for English[12] and fastText embeddings8 for Slovene and Croatian. For MCD AN, weused Keras tokenizer9, and for the MCD BERT, we used BERT’s tokenizer.The characteristics of the models are summarized in Table 1.

Table 1: Summary of characteristics of the three used types of predictionmodels.

Dataset type Size Hate Non-hate LSTM embeddingsEnglish tweets 5000 1430 3670 sentenceCroatian news comments 8422 4211 4211 fastTextSlovene Facebook comments 4364 2182 2182 fastText

5.4 Implementation Details

We implemented the proposed MCD ANs in PyTorch library10. The mainhyper-parameters of the architecture are the number of attention heads andthe number of attention layers. The adaptive classification threshold (describedin Section 4.2) is computed every time we evaluate the performance on thevalidation set.

8 https://fasttext.cc9 https://keras.io/preprocessing/text/

10 https://gitlab.com/skblaz/bayesianattention

14 Kristian Miok, Blaz Skrlj, Daniela Zaharie, Marko Robnik-Sikonja

The Monte Carlo dropout in AN is also used in the prediction phase.When a network makes a prediction, we apply a deactivation operation acrossthe architecture, freezing everything but the dropout layers. In this way, wemaintain the variance of predictions. Each final prediction thus consists of aset of results obtained by several forward passes. The complete source code ofour approaches will be available with the final version of the paper.

Other parameters are set as follows. We use the Adamax optimizer [27],a variant of Adam based on infinity norm. Binary cross-entropy loss guidesthe training. To automatically stop training, we use the stopping step of 10– if after 10 optimization steps the performance on the validation set is notimproved, the training stops.

We explored the following hyperparameter tuning space: the validationpercentage (size of validation set) was varied between 5% and 10%. Therationale for testing different ratios of validation set sizes is that the dataconsidered is small, and so considering too high validation percentages couldprevent the classifier from viewing crucial instances and thus reduce its finalperformance. However, given enough data, the percentage should be as high aspossible. The number of epochs was either 30 or 100, the number of hiddenlayers and attention heads was 1 or 2. The maximum padding of the inputsequences was either 48, 32, or 64. The learning rate was either 0.001 or 0.0005,and the adaptive threshold was either enabled or disabled.

MCD LSTM networks consist of an embedding layer, LSTM layer, and afully connected layer within the word2Vec and ELMo embeddings. To obtainthe best architectures for the LSTM and MCD LSTM models, various numberof units, batch sizes, dropout rates, etc. were tested. For BERT, we used theBERT base model in English and multilingual variant, using HuggingFace code11.

To combine the information from the MCD BERT and SenticNet, wegenerated 1000 MCD BERT predictions for each instance. We merged themwith the four Sentic variables, described in Section 5.2, thus obtaining 1004variables. This data was passed as an input to the SVM model. The processused 5-fold cross-validation.

5.5 Prediction Performance Evaluation Measures

Depending on the purpose of the prediction model, we might want to maximizedifferent evaluation measures, such as classification accuracy, precision, recall,or F1 score. In hate speech detection, we want to avoid false accusations ofhate speech; therefore, we try to maximize precision by measuring it on thevalidation set during training. Conversely, this could influence other measures.With that aim, we alter the decision threshold to achieve good precision vs.accuracy balance. In Figure 3 the accuracy-precision trade-off is presented.

11 https://huggingface.co/transformers/model_doc/bert.html

Bayesian Attention Networks for Reliable Hate Speech Detection 15

Fig. 3: Trade-off between precision and accuracy across various hyper-parametersettings. Each curve shows one set of hyper-parameters, each color depictsone decision threshold (0, 0.25, 0.5, 0.75, or 1.0). Hyper-parameters contain:number of heads – max padding – number of layers – number of epochs –validation percentage.

5.6 Calibration Quality Measures

To measure the quality of computed calibration scores, we use the expectedcalibration error (ECE) [22]. ECE splits all n predictions into M equally spacedbins B1, B2, . . . , BM , that contain instances with prediction scores in the givenbin. It sums the weighted differences between actual prediction accuracies andpredicted scores over all the bins and normalizes the result with the number ofinstances n.

ECE =

M∑m=1

|Bm|n|accuracy(Bm)− score(Bm)| (2)

This measure produces lower scores for better calibrated models (lower calibra-tion error).

6 Results

In this section, we present results related to several aspects: calibration ofdifferent prediction models, their prediction performance, reliability of BERTand MCD BERT, the relationship between MCD BERT results and Senticfeatures, and a visualization of uncertainty.

16 Kristian Miok, Blaz Skrlj, Daniela Zaharie, Marko Robnik-Sikonja

6.1 Calibration of MCD AN and BERT

Figure 4 shows how calibration of prediction scores change during the trainingof MCD AN. The red line represents the performance of the fully trainednetwork. It is apparent that additional calibration is necessary – the dottedline represents perfect calibration. Surprisingly, initial training iterations showbetter calibrated scores. This can be due to the definition of ECE measure: incase that both accuracy and predicted scores are low, this would lead to a lowECE value.

0.0 0.2 0.4 0.6 0.8 1.0Mean prediction value

0.0

0.2

0.4

0.6

0.8

1.0

Frac

tion

of p

ositi

ves

Fig. 4: Calibration plot for MCD AN after each epoch (green) based on thevalidation set of the best performing architecture. The more transparent thecalibrations the earlier the training stage (fewer epochs). The final calibrationis in red.

In Tables 2, 3, and 4, the calibration results for different calibration ap-proaches of MCD AN are presented: no calibration, isotonic regression, andPlatt’s method, combined with the adaptive threshold or not. It can be observedthat for all three languages, both calibration methods improve the ECE score,and Platt’s method seems to produce the best calibration scores. The adaptivethreshold slightly improves the ECE score for the uncalibrated (raw) results.This is especially true for the Slovene comments where the ECE score wasreduced from 0.794 to 0.621. Nevertheless, we can conclude that calibrationusing adaptive threshold heuristics is beneficial, but cannot be compared withthe improvements brought by proper calibration techniques.

To compare the calibration results for MCD BERT with different MCD ANarchitectures, we plotted their ECE scores in Figure 5. It can be observed thatcalibration methods substantially improve the MCD AN calibration; however,the MCD BERT model is even better calibrated.

Bayesian Attention Networks for Reliable Hate Speech Detection 17

Table 2: The calibration scores of MCD AN with different calibration approacheson English tweets.

Calibration Adaptive threshold Accuracy F1 ECE

Raw False 0.83 (0.02) 0.82 (0.03) 0.547Raw True 0.83 (0.01) 0.83 (0.04) 0.539Isotonic False 0.84 (0.01) 0.82 (0.01) 0.230Isotonic True 0.83 (0.01) 0.82 (0.02) 0.234Platt’s False 0.84 (0.02) 0.82 (0.02) 0.225Platt’s True 0.83 (0.01) 0.82 (0.01) 0.232

Table 3: The calibration scores of MCD AN with different calibration approacheson Croatian user news comments.

Calibration Adaptive threshold Accuracy F1 ECE

Raw False 0.61 (0.02) 0.47 (0.03) 0.681Raw True 0.62 (0.02) 0.50 (0.04) 0.663Isotonic False 0.60 (0.01) 0.49 (0.04) 0.206Isotonic True 0.61 (0.01) 0.50 (0.03) 0.206Platt’s False 0.61 (0.02) 0.48 (0.02) 0.198Platt’s True 0.62 (0.02) 0.49 (0.02) 0.197

Table 4: The calibration scores of MCD AN with different calibration approacheson Slovene Facebook comments.

Calibration Adaptive threshold Accuracy F1 ECE

Raw False 0.59 (0.01) 0.33 (0.05) 0.794Raw True 0.59 (0.02) 0.48 (0.05) 0.621Isotonic False 0.58 (0.02) 0.48 (0.03) 0.212Isotonic True 0.58 (0.02) 0.49 (0.03) 0.213Platt’s False 0.58 (0.03) 0.475 (0.02) 0.206Platt’s True 0.59 (0.02) 0.47 (0.04) 0.204

0.0 0.2 0.4 0.6 0.8 1.0Mean prediction value

0.0

0.2

0.4

0.6

0.8

1.0

Fract

ion o

f posi

tives

MCD AN (raw)

MCD AN (Platt)

MCD AN (isotonic)

MCD BERT

Fig. 5: Calibration plots based on English test set performance for MCD BERTand MCD AN architecture using different calibration algorithms.

18 Kristian Miok, Blaz Skrlj, Daniela Zaharie, Marko Robnik-Sikonja

6.2 Prediction Performance

The comparison of four different models is presented in Table 5. MCD BERTprovides the best results for all three languages. BERT is pre-trained on a largeamount of text, which makes a significant difference compared to LSTM andMCD AN. As the MCD BERT is slightly better than BERT, we can concludethat for the instances where BERT is unsure, multiple predictions reduce thevariance and influence decisions in the right direction. LSTM seems to be moreconsistent than MCD AN (see the F1 scores). We attribute this to the largernumber of parameters in MCD AN and an insufficient number of instances.BERT models compensate for this with pre-training.

Table 5: Comparison of predictive models. We present average classificationaccuracy and F1 score with their standard deviations, computed using 5-foldcross-validation. All the results are expressed in percentages, and the bestaccuracy for each language is typeset in bold.

English Tweets Croatian Comments Slovene Comments

Model Accuracy F1 Accuracy F1 Accuracy F1

MCD LSTM 81.0 [1.2] 81.9 [1.3] 63.7 [1.0] 51.0 [3.3] 55.3 [0.69] 43.13 [0.8]

MCD AN 83.3 [1.7] 81.6 [3.4] 61.4 [2.0] 38.1 [8.6] 57.4 [1.7] 35.1 [6.3]

BERT 90.9 [0.7] 90.0 [0.7] 70.8 [1.0] 61.2 [1.5] 66.4 [5.0] 67.8 [2.5]

MCD BERT 91.4 [0.7] 90.4 [0.8] 71.5 [1.2] 62.9 [1.7] 68.4 [1.9] 68.6 [1.6]

6.3 Reliability of BERT and MCD BERT

As BERT is already well-calibrated, we want to test if the proposed MCDextension is useful beyond the advantage in predictive performance. To test ifMCD BERT could recognize problematic predictions, we investigated instancesfrom the testing sets. For each classifier (BERT and MCD BERT), we splittested cases into two groups, confused and certain, based on the scores providedby classifiers. As BERT and MCD BERT generally provide predictions closeto 0 or 1, we used the following simple criterion for MCD BERT: the testedinstance is declared confused if the variance computed from the 1000 predictionsis greater then 0.1, otherwise it was declared certain. As BERT returns a singleprediction, we have chosen the same number of confused instances as for MCDBERT, with the criterion that the predicted score is farthest away from 0 or 1,i.e., the least certain.

In Table 6, we show results for each of the three languages, separately forcorrectly classified instances and incorrectly classified instances. The ratio ofincorrectly to correctly classified examples is significantly different betweencertain and confused groups, and much larger for MCD BERT than for BERTfor the English and Croatian datasets, while being similar for the Slovenedataset.

Bayesian Attention Networks for Reliable Hate Speech Detection 19

To statistically test whether the change from the BERT to the MCD BERTmodel would influence the number of correctly detected confused tweets, weapplied the Chi-square statistical test. The Chi-square test for English MCDBERT returns very significant p-value= 2.2e-16, while the Chi-square testfor the BERT model was significant with p-value =1.384e-11. For BERT onCroatian, the Chi-square test was not significant with p-value= 1. Thus, itis clear that for Croatian, we can not classify tweets as certain and confusedbased just on a single prediction. In contrast, MCD BERT provides muchbetter split for Croatian, so that the Chi-square test becomes significant withp-value= 8.348e-16. The p-values for the BERT and MCD BERT on Sloveneare 0.0037 and 0.0002, respectively. This indicates that MCD BERT is muchbetter in detecting unreliable classification. For example, if we are faced withthe re-annotation task to improve the quality of predictions, MCD BERT wouldchoose much better borderline instances than BERT. Thus, the results indicatethat MCD BERT provides a better understanding of how much we can trustthe predictions than BERT.

Table 6: Ratio of predictions where classifiers are correct and incorrect is verydifferent for instances where BERT is certain vs confused compared to MCDBERT.

Language Correct BERT MCD BERTAcc Certain Confused Acc Certain Confused

ENG Yes 880 31 891 24No 71 18 62 23

90.4 92.1Ratio 0.08 0.58 0.06 0.95CRO Yes 1176 35 1053 152

No 461 14 336 13971.4 72.2

Ratio 0.39 0.4 0.31 0.91SLO Yes 576 28 537 55

No 241 27 229 5165.9 67.5

Ratio 0.42 0.96 0.42 0.92

6.4 Combining Emotional Information with MCD BERT

In this section, we test the hypothesis that additional information about theemotions obtained from SenticNet can complement the information about hatespeech extracted by the BERT model. For this aim, we add sentic features tothe input of BERT, as described in Section 5.4, and study how this affectsthe performance of fine-tuned BERT models. Additionally, we investigate ifemotional information can help in the interpretation of trained hate speechdetectors.

20 Kristian Miok, Blaz Skrlj, Daniela Zaharie, Marko Robnik-Sikonja

Using 5-fold cross-validation, we combine the predictions from MCD BERT(1000 of them) with the four sentic variables (pleasantness, aptitude, sensitivity,and attention). On the data sets obtained in this way, we train the SVM model.According to the results in Table 7, the additional sentic information does notsignificantly improve the hate speech detection.

The same conclusions can be drawn from Figure 6, where we used thefeature evaluation algorithm ReliefF [50] which can detect feature dependenciesthat affect the prediction variable. The algorithm evaluated 1004 variables(1000 MCD predictions and 4 sentic variables). The results show that thefour sentic variables have significantly less impact on the final hate speechpredictions than the individual MCD BERT outputs.

Table 7: Comparison of the prediction performance of the MCD BERT andSVM model trained on the outputs of MCD BERT and SenticNet for Englishtweets.

Model Accuracy F1

MCD BERT 91.4 [0.7] 90.4 [0.8]

SVM on MCD BERT+SenticNet 91.4 [0.4] 90.5 [0.7]

Fig. 6: Feature importance according to the ReliefF algorithm for the foursentic variables together with the five typical output variables generated byMCD BERT (we show five variables from the end of the variable list, but theresults for other selections would be similar).

In further analysis, we investigated the correlation between the senticvariables and the hate speech prediction probability, separately for the non-hate speech and hate speech English tweets. The top parts of graphs in Figure 7show that the linear regression lines (in orange) for the hate speech seem to behorizontal, so there is no significant correlation among the predicted probability

Bayesian Attention Networks for Reliable Hate Speech Detection 21

of hate speech obtained with MCD BERT and emotions. In contrast, thereseems to be a certain correlation between probability and emotions for thenon-hate speech tweets, as blue regression lines at the bottom parts of thegraphs show. Both attention and sensitivity have positive correlation withthe hate speech prediction probability. This confirms the conclusions of theHourglass of Emotions model that high attention and sensitivity lead to theaggressiveness (Figure 5 in [8]).

Fig. 7: Relationship between prediction probability and the four sentic variables.

To evaluate the Sentic variables in the context of hate speech probabilityassigned by MCD BERT (as currently the best model for hate speech), we fitteda regression model, shown in Table 8. According to this model, pleasantnessexhibits a strong significant effect on the hate speech prediction, while aptitudeis borderline significant. The same effect of pleasantness and aptitude having anegative effect on the probability that specific offensive tweet is edited by a trollwas observed [7]. The fact that the coefficient of both pleasantness and aptitudeare negative confirms the results using the Hourglass of Emotions model. Forinstance, according to the Hourglass compound emotions of the second level(Figure 5, [8]), simultaneous high values for the pleasantness and aptitude leadto love which is a complete opposite to hate speech. Thus, sentiments extractedusing SenticNet can be used for interpretation from the emotional perspectivewhy a specific discussion leads to the hate speech.

Table 8: The linear regression model of sentic variables and hate speech predic-tion probability for the hate speech English tweets.

Estimate Std. Error t-value P-valueIntercept 0.540 0.021 25.303 2e-16 *Pleasantness -0.271 0.065 -4.133 3.8e-05*Attention 0.069 0.059 1.172 0.2414Sensitivity -0.084 0.091 -0.918 0.358Aptitude -0.126 0.061 -2.086 0.037*

22 Kristian Miok, Blaz Skrlj, Daniela Zaharie, Marko Robnik-Sikonja

6.5 Visualization of Uncertainty

Obtaining multiple predictions for a specific instance can improve understandingof the final prediction. The mean of the distribution is used to estimate theprobability. The variance informs us about the spread and certainty of theprediction. We can inspect the actual distribution of prediction scores withhistogram plots, as illustrated in Figure 8 for a few correctly classified instancesfrom the English dataset and on Figure 9 for a few misclassified cases.

Histograms presented in the Figures 8 and 9 visually display the predictioncertainty for specific instances. MCD BERT’s predictions are always close to 0 or1, especially when the model seems sure. MCD AN with 10% dropout providesa similar spread of values as BERT. This is expected as BERT architecture ispre-trained with 10% dropout. However, 30 % of dropout in MCD AN, resultsin a much larger spread of predictions in instances where MCD AN is uncertain.Note that the results of MCD BERT are concentrated in a much narrowerinterval compared to MCD LSTM and MCD AN.

While visualizations of prediction histograms for each instance separatelyas in Figures 8 and 9 are useful in the assessment of individual predictionreliability, we want to show also aggregated results for multiple instances tounderstand dependencies with other instances. Following [41], we visualize theembeddings of the prediction distributions. The key idea of the visualizationcan be summarized as follows. First, we obtain many predictions (1,000 inour experiments) for each instance. The space of such prediction distributionsacross individual instances is embedded into two dimensions by using theUniform Manifold Projections method [37]. In this way, we obtain a two-dimensional space corresponding to the initial 1,000 dimensional space ofprediction distributions. Next, we use the Gaussian kernel estimation to identifyequivalent regions connected with closed curves. Finally, the shapes and sizes ofindividual predictions are chosen based on their classification error and certaintyof predictions. The goal of using this visualization is to discover potentiallylarger structures within the space of probability distributions, possibly offeringinsights into the drawbacks and limitations of the analyzed classifier. Theresults of this visualization are shown in Figures 10 and 11. In Figure 10, theplot displays the position of certain and uncertain test instances in the latentspace, while in Figure 11 the differences based on the mean probability.

In both Figures, 10 and 11, the probability space is distinctly separatedinto two components, indicating that there are predictions for which theneural network is certain (and were correctly classified); however, for somepredictions, especially of instances that are not hate speech, the network isless certain (albeit still correct). The two examples demonstrate how the spaceof probabilities is split into distinct components once the neural network istrained. The visualizations also indicate that some of the instances are moreproblematic than others, allowing their identification and potentially facilitatingthe debugging process for a developer (e.g., an inspection of convergence).

Bayesian Attention Networks for Reliable Hate Speech Detection 23

tweet1: @useryou might be a lib-tard if... #libtard#sjw #liberal#politics

tweet2: #paid#kkk to #fabri-cate #stories topush the producers#narrative #cancel

tweet3: carl pal-adino, trump ally,wishes obama deadof mad cow disease

tweet4: thoughtfactory: left-right po-larisation! #trump#uselections2016#leadership #poli-tics #brexit #blm

Fig. 8: Prediction histograms on English dataset for different models wherethe hate speech was correctly predicted. Models shown are MCD LSTM(first row), MCD AN with 30% dropout (second row), MCD AN with 10%dropout (third row) and MCD BERT (fourth row). Each tweet is shown in aseparate column, and the x axes showing predicted probability distributionsare different.

7 Conclusions and Future Work

In practical settings, the automatic detection of hate speech requires highprecision and prediction uncertainty estimates. In times when social networkssuffer from a high amount of offensive messages, wrong classifications candamage the minorities, lower the level of democratic debate but also damagethe freedom of speech. In technological terms, natural language approachesare witnessing a switch from recurrent neural networks with pre-trained wordembeddings (such as LSTM with fastText) to large pre-trained transformermodels (such as BERT). Monte Carlo dropout in the attention layers of

24 Kristian Miok, Blaz Skrlj, Daniela Zaharie, Marko Robnik-Sikonja

tweet5: whatover-rode jour-nalists’ integritywas greed andambition, alongwith a total ab-sence of courage(true label = hatespeech)

tweet6: there isa future nina andright now youare pretty muchthrowing yoursaway. (true label =non-hate speech)

tweet7: @user@user someonereally respects euguidelines!!! (truelabel = non-hatespeech)

tweet8: the scarsleft by us waimecamps mt @user(true label = hatespeech)

Fig. 9: Prediction histograms on the English dataset for different models wherethe hate speech was incorrectly predicted. Models shown are MCD LSTM(first row), MCD AN with 30% dropout (second row), MCD AN with 10%dropout (third row), and MCD BERT (fourth row). Each tweet is shown in aseparate column, and the x axis showing predicted probability distributions isdifferent for each graph.

transformer neural networks turned out to be a useful tool for predictionuncertainty estimation in the hate speech detection task.

We investigated the possibility of obtaining uncertainty assessment forpredictions with transformer neural networks. To better understand the uncer-tainty obtained for the BERT model, we investigated the part of its architecture,called ‘attention layers’, and proposed the use of Monte Carlo dropout in theselayers. We empirically investigated the calibration and predictive performanceof MCD AN and MCD BERT. The results show that MCD BERT is muchbetter calibrated than MCD AN. Its pre-training gathers a large amount ofinformation about the language that can be successfully exploited in fine-tuning

Bayesian Attention Networks for Reliable Hate Speech Detection 25

Fig. 10: Visualization of the 100 test tweets in two dimensions. Tweets thatare found to be certain are colored in blue (0) while tweets that are confusedin orange (1). It can be observed that uncertain tweets get clustered.

Fig. 11: Visualization of the outcome probability space for 100 tweets from thetest set. The test tweets are colored green, yellow, and red, depending on theinterval of the mean probability of the 1000 predictions. The predictions withvery high confidence form an isolated part in the probability space.

to the specific problem. MCD ANs, trained from scratch, are not competitivewith this.

26 Kristian Miok, Blaz Skrlj, Daniela Zaharie, Marko Robnik-Sikonja

Multiple predictions obtained from dropout layers of BERT in MCD BERTturn out to be very useful. They produce better predictive performance com-pared to BERT and provide better separation between trusted and dubiouspredictions. Also, the reliability enhanced classifications, and their proposedvisualizations support the detection of less certain automatic decisions andhelp moderators or annotators focus on questionable cases. The visualizationscan show either uncertainty in the classification of individual instances orrelationships among them.

The information extracted by BERT models seems to provide a robustsignal about hate speech, which is not enhanced by sentic features. Never-theless, the predictions of MCD BERT confirm the findings of the Hourglassof Emotions. We show that sentic variables are correlated with the non-hatespeech probability, and can potentially explain decisions of the BERT model.Breaking down the complex offensive language to the fundamental emotionscan bring interesting insights into the roots of the hate speech problem.

Our further work will adapt other Bayesian approaches to transformernetworks. In hate speech detection, we will investigate other languages. Be-sides hate speech, we plan to apply reliability enhanced classification to otherdomains, such as medical imaging and machine translation. One of the taskswhere Bayesian text classification can be particularly useful is semi-supervisedlearning, where we iteratively expand an initial small set of manually labeledinstances with the most reliably classified instances. Data re-annotation isanother example where reliability scores can be of significant help. An initialpilot study of Croatian comment filtering showed that human annotators de-cide mostly based on the observed keywords and lack time to find more subtleexpressions of offensive content. Such decisions result in low quality of theresulting datasets and demand their reannotation. Using the proposed MCDBERT reliability scores can significantly reduce the amount of reannotationand focus on genuinely borderline cases where the prediction models may err.

Funding Information Marko Robnik-Sikonja received the financial supportfrom the Slovenian Research Agency through core research programme P6-0411.All the authors except Daniela Zaharie have received funding from the Euro-pean Unions Horizon 2020 research and innovation programme under grantagreement No 825153 (EMBEDDIA).

Compliance with Ethical Standards

Conflict of Interest The authors declare that they have no conflict of interest.

Ethical Approval This article does not contain any studies with humanparticipants or animals performed by any of the authors.

Informed Consent Informed consent was not required as no humans oranimals were involved.

Bayesian Attention Networks for Reliable Hate Speech Detection 27

References

1. Alorainy W, Burnap P, Liu H, Javed A, Williams ML (2018) Suspendedaccounts: A source of tweets with disgust and anger emotions for aug-menting hate speech data sample. In: International Conference on MachineLearning and Cybernetics (ICMLC), vol 2, pp 581–586

2. Bauwelinck N, Lefever E (2019) Measuring the impact of sentiment forhate speech detection on Twitter. In: The Fifth International Conferenceon Human and Social Analytics (HUSO), pp 17–22

3. Bleich E (2011) The rise of hate speech and hate crime laws in liberaldemocracies. Journal of Ethnic and Migration Studies 37(6):917–934

4. Bluche T, Kermorvant C, Louradour J (2015) Where to apply dropoutin recurrent neural networks for handwriting recognition? In: 2015 13thInternational Conference on Document Analysis and Recognition (ICDAR),IEEE, pp 681–685

5. Cambria E (2016) Affective computing and sentiment analysis. IEEEintelligent systems 31(2):102–107

6. Cambria E, Hussain A (2015) Sentic Computing: A Common-Sense-BasedFramework for Concept-Level Sentiment Analysis. Springer, Cham, Switzer-land

7. Cambria E, Chandra P, Sharma A, Hussain A (2010) Do not feel the trolls.ISWC, Shanghai

8. Cambria E, Livingstone A, Hussain A (2012) The hourglass of emotions.In: Cognitive behavioural systems, Springer, pp 144–157

9. Cambria E, Poria S, Bajpai R, Schuller B (2016) Senticnet 4: A semanticresource for sentiment analysis based on conceptual primitives. In: Proceed-ings of COLING 2016, the 26th international conference on computationallinguistics: Technical papers, pp 2666–2677

10. Cambria E, Li Y, Xing FZ, Poria S, Kwok K (2020) Senticnet 6: Ensembleapplication of symbolic and subsymbolic ai for sentiment analysis. CIKM20,Oct 20-24

11. Cao K, Wang G, Han D, Ning J, Zhang X (2015) Classification of uncertaindata streams based on extreme learning machine. Cognitive Computation7(1):150–160

12. Cer D, Yang Y, Kong Sy, Hua N, Limtiaco N, John RS, Constant N,Guajardo-Cespedes M, Yuan S, Tar C, et al. (2018) Universal sentenceencoder for english. In: Proceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing: System Demonstrations, pp169–174

13. Chang WC, Yu HF, Zhong K, Yang Y, Dhillon I (2019) X-BERT: eXtrememulti-label text classification with BERT. arXiv preprint arXiv:190502331

14. Davidson T, Warmsley D, Macy M, Weber I (2017) Automated hate speechdetection and the problem of offensive language. In: Eleventh internationalAAAI conference on web and social media

15. Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: Pre-training ofdeep bidirectional transformers for language understanding. In: Proceedings

28 Kristian Miok, Blaz Skrlj, Daniela Zaharie, Marko Robnik-Sikonja

of the 2019 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, Volume 1(Long and Short Papers), pp 4171–4186

16. Du J, Xu R, He Y, Gui L (2017) Stance classification with target-specificneural attention networks. In: Proceedings of International Joint Confer-ences on Artificial Intelligence, IJCAI 2017

17. Fiser D, Erjavec T, Ljubesic N (2017) Legal framework, dataset andannotation schema for socially unacceptable online discourse practices inSlovene. In: Proceedings of the first workshop on abusive language online,pp 46–51

18. Fortunato M, Blundell C, Vinyals O (2017) Bayesian recurrent neuralnetworks. arXiv preprint arXiv:170402798

19. Gal Y (2016) Uncertainty in deep learning. University of Cambridge 1:320. Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: Rep-

resenting model uncertainty in deep learning. In: International Conferenceon Machine Learning, pp 1050–1059

21. Gal Y, Ghahramani Z (2016) A theoretically grounded application ofdropout in recurrent neural networks. In: Advances in Neural InformationProcessing Systems, pp 1019–1027

22. Guo C, Pleiss G, Sun Y, Weinberger KQ (2017) On calibration of modernneural networks. In: Proceedings of the 34th International Conference onMachine Learning-Volume 70, JMLR. org, pp 1321–1330

23. Gururangan S, Dang T, Card D, Smith NA (2019) Variational pretrainingfor semi-supervised text classification. arXiv preprint arXiv:190602242

24. He L, Liu B, Li G, Sheng Y, Wang Y, Xu Z (2018) Knowledge basecompletion by variational Bayesian neural tensor decomposition. CognitiveComputation 10(6):1075–1084

25. Izmailov P, Maddox WJ, Kirichenko P, Garipov T, Vetrov D, WilsonAG (2019) Subspace inference for bayesian deep learning. arXiv preprintarXiv:190707504

26. Kant N, Puri R, Yakovenko N, Catanzaro B (2018) Practical textclassification with large pre-trained language models. arXiv preprintarXiv:181201207

27. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization.arXiv preprint arXiv:14126980

28. Kocijan K, Koskovic L, Bajac P (2019) Detecting hate speech online: Acase of Croatian. In: International Conference on Automatic Processing ofNatural-Language Electronic Texts with NooJ, Springer, pp 185–197

29. Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and scalablepredictive uncertainty estimation using deep ensembles. In: Advances inneural information processing systems, pp 6402–6413

30. Ljubesic N, Erjavec T, Fiser D (2018) Datasets of Slovene and Croatianmoderated news comments. In: Proceedings of the 2nd Workshop onAbusive Language Online (ALW2), pp 124–131

31. Ljubesic N, Fiser D, Erjavec T (2019) The FRENK datasets of sociallyunacceptable discourse in Slovene and English. In: International Conference

Bayesian Attention Networks for Reliable Hate Speech Detection 29

on Text, Speech, and Dialogue, Springer, pp 103–11432. Maddox WJ, Izmailov P, Garipov T, Vetrov DP, Wilson AG (2019) A

simple baseline for Bayesian uncertainty in deep learning. In: Advances inNeural Information Processing Systems, pp 13132–13143

33. Malmasi S, Zampieri M (2017) Detecting hate speech in social media. arXivpreprint arXiv:171206427

34. Marinsek R (2019) Cross-lingual embeddings for hate speech detection incomments. University of Ljubljana, Faculty of Computer and InformationScience

35. Martins R, Gomes M, Almeida JJ, Novais P, Henriques P (2018) Hatespeech classification in social media using emotional analysis. In: 7thBrazilian Conference on Intelligent Systems (BRACIS), pp 61–66

36. McAuley J, Targett C, Shi Q, Van Den Hengel A (2015) Image-basedrecommendations on styles and substitutes. In: Proceedings of the 38thInternational ACM SIGIR Conference on Research and Development inInformation Retrieval, pp 43–52

37. McInnes L, Healy J, Melville J (2018) UMAP: Uniform manifold ap-proximation and projection for dimension reduction. arXiv preprintarXiv:180203426

38. Mehdad Y, Tetreault J (2016) Do characters abuse more than words? In:Proceedings of the 17th Annual Meeting of the Special Interest Group onDiscourse and Dialogue, pp 299–303

39. Miok K (2018) Estimation of prediction intervals in neural network-basedregression models. In: 20th International Symposium on Symbolic andNumeric Algorithms for Scientific Computing (SYNASC), pp 463–468

40. Miok K, Nguyen-Doan D, Robnik-Sikonja M, Zaharie D (2019) Multipleimputation for biomedical data using monte carlo dropout autoencoders.2019 E-Health and Bioengineering Conference (EHB) pp 1–4

41. Miok K, Nguyen-Doan D, Skrlj B, Zaharie D, Robnik-Sikonja M (2019)Prediction uncertainty estimation for hate speech classification. In: Interna-tional Conference on Statistical Language and Speech Processing, Springer,pp 286–298

42. Miok K, Nguyen-Doan D, Zaharie D, Robnik-Sikonja M (2019) Generatingdata using Monte Carlo dropout. In: 2019 IEEE 15th International Con-ference on Intelligent Computer Communication and Processing (ICCP),IEEE, pp 509–515

43. Mozafari M, Farahbakhsh R, Crespi N (2019) A BERT-based transferlearning approach for hate speech detection in online social media. In:International Conference on Complex Networks and Their Applications,Springer, pp 928–940

44. Myshkov P, Julier S (2016) Posterior distribution analysis for Bayesianinference in neural networks. Advances in Neural Information ProcessingSystems (NIPS)

45. Niculescu-Mizil A, Caruana R (2005) Predicting good probabilities withsupervised learning. In: De Raedt L, Wrobel S (eds) Proceedings of the22nd International Machine Learning Conference, ACM Press

30 Kristian Miok, Blaz Skrlj, Daniela Zaharie, Marko Robnik-Sikonja

46. Pamungkas EW, Patti V (2019) Cross-domain and cross-lingual abusivelanguage detection: A hybrid approach with deep learning and a multilin-gual lexicon. In: Proceedings of the 57th Annual Meeting of the Associationfor Computational Linguistics: Student Research Workshop, pp 363–370

47. Pappas N, Popescu-Belis A (2017) Multilingual hierarchical attentionnetworks for document classification. arXiv preprint arXiv:170700896

48. Platt JC (1999) Probabilistic outputs for support vector machines andcomparisons to regularized likelihood methods. In: Advances in large marginclassifiers, MIT Press, pp 61–74

49. Plutchik R (1980) Emotion: a psychoevolutionary synthesis. Harper andRow, New York

50. Robnik-Sikonja M, Kononenko I (2003) Theoretical and empirical analysisof ReliefF and RReliefF. Machine Learning Journal 53:23–69

51. Rodrıguez A, Argueta C, Chen YL (2019) Automatic detection of hatespeech on Facebook using sentiment and emotion analysis. In: 2019 Inter-national Conference on Artificial Intelligence in Information and Commu-nication (ICAIIC), pp 169–174

52. Rogers A, Kovaleva O, Rumshisky A (2020) A primer in BERTology: Whatwe know about how BERT works. arXiv preprint arXiv:200212327

53. Schmidt A, Wiegand M (2017) A survey on hate speech detection usingnatural language processing. In: Proceedings of the Fifth InternationalWorkshop on Natural Language Processing for Social Media, pp 1–10

54. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014)Dropout: a simple way to prevent neural networks from overfitting. Thejournal of machine learning research 15(1):1929–1958

55. Stappen L, Brunn F, Schuller B (2020) Cross-lingual zero-and few-shothate speech detection utilising frozen transformer language models andaxel. arXiv preprint arXiv:200413850

56. Sun C, Qiu X, Xu Y, Huang X (2019) How to fine-tune BERT for textclassification? In: China National Conference on Chinese ComputationalLinguistics, Springer, pp 194–206

57. Sun X, Peng X, Ding S (2018) Emotional human-machine conversationgeneration based on long short-term memory. Cognitive Computation10(3):389–397

58. Susanto Y, Livingstone A, Ng BC, Cambria E (2020) The hourglass modelrevisited. IEEE Intelligent Systems 35(5)

59. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neuralinformation processing systems, pp 5998–6008

60. Vezjak B (2018) Radical hate speech: The fascination with Hitler andfascism on the Slovenian webosphere. Solsko Polje 29

61. Wang S, Manning C (2013) Fast dropout training. In: International Con-ference on Machine Learning, pp 118–126

62. Warner W, Hirschberg J (2012) Detecting hate speech on the world wideweb. In: Proceedings of the second workshop on language in social media,Association for Computational Linguistics, pp 19–26

Bayesian Attention Networks for Reliable Hate Speech Detection 31

63. Waseem Z, Hovy D (2016) Hateful symbols or hateful people? predictivefeatures for hate speech detection on Twitter. In: Proceedings of theNAACL student research workshop, pp 88–93

64. Wiedemann G, Yimam SM, Biemann C (2020) UHH-LT & LT2 at SemEval-2020 Task 12: Fine-tuning of pre-trained transformer networks for offensivelanguage detection. arXiv preprint arXiv:200411493

65. Xu Y, Qiu X, Zhou L, Huang X (2020) Improving BERT fine-tuning viaself-ensemble and self-distillation. arXiv preprint arXiv:200210345

66. Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchicalattention networks for document classification. In: Proceedings of the2016 conference of the North American chapter of the association forcomputational linguistics: human language technologies, pp 1480–1489

67. Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate mul-ticlass probability estimates. In: Proceedings of the eighth ACM SIGKDDinternational conference on Knowledge discovery and data mining, pp694–699

68. Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural networkregularization. arXiv preprint arXiv:14092329

69. Zhang X, Song S, Wu C (2013) Robust Bayesian classification with incom-plete data. Cognitive Computation 5(2):170–187

70. Zhu L, Laptev N (2017) Deep and confident prediction for time seriesat Uber. In: IEEE International Conference on Data Mining Workshops(ICDMW), pp 103–110