Buzz Prediction in Social Media via Random Forest Approach

Buzz Prediction in Social Media via Random Forest Approach

Duygu Çelik Ertuğrul1 and Mohammad Ali Haji Hasan Khonsari2

Authors’ Affiliations:

aAssociate Professor, Department of Computer Engineering, Engineering Faculty, Eastern

Mediterranean University, Famagusta, North Cyprus; [email protected]

cDepartment of Computer Engineering, Engineering Faculty, Eastern Mediterranean

University, Famagusta, North Cyprus; [email protected]

Author responsible for correspondence:

Assoc. Prof. Dr. Duygu Çelik Ertuğrul

e-mail: [email protected], [email protected]

Tel: +90 392 6301178

Fax: +90 392 3650240

Highlights:

• Buzz is a remarkable shared comment or post (i.e. text, a video, or a picture), a particular subject

opened, or behavior, which initially emerges through social media, and causes amazing attention

of masses as well as outstanding remarkable.

• The posts in social media, which causes virality of buzzes, evokes intensive interaction over a

short period of time and sometimes causes serious consequences for individuals and companies

in the physical world.

• Majority of research studies on buzz prediction are modeled on machine learning techniques

such as Radial Basis Function Network (RBF network), Artificial Bee Colony (ABC), Support

Vector Machines (SVM), and Adaptive Boost (AB), etc.

• In this study, Random Forest (RF) algorithm is used to predict buzzes on 140707 instances

(almost 140K) with 77 features on Twitter posts.

1Computer Engineering Department, Faculty of Engineering, Eastern Mediterranean University, Famagusta, North

Cyprus, via Mersin 10, Turkey, [email protected]

2Computer Engineering Department, Faculty of Engineering, Eastern Mediterranean University, Famagusta, North

Cyprus, via Mersin 10, Turkey, [email protected]

mailto:[email protected]


• The results of comparative experiments demonstrate that Random Forest is better to predict a

buzz.

• Additionally, the variable rank ordering approach is considered to better Buzz prediction since

the most significant features in over all features may have more effect on system performance.

• The significant features indicate the most significant attributes to identify the created topic is

either “Buzz” or “Not Buzz”.

• Random Forest has shown the overall better accuracy result with the value of 99% and with the

fastest training time. However, while the Radial Based Function method has proven to be very

competitive, the Support Vector Machines approach is of average accuracy.

1

Buzz Prediction in Social Media via Random Forest Approach

1, *Duygu Çelik Ertuğrul, 2Mohammad Ali Haji Hasan Khonsari

1, 2Department of Computer Engineering, Engineering Faculty, Eastern Mediterranean University, Famagusta, North

Cyprus via Mersin-10, Turkey.

[email protected], [email protected]

Abstract Purpose –Buzz marketing is a viral marketing technique that emphasis on maximizing the mouth-to-mouth potential

of a campaign or product, through discussions between the consumer's family and friends, or through larger

discussions on social media platforms. Buzz prediction provides to analyze the rank of brands through the posts shared

of their potential customers or users in social networks. This paper aims to examine buzz prediction in social media

posts using machine learning techniques, determining most significant features and investigating research studies in

this field to evaluate competing predictions. In addition, in order to perform Buzz prediction more effectively and to

make system performance more efficient, a variable ranking approach has been performed to infer the most important

features among all features.

Design/methodology/approach – Buzz is a remarkable shared comment or post (i.e. text, a video, or a picture), a

particular subject opened, or behavior, which initially emerges through social media, and causes amazing attention of

masses as well as outstanding remarkable. The posts in social media, which causes virality of buzzes, evokes intensive

interaction over a short period of time and sometimes causes serious consequences for individuals and companies in

the physical world. In this study, 140,707 posts generated by Twitter users over a 7 days period were used in the

experimental studies. The instances are stored using a standard comma separated value (CSV) format. Each instance

is described by 77 features, those describe the evolution of 77 “primary features” through time. The 77 primary

features are considered in 11 categories. Each feature in a category for each instance keeps the data (min, max, mean,

std values) gathered for the 7 days.

Findings – Buzz prediction in social media channels such as Twitter is a challenging task done over consumer posts

by identifying different features to discover a Buzz. The study has shown that Buzz prediction can be made more

efficiently by filtering the most significant features and analyzing them via machine learning approaches to predict a

subject is a "Buzz" or "Not Buzz" in real-time data. Therefore, Random Forest approach has shown the overall better

accuracy result with the value of 99% and with the fastest training time. However, while the Radial Based Function

approach has proven to be very competitive, the Support Vector Machines approach is of average accuracy.

Research limitations/implications – In the Buzz prediction on social media channels such as Twitter, various key

features for a Buzz prediction should be examined and inferred based on real-time data. These features vary depending

on the time, topic, and users. This study focuses on a group of Twitter users and interprets a long-researched theoretical

puzzle and offers interesting research opportunities for future studies.

Practical implications – After comparative experimental studies conducted, Random Forest proved to be a good

approach in the buzz prediction problem. Because, it creates multiple decision trees and then combines them to get a

more accurate prediction. At the same time, a variable ranking approach has been implemented which allows the

inferencing of most significant features to achieve good accuracy in classification.

Originality/value – Majority of research studies in buzz prediction are modeled on machine learning techniques such

as Radial Basis Function Network, Artificial Bee Colony, Support Vector Machines, and Adaptive Boost, etc. In this

study, Random Forest algorithm is used to predict buzzes on 140,707 instances with 77 features on Twitter posts. The


2

results of comparative experiments demonstrate that Random Forest is better to predict a Buzz. In addition, this is the

first article to examine a variable rank ordering approach to perform a better Buzz prediction which have a greater

impact on system performance. The variable rank ordering is considered as a filtering task to find out the most

significant features in posts. The features indicate the most significant attributes to identify the created topic is either

“Buzz” or “Not Buzz”.

Keywords Buzz prediction, Random Forest, Support Vector Machine, Twitter, Neural Network, Machine Learning

1 Introduction

Social media is a huge market where millions of people interact interactively in a growing world rapidly. As

mobile and networking technologies evolve, our use of social media is rapidly increasing. Recently, the most popular

social networking sites of the world are Facebook, Twitter, LinkedIn, Google +, Instagram, etc. Facebook holds more

than 20% of the world's total internet traffic [1]. In Twitter, around 6,000 tweets are tweeted every second on average,

which corresponds to 500 million tweets per day and around 200 billion tweets per year [2-3]. Through Twitter

services, users can share posts and have interactions by messages that are called "tweets”. Tweets have restrictions of

approximately 140 characters. Over 100 million users sent about 340 million tweets each day and an average of 1.6

billion search queries per day were handled by the service [2]. Vast number of events and topics have been discussed

on Twitter that are about different fields such as current trends, marketing strategies, personal tweets, etc.

Buzz is a remarkable shared comment or post (i.e. text, a video, or a picture), a particular subject opened, or

behavior, which initially emerges through social media, and causes amazing attention of masses as well as outstanding

remarkable. Buzz can lead to user reactions such as making like, sharing posts, commenting on the posts, participating

to events, or purchasing products, etc. In most cases this exceptional attention lasts a few days or weeks. In rare cases

it takes several months [4].

Beside this, social media is a huge market area and brands in the markets are important power in order to reach

the masses. Brands aim to make the right communication with the masses and try to be permanent in social media

markets like individuals. Each brand uses certain marketing methods to increase its awareness by creating trends,

performing campaigns, and organizing events. Buzz, viral, grassroots, user generated content, and social media

marketing methods are all defined as different marketing concepts under one heading: Word of Mouth Marketing [5].

Buzz marketing is an action, activity or event to make people wonder, to create a big effect, to excite people about a

product or service. Viral marketing is a little different from buzz. Viral marketing provides to share a message via e-

mail or video over the Internet that allows spread it quickly among social media users or potential customers [5]. The

goal of both is to rapidly spread the marketing messages of brands using the power of Web.

The vast majority of posts or conversations on social media cause little to no discussion, while some start trends

and become viral. The posts, which causes virality of buzzes, evokes intensive interaction over a short period of time

and sometimes causes serious consequences for individuals and companies in the physical world [4]. Thus, early

detection of buzzes in a short period of time can help alleviate or prevent the negative consequences of social media

outbreaks against companies or individuals. Early detection of buzzes could give them a chance to react early.

Moreover, it is important to follow the results of campaigns or digital advertisements and realize the positive or

negative impacts of products or services on social media.

Majority of research studies on buzz prediction are modeled on machine learning techniques such as Radial

Basis Function Network (RBF network) [6], Artificial Bee Colony (ABC) [7], Support Vector Machines (SVM) [8],

and Adaptive Boost (AB) [9]. In this study, Random Forest (RF) [10,11] algorithm is used to predict buzzes on 140707

instances (almost 140K) on Twitter posts. Accuracy and error rate are considered as performance metrics.

Additionally, RF is applied by considering a variable ranking approach. The approach provides to find out the most

significant features (attributes) on posts, and thus identifies the created topic is either “Buzz” or “Not Buzz”. RF is

preferred in this study since it builds multiple decision trees and merges them together to get a more accurate and

stable prediction.

Due to the limited number of research studies about buzz prediction in literature and difficulty of finding similar

research studies which were used same data set, the performance metrics used (in terms of accuracy and error rate) of

RF are compared against SVM (with three different kernels such as Polynomial, Radial-Linear and Sigmoid) and

RBF. Results of the comparative analysis show that RF has the overall better results in accuracy, error rate and training

time. Finally, similar research studies on this topic using the aforementioned machine learning methods with different

data sets are also criticized in the empirical findings section according to the performance metrics used.

https://en.wikipedia.org/wiki/Web_search_query

3

The rest of the paper is organized as follows. Section 2 discusses various machine learning approaches used in

buzz prediction. Section 3 discusses similar research studies found in literature. Section 4 presents the data set used

and related descriptions. Section 5 discusses the methodology used and empirical findings of this research study.

Finally, Section 6 discusses the summary and conclusions.

2. Background

This section discusses various machine learning approaches commonly used for buzz prediction in literature.

In addition, the reasons of RF preference and the contributions of RF prediction-based analysis are mentioned in this

section. There are some well-known machine learning algorithms used in buzz prediction such as RBF network, SVM,

ABC, and AB, which are described below.

RBF network is an adaptive network model formulated by Broomhead and Lowe in 1988 [6]. RBF network is

commonly used recently due to their function approximation, curve fitting, and prediction of time series [12]. One of

the vital factors in RBF network is the choice of the number of neurons in the hidden layer, where every neuron

possesses a specific activation function, because it has effects on the complexity of the network as well as the general

capability. The most preferred function for activation is the Gaussian function [12] that possess spread parameter for

controlling the function’s characteristics and operations.

SVM is first proposed by Vapnik [13] that is a supervised learning model with associated learning algorithms

that analyzes data used for classification and regression analysis. SVM comes into existence to separate the large

chunk of the available data with a gap. These gaps separate the data points belonging to a different class. The data

points which lie on this gap are the support vector points. The points are based on the theory of decision planes that

identify decision boundaries that are able to separate a set of objects belong to different classes.

ABC algorithm is a population-based efficient algorithm that can scan a wide range of solutions, spreading to

many different points at the same time and gives successful performance values in the local search. ABC provides to

search in which food sources are assessed by artificial bees [14]. This approach was modified and investigated through

the years for different application domain. It gives guaranteed results in different of domains [15,16]. ABC is based

on how bees intelligently communicate with each other in a colony in order to detect and get to food sources.

AdaBoost [9] is another robust machine learning algorithm. It is commonly used for classification that can be

applied numerous fields such as biology, computer vision, and speech processing [17]. Unlike other powerful

classifiers, such as SVM, AdaBoost can attain similar classification results with much less alteration of parameters or

settings [18].

RF or Random Decision Forests (RDFs) that is developed by Leo Breiman [10,11] are ensemble learning

methods for classification, regression and other tasks. The approach operates by constructing a multitude of decision

trees at training time and by outputting the class that is the mode of the classes (classification) or mean prediction

(regression) of the individual trees.

There are many studies show that RFs have impressive predictive performance in regression and classification

problems in various fields, including ecology, financial forecasting, remote sensing, genetic and biomedical analysis

[19-25]. RF is considered as an appropriate model to handle large number of input dataset, imbalance dataset, and

provide an empirical approach to trace variable interactions [26,27]. In literature, RF method has been compared with

other machine learning methods such as Partial Least Squares Regression (PLS regression) [28, 29], SVM and NNs

in [30, 31]. In these comparisons, RFs characteristically shows comparable or even better prediction performance

measurements. Besides the appealing prediction performance by RF, the availability of computation package random

Forest in R tool [32] is another reason for widespread use of RF. Therefore, RF has been preferred in this study due

to the varied positive contributions as mentioned above. Similar research studies about buzz prediction by considering

the mentioned above machine learning algorithms are discussed in next section.

3. Literature Review

Certain comparable research studies in literature on buzz prediction using the aforementioned machine learning

methods with different data sets are discussed in this section.

Rastogi M and Bist AS used machine learning methods to classify on features of time windows of Twitter, a

micro-blogging social media and to determine whether or not these times-windows are followed by Buzz events [33].

The researchers used Naïve Bayes [34] and SVM and compare the results of the machine learning techniques. The

researchers applied SVM by considering different kernels such as Linear, Polynomial, and RBF. The performance

results of the researchers and ours are compared in terms of accuracy and error rate and discussed the results in

empirical findings section.

4

Mayuri M et al. used RBF network for buzz prediction on a real discussion data in Twitter by considering

particular attributes of the discussion [35]. The researchers compared the performance results of RBF network and

multiple regression and found out that the error rate in multiple regression overruns since it is a statistical method.

Thus, they put forwarded the RBF network is better since it produces better results in terms of accuracy and fastest

training time. As a result, it can be said that RBF network is a more suitable option because of it is much faster and

more reliable for buzz prediction in social-media when RBF network is compared multiple regression. In addition,

RBFN can work with non-linear values that allows researchers to deal with complex data. Since Twitter consists of

large amount of non-linear data, RBF network is regarded as a good function for its analysis. It is a feed forward

network that was trained by supervised training algorithm and was found to perform faster than back propagation

networks.

Morchid M, Linares G, and Dufour R concentrate new method to forecast bursty events using content-related

indices [36]. The indices are about “popularity”, “expressivity” and “singularity”. Popularity index is based on the

analysis of RSS streams. Expressivity uses a dictionary that contains words annotated in terms of expressivity load.

Singularity represents outlying topic association estimated via a Latent Dirichlet Allocation (LDA) model [37]. Bursty

event prediction is achieved by using a neural network that uses the three features to predict the number of retweets

of a tweet on Twitter. Their approach applies mapping task of source documents in a reduced semantic space in which

some features could be determined by the LDA analysis. LDA provide to check a document model (known as a bag

of words) as a combination of rate of occurrence of latent topics. Latent topics are identified by a distribution of word

probabilities which are linked to them. After the LDA analysis, a set of topics is obtained for each, a set of words and

chances of emission.

Retweet is a reposted or forwarded message on Twitter. By retweeting, Twitter allows massively its users to

share tweets which are potentially interesting posts for others. Morchid M et al. [38] analyzed specific tweet features

through a Principal Component Analysis (PCA) [39] to better understand the behavior of highly forwarded tweets as

opposed to those retweeted only a few times. After all, they projected to perceive automatically the massively

retweeted posts by a system. They have used different machine learning techniques such as Naïve Bayes [34] and

SVM while classifying posts and retrieved recall, precision and F1 result values after classification based on analysis

of selected features. After all experiments, based on the results obtained from both approaches, the researchers

suggested that SVM approach undoubtedly yields better performance results than ones obtained by the Naive Bayes

method.

Aswani R, Ghrera SP, Kar AK, and Chandra S [40] proposed a hybrid model which comprises ABC and k-

nearest neighbors approaches. In their model, ABC is integrated with k-nearest neighbors to identify outlier and

segregate buzz in Twitter. Data used by researchers are social media data from Twitter. The data set consists of certain

parameters describing discussions. The data is a regression/classification dataset which is publicly accessible from the

UCI Machine Learning Repository [41] and being used for perceiving outliers in the form of ‘‘buzz’’ in discussions.

The dataset includes 11 categories for 77 features belong the Twitter post instances that are “number of created

discussions”, “author increase”, “attention paid”, “burstiness level”, “atomic containers tweet”, “attention level

tweet”, “contribution sparseness”, “author interaction”, “number of authors”, “mean discussions length in time”, and

“average number of discussions”. Their hybrid model provides to dig out the outliers by using 11 these categories to

separate buzz discussions from the remaining topic discussions. The researchers argued that their hybrid model is

useful especially in domains like e-commerce, digital and viral marketing and provides to separate the buzz discussions

as outliers deviating from the normal discussions with the accuracy of 98.37%.

Deusser C, Jansen N, Reubold J, Schiller B, Hinz O, and Strufe T [42] also discussed early detection of a Buzz

for mitigating or preventing the negative effects of large-scale social media for companies and individuals and giving

them a chance to react at an early stage. Data used is a large dataset of over 100,000 posts that is collected and

annotated from 76 Facebook pages by researchers. The researchers have used logistic regression on the metadata of

the posts to identify certain features that define posts as Buzzes. The features considered and used in the solution are

categorized in two that are general metadata and temporal features. The general metadata features are “page likes”,

“post likes”, “post shares”, “comments”, “first level”, “replies”, “reply ratio”, “replied to ratio”, “users”, “old users”,

“new users”, “new users ratio”, “repeat users”, “comment likes”, “author comments”, and “content length”. Temporal

features are considered during buzz prediction are “first hour interval”, “first hour comments”, “last hour”, “max hour

comments”, “max hour”, “max derivative”, “min derivative”, “completeness index”, and “extremas”. The direct

general metadata features of posts are number of likes, shares, and comments on a post.

In addition, the researchers have also considered popularity level of pages and added a feature of their division

by the number of average likes, reshares, and comments for the particular page. For instance; “postlikes” is the number

of a post likes that is divided by the average number of post-likes per page. The researchers have first discovered

properties that define a Buzz using logistic regression. Then, they have used certain classifiers for train and then

5

applied machine learning classification procedures to prove the potential competences of automated prediction. They

applied RF, SVM and AB approaches to the data set. Finally, logistic regression is useful for interpretation of the

results and can be used to directly derive recommendations for human decision makers, whereas the ML techniques

yield superior classification results.

In this study, we used similar another data set of from UCI repository which is titled as Buzz prediction on

Twitter (Absolute Labeling folder with Threshold Sigma equals 500) [43] which is classified with RF and SVM (with

three different kernels such as Polynomial, Radial-Linear and Sigmoid). The results are analyzed with a comparison

of prediction performances with both the techniques. The results with different parameters are adjusted and parameters

selections to get optimal classification results are performed. Accuracy is obtained using SVM with different kernel

linear, polynomial and RBF, and then RF is applied. The comparison results are discussed in Section 5. More detail

description of the dataset used is given in next section.

4. Description of Dataset Used

A buzz is often characterized by many people discussing a certain topic and commenting on a given post. The

data set used in this study is taken on the UC Irvine Machine Learning Repository website [43] under the topic “Buzz

Prediction in Social Media (Twitter Data set)” where binary classification of buzz that is “Buzz / no Buzz” and the

domain Twitter is discussed. It is a publicly available regression/classification dataset from the UCI repository created

by Kawala et al. [44] which is being used for detecting outliers in the form of ‘‘buzz.’’ Total number of data in the

data set involves almost 140707 instances (almost 140K) and 77 features. The instances are stored using a standard

comma separated value (CSV) format. Each instance is described by 77 features, those describe the evolution of 77

“primary features” through time. The 77 primary features are considered in 11 categories (Table 1). Each feature in a

category for each instance keeps the data (min, max, mean, std values) gathered for 7 days.

Hence, the name of each feature is post fixed with the relative time of observation. For instance, the value of

the feature “Nb_Active_Discussion” at time t is given in “Nb_Active_Discussion_t”. Each instance covers seven days

of observation for a particular topic (e.g. overclocking). Considering the couple day following this initial observation,

if there are at least 500 additional active discussions by day (on average, with respect to the initial observation) then,

the predicted attribute “Buzz” is True. Observations are independent and identically distributed.

Table 1. Attributes Descriptions of the Dataset Used [43]. # Categories

Type Explanation 77 attributes along 7 days observed

1 Number of Created Discussions (NCD)

Numeric, integers only

This feature measures the number of discussions created at time step t and involving the instance’s topic.

Columns [0, 6] in data set: NCD_0, NCD_1, NCD_2, NCD_3, NCD_4, NCD_5, NCD_6

2 Author Increase (AI)

Numeric, integers only

Number of new authors interacting on the instance's topic at time t (i.e. its popularity)

Columns [7, 13] in data set: AI_0, AI_1, AI_2, AI_3, AI_4, AI_5, AI_6

3 Attention Level (measured with Number of Authors) AS(NA)

Numeric, real in [0,1]

The attention payed to the instance's topic on a social media.

Columns [14, 20] in data set: AS(NA)_0, AS(NA)_1, AS(NA)_2, AS(NA)_3, AS(NA)_4, AS(NA)_5, AS(NA)_6

4 Burstiness Level (BL)*

Numeric, defined on [0,1]

The burstiness level for a topic z at a time t is defined as the ratio of NCD and NAD features.

Columns [21, 27]) in data set: BL_0, BL_1, BL_2, BL_3, BL_4, BL_5, BL_6

5 Number of Atomic Containers (NAC)

Numeric, integer

This feature measures the total number of atomic containers generated through the whole social media on the instance's topic until time t.

Columns [28, 34] in data set: NAC_0, NAC_1, NAC_2, NAC_3, NAC_4, NAC_5, NAC_6

6 Attention Level (measured with Number of Contributions) (AS(NAC))


This feature is a measure of the attention payed to the instance's topic on a social media.

Columns [35, 41] in data set: AS(NAC)_0, AS(NAC)_1, AS(NAC)_2, AS(NAC)_3, AS(NAC)_4, AS(NAC)_5, AS(NAC)_6

6

7 Contribution Sparseness (CS)


This feature is a measure of spreading of contributions over discussion for the instance's topic at time t.

Columns [42, 48] in data set: CS_0, CS_1, CS_2, CS_3, CS_4, CS_5, CS_6

8 Author Interaction (AT)

Numeric, integer

This feature measures the average number of authors interacting on the instance's topic within a discussion.

Columns [49, 55] in data set: AT_0, AT_1, AT_2, AT_3, AT_4, AT_5, AT_6

9 Number of Authors (NA)

Numeric, integer

This feature measures the number of authors interacting on the instance's topic at time t.

Columns [56, 62] in data set: NA_0, NA_1, NA_2, NA_3, NA_4, NA_5, NA_6

10 Average Discussions Length (ADL)

Numeric, real

This feature directly measures the average length of a discussion belonging to the instance's topic.

Columns [63, 69] in data set: ADL_0, ADL_1, ADL_2, ADL_3, ADL_4, ADL_5, ADL_6

11 Average Discussions Length (NAD)

Numeric, integer

This feature measures the number of discussions involving the instance's topic until time t.

Columns [70, 76] in data set: NAD_0, NAD_1, NAD_2, NAD_3, NAD_4, NAD_5, NAD_6.

*In statistic, burstiness is the intermittent increases and decreases in activity or frequency of an event.

5. Methodology Used and Empirical Findings

As it is mentioned before, the study evaluates and compares the performance of RF against SVM on the data

set [43] to predict the buzzes. SVM is considered with different kernels that are linear, polynomial, and RBF. In

addition, RF is implemented to perform variable rank ordering which is an efficient strategy to detect the most

significant features to identify the topic created is buzz or not. In this study, R tool [32] and certain suitable packages

are used during experimental studies. We employed the RF and SVM methods on the same data set without

dimensional reduction. The user interfaces of RF are accord with other classification functions such as the NNET [45]

and SVM (in the e1071 package) [46]. It involves an equational area and predictors which can be specified as a matrix

or data frame via the x argument, with responses as a vector via the y argument. RF carries out classification task if

the response is a factor, that mean is, the response is not continuous. However, if the response not a factor, RF carries

out regression task. RF carries out unsupervised learning whenever the response is not specified. In this case, RF does

not handle statistically categorical responses. Categorical predictor variables also need to be specified as factors so

that they are not wrongly treated as continuous. Therefore, RF function returns an object of class "Random Forest".

In the rest of this section, comparison issues based on the error and accuracy parameters related to RF, SVM with

three different kernels, and RBF methods are discussed. Accuracy is the portion of all relevant and irrelevant features

against all features. Error is the part of one minus Accuracy.

M. Rastogi and A. S. Bist [33] used SVM with different parameters (Linear, Polynomial 3, and Polynomial 4)

and RBF to predict buzzes on the same data set which we used in our experiments. However, the researchers used

only 2000 instances with 77 features. Their results are given in Table 2.

Table 2. The results obtained on accuracy using SVM with different kernels and RBF by the researchers [4].

Kernel Accuracy Training set Testing set

Linear 0.927 1000 1000

RBF 0.958 1000 1000

Polynomial 3 0.92 1000 1000

Polynomial 4 0.923 1000 1000

Table 3. The results obtained on error and accuracy using RF, SVM with three different kernels, and RBF methods.

Type Error Accuracy Training set Testing set

SVM-RB F 0.341 0.66 93805 46902

SVM-Linear 0.361 0.64 93805 46902

SVM-Polynomial 0.762 0.14 93805 46902

RBF 0.061 0.94 93805 46902

RF 0.001 0.99 93805 46902

7

In our study, RF and SVM with three different kernels are used. The kernel types considered are Polynomial,

Radial-Linear and Sigmoid. In addition, the same dataset is used by considering entire instances (140707) to improve

accuracy. We considered to obtain two label classes represented by “0” and “1”. The “0” represented Non-Buzzed

event and “1” represented Buzzed event.

In experimental studies, the overall dataset is considered as two sets: training and test sets. The sets are used

respectively in terms of training the system and subsequent testing for accuracy of the Buzz prediction. Therefore, the

test and train sets used in this research study contain all 140707 instances over 77 attributes. Table 3 signifies the

retrieved error and accuracy results. While the error is 0.001, the corresponding accuracy is 0.99 in Table 3. Error and

accuracy obtained are also shown in Figure 1 as follows:

(a) (b)

Figure 1. (a) RMSE Error and (b)Accuracy

As shown in the Figure 1, RF outperformed against other approaches as it has lower Root Mean Square Error

(RMSE) value relative to the other approaches.

Figure 2. Top ten significant features determined from the 77 features.

Additionally, the variable rank ordering may lead to better Buzz prediction since the top variables may have

more effect on the accurate prediction out of the total variables. Moreover, in our experimental studies, RF is

implemented to find out variable rank ordering, which is an efficient strategy to detect the most significant features to

identify the topic is Buzz or not. Figure 2 illustrates the top 10 most significant features detected in the 77 features

which are crucial to identify any topic created is Buzz and or not Buzz. The crucial features are determined as x29

(BL_6), x57(NA_0), x50(AT_0), x71(NAD_0), x15(AS(NA)_0), x64(ADL_0), y01(NCD_1), x22(BL_0),

SVM-RBF

SVM-Linear

SVM-Polyno

mialRBF

Random

Forest

Error 0,341 0,361 0,762 0,061 0,001

0,341 0,361

0,762

0,061 0,001

0

0,2

0,4

0,6

0,8

1

ERROR

SVM-RBF

SVM-Linear

SVM-Polynomial

RBFRandom

Forest

Accuracy 0,66 0,64 0,14 0,94 0,99

0,66 0,64

0,14

0,94 0,99

00,20,40,60,8

11,2

ACCURACY

8

x08(AI_0), and x36(AS(NAC)_0). “BL_6” mean is burstiness level for a topic z at a time t is defined as the ratio of

NCD and NAD. The “NA_0” is the number of authors which measures the number of authors interacting on the

instance's topic at time t. The “AT_0” is the average number of authors interacting on the instance's topic within a

discussion. The “NAD_0” is the number of discussions involving the instance's topic until time t. The “AS(NA)_0”

is a measure of the attention payed to the instance's topic on a social media. The “ADL_0” is directly measures the

average length of a discussion belonging to the instance's topic. The “NCD_1” is measures the number of discussions

created at time step t and involving the instance's topic. While "BL_0" is the statistics observed on the first day,

"BL_6" is the statistics observed on the seventh day. Both are considering the burstiness level for a topic z at a time t.

The 11 categories of the 77 attributes with their description in the data set used are available in [43].

Different features represent the buzz case may lead to have a serious imbalance in the dataset. Hence, RF has

been preferred to define and capture all inherent phenomena in the dataset. The data set used was contain various

attributes to define the buzz case, a ranking analysis was necessary to find out the most significant attributes in order

to identify the created topic is either buzz or not. To the best of our knowledge, this study seems to be the first study

that conducts a feature ranking methodology to find out the most significant features indicating a buzz in the social

media posts dataset.

The results of the comparative experimental studies demonstrate that RF is the most accurate method for

predicting buzz and designates also a ranking list of significant features to obtain a better accuracy. RF approach gives

the overall better accuracy result with the value of 99% and with the fastest training time. However, RBF method

proved to be very competitive while SVM approach is average. In this study, it is also observed that the Buzz is a

dynamic event in which the basis of prediction could be modelled on the content as well as the forest. Finally, the use

of faster and more reliable algorithms for buzz prediction especially on products and brands comments in social media

is expanded further with this study.

6. Conclusion

In this paper, a method for predicting buzz cases on social media platform; Twitter, through the use of Random

Forest (RF) descriptor is proposed. The performance of this method was evaluated and compared to performances of

two other similar descriptors, Support Vector Machine (SVM) in three different kernels, and Radial Basis Function

(RBF). The results obtained from the experimental studies showed that RF gives better results in terms of “Accuracy”

and training time compared to other descriptors.

Additionally, in this study, RF is implemented to find out variable rank ordering, which is an efficient strategy

to detect the most significant features to identify a topic is Buzz or not. According to the results of the variable

ordering order obtained, it is found that research studies in this area should be performed more comprehensive, that

is, the importance of the features should be taken into account in order to obtain better performance results in defining

the buzz events. Random Forest is good approach which achieved 0.99 accuracy result and show that it is faster and

reliable approach for buzz prediction in social media. It is also inferred that the Buzz is a dynamic event in which the

basis of prediction could be modelled on the content as well as the forest.

Future work of the study is expected to a look at the Buzz activities in other social media platforms other than

Twitter. In addition, Tuning Random Forest algorithm is another alternative to try that can be used to retrieve better

performance. Further areas such as economy, social and political are potential to consider and perform future research

studies in this field.

References

[1]. Statista, Most popular social networks worldwide as of January 2019, ranked by number of active users,

https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/, Last visited: 11-02-

2019.

[2]. Twitter Usage Statistics, http://www.internetlivestats.com/twitter-statistics/, Last visited: 11-02-2019.

[3]. Kwak, H., Lee, C., Park, H., & Moon, S. (2010, April). What is Twitter, a social network or a news media?

In Proceedings of the 19th international conference on World wide web (pp. 591-600). ACM.

[4]. Deusser C, Jansen N, Reubold J, Schiller B, Hinz O, Strufe T (2018, April) Buzz in Social Media: Detection

of Short-lived Viral Phenomena. In companion of the Web conference 2018 1443-1449. International World Wide

Web Conferences Steering Committee.

[5]. Douglas K (2015, April) Buzz, Viral or Word of Mouth Marketing: What’s the difference?,

https://martech.zone/whats-the-difference-buzz-viral-word-of-mouth-marketing/, Last visited: 11-02-2019.

[6]. Broomhead DS, Lowe D. Radial basis functions, multi-variable functional interpolation and adaptive

networks. Royal Signals and Radar Establishment Malvern (United Kingdom); 1988 Mar 28.

https://martech.zone/whats-the-difference-buzz-viral-word-of-mouth-marketing/

9

[7]. Karaboga, D. (2005). An idea based on honey bee swarm for numerical optimization (Vol. 200). Technical

report-tr06, Erciyes university, engineering faculty, computer engineering department.

[8]. Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intelligent

Systems and their applications. Jul;13(4):18-28.

[9]. Freund Y, Schapire R, Abe N (1999, Sep) A short introduction to boosting. Journal-Japanese Society For

Artificial Intelligence. 1;14(771-780):1612.

[10]. Breiman L (2001, Oct) Random Forests. Machine learning. 1;45(1):5-32.

[11]. Breiman L, Cutler A (2003) Manual for Setting Up. Using, and Understanding Random Forest, 4.

[12]. Hutchinson J M (1994) A radial basis function approach to financial time series analysis, PhD Thesis,

https://pdfs.semanticscholar.org/19f3/528f7e485bc5b075b2bcf0c5cc7960c10148.pdf, Last visited: 11-02-2019.

[13]. Vapnik V, Mukherjee S (2000) Support vector method for multivariate density estimation. In Advances in

neural information processing systems, 659-665.

[14]. Karaboga D, Basturk B (2007, Nov) A powerful and efficient algorithm for numerical function optimization:

artificial bee colony (ABC) algorithm. Journal of global optimization. 1;39(3):459-71.

[15]. Karaboga D, Akay B (2009, Aug) A comparative study of artificial bee colony algorithm. Applied

mathematics and computation, 1;214(1):108-32.

[16]. Karaboga D, Gorkemli B (2014, Oct) A quick artificial bee colony (qABC) algorithm and its performance

on optimization problems. Applied Soft Computing, 1;23:227-38.

[17]. Kaur R, Chopra V (2015) Implementing Adaboost and Enhanced Adaboost Algorithm in Web

Mining. International Journal of Advanced Research in Computer and Communication Engineering, 4(7).

[18]. Von Kirby P, Gerardo BD, Medina RP (2017, Dec) Implementing Enhanced AdaBoost Algorithm for Sales

Classification and Prediction. International Journal of Trade, Economics and Finance, 8(6).

[19]. Siroky DS (2009) Navigating random forests and related advances in algorithmic modeling. Statistics

Surveys, 3, 147-163.

[20]. Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP (2003 Nov) Random forest: a

classification and regression tool for compound classification and QSAR modeling. Journal of chemical information

and computer sciences, 24;43(6):1947-58.

[21]. Jiang P, Wu H, Wang W, Ma W, Sun X, Lu Z (2007, Jul) MiPred: classification of real and pseudo microRNA

precursors using random forest prediction model with combined features. Nucleic acids research, 1;35(suppl_2):

W339-44.

[22]. Kumar M, Thenmozhi M (2006) Forecasting stock index movement: A comparison of support vector

machines and random forest.

[23]. Diaz-Uriarte R (2007, Jul) GeneSrF and varSelRF: a web-based tool and R package for gene selection and

classification using random forest. BMC bioinformatics, 8(1):328.

[24]. Pal M (2005, Jan) Random forest classifier for remote sensing classification. International Journal of Remote

Sensing, 1;26(1):217-22.

[25]. Cutler DR, Edwards Jr TC, Beard KH, Cutler A, Hess KT, Gibson J, Lawler JJ (2007, Nov) Random forests

for classification in ecology. Ecology, 88(11):2783-92.

[26]. Livingston F (2005) Implementation of Breiman’s random forest machine learning algorithm. ECE591Q

Machine Learning Journal Paper.

[27]. Liaw A, Wiener M (2002) Classification and regression by randomForest. R News, 2 (3): 18–22. R package

version 4.6. 10.

[28]. Mevik BH, Segtnan VH, Næs T (2004, Nov) Ensemble methods and partial least squares regression. Journal

of Chemometrics: A Journal of the Chemometrics Society,18(11):498-507.

[29]. Xu R (2013) Improvements to random forest methodology.

[30]. Statnikov A, Wang L, Aliferis CF (2008, Dec) A comprehensive comparison of random forests and support

vector machines for microarray-based cancer classification. BMC bioinformatics, 9(1):319.

[31]. Liu M, Wang M, Wang J, Li D (2013, Feb) Comparison of random forest, support vector machine and back

propagation neural network for electronic tongue data classification: Application to the recognition of orange beverage

and Chinese vinegar. Sensors and Actuators B: Chemical,177:970-80.

[32]. Dessau, R. B., & Pipper, C. B. (2008). ''R"--project for statistical computing. Ugeskrift for laeger, 170(5),

328-330.

[33]. Rastogi M, Bist AS (2016) Analysis of Twitter Data With Machine Learning Techniques, International

Journal of Engineering Sciences & Research Technology, ISSN: 2277-9655

http://www.ijesrt.com/issues%20pdf%20file/Archive-2016/July-2016/117.pdf, Last visited: 11-02-2019.

10

[34]. Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on

empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46).

[35]. Mayuri M, Sneha M, Kamatchi LP (2015) Prediction of buzz in Social-media using Radial Basis Function

Neural Networks, International Conference on Interdisciplinary Engineering and Sustainable Management Sciences.

[36]. Morchid M, Linares G, Dufour R (2014, May) Characterizing and Predicting Bursty Events: The Buzz Case

Study on Twitter. InLREC, 26, 2766-2771.

[37]. Blei DM, Ng AY (2003, Jan) Jordan MI. Latent Dirichlet Allocation. Journal of machine Learning research,

3:993-1022.

[38]. Morchid M, Dufour R, Bousquet PM, Linares G, Torres-Moreno JM (2014 Nov) Feature selection using

principal component analysis for massive retweet detection. Pattern Recognition Letters, 1;49:33-9.

[39]. Chen, X., Chen, C. and Jin, L. (2011). Principal Component Analyses in Anthropological Genetics. Advances

in Anthropology, 01(02), pp.9-14.

[40]. Aswani R, Ghrera SP, Kar AK, Chandra S (2017 Dec) Identifying buzz in social media: a hybrid approach

using artificial bee colony and k-nearest neighbors for outlier detection. Social Network Analysis and Mining,

1;7(1):38.

[41]. UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Buzz+in+social+media+, Last

visited: 11-07-2020.

[42]. Deusser C, Jansen N, Reubold J, Schiller B, Hinz O, Strufe T (2018, Apr) Buzz in Social Media: Detection

of Short-lived Viral Phenomena. InCompanion of the The Web Conference 2018 on The Web Conference,

International World Wide Web Conferences Steering Committee, 1443-1449.

[43]. Buzz prediction on Twitter - Absolute Labeling - Threshold Sigma equals 500 in UCI repository,

http://ama.liglab.fr/data/buzz/classification/Twitter/Absolute_labeling/Twitter-Absolute-Sigma-500.names, Last

visited: 11-07-2020.

[44]. Kawala, F., Douzal-Chouakria, A., Gaussier, E., & Dimert, E. (2013, October). Prédictions d'activité dans

les réseaux sociaux en ligne. In 4ie`me Confe´rence sur les Mode`les et l’Analyse des Re´seaux: Approches

Mathe´matiques et Informatiques, p 16.

[45]. Ripley, B., Venables, W., & Ripley, M. B. (2016). Package ‘nnet’. R package version, 7-3

[46]. Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., & Weingessel, A. The e1071 package, 2005. Software

available at< http://cran. r-project. org/src/contrib/Descriptions/e1071. html.

http://ama.liglab.fr/data/buzz/classification/Twitter/Absolute_labeling/Twitter-Absolute-Sigma-500.names

AUTHORS

Duygu Çelik Ertuğrul was born in Turkey. She is an associate professor in

the computer engineering department at the Eastern Mediterranean

University, Famagusta, North Cyprus via Mersin-10, Turkey. Her research

topics are related to the Web and Semantics; Composition and Discovery of

Semantic Web Services, Semantic Search Agents, Rule-Based Expert

Systems, m-Health and Healthcare Knowledgebase Expert Systems. She is

one of the organizers of two international workshops and one international

symposium: “IEEE International Workshop on ESAS: E-Health Systems and

Semantic Web” since 2006, “Security of Information and Networks (SIN)

between 2007-2011” and “IEEE COMPSAC Symposium on Web Technologies & Data Analytics

(WEDA) in 2016”. She also supervises several research and development projects supported by

the university, government, and industrial companies. She has organized a special issue for Expert

Systems: The Journal of Knowledge Engineering and served as guest editor. She has published

numerous articles/book chapters/book in several international/national journals and conferences

on the topic of Web Semantics and Mobile Medical Healthcare Services and Systems. She is also

the author of various books or book chapters about ‘Composition or Discovery of Semant ic Web

Services’ and ‘Semantic Web based e-Health Services’.

Mohammad Ali Haji Hasan Khonsari was born in Iran. He is graduate

student in the Computer Engineering Department at the Eastern

Mediterranean University, Famagusta, North Cyprus via Mersin-10, Turkey.

His research topics are related to Web technologies, artificial intelligence,

social network research and applications, data mining applications and

methods.

Buzz Prediction in Social Media via Random Forest Approach

Documents

Transcript of Buzz Prediction in Social Media via Random Forest Approach