Buzz Prediction in Social Media via Random Forest Approach
Transcript of Buzz Prediction in Social Media via Random Forest Approach
Buzz Prediction in Social Media via Random Forest Approach
Duygu Çelik Ertuğrul1 and Mohammad Ali Haji Hasan Khonsari2
Authors’ Affiliations:
aAssociate Professor, Department of Computer Engineering, Engineering Faculty, Eastern
Mediterranean University, Famagusta, North Cyprus; [email protected]
cDepartment of Computer Engineering, Engineering Faculty, Eastern Mediterranean
University, Famagusta, North Cyprus; [email protected]
Author responsible for correspondence:
Assoc. Prof. Dr. Duygu Çelik Ertuğrul
e-mail: [email protected], [email protected]
Tel: +90 392 6301178
Fax: +90 392 3650240
Highlights:
• Buzz is a remarkable shared comment or post (i.e. text, a video, or a picture), a particular subject
opened, or behavior, which initially emerges through social media, and causes amazing attention
of masses as well as outstanding remarkable.
• The posts in social media, which causes virality of buzzes, evokes intensive interaction over a
short period of time and sometimes causes serious consequences for individuals and companies
in the physical world.
• Majority of research studies on buzz prediction are modeled on machine learning techniques
such as Radial Basis Function Network (RBF network), Artificial Bee Colony (ABC), Support
Vector Machines (SVM), and Adaptive Boost (AB), etc.
• In this study, Random Forest (RF) algorithm is used to predict buzzes on 140707 instances
(almost 140K) with 77 features on Twitter posts.
1Computer Engineering Department, Faculty of Engineering, Eastern Mediterranean University, Famagusta, North
Cyprus, via Mersin 10, Turkey, [email protected]
2Computer Engineering Department, Faculty of Engineering, Eastern Mediterranean University, Famagusta, North
Cyprus, via Mersin 10, Turkey, [email protected]
• The results of comparative experiments demonstrate that Random Forest is better to predict a
buzz.
• Additionally, the variable rank ordering approach is considered to better Buzz prediction since
the most significant features in over all features may have more effect on system performance.
• The significant features indicate the most significant attributes to identify the created topic is
either “Buzz” or “Not Buzz”.
• Random Forest has shown the overall better accuracy result with the value of 99% and with the
fastest training time. However, while the Radial Based Function method has proven to be very
competitive, the Support Vector Machines approach is of average accuracy.
1
Buzz Prediction in Social Media via Random Forest Approach
1, *Duygu Çelik Ertuğrul, 2Mohammad Ali Haji Hasan Khonsari
1, 2Department of Computer Engineering, Engineering Faculty, Eastern Mediterranean University, Famagusta, North
Cyprus via Mersin-10, Turkey.
[email protected], [email protected]
Abstract Purpose –Buzz marketing is a viral marketing technique that emphasis on maximizing the mouth-to-mouth potential
of a campaign or product, through discussions between the consumer's family and friends, or through larger
discussions on social media platforms. Buzz prediction provides to analyze the rank of brands through the posts shared
of their potential customers or users in social networks. This paper aims to examine buzz prediction in social media
posts using machine learning techniques, determining most significant features and investigating research studies in
this field to evaluate competing predictions. In addition, in order to perform Buzz prediction more effectively and to
make system performance more efficient, a variable ranking approach has been performed to infer the most important
features among all features.
Design/methodology/approach – Buzz is a remarkable shared comment or post (i.e. text, a video, or a picture), a
particular subject opened, or behavior, which initially emerges through social media, and causes amazing attention of
masses as well as outstanding remarkable. The posts in social media, which causes virality of buzzes, evokes intensive
interaction over a short period of time and sometimes causes serious consequences for individuals and companies in
the physical world. In this study, 140,707 posts generated by Twitter users over a 7 days period were used in the
experimental studies. The instances are stored using a standard comma separated value (CSV) format. Each instance
is described by 77 features, those describe the evolution of 77 “primary features” through time. The 77 primary
features are considered in 11 categories. Each feature in a category for each instance keeps the data (min, max, mean,
std values) gathered for the 7 days.
Findings – Buzz prediction in social media channels such as Twitter is a challenging task done over consumer posts
by identifying different features to discover a Buzz. The study has shown that Buzz prediction can be made more
efficiently by filtering the most significant features and analyzing them via machine learning approaches to predict a
subject is a "Buzz" or "Not Buzz" in real-time data. Therefore, Random Forest approach has shown the overall better
accuracy result with the value of 99% and with the fastest training time. However, while the Radial Based Function
approach has proven to be very competitive, the Support Vector Machines approach is of average accuracy.
Research limitations/implications – In the Buzz prediction on social media channels such as Twitter, various key
features for a Buzz prediction should be examined and inferred based on real-time data. These features vary depending
on the time, topic, and users. This study focuses on a group of Twitter users and interprets a long-researched theoretical
puzzle and offers interesting research opportunities for future studies.
Practical implications – After comparative experimental studies conducted, Random Forest proved to be a good
approach in the buzz prediction problem. Because, it creates multiple decision trees and then combines them to get a
more accurate prediction. At the same time, a variable ranking approach has been implemented which allows the
inferencing of most significant features to achieve good accuracy in classification.
Originality/value – Majority of research studies in buzz prediction are modeled on machine learning techniques such
as Radial Basis Function Network, Artificial Bee Colony, Support Vector Machines, and Adaptive Boost, etc. In this
study, Random Forest algorithm is used to predict buzzes on 140,707 instances with 77 features on Twitter posts. The
2
results of comparative experiments demonstrate that Random Forest is better to predict a Buzz. In addition, this is the
first article to examine a variable rank ordering approach to perform a better Buzz prediction which have a greater
impact on system performance. The variable rank ordering is considered as a filtering task to find out the most
significant features in posts. The features indicate the most significant attributes to identify the created topic is either
“Buzz” or “Not Buzz”.
Keywords Buzz prediction, Random Forest, Support Vector Machine, Twitter, Neural Network, Machine Learning
1 Introduction
Social media is a huge market where millions of people interact interactively in a growing world rapidly. As
mobile and networking technologies evolve, our use of social media is rapidly increasing. Recently, the most popular
social networking sites of the world are Facebook, Twitter, LinkedIn, Google +, Instagram, etc. Facebook holds more
than 20% of the world's total internet traffic [1]. In Twitter, around 6,000 tweets are tweeted every second on average,
which corresponds to 500 million tweets per day and around 200 billion tweets per year [2-3]. Through Twitter
services, users can share posts and have interactions by messages that are called "tweets”. Tweets have restrictions of
approximately 140 characters. Over 100 million users sent about 340 million tweets each day and an average of 1.6
billion search queries per day were handled by the service [2]. Vast number of events and topics have been discussed
on Twitter that are about different fields such as current trends, marketing strategies, personal tweets, etc.
Buzz is a remarkable shared comment or post (i.e. text, a video, or a picture), a particular subject opened, or
behavior, which initially emerges through social media, and causes amazing attention of masses as well as outstanding
remarkable. Buzz can lead to user reactions such as making like, sharing posts, commenting on the posts, participating
to events, or purchasing products, etc. In most cases this exceptional attention lasts a few days or weeks. In rare cases
it takes several months [4].
Beside this, social media is a huge market area and brands in the markets are important power in order to reach
the masses. Brands aim to make the right communication with the masses and try to be permanent in social media
markets like individuals. Each brand uses certain marketing methods to increase its awareness by creating trends,
performing campaigns, and organizing events. Buzz, viral, grassroots, user generated content, and social media
marketing methods are all defined as different marketing concepts under one heading: Word of Mouth Marketing [5].
Buzz marketing is an action, activity or event to make people wonder, to create a big effect, to excite people about a
product or service. Viral marketing is a little different from buzz. Viral marketing provides to share a message via e-
mail or video over the Internet that allows spread it quickly among social media users or potential customers [5]. The
goal of both is to rapidly spread the marketing messages of brands using the power of Web.
The vast majority of posts or conversations on social media cause little to no discussion, while some start trends
and become viral. The posts, which causes virality of buzzes, evokes intensive interaction over a short period of time
and sometimes causes serious consequences for individuals and companies in the physical world [4]. Thus, early
detection of buzzes in a short period of time can help alleviate or prevent the negative consequences of social media
outbreaks against companies or individuals. Early detection of buzzes could give them a chance to react early.
Moreover, it is important to follow the results of campaigns or digital advertisements and realize the positive or
negative impacts of products or services on social media.
Majority of research studies on buzz prediction are modeled on machine learning techniques such as Radial
Basis Function Network (RBF network) [6], Artificial Bee Colony (ABC) [7], Support Vector Machines (SVM) [8],
and Adaptive Boost (AB) [9]. In this study, Random Forest (RF) [10,11] algorithm is used to predict buzzes on 140707
instances (almost 140K) on Twitter posts. Accuracy and error rate are considered as performance metrics.
Additionally, RF is applied by considering a variable ranking approach. The approach provides to find out the most
significant features (attributes) on posts, and thus identifies the created topic is either “Buzz” or “Not Buzz”. RF is
preferred in this study since it builds multiple decision trees and merges them together to get a more accurate and
stable prediction.
Due to the limited number of research studies about buzz prediction in literature and difficulty of finding similar
research studies which were used same data set, the performance metrics used (in terms of accuracy and error rate) of
RF are compared against SVM (with three different kernels such as Polynomial, Radial-Linear and Sigmoid) and
RBF. Results of the comparative analysis show that RF has the overall better results in accuracy, error rate and training
time. Finally, similar research studies on this topic using the aforementioned machine learning methods with different
data sets are also criticized in the empirical findings section according to the performance metrics used.
3
The rest of the paper is organized as follows. Section 2 discusses various machine learning approaches used in
buzz prediction. Section 3 discusses similar research studies found in literature. Section 4 presents the data set used
and related descriptions. Section 5 discusses the methodology used and empirical findings of this research study.
Finally, Section 6 discusses the summary and conclusions.
2. Background
This section discusses various machine learning approaches commonly used for buzz prediction in literature.
In addition, the reasons of RF preference and the contributions of RF prediction-based analysis are mentioned in this
section. There are some well-known machine learning algorithms used in buzz prediction such as RBF network, SVM,
ABC, and AB, which are described below.
RBF network is an adaptive network model formulated by Broomhead and Lowe in 1988 [6]. RBF network is
commonly used recently due to their function approximation, curve fitting, and prediction of time series [12]. One of
the vital factors in RBF network is the choice of the number of neurons in the hidden layer, where every neuron
possesses a specific activation function, because it has effects on the complexity of the network as well as the general
capability. The most preferred function for activation is the Gaussian function [12] that possess spread parameter for
controlling the function’s characteristics and operations.
SVM is first proposed by Vapnik [13] that is a supervised learning model with associated learning algorithms
that analyzes data used for classification and regression analysis. SVM comes into existence to separate the large
chunk of the available data with a gap. These gaps separate the data points belonging to a different class. The data
points which lie on this gap are the support vector points. The points are based on the theory of decision planes that
identify decision boundaries that are able to separate a set of objects belong to different classes.
ABC algorithm is a population-based efficient algorithm that can scan a wide range of solutions, spreading to
many different points at the same time and gives successful performance values in the local search. ABC provides to
search in which food sources are assessed by artificial bees [14]. This approach was modified and investigated through
the years for different application domain. It gives guaranteed results in different of domains [15,16]. ABC is based
on how bees intelligently communicate with each other in a colony in order to detect and get to food sources.
AdaBoost [9] is another robust machine learning algorithm. It is commonly used for classification that can be
applied numerous fields such as biology, computer vision, and speech processing [17]. Unlike other powerful
classifiers, such as SVM, AdaBoost can attain similar classification results with much less alteration of parameters or
settings [18].
RF or Random Decision Forests (RDFs) that is developed by Leo Breiman [10,11] are ensemble learning
methods for classification, regression and other tasks. The approach operates by constructing a multitude of decision
trees at training time and by outputting the class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees.
There are many studies show that RFs have impressive predictive performance in regression and classification
problems in various fields, including ecology, financial forecasting, remote sensing, genetic and biomedical analysis
[19-25]. RF is considered as an appropriate model to handle large number of input dataset, imbalance dataset, and
provide an empirical approach to trace variable interactions [26,27]. In literature, RF method has been compared with
other machine learning methods such as Partial Least Squares Regression (PLS regression) [28, 29], SVM and NNs
in [30, 31]. In these comparisons, RFs characteristically shows comparable or even better prediction performance
measurements. Besides the appealing prediction performance by RF, the availability of computation package random
Forest in R tool [32] is another reason for widespread use of RF. Therefore, RF has been preferred in this study due
to the varied positive contributions as mentioned above. Similar research studies about buzz prediction by considering
the mentioned above machine learning algorithms are discussed in next section.
3. Literature Review
Certain comparable research studies in literature on buzz prediction using the aforementioned machine learning
methods with different data sets are discussed in this section.
Rastogi M and Bist AS used machine learning methods to classify on features of time windows of Twitter, a
micro-blogging social media and to determine whether or not these times-windows are followed by Buzz events [33].
The researchers used Naïve Bayes [34] and SVM and compare the results of the machine learning techniques. The
researchers applied SVM by considering different kernels such as Linear, Polynomial, and RBF. The performance
results of the researchers and ours are compared in terms of accuracy and error rate and discussed the results in
empirical findings section.
4
Mayuri M et al. used RBF network for buzz prediction on a real discussion data in Twitter by considering
particular attributes of the discussion [35]. The researchers compared the performance results of RBF network and
multiple regression and found out that the error rate in multiple regression overruns since it is a statistical method.
Thus, they put forwarded the RBF network is better since it produces better results in terms of accuracy and fastest
training time. As a result, it can be said that RBF network is a more suitable option because of it is much faster and
more reliable for buzz prediction in social-media when RBF network is compared multiple regression. In addition,
RBFN can work with non-linear values that allows researchers to deal with complex data. Since Twitter consists of
large amount of non-linear data, RBF network is regarded as a good function for its analysis. It is a feed forward
network that was trained by supervised training algorithm and was found to perform faster than back propagation
networks.
Morchid M, Linares G, and Dufour R concentrate new method to forecast bursty events using content-related
indices [36]. The indices are about “popularity”, “expressivity” and “singularity”. Popularity index is based on the
analysis of RSS streams. Expressivity uses a dictionary that contains words annotated in terms of expressivity load.
Singularity represents outlying topic association estimated via a Latent Dirichlet Allocation (LDA) model [37]. Bursty
event prediction is achieved by using a neural network that uses the three features to predict the number of retweets
of a tweet on Twitter. Their approach applies mapping task of source documents in a reduced semantic space in which
some features could be determined by the LDA analysis. LDA provide to check a document model (known as a bag
of words) as a combination of rate of occurrence of latent topics. Latent topics are identified by a distribution of word
probabilities which are linked to them. After the LDA analysis, a set of topics is obtained for each, a set of words and
chances of emission.
Retweet is a reposted or forwarded message on Twitter. By retweeting, Twitter allows massively its users to
share tweets which are potentially interesting posts for others. Morchid M et al. [38] analyzed specific tweet features
through a Principal Component Analysis (PCA) [39] to better understand the behavior of highly forwarded tweets as
opposed to those retweeted only a few times. After all, they projected to perceive automatically the massively
retweeted posts by a system. They have used different machine learning techniques such as Naïve Bayes [34] and
SVM while classifying posts and retrieved recall, precision and F1 result values after classification based on analysis
of selected features. After all experiments, based on the results obtained from both approaches, the researchers
suggested that SVM approach undoubtedly yields better performance results than ones obtained by the Naive Bayes
method.
Aswani R, Ghrera SP, Kar AK, and Chandra S [40] proposed a hybrid model which comprises ABC and k-
nearest neighbors approaches. In their model, ABC is integrated with k-nearest neighbors to identify outlier and
segregate buzz in Twitter. Data used by researchers are social media data from Twitter. The data set consists of certain
parameters describing discussions. The data is a regression/classification dataset which is publicly accessible from the
UCI Machine Learning Repository [41] and being used for perceiving outliers in the form of ‘‘buzz’’ in discussions.
The dataset includes 11 categories for 77 features belong the Twitter post instances that are “number of created
discussions”, “author increase”, “attention paid”, “burstiness level”, “atomic containers tweet”, “attention level
tweet”, “contribution sparseness”, “author interaction”, “number of authors”, “mean discussions length in time”, and
“average number of discussions”. Their hybrid model provides to dig out the outliers by using 11 these categories to
separate buzz discussions from the remaining topic discussions. The researchers argued that their hybrid model is
useful especially in domains like e-commerce, digital and viral marketing and provides to separate the buzz discussions
as outliers deviating from the normal discussions with the accuracy of 98.37%.
Deusser C, Jansen N, Reubold J, Schiller B, Hinz O, and Strufe T [42] also discussed early detection of a Buzz
for mitigating or preventing the negative effects of large-scale social media for companies and individuals and giving
them a chance to react at an early stage. Data used is a large dataset of over 100,000 posts that is collected and
annotated from 76 Facebook pages by researchers. The researchers have used logistic regression on the metadata of
the posts to identify certain features that define posts as Buzzes. The features considered and used in the solution are
categorized in two that are general metadata and temporal features. The general metadata features are “page likes”,
“post likes”, “post shares”, “comments”, “first level”, “replies”, “reply ratio”, “replied to ratio”, “users”, “old users”,
“new users”, “new users ratio”, “repeat users”, “comment likes”, “author comments”, and “content length”. Temporal
features are considered during buzz prediction are “first hour interval”, “first hour comments”, “last hour”, “max hour
comments”, “max hour”, “max derivative”, “min derivative”, “completeness index”, and “extremas”. The direct
general metadata features of posts are number of likes, shares, and comments on a post.
In addition, the researchers have also considered popularity level of pages and added a feature of their division
by the number of average likes, reshares, and comments for the particular page. For instance; “postlikes” is the number
of a post likes that is divided by the average number of post-likes per page. The researchers have first discovered
properties that define a Buzz using logistic regression. Then, they have used certain classifiers for train and then
5
applied machine learning classification procedures to prove the potential competences of automated prediction. They
applied RF, SVM and AB approaches to the data set. Finally, logistic regression is useful for interpretation of the
results and can be used to directly derive recommendations for human decision makers, whereas the ML techniques
yield superior classification results.
In this study, we used similar another data set of from UCI repository which is titled as Buzz prediction on
Twitter (Absolute Labeling folder with Threshold Sigma equals 500) [43] which is classified with RF and SVM (with
three different kernels such as Polynomial, Radial-Linear and Sigmoid). The results are analyzed with a comparison
of prediction performances with both the techniques. The results with different parameters are adjusted and parameters
selections to get optimal classification results are performed. Accuracy is obtained using SVM with different kernel
linear, polynomial and RBF, and then RF is applied. The comparison results are discussed in Section 5. More detail
description of the dataset used is given in next section.
4. Description of Dataset Used
A buzz is often characterized by many people discussing a certain topic and commenting on a given post. The
data set used in this study is taken on the UC Irvine Machine Learning Repository website [43] under the topic “Buzz
Prediction in Social Media (Twitter Data set)” where binary classification of buzz that is “Buzz / no Buzz” and the
domain Twitter is discussed. It is a publicly available regression/classification dataset from the UCI repository created
by Kawala et al. [44] which is being used for detecting outliers in the form of ‘‘buzz.’’ Total number of data in the
data set involves almost 140707 instances (almost 140K) and 77 features. The instances are stored using a standard
comma separated value (CSV) format. Each instance is described by 77 features, those describe the evolution of 77
“primary features” through time. The 77 primary features are considered in 11 categories (Table 1). Each feature in a
category for each instance keeps the data (min, max, mean, std values) gathered for 7 days.
Hence, the name of each feature is post fixed with the relative time of observation. For instance, the value of
the feature “Nb_Active_Discussion” at time t is given in “Nb_Active_Discussion_t”. Each instance covers seven days
of observation for a particular topic (e.g. overclocking). Considering the couple day following this initial observation,
if there are at least 500 additional active discussions by day (on average, with respect to the initial observation) then,
the predicted attribute “Buzz” is True. Observations are independent and identically distributed.
Table 1. Attributes Descriptions of the Dataset Used [43]. # Categories
Type Explanation 77 attributes along 7 days observed
1 Number of Created Discussions (NCD)
Numeric, integers only
This feature measures the number of discussions created at time step t and involving the instance’s topic.
Columns [0, 6] in data set: NCD_0, NCD_1, NCD_2, NCD_3, NCD_4, NCD_5, NCD_6
2 Author Increase (AI)
Numeric, integers only
Number of new authors interacting on the instance's topic at time t (i.e. its popularity)
Columns [7, 13] in data set: AI_0, AI_1, AI_2, AI_3, AI_4, AI_5, AI_6
3 Attention Level (measured with Number of Authors) AS(NA)
Numeric, real in [0,1]
The attention payed to the instance's topic on a social media.
Columns [14, 20] in data set: AS(NA)_0, AS(NA)_1, AS(NA)_2, AS(NA)_3, AS(NA)_4, AS(NA)_5, AS(NA)_6
4 Burstiness Level (BL)*
Numeric, defined on [0,1]
The burstiness level for a topic z at a time t is defined as the ratio of NCD and NAD features.
Columns [21, 27]) in data set: BL_0, BL_1, BL_2, BL_3, BL_4, BL_5, BL_6
5 Number of Atomic Containers (NAC)
Numeric, integer
This feature measures the total number of atomic containers generated through the whole social media on the instance's topic until time t.
Columns [28, 34] in data set: NAC_0, NAC_1, NAC_2, NAC_3, NAC_4, NAC_5, NAC_6
6 Attention Level (measured with Number of Contributions) (AS(NAC))
Numeric, real in [0,1]
This feature is a measure of the attention payed to the instance's topic on a social media.
Columns [35, 41] in data set: AS(NAC)_0, AS(NAC)_1, AS(NAC)_2, AS(NAC)_3, AS(NAC)_4, AS(NAC)_5, AS(NAC)_6
6
7 Contribution Sparseness (CS)
Numeric, real in [0,1]
This feature is a measure of spreading of contributions over discussion for the instance's topic at time t.
Columns [42, 48] in data set: CS_0, CS_1, CS_2, CS_3, CS_4, CS_5, CS_6
8 Author Interaction (AT)
Numeric, integer
This feature measures the average number of authors interacting on the instance's topic within a discussion.
Columns [49, 55] in data set: AT_0, AT_1, AT_2, AT_3, AT_4, AT_5, AT_6
9 Number of Authors (NA)
Numeric, integer
This feature measures the number of authors interacting on the instance's topic at time t.
Columns [56, 62] in data set: NA_0, NA_1, NA_2, NA_3, NA_4, NA_5, NA_6
10 Average Discussions Length (ADL)
Numeric, real
This feature directly measures the average length of a discussion belonging to the instance's topic.
Columns [63, 69] in data set: ADL_0, ADL_1, ADL_2, ADL_3, ADL_4, ADL_5, ADL_6
11 Average Discussions Length (NAD)
Numeric, integer
This feature measures the number of discussions involving the instance's topic until time t.
Columns [70, 76] in data set: NAD_0, NAD_1, NAD_2, NAD_3, NAD_4, NAD_5, NAD_6.
*In statistic, burstiness is the intermittent increases and decreases in activity or frequency of an event.
5. Methodology Used and Empirical Findings
As it is mentioned before, the study evaluates and compares the performance of RF against SVM on the data
set [43] to predict the buzzes. SVM is considered with different kernels that are linear, polynomial, and RBF. In
addition, RF is implemented to perform variable rank ordering which is an efficient strategy to detect the most
significant features to identify the topic created is buzz or not. In this study, R tool [32] and certain suitable packages
are used during experimental studies. We employed the RF and SVM methods on the same data set without
dimensional reduction. The user interfaces of RF are accord with other classification functions such as the NNET [45]
and SVM (in the e1071 package) [46]. It involves an equational area and predictors which can be specified as a matrix
or data frame via the x argument, with responses as a vector via the y argument. RF carries out classification task if
the response is a factor, that mean is, the response is not continuous. However, if the response not a factor, RF carries
out regression task. RF carries out unsupervised learning whenever the response is not specified. In this case, RF does
not handle statistically categorical responses. Categorical predictor variables also need to be specified as factors so
that they are not wrongly treated as continuous. Therefore, RF function returns an object of class "Random Forest".
In the rest of this section, comparison issues based on the error and accuracy parameters related to RF, SVM with
three different kernels, and RBF methods are discussed. Accuracy is the portion of all relevant and irrelevant features
against all features. Error is the part of one minus Accuracy.
M. Rastogi and A. S. Bist [33] used SVM with different parameters (Linear, Polynomial 3, and Polynomial 4)
and RBF to predict buzzes on the same data set which we used in our experiments. However, the researchers used
only 2000 instances with 77 features. Their results are given in Table 2.
Table 2. The results obtained on accuracy using SVM with different kernels and RBF by the researchers [4].
Kernel Accuracy Training set Testing set
Linear 0.927 1000 1000
RBF 0.958 1000 1000
Polynomial 3 0.92 1000 1000
Polynomial 4 0.923 1000 1000
Table 3. The results obtained on error and accuracy using RF, SVM with three different kernels, and RBF methods.
Type Error Accuracy Training set Testing set
SVM-RB F 0.341 0.66 93805 46902
SVM-Linear 0.361 0.64 93805 46902
SVM-Polynomial 0.762 0.14 93805 46902
RBF 0.061 0.94 93805 46902
RF 0.001 0.99 93805 46902
7
In our study, RF and SVM with three different kernels are used. The kernel types considered are Polynomial,
Radial-Linear and Sigmoid. In addition, the same dataset is used by considering entire instances (140707) to improve
accuracy. We considered to obtain two label classes represented by “0” and “1”. The “0” represented Non-Buzzed
event and “1” represented Buzzed event.
In experimental studies, the overall dataset is considered as two sets: training and test sets. The sets are used
respectively in terms of training the system and subsequent testing for accuracy of the Buzz prediction. Therefore, the
test and train sets used in this research study contain all 140707 instances over 77 attributes. Table 3 signifies the
retrieved error and accuracy results. While the error is 0.001, the corresponding accuracy is 0.99 in Table 3. Error and
accuracy obtained are also shown in Figure 1 as follows:
(a) (b)
Figure 1. (a) RMSE Error and (b)Accuracy
As shown in the Figure 1, RF outperformed against other approaches as it has lower Root Mean Square Error
(RMSE) value relative to the other approaches.
Figure 2. Top ten significant features determined from the 77 features.
Additionally, the variable rank ordering may lead to better Buzz prediction since the top variables may have
more effect on the accurate prediction out of the total variables. Moreover, in our experimental studies, RF is
implemented to find out variable rank ordering, which is an efficient strategy to detect the most significant features to
identify the topic is Buzz or not. Figure 2 illustrates the top 10 most significant features detected in the 77 features
which are crucial to identify any topic created is Buzz and or not Buzz. The crucial features are determined as x29
(BL_6), x57(NA_0), x50(AT_0), x71(NAD_0), x15(AS(NA)_0), x64(ADL_0), y01(NCD_1), x22(BL_0),
SVM-RBF
SVM-Linear
SVM-Polyno
mialRBF
Random
Forest
Error 0,341 0,361 0,762 0,061 0,001
0,341 0,361
0,762
0,061 0,001
0
0,2
0,4
0,6
0,8
1
ERROR
SVM-RBF
SVM-Linear
SVM-Polynomial
RBFRandom
Forest
Accuracy 0,66 0,64 0,14 0,94 0,99
0,66 0,64
0,14
0,94 0,99
00,20,40,60,8
11,2
ACCURACY
8
x08(AI_0), and x36(AS(NAC)_0). “BL_6” mean is burstiness level for a topic z at a time t is defined as the ratio of
NCD and NAD. The “NA_0” is the number of authors which measures the number of authors interacting on the
instance's topic at time t. The “AT_0” is the average number of authors interacting on the instance's topic within a
discussion. The “NAD_0” is the number of discussions involving the instance's topic until time t. The “AS(NA)_0”
is a measure of the attention payed to the instance's topic on a social media. The “ADL_0” is directly measures the
average length of a discussion belonging to the instance's topic. The “NCD_1” is measures the number of discussions
created at time step t and involving the instance's topic. While "BL_0" is the statistics observed on the first day,
"BL_6" is the statistics observed on the seventh day. Both are considering the burstiness level for a topic z at a time t.
The 11 categories of the 77 attributes with their description in the data set used are available in [43].
Different features represent the buzz case may lead to have a serious imbalance in the dataset. Hence, RF has
been preferred to define and capture all inherent phenomena in the dataset. The data set used was contain various
attributes to define the buzz case, a ranking analysis was necessary to find out the most significant attributes in order
to identify the created topic is either buzz or not. To the best of our knowledge, this study seems to be the first study
that conducts a feature ranking methodology to find out the most significant features indicating a buzz in the social
media posts dataset.
The results of the comparative experimental studies demonstrate that RF is the most accurate method for
predicting buzz and designates also a ranking list of significant features to obtain a better accuracy. RF approach gives
the overall better accuracy result with the value of 99% and with the fastest training time. However, RBF method
proved to be very competitive while SVM approach is average. In this study, it is also observed that the Buzz is a
dynamic event in which the basis of prediction could be modelled on the content as well as the forest. Finally, the use
of faster and more reliable algorithms for buzz prediction especially on products and brands comments in social media
is expanded further with this study.
6. Conclusion
In this paper, a method for predicting buzz cases on social media platform; Twitter, through the use of Random
Forest (RF) descriptor is proposed. The performance of this method was evaluated and compared to performances of
two other similar descriptors, Support Vector Machine (SVM) in three different kernels, and Radial Basis Function
(RBF). The results obtained from the experimental studies showed that RF gives better results in terms of “Accuracy”
and training time compared to other descriptors.
Additionally, in this study, RF is implemented to find out variable rank ordering, which is an efficient strategy
to detect the most significant features to identify a topic is Buzz or not. According to the results of the variable
ordering order obtained, it is found that research studies in this area should be performed more comprehensive, that
is, the importance of the features should be taken into account in order to obtain better performance results in defining
the buzz events. Random Forest is good approach which achieved 0.99 accuracy result and show that it is faster and
reliable approach for buzz prediction in social media. It is also inferred that the Buzz is a dynamic event in which the
basis of prediction could be modelled on the content as well as the forest.
Future work of the study is expected to a look at the Buzz activities in other social media platforms other than
Twitter. In addition, Tuning Random Forest algorithm is another alternative to try that can be used to retrieve better
performance. Further areas such as economy, social and political are potential to consider and perform future research
studies in this field.
References
[1]. Statista, Most popular social networks worldwide as of January 2019, ranked by number of active users,
https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/, Last visited: 11-02-
2019.
[2]. Twitter Usage Statistics, http://www.internetlivestats.com/twitter-statistics/, Last visited: 11-02-2019.
[3]. Kwak, H., Lee, C., Park, H., & Moon, S. (2010, April). What is Twitter, a social network or a news media?
In Proceedings of the 19th international conference on World wide web (pp. 591-600). ACM.
[4]. Deusser C, Jansen N, Reubold J, Schiller B, Hinz O, Strufe T (2018, April) Buzz in Social Media: Detection
of Short-lived Viral Phenomena. In companion of the Web conference 2018 1443-1449. International World Wide
Web Conferences Steering Committee.
[5]. Douglas K (2015, April) Buzz, Viral or Word of Mouth Marketing: What’s the difference?,
https://martech.zone/whats-the-difference-buzz-viral-word-of-mouth-marketing/, Last visited: 11-02-2019.
[6]. Broomhead DS, Lowe D. Radial basis functions, multi-variable functional interpolation and adaptive
networks. Royal Signals and Radar Establishment Malvern (United Kingdom); 1988 Mar 28.
9
[7]. Karaboga, D. (2005). An idea based on honey bee swarm for numerical optimization (Vol. 200). Technical
report-tr06, Erciyes university, engineering faculty, computer engineering department.
[8]. Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intelligent
Systems and their applications. Jul;13(4):18-28.
[9]. Freund Y, Schapire R, Abe N (1999, Sep) A short introduction to boosting. Journal-Japanese Society For
Artificial Intelligence. 1;14(771-780):1612.
[10]. Breiman L (2001, Oct) Random Forests. Machine learning. 1;45(1):5-32.
[11]. Breiman L, Cutler A (2003) Manual for Setting Up. Using, and Understanding Random Forest, 4.
[12]. Hutchinson J M (1994) A radial basis function approach to financial time series analysis, PhD Thesis,
https://pdfs.semanticscholar.org/19f3/528f7e485bc5b075b2bcf0c5cc7960c10148.pdf, Last visited: 11-02-2019.
[13]. Vapnik V, Mukherjee S (2000) Support vector method for multivariate density estimation. In Advances in
neural information processing systems, 659-665.
[14]. Karaboga D, Basturk B (2007, Nov) A powerful and efficient algorithm for numerical function optimization:
artificial bee colony (ABC) algorithm. Journal of global optimization. 1;39(3):459-71.
[15]. Karaboga D, Akay B (2009, Aug) A comparative study of artificial bee colony algorithm. Applied
mathematics and computation, 1;214(1):108-32.
[16]. Karaboga D, Gorkemli B (2014, Oct) A quick artificial bee colony (qABC) algorithm and its performance
on optimization problems. Applied Soft Computing, 1;23:227-38.
[17]. Kaur R, Chopra V (2015) Implementing Adaboost and Enhanced Adaboost Algorithm in Web
Mining. International Journal of Advanced Research in Computer and Communication Engineering, 4(7).
[18]. Von Kirby P, Gerardo BD, Medina RP (2017, Dec) Implementing Enhanced AdaBoost Algorithm for Sales
Classification and Prediction. International Journal of Trade, Economics and Finance, 8(6).
[19]. Siroky DS (2009) Navigating random forests and related advances in algorithmic modeling. Statistics
Surveys, 3, 147-163.
[20]. Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP (2003 Nov) Random forest: a
classification and regression tool for compound classification and QSAR modeling. Journal of chemical information
and computer sciences, 24;43(6):1947-58.
[21]. Jiang P, Wu H, Wang W, Ma W, Sun X, Lu Z (2007, Jul) MiPred: classification of real and pseudo microRNA
precursors using random forest prediction model with combined features. Nucleic acids research, 1;35(suppl_2):
W339-44.
[22]. Kumar M, Thenmozhi M (2006) Forecasting stock index movement: A comparison of support vector
machines and random forest.
[23]. Diaz-Uriarte R (2007, Jul) GeneSrF and varSelRF: a web-based tool and R package for gene selection and
classification using random forest. BMC bioinformatics, 8(1):328.
[24]. Pal M (2005, Jan) Random forest classifier for remote sensing classification. International Journal of Remote
Sensing, 1;26(1):217-22.
[25]. Cutler DR, Edwards Jr TC, Beard KH, Cutler A, Hess KT, Gibson J, Lawler JJ (2007, Nov) Random forests
for classification in ecology. Ecology, 88(11):2783-92.
[26]. Livingston F (2005) Implementation of Breiman’s random forest machine learning algorithm. ECE591Q
Machine Learning Journal Paper.
[27]. Liaw A, Wiener M (2002) Classification and regression by randomForest. R News, 2 (3): 18–22. R package
version 4.6. 10.
[28]. Mevik BH, Segtnan VH, Næs T (2004, Nov) Ensemble methods and partial least squares regression. Journal
of Chemometrics: A Journal of the Chemometrics Society,18(11):498-507.
[29]. Xu R (2013) Improvements to random forest methodology.
[30]. Statnikov A, Wang L, Aliferis CF (2008, Dec) A comprehensive comparison of random forests and support
vector machines for microarray-based cancer classification. BMC bioinformatics, 9(1):319.
[31]. Liu M, Wang M, Wang J, Li D (2013, Feb) Comparison of random forest, support vector machine and back
propagation neural network for electronic tongue data classification: Application to the recognition of orange beverage
and Chinese vinegar. Sensors and Actuators B: Chemical,177:970-80.
[32]. Dessau, R. B., & Pipper, C. B. (2008). ''R"--project for statistical computing. Ugeskrift for laeger, 170(5),
328-330.
[33]. Rastogi M, Bist AS (2016) Analysis of Twitter Data With Machine Learning Techniques, International
Journal of Engineering Sciences & Research Technology, ISSN: 2277-9655
http://www.ijesrt.com/issues%20pdf%20file/Archive-2016/July-2016/117.pdf, Last visited: 11-02-2019.
10
[34]. Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on
empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46).
[35]. Mayuri M, Sneha M, Kamatchi LP (2015) Prediction of buzz in Social-media using Radial Basis Function
Neural Networks, International Conference on Interdisciplinary Engineering and Sustainable Management Sciences.
[36]. Morchid M, Linares G, Dufour R (2014, May) Characterizing and Predicting Bursty Events: The Buzz Case
Study on Twitter. InLREC, 26, 2766-2771.
[37]. Blei DM, Ng AY (2003, Jan) Jordan MI. Latent Dirichlet Allocation. Journal of machine Learning research,
3:993-1022.
[38]. Morchid M, Dufour R, Bousquet PM, Linares G, Torres-Moreno JM (2014 Nov) Feature selection using
principal component analysis for massive retweet detection. Pattern Recognition Letters, 1;49:33-9.
[39]. Chen, X., Chen, C. and Jin, L. (2011). Principal Component Analyses in Anthropological Genetics. Advances
in Anthropology, 01(02), pp.9-14.
[40]. Aswani R, Ghrera SP, Kar AK, Chandra S (2017 Dec) Identifying buzz in social media: a hybrid approach
using artificial bee colony and k-nearest neighbors for outlier detection. Social Network Analysis and Mining,
1;7(1):38.
[41]. UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Buzz+in+social+media+, Last
visited: 11-07-2020.
[42]. Deusser C, Jansen N, Reubold J, Schiller B, Hinz O, Strufe T (2018, Apr) Buzz in Social Media: Detection
of Short-lived Viral Phenomena. InCompanion of the The Web Conference 2018 on The Web Conference,
International World Wide Web Conferences Steering Committee, 1443-1449.
[43]. Buzz prediction on Twitter - Absolute Labeling - Threshold Sigma equals 500 in UCI repository,
http://ama.liglab.fr/data/buzz/classification/Twitter/Absolute_labeling/Twitter-Absolute-Sigma-500.names, Last
visited: 11-07-2020.
[44]. Kawala, F., Douzal-Chouakria, A., Gaussier, E., & Dimert, E. (2013, October). Prédictions d'activité dans
les réseaux sociaux en ligne. In 4ie`me Confe´rence sur les Mode`les et l’Analyse des Re´seaux: Approches
Mathe´matiques et Informatiques, p 16.
[45]. Ripley, B., Venables, W., & Ripley, M. B. (2016). Package ‘nnet’. R package version, 7-3
[46]. Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., & Weingessel, A. The e1071 package, 2005. Software
available at< http://cran. r-project. org/src/contrib/Descriptions/e1071. html.
AUTHORS
Duygu Çelik Ertuğrul was born in Turkey. She is an associate professor in
the computer engineering department at the Eastern Mediterranean
University, Famagusta, North Cyprus via Mersin-10, Turkey. Her research
topics are related to the Web and Semantics; Composition and Discovery of
Semantic Web Services, Semantic Search Agents, Rule-Based Expert
Systems, m-Health and Healthcare Knowledgebase Expert Systems. She is
one of the organizers of two international workshops and one international
symposium: “IEEE International Workshop on ESAS: E-Health Systems and
Semantic Web” since 2006, “Security of Information and Networks (SIN)
between 2007-2011” and “IEEE COMPSAC Symposium on Web Technologies & Data Analytics
(WEDA) in 2016”. She also supervises several research and development projects supported by
the university, government, and industrial companies. She has organized a special issue for Expert
Systems: The Journal of Knowledge Engineering and served as guest editor. She has published
numerous articles/book chapters/book in several international/national journals and conferences
on the topic of Web Semantics and Mobile Medical Healthcare Services and Systems. She is also
the author of various books or book chapters about ‘Composition or Discovery of Semant ic Web
Services’ and ‘Semantic Web based e-Health Services’.
Mohammad Ali Haji Hasan Khonsari was born in Iran. He is graduate
student in the Computer Engineering Department at the Eastern
Mediterranean University, Famagusta, North Cyprus via Mersin-10, Turkey.
His research topics are related to Web technologies, artificial intelligence,
social network research and applications, data mining applications and
methods.