Analysing Zombie accounts in Weiboyjc32/project/Thesis - YiBo/Yibo Liang.pdf · 2 Declaration I,...
Transcript of Analysing Zombie accounts in Weiboyjc32/project/Thesis - YiBo/Yibo Liang.pdf · 2 Declaration I,...
1
Analysing Zombie accounts in Weibo
Final Year Dissertation
Yibo Liang BSc (Hons) Computer Science
H00194508
Supervised by
Dr Yun-Heh Jessica Chen-Burger
Second Reader
Dr Hamish Taylor
Heirot-Watt Univeristy, Edinburgh
School of Mathematical & Computer Science
2016
2
Declaration I, Yibo Liang confirm that this work submitted for assessment is my own and is expressed in
my own words. Any uses made within it of the works of other authors in any form (e.g.
ideas, equations, figures, text, tables, programs) are properly acknowledged at any point of
their use. A list of the references employed is included.
Signed:
Date: 24 April 2016
3
Abstract Information published on various social networks has been proven valuable for researches in
the fields of economy, politics, culture, social science, etc. Weibo.com is one of the largest
social networking media in China, based on which many researchers have been mining data
for different purposes, e.g. to understand citizens’ opinions. Unfortunately, a large number
of accounts in such websites are controlled by computer programs (robots) motivated by
specific agenda or for malicious reasons such as advertisement, spamming, public opinion
manipulation, etc., such accounts are often referred to as zombie accounts.
Many articles have already discussed how to distinguish these zombie accounts from
genuine ones and how they may behave differently, mostly on Twitter, few on Chinese
Weibo. Nevertheless, they are rarely based on any ground truth data or using a large
dataset.
I therefore propose a study on detecting zombie accounts using machine learning algorithms
based on ground truth data, testing and evaluating relatively large dataset, and therefore
providing a better estimate on the amount of zombie accounts existing on Weibo.com.
Given ground truth data, features of accounts are then studied and subtracted, and more
features are created from the information of the timestamp of microblogs. Moreover, the
behaviour of zombie accounts are analysed by observing the visualised features. My study
shows that it is possible to detect even relatively well programmed human-like zombie
accounts with relatively high accuracy if enough ground truth data is given.
4
Acknowledgements First, I would like to thank my supervisor, Jessica Chen-Burger, who taught me the way into
the field of research. Starting this project as a student and finishing it as a researcher, I thank
her for her generous sharing of her invaluable knowledge and experience in researching.
In addition, I would like to thank my parents for constantly supporting my studying. Their
trust in my determination and ability helped me go through this hardworking year. What is
more, I must thank my parents-in-law who worked as researchers, for their guidance in
researching and studying attitude.
Next I want to thank my wife Xuecong. Her smile and comforting helped me go through hard
time during studying and her cooking makes my good time even better.
5
Table of Content
1 Introduction ....................................................................................................................... 7
1.1 The problem .............................................................................................................. 7
1.2 The approach and objectives of this project ............................................................. 7
1.3 Introduction to Literature review and relative works ............................................... 8
1.4 Research Gap ............................................................................................................. 9
1.4.1 Lack of research on Weibo ................................................................................ 9
1.4.2 No large scaled evaluation ................................................................................ 9
1.4.3 Evaluation with no ground truth ..................................................................... 10
2 Background ...................................................................................................................... 10
2.1 Current Situation of China’s Internet ...................................................................... 10
2.2 What is Weibo? ....................................................................................................... 11
2.3 Preliminary research on Zombie Accounts .............................................................. 14
2.3.1 Problem with Weibo ........................................................................................ 14
2.3.2 Investigation on Zombie account market ........................................................ 15
2.4 Literature Review .................................................................................................... 22
2.4.1 Existing Researches ......................................................................................... 22
2.4.2 Data Mining & Machine Learning methods and algorithms ........................... 29
2.4.3 Research gaps and my approach ..................................................................... 31
3 Methodology ................................................................................................................... 32
3.1 Obtaining Ground Truth data .................................................................................. 32
3.1.1 Zombie accounts ............................................................................................. 33
3.1.2 Real user accounts ........................................................................................... 35
3.1.3 Accounts data for Evaluation .......................................................................... 37
3.2 Implementation ....................................................................................................... 38
3.3 Actual System Design .............................................................................................. 38
3.3.1 Data Gathering ................................................................................................ 38
3.3.2 Data Pre-processing ......................................................................................... 51
4 Data Mining & Classifier Evaluating ................................................................................ 59
4.1 Initial analysis of features with visualisation ........................................................... 60
4.2 Base Line .................................................................................................................. 64
4.3 Single Classifier Experiments ................................................................................... 64
4.3.1 Naive Bayes ..................................................................................................... 64
4.3.2 Decision Tree Classifier .................................................................................... 66
4.3.3 Support Vector Machine ................................................................................. 67
6
4.3.4 Multi-Layered Perceptron ............................................................................... 68
4.4 Meta Classifiers ....................................................................................................... 69
4.4.1 Boosting ........................................................................................................... 69
4.4.2 Bagging ............................................................................................................ 69
4.4.3 Voting .............................................................................................................. 69
4.4.4 Results ............................................................................................................. 69
4.5 Evaluating using incomplete data ........................................................................... 70
4.6 The composition of Weibo Users ............................................................................ 71
5 Conclusion & Discussion .................................................................................................. 72
5.1 Achievements .......................................................................................................... 72
5.2 Limitation ................................................................................................................. 74
5.3 Future Work ............................................................................................................ 75
6 Reference ........................................................................................................................ 77
7 Appendix A ...................................................................................................................... 78
8 Appendix B ....................................................................................................................... 79
9 Appendix C ....................................................................................................................... 80
7
1 Introduction
1.1 The problem
China’s online social network has been rapidly developing for a decade. Now its power is
influencing all walks of life in China and in different perspectives, including economy, law,
politics, and in particular, public opinion. The freedom of Internet allows all to express their
opinions, exchanging their opinions with each other. Yet this freedom also allows people
with ulterior motives to manipulate public opinion by spreading spam and fake stories using
fabricated information through un-checked social media websites.
Such information is often spread by using computer program controlled fake online
robots/avatars, or in the short name, zombie accounts. These zombie accounts have flooded
every social network website, with profitable motivation, creating and spreading false
information to the public, manipulating public opinion.
Therefore, if we are able to develop an efficient way of understanding and classifying those
zombie accounts, the public will be able to express their true opinions by getting rid of the
false ones from zombie accounts. Moreover, classifying these zombie accounts would help
further researches, including the research of the effect of zombie accounts, the study of real
public opinion, and all other researches that make use of the online social network.
1.2 The approach and objectives of this project
There are many approaches already done by researchers in the topic of zombie accounts.
My objective is to find a good classifier or a combination of the existing approaches that can
maximise the ability to classify zombie accounts on social networks.
Weibo and Twitter are the two mostly studied short blog social media. In this project, firstly I
have implemented a distributed system crawling system with proper framework (described
8
in Section 3.3.1.2) in order to obtain large dataset. In addition to the crawling system,
ground truth data will be obtained using different methods (described in Section 3.1). I will
then conduct a large scaled experiments to evaluate the composition of Weibo accounts
with crawled data and ground truth data. By using a different methodology to classify Weibo
zombie accounts in large scale, I will be able to obtain an optimum set of classifiers to
identify the zombie account by training classifiers with ground truth data. And with that, I
might be able to obtain an optimum analysis of the current situation of Weibo social
network, including how many zombie accounts are approximately in the network, what are
the significant pattern and behaviours of these zombie accounts and possibly how Weibo
and its users are influenced by these zombie accounts.
1.3 Introduction to Literature review and relative works
I have done researches on different methods and algorithms of classifying zombie accounts.
A paper (Zhang & Vern, 2011) offers an efficient way of classifying accounts by simply using
the timestamp of posts. A similar approach (Tavares & Faisal, 2013) uses the time interval
between posts and with probabilistic classifiers and has produced very good results. Another
approach (Amit A. Amleshwaram, et al., 2013) analyses different features of zombie
accounts, and by selecting the best set of features, the paper also successfully classifies
zombie accounts with high rates. What is more, location information is also found to be
useful (Deng, et al., 2015), although the accuracy of the paper is in doubt, their approach is
inspiring. Last but not least, I find the research (Sun, et al., 2014) that studies the user
interaction and the influence between users very helpful because of their assumption that
zombie accounts have different interaction and influence with real users.
In this project, I aim to evaluate the above methods but will not be limited to these
methods. I wish to discover more algorithms and methods that can distinguish the zombie
accounts in the process of conducting this project.
9
1.4 Research Gap
According to my study, there are noticeable research gaps in the topic of identifying zombie
accounts on Weibo as following:
1.4.1 Lack of research on Weibo
There are many successful researches of zombie classification on Twitter, but much less
researches on Weibo. Obviously it is because of the fact that Twitter is a worldwide social
network and has stronger influence. By contrast, Weibo is much younger than Twitter and
its influence is still growing. The problem of zombie accounts on Weibo has just been
noticed by researchers in China recently and the related work has just begun to appear in
recent few years. My project itself is to fill in this gap.
1.4.2 No large scaled evaluation
Among those existing researches on Weibo, most of their data is obtained by Weibo public
API that limits down the number of access within fixed time period and therefore limiting
their sizes of data. Considering the size of total registered users on Weibo, sample rate of
these researches are very low. After researching HTTP communication of Weibo.com, I have
come up with a solution that does not require Weibo public API and only needs regular HTTP
request. With the help of proxy and pipelining technique, I have managed to obtain data
from Weibo at an unprecedented speed. Large data obtained in short time enables me to
conduct relatively large scaled experiments to evaluate classification algorithms and
methods.
10
1.4.3 Evaluation with no ground truth
I have not seen any zombie account analysis that is able to evaluate their algorithm with
ground truth, that is, to use guaranteed zombie accounts and real user accounts. The zombie
accounts in most researches are classified manually. Although the classifiers are supposed to
be Intelligent and educated Savvy, their identification is unsound and that does not
guarantee their correctness or accuracy. By looking into the black market of zombie account
as shown in the following pages, I am able to acquire various types of zombie accounts,
which are created and trained for different purposes and which are guaranteed to be
zombie accounts. What is more, I also manually identified real human controlled accounts as
control group. With these accounts at hand, experiments of classifiers were carried out,
allowing me to obtain more authentic and grounded result from evaluating all related
classifier algorithm and methods.
2 Background
2.1 Current Situation of China’s Internet
The Chinese Internet industry has been one of the most rapidly increasing fields for more
than 10 years. According to statistics,1 as of the end of 2014 China had 649 million internet
users, with an increase of 31 million from the past year. About 86% of the total netizen
population was accessing Internet through mobile devices compared to only 81% in 2013.
What is more, the online duration weekly for netizens increased from 25 hours to 26.1 hours
averagely.
1 Statistical Report on Internet Development in China (January 2015)
11
2.2 What is Weibo?
Composed of two Chinese characters, Wei (微, meaning “micro”) and Bo (博, meaning
“blog”), “Weibo” is a literal translation of “Microblogs”. Weibo was established on 14th
August 2009, only a month later after Chinese Government closed most of the domestic
microblogging websites such as Fanfou and banned other international social media services
including Facebook, Twitter and Plurk. It is a service provided by Sina Corporation with basic
functionalities such as messaging, private messaging, commenting and reposting. Then “Sina
Weibo” – a compatible API platform was made open to public on 28 July 2010.2
The amount of registered Users on Weibo reached 100 million before March 2011.3 And now
Weibo has become very popular among young people for the diversity and completeness of
its social media functions and applications. In fact, about 24.8% internet users are using
Microblogs,1 and Weibo.com is one of the largest Chinese versioned Microblogs. Just like
Twitter is playing a significant role in Western countries, Weibo plays a significant role in
Chinese internet social media. According to an official report from Weibo,4 by the end of
September 2014, there are 76.6 million users that are daily active on Weibo, and 160 million
users that are monthly active.
2 "Special: Micro blog's macro impact". Michelle and Uking (China Daily). 2 March 2011. Retrieved 26 October
2011.
3 2010 Sina Annual Financial report, Accessed on 10 Nov 2015, HTTP://tech.sina.com.cn/i/2011-03-
02/06005233783.shtml
4 2014 Weibo User Development Report, Weibo Data Centre
12
According to the market research,5 there are about 54% male users and 46% female users on
Weibo.com. Moreover, approximately 70% microbloggers are under the age of 30. Among
the 70% of the users, 80% are holding the Bachelor degree. These young people, with their
posts and comments on Weibo, together form a great part of public opinion on China’s
internet.
Many features and functionalities of Twitter are implemented in Weibo. For basic
functionalities, any single post is limited to 140 characters, Chinese or English. Referencing
to other people in the post can be done by using ‘@Username’ formatting. Users can also
use ‘#tagname#’ format to add hashtags in the post. What is more, by using ‘//@Username’
formatting, user can re-post other user’s post as you can use ‘RT @Username’ in Twitter.
The structure of the social network of Weibo is constructed by two concepts, fans and
followers. The user is free to ‘follow’ any other user, and he becomes the ‘fan’ of the one
being followed. Once a user A becomes a follower/fan of a user B, all B’s posts are
synchronously pushed on the main page of A.
Besides texts, users are allowed to post huge images if his text exceeds 140 characters.
Videos, music and uploaded files are allowed as well, if not violating the copyright.
Additionally, most posts and all comments, re-posts and likes of posts are only publicly
available to all logged-in users. This is why Weibo is called semi-public social network.
Unregistered users can only see the latest post of a registered one.
5 IResearch (2011) China microblog industry and user research report 2010 (Chinese). Beijing: iResearch.
Accessed on: 11 December, 2012. HTTP://www.iresearch.com.cn.
13
There is a verification policy on Weibo, similar to Twitter’s verified account, which requires
the user to submit his identification documents in reality such as the ID card or passport
from the individual person, or official letter with legal proof from the company or
organisation. Identities are examined and verified manually (Or not, I will explain this later).
Successfully verified users will be given a big ‘V’ image after their name. An orange ‘V’ for
individuals and a blue ‘V’ for organisations and companies. There are different classes of
verification for different organisations, e.g. educational institutes, public organisations or
local and international government departments.
Weibo offers a level up system for users based on their account activity and online time.
Each account has a “next level experience”, which is like the definitions of video games, a
number of experience are required to level up the user’s account. Users can speed up their
levelling by either gaining experience from accomplishing tasks such as logging in everyday,
one new post each day for 5 days, etc. or paying the website for a boost. Each level grants
the user some advantages on Weibo, including free recommendation of his account to new
users, a free lottery for money, etc. What is more, users are allowed to apply for a title
named “Weibo Master” if his or her account achieved certain popularity and influence.
Sina has developed a Weibo app for multiple platforms including Android, iOS, Blackberry
OS, Windows Mobile and even Symbian S60. There is also a desktop version client that runs
on Windows PCs.6
6 Weibo Desktop Client home page. Accessed on 11 Nov 2015, HTTP://desktop.weibo.ocm
14
2.3 Preliminary research on Zombie Accounts
2.3.1 Problem with Weibo
Many news have indicated that websites such as Twitter and Weibo, are flooded with
program controlled accounts. For example, it has been reported that the Twitter account of
American President Obama, which has 36.9 million Twitters followers, has at least 19.5 fake
followers.7 Nevertheless, Xie Na, a famous Chinese hostess, singer, and actress8 whose
Weibo followers dropped dramatically from 3 million to 2 million in one day after Sina
Weibo started its first cleaning on fake accounts.9 These accounts, with such big quantity,
cannot be manually controlled by human easily. There is no doubt that these account are
managed by computer programs. These computer-managed and automated accounts are
called zombie accounts. It had been reported that 24% Twitter accounts are bot controlled10
back in 2009, and it is possible that Weibo has more.
7 Barack Obama is the political king of the fake Twitter followers, with more than 19.5 MILLION online fans who
don't really exist. Accessed on 13 Nov 2015. HTTP://www.dailymail.co.uk/news/article-2430875/Barack-Obama-
19-5m-fake-Twitter-followers.html
8 Xie Na’s homepage of Hunan TV station, China. Accessed on 13 Nov, 2015,
HTTP://ent.hunantv.com/v/mxgw/hnzc/xnzy/index_3244.html
9 Xie Na’s Weibo Post complained about the decrease of followers on 23 Nov, 2011. Accessed on 13 Nov 2015.
HTTP://www.weibo.com/1192329374/zF0tcUr1cj?type=comment
10 An In-depth look at the most active twitter user data. Accessed on 13 Nov 2015. HTTP://sysomos.com/inside-
twitter/most-active-twitter-user-data
15
Although Weibo officially started another zombie cleaning operation early this year,11 the
effort has been doubted by many.12 No available report has confirmed that the problem of
zombie accounts on Sina Weibo is solved.
2.3.2 Investigation on Zombie account market
The zombie account phenomenon is obviously profit-driven. According to Carlo De Micheli
and Andrea Stroppa’s research,13 with a conservative estimation, the fake Twitter followers
has a potential for a $40 million to $360 million business. According to the above data,
Weibo has at least half the active user size as Twitter,14 which implies that the potential
business of zombie accounts on Weibo might reach a similar scale. I have done some
research about the Zombie account market on Weibo.
Though there is no easy way of estimating the size of such an underground business for me,
it is still possible to take a glance at the surface of it. According to my own little research,
there are at least 5 parties involved in this business, most of which are legal (as there is no
written law in China for this business), and each one being possibly separated from others.
11 “Weibo launched the plan to clean fake fans, building an optimum Weibo environment” (微博启动垃圾粉丝清
理计划 打造良性微博生态). Sina Official Website. Accessed on 13 Nov 2015. HTTP://tech.sina.com.cn/i/2015-
02-10/doc-iavxeafs1026932.shtml
12 “Users are losing their real fans, what is Sina’s ‘conspiracy’?” (用户大量掉真粉儿,新浪微博有何“阴
谋”?). Accessed on 13 Nov 2015. HTTP://www.ibailve.com/show/6-579-1.html.
13 Twitter and the underground market, C De Micheli., A Stroppa.
14 Number of monthly active Twitter users worldwide from 1st quarter 2010 to 3rd quarter 2015, Accessed on 13
Nov 2015. HTTP://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/
16
2.3.2.1 Proxy
The first party is the web proxy server provider. A proxy is a server that acts as an
intermediary for requests from clients seeking resources from other servers. A proxy with
high anonymity, or an ‘elite proxy’, is able to hide the true geographical position of an
internet user. This is vital for such kind of industry. Websites such as Weibo will block an IP
address if the HTTP request from that IP is more frequent than 2 request per seconds for
more than 1 minute. This was tested with a small Java program that simply send HTTP GET
request for a Weibo Page. It is obvious that in order to control thousands and even millions
of accounts, the proxy technology is essential for them. The price of proxy server is
unexpectedly cheap. For example, a seller named TKDaili15 has different levels of monthly
bundle for sale sell, from 350 thousands IP for ¥50 (about £5.15 GBP) to 50 million different
IP for ¥500 (about £51.5 GBP). That means with only 51 pounds, one is at least able to send
50 million HTTP requests to Weibo from 50 million of different IP. What is more, if one
programs well and doesn't send request too frequently, the proxy is reusable for a duration
from 10 minutes to 10 hours for connecting to Weibo according to my test. On the other
hand, the advantages of proxy allow me to do large scaled crawling for this research, which I
will explain later.
2.3.2.2 Verification Code Typing Service
Verification codes are the words that one is required to recognise and type in when trying to
login, operate or post messages on certain websites in order to verify the user is a human.
These codes are normally printed on screen, twisted, randomly coloured and contrasted.
The verification codes are made very hard to be recognized by machines. There are small
15 TK 代理, TKDaili. Accessed on 13 Nov 2015. HTTP://www.tkdaili.com/charge-HTTP.aspx
17
companies, however, that offer the service of recognition and typing verification codes on
various websites. A platform called Damatu is famous in the industry. It integrates 4 code
typing companies and offers price comparison for different services.16 By the day 13th Nov
2015, the cheapest price for a thousand verification code recognition is 1.4 RMB (0.14
GBP)16. Be aware, this service is not done by any algorithm or program, they are processed
manually by cheaply paid workers. The Damatu platform also offers jobs for such code
recognitions, and it is easy to register as an online worker on the website. The work flow of
this business is as follows: 1. Client sends a picture of verification code to Recognition
Server. 2. Server distributes the code to a worker. 3. Worker recognises the letters or
characters in the picture and submits the result as text. 4. Server responds to client with the
text. 5. Client verifies the text and sends back the verification result. 6. Worker is rewarded if
the code is correct. According to my own test with a small Java program and 100 thousand
samples, the accuracy of this manual service is higher than any known algorithm, about 99%
verification codes are correctly recognised. Apparently, this service is for no good purpose. It
is the essence of all spam systems as well as Zombie accounts in Weibo. The registration
process of Weibo requires new users to type in verification codes. And if a user login from
different places, he is required to type in the code. With this service, the obstacles of
verification code are conquered with very low price and this will potentially result in an
underestimation of the quantity of zombie accounts. In common sense, it is believed that
there should be more real users than zombie users in a website but sometimes it might be
just an illusion. My study will try to find out the proportion of zombie accounts on Weibo as
accurately as possible, without assuming that there are more human users.
16 Damatu 打码兔 (Code Typing Rabbit). Accessed on 13 Nov 2015. HTTP://www.dama2.com/
18
2.3.2.3 Account Trainer and Sellers
In the following text, I define the account trainer as people who uses automated program to
manage Sina Weibo accounts and who, by doing so, maintains the activity, quality, human
likelihood of massive Weibo accounts.
Sellers of the zombie accounts appear to be a different group of people in this industry. It is
easy to obtain a so-called zombie account at various websites. In order to initially study
zombie accounts and this zombie industry, I bought some zombie accounts from the most
famous shopping website17 in China, “taobao.com” (a Chinese-versioned eBay18 that allows
sellers to sell both virtual products and real products). The accounts were bought from
different sellers, and the usernames of their products are usually in following formats:
1. “Letters + numbers”. The letters as prefix are usually the sellers’ Taobao shop name.
The numbers as suffix are usually in sequence. It is quite obvious that these accounts were
created by computer programs.
2. Random phone number. In China, a mobile phone number is formed by 11-digit
number and should require a SIM card to be registered. In Weibo.com, users are encouraged
to use their phone numbers to register as usernames for service and security reasons. But
each phone number is only allowed to register one account. It remains a mystery as for how
the zombie account sellers are able to obtain such a massive quantity of active mobile
numbers and use them to register accounts. What I would suggest is that this could be
another black market that I am not able to dig in this paper.
17 "Taobao.com Site Info". Alexa Internet. Retrieved 2015-08-13. HTTP://www.alexa.com/siteinfo/taobao.com
18 “Taobao= eBay+ Rakuten+ Amazon”, ("淘宝网=eBay+乐天+亚马逊). International financial paper (国际金融
报). 9 Jul 2009. Retrieved 14 Nov 2015. HTTP://paper.people.com.cn/gjjrb/html/2009-
07/09/content_292098.htm
19
In order to attract buyers, many sellers offer trial discount, that is, a very inexpensive price
for the first ten or hundred accounts. The normal price is usually between 0.1 to 0.2¥ (0.01
to 0.02 GBP) each.19 I spent less than 10 GBP and bought about a thousand Weibo accounts
from different sellers. Then I discovered that some of the sellers offered accounts with
usernames in the same pattern, letters + numbers, and the letters are very similar to each
other. It appears that different sellers have the same source of their “products”. I have two
hypotheses in this regard
1. The same seller opens up different zombie account sellers and by doing so gaining
the advantage of competing with other sellers. This strategy works because
a. On each search page of Taobao.com, only 48 shops are displayed;
b. It is very common for users to sort out searching results by price;
c. It seems this is a very competitive business. Among the first 5 pages of results, all sellers
have the very low price, which is about¥1 for 5 to 10 accounts.
d. The authenticity of the transactions of virtual products as such are in doubt.20 The reviews
of such shops are usually ignored.
All these factors encourage sellers to register more shops to increase their probability of
being viewed by customers on each single search page.
2. The sellers are different groups of people from account trainers. They purchase their
products from sources such as the account trainer and sell them to whoever in need. Apart
19 Search page of Weibo account sellers which shows a competing price list. Accessed on 13 Nov 2015.
HTTPs://s.taobao.com/search?q=%E5%BE%AE%E5%8D%9A+%E6%89%B9%E5%8F%91&commend=all&ssid=s5-
e&search_type=item&sourceId=tb.index&spm=a21bo.7724922.8452-taobao-
item.2&ie=utf8&initiative_id=tbindexz_20151118&bcoffset=1&ntoffset=1&p4plefttype=3%2C1&p4pleftnum=1%
2C3&s=0
20 “Good reviews on Taobao can be bought easily” (“淘宝付费刷信誉泛滥”). 13 Aug 2014 (People.cn). Accessed
on 13 Nov 2015. HTTP://finance.people.com.cn/n/2014/0813/c1004-25455454.html
20
from the similarity of the format of usernames, there is another evidence that can support
this hypothesis. The descriptions of many shops claim that if a large quantity of product is
needed, buyers should contact the seller and order in advance.21 This indicates that the
seller does not possess a large quantity of accounts at hand, but if one contacts them in
advance, they will manage to get enough accounts.
There are quite a few free and paid tools that help the user to register Weibo accounts
massively.22 This kind of tools would enable anyone with appropriate resources to create a
large quantity of Weibo accounts. However, none of the public tools have the capability of
maintaining the accounts, such as logging in regularity, making posts and comments
regularity, etc. The account trainer must have other tools that are not publicly sold, for many
of Taobao-sold accounts are more than merely registered accounts with default empty
setting, but accounts with selfie photos, fans, followers, and regular posts. Moreover, it can
be safely assumed that the majority portion of the followers of such accounts are zombies as
well. Account trainers use programs to control zombie accounts to follow each other,
imitating human-like behaviours. These more human-like accounts are called “Advanced
accounts”, which are usually more expensive. Account price varies from ¥1 to ¥1000 or
even more based on the account quality, which is usually defined by its account level, fans
number and registration date.
21 A sample shop of Weibo accounts seller. All descriptions are converted to picture in his page. There is one line
in red saying “Please contact if you buy in bulk” (“量大请联系”) before the black bold attention text. Accessed on
14 Nov 2015.
HTTPs://item.taobao.com/item.htm?spm=a230r.1.14.38.tQNW0p&id=523123002035&ns=1&abbucket=15#detai
l
22 A downloading site page that offers this kind of tool. The page is describing a Weibo register tool with
verification code recognition service integrated in it. Accessed on 14 Nov 2015.
HTTP://www.boyuansoft.com/html/cn/product/read_370.html
21
Rather than speculating groundlessly what zombie accounts are like, obtaining various types
of zombie accounts from these sellers is a better way to closely study the zombie accounts
and their features. Therefore, I have tried to buy as many as possible different zombie
accounts with limited budget. The prices and other details of the accounts will be described
in the following sections.
2.3.2.4 Clients
Whoever is utilising zombie accounts for their own benefit is a Zombie account client.
Celebrities, 9 commercial or non-commercial corporations and organisations, government
departments, individuals who want fame, and people from all walks of life are potential
buyers of zombie accounts. Based on the large number of account shops on Taobo.com19, it
is safe to say that there are many buyers of this product.
Proxy, Verification Code Recognition, Account trainer, Sellers, and Buyers, are the five
parties that will affect the analysis of this research and therefore are worth investigating. For
instance, there is position-based analysis on zombie accounts, which might be significantly
influenced by using Proxy. Moreover, well-disguised verification code has been
conventionally viewed as an effective way of reducing the quantity of script-controlled
registration. But by using the recognition service from providers such as ‘Damatu’, the
verification code is no longer as strong an obstacle as it is assumed and therefore, there
might be an underestimation of the quantity of zombie accounts from both Sina official and
researchers.
22
2.4 Literature Review
2.4.1 Existing Researches
There are many researches relating to this topic, but most of them are concerning the
zombie accounts on Twitter and less are looking at Weibo. Considering the number of users
on Weibo scaling as at least half as that on Twitter, the quantity and quality of studies on
Weibo is significantly lower than that on Twitter. In this section, I will introduce several
approaches, on either Twitter or Weibo. These approaches are classified into 4 types.
2.4.1.1 Automation detection based on timestamp information
This is a research on Twitter (Zhang & Vern, 2011) where only publicly available timestamp
information of each tweet23 is used.
This paper (Zhang & Vern, 2011) uses minutes of the hour and seconds of the minute as two
axes to plot the activities of different accounts. In comparison with the graphs of other
users, they found that the graph of some users were able to pass “χ2 test for expected
uniformity, presumably reflecting organic behaviour” (Zhang & Vern, 2011) , while some
accounts exhibit detectable non-uniformity or hyper-uniformity. By analysing 6 different
types (see Figure 1 Different uniformity) of distribution of plot of post time, the paper
concludes that automated bot accounts can be recognised according to the uniformity of
their posting time. If the posts of a user are non-uniform or overly uniform, the user is
assumed to be automated by program. “We can conclude the presence of automation if we
find tweet times either not uniform enough, or too uniform.” (Zhang & Vern, 2011).
23 A tweet is a post or microblog on Twitter, the length of which is no longer than 140 letters.
23
Figure 1 Different uniformity
19463 accounts are tested with their public timeline. They have found that 16% of active
accounts demonstrate a high likelihood of automation. What is more, according to the
research, 11% of accounts that post solely through the browser are automated accounts.
As is described in the paper, it is very plausible for automated account to evade the test that
they have conducted in the paper, but there is no evidence that at the time of their
publication, any automated account is intentionally exhibiting uniformity to evade this test.
The approach by Zhang & Vern is easy to understand and is effective in their small scaled
test. Although uniformity indicates the sign of automation, it is not a definite proof of it.
Unlike what is being said in this paper about Twitter, there is evidence that, in Weibo,
organisation controlled accounts are intentionally making posts with fixed time intervals
because they believe this will increase their popularity.24 Yet these famous Weibo accounts
24 “The controller of grass-root popular Weibo accounts: a decryption of Weibo influence and relations”, “草根牛
博操控者 解密微博势力关系谱”. Tencent News (May 2011). Accessed on 15 Nov 2015.
HTTP://finance.qq.com/a/20110505/003612.htm. This news describes how several very popular (more than 10
million fans) Weibo accounts are managed. The manager hired employees to post routinely, so as to maintain the
freshness and the activity of the account.
24
are only a small fraction of all 500 million users. I would expect a good result of a larger
scaled test on Weibo if I am able to filter this type of accounts.
There is also another paper (Tavares & Faisal, 2013) describing a similar approach. In this
paper, twitter accounts are divided into three classes, which are personal, organisational
and bot controlled. The investigation conducted in the research is also independent of the
content in Tweets. Probabilistic inference algorithms are used as the classifying method,
including naïve Bayes classifier and a prediction algorithm that tries to predict the
distribution of the time interval of a user’s tweets.
The prediction result is slightly worse than other related works. With naive Bayes classifier
that distinguishes between different accounts categories, an accuracy of 84.6% percent was
achieved if only classifying between individual accounts and organisational accounts, and
75.8% if classifying between all three types. It seemed that their accuracy is a bit low, but
this was explained by the author that because they did not use any priori assumptions of the
account features, all classifications are purely based on tweeting behaviour, rather than
parsing tweet contents or analysing user profile like others did.
In this project, these two methods are used as a start points. I have managed to obtain
enough information and carried out simplified versions of experiments based on their
studies.
2.4.1.2 Classifier by using supervised learning on different features
In this paper (Amit A. Amleshwaram, et al., 2013), researchers develop a set of 15 new
features and use these features with another 3 previously proposed features to detect
Twitter-based spammers. Spammers are known to use zombie accounts for all kinds of
purposes, including commercial advertising, malicious URL spreading, etc. Through
25
experiments and tests, this paper finds out the subset of features that contributes mostly to
spam detection, the result of which enables them to detect more than 90% of spammers
with only 5 tweets. 600 thousand accounts are used to evaluate their accounts. In detail,
they are able to achieve 96% of detection rate with only 0.8% false positive rate by using
different supervised learning algorithms.
Researchers propose 5 categories of features for detection. The first one focuses on the so-
called bait-oriented features that are used to identify spammers who lure victims by posting
fake tweets or by reference victims in random tweets, expecting victims to click on the
accompanying URLs. The second set of features is used to identify the behavioural aspects of
spammers, including the repetitive text and URL and the domain of the URLs posted in
tweets. The next set of features are the features that are analysed from the tweet content
showing the signs of automated program. Nevertheless, the similarity of tweets of the same
account is also analysed as a feature. Finally, the user profile is taken as a feature. It is
assumed in this paper that a well-organised profile is less likely to be malicious.
This feature-based spammer account classifier has a promising result. What has been
highlighted by authors is that they were able to identify more than half of the spammers
with only a single tweet post. Therefore, very little computation is needed and yet a really
good accuracy is obtained (96%). Although the set of zombie accounts is differently defined
from spammer accounts and the bait-oriented features is not available on Weibo.com25, the
features analysed in this paper are very inspiring to my work, and the fact that only little
computational power is needed for classification is very helpful in large scaled approach of
zombie classification.
25 Any URL posted by users on Weibo is converted into an intermedia short URL, of which the content is hard to
differentiate.
26
2.4.1.3 Location based Zombie user detection
A new approach is published recently in the paper (Deng, et al., 2015) , where the location of
accounts and their fans and followers in Weibo.com can be used to classify their zombie
identity. The location of the followers and fans are compared and recorded. If each two of
the followers or each two of the fans share the same registration city or province, two
variables named SAMEC and SAMEP are incremented respectively. The numbers of followers
and fans are also defined as two variables FER and FING. Thresholds SAMEC_TH, SAMEP_TH,
FER_TH and FING_TH are defined for the above four variables respectively. Based on these
thresholds and variables, an intuitive classification rule with four conditions is created.
Together with a logic expression they form a rule based classifier as shown in Figure 2.
Figure 2, Rule based on 4 conditions
10000 Weibo accounts are used to test their scheme, and zombie accounts are manually
identified. By using try-and-error method, they find the best configuration of thresholds that
grants this scheme an accuracy of approximately 80%.
This paper does provide a creative method of classifying zombie accounts, which could be
helpful to other types of zombie classifier. However, there are also obvious problems with this
paper. Firstly, the accuracy of their test is in question because they have not explained the
method of the manual identification of zombie accounts among the 10000 accounts. It is in
doubt that their manual identification is sound. Secondly, the 4 conditions are not explained
thoroughly. There is no evidence to support their assumptions such as “At least in the early
days, zombie accounts made large number of follows so that they might get followed back.”
27
or “It is hard to find real user accounts to follow them back” (Deng, et al., 2015). What is more,
the thresholds are founded by brute force, trying to get the best classification result from
those unreliable 10000 accounts. It is safe to predict that their classifier can be overfitting.
Most importantly, the use of proxy can potentially defeat their assumptions, depending on
how the automated registration program uses proxies and how account managing program
maintains the accounts. If each proxy IP is used for only limited registration, the locations
would appear more randomly than the paper expected.
To sum up, although it fails to be a sound research, this paper is inspiring in the location usage
to zombie account classification researches. In my project, account location is considered a
nominal feature, and its usefulness is studied in later section.
2.4.1.4 User interaction based research
The paper (Sun, et al., 2014) analyses the user interactions that are based on the influence
transfer effect in social networks. Sun and his colleagues present a regional user interaction
model, which describes a dynamic process in online social network, in order to study the
interaction process of different users.
Figure 3 Schema of Sun’s reginal user interaction model
28
Direct influence and indirect influence are calculated according to how users are retweeting
others’ posts and the distance between them. Consider the user interaction as a graph, where
the nodes are users and connecting edges are direct retweet from one user to another, then
the distance between users is defined as the number of edges on the shortest path between
two user nodes. The further the distance, the less influence of one user on another.
Figure 4 the distribution of transferred influence of each user node
By calculating the node influence of each node on 500 test data retrieved from Weibo,
apparent patterns are found, that users are easily separated into 3 groups: Key users that have
significant influence among its followers, regular users that have some of the influence among
others, and isolated users of which no interaction and no influence are found. Then they use
the test result based on PageRank (Chang, et al., 2013) to compare with their own interaction
model. Complementary Cumulative Distribution Function (CCDF) are used as y values and
follower count, effective follower count, the individual post count and average post count as
x axes, and they created 4 scatter plots with top 50 influential accounts respectively. Sun’s
model shows a superior result than Chang’s with the same features, where a larger scale of
interaction model is seen on all plots.
With further tests, where all test data are evaluated with their interaction, 3 groups of users
show a significant difference in contrastive distribution against both account reputation and
tweet count.
29
In conclusion of this paper, although their assumption that zombie users should have less
influence than real users is not sound enough, Sun and his colleagues provide a great way of
identifying zombie accounts based on their regional user interaction model. Although the
paper only demonstrates the difference in CCDF between types of users, and does not test its
accuracy of zombie classification, its value in helping classification should not be neglected.
The method requires more computational power because the program need to traverse the
followers and fans tree of a user to determine its influence and therefore classify type. In this
project, this method is not tested for its requirement of obtaining user relation graph is
difficult with undergraduate levelled time/space resources. However, it inspired me to have
set implemented a scalable web crawling framework that will be helpful for extend research
in the future.
2.4.2 Data Mining & Machine Learning methods and algorithms
In reviewed literatures, different classifiers or clustering methods are used. In this project, I
have selected several supervised classifiers for the following reasons.
2.4.2.1 Naive Bayes
Naive Bayes is a popular conditional probability model, given a problem instance to be
classified and a vector of different features, it can compute the probability of each possible
class that instance belongs to. Training a naive Bayes classifier needs a set of ground truth
data of different classes, of which size is as large as possible. The performance of this classifier
increases with the growing size of training data set. In addition, naive Bayes classifier assumes
that the features of the data set are unrelated to each other.
2.4.2.2 Decision Tree Learning
Decision tree learning is a predictive modelling approach in data mining. Classification trees
where the target instance can take a finite set of feature values are used to predict the class
30
of the target instance. Leaves are the class labels and branches are the representation of
conjunctions of features leading to class labels. Decision tree offers a visual representation of
how each instance is classified, I find it is very useful to understand different feature and their
significance in classifying zombie accounts.
2.4.2.3 Multi-layered Perceptron
A multilayer perceptron is a feedforward artificial neural network model. It maps sets of input
data onto a set of outputs using a multi-layered neural network, trained by using
backpropagation algorithm. MLP has been a popular learning model in various fields such as
machine translation and speech recognition since 1980s.
2.4.2.4 Support Vector Machine
Support vector machine (SVM) is a supervised learning model that analyses data used for
classification and regression. Given a set of training examples, whose instances is already
tagged with one of two classes, the training algorithm creates a model that put new examples
into one of the classes by creating a hyperplane based on training data and comparing the
new instance to the hyperplane. SVM is non-probabilistic binary linear classifier, which means
given only a small number of training data, of which the quantity and proportion of classes is
unknown, SVM is able to classify the larger set of unknown data. This property also means
that SVM is resilient to overfitting. Unlike naive Bayes classifier, the zombie account
classification result will not be strongly affected by how much zombie accounts occupy in
training set, which is a preferable property since there is no way we can determine the prior
probability of zombie accounts.
2.4.2.5 Combining multiple classifier
Researches have shown that some classification task is more reliable if more than one
classifier is used. For example, the study (Kittler, et al., 1998) shows that using a combination
31
algorithm such as sum rule with multiple classifier on the problem of pattern recognition
outperform many other classifier. Another study (Benmokhtar & Huet, 2006) shows that by
using a combination of Multilayer Neural Network and Gaussian Mixture Models, there is an
overall improvement of the performance of video shot recognition. Also the paper (Opitz &
Maclin, 1999) shows that bagging, boosting increases the classifiers performance given only
few different classifier models. It is very plausible that none of above 3 classifier can
distinguish zombie accounts from real ones, therefore in this project, classifier fusion
methods will be considered in this project.
2.4.3 Research gaps and my approach
Some of the above methods have the gaps I have mentioned in the introduction section and
I plan to fill in these gaps in this project.
Firstly, the sampling rate is too low considering the size of total users on either Weibo or
Twitter and this does affect the value of the research results. The paper (Sun, et al., 2014)
only collects 500 user data whereas there might be a bigger quantity of zombie sellers on
the market. Each of them might have a different way of producing zombie accounts and thus
there might be more possible types of users than 3 as found in the paper. This gap is
solvable by evaluating a larger data set.
Secondly, after reading these papers, I find that they have all used unsound data to evaluate
their classifiers, for they either classify zombie accounts in the test data according to some
naive assumptions or thresholds, or manually classify the zombie accounts, none of which is
justified enough. In the paper (Amit A. Amleshwaram, et al., 2013), for example, their work
is evaluated by observing the suspension state of the identified spammer accounts. The
justification is based on the efficiency and correctness of Twitter anti-spammer mechanism,
32
which should be sounder than other approaches but is not yet. Twitter anti-spammer
mechanism is still an algorithm and can give false positive and negative. It is a good source of
reference but not an evaluation standard. What I can do in this project is to acquire a large
amount of real zombie accounts from different sellers, the detail of which I will explain in
the later section.
3 Methodology
3.1 Obtaining Ground Truth data
Before obtaining any data, I define the zombie accounts and real user accounts according to
the following rules, in order to clarify the range of study:
Zombie accounts
Weibo accounts whose activity are either partially or solely controlled by computer
programs, regardless of their purpose. An account of a real authority however uses any
software to manage its Weibo account is considered zombie accounts in this project.
Real accounts
Weibo accounts whose activity are only controlled by human being, regardless of their
behaviours. Accounts whose user does not post any microblogs and Accounts that posts
every half hour punctually by human user should be considered as real accounts.
In order to clarify any ambiguity due to the problematic naming convention of Sina
programmer, a recap of definition of followers and fans is as following:
Followers of an account A: A field of an account. The sets of accounts that one
account A actively choose to follow, account A will receive updates of its followers’
posts.
33
Fans of an account A: The sets of accounts that choose to follow account A,
whenever account A posts a microblog, every fans will receive an update.
Ground truth data of accounts that are real human user and zombie are obtained differently.
3.1.1 Zombie accounts
Getting zombie accounts on Weibo is relatively simple and straight forward. The most
budget friendly method is to buy fake fans26. Simply type “微博 粉丝” (In English, it is
“Weibo Fans”) in the search box of www.taobao.com, as introduced in the background
section. Different fans sellers were found immediately.
Visiting most of the shops I have searched, the priced accounts can be generally divided into
4 levels:
Basics Level, where accounts usually have no fans themselves but only follows other
people.
Top level (顶级 in Chinese as described by seller), accounts have limited activities,
including following other accounts, post simple sentences. They differentiate to
basic level accounts with relatively more fans, that is, these zombie accounts are
usually followed by other zombie accounts, making them more human like.
26 The fake fans are the zombie accounts that are programmed to follow target accounts for the purpose of
popularity mainly. However, by closely looking at the fake fans sample provided by the sellers, these “fans” post
microblogs with different contents, such as advertisement, sensitive topics, human mimicking talks, jokes, and
etc. I assume that these sellers not only sell the service of adding fans using these accounts, but also other
potential “beneficial” services, therefore making the fans accounts the best accessible zombie accounts with
limited budget.
34
Exquisite level accounts are like top level accounts, except they post various types
microblog, including retweet others blogs, jokes, advertisement, picture/videos,
some of their posts are even “liked” by others.
Expert level, or Daren level (达人), are accounts that are based on exquisite level,
however, with longer registered time, higher Weibo level, and more frequent post
time (one per day or more), making them being counted as daily active user by Sina.
4 sellers were chosen randomly from the search results for unbiased sampling. The following
table describe the summary of bought accounts from each seller.
Seller ID27 Fans type Price28
120006216 15000 Top
10000 Exquisite
¥528
142313259 2000 Top
2000 Exquisite
1000 Expert
¥220
149080854 1000 Top
3000 Exquisite
500 Expert
¥215
121117246 10000 Basics ¥90
Sum 34500 ¥1053
Sellers are asked to use their zombie accounts to add the account, which is owned by me, as
their follower (or namely, my fans). All these fans are added gradually to my accounts within
27 The ID number of the sellers shop. The shop can be accessed by URL as HTTPs://shop[ID].taobao.com
28 In Chinese Yuan
35
3 days of purchasing. Then a simple C# is implemented to traverse my “fans list”, obtaining
most ID of these account. However, only about 21,740 accounts were actually obtained due
to the limitation of even trying to viewing your own fans, where only first 250 pages of fans
are allowed to be viewed.
Simple manual analysis was carried out on these zombie accounts. Depending on the level of
the zombie account, their data exhibits different pattern. More expensively levelled
accounts usually have made more posts and have more fans/follows as well. Some basic
level accounts does not have an avatar image (Thumbnail image). Many of their name have
significant pattern, appearing to be a mixture of number and letters or a mixture of Chinese
characters and numbers. Moreover, there are some name that consist very rare Chinese
character which is unlikely to be seen or used by anyone in his whole life. These properties
can be used as reference for further implementation, however, there will be no hardcoded
threshold or classifier, since this project aim to create a generalised method of identifying
these zombie accounts.
3.1.2 Real user accounts
There is no simple method of obtaining large amount of real user accounts directly.
However there is a work around based on a simple assumption, that is, human users, who
takes their account activities seriously and who have only limited aspects of interests, are
very unlikely to follow zombie accounts, whose posts are generally in random fields that are
not strongly attractive to any particular person, whose posts may seem contradictory to
human, who does not respond to human comments. Especially since the exposure of the
existence of zombie accounts, public have been more carefully following others.
Based on above assumption, only small amount of real users need to be identified manually.
These identified accounts can be used as a starting point of crawling, by using breadth-first
36
search with the lists of followers of these accounts, we can find much more accounts that
belongs to real users.
Weibo.com has a channel called “Discovery” or “发现” in Chinese29, where posts of popular
users of each field is randomly chosen and listed. This channel is used as a starting point of
finding real users. And real users are obtained by following rules;
1. From each field, 7-10 posts were randomly chosen, and one who have made
obviously sensible comment and discussion to the post were initially selected.
2. The more diverse the better, if the first person in this fields was commenting a post
about basketball, then the second one is preferable to be chosen if he is
commenting another sport such as football.
3. After initial users have been, the blog page of each user is manually visited and
verified by a simple Turing test: whether its activity such as microblogs, photos,
comments, and etc. shows a sign of being programmed. If the test passes, the user is
added into real user group.
NOTE: There are many marginal users (users who do not actively writing posts on
Weibo and users who only reads and comments others’ posts). There is no easy way
to identify these users from zombie accounts, since it is very hard to search their
comments on the microblogs written by others30. As a result, initial users selected
from step previous steps with small amount of posts (<2) are not added to real user
group, if there is no other obvious factors that help to clarify its identity.
Consequently, the ground truth data of real users is kept sound but incomplete, for
many marginal but real users are ruled out by this. It is foreseeable that his is going
to affect the performance of classifiers. However if it is the public opinion or posts
29 HTTP://d.weibo.com/102803_ctg1_1199_-_ctg1_1199# , translations of categories can be found in appendix A
30 Weibo.com does not offer such functionality as searching one’s comments.
37
data that we care about, real users, who do not make posts being misclassified into
zombie account because of this incompleteness of training set, will not influence
badly. I expect to get an inaccurate classification result for the way I choose the
initial users, but there is no better way of doing it, since inactive users and many
zombie accounts have no human-observable difference. The only hope is left to our
classifiers.
300 users from commenting 300 posts of 48 topic fields were found from previous steps. A
breadth first search with depth only 1 is used for obtaining their followers. 26313 users31
were obtained and added into real user group. A worst case assumption that each of 300
users have followed 5% zombie accounts (though very unlikely), we still have about 25000
real users as ground truth data.
3.1.3 Accounts data for Evaluation
One of the purpose of this project is to find out the proportion of zombie account in
Weibo.com, therefore it is preferable that we obtain as large as possible account
information for evaluation.
Obtaining large amount of accounts data for evaluation is not a simple job. Weibo.com
prevents any single IP accessing it too frequently. The highest frequency that an IP address
can access its website constantly without being banned is 1 request per 4 seconds in the
long run.32 This means if information of millions of users is to be obtained, we will need a
good framework that can efficiently access Weibo.com. Details of this framework will be
described in the next section.
31 300 users should have more than different followers in total, however, Weibo.com only allow any user to view
first 20 pages of 400 followers of a particular user, thus limiting down our results.
32 Tested with a simple Java program that uses a proxy and access Weibo.com for user information at different
frequency.
38
In total, 24 millions of non-repetitive user ID33 were obtained. Details of 1583135 users with
no blog information or little blog information were obtained. What is more, Details of
900000 users with recent 100 posts were obtained. These information will all be used in
testing our classifiers.
3.2 Implementation
3.3 Actual System Design
This project is divided into 3 stages: Data gathering, Data pre-processing and Data Mining.
Software systems are implemented for Data gathering and Data pre-processing due to the
relatively large scaled datasets. An open-source software called WEKA is used for Data
Mining.
Ground truth data of real users and zombie accounts as well as a set of random accounts
from Weibo.com are obtained in Data gathering section. Then the fields and features are
analysed and pre-processed in Data pre-processing section. Finally the data are used in Data
Mining section.
3.3.1 Data Gathering
3.3.1.1 Requirements
As being described above, gathering only about 50K rows of User ID for ground truth data is
relatively simple. Sequential program with internet of speed 12Mbps took less than 12 hours
for this task. However, getting more detailed information for 50K users such as microblogs
will take much longer.
33 Each user ID correspond to a Weibo User, the main page of the user can be accessed by
HTTP://weibo.com/u/[ID] .
39
In fact, Weibo.com displays 20 User ID per page if you are browsing the follower/fans list of
an account. For each HTTP request, the sequential program can obtain 20 more User ID.
Whereas Weibo.com only display 10 microblogs per page for one user. If we were to use the
blog information, presumably 100 microblogs, we will need to make 10 HTTP requests to
Weibo.com, plus one HTTP request to get the detailed account information that is public to
everyone, 11 HTTP requests are needed for each account. For 1 million users in worst case,
we will need to make 11 million HTTP request to Weibo.com. A single machine with
sequential program, would take ages to finish this task, needless to say, if the IP address of
that machine will be treated as a DDoS attacker by Weibo for making enormous requests.
Last but not least, the limitation of internet that maximum inbound internet speed is
1.5MB/s in my rented flat become the bottle-neck, since each HTTP request does not only
retrieve the simple information needed for this project, but also the HTML formatted
webpages that contains many bytes of unnecessary information.
To sum up, a software system for this project should be able to satisfy following
requirements for data gathering:
1. The system can make large amount of HTTP requests in parallel or even in
distributed machines is needed.
2. The speed at which this system is getting information should not be bounded by the
internet speed in my flat.
3. This system should also be able to avoid being marked as DDoS for making too much
HTTP requests.
4. In addition, this system should be able to be as fast as possible in obtaining,
processing and storing the data.
5. Data gathered by this system should be accessible easily, any program in the
following stage should be able to read the millions rows of data with fast speed.
40
After analysing above requirements and thorough planning, I have designed and
implemented a distributed information crawling system.
3.3.1.2 Software System Design
3.3.1.2.1 Choice of Database
MariaDB
MariaDB is a community fork of MySQL, an open-source relational database management
system. It is easy to use, and can handle simple structured data set very efficiently.
MongoDB
MongoDB is an open-source, document database designed for ease of developing and
scaling.34 It is used because in this project, nested data structure will be stored and used,
such as storing microblogs of one user with other account details together.
3.3.1.2.2 Choice of programming language
Considering I need a software system that can be run in parallel and in distributed machines,
a cross platform language is preferable for this system. With the design goal of portability,
Java program can be executed in most platforms if Java Runtime Environment (JRE) is
installed. What is more, I have mainly programmed in Java in past few years, therefore Java
is the best choice for me to implement this desired system.
3.3.1.2.3 Use of Proxy
In order to avoid being treated as DDoS attack and prohibited from accessing Weibo.com,
multiple IP address is required. With the preliminary research on Zombie accounts, I have
34 https://docs.mongodb.org/manual/
41
found out that the website TKDaili.com offers high anonymous proxies35 with inexpensive
prices. This website offers an API for programmers to retrieve IP address of proxies easily.
3.3.1.2.3.1 Proxy Anonymity Check
In order to check the anonymity and availability of proxies from TKdaili.com, a website or a
host is needed. There are many online proxy anonymity check website, however, are very
slow in connection because firstly the index page of these websites contains unnecessary
information, secondly most of the proxy are from China since TKdaili.com is a Chinese
provider yet no good China based proxy check website can be found and last but not least,
these websites are very unstable, of which up time and connection speed is not guaranteed.
An amazon EC2 instance is used as a host, and a PHP server that holds the proxy checker is
loaded. After test, it only take 200ms to 500ms to check a single proxy even it is China based.
Nevertheless, amazon EC2 instance is very stable and guaranteed to be available.
3.3.1.2.4 JPipe Framework
This framework is implemented in the purpose of handling concurrent work that are running
with task parallel pattern. By the time I started this project, there is not any known Object
Oriented Pipelining implementation in Java, despite the default stream pipelining package
java.nio.pipe which only handles data in bytes or string and offers very little functionality for
task managing and parallel thread control. I developed JPipe simply for the reason of
parallelism and pipelining that helps to massively retrieving and processing data.
35 Proxy anonymity has three levels: High Anonymous/Elite, Anonymous and Transparent. If a HTTP request is
connecting through a High Anonymous proxy to a target host, that host will only know that a user from that
proxy address is visiting and therefore protects your identity. More information can be found here
http://www.proxynova.com/proxy-articles/proxy-anonymity-levels-explained/
42
JPipe is a producer-consumer based, Object-oriented pipelining framework. It serves the
purpose to easily create pipeline work in JAVA. The basic idea and structure of this
framework is shown as below:
Figure 5 Basic Pipeline structure using JPipe
The basic structure of a pipeline includes multiple “Pipe Sections”, Each Pipe Section have a
number of workers doing the same work. The produced Objects from workers are saved into
a buffer, and these Objects can be polled by workers from another Pipe Section. Pipe Section
keeps a records of status of its child workers, such as the number of successful/failed work
done, the throughput and the latency of each worker and the number of continuous
successful work. Pipe Section Object can output above states in JSON string for any further
adjustment or monitor purposes. Nevertheless, a pipe section, if enabled, can dynamically
change the number of its child worker, if the status of all pipe sections are properly used,
programmers can produce a pipeline that dynamically adjust workers according to its bottle
neck.
What is more, a feed forward pipeline is not always desirable because of the complexity of
the problem. Therefore a Buffer Store that manages all buffer is implemented, consequently
workers from any section is able to access any buffer thus allowing programmer to build
more complicated pipelining, for example:
A pipeline PipeSection A
Worker 1
Worker 2
Worker 3
…
Worker N
Buffer B1 PipeSection B
Worker 1
Worker 2
Worker 3
…
Worker M
Buffer B2 PipeSection C
Worker 1
Worker 2
Worker 3
…
Worker O
..more Buffer &
Pipesections
Produce Produce Produce Consume Consume
43
Figure 6 A simple Pipeline structure
PipeSection A produce a type of Object X and save its products into buffer B1, PipeSection C
produces a type of Object Y and save its products into buffer B2. Workers in PipeSection B
need Both Object X and Object Y to produce Object Z, it polls from buffer B1 and B2 and save
Object Z into B3, and finally followed by Workers from PipeSection D that poll the results
from B3 for further processing.
Buffer Store also monitors the process of the pipeline work. This framework is thread safe
and all read/write operation to buffers are locked. Buffer Store creates a detailed JSON
string that describe the states all of thread of each pipe section. The states include: The lock
state of that worker, the number of consumed object from/to that worker.
JPipe not only allows programmers to build complex pipeline program easily, for it offers
parallelism out of box, but also enables them to debug the parallel program with ease,
because the states of workers and buffers are monitored in details.
3.3.1.2.5 PipeCrawler
PipeCrawler is a Java project built on JPipe for the purpose of making HTTP requests,
obtaining webpage data, data pre-processing and data storing. Using socket connecting and
JPipe pipeline techniques, I created a distributed master-slave web crawler that can gather
information from Weibo.com with least resources yet fast speed.
PipeSection A
PipeSection B
PipeSection C
PipeSection D
Buffer Store
B1
B2
B3
44
This crawler is a compound of multiple pipeline programs. Depending on different executing
argument, the program will be forked into different instances. All instances are implemented
as pipeline using JPipe. There are 4 different types of instance:
Server Instance
Server Instance is also the master of all other instance. It gives out information and resource
(Such as proxy from TKdaili.com and raw users) needed for different jobs to slave instances,
and collects data retrieved from slave instances as well as saving these data to databases
accordingly. This instance also monitors the status of all slave instances, such as their
responding time and hostname. Below is a brief exhibition of the pipeline work of the server
instance. The detail of the work flow will be discussed in later section.
Figure 7 The pipeline structure of the server instance
Slave Instances
Shared code between slave instances
All slave instances have 2 pipe section and 3 buffers in common: Proxy validator section,
Socket connector section; raw proxy buffer, valid proxy buffer and message buffer.
Account Detail
Persistor Section
Socket Request
Receiver Section
Buffer Store
Raw Proxies Buffer
Raw Users Buffer
Account Detail Buffer
Account with microblogs
Buffer
Raw User Persistor
Section
Account with
microblogs
Persistor Section
Proxy Supplier
Section
MariaDB
MongoDB
Server Instance Structure
Socket
Connection
to Slave
Instances
45
Socket connector section only have one worker, which connects to server socket receiver
whenever there is a message object in the message buffer. A message can be a request of
more raw proxies or a set of product object of this instance or etc.
A proxy is raw when its validity is unknown. After retrieving raw proxies from server , the
workers in proxy validator section tries to validate the usability of proxies by making
connection to a proxy anonymity test website using every proxy from raw proxy buffer, if
that proxy is valid, then it is saved into a valid proxy buffer for further usage. The reason why
a proxy should be validated is that these raw proxies offered by TKdaili.com has very short
life span, from few minutes to few hours, they can be already expired by the time slave gets
them from server. In addition the anonymity of these proxy is not 100% high anonymous
(actually about 90% by tests).
All threads that makes HTTP requests to Weibo.com will be using validated proxy from the
buffer, and there is no shared proxy between any two thread. In this way, we deceived
Weibo.com as if each of our crawling thread is a single normal user from a different IP
address.
Raw User Crawler Instance
The job of raw user crawler instances is to obtain as much user ID as possible. A user is raw
when there is only a User ID of that Weibo user. This instance retrieve list of raw users and a
list of raw proxy from host and uses them to get more raw users. Weibo.com allows visitor
to view first 20 pages of one’s follower list, and first 50 pages of ones fans list which only
contains the name, user ID of those users. Together 20*20+50*20=1400 raw users in
maximum can be obtained given one user ID. There is no doubt that duplicated user will be
found, however, they are ignored by MariaDB database since an insertion of duplicated ID
into a table will be caught and handled by the server instance.
46
By modify part of the code, this instance is also used for getting the real user of ground truth
data. Given 300 real raw users, disabling the code that crawls for fans, this instance will get
all followers of 300 real users, returning the server the list of 25K real users as described
previously.
Account detail crawler instance
The job of this instance is simpler. It takes users ID of each raw user and obtain more user
information such as user register time, fans/followers number, gender, and etc. The result is
called detailed account and is sent back to server and stored in MySQL database. These
attributes of users will be selected and further processed.
Microblog crawler instance
This type of instances takes a detailed account and obtain the first 10 page36 with 10
microblogs per page of this account. Then a nest structured object that contains all account
details plus the details of all potential 100 microblogs is sent back to server and stored in
MongoDB.
3.3.1.3 Hardware Environment
A maximum of 34 computers and 1 amazon EC2 instance are used for gathering the
evaluation data.
DELL XPS15 L502x laptop
Quantity 1
Specification 4 Core CPU
8 GB RAM
NVidia GeForce 460M graphic card
36 Due to the time-space limit, a number of 10 pages is chosen.
47
120 GB SSD
Usage Running Server Instance
Pre-processing collected data
Note The original specification of this laptop does not have SSD, in order to
gain faster read/write speed for database, a SSD was bought simply
for the data storage of this project.
Beowulf Cluster Nodes in MACS, Heriot Watt University
Quantity 33
Specification 8 Core CPU
12 GB RAM
NVidia GeForce 520 graphic card
Usage Running different Slave Instances
Amazon EC2 instance
Quantity 1
Specification 1 Core CPU
1 GB RAM
EBS37 only
Usage Running Proxy Validating Website, implemented in PHP
37 Amazon Elastic Block Store, https://aws.amazon.com/ebs/
48
3.3.1.4 Overall Design & Implementation
The Figure 8 The overall implementation of the data crawling systemexplains the overall
structure of the data crawling system on Weibo.com. The master instance running on home
server firstly retrieves raw proxies from TKdaili.com, then these proxies are distributed to
slave instances. Slave instances connects to the PHP server I set up on the VPS of Amazon
EC2, and the validated proxy are saved into a buffer for further usage. Depending on the
type of slave instance, the instances will require corresponding information from master
instance for their tasks and will connect through proxy servers to retrieve data from
Weibo.com, and will return the result to master if there are enough complemented task
result in buffer. Finally the master instance save the instances into the database.
Figure 8 The overall implementation of the data crawling system
Each slave instance has about 80 threads running concurrently, 20 threads for proxy
validation using amazon EC2 server and 60 threads for obtaining data from Weibo.com using
proxies. These number were manually tested for maximum performance. With less thread
validating proxy will cause the 60 threads having not enough valid proxies to connect to
Weibo, because these proxies have very short life span. With more threads connecting to
Home Server
Maser Instance/
Database
Beowulf Cluster
Slave Instance
TKdaili.com
High Anonymity
Proxy Provider
Proxy validation
Checker host
PHP server
Weibo.com
The Chinese
versioned Twitter
Fetch
raw
proxy
Give tasks and raw proxies
HTTP
request
Validate
Proxies Proxies
HTTP
request
Return result
49
Weibo will make the instance unstable, because there will be too many socket connection
open.38
Considering the instances are requesting data from mobile host of Weibo.com, the above
set up is equivalent to 33*60=1980 sequential program connecting to Weibo simultaneously,
which is the reason why I am able to collect huge set of data for this project in short time.
Figure 9 shows how account information is gathered in steps. Firstly, for ground truth data,
the set of real user accounts were obtained by using raw user crawler instance given 300
initial real raw user, and the set of zombie accounts were simply collected by sequential C#
program. The evaluation set data, of which accounts identity are unclear, were obtained by
a raw user crawler instance given 9 random raw user and 1 selected user.39 Afterwards, all
raw users were used as an input to Account Detail Crawler Instances, and the detailed
account information are obtained. Followed by the Microblog Crawler Instance, where most
recent 100 microblogs of each user are collected.
Figure 9
38 Because it is possible that the Java HttpClient package execute multiple socket connection for each HTTP
request and will finally run out of file descriptor if there are many different threads doing this, whereas it fails to
close these connection immediately due to the Linux system structure, that each socket connection has to wait
for 2 stages (TIME_WAIT and CLOSE_WAIT) before close. Although I have implemented many methods to prevent
this, I did not find a perfect solution to prevent this.
39 This selected user is called Sina Weibo Helper, where every new registered account on Weibo will have it as a
fan. This fan cannot be deleted. Using this account as an initial raw user, crawling its follower list will always give
you most recent new registered User.
Data Type:
Raw User
Save to:
MariaDB
Data Type:
Account Detail
Save to:
MariaDB
Data Type:
Account Detail with
Microblog details
Save to:
MongoDB
Account
detail
Crawler
Instance
Raw User
Crawler
Instance
Manual
Gathering
Simple C#
Microblog
Crawler
Instance
50
3.3.1.5 Conclusion of Data Gathering
3.3.1.5.1 Gathered Data
It takes about 40 days to gather all data needed for this project. To sum up, 35.58GB of data
is collected. 17.2 GB of data stored in MariaDB database, and 18.4 GB of data stored in
MongoDB database.
In detail, 26313 real accounts with their details most recent 100 microblogs have been
collected and 21740 zombie accounts with their details and most recent 100 microblogs
have been collected. Moreover, 897343 accounts with their details and most recent 100
microblogs have been collected for evaluation purposes. I will call these dataset “Good”
because they have all the information needed for these project.
What is more, 26,240,000 raw accounts have been collected, of which 24,543,720 account
details have been collected and the set of 897343 accounts randomly selected from these
account have been crawled for their blogs.
Nevertheless, due to a software bug40, 1,536,193 accounts with details and 0-100 recent
microblogs were collected. In addition, due to the same software bug, 20536 rows of data
consisting 4262 real users account detail and 16274 zombie account details with incomplete
blog information are collected, which is basically the same set of users as the “Good”
dataset with improperly crawled microblog information. Although these data seem to be
useless, since the information is insufficient, I still would like to see how the classifiers
perform on these data. In later section, these data will be mentioned as “Incomplete”.
40 Due to the fact that I did not handle when Weibo.com actually treated a proxy as a DDoS attacker and blocked
its IP, this caused Microblog Crawler Instances to return their result to server earlier than expected. This problem
was found at a late stage of this project, therefore the valid data is less than the bugged data.
51
3.3.1.5.2 Cost
Gathering dataset in million size requires more than time and good program. In order to
obtain the data efficiently and store them reliably, money has been spent in different area
as listed below.
Spent on Amount(in Chinese Yuan) In GBP
25,000,000 Proxies ¥800 £87.21
34,500 Zombie accounts ¥1053 £114.80
40 days of Amazon EC2 Instance £8.82
Sum £210.83
3.3.2 Data Pre-processing
Before feeding the datasets into classifier and trying to tune for the best result, it is
important to study and understand the data we have collected, and select or create useful
feature for further learning.
3.3.2.1 Feature study & extraction of Collected Data
Fields of Account Details
The Table 1 below listed the detail of each field of account detail data and their data type as
well as whether they are selected for further data mining. These fields are directly crawled
from Weibo.com. The selection and extraction are based on 3 rules: 1. if this attribute
provides information about the activity of the account. 2. If this attribute is easily obtained
and analysed. 3. If this attributes can provide information to study zombie accounts.
Feature name Data
Type
Detail Selected Select reason
52
uid Long A unique large
integer represent the
user ID on
Weibo.com of the
account. This field is
possibly incremental.
Yes It is possible that zombie
accounts are registered
sequentially in large amount by
program, so uid can potentially
be a good source of detecting
them.
Gender Nomin
al
0 if the account is
female, 1 otherwise
Yes It is possible that zombie
accounts are more likely to be
male, since it is the default
option when registering and may
be kept.
Name String The unique string
that represents the
user
Processed
Will be pre-processed before use
att_num Integer The number of
followers of the user
Yes Basically all previous research
used this attributes, because it
gives information about the
activity of the account.
fans_num Integer The number of fans
of the user
Yes Same as att_num
avatar_img String The URL string of the
user’s thumbnail icon
Processed Will be converted into Boolean
value, true if it is default image,
false if customised by user. This
also gives information of user
activity.
background String The URL string of the
background image of
the user’s homepage
Processed Same as avatar_img
53
blog_num Integer The number of
microblogs this user
have post
Yes User activity
create_time Integer Linux timestamp of
when this user is
registered on
Weibo.com
Yes The registration time can
potentially be useful given
enough training data, if zombie
accounts are registered by
program massively.
description String The self-description
of the user
No This field can be very useful in
distinguishing the user’s identity,
however, since most description
is in Chinese, there is no simple
way of process it by machine,
which is out of the scope of this
project.
member_type Integer A number indicate
the type of the
account, it can be
one of
[0,2,11,12,13,14]
Yes Although the representation of
this field is unclear, but it is a
potential distinctive feature. To
be analysed in next section.
native_place String The string name of
the city in Chinese
where the user is.
Yes Consider the research of (Deng,
et al., 2015), location based
classification can be potentially
useful, regardless of its unsound
methodology
54
verified Boolea
n
If this user is verified
for the name it has.
E.g. Obama opens a
Weibo.com account,
and is verified to be
him according to
some procedure,
then this field is true.
Yes Verified user is more likely to be
real user. However, only a very
small fraction of users are
verified by Weibo.com.
v_type Integer A number represent
the type of
verification, if it is a
personal account, an
organisation or an
authority. Can be on
of [-
1,0,2,3,4,5,6,7,10,200
,220]
Yes Same as verified.
Table 1
Fields of Microblogs
As being described in previous section, most recent 100 microblogs of each selected account
have been obtained as well, the following Table 2 exhibits the detailed features of
microblog.
Feature name Data
Type
Detail Selected Select reason
55
postid Long A unique large
integer represent
the blog ID on
Weibo.com of the
microblog.
No This field does not give any
useful information.
timestamp Integer Linux timestamp of
when this
microblog was
posted
Processed The timestamps of the blogs,
according to paper (Zhang &
Vern, 2011), are very useful in
detecting automation. However
the timestamp of a single post
does not give much information,
so it will be processed with other
posts made by same user for
further analysis.
repost_count Integer The count that this
microblog getting
forwarded and
reposted by
another user
Processed This figure gives information
about user interaction, because
the posts of real user are more
likely getting reposted.
comments_co
unt
Integer The count of
comments to this
microblog by any
users.
Processed Same as repost_count
att_count Integer The count of “Like”
of this microblog
Processed Same as repost_count
56
picture_count Integer The number of
picture in this
microblog
No This field should be useful,
because zombie accounts may
have a different behaviour in
making post, such as less image
posts. However this was not
considered at the time of
implementing the Gathering
System, and the data was not
saved.
Is_retweet Boolean If this microblog is a
repost of another
microblog from
another user
Processed The repost probability is
calculated for every single
account. Zombie account may
have a certain random rate of
making a reposting.
text String The text content of
this microblog
No Since the majority posts are in
Chinese, there is no easy way of
analysing them.
Table 2
There are 3 fields selected from Table 1 and all selected fields from Table 2 being processed
before they are used in the data mining stage. More information can be extracted from
these fields with proper algorithms or analysis.
Any time complicated method in analysing the name field is not an option, since we a
dealing with large scaled data. So I created a simple processing methods that generates 5
new fields from the name, they are all Boolean as shown below:
Field Name Detail
name_has_character If the name has Chinese character
name_has_letter If the name has English letter in it
57
name_has_number If the name has number
name_has_rare_char If the name has rare Chinese character that is not in
the range of 3500 common Chinese character.
name_has_symbol If the name has any string other than number, letter
or Chinese character
name_is_mixture If the name is a combination of more than one of
character, number, letter or symbol.
Pre-processing on selected fields
As described in the Table 1, fields such as avatar_img and background are processed into
Boolean representing if they are default value. Two new fields taking their place are
is_default_avatar_img and is_default_background. These 2 fields will be used in further
analysis.
The pre-processing of the fields for the microblogs is more statistical. Fields of each
microblog including repost_count, comments_count, att_count and is_retweet of all recent
100 microblogs are summed together and the mean value of them are computed,
generating avr_blog_repost, avr_blog_comment, avr_blog_att and avr_blog_is_retweet
for each account.
The automation detection methods (Zhang & Vern, 2011) transform the timestamp into
second of the minute and minute of the hour, and uses Pearson’s χ2 test to detect the
uniformity of the timestamps of the blogs. Adapting its idea, as a simplified version, the
timestamps of microblogs are converted into minute of the day (for example, if a microblog
is posted at 3:13 am, then this microblog is posted on the 3*60+13 = 193th minute of that
day). With all these minutes of the day, the p-value is calculated and Pearson’s χ2 test with a
bin size of 240 (24 if less than 24 microblogs of this user) and significance of 0.1, 0.05 and
58
0.025 are carried out, generating 3 new fields respectively (BT = blog time):
BT_chisquretest_010, BT_chisquretest_005 and BT_chisquretest_0025. These new fields
supposedly will help to find out how uniform the timestamps are distributed through the
day at different level. It is expected that in the long run, the zombie accounts with
automation will have more uniform distribution of posting time, whereas human beings,
who need to rest and work, have limited range of posting time. In addition, the mean,
median, variance of the minute of the day are also calculated, for the same reason,
generating 3 new fields: BT_mean, BT_median and BT_variance.
What is more, as an addition to (Zhang & Vern, 2011), I consider the post time interval
should also provide valuable information, because I expect a program to control these
account in a loop, if that is the case, the post time interval can be an apparent figure to show
that. As a result, I created 3 more fields BT_I_mean, BT_I_median, BT_I_variance. This idea
will be analysed in the data mining section.
Last but not least, a single field ff_ratio is computed for each account as their followers/fans
ratio. This is simply because I found out in the initial analysis of zombie accounts that these
accounts seem to have a similar followers/fans ratio.
In conclusion, the pre-process created new fields and discarded unnecessary fields, together
the final dataset for experiments and evaluation have 35 fields including class field, listed in
Appendix C. 14 original fields are shown in Table 1 and 22 analytical fields created by pre-
processing are listed in Appendix B.
3.3.2.2 Creating different data set
After pre-processing the data according to above methods, two MySQL data for ground
truth data of zombie and real accounts have been created respectively with an extra field
called accountClass for classification purpose, of which value can be 0 (real user) and 1
59
(zombie). Two table are then merged together and randomized, creating the final training
table.
On the other hand, 897343 rows of random accounts detail with recent 100 posts
information are processed in the same way, and 1583135 random rows without proper
crawled post details are also processed for testing purposes. The brief conclusion of the
processed data are displayed in the following table
Dataset Number of rows
Good training set 31705
Good testing set 16333
Good evaluation set 897343
Incomplete testing set 50956
Incomplete evaluation set 1583135
Table 3
4 Data Mining & Classifier Evaluating
In this section, the machine learning tool WEKA is used. WEKA is an open source software, a
collection of machine learning algorithms for data mining tasks. It has many useful tools for
data pre-processing, classification, regression, clustering and visualisation41.
In this section, the training data set is firstly visualised, and patterns in fields are analysed.
Characteristics of zombie accounts are concluded from the patterns, which gives implication
of the manner how these zombie accounts are created. Moreover, the different purpose of
zombie accounts are also possibly analysable from the data.
41 http://www.cs.waikato.ac.nz/ml/weka/
60
After initial analysis, classifiers using different learning algorithms such as naive Bayes, SVM
and decision tree are carried out on the data. These classifiers are then tested on different
dataset of training, testing and evaluation respectively.
Finally the combination of classifiers using voting algorithm is tested and evaluated for the
purpose of better performance.
4.1 Initial analysis of features with visualisation
By using the Visualisation functionality of WEKA, 35 histograms of each field plus
35*35=1225 plots using different fields are created. The following graphs are interesting to
be looked at.
Figure 10 The distribution of the field of native_place. X axis is the different place in Chinese, and Y axis indicates the class (0=real user, 1=zombie)
It can be seen from Figure 10 that the count of real user from different places are not
banlance, which is expected since the user population from different places are different.
Whereas this distribution of zombie accounts is nearly uniformly distributed, it is a safe
guess that the programs that register for zombie account did not consider the distribution of
user population and have used a uniformed distributed random place. This picture indicates
that the native place is useful to some extend as claimed in (Deng, et al., 2015) . Den, et al
uses a hard coded threshold on the number of follower that have the same native_place
field for the user being classified. However, in this project, location information is obtained
for that user only, since their methodology is still in doubt.
61
Figure 11 The stack histogram of field create_time, X-axis is in Linux timestamp. Blue part is real users, red part is the zombie accounts, and the height is the total user registered during that period of time.
As shown in Figure 11, the create_time field of the training data range from Friday August
14, 2009 20:49:13 GMT+8 to Friday April 08, 2016 20:15:44 (pm), the amount of the
registered real users had the first peak at the opening of the Weibo.com, followed by
another peak 1 year later, keeping decreasing in getting new user. On the other hand, the
zombie accounts started with very few amount, and they kept growing in number. Moreover
the registering number reached 2 abnormal peaks in July 2012 and Sept 2012, when Weibo
was considered most popular website in China. These 2 peaks support the hypothesis that
many zombie accounts are registered in batch by program within short period of time.
Figure 12 the histogram of BT_mean, which is the mean of blog time in the minute of day. Note that the time in database is saved as Linux timestamp in time zone of GMT, therefore the values that are not 0 should be added by 8*60 = 480 minutes to get correct number for Chinese users.
Figure 12 shows the distribution of the mean time of users writing new microblogs every
day. Despite the users that don’t make microblogs (whose BT_mean are 0s as shown as the
little peak at origin), the distribution of the mean time of real users exhibit a healthy normal
62
distribution, whereas that of zombie accounts shows an unexpected single peak.
Hypothetically, this peak is caused by massive zombie accounts that make posts at same
exact time, generating same mean time in this graph.
Figure 13 The partial plot (points with more than 200 ff_ratio are not shown) of field create_time as X-axis and ff_ratio (followers/fans ratio) as Y-axis.
Previous work such as (Deng, et al., 2015) and (Jiang, et al., 2015), that uses manually
hardcoded threshold on the follower/fans ratio. The Figure 13 to some extend justify their
methodology. Zombie accounts shows a variety of follower/fans ratio from 0 to more than
200, whereas real users are mostly threshold below 5 to 10. Their problem was not having a
set of sound ground truth data, and the number of their threshold ratio given by manual
work is less sound. The best practice is to let classifying algorithms to decide the best line to
separate zombie accounts from real users.
In addition to Figure 11, Figure 13 shows the following behaviour of the zombie accounts. It
is noticeable that there are several period of time on the graph where the zombie accounts
registered during that time have significantly higher ff_ratio and most zombie accounts have
higher ff_ratio than real users, implying the purpose of these accounts as the exact service
offered by the sellers on Taobao.com: zombie fans that follows targeted accounts.
63
Figure 14 The plot using avr_blog_is_retweet as X-axis, and ff_ratio as Y-asis (few points with more than 125 ff_ratio are not shown).
One of the best plot that gives most separable data is Figure 14. This graph not only
indicates a distinct difference between zombie accounts and normal accounts, but also
exhibits the basic behaviour pattern of zombie accounts. To generalise from this graph, as
shown on the left side of this graph where most red points are ploted, the accounts with less
avr_blog_is_retweet (the percentage of microblogs that are just a repost of another user)
and with higher ff_ratio than a certain amount have a very high probability of being
zombies. Moreover, there are obvious pattern of vertical straight lines that can be observed
from the plot of zombie accounts: on the left side, formed by many different accounts share
the exact same certain retweet ratio, which implies that these accounts are controlled by
programs. On the contrary, real users have randomly distributed retweet ratio. I randomly
visited the few zombie accounts shown on the left side of the graph, and without surprise
found that these zombie accounts have posted many microblogs such as advertisement and
chicken-soup-for-the-soul-styled text, trying to mimic human behaviour. This methods of
intimation, however, can be one of the biggest flaw in hiding from machine learning.
In short, the initial analysis of the data shows some significant difference between zombie
accounts and real users, which indicates that the classes of the ground truth data obtained
are potentially differentiable.
64
4.2 Base Line
The base line correct rate is 54.7% because we are using ground truth data with 54.7% of
real user. The training set and testing set are both subsets of the ground truth data and
should have the same base line. This will not apply to the evaluation data, as all instances
are randomly crawled from Weibo.com.
4.3 Single Classifier Experiments
In this subsection, 4 types of classifiers described in Literature are trained using training data
set using 10-fold cross validation method that the training data is split into 10 parts and each
part is evaluated against the classifier trained by other 9 parts. Then the generated classifier
is tested using testing data set. If the accuracy rate is good, then the classifier will be used to
evaluate the evaluation set where the class of account data is unknown.
4.3.1 Naive Bayes
4.3.1.1 Using all fields Firstly, naive Bayes classifier is trained using all available fields of training set, and is tested
using the training set. The confusion matrix shows an unpromising result:
Classified
Real Zombie
Exp
ect
ed
Real 6364 10976
Zombie 80 14285
Table 4
Many real user are classified into zombie accounts, giving a high number of false positive.
Meanwhile very few zombie accounts are classified as real accounts. Overall, this classifier
correctly classified 65.1% instances. Consider it is only 10% above baseline and this result
was based on the evaluation of training set itself, more tuning have to be done for naive
Bayes Classifier.
65
4.3.1.2 Feature Selection Using all fields of the training set is not suitable for naive Bayes classifier if not all of them
are giving useful information, therefore the forward feature selection algorithm is used. In
this project, the forward selection algorithm starts with 0 fields and add fields one by one
according to the increase in accuracy. What is more, a best first search strategy is used that
the field with best increase in accuracy is searched first until there is no more increase in
correctness.
The forward feature selection yields 9 fields including the class field:
BT_I_median, BT_median, ff_ratio, is_default_background, member_type,
name_has_charactor, name_has_letter, uid, v_type, accountClass
4.3.1.3 Naive Bayes with selected features Training set Testing set Evaluation set
Classified
Real Zombie
Exp
ecte
d Rea
l
15831 1509
Zom
bie
3317 11048
Classified
Real Zombie
Exp
ecte
d Rea
l
8152 821
Zom
bie
1686 5674
Classified
Real Zombie
Exp
ecte
d Rea
l
0 0
Zom
bie
505874 391469
Table 5
Using these 9 fields, the naive Bayes classifier using training set itself generates the
confusion matrix on the left of Table 5.
An overall accuracy of 84.78% is achieved by the classifier, though there are more zombie
accounts classified as real user, leading a higher false negative.
The performance of the trained classifier on testing set is very similar, with an almost
identical accuracy of 84.65% as shown on the middle confusion matrix of Table 5. Since the
testing set is half size of training set, the figure from above table is expected, this implies
that the trained classifier is not overfitting. The problem with the result of testing set is that
66
there are many false negative where 1686 of 7360 (23%) zombie accounts are classified as
real accounts.
Finally I use this trained classifier on evaluation dataset, where the identity of accounts are
unknown thus all are assumed to be zombies before classification. With the same classifier
model, 56.4% accounts of 897343 accounts are classified as real user, and 43.6% are
classified as zombie accounts.
4.3.2 Decision Tree Classifier
WEKA provides a decision tree implementation named J48, which uses selected attributes
and build a decision tree classifier model with given data. With same set of training data,
testing data and evaluation data as naive Bayes, the following confusion matrices are
generated:
Training set Testing set Evaluation set
Classified
Real Zombie
Exp
ecte
d Rea
l
15911 1429
Zom
bie
1629 12736
Classified
Real Zombie
Exp
ecte
d Rea
l
8173 800
Zom
bie
815 6454
Classified
Real Zombie
Exp
ecte
d Rea
l
0 0
Zom
bie
519708 377635
Table 6
As shown in the Table 6, decision tree classifier has a better performance on the same data
sets. It have an accuracy of 90.35% in predicting the account identity in training set, and
90.11% in testing set. 58% of evaluation data are classified as real user, in addition with 42%
zombie accounts.
The decision tree generated by the algorithm that uses all 35 attributes have 428 leaves and
the total size of the tree is 855 nodes. This is highly likely to be overfitting. Although it is
recommended that decision tree algorithm performs better with more features, I still
67
wondered how it performs with less. Using feature selection algorithm with J48, a sub set of
9 features are selected excluding accountClass:
BT_I_Variance, avr_blog_comment, avr_blog_is_retweet, avr_blog_like, blog_num,
create_time, fans_num, ff_ratio, member_type, v_type
Using these selected feature, the same algorithm is executed again, and the confusion
matrices are generated:
Training set Testing set Evaluation set
Classified
Real Zombie
Exp
ecte
d Rea
l
16618 722
Zom
bie
980 13385
Classified
Real Zombie
Exp
ecte
d Rea
l
8235 738
Zom
bie
798 6562
Classified
Real Zombie
Exp
ecte
d Rea
l
0 0
Zom
bie
537759 359584
Table 7
The performance of decision tree algorithm with feature selection is higher than that of
using all features. It has 94.6% accuracy on training set, and 90.6% on testing set. 60% of
evaluation data are classified as real user.
4.3.3 Support Vector Machine
Support Vector Machine classifier is implemented in WEKA as SMO. Because the fact that
SMO is very slow in training compared to other classifiers, thus the feature selection
algorithm is not used for this classifier for it is too time consuming (With more than 30 hours
building training model, the process had not completed). All 35 attributes are used as
features for the classifier. The SMO implementation is configured to use logistic regression
as calibrator, and its execution result on different dataset are shown in Table 8.
Training set Testing set Evaluation set
68
Classified
Real Zombie Ex
pec
ted
Rea
l 15481 1859
Zom
bie
2420 11945
Classified
Real Zombie
Exp
ecte
d Rea
l
7968 1005
Zom
bie
1279 6081
Classified
Real Zombie
Exp
ecte
d Rea
l
0 0
Zom
bie
466583 430760
Table 8
The performance of SVM is as good as naive Bayes, approximately 86.5% accuracy on
Training set and 86.0% accuracy on testing set. The evaluation result is different, that only
about 52% are classified as real user and 48% are zombies.
4.3.4 Multi-Layered Perceptron
Using WEKA’s default implementation of MLP with a learning rate of 0.3 and momentum of
0.2, the classifier is trained with the Training dataset. The following confusion matrices are
produced:
Training set Testing set Evaluation set
Classified
Real Zombie
Exp
ecte
d Rea
l
16860 480
Zom
bie
4380 9985
Classified
Real Zombie
Exp
ecte
d Rea
l
8629 344
Zom
bie
2392 4968
Classified
Real Zombie
Exp
ecte
d Rea
l
0 0
Zom
bie
715332 182011
Table 9
The accuracy of MLP model is slightly lower than other models: The accuracy on training
dataset itself is 84.6%, and that on testing dataset is 83.25%. MLP tend to give much less
false positive, meanwhile giving too much false negative. As shown on the right matrix of
Table 9, the majority of evaluating set (79.7%) is classified as real user, leaving only 20.3%
zombie accounts.
No feature selection method is used for MLP classifier, because the backpropagation
algorithm will train the weights of unrelated features, making them close to 0 for less error.
69
Moreover, the feature selection with MLP using backpropagation is time-consuming and
therefore not considered in this project.
4.4 Meta Classifiers
In this subsection, I will list some meta-classifiers and uses them with previous classifiers to
get either a higher accuracy in training or a less overfitting problem in testing and
evaluating, of which details are omitted.
4.4.1 Boosting
Boosting is an ensemble methods that uses a single classifier as a base and generates a
second classifier that focus on the data instances that are misclassified by the first one.
Boosting repeats this process until a certain number of limitation or a specified accuracy.
4.4.2 Bagging
Bagging (or Bootstrap Aggregating) is another ensemble method that divides the training
dataset into N subsets and creates one classifier for these subsets respectively. The results
of N classifier are then combined using mean value or voting, thus giving the final
classification result.
4.4.3 Voting
Voting is a technique that uses multiple different classifier for one classification problem.
Multiple different classifiers is used for classifying each data instance, the class of the
instance will be the majority class giving by used classifiers. This is the combination method I
use in this project.
4.4.4 Results
Different meta-classifiers are tried on all previous classifiers, increases in performance are
observed in most run. The following table shows how different meta-classifier performs with
my datasets.
70
Classifier Number of Features
Meta-classifier Accuracy on Training data
Accuracy on Testing data
Decision Tree 35 Bagging 96.77% 91.39%
SVM 35 Bagging 86.46% 86.09%
Naive Bayes 17 Boosting 85.6% 85.3%
MLP 35 Boosting 86.12% 81.86%
DT, NB, MLP, SVM 35 Voting 90.50% 88.7%
DT, NB, SVM 35 Voting 90.22% 88.6% Table 10
Since the study of finding the best classifier of the problem is always empirical and
experimental, I used different meta-classifiers on top of 4 simple classifier. As shown in the
Table 10, Decision tree has the best performance among single classifiers on both training
data and testing data, whereas MLP gives worst performance on testing data. Because of
this, I trained two voting meta-classifier, one of which uses all 4 single classifier as voters,
another one the same without MLP. The result did not show any strong evidence of
advantages gained by taking votes from different classifier.
4.5 Evaluating using incomplete data
As described in Data Gathering subsection, there are 20536 rows of data with incomplete
blog information being crawled. In big data analysis, this type of data is expected to be seen,
therefore I kept these data and executed different classifiers on it as a method to evaluate
the methodology.
Classifier Meta-classifier
Accuracy on incomplete data
Decision Tree Bagging 86.6%
Naive Bayes Boosting 89.4%
MLP Boosting 72.64%
SVM Bagging 88.97%
DT, NB, MLP, SVM Voting 90.725
DT, NB, SVM Voting 90.91% Table 11
As can be seen from Table 11, the accuracy of different classifiers on the incomplete data are
still very high, except MLP which only has about 72% accuracy. Decision Tree classifier which
performed better than all other classifier on the good data, performs worse on the
71
incomplete data. In addition, Naive Bayes classifier is able to deal better with the data that
has less information.
What is surprising is that, though not showing any better performance in training and testing
data, voting based meta-classifier models give significantly better accuracy on incomplete
dataset. The one that uses 4 classifiers has slightly lower performance than the other one
without using MLP (around 0.2%), yet both achieving an accuracy above 90.7%, meanwhile
no single classifier from DT, NB, MLP and SVM has better than 89.4%. This is the evidence
that using multiple classifier gives decreases the overfitting problem and increases the
overall strength for classification.
4.6 The composition of Weibo Users
The classifiers are then finally used to classify 897343 rows of unknown evaluation data,
where the class of accounts are unknown.
Classifier Meta-classifier Real User Zombie accounts
Decision Tree Bagging 56.8% 43.2%
Naive Bayes Boosting 54.7% 45.3%
MLP Boosting 68.04% 31.96%
SVM Bagging 52.1% 47.9%
DT, NB, MLP, SVM Voting 54.5% 45.5%
DT, NB, SVM Voting 53.6% 46.4% Table 12
Except MLP, which is optimistic about the proportion of real user (68%), the rest trained
classifier models give the ratio of real user from 52% to 57%, consider the maximum
accuracy is approximately 90%, the number of real user may vary from 47%(52+5%) to
63%(57+6%). Consider the number is close to the confusion matrices generated by using
these classifier on training and testing data, it is reasonable to believe that there are quite
large portion of users (according to the experimental data, at least 27%) on Weibo are
zombie accounts. The data, due to the limitations explained in next section, can be
inaccurate. However, we should not underestimate the actual number of zombie accounts.
72
5 Conclusion & Discussion
5.1 Achievements
The objective of this project, as mentioned in the subsection of the approach and objectives
of Introduction section, are:
1. Implement a good distributed crawling system in order to obtain large data set from
Weibo.
2. Gather ground truth data of zombie and real accounts on Weibo for the soundness
of this study.
3. Find a good classifier or a combination of classifiers that can maximise the ability to
classify zombie accounts on Weibo using the data gathered by the crawling system.
4. Conduct relatively large scaled experiments to evaluate classifier, and therefore
evaluate the composition of Weibo accounts.
5. Trying to analyse how zombie accounts have been influencing real human.
First four objectives are met and the approach to these objectives have given promising
results.
Firstly, a framework JPipe have been implemented for Java Pipelining. This is the first open-
source Object-Oriented Pipelining implementation. The framework is easy to use, allowing
user to efficiently create applications that need concurrent work, especially for creating
patterns of task parallelism; then a distributed crawling system PipeCrawler is built on JPipe,
enabling me to obtain millions rows of data for this project. Nevertheless, the JPipe
framework and PipeCrawler system are believed to be useful in future information retrieval
related researches.
73
Secondly, ground truth data are obtained with proper methods. Zombie accounts are
directly bought from 4 different sellers with unbiased sampling, and real accounts are
crawled using manually identified real accounts as initialising points using breadth first
search with depth of only 1.
Thirdly, using different tuning methods and meta-classifiers, selected classifiers all
performed well on classifying zombie accounts, on both good and incomplete datasets.
More than 20% above baseline accuracy, a minimum of 86% accuracy have been achieved.
What is more, an average of 90% accuracy on any dataset using a voting classifier have been
created with 4 selected simple classifiers. It is shown that single classifier such as decision
tree performs well on well crawled data, however, slightly worse than Naive Bayes classifier
on improperly crawled data. Meta-classifiers such as boosting and bagging are proven to be
useful for all 4 simple classifiers gained improvement in accuracy using them.
Fourthly, experiments about the composition of the accounts on Weibo is carried out. About
0.9 million of random account data are evaluated using different classifiers. Based on the
analysed data given by different classifier model, an estimation of actual amount of real
users varies from only 47% to 63%. This research shows the possibility that zombie accounts
on Weibo.com can be unexpectedly high, of which ratio is never recognised in any previous
study.
However, due to the limitation of time and space, the final objective were not met, because
that studying how zombie accounts had been influencing real human is hard. In order to
achieve this objective, algorithms and tools of Chinese text analysis will be implemented and
used, nevertheless the complicated work of Chinese semantic analysis is potentially needed.
This is another area of machine learning study, and is out of the scope of this undergraduate
project.
74
5.2 Limitation
This project has certain limitations that can be improved or dealt with:
1. The diversity of ground truth zombie account is relatively low, that zombie accounts
are bought from 4 out of more than 60 sellers. Different sellers may have different
method of creating and training their zombie accounts. With more diverse accounts,
the accuracy of classifier can be improved furthermore. Moreover, the initialising
point to crawl real user accounts is only 300, which is relatively small and thus
requiring search algorithm for more real user accounts. It is possible that there is a
small fraction of non-real user in the supposed ground truth real user set. Although I
had no time in manually identifying real user, in the future it is possible to get a
reasonably large set of 100% real user data set, thus improving the performance of
classifiers.
2. In addition to previous limitation, the real accounts should be actually regarded as
active users, where there are many inactive user that I was not able to identify
manually because the lack of information from these user, who does not post or
change their profile and who only reads other’s posts. As a result, the ground truth
data of real user accounts are more likely to be a collection of active user. The
classifier is accurate in the sense of distinguish active user from zombie accounts,
whereas the classification details of inactive real users are unknown. And these
details may never be learned.
3. The methods of data pre-processing is limited, because there are no use of the
actual text of the blogs or the self-description of the user. In manual work, the
microblog text is in fact the key factor for me to distinguish a real user from a
zombie one. If more analyse can be done on these text, we can make a further
improvement on classification accuracy.
75
4. There is still space for improvement for classifiers. This project only used 4 of many
available classifiers without exhaustive optimisation. With more time, I believe an
even better classifier model can be found.
5. This project only conducted experiments on million-sized datasets, whereas there
are more than 0.5 billion users on Weibo.com. This limitation is bounded by the
resources such as budget to buy proxies for data crawling, internet speed and time.
If with enough of these resources, larger scaled experiments with more ground truth
data can be carried out, potentially allowing me to study the problem of zombie
accounts in depth.
5.3 Future Work
The future works of this project include:
1. Implementing a proxy crawling system, so that PipeCrawler will not require paid
proxy to obtain more data. Currently the crawling system uses an API from the proxy
seller to obtain proxies. If a proxy crawling system is implemented, the size of data
obtained PipeCrawler will not be by budget of purchasing proxies and therefore
allowing larger scaled experiments.
2. Refurbishing JPipe code, make it easier to use and more robust and efficient. The
way to implement Pipelining pattern using JPipe can be further simplified and
optimised. If a more generalised implementation of JPipe is done, it will be a good
support for all kinds of information retrieving researches.
3. Training different classifiers, training with larger set of ground truth data. Evaluating
more unknown accounts. Given good ground truth data, this project has achieved a
90% accuracy in identifying zombie accounts.
4. Most importantly, any research based on retrieving information social network such
as Twitter or Weibo, will need a proper way of identifying zombie accounts, in order
76
to obtain authenticate and sound research result. With extension of this project and
classifiers with better accuracy, researches such as sentimental analysis can be
carried out on the websites such Weibo without being deceived by the fake
information given by zombies.
77
6 Reference
Amit A. Amleshwaram, N. R., Yadav, E., Gu, G. & Yang, C., 2013. CATS: Characterizing
Automation of Twitter. COMSNETS, pp. 1-10.
Benmokhtar, R. & Huet, B., 2006. Classifier Fusion: Combination Methods For Semantic
Indexing in Video Content. 16th International Conference, Athens, Greece, Volume II, pp. 65-
74.
Chang, Y., Xuanhui, Mei, Q. & Liu, Y., 2013. Towards Twitter context summarization with
user influence models. WSDM '13 Proceedings of the sixth ACM international conference on
Web search and data mining, pp. 527-536 .
Deng, J., Fu, L. & Yang, Y., 2015. ZLOC: Detection of Zombie Users in Online Social Networks.
WEB 2015 : The Third International Conference on Building and Exploring Web Based
Environments.
Jiang, H., Wang, Y. & Zhu, M., 2015. Discrimination of Zombie Fans on Weibo based on
Features Extraction and Business-Driven Analysis. ICEC '15 Proceedings of the 17th
International Conference on Electronic Commerce, 03 08, Issue ISBN: 978-1-4503-3461-7.
Kittler, J., Hatef, M., Duin, R. P. & Matas, J., 1998. On Combining Classifiers. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 20(3), pp. 226 - 239.
Opitz, D. & Maclin, R., 1999. Popular Ensemble Methods: An Empirical Study. Journal of
Aritificial Intelligence Research, Volume 11, pp. 169-198.
Sun, Q. et al., 2014. Modeling for User Interaction by Influence Transfer Effect in Online
Social Networks. 39th Annual IEEE Conference on Local Computer Networks, 8 Sept, pp. 486 -
489.
Tavares, G. & Faisal, A., 2013. Scaling-Laws of Human Broadcast Communication Enable
Distinction between Human, Corporate and Robot Twitter Users. DOI:
10.1371/journal.pone.0065774, 3 July.
Waikato, T. U. o., 2016. Weka 3: Data Mining Software in Java. [Online]
Available at: http://www.cs.waikato.ac.nz/ml/weka/
[Accessed 15 Apirl 2016].
Zhang, C. M. & Vern, P., 2011. Detecting and Analyzing Automated Activity on Twitter. In:
Passive and Active Measurement: 12th International Conference, PAM. Berlin Heidelberg
2011: Springer-Verlag, pp. 102-111.
78
7 Appendix A
Field List of Discovery Channel of Weibo.com
Social international technology science digital Best comments
finance stock market
star variety drama movies
music cars sports sports and fitness
health weight loss
health military history beautiful models
beauties Pets
emotion quotations jokes Rumour Chicken soup for the soul
religion
government games travel childcare education food
real estate home sign reading agriculture design
art fashion beauty animation
A screen shot of above fields on Discovery Channel of Weibo.com
79
8 Appendix B
The following are the analytic fields created by pre-processing
Field Name Detail
name_has_character If the name has Chinese character
name_has_letter If the name has English letter in it
name_has_number If the name has number
name_has_rare_char If the name has rare Chinese character that is not in the range of 3500 common Chinese character.
name_has_symbol If the name has any string other than number, letter or Chinese character
name_is_mixture If the name is a combination of more than one of character, number, letter or symbol.
is_default_avatar_img If the avatar (thumbnail) image of the user is the default image given by Weibo.com
is_default_background If the background image of user’s home page is the default image given by Weibo.com
avr_blog_repost The average repost count of the latest 100 microblogs of the user.
avr_blog_comment The average comment count of the latest 100 microblogs of the user.
avr_blog_att The average “like” count of the latest 100 microblogs of the user.
avr_blog_is_retweet The proportion of the latest 100 microblogs of the user that are retweets of other users.
BT_chisquretest_010 Whether the blogging time (as the minute of the day) of the latest 100 microblogs passes Pearson’s χ2 test with significance of 0.10, 0.05 and 0.025
BT_chisquretest_005
BT_chisquretest_0025
BT_mean The mean time (as the minute of the day) the user post microblogs
BT_median The median time (as the minute of the day) the user post microblogs
BT_variance The variance of time (as the minute of the day) the user post microblogs
BT_I_mean The mean interval time between microblogs that the user posts.
BT_I_median The median interval time between microblogs that the user posts.
BT_I_variance The variance of the interval time between microblogs that the user posts.
ff_ratio The follower/fans ratio of the user
80
9 Appendix C
The following are the final fields used for data mining.
Attribute name Type
BT_I_Variance Float
BT_I_mean Float
BT_I_median Float
BT_chisquare_p Float
BT_chisquretest_0025 Boolean
BT_chisquretest_005 Boolean
BT_chisquretest_010 Boolean
BT_mean Float
BT_median Float
BT_variance Float
att_num Integer
avr_blog_att Float
avr_blog_comment Float
avr_blog_is_retweet Float
avr_blog_like Float
avr_blog_repost Float
blog_num Integer
create_time Integer
fans_num Integer
ff_ratio Float
gender Boolean
is_default_avatar_img Boolean
is_default_background Boolean
member_type Integer
name_has_charactor Integer
name_has_letter Integer
name_has_number Integer
name_has_rare_char Integer
name_has_symbol Integer
name_is_mixture Integer
native_place String
uid Integer
v_type Integer
verified Boolean
accountClass Boolean