Analysing Zombie accounts in Weiboyjc32/project/Thesis - YiBo/Yibo Liang.pdf · 2 Declaration I,...

1

Analysing Zombie accounts in Weibo

Final Year Dissertation

Yibo Liang BSc (Hons) Computer Science

H00194508

Supervised by

Dr Yun-Heh Jessica Chen-Burger

Second Reader

Dr Hamish Taylor

Heirot-Watt Univeristy, Edinburgh

School of Mathematical & Computer Science

2016

2

Declaration I, Yibo Liang confirm that this work submitted for assessment is my own and is expressed in

my own words. Any uses made within it of the works of other authors in any form (e.g.

ideas, equations, figures, text, tables, programs) are properly acknowledged at any point of

their use. A list of the references employed is included.

Signed:

Date: 24 April 2016

3

Abstract Information published on various social networks has been proven valuable for researches in

the fields of economy, politics, culture, social science, etc. Weibo.com is one of the largest

social networking media in China, based on which many researchers have been mining data

for different purposes, e.g. to understand citizens’ opinions. Unfortunately, a large number

of accounts in such websites are controlled by computer programs (robots) motivated by

specific agenda or for malicious reasons such as advertisement, spamming, public opinion

manipulation, etc., such accounts are often referred to as zombie accounts.

Many articles have already discussed how to distinguish these zombie accounts from

genuine ones and how they may behave differently, mostly on Twitter, few on Chinese

Weibo. Nevertheless, they are rarely based on any ground truth data or using a large

dataset.

I therefore propose a study on detecting zombie accounts using machine learning algorithms

based on ground truth data, testing and evaluating relatively large dataset, and therefore

providing a better estimate on the amount of zombie accounts existing on Weibo.com.

Given ground truth data, features of accounts are then studied and subtracted, and more

features are created from the information of the timestamp of microblogs. Moreover, the

behaviour of zombie accounts are analysed by observing the visualised features. My study

shows that it is possible to detect even relatively well programmed human-like zombie

accounts with relatively high accuracy if enough ground truth data is given.

4

Acknowledgements First, I would like to thank my supervisor, Jessica Chen-Burger, who taught me the way into

the field of research. Starting this project as a student and finishing it as a researcher, I thank

her for her generous sharing of her invaluable knowledge and experience in researching.

In addition, I would like to thank my parents for constantly supporting my studying. Their

trust in my determination and ability helped me go through this hardworking year. What is

more, I must thank my parents-in-law who worked as researchers, for their guidance in

researching and studying attitude.

Next I want to thank my wife Xuecong. Her smile and comforting helped me go through hard

time during studying and her cooking makes my good time even better.

5

Table of Content

1 Introduction ....................................................................................................................... 7

1.1 The problem .............................................................................................................. 7

1.2 The approach and objectives of this project ............................................................. 7

1.3 Introduction to Literature review and relative works ............................................... 8

1.4 Research Gap ............................................................................................................. 9

1.4.1 Lack of research on Weibo ................................................................................ 9

1.4.2 No large scaled evaluation ................................................................................ 9

1.4.3 Evaluation with no ground truth ..................................................................... 10

2 Background ...................................................................................................................... 10

2.1 Current Situation of China’s Internet ...................................................................... 10

2.2 What is Weibo? ....................................................................................................... 11

2.3 Preliminary research on Zombie Accounts .............................................................. 14

2.3.1 Problem with Weibo ........................................................................................ 14

2.3.2 Investigation on Zombie account market ........................................................ 15

2.4 Literature Review .................................................................................................... 22

2.4.1 Existing Researches ......................................................................................... 22

2.4.2 Data Mining & Machine Learning methods and algorithms ........................... 29

2.4.3 Research gaps and my approach ..................................................................... 31

3 Methodology ................................................................................................................... 32

3.1 Obtaining Ground Truth data .................................................................................. 32

3.1.1 Zombie accounts ............................................................................................. 33

3.1.2 Real user accounts ........................................................................................... 35

3.1.3 Accounts data for Evaluation .......................................................................... 37

3.2 Implementation ....................................................................................................... 38

3.3 Actual System Design .............................................................................................. 38

3.3.1 Data Gathering ................................................................................................ 38

3.3.2 Data Pre-processing ......................................................................................... 51

4 Data Mining & Classifier Evaluating ................................................................................ 59

4.1 Initial analysis of features with visualisation ........................................................... 60

4.2 Base Line .................................................................................................................. 64

4.3 Single Classifier Experiments ................................................................................... 64

4.3.1 Naive Bayes ..................................................................................................... 64

4.3.2 Decision Tree Classifier .................................................................................... 66

4.3.3 Support Vector Machine ................................................................................. 67

6

4.3.4 Multi-Layered Perceptron ............................................................................... 68

4.4 Meta Classifiers ....................................................................................................... 69

4.4.1 Boosting ........................................................................................................... 69

4.4.2 Bagging ............................................................................................................ 69

4.4.3 Voting .............................................................................................................. 69

4.4.4 Results ............................................................................................................. 69

4.5 Evaluating using incomplete data ........................................................................... 70

4.6 The composition of Weibo Users ............................................................................ 71

5 Conclusion & Discussion .................................................................................................. 72

5.1 Achievements .......................................................................................................... 72

5.2 Limitation ................................................................................................................. 74

5.3 Future Work ............................................................................................................ 75

6 Reference ........................................................................................................................ 77

7 Appendix A ...................................................................................................................... 78

8 Appendix B ....................................................................................................................... 79

9 Appendix C ....................................................................................................................... 80

7

1 Introduction

1.1 The problem

China’s online social network has been rapidly developing for a decade. Now its power is

influencing all walks of life in China and in different perspectives, including economy, law,

politics, and in particular, public opinion. The freedom of Internet allows all to express their

opinions, exchanging their opinions with each other. Yet this freedom also allows people

with ulterior motives to manipulate public opinion by spreading spam and fake stories using

fabricated information through un-checked social media websites.

Such information is often spread by using computer program controlled fake online

robots/avatars, or in the short name, zombie accounts. These zombie accounts have flooded

every social network website, with profitable motivation, creating and spreading false

information to the public, manipulating public opinion.

Therefore, if we are able to develop an efficient way of understanding and classifying those

zombie accounts, the public will be able to express their true opinions by getting rid of the

false ones from zombie accounts. Moreover, classifying these zombie accounts would help

further researches, including the research of the effect of zombie accounts, the study of real

public opinion, and all other researches that make use of the online social network.

1.2 The approach and objectives of this project

There are many approaches already done by researchers in the topic of zombie accounts.

My objective is to find a good classifier or a combination of the existing approaches that can

maximise the ability to classify zombie accounts on social networks.

Weibo and Twitter are the two mostly studied short blog social media. In this project, firstly I

have implemented a distributed system crawling system with proper framework (described

8

in Section 3.3.1.2) in order to obtain large dataset. In addition to the crawling system,

ground truth data will be obtained using different methods (described in Section 3.1). I will

then conduct a large scaled experiments to evaluate the composition of Weibo accounts

with crawled data and ground truth data. By using a different methodology to classify Weibo

zombie accounts in large scale, I will be able to obtain an optimum set of classifiers to

identify the zombie account by training classifiers with ground truth data. And with that, I

might be able to obtain an optimum analysis of the current situation of Weibo social

network, including how many zombie accounts are approximately in the network, what are

the significant pattern and behaviours of these zombie accounts and possibly how Weibo

and its users are influenced by these zombie accounts.

1.3 Introduction to Literature review and relative works

I have done researches on different methods and algorithms of classifying zombie accounts.

A paper (Zhang & Vern, 2011) offers an efficient way of classifying accounts by simply using

the timestamp of posts. A similar approach (Tavares & Faisal, 2013) uses the time interval

between posts and with probabilistic classifiers and has produced very good results. Another

approach (Amit A. Amleshwaram, et al., 2013) analyses different features of zombie

accounts, and by selecting the best set of features, the paper also successfully classifies

zombie accounts with high rates. What is more, location information is also found to be

useful (Deng, et al., 2015), although the accuracy of the paper is in doubt, their approach is

inspiring. Last but not least, I find the research (Sun, et al., 2014) that studies the user

interaction and the influence between users very helpful because of their assumption that

zombie accounts have different interaction and influence with real users.

In this project, I aim to evaluate the above methods but will not be limited to these

methods. I wish to discover more algorithms and methods that can distinguish the zombie

accounts in the process of conducting this project.

9

1.4 Research Gap

According to my study, there are noticeable research gaps in the topic of identifying zombie

accounts on Weibo as following:

1.4.1 Lack of research on Weibo

There are many successful researches of zombie classification on Twitter, but much less

researches on Weibo. Obviously it is because of the fact that Twitter is a worldwide social

network and has stronger influence. By contrast, Weibo is much younger than Twitter and

its influence is still growing. The problem of zombie accounts on Weibo has just been

noticed by researchers in China recently and the related work has just begun to appear in

recent few years. My project itself is to fill in this gap.

1.4.2 No large scaled evaluation

Among those existing researches on Weibo, most of their data is obtained by Weibo public

API that limits down the number of access within fixed time period and therefore limiting

their sizes of data. Considering the size of total registered users on Weibo, sample rate of

these researches are very low. After researching HTTP communication of Weibo.com, I have

come up with a solution that does not require Weibo public API and only needs regular HTTP

request. With the help of proxy and pipelining technique, I have managed to obtain data

from Weibo at an unprecedented speed. Large data obtained in short time enables me to

conduct relatively large scaled experiments to evaluate classification algorithms and

methods.

10

1.4.3 Evaluation with no ground truth

I have not seen any zombie account analysis that is able to evaluate their algorithm with

ground truth, that is, to use guaranteed zombie accounts and real user accounts. The zombie

accounts in most researches are classified manually. Although the classifiers are supposed to

be Intelligent and educated Savvy, their identification is unsound and that does not

guarantee their correctness or accuracy. By looking into the black market of zombie account

as shown in the following pages, I am able to acquire various types of zombie accounts,

which are created and trained for different purposes and which are guaranteed to be

zombie accounts. What is more, I also manually identified real human controlled accounts as

control group. With these accounts at hand, experiments of classifiers were carried out,

allowing me to obtain more authentic and grounded result from evaluating all related

classifier algorithm and methods.

2 Background

2.1 Current Situation of China’s Internet

The Chinese Internet industry has been one of the most rapidly increasing fields for more

than 10 years. According to statistics,1 as of the end of 2014 China had 649 million internet

users, with an increase of 31 million from the past year. About 86% of the total netizen

population was accessing Internet through mobile devices compared to only 81% in 2013.

What is more, the online duration weekly for netizens increased from 25 hours to 26.1 hours

averagely.

1 Statistical Report on Internet Development in China (January 2015)

11

2.2 What is Weibo?

Composed of two Chinese characters, Wei (微, meaning “micro”) and Bo (博, meaning

“blog”), “Weibo” is a literal translation of “Microblogs”. Weibo was established on 14th

August 2009, only a month later after Chinese Government closed most of the domestic

microblogging websites such as Fanfou and banned other international social media services

including Facebook, Twitter and Plurk. It is a service provided by Sina Corporation with basic

functionalities such as messaging, private messaging, commenting and reposting. Then “Sina

Weibo” – a compatible API platform was made open to public on 28 July 2010.2

The amount of registered Users on Weibo reached 100 million before March 2011.3 And now

Weibo has become very popular among young people for the diversity and completeness of

its social media functions and applications. In fact, about 24.8% internet users are using

Microblogs,1 and Weibo.com is one of the largest Chinese versioned Microblogs. Just like

Twitter is playing a significant role in Western countries, Weibo plays a significant role in

Chinese internet social media. According to an official report from Weibo,4 by the end of

September 2014, there are 76.6 million users that are daily active on Weibo, and 160 million

users that are monthly active.

2 "Special: Micro blog's macro impact". Michelle and Uking (China Daily). 2 March 2011. Retrieved 26 October

2011.

3 2010 Sina Annual Financial report, Accessed on 10 Nov 2015, HTTP://tech.sina.com.cn/i/2011-03-

02/06005233783.shtml

4 2014 Weibo User Development Report, Weibo Data Centre

12

According to the market research,5 there are about 54% male users and 46% female users on

Weibo.com. Moreover, approximately 70% microbloggers are under the age of 30. Among

the 70% of the users, 80% are holding the Bachelor degree. These young people, with their

posts and comments on Weibo, together form a great part of public opinion on China’s

internet.

Many features and functionalities of Twitter are implemented in Weibo. For basic

functionalities, any single post is limited to 140 characters, Chinese or English. Referencing

to other people in the post can be done by using ‘@Username’ formatting. Users can also

use ‘#tagname#’ format to add hashtags in the post. What is more, by using ‘//@Username’

formatting, user can re-post other user’s post as you can use ‘RT @Username’ in Twitter.

The structure of the social network of Weibo is constructed by two concepts, fans and

followers. The user is free to ‘follow’ any other user, and he becomes the ‘fan’ of the one

being followed. Once a user A becomes a follower/fan of a user B, all B’s posts are

synchronously pushed on the main page of A.

Besides texts, users are allowed to post huge images if his text exceeds 140 characters.

Videos, music and uploaded files are allowed as well, if not violating the copyright.

Additionally, most posts and all comments, re-posts and likes of posts are only publicly

available to all logged-in users. This is why Weibo is called semi-public social network.

Unregistered users can only see the latest post of a registered one.

5 IResearch (2011) China microblog industry and user research report 2010 (Chinese). Beijing: iResearch.

Accessed on: 11 December, 2012. HTTP://www.iresearch.com.cn.

13

There is a verification policy on Weibo, similar to Twitter’s verified account, which requires

the user to submit his identification documents in reality such as the ID card or passport

from the individual person, or official letter with legal proof from the company or

organisation. Identities are examined and verified manually (Or not, I will explain this later).

Successfully verified users will be given a big ‘V’ image after their name. An orange ‘V’ for

individuals and a blue ‘V’ for organisations and companies. There are different classes of

verification for different organisations, e.g. educational institutes, public organisations or

local and international government departments.

Weibo offers a level up system for users based on their account activity and online time.

Each account has a “next level experience”, which is like the definitions of video games, a

number of experience are required to level up the user’s account. Users can speed up their

levelling by either gaining experience from accomplishing tasks such as logging in everyday,

one new post each day for 5 days, etc. or paying the website for a boost. Each level grants

the user some advantages on Weibo, including free recommendation of his account to new

users, a free lottery for money, etc. What is more, users are allowed to apply for a title

named “Weibo Master” if his or her account achieved certain popularity and influence.

Sina has developed a Weibo app for multiple platforms including Android, iOS, Blackberry

OS, Windows Mobile and even Symbian S60. There is also a desktop version client that runs

on Windows PCs.6

6 Weibo Desktop Client home page. Accessed on 11 Nov 2015, HTTP://desktop.weibo.ocm

14

2.3 Preliminary research on Zombie Accounts

2.3.1 Problem with Weibo

Many news have indicated that websites such as Twitter and Weibo, are flooded with

program controlled accounts. For example, it has been reported that the Twitter account of

American President Obama, which has 36.9 million Twitters followers, has at least 19.5 fake

followers.7 Nevertheless, Xie Na, a famous Chinese hostess, singer, and actress8 whose

Weibo followers dropped dramatically from 3 million to 2 million in one day after Sina

Weibo started its first cleaning on fake accounts.9 These accounts, with such big quantity,

cannot be manually controlled by human easily. There is no doubt that these account are

managed by computer programs. These computer-managed and automated accounts are

called zombie accounts. It had been reported that 24% Twitter accounts are bot controlled10

back in 2009, and it is possible that Weibo has more.

7 Barack Obama is the political king of the fake Twitter followers, with more than 19.5 MILLION online fans who

don't really exist. Accessed on 13 Nov 2015. HTTP://www.dailymail.co.uk/news/article-2430875/Barack-Obama-

19-5m-fake-Twitter-followers.html

8 Xie Na’s homepage of Hunan TV station, China. Accessed on 13 Nov, 2015,

HTTP://ent.hunantv.com/v/mxgw/hnzc/xnzy/index_3244.html

9 Xie Na’s Weibo Post complained about the decrease of followers on 23 Nov, 2011. Accessed on 13 Nov 2015.

HTTP://www.weibo.com/1192329374/zF0tcUr1cj?type=comment

10 An In-depth look at the most active twitter user data. Accessed on 13 Nov 2015. HTTP://sysomos.com/inside-

twitter/most-active-twitter-user-data

15

Although Weibo officially started another zombie cleaning operation early this year,11 the

effort has been doubted by many.12 No available report has confirmed that the problem of

zombie accounts on Sina Weibo is solved.

2.3.2 Investigation on Zombie account market

The zombie account phenomenon is obviously profit-driven. According to Carlo De Micheli

and Andrea Stroppa’s research,13 with a conservative estimation, the fake Twitter followers

has a potential for a $40 million to $360 million business. According to the above data,

Weibo has at least half the active user size as Twitter,14 which implies that the potential

business of zombie accounts on Weibo might reach a similar scale. I have done some

research about the Zombie account market on Weibo.

Though there is no easy way of estimating the size of such an underground business for me,

it is still possible to take a glance at the surface of it. According to my own little research,

there are at least 5 parties involved in this business, most of which are legal (as there is no

written law in China for this business), and each one being possibly separated from others.

11 “Weibo launched the plan to clean fake fans, building an optimum Weibo environment” (微博启动垃圾粉丝清

理计划打造良性微博生态). Sina Official Website. Accessed on 13 Nov 2015. HTTP://tech.sina.com.cn/i/2015-

02-10/doc-iavxeafs1026932.shtml

12 “Users are losing their real fans, what is Sina’s ‘conspiracy’?” (用户大量掉真粉儿，新浪微博有何“阴

谋”？). Accessed on 13 Nov 2015. HTTP://www.ibailve.com/show/6-579-1.html.

13 Twitter and the underground market, C De Micheli., A Stroppa.

14 Number of monthly active Twitter users worldwide from 1st quarter 2010 to 3rd quarter 2015, Accessed on 13

Nov 2015. HTTP://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/

16

2.3.2.1 Proxy

The first party is the web proxy server provider. A proxy is a server that acts as an

intermediary for requests from clients seeking resources from other servers. A proxy with

high anonymity, or an ‘elite proxy’, is able to hide the true geographical position of an

internet user. This is vital for such kind of industry. Websites such as Weibo will block an IP

address if the HTTP request from that IP is more frequent than 2 request per seconds for

more than 1 minute. This was tested with a small Java program that simply send HTTP GET

request for a Weibo Page. It is obvious that in order to control thousands and even millions

of accounts, the proxy technology is essential for them. The price of proxy server is

unexpectedly cheap. For example, a seller named TKDaili15 has different levels of monthly

bundle for sale sell, from 350 thousands IP for ￥50 (about £5.15 GBP) to 50 million different

IP for ￥500 (about £51.5 GBP). That means with only 51 pounds, one is at least able to send

50 million HTTP requests to Weibo from 50 million of different IP. What is more, if one

programs well and doesn't send request too frequently, the proxy is reusable for a duration

from 10 minutes to 10 hours for connecting to Weibo according to my test. On the other

hand, the advantages of proxy allow me to do large scaled crawling for this research, which I

will explain later.

2.3.2.2 Verification Code Typing Service

Verification codes are the words that one is required to recognise and type in when trying to

login, operate or post messages on certain websites in order to verify the user is a human.

These codes are normally printed on screen, twisted, randomly coloured and contrasted.

The verification codes are made very hard to be recognized by machines. There are small

15 TK 代理, TKDaili. Accessed on 13 Nov 2015. HTTP://www.tkdaili.com/charge-HTTP.aspx

17

companies, however, that offer the service of recognition and typing verification codes on

various websites. A platform called Damatu is famous in the industry. It integrates 4 code

typing companies and offers price comparison for different services.16 By the day 13th Nov

2015, the cheapest price for a thousand verification code recognition is 1.4 RMB (0.14

GBP)16. Be aware, this service is not done by any algorithm or program, they are processed

manually by cheaply paid workers. The Damatu platform also offers jobs for such code

recognitions, and it is easy to register as an online worker on the website. The work flow of

this business is as follows: 1. Client sends a picture of verification code to Recognition

Server. 2. Server distributes the code to a worker. 3. Worker recognises the letters or

characters in the picture and submits the result as text. 4. Server responds to client with the

text. 5. Client verifies the text and sends back the verification result. 6. Worker is rewarded if

the code is correct. According to my own test with a small Java program and 100 thousand

samples, the accuracy of this manual service is higher than any known algorithm, about 99%

verification codes are correctly recognised. Apparently, this service is for no good purpose. It

is the essence of all spam systems as well as Zombie accounts in Weibo. The registration

process of Weibo requires new users to type in verification codes. And if a user login from

different places, he is required to type in the code. With this service, the obstacles of

verification code are conquered with very low price and this will potentially result in an

underestimation of the quantity of zombie accounts. In common sense, it is believed that

there should be more real users than zombie users in a website but sometimes it might be

just an illusion. My study will try to find out the proportion of zombie accounts on Weibo as

accurately as possible, without assuming that there are more human users.

16 Damatu 打码兔 (Code Typing Rabbit). Accessed on 13 Nov 2015. HTTP://www.dama2.com/

18

2.3.2.3 Account Trainer and Sellers

In the following text, I define the account trainer as people who uses automated program to

manage Sina Weibo accounts and who, by doing so, maintains the activity, quality, human

likelihood of massive Weibo accounts.

Sellers of the zombie accounts appear to be a different group of people in this industry. It is

easy to obtain a so-called zombie account at various websites. In order to initially study

zombie accounts and this zombie industry, I bought some zombie accounts from the most

famous shopping website17 in China, “taobao.com” (a Chinese-versioned eBay18 that allows

sellers to sell both virtual products and real products). The accounts were bought from

different sellers, and the usernames of their products are usually in following formats:

1. “Letters + numbers”. The letters as prefix are usually the sellers’ Taobao shop name.

The numbers as suffix are usually in sequence. It is quite obvious that these accounts were

created by computer programs.

2. Random phone number. In China, a mobile phone number is formed by 11-digit

number and should require a SIM card to be registered. In Weibo.com, users are encouraged

to use their phone numbers to register as usernames for service and security reasons. But

each phone number is only allowed to register one account. It remains a mystery as for how

the zombie account sellers are able to obtain such a massive quantity of active mobile

numbers and use them to register accounts. What I would suggest is that this could be

another black market that I am not able to dig in this paper.

17 "Taobao.com Site Info". Alexa Internet. Retrieved 2015-08-13. HTTP://www.alexa.com/siteinfo/taobao.com

18 “Taobao= eBay+ Rakuten+ Amazon”, ("淘宝网=eBay+乐天+亚马逊). International financial paper (国际金融

报). 9 Jul 2009. Retrieved 14 Nov 2015. HTTP://paper.people.com.cn/gjjrb/html/2009-

07/09/content_292098.htm

19

In order to attract buyers, many sellers offer trial discount, that is, a very inexpensive price

for the first ten or hundred accounts. The normal price is usually between 0.1 to 0.2￥ (0.01

to 0.02 GBP) each.19 I spent less than 10 GBP and bought about a thousand Weibo accounts

from different sellers. Then I discovered that some of the sellers offered accounts with

usernames in the same pattern, letters + numbers, and the letters are very similar to each

other. It appears that different sellers have the same source of their “products”. I have two

hypotheses in this regard

1. The same seller opens up different zombie account sellers and by doing so gaining

the advantage of competing with other sellers. This strategy works because

a. On each search page of Taobao.com, only 48 shops are displayed；

b. It is very common for users to sort out searching results by price；

c. It seems this is a very competitive business. Among the first 5 pages of results, all sellers

have the very low price, which is about￥1 for 5 to 10 accounts.

d. The authenticity of the transactions of virtual products as such are in doubt.20 The reviews

of such shops are usually ignored.

All these factors encourage sellers to register more shops to increase their probability of

being viewed by customers on each single search page.

2. The sellers are different groups of people from account trainers. They purchase their

products from sources such as the account trainer and sell them to whoever in need. Apart

19 Search page of Weibo account sellers which shows a competing price list. Accessed on 13 Nov 2015.

HTTPs://s.taobao.com/search?q=%E5%BE%AE%E5%8D%9A+%E6%89%B9%E5%8F%91&commend=all&ssid=s5-

e&search_type=item&sourceId=tb.index&spm=a21bo.7724922.8452-taobao-

item.2&ie=utf8&initiative_id=tbindexz_20151118&bcoffset=1&ntoffset=1&p4plefttype=3%2C1&p4pleftnum=1%

2C3&s=0

20 “Good reviews on Taobao can be bought easily” (“淘宝付费刷信誉泛滥”). 13 Aug 2014 (People.cn). Accessed

on 13 Nov 2015. HTTP://finance.people.com.cn/n/2014/0813/c1004-25455454.html

20

from the similarity of the format of usernames, there is another evidence that can support

this hypothesis. The descriptions of many shops claim that if a large quantity of product is

needed, buyers should contact the seller and order in advance.21 This indicates that the

seller does not possess a large quantity of accounts at hand, but if one contacts them in

advance, they will manage to get enough accounts.

There are quite a few free and paid tools that help the user to register Weibo accounts

massively.22 This kind of tools would enable anyone with appropriate resources to create a

large quantity of Weibo accounts. However, none of the public tools have the capability of

maintaining the accounts, such as logging in regularity, making posts and comments

regularity, etc. The account trainer must have other tools that are not publicly sold, for many

of Taobao-sold accounts are more than merely registered accounts with default empty

setting, but accounts with selfie photos, fans, followers, and regular posts. Moreover, it can

be safely assumed that the majority portion of the followers of such accounts are zombies as

well. Account trainers use programs to control zombie accounts to follow each other,

imitating human-like behaviours. These more human-like accounts are called “Advanced

accounts”, which are usually more expensive. Account price varies from ￥1 to ￥1000 or

even more based on the account quality, which is usually defined by its account level, fans

number and registration date.

21 A sample shop of Weibo accounts seller. All descriptions are converted to picture in his page. There is one line

in red saying “Please contact if you buy in bulk” (“量大请联系”) before the black bold attention text. Accessed on

14 Nov 2015.

HTTPs://item.taobao.com/item.htm?spm=a230r.1.14.38.tQNW0p&id=523123002035&ns=1&abbucket=15#detai

l

22 A downloading site page that offers this kind of tool. The page is describing a Weibo register tool with

verification code recognition service integrated in it. Accessed on 14 Nov 2015.

HTTP://www.boyuansoft.com/html/cn/product/read_370.html

21

Rather than speculating groundlessly what zombie accounts are like, obtaining various types

of zombie accounts from these sellers is a better way to closely study the zombie accounts

and their features. Therefore, I have tried to buy as many as possible different zombie

accounts with limited budget. The prices and other details of the accounts will be described

in the following sections.

2.3.2.4 Clients

Whoever is utilising zombie accounts for their own benefit is a Zombie account client.

Celebrities, 9 commercial or non-commercial corporations and organisations, government

departments, individuals who want fame, and people from all walks of life are potential

buyers of zombie accounts. Based on the large number of account shops on Taobo.com19, it

is safe to say that there are many buyers of this product.

Proxy, Verification Code Recognition, Account trainer, Sellers, and Buyers, are the five

parties that will affect the analysis of this research and therefore are worth investigating. For

instance, there is position-based analysis on zombie accounts, which might be significantly

influenced by using Proxy. Moreover, well-disguised verification code has been

conventionally viewed as an effective way of reducing the quantity of script-controlled

registration. But by using the recognition service from providers such as ‘Damatu’, the

verification code is no longer as strong an obstacle as it is assumed and therefore, there

might be an underestimation of the quantity of zombie accounts from both Sina official and

researchers.

22

2.4 Literature Review

2.4.1 Existing Researches

There are many researches relating to this topic, but most of them are concerning the

zombie accounts on Twitter and less are looking at Weibo. Considering the number of users

on Weibo scaling as at least half as that on Twitter, the quantity and quality of studies on

Weibo is significantly lower than that on Twitter. In this section, I will introduce several

approaches, on either Twitter or Weibo. These approaches are classified into 4 types.

2.4.1.1 Automation detection based on timestamp information

This is a research on Twitter (Zhang & Vern, 2011) where only publicly available timestamp

information of each tweet23 is used.

This paper (Zhang & Vern, 2011) uses minutes of the hour and seconds of the minute as two

axes to plot the activities of different accounts. In comparison with the graphs of other

users, they found that the graph of some users were able to pass “χ2 test for expected

uniformity, presumably reflecting organic behaviour” (Zhang & Vern, 2011) , while some

accounts exhibit detectable non-uniformity or hyper-uniformity. By analysing 6 different

types (see Figure 1 Different uniformity) of distribution of plot of post time, the paper

concludes that automated bot accounts can be recognised according to the uniformity of

their posting time. If the posts of a user are non-uniform or overly uniform, the user is

assumed to be automated by program. “We can conclude the presence of automation if we

find tweet times either not uniform enough, or too uniform.” (Zhang & Vern, 2011).

23 A tweet is a post or microblog on Twitter, the length of which is no longer than 140 letters.

23

Figure 1 Different uniformity

19463 accounts are tested with their public timeline. They have found that 16% of active

accounts demonstrate a high likelihood of automation. What is more, according to the

research, 11% of accounts that post solely through the browser are automated accounts.

As is described in the paper, it is very plausible for automated account to evade the test that

they have conducted in the paper, but there is no evidence that at the time of their

publication, any automated account is intentionally exhibiting uniformity to evade this test.

The approach by Zhang & Vern is easy to understand and is effective in their small scaled

test. Although uniformity indicates the sign of automation, it is not a definite proof of it.

Unlike what is being said in this paper about Twitter, there is evidence that, in Weibo,

organisation controlled accounts are intentionally making posts with fixed time intervals

because they believe this will increase their popularity.24 Yet these famous Weibo accounts

24 “The controller of grass-root popular Weibo accounts: a decryption of Weibo influence and relations”, “草根牛

博操控者解密微博势力关系谱”. Tencent News (May 2011). Accessed on 15 Nov 2015.

HTTP://finance.qq.com/a/20110505/003612.htm. This news describes how several very popular (more than 10

million fans) Weibo accounts are managed. The manager hired employees to post routinely, so as to maintain the

freshness and the activity of the account.

24

are only a small fraction of all 500 million users. I would expect a good result of a larger

scaled test on Weibo if I am able to filter this type of accounts.

There is also another paper (Tavares & Faisal, 2013) describing a similar approach. In this

paper, twitter accounts are divided into three classes, which are personal, organisational

and bot controlled. The investigation conducted in the research is also independent of the

content in Tweets. Probabilistic inference algorithms are used as the classifying method,

including naïve Bayes classifier and a prediction algorithm that tries to predict the

distribution of the time interval of a user’s tweets.

The prediction result is slightly worse than other related works. With naive Bayes classifier

that distinguishes between different accounts categories, an accuracy of 84.6% percent was

achieved if only classifying between individual accounts and organisational accounts, and

75.8% if classifying between all three types. It seemed that their accuracy is a bit low, but

this was explained by the author that because they did not use any priori assumptions of the

account features, all classifications are purely based on tweeting behaviour, rather than

parsing tweet contents or analysing user profile like others did.

In this project, these two methods are used as a start points. I have managed to obtain

enough information and carried out simplified versions of experiments based on their

studies.

2.4.1.2 Classifier by using supervised learning on different features

In this paper (Amit A. Amleshwaram, et al., 2013), researchers develop a set of 15 new

features and use these features with another 3 previously proposed features to detect

Twitter-based spammers. Spammers are known to use zombie accounts for all kinds of

purposes, including commercial advertising, malicious URL spreading, etc. Through

25

experiments and tests, this paper finds out the subset of features that contributes mostly to

spam detection, the result of which enables them to detect more than 90% of spammers

with only 5 tweets. 600 thousand accounts are used to evaluate their accounts. In detail,

they are able to achieve 96% of detection rate with only 0.8% false positive rate by using

different supervised learning algorithms.

Researchers propose 5 categories of features for detection. The first one focuses on the so-

called bait-oriented features that are used to identify spammers who lure victims by posting

fake tweets or by reference victims in random tweets, expecting victims to click on the

accompanying URLs. The second set of features is used to identify the behavioural aspects of

spammers, including the repetitive text and URL and the domain of the URLs posted in

tweets. The next set of features are the features that are analysed from the tweet content

showing the signs of automated program. Nevertheless, the similarity of tweets of the same

account is also analysed as a feature. Finally, the user profile is taken as a feature. It is

assumed in this paper that a well-organised profile is less likely to be malicious.

This feature-based spammer account classifier has a promising result. What has been

highlighted by authors is that they were able to identify more than half of the spammers

with only a single tweet post. Therefore, very little computation is needed and yet a really

good accuracy is obtained (96%). Although the set of zombie accounts is differently defined

from spammer accounts and the bait-oriented features is not available on Weibo.com25, the

features analysed in this paper are very inspiring to my work, and the fact that only little

computational power is needed for classification is very helpful in large scaled approach of

zombie classification.

25 Any URL posted by users on Weibo is converted into an intermedia short URL, of which the content is hard to

differentiate.

26

2.4.1.3 Location based Zombie user detection

A new approach is published recently in the paper (Deng, et al., 2015) , where the location of

accounts and their fans and followers in Weibo.com can be used to classify their zombie

identity. The location of the followers and fans are compared and recorded. If each two of

the followers or each two of the fans share the same registration city or province, two

variables named SAMEC and SAMEP are incremented respectively. The numbers of followers

and fans are also defined as two variables FER and FING. Thresholds SAMEC_TH, SAMEP_TH,

FER_TH and FING_TH are defined for the above four variables respectively. Based on these

thresholds and variables, an intuitive classification rule with four conditions is created.

Together with a logic expression they form a rule based classifier as shown in Figure 2.

Figure 2, Rule based on 4 conditions

10000 Weibo accounts are used to test their scheme, and zombie accounts are manually

identified. By using try-and-error method, they find the best configuration of thresholds that

grants this scheme an accuracy of approximately 80%.

This paper does provide a creative method of classifying zombie accounts, which could be

helpful to other types of zombie classifier. However, there are also obvious problems with this

paper. Firstly, the accuracy of their test is in question because they have not explained the

method of the manual identification of zombie accounts among the 10000 accounts. It is in

doubt that their manual identification is sound. Secondly, the 4 conditions are not explained

thoroughly. There is no evidence to support their assumptions such as “At least in the early

days, zombie accounts made large number of follows so that they might get followed back.”

27

or “It is hard to find real user accounts to follow them back” (Deng, et al., 2015). What is more,

the thresholds are founded by brute force, trying to get the best classification result from

those unreliable 10000 accounts. It is safe to predict that their classifier can be overfitting.

Most importantly, the use of proxy can potentially defeat their assumptions, depending on

how the automated registration program uses proxies and how account managing program

maintains the accounts. If each proxy IP is used for only limited registration, the locations

would appear more randomly than the paper expected.

To sum up, although it fails to be a sound research, this paper is inspiring in the location usage

to zombie account classification researches. In my project, account location is considered a

nominal feature, and its usefulness is studied in later section.

2.4.1.4 User interaction based research

The paper (Sun, et al., 2014) analyses the user interactions that are based on the influence

transfer effect in social networks. Sun and his colleagues present a regional user interaction

model, which describes a dynamic process in online social network, in order to study the

interaction process of different users.

Figure 3 Schema of Sun’s reginal user interaction model

28

Direct influence and indirect influence are calculated according to how users are retweeting

others’ posts and the distance between them. Consider the user interaction as a graph, where

the nodes are users and connecting edges are direct retweet from one user to another, then

the distance between users is defined as the number of edges on the shortest path between

two user nodes. The further the distance, the less influence of one user on another.

Figure 4 the distribution of transferred influence of each user node

By calculating the node influence of each node on 500 test data retrieved from Weibo,

apparent patterns are found, that users are easily separated into 3 groups: Key users that have

significant influence among its followers, regular users that have some of the influence among

others, and isolated users of which no interaction and no influence are found. Then they use

the test result based on PageRank (Chang, et al., 2013) to compare with their own interaction

model. Complementary Cumulative Distribution Function (CCDF) are used as y values and

follower count, effective follower count, the individual post count and average post count as

x axes, and they created 4 scatter plots with top 50 influential accounts respectively. Sun’s

model shows a superior result than Chang’s with the same features, where a larger scale of

interaction model is seen on all plots.

With further tests, where all test data are evaluated with their interaction, 3 groups of users

show a significant difference in contrastive distribution against both account reputation and

tweet count.

29

In conclusion of this paper, although their assumption that zombie users should have less

influence than real users is not sound enough, Sun and his colleagues provide a great way of

identifying zombie accounts based on their regional user interaction model. Although the

paper only demonstrates the difference in CCDF between types of users, and does not test its

accuracy of zombie classification, its value in helping classification should not be neglected.

The method requires more computational power because the program need to traverse the

followers and fans tree of a user to determine its influence and therefore classify type. In this

project, this method is not tested for its requirement of obtaining user relation graph is

difficult with undergraduate levelled time/space resources. However, it inspired me to have

set implemented a scalable web crawling framework that will be helpful for extend research

in the future.

2.4.2 Data Mining & Machine Learning methods and algorithms

In reviewed literatures, different classifiers or clustering methods are used. In this project, I

have selected several supervised classifiers for the following reasons.

2.4.2.1 Naive Bayes

Naive Bayes is a popular conditional probability model, given a problem instance to be

classified and a vector of different features, it can compute the probability of each possible

class that instance belongs to. Training a naive Bayes classifier needs a set of ground truth

data of different classes, of which size is as large as possible. The performance of this classifier

increases with the growing size of training data set. In addition, naive Bayes classifier assumes

that the features of the data set are unrelated to each other.

2.4.2.2 Decision Tree Learning

Decision tree learning is a predictive modelling approach in data mining. Classification trees

where the target instance can take a finite set of feature values are used to predict the class

30

of the target instance. Leaves are the class labels and branches are the representation of

conjunctions of features leading to class labels. Decision tree offers a visual representation of

how each instance is classified, I find it is very useful to understand different feature and their

significance in classifying zombie accounts.

2.4.2.3 Multi-layered Perceptron

A multilayer perceptron is a feedforward artificial neural network model. It maps sets of input

data onto a set of outputs using a multi-layered neural network, trained by using

backpropagation algorithm. MLP has been a popular learning model in various fields such as

machine translation and speech recognition since 1980s.

2.4.2.4 Support Vector Machine

Support vector machine (SVM) is a supervised learning model that analyses data used for

classification and regression. Given a set of training examples, whose instances is already

tagged with one of two classes, the training algorithm creates a model that put new examples

into one of the classes by creating a hyperplane based on training data and comparing the

new instance to the hyperplane. SVM is non-probabilistic binary linear classifier, which means

given only a small number of training data, of which the quantity and proportion of classes is

unknown, SVM is able to classify the larger set of unknown data. This property also means

that SVM is resilient to overfitting. Unlike naive Bayes classifier, the zombie account

classification result will not be strongly affected by how much zombie accounts occupy in

training set, which is a preferable property since there is no way we can determine the prior

probability of zombie accounts.

2.4.2.5 Combining multiple classifier

Researches have shown that some classification task is more reliable if more than one

classifier is used. For example, the study (Kittler, et al., 1998) shows that using a combination

31

algorithm such as sum rule with multiple classifier on the problem of pattern recognition

outperform many other classifier. Another study (Benmokhtar & Huet, 2006) shows that by

using a combination of Multilayer Neural Network and Gaussian Mixture Models, there is an

overall improvement of the performance of video shot recognition. Also the paper (Opitz &

Maclin, 1999) shows that bagging, boosting increases the classifiers performance given only

few different classifier models. It is very plausible that none of above 3 classifier can

distinguish zombie accounts from real ones, therefore in this project, classifier fusion

methods will be considered in this project.

2.4.3 Research gaps and my approach

Some of the above methods have the gaps I have mentioned in the introduction section and

I plan to fill in these gaps in this project.

Firstly, the sampling rate is too low considering the size of total users on either Weibo or

Twitter and this does affect the value of the research results. The paper (Sun, et al., 2014)

only collects 500 user data whereas there might be a bigger quantity of zombie sellers on

the market. Each of them might have a different way of producing zombie accounts and thus

there might be more possible types of users than 3 as found in the paper. This gap is

solvable by evaluating a larger data set.

Secondly, after reading these papers, I find that they have all used unsound data to evaluate

their classifiers, for they either classify zombie accounts in the test data according to some

naive assumptions or thresholds, or manually classify the zombie accounts, none of which is

justified enough. In the paper (Amit A. Amleshwaram, et al., 2013), for example, their work

is evaluated by observing the suspension state of the identified spammer accounts. The

justification is based on the efficiency and correctness of Twitter anti-spammer mechanism,

32

which should be sounder than other approaches but is not yet. Twitter anti-spammer

mechanism is still an algorithm and can give false positive and negative. It is a good source of

reference but not an evaluation standard. What I can do in this project is to acquire a large

amount of real zombie accounts from different sellers, the detail of which I will explain in

the later section.

3 Methodology

3.1 Obtaining Ground Truth data

Before obtaining any data, I define the zombie accounts and real user accounts according to

the following rules, in order to clarify the range of study:

Zombie accounts

Weibo accounts whose activity are either partially or solely controlled by computer

programs, regardless of their purpose. An account of a real authority however uses any

software to manage its Weibo account is considered zombie accounts in this project.

Real accounts

Weibo accounts whose activity are only controlled by human being, regardless of their

behaviours. Accounts whose user does not post any microblogs and Accounts that posts

every half hour punctually by human user should be considered as real accounts.

In order to clarify any ambiguity due to the problematic naming convention of Sina

programmer, a recap of definition of followers and fans is as following:

Followers of an account A: A field of an account. The sets of accounts that one

account A actively choose to follow, account A will receive updates of its followers’

posts.

33

Fans of an account A: The sets of accounts that choose to follow account A,

whenever account A posts a microblog, every fans will receive an update.

Ground truth data of accounts that are real human user and zombie are obtained differently.

3.1.1 Zombie accounts

Getting zombie accounts on Weibo is relatively simple and straight forward. The most

budget friendly method is to buy fake fans26. Simply type “微博粉丝” (In English, it is

“Weibo Fans”) in the search box of www.taobao.com, as introduced in the background

section. Different fans sellers were found immediately.

Visiting most of the shops I have searched, the priced accounts can be generally divided into

4 levels:

Basics Level, where accounts usually have no fans themselves but only follows other

people.

Top level (顶级 in Chinese as described by seller), accounts have limited activities,

including following other accounts, post simple sentences. They differentiate to

basic level accounts with relatively more fans, that is, these zombie accounts are

usually followed by other zombie accounts, making them more human like.

26 The fake fans are the zombie accounts that are programmed to follow target accounts for the purpose of

popularity mainly. However, by closely looking at the fake fans sample provided by the sellers, these “fans” post

microblogs with different contents, such as advertisement, sensitive topics, human mimicking talks, jokes, and

etc. I assume that these sellers not only sell the service of adding fans using these accounts, but also other

potential “beneficial” services, therefore making the fans accounts the best accessible zombie accounts with

limited budget.

34

Exquisite level accounts are like top level accounts, except they post various types

microblog, including retweet others blogs, jokes, advertisement, picture/videos,

some of their posts are even “liked” by others.

Expert level, or Daren level (达人), are accounts that are based on exquisite level,

however, with longer registered time, higher Weibo level, and more frequent post

time (one per day or more), making them being counted as daily active user by Sina.

4 sellers were chosen randomly from the search results for unbiased sampling. The following

table describe the summary of bought accounts from each seller.

Seller ID27 Fans type Price28

120006216 15000 Top

10000 Exquisite

￥528

142313259 2000 Top

2000 Exquisite

1000 Expert

￥220

149080854 1000 Top

3000 Exquisite

500 Expert

￥215

121117246 10000 Basics ￥90

Sum 34500 ￥1053

Sellers are asked to use their zombie accounts to add the account, which is owned by me, as

their follower (or namely, my fans). All these fans are added gradually to my accounts within

27 The ID number of the sellers shop. The shop can be accessed by URL as HTTPs://shop[ID].taobao.com

28 In Chinese Yuan

35

3 days of purchasing. Then a simple C# is implemented to traverse my “fans list”, obtaining

most ID of these account. However, only about 21,740 accounts were actually obtained due

to the limitation of even trying to viewing your own fans, where only first 250 pages of fans

are allowed to be viewed.

Simple manual analysis was carried out on these zombie accounts. Depending on the level of

the zombie account, their data exhibits different pattern. More expensively levelled

accounts usually have made more posts and have more fans/follows as well. Some basic

level accounts does not have an avatar image (Thumbnail image). Many of their name have

significant pattern, appearing to be a mixture of number and letters or a mixture of Chinese

characters and numbers. Moreover, there are some name that consist very rare Chinese

character which is unlikely to be seen or used by anyone in his whole life. These properties

can be used as reference for further implementation, however, there will be no hardcoded

threshold or classifier, since this project aim to create a generalised method of identifying

these zombie accounts.

3.1.2 Real user accounts

There is no simple method of obtaining large amount of real user accounts directly.

However there is a work around based on a simple assumption, that is, human users, who

takes their account activities seriously and who have only limited aspects of interests, are

very unlikely to follow zombie accounts, whose posts are generally in random fields that are

not strongly attractive to any particular person, whose posts may seem contradictory to

human, who does not respond to human comments. Especially since the exposure of the

existence of zombie accounts, public have been more carefully following others.

Based on above assumption, only small amount of real users need to be identified manually.

These identified accounts can be used as a starting point of crawling, by using breadth-first

36

search with the lists of followers of these accounts, we can find much more accounts that

belongs to real users.

Weibo.com has a channel called “Discovery” or “发现” in Chinese29, where posts of popular

users of each field is randomly chosen and listed. This channel is used as a starting point of

finding real users. And real users are obtained by following rules;

1. From each field, 7-10 posts were randomly chosen, and one who have made

obviously sensible comment and discussion to the post were initially selected.

2. The more diverse the better, if the first person in this fields was commenting a post

about basketball, then the second one is preferable to be chosen if he is

commenting another sport such as football.

3. After initial users have been, the blog page of each user is manually visited and

verified by a simple Turing test: whether its activity such as microblogs, photos,

comments, and etc. shows a sign of being programmed. If the test passes, the user is

added into real user group.

NOTE: There are many marginal users (users who do not actively writing posts on

Weibo and users who only reads and comments others’ posts). There is no easy way

to identify these users from zombie accounts, since it is very hard to search their

comments on the microblogs written by others30. As a result, initial users selected

from step previous steps with small amount of posts (<2) are not added to real user

group, if there is no other obvious factors that help to clarify its identity.

Consequently, the ground truth data of real users is kept sound but incomplete, for

many marginal but real users are ruled out by this. It is foreseeable that his is going

to affect the performance of classifiers. However if it is the public opinion or posts

29 HTTP://d.weibo.com/102803_ctg1_1199_-_ctg1_1199# , translations of categories can be found in appendix A

30 Weibo.com does not offer such functionality as searching one’s comments.

37

data that we care about, real users, who do not make posts being misclassified into

zombie account because of this incompleteness of training set, will not influence

badly. I expect to get an inaccurate classification result for the way I choose the

initial users, but there is no better way of doing it, since inactive users and many

zombie accounts have no human-observable difference. The only hope is left to our

classifiers.

300 users from commenting 300 posts of 48 topic fields were found from previous steps. A

breadth first search with depth only 1 is used for obtaining their followers. 26313 users31

were obtained and added into real user group. A worst case assumption that each of 300

users have followed 5% zombie accounts (though very unlikely), we still have about 25000

real users as ground truth data.

3.1.3 Accounts data for Evaluation

One of the purpose of this project is to find out the proportion of zombie account in

Weibo.com, therefore it is preferable that we obtain as large as possible account

information for evaluation.

Obtaining large amount of accounts data for evaluation is not a simple job. Weibo.com

prevents any single IP accessing it too frequently. The highest frequency that an IP address

can access its website constantly without being banned is 1 request per 4 seconds in the

long run.32 This means if information of millions of users is to be obtained, we will need a

good framework that can efficiently access Weibo.com. Details of this framework will be

described in the next section.

31 300 users should have more than different followers in total, however, Weibo.com only allow any user to view

first 20 pages of 400 followers of a particular user, thus limiting down our results.

32 Tested with a simple Java program that uses a proxy and access Weibo.com for user information at different

frequency.

38

In total, 24 millions of non-repetitive user ID33 were obtained. Details of 1583135 users with

no blog information or little blog information were obtained. What is more, Details of

900000 users with recent 100 posts were obtained. These information will all be used in

testing our classifiers.

3.2 Implementation

3.3 Actual System Design

This project is divided into 3 stages: Data gathering, Data pre-processing and Data Mining.

Software systems are implemented for Data gathering and Data pre-processing due to the

relatively large scaled datasets. An open-source software called WEKA is used for Data

Mining.

Ground truth data of real users and zombie accounts as well as a set of random accounts

from Weibo.com are obtained in Data gathering section. Then the fields and features are

analysed and pre-processed in Data pre-processing section. Finally the data are used in Data

Mining section.

3.3.1 Data Gathering

3.3.1.1 Requirements

As being described above, gathering only about 50K rows of User ID for ground truth data is

relatively simple. Sequential program with internet of speed 12Mbps took less than 12 hours

for this task. However, getting more detailed information for 50K users such as microblogs

will take much longer.

33 Each user ID correspond to a Weibo User, the main page of the user can be accessed by

HTTP://weibo.com/u/[ID] .

39

In fact, Weibo.com displays 20 User ID per page if you are browsing the follower/fans list of

an account. For each HTTP request, the sequential program can obtain 20 more User ID.

Whereas Weibo.com only display 10 microblogs per page for one user. If we were to use the

blog information, presumably 100 microblogs, we will need to make 10 HTTP requests to

Weibo.com, plus one HTTP request to get the detailed account information that is public to

everyone, 11 HTTP requests are needed for each account. For 1 million users in worst case,

we will need to make 11 million HTTP request to Weibo.com. A single machine with

sequential program, would take ages to finish this task, needless to say, if the IP address of

that machine will be treated as a DDoS attacker by Weibo for making enormous requests.

Last but not least, the limitation of internet that maximum inbound internet speed is

1.5MB/s in my rented flat become the bottle-neck, since each HTTP request does not only

retrieve the simple information needed for this project, but also the HTML formatted

webpages that contains many bytes of unnecessary information.

To sum up, a software system for this project should be able to satisfy following

requirements for data gathering:

1. The system can make large amount of HTTP requests in parallel or even in

distributed machines is needed.

2. The speed at which this system is getting information should not be bounded by the

internet speed in my flat.

3. This system should also be able to avoid being marked as DDoS for making too much

HTTP requests.

4. In addition, this system should be able to be as fast as possible in obtaining,

processing and storing the data.

5. Data gathered by this system should be accessible easily, any program in the

following stage should be able to read the millions rows of data with fast speed.

40

After analysing above requirements and thorough planning, I have designed and

implemented a distributed information crawling system.

3.3.1.2 Software System Design

3.3.1.2.1 Choice of Database

MariaDB

MariaDB is a community fork of MySQL, an open-source relational database management

system. It is easy to use, and can handle simple structured data set very efficiently.

MongoDB

MongoDB is an open-source, document database designed for ease of developing and

scaling.34 It is used because in this project, nested data structure will be stored and used,

such as storing microblogs of one user with other account details together.

3.3.1.2.2 Choice of programming language

Considering I need a software system that can be run in parallel and in distributed machines,

a cross platform language is preferable for this system. With the design goal of portability,

Java program can be executed in most platforms if Java Runtime Environment (JRE) is

installed. What is more, I have mainly programmed in Java in past few years, therefore Java

is the best choice for me to implement this desired system.

3.3.1.2.3 Use of Proxy

In order to avoid being treated as DDoS attack and prohibited from accessing Weibo.com,

multiple IP address is required. With the preliminary research on Zombie accounts, I have

34 https://docs.mongodb.org/manual/

41

found out that the website TKDaili.com offers high anonymous proxies35 with inexpensive

prices. This website offers an API for programmers to retrieve IP address of proxies easily.

3.3.1.2.3.1 Proxy Anonymity Check

In order to check the anonymity and availability of proxies from TKdaili.com, a website or a

host is needed. There are many online proxy anonymity check website, however, are very

slow in connection because firstly the index page of these websites contains unnecessary

information, secondly most of the proxy are from China since TKdaili.com is a Chinese

provider yet no good China based proxy check website can be found and last but not least,

these websites are very unstable, of which up time and connection speed is not guaranteed.

An amazon EC2 instance is used as a host, and a PHP server that holds the proxy checker is

loaded. After test, it only take 200ms to 500ms to check a single proxy even it is China based.

Nevertheless, amazon EC2 instance is very stable and guaranteed to be available.

3.3.1.2.4 JPipe Framework

This framework is implemented in the purpose of handling concurrent work that are running

with task parallel pattern. By the time I started this project, there is not any known Object

Oriented Pipelining implementation in Java, despite the default stream pipelining package

java.nio.pipe which only handles data in bytes or string and offers very little functionality for

task managing and parallel thread control. I developed JPipe simply for the reason of

parallelism and pipelining that helps to massively retrieving and processing data.

35 Proxy anonymity has three levels: High Anonymous/Elite, Anonymous and Transparent. If a HTTP request is

connecting through a High Anonymous proxy to a target host, that host will only know that a user from that

proxy address is visiting and therefore protects your identity. More information can be found here

http://www.proxynova.com/proxy-articles/proxy-anonymity-levels-explained/

42

JPipe is a producer-consumer based, Object-oriented pipelining framework. It serves the

purpose to easily create pipeline work in JAVA. The basic idea and structure of this

framework is shown as below:

Figure 5 Basic Pipeline structure using JPipe

The basic structure of a pipeline includes multiple “Pipe Sections”, Each Pipe Section have a

number of workers doing the same work. The produced Objects from workers are saved into

a buffer, and these Objects can be polled by workers from another Pipe Section. Pipe Section

keeps a records of status of its child workers, such as the number of successful/failed work

done, the throughput and the latency of each worker and the number of continuous

successful work. Pipe Section Object can output above states in JSON string for any further

adjustment or monitor purposes. Nevertheless, a pipe section, if enabled, can dynamically

change the number of its child worker, if the status of all pipe sections are properly used,

programmers can produce a pipeline that dynamically adjust workers according to its bottle

neck.

What is more, a feed forward pipeline is not always desirable because of the complexity of

the problem. Therefore a Buffer Store that manages all buffer is implemented, consequently

workers from any section is able to access any buffer thus allowing programmer to build

more complicated pipelining, for example:

A pipeline PipeSection A

Worker 1

Worker 2

Worker 3

…

Worker N

Buffer B1 PipeSection B

Worker 1

Worker 2

Worker 3

…

Worker M

Buffer B2 PipeSection C

Worker 1

Worker 2

Worker 3

…

Worker O

..more Buffer &

Pipesections

Produce Produce Produce Consume Consume

43

Figure 6 A simple Pipeline structure

PipeSection A produce a type of Object X and save its products into buffer B1, PipeSection C

produces a type of Object Y and save its products into buffer B2. Workers in PipeSection B

need Both Object X and Object Y to produce Object Z, it polls from buffer B1 and B2 and save

Object Z into B3, and finally followed by Workers from PipeSection D that poll the results

from B3 for further processing.

Buffer Store also monitors the process of the pipeline work. This framework is thread safe

and all read/write operation to buffers are locked. Buffer Store creates a detailed JSON

string that describe the states all of thread of each pipe section. The states include: The lock

state of that worker, the number of consumed object from/to that worker.

JPipe not only allows programmers to build complex pipeline program easily, for it offers

parallelism out of box, but also enables them to debug the parallel program with ease,

because the states of workers and buffers are monitored in details.

3.3.1.2.5 PipeCrawler

PipeCrawler is a Java project built on JPipe for the purpose of making HTTP requests,

obtaining webpage data, data pre-processing and data storing. Using socket connecting and

JPipe pipeline techniques, I created a distributed master-slave web crawler that can gather

information from Weibo.com with least resources yet fast speed.

PipeSection A

PipeSection B

PipeSection C

PipeSection D

Buffer Store

B1

B2

B3

44

This crawler is a compound of multiple pipeline programs. Depending on different executing

argument, the program will be forked into different instances. All instances are implemented

as pipeline using JPipe. There are 4 different types of instance:

Server Instance

Server Instance is also the master of all other instance. It gives out information and resource

(Such as proxy from TKdaili.com and raw users) needed for different jobs to slave instances,

and collects data retrieved from slave instances as well as saving these data to databases

accordingly. This instance also monitors the status of all slave instances, such as their

responding time and hostname. Below is a brief exhibition of the pipeline work of the server

instance. The detail of the work flow will be discussed in later section.

Figure 7 The pipeline structure of the server instance

Slave Instances

Shared code between slave instances

All slave instances have 2 pipe section and 3 buffers in common: Proxy validator section,

Socket connector section; raw proxy buffer, valid proxy buffer and message buffer.

Account Detail

Persistor Section

Socket Request

Receiver Section

Buffer Store

Raw Proxies Buffer

Raw Users Buffer

Account Detail Buffer

Account with microblogs

Buffer

Raw User Persistor

Section

Account with

microblogs

Persistor Section

Proxy Supplier

Section

MariaDB

MongoDB

Server Instance Structure

Socket

Connection

to Slave

Instances

45

Socket connector section only have one worker, which connects to server socket receiver

whenever there is a message object in the message buffer. A message can be a request of

more raw proxies or a set of product object of this instance or etc.

A proxy is raw when its validity is unknown. After retrieving raw proxies from server , the

workers in proxy validator section tries to validate the usability of proxies by making

connection to a proxy anonymity test website using every proxy from raw proxy buffer, if

that proxy is valid, then it is saved into a valid proxy buffer for further usage. The reason why

a proxy should be validated is that these raw proxies offered by TKdaili.com has very short

life span, from few minutes to few hours, they can be already expired by the time slave gets

them from server. In addition the anonymity of these proxy is not 100% high anonymous

(actually about 90% by tests).

All threads that makes HTTP requests to Weibo.com will be using validated proxy from the

buffer, and there is no shared proxy between any two thread. In this way, we deceived

Weibo.com as if each of our crawling thread is a single normal user from a different IP

address.

Raw User Crawler Instance

The job of raw user crawler instances is to obtain as much user ID as possible. A user is raw

when there is only a User ID of that Weibo user. This instance retrieve list of raw users and a

list of raw proxy from host and uses them to get more raw users. Weibo.com allows visitor

to view first 20 pages of one’s follower list, and first 50 pages of ones fans list which only

contains the name, user ID of those users. Together 20*20+50*20=1400 raw users in

maximum can be obtained given one user ID. There is no doubt that duplicated user will be

found, however, they are ignored by MariaDB database since an insertion of duplicated ID

into a table will be caught and handled by the server instance.

46

By modify part of the code, this instance is also used for getting the real user of ground truth

data. Given 300 real raw users, disabling the code that crawls for fans, this instance will get

all followers of 300 real users, returning the server the list of 25K real users as described

previously.

Account detail crawler instance

The job of this instance is simpler. It takes users ID of each raw user and obtain more user

information such as user register time, fans/followers number, gender, and etc. The result is

called detailed account and is sent back to server and stored in MySQL database. These

attributes of users will be selected and further processed.

Microblog crawler instance

This type of instances takes a detailed account and obtain the first 10 page36 with 10

microblogs per page of this account. Then a nest structured object that contains all account

details plus the details of all potential 100 microblogs is sent back to server and stored in

MongoDB.

3.3.1.3 Hardware Environment

A maximum of 34 computers and 1 amazon EC2 instance are used for gathering the

evaluation data.

DELL XPS15 L502x laptop

Quantity 1

Specification 4 Core CPU

8 GB RAM

NVidia GeForce 460M graphic card

36 Due to the time-space limit, a number of 10 pages is chosen.

47

120 GB SSD

Usage Running Server Instance

Pre-processing collected data

Note The original specification of this laptop does not have SSD, in order to

gain faster read/write speed for database, a SSD was bought simply

for the data storage of this project.

Beowulf Cluster Nodes in MACS, Heriot Watt University

Quantity 33


12 GB RAM

NVidia GeForce 520 graphic card

Usage Running different Slave Instances

Amazon EC2 instance

Quantity 1


1 GB RAM

EBS37 only

Usage Running Proxy Validating Website, implemented in PHP

37 Amazon Elastic Block Store, https://aws.amazon.com/ebs/

48

3.3.1.4 Overall Design & Implementation

The Figure 8 The overall implementation of the data crawling systemexplains the overall

structure of the data crawling system on Weibo.com. The master instance running on home

server firstly retrieves raw proxies from TKdaili.com, then these proxies are distributed to

slave instances. Slave instances connects to the PHP server I set up on the VPS of Amazon

EC2, and the validated proxy are saved into a buffer for further usage. Depending on the

type of slave instance, the instances will require corresponding information from master

instance for their tasks and will connect through proxy servers to retrieve data from

Weibo.com, and will return the result to master if there are enough complemented task

result in buffer. Finally the master instance save the instances into the database.

Figure 8 The overall implementation of the data crawling system

Each slave instance has about 80 threads running concurrently, 20 threads for proxy

validation using amazon EC2 server and 60 threads for obtaining data from Weibo.com using

proxies. These number were manually tested for maximum performance. With less thread

validating proxy will cause the 60 threads having not enough valid proxies to connect to

Weibo, because these proxies have very short life span. With more threads connecting to

Home Server

Maser Instance/

Database

Beowulf Cluster

Slave Instance

TKdaili.com

High Anonymity

Proxy Provider

Proxy validation

Checker host

PHP server

Weibo.com

The Chinese

versioned Twitter

Fetch

raw

proxy

Give tasks and raw proxies

HTTP

request

Validate

Proxies Proxies

HTTP

request

Return result

49

Weibo will make the instance unstable, because there will be too many socket connection

open.38

Considering the instances are requesting data from mobile host of Weibo.com, the above

set up is equivalent to 33*60=1980 sequential program connecting to Weibo simultaneously,

which is the reason why I am able to collect huge set of data for this project in short time.

Figure 9 shows how account information is gathered in steps. Firstly, for ground truth data,

the set of real user accounts were obtained by using raw user crawler instance given 300

initial real raw user, and the set of zombie accounts were simply collected by sequential C#

program. The evaluation set data, of which accounts identity are unclear, were obtained by

a raw user crawler instance given 9 random raw user and 1 selected user.39 Afterwards, all

raw users were used as an input to Account Detail Crawler Instances, and the detailed

account information are obtained. Followed by the Microblog Crawler Instance, where most

recent 100 microblogs of each user are collected.

Figure 9

38 Because it is possible that the Java HttpClient package execute multiple socket connection for each HTTP

request and will finally run out of file descriptor if there are many different threads doing this, whereas it fails to

close these connection immediately due to the Linux system structure, that each socket connection has to wait

for 2 stages (TIME_WAIT and CLOSE_WAIT) before close. Although I have implemented many methods to prevent

this, I did not find a perfect solution to prevent this.

39 This selected user is called Sina Weibo Helper, where every new registered account on Weibo will have it as a

fan. This fan cannot be deleted. Using this account as an initial raw user, crawling its follower list will always give

you most recent new registered User.

Data Type:

Raw User

Save to:

MariaDB

Data Type:

Account Detail

Save to:

MariaDB

Data Type:

Account Detail with

Microblog details

Save to:

MongoDB

Account

detail

Crawler

Instance

Raw User

Crawler

Instance

Manual

Gathering

Simple C#

Microblog

Crawler

Instance

50

3.3.1.5 Conclusion of Data Gathering

3.3.1.5.1 Gathered Data

It takes about 40 days to gather all data needed for this project. To sum up, 35.58GB of data

is collected. 17.2 GB of data stored in MariaDB database, and 18.4 GB of data stored in

MongoDB database.

In detail, 26313 real accounts with their details most recent 100 microblogs have been

collected and 21740 zombie accounts with their details and most recent 100 microblogs

have been collected. Moreover, 897343 accounts with their details and most recent 100

microblogs have been collected for evaluation purposes. I will call these dataset “Good”

because they have all the information needed for these project.

What is more, 26,240,000 raw accounts have been collected, of which 24,543,720 account

details have been collected and the set of 897343 accounts randomly selected from these

account have been crawled for their blogs.

Nevertheless, due to a software bug40, 1,536,193 accounts with details and 0-100 recent

microblogs were collected. In addition, due to the same software bug, 20536 rows of data

consisting 4262 real users account detail and 16274 zombie account details with incomplete

blog information are collected, which is basically the same set of users as the “Good”

dataset with improperly crawled microblog information. Although these data seem to be

useless, since the information is insufficient, I still would like to see how the classifiers

perform on these data. In later section, these data will be mentioned as “Incomplete”.

40 Due to the fact that I did not handle when Weibo.com actually treated a proxy as a DDoS attacker and blocked

its IP, this caused Microblog Crawler Instances to return their result to server earlier than expected. This problem

was found at a late stage of this project, therefore the valid data is less than the bugged data.

51

3.3.1.5.2 Cost

Gathering dataset in million size requires more than time and good program. In order to

obtain the data efficiently and store them reliably, money has been spent in different area

as listed below.

Spent on Amount(in Chinese Yuan) In GBP

25,000,000 Proxies ￥800 £87.21

34,500 Zombie accounts ￥1053 £114.80

40 days of Amazon EC2 Instance £8.82

Sum £210.83

3.3.2 Data Pre-processing

Before feeding the datasets into classifier and trying to tune for the best result, it is

important to study and understand the data we have collected, and select or create useful

feature for further learning.

3.3.2.1 Feature study & extraction of Collected Data

Fields of Account Details

The Table 1 below listed the detail of each field of account detail data and their data type as

well as whether they are selected for further data mining. These fields are directly crawled

from Weibo.com. The selection and extraction are based on 3 rules: 1. if this attribute

provides information about the activity of the account. 2. If this attribute is easily obtained

and analysed. 3. If this attributes can provide information to study zombie accounts.

Feature name Data

Type

Detail Selected Select reason

52

uid Long A unique large

integer represent the

user ID on

Weibo.com of the

account. This field is

possibly incremental.

Yes It is possible that zombie

accounts are registered

sequentially in large amount by

program, so uid can potentially

be a good source of detecting

them.

Gender Nomin

al

0 if the account is

female, 1 otherwise

Yes It is possible that zombie

accounts are more likely to be

male, since it is the default

option when registering and may

be kept.

Name String The unique string

that represents the

user

Processed

Will be pre-processed before use

att_num Integer The number of

followers of the user

Yes Basically all previous research

used this attributes, because it

gives information about the

activity of the account.

fans_num Integer The number of fans

of the user

Yes Same as att_num

avatar_img String The URL string of the

user’s thumbnail icon

Processed Will be converted into Boolean

value, true if it is default image,

false if customised by user. This

also gives information of user

activity.

background String The URL string of the

background image of

the user’s homepage

Processed Same as avatar_img

53

blog_num Integer The number of

microblogs this user

have post

Yes User activity

create_time Integer Linux timestamp of

when this user is

registered on

Weibo.com

Yes The registration time can

potentially be useful given

enough training data, if zombie

accounts are registered by

program massively.

description String The self-description

of the user

No This field can be very useful in

distinguishing the user’s identity,

however, since most description

is in Chinese, there is no simple

way of process it by machine,

which is out of the scope of this

project.

member_type Integer A number indicate

the type of the

account, it can be

one of

[0,2,11,12,13,14]

Yes Although the representation of

this field is unclear, but it is a

potential distinctive feature. To

be analysed in next section.

native_place String The string name of

the city in Chinese

where the user is.

Yes Consider the research of (Deng,

et al., 2015), location based

classification can be potentially

useful, regardless of its unsound

methodology

54

verified Boolea

n

If this user is verified

for the name it has.

E.g. Obama opens a

Weibo.com account,

and is verified to be

him according to

some procedure,

then this field is true.

Yes Verified user is more likely to be

real user. However, only a very

small fraction of users are

verified by Weibo.com.

v_type Integer A number represent

the type of

verification, if it is a

personal account, an

organisation or an

authority. Can be on

of [-

1,0,2,3,4,5,6,7,10,200

,220]

Yes Same as verified.

Table 1

Fields of Microblogs

As being described in previous section, most recent 100 microblogs of each selected account

have been obtained as well, the following Table 2 exhibits the detailed features of

microblog.

Feature name Data

Type

Detail Selected Select reason

55

postid Long A unique large

integer represent

the blog ID on

Weibo.com of the

microblog.

No This field does not give any

useful information.

timestamp Integer Linux timestamp of

when this

microblog was

posted

Processed The timestamps of the blogs,

according to paper (Zhang &

Vern, 2011), are very useful in

detecting automation. However

the timestamp of a single post

does not give much information,

so it will be processed with other

posts made by same user for

further analysis.

repost_count Integer The count that this

microblog getting

forwarded and

reposted by

another user

Processed This figure gives information

about user interaction, because

the posts of real user are more

likely getting reposted.

comments_co

unt

Integer The count of

comments to this

microblog by any

users.

Processed Same as repost_count

att_count Integer The count of “Like”

of this microblog

Processed Same as repost_count

56

picture_count Integer The number of

picture in this

microblog

No This field should be useful,

because zombie accounts may

have a different behaviour in

making post, such as less image

posts. However this was not

considered at the time of

implementing the Gathering

System, and the data was not

saved.

Is_retweet Boolean If this microblog is a

repost of another

microblog from

another user

Processed The repost probability is

calculated for every single

account. Zombie account may

have a certain random rate of

making a reposting.

text String The text content of

this microblog

No Since the majority posts are in

Chinese, there is no easy way of

analysing them.

Table 2

There are 3 fields selected from Table 1 and all selected fields from Table 2 being processed

before they are used in the data mining stage. More information can be extracted from

these fields with proper algorithms or analysis.

Any time complicated method in analysing the name field is not an option, since we a

dealing with large scaled data. So I created a simple processing methods that generates 5

new fields from the name, they are all Boolean as shown below:

Field Name Detail

name_has_character If the name has Chinese character

name_has_letter If the name has English letter in it

57

name_has_number If the name has number

name_has_rare_char If the name has rare Chinese character that is not in

the range of 3500 common Chinese character.

name_has_symbol If the name has any string other than number, letter

or Chinese character

name_is_mixture If the name is a combination of more than one of

character, number, letter or symbol.

Pre-processing on selected fields

As described in the Table 1, fields such as avatar_img and background are processed into

Boolean representing if they are default value. Two new fields taking their place are

is_default_avatar_img and is_default_background. These 2 fields will be used in further

analysis.

The pre-processing of the fields for the microblogs is more statistical. Fields of each

microblog including repost_count, comments_count, att_count and is_retweet of all recent

100 microblogs are summed together and the mean value of them are computed,

generating avr_blog_repost, avr_blog_comment, avr_blog_att and avr_blog_is_retweet

for each account.

The automation detection methods (Zhang & Vern, 2011) transform the timestamp into

second of the minute and minute of the hour, and uses Pearson’s χ2 test to detect the

uniformity of the timestamps of the blogs. Adapting its idea, as a simplified version, the

timestamps of microblogs are converted into minute of the day (for example, if a microblog

is posted at 3:13 am, then this microblog is posted on the 3*60+13 = 193th minute of that

day). With all these minutes of the day, the p-value is calculated and Pearson’s χ2 test with a

bin size of 240 (24 if less than 24 microblogs of this user) and significance of 0.1, 0.05 and

58

0.025 are carried out, generating 3 new fields respectively (BT = blog time):

BT_chisquretest_010, BT_chisquretest_005 and BT_chisquretest_0025. These new fields

supposedly will help to find out how uniform the timestamps are distributed through the

day at different level. It is expected that in the long run, the zombie accounts with

automation will have more uniform distribution of posting time, whereas human beings,

who need to rest and work, have limited range of posting time. In addition, the mean,

median, variance of the minute of the day are also calculated, for the same reason,

generating 3 new fields: BT_mean, BT_median and BT_variance.

What is more, as an addition to (Zhang & Vern, 2011), I consider the post time interval

should also provide valuable information, because I expect a program to control these

account in a loop, if that is the case, the post time interval can be an apparent figure to show

that. As a result, I created 3 more fields BT_I_mean, BT_I_median, BT_I_variance. This idea

will be analysed in the data mining section.

Last but not least, a single field ff_ratio is computed for each account as their followers/fans

ratio. This is simply because I found out in the initial analysis of zombie accounts that these

accounts seem to have a similar followers/fans ratio.

In conclusion, the pre-process created new fields and discarded unnecessary fields, together

the final dataset for experiments and evaluation have 35 fields including class field, listed in

Appendix C. 14 original fields are shown in Table 1 and 22 analytical fields created by pre-

processing are listed in Appendix B.

3.3.2.2 Creating different data set

After pre-processing the data according to above methods, two MySQL data for ground

truth data of zombie and real accounts have been created respectively with an extra field

called accountClass for classification purpose, of which value can be 0 (real user) and 1

59

(zombie). Two table are then merged together and randomized, creating the final training

table.

On the other hand, 897343 rows of random accounts detail with recent 100 posts

information are processed in the same way, and 1583135 random rows without proper

crawled post details are also processed for testing purposes. The brief conclusion of the

processed data are displayed in the following table

Dataset Number of rows

Good training set 31705

Good testing set 16333

Good evaluation set 897343

Incomplete testing set 50956

Incomplete evaluation set 1583135

Table 3

4 Data Mining & Classifier Evaluating

In this section, the machine learning tool WEKA is used. WEKA is an open source software, a

collection of machine learning algorithms for data mining tasks. It has many useful tools for

data pre-processing, classification, regression, clustering and visualisation41.

In this section, the training data set is firstly visualised, and patterns in fields are analysed.

Characteristics of zombie accounts are concluded from the patterns, which gives implication

of the manner how these zombie accounts are created. Moreover, the different purpose of

zombie accounts are also possibly analysable from the data.

41 http://www.cs.waikato.ac.nz/ml/weka/

60

After initial analysis, classifiers using different learning algorithms such as naive Bayes, SVM

and decision tree are carried out on the data. These classifiers are then tested on different

dataset of training, testing and evaluation respectively.

Finally the combination of classifiers using voting algorithm is tested and evaluated for the

purpose of better performance.

4.1 Initial analysis of features with visualisation

By using the Visualisation functionality of WEKA, 35 histograms of each field plus

35*35=1225 plots using different fields are created. The following graphs are interesting to

be looked at.

Figure 10 The distribution of the field of native_place. X axis is the different place in Chinese, and Y axis indicates the class (0=real user, 1=zombie)

It can be seen from Figure 10 that the count of real user from different places are not

banlance, which is expected since the user population from different places are different.

Whereas this distribution of zombie accounts is nearly uniformly distributed, it is a safe

guess that the programs that register for zombie account did not consider the distribution of

user population and have used a uniformed distributed random place. This picture indicates

that the native place is useful to some extend as claimed in (Deng, et al., 2015) . Den, et al

uses a hard coded threshold on the number of follower that have the same native_place

field for the user being classified. However, in this project, location information is obtained

for that user only, since their methodology is still in doubt.

61

Figure 11 The stack histogram of field create_time, X-axis is in Linux timestamp. Blue part is real users, red part is the zombie accounts, and the height is the total user registered during that period of time.

As shown in Figure 11, the create_time field of the training data range from Friday August

14, 2009 20:49:13 GMT+8 to Friday April 08, 2016 20:15:44 (pm), the amount of the

registered real users had the first peak at the opening of the Weibo.com, followed by

another peak 1 year later, keeping decreasing in getting new user. On the other hand, the

zombie accounts started with very few amount, and they kept growing in number. Moreover

the registering number reached 2 abnormal peaks in July 2012 and Sept 2012, when Weibo

was considered most popular website in China. These 2 peaks support the hypothesis that

many zombie accounts are registered in batch by program within short period of time.

Figure 12 the histogram of BT_mean, which is the mean of blog time in the minute of day. Note that the time in database is saved as Linux timestamp in time zone of GMT, therefore the values that are not 0 should be added by 8*60 = 480 minutes to get correct number for Chinese users.

Figure 12 shows the distribution of the mean time of users writing new microblogs every

day. Despite the users that don’t make microblogs (whose BT_mean are 0s as shown as the

little peak at origin), the distribution of the mean time of real users exhibit a healthy normal

62

distribution, whereas that of zombie accounts shows an unexpected single peak.

Hypothetically, this peak is caused by massive zombie accounts that make posts at same

exact time, generating same mean time in this graph.

Figure 13 The partial plot (points with more than 200 ff_ratio are not shown) of field create_time as X-axis and ff_ratio (followers/fans ratio) as Y-axis.

Previous work such as (Deng, et al., 2015) and (Jiang, et al., 2015), that uses manually

hardcoded threshold on the follower/fans ratio. The Figure 13 to some extend justify their

methodology. Zombie accounts shows a variety of follower/fans ratio from 0 to more than

200, whereas real users are mostly threshold below 5 to 10. Their problem was not having a

set of sound ground truth data, and the number of their threshold ratio given by manual

work is less sound. The best practice is to let classifying algorithms to decide the best line to

separate zombie accounts from real users.

In addition to Figure 11, Figure 13 shows the following behaviour of the zombie accounts. It

is noticeable that there are several period of time on the graph where the zombie accounts

registered during that time have significantly higher ff_ratio and most zombie accounts have

higher ff_ratio than real users, implying the purpose of these accounts as the exact service

offered by the sellers on Taobao.com: zombie fans that follows targeted accounts.

63

Figure 14 The plot using avr_blog_is_retweet as X-axis, and ff_ratio as Y-asis (few points with more than 125 ff_ratio are not shown).

One of the best plot that gives most separable data is Figure 14. This graph not only

indicates a distinct difference between zombie accounts and normal accounts, but also

exhibits the basic behaviour pattern of zombie accounts. To generalise from this graph, as

shown on the left side of this graph where most red points are ploted, the accounts with less

avr_blog_is_retweet (the percentage of microblogs that are just a repost of another user)

and with higher ff_ratio than a certain amount have a very high probability of being

zombies. Moreover, there are obvious pattern of vertical straight lines that can be observed

from the plot of zombie accounts: on the left side, formed by many different accounts share

the exact same certain retweet ratio, which implies that these accounts are controlled by

programs. On the contrary, real users have randomly distributed retweet ratio. I randomly

visited the few zombie accounts shown on the left side of the graph, and without surprise

found that these zombie accounts have posted many microblogs such as advertisement and

chicken-soup-for-the-soul-styled text, trying to mimic human behaviour. This methods of

intimation, however, can be one of the biggest flaw in hiding from machine learning.

In short, the initial analysis of the data shows some significant difference between zombie

accounts and real users, which indicates that the classes of the ground truth data obtained

are potentially differentiable.

64

4.2 Base Line

The base line correct rate is 54.7% because we are using ground truth data with 54.7% of

real user. The training set and testing set are both subsets of the ground truth data and

should have the same base line. This will not apply to the evaluation data, as all instances

are randomly crawled from Weibo.com.

4.3 Single Classifier Experiments

In this subsection, 4 types of classifiers described in Literature are trained using training data

set using 10-fold cross validation method that the training data is split into 10 parts and each

part is evaluated against the classifier trained by other 9 parts. Then the generated classifier

is tested using testing data set. If the accuracy rate is good, then the classifier will be used to

evaluate the evaluation set where the class of account data is unknown.

4.3.1 Naive Bayes

4.3.1.1 Using all fields Firstly, naive Bayes classifier is trained using all available fields of training set, and is tested

using the training set. The confusion matrix shows an unpromising result:

Classified

Real Zombie

Exp

ect

ed

Real 6364 10976

Zombie 80 14285

Table 4

Many real user are classified into zombie accounts, giving a high number of false positive.

Meanwhile very few zombie accounts are classified as real accounts. Overall, this classifier

correctly classified 65.1% instances. Consider it is only 10% above baseline and this result

was based on the evaluation of training set itself, more tuning have to be done for naive

Bayes Classifier.

65

4.3.1.2 Feature Selection Using all fields of the training set is not suitable for naive Bayes classifier if not all of them

are giving useful information, therefore the forward feature selection algorithm is used. In

this project, the forward selection algorithm starts with 0 fields and add fields one by one

according to the increase in accuracy. What is more, a best first search strategy is used that

the field with best increase in accuracy is searched first until there is no more increase in

correctness.

The forward feature selection yields 9 fields including the class field:

BT_I_median, BT_median, ff_ratio, is_default_background, member_type,

name_has_charactor, name_has_letter, uid, v_type, accountClass

4.3.1.3 Naive Bayes with selected features Training set Testing set Evaluation set

Classified

Real Zombie

Exp

ecte

d Rea

l

15831 1509

Zom

bie

3317 11048

Classified

Real Zombie

Exp

ecte

d Rea

l

8152 821

Zom

bie

1686 5674

Classified

Real Zombie

Exp

ecte

d Rea

l

0 0

Zom

bie

505874 391469

Table 5

Using these 9 fields, the naive Bayes classifier using training set itself generates the

confusion matrix on the left of Table 5.

An overall accuracy of 84.78% is achieved by the classifier, though there are more zombie

accounts classified as real user, leading a higher false negative.

The performance of the trained classifier on testing set is very similar, with an almost

identical accuracy of 84.65% as shown on the middle confusion matrix of Table 5. Since the

testing set is half size of training set, the figure from above table is expected, this implies

that the trained classifier is not overfitting. The problem with the result of testing set is that

66

there are many false negative where 1686 of 7360 (23%) zombie accounts are classified as

real accounts.

Finally I use this trained classifier on evaluation dataset, where the identity of accounts are

unknown thus all are assumed to be zombies before classification. With the same classifier

model, 56.4% accounts of 897343 accounts are classified as real user, and 43.6% are

classified as zombie accounts.

4.3.2 Decision Tree Classifier

WEKA provides a decision tree implementation named J48, which uses selected attributes

and build a decision tree classifier model with given data. With same set of training data,

testing data and evaluation data as naive Bayes, the following confusion matrices are

generated:

Training set Testing set Evaluation set

Classified

Real Zombie

Exp

ecte

d Rea

l

15911 1429

Zom

bie

1629 12736

Classified

Real Zombie

Exp

ecte

d Rea

l

8173 800

Zom

bie

815 6454

Classified

Real Zombie

Exp

ecte

d Rea

l

0 0

Zom

bie

519708 377635

Table 6

As shown in the Table 6, decision tree classifier has a better performance on the same data

sets. It have an accuracy of 90.35% in predicting the account identity in training set, and

90.11% in testing set. 58% of evaluation data are classified as real user, in addition with 42%

zombie accounts.

The decision tree generated by the algorithm that uses all 35 attributes have 428 leaves and

the total size of the tree is 855 nodes. This is highly likely to be overfitting. Although it is

recommended that decision tree algorithm performs better with more features, I still

67

wondered how it performs with less. Using feature selection algorithm with J48, a sub set of

9 features are selected excluding accountClass:

BT_I_Variance, avr_blog_comment, avr_blog_is_retweet, avr_blog_like, blog_num,

create_time, fans_num, ff_ratio, member_type, v_type

Using these selected feature, the same algorithm is executed again, and the confusion

matrices are generated:


Classified

Real Zombie

Exp

ecte

d Rea

l

16618 722

Zom

bie

980 13385

Classified

Real Zombie

Exp

ecte

d Rea

l

8235 738

Zom

bie

798 6562

Classified

Real Zombie

Exp

ecte

d Rea

l

0 0

Zom

bie

537759 359584

Table 7

The performance of decision tree algorithm with feature selection is higher than that of

using all features. It has 94.6% accuracy on training set, and 90.6% on testing set. 60% of

evaluation data are classified as real user.

4.3.3 Support Vector Machine

Support Vector Machine classifier is implemented in WEKA as SMO. Because the fact that

SMO is very slow in training compared to other classifiers, thus the feature selection

algorithm is not used for this classifier for it is too time consuming (With more than 30 hours

building training model, the process had not completed). All 35 attributes are used as

features for the classifier. The SMO implementation is configured to use logistic regression

as calibrator, and its execution result on different dataset are shown in Table 8.


68

Classified

Real Zombie Ex

pec

ted

Rea

l 15481 1859

Zom

bie

2420 11945

Classified

Real Zombie

Exp

ecte

d Rea

l

7968 1005

Zom

bie

1279 6081

Classified

Real Zombie

Exp

ecte

d Rea

l

0 0

Zom

bie

466583 430760

Table 8

The performance of SVM is as good as naive Bayes, approximately 86.5% accuracy on

Training set and 86.0% accuracy on testing set. The evaluation result is different, that only

about 52% are classified as real user and 48% are zombies.

4.3.4 Multi-Layered Perceptron

Using WEKA’s default implementation of MLP with a learning rate of 0.3 and momentum of

0.2, the classifier is trained with the Training dataset. The following confusion matrices are

produced:


Classified

Real Zombie

Exp

ecte

d Rea

l

16860 480

Zom

bie

4380 9985

Classified

Real Zombie

Exp

ecte

d Rea

l

8629 344

Zom

bie

2392 4968

Classified

Real Zombie

Exp

ecte

d Rea

l

0 0

Zom

bie

715332 182011

Table 9

The accuracy of MLP model is slightly lower than other models: The accuracy on training

dataset itself is 84.6%, and that on testing dataset is 83.25%. MLP tend to give much less

false positive, meanwhile giving too much false negative. As shown on the right matrix of

Table 9, the majority of evaluating set (79.7%) is classified as real user, leaving only 20.3%

zombie accounts.

No feature selection method is used for MLP classifier, because the backpropagation

algorithm will train the weights of unrelated features, making them close to 0 for less error.

69

Moreover, the feature selection with MLP using backpropagation is time-consuming and

therefore not considered in this project.

4.4 Meta Classifiers

In this subsection, I will list some meta-classifiers and uses them with previous classifiers to

get either a higher accuracy in training or a less overfitting problem in testing and

evaluating, of which details are omitted.

4.4.1 Boosting

Boosting is an ensemble methods that uses a single classifier as a base and generates a

second classifier that focus on the data instances that are misclassified by the first one.

Boosting repeats this process until a certain number of limitation or a specified accuracy.

4.4.2 Bagging

Bagging (or Bootstrap Aggregating) is another ensemble method that divides the training

dataset into N subsets and creates one classifier for these subsets respectively. The results

of N classifier are then combined using mean value or voting, thus giving the final

classification result.

4.4.3 Voting

Voting is a technique that uses multiple different classifier for one classification problem.

Multiple different classifiers is used for classifying each data instance, the class of the

instance will be the majority class giving by used classifiers. This is the combination method I

use in this project.

4.4.4 Results

Different meta-classifiers are tried on all previous classifiers, increases in performance are

observed in most run. The following table shows how different meta-classifier performs with

my datasets.

70

Classifier Number of Features

Meta-classifier Accuracy on Training data

Accuracy on Testing data

Decision Tree 35 Bagging 96.77% 91.39%

SVM 35 Bagging 86.46% 86.09%

Naive Bayes 17 Boosting 85.6% 85.3%

MLP 35 Boosting 86.12% 81.86%

DT, NB, MLP, SVM 35 Voting 90.50% 88.7%

DT, NB, SVM 35 Voting 90.22% 88.6% Table 10

Since the study of finding the best classifier of the problem is always empirical and

experimental, I used different meta-classifiers on top of 4 simple classifier. As shown in the

Table 10, Decision tree has the best performance among single classifiers on both training

data and testing data, whereas MLP gives worst performance on testing data. Because of

this, I trained two voting meta-classifier, one of which uses all 4 single classifier as voters,

another one the same without MLP. The result did not show any strong evidence of

advantages gained by taking votes from different classifier.

4.5 Evaluating using incomplete data

As described in Data Gathering subsection, there are 20536 rows of data with incomplete

blog information being crawled. In big data analysis, this type of data is expected to be seen,

therefore I kept these data and executed different classifiers on it as a method to evaluate

the methodology.

Classifier Meta-classifier

Accuracy on incomplete data

Decision Tree Bagging 86.6%

Naive Bayes Boosting 89.4%

MLP Boosting 72.64%

SVM Bagging 88.97%

DT, NB, MLP, SVM Voting 90.725

DT, NB, SVM Voting 90.91% Table 11

As can be seen from Table 11, the accuracy of different classifiers on the incomplete data are

still very high, except MLP which only has about 72% accuracy. Decision Tree classifier which

performed better than all other classifier on the good data, performs worse on the

71

incomplete data. In addition, Naive Bayes classifier is able to deal better with the data that

has less information.

What is surprising is that, though not showing any better performance in training and testing

data, voting based meta-classifier models give significantly better accuracy on incomplete

dataset. The one that uses 4 classifiers has slightly lower performance than the other one

without using MLP (around 0.2%), yet both achieving an accuracy above 90.7%, meanwhile

no single classifier from DT, NB, MLP and SVM has better than 89.4%. This is the evidence

that using multiple classifier gives decreases the overfitting problem and increases the

overall strength for classification.

4.6 The composition of Weibo Users

The classifiers are then finally used to classify 897343 rows of unknown evaluation data,

where the class of accounts are unknown.

Classifier Meta-classifier Real User Zombie accounts

Decision Tree Bagging 56.8% 43.2%

Naive Bayes Boosting 54.7% 45.3%

MLP Boosting 68.04% 31.96%

SVM Bagging 52.1% 47.9%

DT, NB, MLP, SVM Voting 54.5% 45.5%

DT, NB, SVM Voting 53.6% 46.4% Table 12

Except MLP, which is optimistic about the proportion of real user (68%), the rest trained

classifier models give the ratio of real user from 52% to 57%, consider the maximum

accuracy is approximately 90%, the number of real user may vary from 47%(52+5%) to

63%(57+6%). Consider the number is close to the confusion matrices generated by using

these classifier on training and testing data, it is reasonable to believe that there are quite

large portion of users (according to the experimental data, at least 27%) on Weibo are

zombie accounts. The data, due to the limitations explained in next section, can be

inaccurate. However, we should not underestimate the actual number of zombie accounts.

72

5 Conclusion & Discussion

5.1 Achievements

The objective of this project, as mentioned in the subsection of the approach and objectives

of Introduction section, are:

1. Implement a good distributed crawling system in order to obtain large data set from

Weibo.

2. Gather ground truth data of zombie and real accounts on Weibo for the soundness

of this study.

3. Find a good classifier or a combination of classifiers that can maximise the ability to

classify zombie accounts on Weibo using the data gathered by the crawling system.

4. Conduct relatively large scaled experiments to evaluate classifier, and therefore

evaluate the composition of Weibo accounts.

5. Trying to analyse how zombie accounts have been influencing real human.

First four objectives are met and the approach to these objectives have given promising

results.

Firstly, a framework JPipe have been implemented for Java Pipelining. This is the first open-

source Object-Oriented Pipelining implementation. The framework is easy to use, allowing

user to efficiently create applications that need concurrent work, especially for creating

patterns of task parallelism; then a distributed crawling system PipeCrawler is built on JPipe,

enabling me to obtain millions rows of data for this project. Nevertheless, the JPipe

framework and PipeCrawler system are believed to be useful in future information retrieval

related researches.

73

Secondly, ground truth data are obtained with proper methods. Zombie accounts are

directly bought from 4 different sellers with unbiased sampling, and real accounts are

crawled using manually identified real accounts as initialising points using breadth first

search with depth of only 1.

Thirdly, using different tuning methods and meta-classifiers, selected classifiers all

performed well on classifying zombie accounts, on both good and incomplete datasets.

More than 20% above baseline accuracy, a minimum of 86% accuracy have been achieved.

What is more, an average of 90% accuracy on any dataset using a voting classifier have been

created with 4 selected simple classifiers. It is shown that single classifier such as decision

tree performs well on well crawled data, however, slightly worse than Naive Bayes classifier

on improperly crawled data. Meta-classifiers such as boosting and bagging are proven to be

useful for all 4 simple classifiers gained improvement in accuracy using them.

Fourthly, experiments about the composition of the accounts on Weibo is carried out. About

0.9 million of random account data are evaluated using different classifiers. Based on the

analysed data given by different classifier model, an estimation of actual amount of real

users varies from only 47% to 63%. This research shows the possibility that zombie accounts

on Weibo.com can be unexpectedly high, of which ratio is never recognised in any previous

study.

However, due to the limitation of time and space, the final objective were not met, because

that studying how zombie accounts had been influencing real human is hard. In order to

achieve this objective, algorithms and tools of Chinese text analysis will be implemented and

used, nevertheless the complicated work of Chinese semantic analysis is potentially needed.

This is another area of machine learning study, and is out of the scope of this undergraduate

project.

74

5.2 Limitation

This project has certain limitations that can be improved or dealt with:

1. The diversity of ground truth zombie account is relatively low, that zombie accounts

are bought from 4 out of more than 60 sellers. Different sellers may have different

method of creating and training their zombie accounts. With more diverse accounts,

the accuracy of classifier can be improved furthermore. Moreover, the initialising

point to crawl real user accounts is only 300, which is relatively small and thus

requiring search algorithm for more real user accounts. It is possible that there is a

small fraction of non-real user in the supposed ground truth real user set. Although I

had no time in manually identifying real user, in the future it is possible to get a

reasonably large set of 100% real user data set, thus improving the performance of

classifiers.

2. In addition to previous limitation, the real accounts should be actually regarded as

active users, where there are many inactive user that I was not able to identify

manually because the lack of information from these user, who does not post or

change their profile and who only reads other’s posts. As a result, the ground truth

data of real user accounts are more likely to be a collection of active user. The

classifier is accurate in the sense of distinguish active user from zombie accounts,

whereas the classification details of inactive real users are unknown. And these

details may never be learned.

3. The methods of data pre-processing is limited, because there are no use of the

actual text of the blogs or the self-description of the user. In manual work, the

microblog text is in fact the key factor for me to distinguish a real user from a

zombie one. If more analyse can be done on these text, we can make a further

improvement on classification accuracy.

75

4. There is still space for improvement for classifiers. This project only used 4 of many

available classifiers without exhaustive optimisation. With more time, I believe an

even better classifier model can be found.

5. This project only conducted experiments on million-sized datasets, whereas there

are more than 0.5 billion users on Weibo.com. This limitation is bounded by the

resources such as budget to buy proxies for data crawling, internet speed and time.

If with enough of these resources, larger scaled experiments with more ground truth

data can be carried out, potentially allowing me to study the problem of zombie

accounts in depth.

5.3 Future Work

The future works of this project include:

1. Implementing a proxy crawling system, so that PipeCrawler will not require paid

proxy to obtain more data. Currently the crawling system uses an API from the proxy

seller to obtain proxies. If a proxy crawling system is implemented, the size of data

obtained PipeCrawler will not be by budget of purchasing proxies and therefore

allowing larger scaled experiments.

2. Refurbishing JPipe code, make it easier to use and more robust and efficient. The

way to implement Pipelining pattern using JPipe can be further simplified and

optimised. If a more generalised implementation of JPipe is done, it will be a good

support for all kinds of information retrieving researches.

3. Training different classifiers, training with larger set of ground truth data. Evaluating

more unknown accounts. Given good ground truth data, this project has achieved a

90% accuracy in identifying zombie accounts.

4. Most importantly, any research based on retrieving information social network such

as Twitter or Weibo, will need a proper way of identifying zombie accounts, in order

76

to obtain authenticate and sound research result. With extension of this project and

classifiers with better accuracy, researches such as sentimental analysis can be

carried out on the websites such Weibo without being deceived by the fake

information given by zombies.

77

6 Reference

Amit A. Amleshwaram, N. R., Yadav, E., Gu, G. & Yang, C., 2013. CATS: Characterizing

Automation of Twitter. COMSNETS, pp. 1-10.

Benmokhtar, R. & Huet, B., 2006. Classifier Fusion: Combination Methods For Semantic

Indexing in Video Content. 16th International Conference, Athens, Greece, Volume II, pp. 65-

74.

Chang, Y., Xuanhui, Mei, Q. & Liu, Y., 2013. Towards Twitter context summarization with

user influence models. WSDM '13 Proceedings of the sixth ACM international conference on

Web search and data mining, pp. 527-536 .

Deng, J., Fu, L. & Yang, Y., 2015. ZLOC: Detection of Zombie Users in Online Social Networks.

WEB 2015 : The Third International Conference on Building and Exploring Web Based

Environments.

Jiang, H., Wang, Y. & Zhu, M., 2015. Discrimination of Zombie Fans on Weibo based on

Features Extraction and Business-Driven Analysis. ICEC '15 Proceedings of the 17th

International Conference on Electronic Commerce, 03 08, Issue ISBN: 978-1-4503-3461-7.

Kittler, J., Hatef, M., Duin, R. P. & Matas, J., 1998. On Combining Classifiers. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 20(3), pp. 226 - 239.

Opitz, D. & Maclin, R., 1999. Popular Ensemble Methods: An Empirical Study. Journal of

Aritificial Intelligence Research, Volume 11, pp. 169-198.

Sun, Q. et al., 2014. Modeling for User Interaction by Influence Transfer Effect in Online

Social Networks. 39th Annual IEEE Conference on Local Computer Networks, 8 Sept, pp. 486 -

489.

Tavares, G. & Faisal, A., 2013. Scaling-Laws of Human Broadcast Communication Enable

Distinction between Human, Corporate and Robot Twitter Users. DOI:

10.1371/journal.pone.0065774, 3 July.

Waikato, T. U. o., 2016. Weka 3: Data Mining Software in Java. [Online]

Available at: http://www.cs.waikato.ac.nz/ml/weka/

[Accessed 15 Apirl 2016].

Zhang, C. M. & Vern, P., 2011. Detecting and Analyzing Automated Activity on Twitter. In:

Passive and Active Measurement: 12th International Conference, PAM. Berlin Heidelberg

2011: Springer-Verlag, pp. 102-111.

78

7 Appendix A

Field List of Discovery Channel of Weibo.com

Social international technology science digital Best comments

finance stock market

star variety drama movies

music cars sports sports and fitness

health weight loss

health military history beautiful models

beauties Pets

emotion quotations jokes Rumour Chicken soup for the soul

religion

government games travel childcare education food

real estate home sign reading agriculture design

art fashion beauty animation

A screen shot of above fields on Discovery Channel of Weibo.com

79

8 Appendix B

The following are the analytic fields created by pre-processing

Field Name Detail

name_has_character If the name has Chinese character

name_has_letter If the name has English letter in it

name_has_number If the name has number

name_has_rare_char If the name has rare Chinese character that is not in the range of 3500 common Chinese character.

name_has_symbol If the name has any string other than number, letter or Chinese character

name_is_mixture If the name is a combination of more than one of character, number, letter or symbol.

is_default_avatar_img If the avatar (thumbnail) image of the user is the default image given by Weibo.com

is_default_background If the background image of user’s home page is the default image given by Weibo.com

avr_blog_repost The average repost count of the latest 100 microblogs of the user.

avr_blog_comment The average comment count of the latest 100 microblogs of the user.

avr_blog_att The average “like” count of the latest 100 microblogs of the user.

avr_blog_is_retweet The proportion of the latest 100 microblogs of the user that are retweets of other users.

BT_chisquretest_010 Whether the blogging time (as the minute of the day) of the latest 100 microblogs passes Pearson’s χ2 test with significance of 0.10, 0.05 and 0.025

BT_chisquretest_005

BT_chisquretest_0025

BT_mean The mean time (as the minute of the day) the user post microblogs

BT_median The median time (as the minute of the day) the user post microblogs

BT_variance The variance of time (as the minute of the day) the user post microblogs

BT_I_mean The mean interval time between microblogs that the user posts.

BT_I_median The median interval time between microblogs that the user posts.

BT_I_variance The variance of the interval time between microblogs that the user posts.

ff_ratio The follower/fans ratio of the user

80

9 Appendix C

The following are the final fields used for data mining.

Attribute name Type

BT_I_Variance Float

BT_I_mean Float

BT_I_median Float

BT_chisquare_p Float

BT_chisquretest_0025 Boolean



BT_mean Float

BT_median Float

BT_variance Float

att_num Integer

avr_blog_att Float

avr_blog_comment Float

avr_blog_is_retweet Float

avr_blog_like Float

avr_blog_repost Float

blog_num Integer

create_time Integer

fans_num Integer

ff_ratio Float

gender Boolean

is_default_avatar_img Boolean

is_default_background Boolean

member_type Integer

name_has_charactor Integer

name_has_letter Integer

name_has_number Integer

name_has_rare_char Integer

name_has_symbol Integer

name_is_mixture Integer

native_place String

uid Integer

v_type Integer

verified Boolean

accountClass Boolean

Analysing Zombie accounts in Weiboyjc32/project/Thesis - YiBo/Yibo Liang.pdf · 2 Declaration I,...

Documents

Transcript of Analysing Zombie accounts in Weiboyjc32/project/Thesis - YiBo/Yibo Liang.pdf · 2 Declaration I,...