SPOTTING FAKE RETWEETING ACTIVITY IN TWITTER Maria Giatsoglou 1, Despoina Chatzakou 1, Neil Shah 2,...

36
SPOTTING FAKE RETWEETING ACTIVITY IN TWITTER Maria Giatsoglou 1 , Despoina Chatzakou 1 , Neil Shah 2 , Alex Beutel 2 , Christos Faloutsos 2 , Athena Vakali 1 1 Informatics Department, Aristotle University of Thessaloniki, Greece 2 School of Computer Science, Carnegie Mellon University, USA Informatics Department Aristotle University of Thessaloniki School of Computer Science Carnegie Mellon University

Transcript of SPOTTING FAKE RETWEETING ACTIVITY IN TWITTER Maria Giatsoglou 1, Despoina Chatzakou 1, Neil Shah 2,...

SPOTTING FAKE RETWEETING ACTIVITY IN TWITTER

Maria Giatsoglou1, Despoina Chatzakou1, Neil Shah2, Alex Beutel2, Christos Faloutsos2, Athena Vakali1

1 Informatics Department, Aristotle University of Thessaloniki, Greece2 School of Computer Science, Carnegie Mellon University, USA

Informatics DepartmentAristotle University of Thessaloniki

School of Computer ScienceCarnegie Mellon University

Content in Twitter• Great topic diversity• Varying attention levels (#views, , #favorites)

04/18/23 2

#RTs

Retweet Fraud: overview

• User typically retweet a post due to its high quality /

interestingness + author’s influence / popularity

• # retweets serves as a post’s popularity indicator

Retweet fraud: falsely create the impression of popularity

by artificially generating a high volume of retweets

• Twitter estimates 14% (5%) of user accounts being bots

(spam bots); the problem is probably much bigger

• Such content is vacuous, spammy / malicious and detracts

from Twitter content’s credibility and users’ experiences

04/18/23 3

%

Retweet Fraud: example

04/18/23 4

Retweet Fraud: examples

04/18/23 5

Retweet Fraud: dimensions

• Accounts of varying automation level (bots, humans, semi-automated)

• Mixed honest and fake retweets for the same post• Promiscuous vs. subtle fraudsters: based on the ratio of

fraudulent to honest(-like) activity

04/18/23 6

###

###

###

###

###

###

######

occasional retweet buyer

honest humans

paidhuman

bots

%%%%%%

%%%%%%

%%%

%%%

professional content / user promoter

###

Complex problem with multiple dimensions

examp

les

How can we spot fake retweeting activity?

04/18/23 7

What features tell fake from genuine

reactions?

How do they relate to the targeted

problems ?

RTSCOPE addresses these

issues

Hypotheses and problems addressed

There are distinctive patterns in retweet fraud in terms ofH1. the timing of retweets (use of automation tools)

H2. the accounts that retweet (fraudsters acting in lockstep)

H3. the connectivity of retweeters (bot networks, “camouflage”)

04/18/23 8

Retweet-thread level problemGiven: the ith tweet of user u; its induced retweet activity (user IDs &timestamps)Identify: if the activity is organic or not.

User level problemGiven: a user u; a set of tweets of user u; their induced retweet activityIdentify: if u is a spammer.

promiscuous

fraudsters

cautious

fraudsters

Background• User u: a given Twitter account

• Tweet twu,i: the ith post of user u

• Retweet thread: all re-posts of a tweet

04/18/23 9

can be honest OR fraudster

### %%% *** $$$

t1 t2 t3 t4u time

twu,1 twu,2 twu,3 twu,4

###

###

t1 t2 t3 t4 time

### ### ######

t4

Ru,1

Alex Mary Peter DebbieTimtwu,1

“R” network(of Ru,1)

Alex

Mary

PeterDebbie

Introducing the RTSCOPE approach• RTSCOPE: series of tests for spotting fraudsters with varying behaviors

04/18/23 10

Maria Giatsoglou, Despoina Chatzakou, Neil Shah, Christos Faloutsos, and Athena Vakali. Retweeting Activity on Twitter: Signs of Deception. In PAKDD 2015. 

Connectivity: TRIANGLES pattern

04/18/23 11

honest“R” network

fraudulent“R” network

degree2 degree2

degree2degree2

Connectivity: DEGREES pattern

04/18/23 12

spike at30

honest“R” network

fraudulent“R” network

power-law

Activity Summarization: Features

• Temporal & popularity features per retweet thread

Log-log pairwise feature scatterplots of retweet threads reveal dense microclusters for fraudsters

04/18/23 13

ratio of activated followers author’s followers who retweeted

response timetime between the tweet’s posting and first retweet

lifespantime between first and last retweet (constrained to 1 month)

Arr-IQRinter-quartile range of inter-arrival times for retweets

Activity Summarization: Patterns

ENTHUSIASM: High infection probability for followers of fraudsters

MACHINE-GUN: Fraudsters retweet all at once/with similar time delay

REPETITION: Fake retweet threads form microclusters due to similar response time, Arr-IQR, activated followers ratio

04/18/23 14

Popular++ , Popular+, Popular , Fraudulent users

ENTHUSIASM

MACHINE-GUN

Retweeters activation: Disparity

• Given the posts of user ui, what is the distribution of retweets across retweeters?

• Disparity reveals if retweeting activity spreads homogeneously over retweeters or it is skewed towards few dedicated users.

Disparity for ui and a retweet thread size of k

,

,

04/18/23 15

### ### ### ######

Alex Mary Peter DebbieTim

%%%

$$$

ui

***

###

100 posts

k = 5

.

.

.

ri,1 = 100

ri,2 = 2

ri,3 = 2

ri,4 = 1 ri,5 = 1***

%%%

$$$

*** %%%

Disparity: Intuition

04/18/23 16

### ### ### ######

bot1 bot2 bot4 bot5bot3

%%%

$$$

ui

***

###

100 posts

$$$

%%%

***

k = 5

.

.

.

ri,1 = 100

$$$

%%%

***

$$$

%%%

***

$$$

%%%

***

$$$

%%%

***

.

.

.

ri,2 = 100

.

.

.

ri,3 = 100

.

.

.

ri,4 = 100

.

.

.

ri,5 = 100

85.0108

1124100),5(

2

222

iY

%%%

$$$

5

1

500

100100100100100),5(

2

22222

iY

FAVORITISM & HOMOGENEITY patterns

Disparity of a Zipf distribution

(proof in paper)

04/18/23 17

homogeneity

favoritism

FAVORITISM. Participation of honest users to retweets follows a Zipf law.

HOMOGENEITY. Participation of fraudulent users to retweets is homogeneous.

super-skewed

favoritism

DETAIL

Findings• Patterns: we discovered several patterns for spotting retweet fraud

• All tests are content independent• can catch more sophisticated fraudsters• are language independent

• But: • golden number of tests for flagging fraudsters?

04/18/23 18

Can we come up with a more generalizable

approach?

04/18/23 19

Synchronization Fraud

• Group of unnaturally synchronized events/entities• Collective / group anomaly

• e.g. retweets, Facebook likes, subgraphs, image subregions

04/18/23 20

###

Alex

###

Mary###

Peter

###

Debbie

###

Tim

### got 3K retweets in 10 minutes

… 3000 times

SUSPICIOUS?

not necessarily10’

###

John

$$$

Alex

$$$

Mary$$$

Peter

$$$

Debbie

$$$

Tim… 3000 times

10’

$$$

John

&&&

Alex

&&&

Mary&&&

Peter

&&&

Debbie

&&&

Tim… 3000 times

10’

&&&

John

###

Alex

###

Mary###

Peter

###

Debbie

###

Tim… 3000 times

10’

SUSPICIOUS ? Probably!

Our goals• Given: N groups of entities; a representation for each

entity in a p-dimensional space;• Identify groups of entities abnormally synchronized in

some feature subspaces.

G1. Design a general, effective approach for collective anomalies detection

G2. Customize it for Retweet Fraud detection

G3. Find features that will assist distinguishing fraudsters from honest users

04/18/23 21

Background: Measuring group strangeness

04/18/23 22

average closeness

Meng Jiang, Peng Cui, Alex Beutel, Christos Faloutsos, Shiqiang Yang. CatchSync: Catching Synchronized Behavior in Large Directed Graphs. KDD 2014.

Background: Robust outlier detection

• ROBPCA-AO: robust dimensionality reduction approach; finds outlying points

• Suitable for multivariate, high-dimensional data; Independent of features’ distribution; Non-deterministic

1.Finds the “best” k-D space to project data based on subset of points

2.Detects outliers based on two distance scores

04/18/23 23

M. Hubert, P.J. Rousseeuw, T. Verdonck. Robust PCA for skewed data and its outlier map. Comput. Stat. Dat. An., 53 (2009), 2264-2274.

orthogonal distance

robust scoredistance

Detail

Problem definitionSYNCFRAUD Problem.

•Given: a set of groups of entities G with a variable number of entities em,i for each group gm; p features for the entities’ representation,

•Extract: a set of features at the group-level, and

•Identify: suspicious groups S with highly synchronized characteristics.

RTFRAUD Problem.

groups of entities users entities retweet threads suspicious groups RTFraudsters

04/18/23 24

ND-SYNC pipeline

Given N groups of p -D entities and I iterations

Do

1. Feature subspace sweeping;

2. Group scoring;

3. Multivariate outlier detection;

Extract suspicious groups

04/18/23 25

ND-SYNC: Feature subspace sweeping

04/18/23 26

sign of synchronicity

all, for simplicity

ND-SYNC: Group scoring

04/18/23 27

ND-SYNC: Multivariate outlier detection

Aim: given the suspiciousness score vectors identify the suspicious groups

1.Apply ROBPCA-AO for I iterations and find outliers

2.Flag a group as suspicious based on majority vote over all iterations.

•To eliminate parameters• automatic selection of dimensionality k via 95%

cumulative variance explained criterion heuristic• use of all entities for estimating the robust feature

subspaces

04/18/23 28

Features for retweet threads

Retweets: # retweets

Response time: tweet’s posting first retweet

Lifespan: first last (observed) retweet constrained to 3 weeks

RT-Q3 response time: tweet’s posting first ¾ of retweets

RT-Q2 response time: tweet’s posting first ½ of retweets

Arr-MAD: mean absolute deviation of RTs inter-arrival times

Arr-IQR: inter-quartile range of RTs inter-arrival times

04/18/23 29

Microclusters of fraudulent retweet threads

04/18/23 30

high synchronicity

for RTFraudsters

2D feature subspaces

Dataset generation• Selection of target users (both honest and fraudulent)

• users with the most retweeted tweets and heavy use of spammy keywords (casino, buy, followback, etc) in a 2-day Twitter sample

• active (frequent tweets) and popular (> 100 retweets) users (http://twittercounter.com/)

• topic experts (European affairs and Automobile), based on Twitter lists

• Target users tracked for 2-6 months (all tweets & their retweets)• Pruned “unpopular” users (all retweet threads < 50 retweet or

fewer than 20)

04/18/23 31

Dataset overview

Type #Retweet threads #Retweets

honest 83,587 2,939,455

fraudulent 50,435 8,787,803

BOTH 134,022 11,727,258

04/18/23 32

User categorization

•28 fraudulent: tweets with spammy links and terms,

repetitive promotions; fabricated profiles

•278 honest

(Available at http://oswinds.csd.auth.gr/project/NDSYNC)

ND-SYNC effectiveness & robustness

• Highly accurate and robust to the selection of k• Best performance at k = 6 (selected with the 95% cumulative

variance explained criterion)• Only 1% decrease in F1-score using just 2D feature subspaces

04/18/23 33

97% accuracy0.82 F1-score

Detected outliers

04/18/23 34

professional promoterspromiscuous

65 retweet threads in 4 months80% > 1k retweets60% > 10k retweetsnews media

account

news mediaaccount

politician

Conclusions

G1. Design a general, effective approach for collective anomalies detection

ND-SYNC is a general, effective pipeline, which automatically detects group anomalies

G2. Customize it for Retweet Fraud detection

Carefully designed set of features for the retweet fraud case

G3. Find features that will assist spotting fraudsters from honest users

ND-SYNC achieves 97% accuracy in distinguishing fraudulent from honest users on real Twitter data

04/18/23 35

Questions?

Download datasets at:http://oswinds.csd.auth.gr/project/RTSCOPEhttp://oswinds.csd.auth.gr/project/NDSYNC

3604/18/23