Download - Misinformation in Social Media · misinformation using predictive models • Task 2: Building mathematical graph based models for the misinformation cascades • Task 3: Designing

Misinformation in Social Media

Anupam JoshiOros Family Professor and Chair, CSEE

Director, UMBC Center for CybersecurityUniversity of Maryland Baltimore County

[email protected]

Power of Social Media

2

300 hours of video uploaded

every minute

500 million tweets posted

every day

1.44 Billion monthly active

users

60 million photos shared

everyday

* 2015 Statistics

Motivation• Social media sites are rich source of information for current

events, and supplement traditional sensors• Events/Accidents are reported on social networks even before they

appear on news channels• Eg: Tweets on the 29th July 2009 earthquake in Southern California appeared a

few seconds later while official news emerged 4 minutes later• Social Networks often provide information where more formal challens

are censored• Eg: Arab Spring, Iran Election 2009,

• Iranians used Twitter, Flickr, Youtube and some blogs to protest and communicate with the outside world.

• #IranElection, #Ahmadinejad, #Mousavi, and #Tehran became trendy on twitter• Youtube set up various channels to upload videos which have been shot via cellphone

and videocam.• Iran Protests”, “Iran Riots 2009” or “Tehran Protests“ were popular tags on Flickr

after the election.• On Facebook, IRAN page had around 40,000 fan following.

Areas of Interest

• Building a Scalable Infrastructure to harvest social media data• Analysis of social media text and relationship (graph) data to (Social)

Situational Awareness:• Detect Events and their Attributes• Temporal Evolution• Detect Communities• Detect Sentiment• Detect and Prevent Misinforation

Analytics

• Can use both the Social Network and the Content• Can discover a variety of things

• Individual tastes and preferences• Groups• Influential Individuals• Identity across social media

• Does Privacy still exist ?

A Pessimistic View on Privacy

• Mankind is a social animal, we have a “need to share.”• Internet enabled social media scales that up

• McLuhan’s view of changing media effecting social organization• Once data is “gone” can it ever be controlled?

• Especially if the economic incentives are aligned against it

• Is “Privacy” the norm in human affairs ?

“I did not inhale” in the Internet Age

• Hat tip David Chadwick and Ravi Sandhu• Curating data about self is much harder when “youthful indiscretions”

are on social media• Curating data about self might not be sufficient – empowerment of

“whisper campaigns”• Nikki Haley in South Carolina• Privacy vs Integrity

Social Media meets Mobile Phones

• Embedded sensors in mobile phones add significant data to social media

• Often not controlled by the users• Are “18th century laws” sufficient to argue about these questions ?

• U.S. v. Jones, 10-1259.

• What does “expectation of privacy” mean ?

Who owns the data anyway

• Does the creator of the data own it• Do you turn it over to the “service” and they own it ?• I can haz your facebook password, or Can I make your access to some

service conditional upon you sharing this data • What about data that is created as a side effect of your explicit

actions• Location data collected by telcos or “environment” creators

Legislation, Schmeligslation

• Legislation is often thought of as a solution• Do we want “laws to catch up to the internet”• The debate over CISPA

• Whose legislation is it when these services cross jurisdictional boundaries

• Explicit censorship• Implicit “good behavior” requests

Misinformation on Social Media

11

Misinformation Tweets

FAKE

RUMORS

12

$

Background: Hurricane Sandy

• Dates: Oct 22- 31, 2012• Damages worth $75 billion• Coast of NE America

14

Faking Sandy: Characterizing and Identifying Fake Images on Twitter during Hurricane Sandy. Aditi Gupta, Hemank Lamba, Ponnurangam Kumaraguruand Anupam Joshi. Accepted at the 2nd International Workshop on Privacy and Security in Online Social Media (PSOSM), in conjunction with the 22th International World Wide Web Conference (WWW), Rio De Janeiro, Brazil, 2013. Best Paper Award.

Fake Image Tweets

15

Data Description

16

Total tweets 1,782,526Total unique users 1,174,266

Tweets with URLs 622,860

Tweets with fake images 10,350

Users with fake images 10,215

Tweets with real images 5,767

Users with real images 5,678

Network Analysis

17Tweet – Retweet graph for the propagation of fake images during first 2 hours

Node -> User IdEdge -> Retweet

Role of Twitter Network

• Analyzed role of follower network in fake imagepropagation

• Crawled the Twitter network for all users who tweeted the fake image URLs

18

Graph 1- Nodes: Users, Edges: Reweets

Graph 2- Nodes: Users, Edges: Follow relationships

Network Overlap Algorithm

19

Results

20

Total edges in retweet network 10,508

Total edges in follower-followee network 10,799,122

Common edges 1,215

%age Overlap 11%

Classification

5 fold cross validation

21

Tweet Features [F2]Length of Tweet

Number of WordsContains Question Mark?

Contains Exclamation Mark?Number of Question Marks

Number of Exclamation Marks

Contains Happy EmoticonContains Sad Emoticon

Contains First Order Pronoun

Contains Second Order PronounContains Third Order Pronoun

Number of uppercase characters

Number of negative sentiment words

Number of positive sentiment wordsNumber of mentionsNumber of hashtags

Number of URLsRetweet count

User Features [F1]

Number of Friends

Number of Followers

Follower-Friend Ratio

Number of times listed

User has a URL

User is a verified user

Age of user account

Classification Results

22

F1 (user) F2 (tweet) F1+F2

Naïve Bayes 56.32% 91.97% 91.52%

Decision Tree 53.24% 97.65% 96.65%

• Best results were obtained from Decision Tree classifier, we got 97% accuracy in predicting fake images from real.

• Tweet based features are very effective in distinguishing fake images tweets from real, while the performance of user based features was very poor.

Building and Evaluating a Real-time System

24

• Learning to Rank model for assessing credibility of Tweets

• Model based on ground truth data for 25 real world events and 45 features

• System evaluation using year long real world experiment

• 1800+ users requested for credibility score of more than

14.2 million tweets.TweetCred: Real-Time Credibility Assessment of Content on Twitter. Aditi Gupta, Ponnurangam Kumaraguru, Carlos Castillo and Patrick Meier. Proceedings of the 6th International Conference on Social Informatics (SocInfo), Barcelona, Spain, 2014. Honorable Mention for Best Paper.

TweetCred Score

25

Score

User Feedback

Features for Real-time Analysis

26

Feature set Features (45)

Tweet meta-data Number of seconds since the tweet; Source of tweet (mobile / web/ etc); Tweet contains geo-coordinates

Tweet content (simple)

Number of characters; Number of words; Number of URLs; Number of hashtags; Number of unique characters; Presence of stock symbol; Presence of happy smiley; Presence of sad smiley; Tweet contains `via'; Presence of colon symbol

Tweet content (linguistic)

Presence of swear words; Presence of negative emotion words; Presence of positive emotion words; Presence of pronouns; Mention of self words in tweet (I; my; mine)

Tweet author Number of followers; friends; time since the user if on Twitter; etc.

Tweet network Number of retweets; Number of mentions; Tweet is a reply; Tweet is a retweet

Tweet links WOT score for the URL; Ratio of likes / dislikes for a YouTube video

Annotation• 500 Tweets per event (six different events)

• Used CrowdFlower

• Step 1• R1. Contains information about the event• R2. Is related to the event, but contains no information• R3. Not related to the event• R4. Skip tweet

*45% (class R1), 40% (class R2), and 15% (class R3)

• Step 2• C1. Definitely credible• C2. Seems credible• C3. Definitely incredible• C4. Skip tweet.

*52% (class C1), 35% (class C2), and 13% (class C3) 27

Ranking Model Evaluation

28

AdaRankCoord. Ascent RankBoost

SVM-rank

NDCG@25 0.6773 0.5358 0.6736 0.3951NDCG@50 0.6861 0.5194 0.6825 0.4919NDCG@75 0.6949 0.7521 0.689 0.6188NDCG@100 0.6669 0.7607 0.6826 0.7219

Time (training) 35-40 secs 1 min 35-40 secs 9-10 secsTime (testing) <1 sec <1 sec <1 sec <1 sec

Identity Problem

@BarakObama

@BarackObama

@theUSpresident

Which one is real??

プレゼンター

プレゼンテーションのノート

Which one is the real obama? Dr Manmohan Singh?

Why?•Security Applications

• Detect malicious user accounts!• Detect compromised user accounts!

•Automatic Social Aggregation• Smartly aggregating information, managing

privacy risks via other measures.

•Characterizing User behavior across OSN

• Users activities across OSN?

•Targeted Phishing / Spam attacks

Our Approach

GROUND TRUTH ??

Our Approach(Contd..)

Content

Words in tweets

Hash tags

Meta Data

Gender Age

Location

Links

Replied

Mentions

Re- tweets

FollowingFollowers

プレゼンター

プレゼンテーションのノート

Content- Tweets Links-

Initial Results

Tweets about ‘Romney’ and ‘Massachusetts’frequently with Tf-Idf scores of 0.16 and 0.13

Fake profile tweets about ‘dudes’ and ‘excuses ‘with Tf-Idf scores of 0.168 and 0.15

4 out of 6 articles mentioning theUS president talk about him mentioning ‘Romney’ and ‘Massachusetts’ with an average TF-Idf of 0.093 and o.o85

Initial Results

Talks about ‘CSIR’ most frequentlyTf-Idf score of 0.1

TOI articles (2/2) mention ‘CSIR’ withrespect to the PMO with an averageTf-Idf score of 0.09A fake profile talks about a ‘polar’

‘satellite’ ‘launch’ with TF-IDF s of0.206

Current Work

Research Tasks

• Task 1: Identify factors to compute the magnitude / severity of a misinformation using predictive models

• Task 2: Building mathematical graph based models for the misinformation cascades

• Task 3: Designing and formulating strategies to mitigate misinformation propagation on social networks

• Task 4: Evaluating and Prototyping

Proposed Methodology

Motivation: Saffir-Simpson scale

• Scale to measure Hurricane category• Based on wind speed

• Adapting to online social media• Based on speed of propagation• Identify other factors

Image: https://en.wikipedia.org/wiki/Saffir%E2%80%93Simpson_hurricane_wind_scale

Compute the magnitude of a misinformation

• Identify factors that effect misinformation propagation• Based on users who are propagating• Topic of information• Location of an event

• Develop predictive models

Graph-based Models for Misinformation Diffusion

• Various models exsist for information diffusion• SIR, SIS and SEIS • Threshold and independent cascades models

• Literature shows features and properties of rumor and true content are distinct

• Need to build new mathematical models for misinformation propagation

Literature Review

• Friggeri et al. tracked propagation of numerous rumor cascades on Facebook, their results showed that rumor cascades run deeper than the normal re-share cascades on Facebook.

• Mendoza et al. compared rumor and true news tweets and found that tweets related to rumors contained more questions than news tweets containing true news.

• From our preliminary work for events such as Hurricane Sandy, we concluded that temporal, network and user based properties of rumor tweets are distinct from true news tweets.