Misinformation in Social Media
Anupam JoshiOros Family Professor and Chair, CSEE
Director, UMBC Center for CybersecurityUniversity of Maryland Baltimore County
Power of Social Media
2
300 hours of video uploaded
every minute
500 million tweets posted
every day
1.44 Billion monthly active
users
60 million photos shared
everyday
* 2015 Statistics
Motivation• Social media sites are rich source of information for current
events, and supplement traditional sensors• Events/Accidents are reported on social networks even before they
appear on news channels• Eg: Tweets on the 29th July 2009 earthquake in Southern California appeared a
few seconds later while official news emerged 4 minutes later• Social Networks often provide information where more formal challens
are censored• Eg: Arab Spring, Iran Election 2009,
• Iranians used Twitter, Flickr, Youtube and some blogs to protest and communicate with the outside world.
• #IranElection, #Ahmadinejad, #Mousavi, and #Tehran became trendy on twitter• Youtube set up various channels to upload videos which have been shot via cellphone
and videocam.• Iran Protests”, “Iran Riots 2009” or “Tehran Protests“ were popular tags on Flickr
after the election.• On Facebook, IRAN page had around 40,000 fan following.
Areas of Interest
• Building a Scalable Infrastructure to harvest social media data• Analysis of social media text and relationship (graph) data to (Social)
Situational Awareness:• Detect Events and their Attributes• Temporal Evolution• Detect Communities• Detect Sentiment• Detect and Prevent Misinforation
Analytics
• Can use both the Social Network and the Content• Can discover a variety of things
• Individual tastes and preferences• Groups• Influential Individuals• Identity across social media
• Does Privacy still exist ?
A Pessimistic View on Privacy
• Mankind is a social animal, we have a “need to share.”• Internet enabled social media scales that up
• McLuhan’s view of changing media effecting social organization• Once data is “gone” can it ever be controlled?
• Especially if the economic incentives are aligned against it
• Is “Privacy” the norm in human affairs ?
“I did not inhale” in the Internet Age
• Hat tip David Chadwick and Ravi Sandhu• Curating data about self is much harder when “youthful indiscretions”
are on social media• Curating data about self might not be sufficient – empowerment of
“whisper campaigns”• Nikki Haley in South Carolina• Privacy vs Integrity
Social Media meets Mobile Phones
• Embedded sensors in mobile phones add significant data to social media
• Often not controlled by the users• Are “18th century laws” sufficient to argue about these questions ?
• U.S. v. Jones, 10-1259.
• What does “expectation of privacy” mean ?
Who owns the data anyway
• Does the creator of the data own it• Do you turn it over to the “service” and they own it ?• I can haz your facebook password, or Can I make your access to some
service conditional upon you sharing this data • What about data that is created as a side effect of your explicit
actions• Location data collected by telcos or “environment” creators
Legislation, Schmeligslation
• Legislation is often thought of as a solution• Do we want “laws to catch up to the internet”• The debate over CISPA
• Whose legislation is it when these services cross jurisdictional boundaries
• Explicit censorship• Implicit “good behavior” requests
Misinformation on Social Media
11
Misinformation Tweets
FAKE
RUMORS
12
$
Background: Hurricane Sandy
• Dates: Oct 22- 31, 2012• Damages worth $75 billion• Coast of NE America
14
Faking Sandy: Characterizing and Identifying Fake Images on Twitter during Hurricane Sandy. Aditi Gupta, Hemank Lamba, Ponnurangam Kumaraguruand Anupam Joshi. Accepted at the 2nd International Workshop on Privacy and Security in Online Social Media (PSOSM), in conjunction with the 22th International World Wide Web Conference (WWW), Rio De Janeiro, Brazil, 2013. Best Paper Award.
Fake Image Tweets
15
Data Description
16
Total tweets 1,782,526Total unique users 1,174,266
Tweets with URLs 622,860
Tweets with fake images 10,350
Users with fake images 10,215
Tweets with real images 5,767
Users with real images 5,678
Network Analysis
17Tweet – Retweet graph for the propagation of fake images during first 2 hours
Node -> User IdEdge -> Retweet
Role of Twitter Network
• Analyzed role of follower network in fake imagepropagation
• Crawled the Twitter network for all users who tweeted the fake image URLs
18
Graph 1- Nodes: Users, Edges: Reweets
Graph 2- Nodes: Users, Edges: Follow relationships
Network Overlap Algorithm
19
Results
20
Total edges in retweet network 10,508
Total edges in follower-followee network 10,799,122
Common edges 1,215
%age Overlap 11%
Classification
5 fold cross validation
21
Tweet Features [F2]Length of Tweet
Number of WordsContains Question Mark?
Contains Exclamation Mark?Number of Question Marks
Number of Exclamation Marks
Contains Happy EmoticonContains Sad Emoticon
Contains First Order Pronoun
Contains Second Order PronounContains Third Order Pronoun
Number of uppercase characters
Number of negative sentiment words
Number of positive sentiment wordsNumber of mentionsNumber of hashtags
Number of URLsRetweet count
User Features [F1]
Number of Friends
Number of Followers
Follower-Friend Ratio
Number of times listed
User has a URL
User is a verified user
Age of user account
Classification Results
22
F1 (user) F2 (tweet) F1+F2
Naïve Bayes 56.32% 91.97% 91.52%
Decision Tree 53.24% 97.65% 96.65%
• Best results were obtained from Decision Tree classifier, we got 97% accuracy in predicting fake images from real.
• Tweet based features are very effective in distinguishing fake images tweets from real, while the performance of user based features was very poor.
Building and Evaluating a Real-time System
24
• Learning to Rank model for assessing credibility of Tweets
• Model based on ground truth data for 25 real world events and 45 features
• System evaluation using year long real world experiment
• 1800+ users requested for credibility score of more than
14.2 million tweets.TweetCred: Real-Time Credibility Assessment of Content on Twitter. Aditi Gupta, Ponnurangam Kumaraguru, Carlos Castillo and Patrick Meier. Proceedings of the 6th International Conference on Social Informatics (SocInfo), Barcelona, Spain, 2014. Honorable Mention for Best Paper.
TweetCred Score
25
Score
User Feedback
Features for Real-time Analysis
26
Feature set Features (45)
Tweet meta-data Number of seconds since the tweet; Source of tweet (mobile / web/ etc); Tweet contains geo-coordinates
Tweet content (simple)
Number of characters; Number of words; Number of URLs; Number of hashtags; Number of unique characters; Presence of stock symbol; Presence of happy smiley; Presence of sad smiley; Tweet contains `via'; Presence of colon symbol
Tweet content (linguistic)
Presence of swear words; Presence of negative emotion words; Presence of positive emotion words; Presence of pronouns; Mention of self words in tweet (I; my; mine)
Tweet author Number of followers; friends; time since the user if on Twitter; etc.
Tweet network Number of retweets; Number of mentions; Tweet is a reply; Tweet is a retweet
Tweet links WOT score for the URL; Ratio of likes / dislikes for a YouTube video
Annotation• 500 Tweets per event (six different events)
• Used CrowdFlower
• Step 1• R1. Contains information about the event• R2. Is related to the event, but contains no information• R3. Not related to the event• R4. Skip tweet
*45% (class R1), 40% (class R2), and 15% (class R3)
• Step 2• C1. Definitely credible• C2. Seems credible• C3. Definitely incredible• C4. Skip tweet.
*52% (class C1), 35% (class C2), and 13% (class C3) 27
Ranking Model Evaluation
28
AdaRankCoord. Ascent RankBoost
SVM-rank
NDCG@25 0.6773 0.5358 0.6736 0.3951NDCG@50 0.6861 0.5194 0.6825 0.4919NDCG@75 0.6949 0.7521 0.689 0.6188NDCG@100 0.6669 0.7607 0.6826 0.7219
Time (training) 35-40 secs 1 min 35-40 secs 9-10 secsTime (testing) <1 sec <1 sec <1 sec <1 sec
Identity Problem
@BarakObama
@BarackObama
@theUSpresident
Which one is real??
Why?•Security Applications
• Detect malicious user accounts!• Detect compromised user accounts!
•Automatic Social Aggregation• Smartly aggregating information, managing
privacy risks via other measures.
•Characterizing User behavior across OSN
• Users activities across OSN?
•Targeted Phishing / Spam attacks
Our Approach
GROUND TRUTH ??
Our Approach(Contd..)
Content
Words in tweets
Hash tags
Meta Data
Gender Age
Location
Links
Replied
Mentions
Re- tweets
FollowingFollowers
Initial Results
Tweets about ‘Romney’ and ‘Massachusetts’frequently with Tf-Idf scores of 0.16 and 0.13
Fake profile tweets about ‘dudes’ and ‘excuses ‘with Tf-Idf scores of 0.168 and 0.15
4 out of 6 articles mentioning theUS president talk about him mentioning ‘Romney’ and ‘Massachusetts’ with an average TF-Idf of 0.093 and o.o85
Initial Results
Talks about ‘CSIR’ most frequentlyTf-Idf score of 0.1
TOI articles (2/2) mention ‘CSIR’ withrespect to the PMO with an averageTf-Idf score of 0.09A fake profile talks about a ‘polar’
‘satellite’ ‘launch’ with TF-IDF s of0.206
Current Work
Research Tasks
• Task 1: Identify factors to compute the magnitude / severity of a misinformation using predictive models
• Task 2: Building mathematical graph based models for the misinformation cascades
• Task 3: Designing and formulating strategies to mitigate misinformation propagation on social networks
• Task 4: Evaluating and Prototyping
Proposed Methodology
Motivation: Saffir-Simpson scale
• Scale to measure Hurricane category• Based on wind speed
• Adapting to online social media• Based on speed of propagation• Identify other factors
Image: https://en.wikipedia.org/wiki/Saffir%E2%80%93Simpson_hurricane_wind_scale
Compute the magnitude of a misinformation
• Identify factors that effect misinformation propagation• Based on users who are propagating• Topic of information• Location of an event
• Develop predictive models
Graph-based Models for Misinformation Diffusion
• Various models exsist for information diffusion• SIR, SIS and SEIS • Threshold and independent cascades models
• Literature shows features and properties of rumor and true content are distinct
• Need to build new mathematical models for misinformation propagation
Literature Review
• Friggeri et al. tracked propagation of numerous rumor cascades on Facebook, their results showed that rumor cascades run deeper than the normal re-share cascades on Facebook.
• Mendoza et al. compared rumor and true news tweets and found that tweets related to rumors contained more questions than news tweets containing true news.
• From our preliminary work for events such as Hurricane Sandy, we concluded that temporal, network and user based properties of rumor tweets are distinct from true news tweets.
Top Related