Deciphering Social Media Messages for #GE2015sis06gd/res/BSD-WS-24-04...2015/04/24 · Deciphering...
Transcript of Deciphering Social Media Messages for #GE2015sis06gd/res/BSD-WS-24-04...2015/04/24 · Deciphering...
Deciphering Social Media Messages for #GE2015
Dr. Giuseppe Di Fatta (SSE)
Associate Professor of Computer Science
Director MSc Advanced Computer Science
School of Systems Engineering , University of Reading
http://www.personal.reading.ac.uk/~sis06gd/
Dr. James Reade (SPEIR)
Lecturer in Economics
School of Politics, Economics and International Relations,
University of Reading
http://www.reading.ac.uk/economics/about/staff/j-j-reade.aspx
Henley Business School, University of Reading, Friday 24 April 2015
Workshop onBig Social Data and Interdisciplinary Analytics
by J. Reade and G. Di Fatta 2
Outline
• Introduction
– Motivation
– University of Reading initiative on Big Social Data
– Case study on the General Election 2015 (#GE2015)
• Nuts and bolts
– Twitter tracking and tweets gathering
– Tweets mining
– A knowledge discovery process
• Data analysis examples
– analysis of some key moments during the Leaders’ TV debate
by J. Reade and G. Di Fatta
Introduction
• Social media has exploded in recent years.
3
by J. Reade and G. Di Fatta
Introduction• Social media defined:
4
• Incredible numbers, incredible potential…
• We are the University of Reading Big Social Data Research Group.
• Formed Summer 2014 covering multiple disciplines across the university.
by J. Reade and G. Di Fatta
Form and Function
• Social media are of interest to social scientists:
• Social (and other) networks influence decision making.
• Favouritism, discrimination, bias, loyalty, etc. all influence
allocations of resources and outcomes.
• Social media are social networks quantified.
• Social networks publicise and propagate information:
• Information availability crucial in decision making.
• More information = better forecasting, better policy making?
• Social media present huge opportunities:
• But huge challenges: Collection, processing, understanding
the data.
• Cross-disciplinary collaboration essential.
5
by J. Reade and G. Di Fatta
An Open Multidisciplinary Group
Our group consists of:
• Computer scientists (Di Fatta, Stahl)
– Data Mining and Knowledge Discovery in Databases (KDD): collecting, processing and
extracting useful knowledge from data.
• Mathematicians (Vukadinović Greetham)
– Complex analysis of network dynamics.
• Applied Linguists (Jaworska)
– Extracting meaning from qualitative data.
• Economics (Reade, Nanda)
– Information is fundamental: Where does it appear, how is it propagated? Does it
influence prices/voting behaviour, or vice versa?
• Social scientists
– What can we learn about social (and other types of) interaction and outcomes?
6
by J. Reade and G. Di Fatta
Reading and the General Election
• On March the 1st we began collecting Tweets related to
politics and the general election
– General election related tweets: #GE2015, #Tories, #Labour, etc.
• In 53 days we’ve collected:
– 13M tweets 250K tweets/day 2.8 tweets/sec.
– with over 1.8M tweets during three TV debates alone.
• But what to do with this information?
7
April 2 April 16
by J. Reade and G. Di Fatta
Sentiment Analysis?
• Simple volume of tweets may be interesting, but is it useful?
• Increasing focus on sentiment, or mood: What do people
think?
– Does mood/sentiment yield predictive power?
– Academic papers have considered stock markets and sports events.
– During election time, sentiment hugely interesting…
• Who is ahead? Do big shifts occur?
• What messages stick? Persistence in sentiment?
8
by J. Reade and G. Di Fatta
Sentiment Analysis?
• Perhaps however, we have jumped a step:
– Sentiment is a latent concept: We never observe its true value.
– We can try to estimate it but we have no true value to compare against.
9
Nuts and Bolts
by J. Reade and G. Di Fatta
• Twitter, described as "the SMS of the Internet“, is an online
social networking service that enables users to send and read
short 140-character messages called "tweets".
– launched in 2006
– photos and short videos can also be embedded
• In 2012, 100 million users, 340 million tweets per day
• In December 2014, more than 500 million users: more than
284 million are active.
• Record tweets: on February 3, 2013, Twitter announced that a
record 24.1 million tweets were sent the night of Super Bowl.
11
by J. Reade and G. Di Fatta
Twitter Per Second Records1. 143K TPS: TV broadcast of Anime movie "Castle in the Sky" in Japan on
Dec. 9, 2011
– At one point viewers joined forces, sending tweets at the same time to
symbolically help the movie's characters cast a spell.
2. 15K TPS: Euro 2012 Finals
– as Spain scored the winning goal against Italy in the 2012 European
Championship,
3. 10K TPS: Last Minutes of Super Bowl 2012
– as the Giants took the lead on a touchdown with 57 seconds left
…
16. 5.5K TPS: Japanese Earthquake and Tsunami on March 11, 2011
– Twitter turned into an emergency service for many following an 8.9 magnitude
earthquake and subsequent tsunami on Japan’s coast, while in Tokyo the
phone system went down.
12
by J. Reade and G. Di Fatta
Gathering Tweets
• Three methods to retrieve tweets
– Search API
• Representational State Transfer (REST) requests
• max 3200 tweets for each requests
• free
– Streaming API
• real-time streaming, OAuth for secure delegated access
• max 1% of the total volume of tweets
• free
– Firehose
• real-time streaming
• unlimited and guaranteed
• not free: only from Twitter commercial partners (e.g., DATASIFT)
13
by J. Reade and G. Di Fatta
Twitter Tracking and Tweets Collection
• Tracking terms on the Twitter Streaming API and gathering all
tweets which match them.
– more than 30 tracked terms, e.g.: ge2015, uklabour, conservative,
votetories, ukip, voteukip, LibDems, GreenParty, SNP, etc.
• But what if you track “Cameron”?
14
Cameron Dallas is an 18-year-old Vine celebrity.
Vine is a short video sharing service and microblogging website.
by J. Reade and G. Di Fatta
Twitter Tracking and Tweets Collection
• And what if you track “Labour”?
15
by J. Reade and G. Di Fatta
Twitter Tracking and Tweets Collection
• A new software has been developed.
– Objective: collect all tweets from March to May 2015 that are related to
UK politics.
– The software tracks terms on the Twitter Streaming API and gathers all
tweets which match them.
1. tracked terms (~30)
– e.g., ge2015, uklabour, votelabour, conservative, votetories, ukip, voteukip,
LibDems, GreenParty, SNP, etc
2. tracked terms that require a context check
– e.g. labour, greens, etc.
3. terms for context check (~50)
– e.g., government, politic, vote, election, parliament, economy, etc.
4. rejected terms
– e.g. USA, Canada, Clinton, TCOT, etc.
5. equivalent terms for aggregation of party references
– e.g. Tories, Tory, voteTories, Conservatives, etc.
16
by J. Reade and G. Di Fatta
A Multi-Threaded Process• There are three concurrent threads of execution which never stop:
1. the tweets consumer which
• manages the stream of tweets for the tracked terms,
• receives and process tweets from Twitter in real time and
• stores them to a secondary memory
2. the controller which
• controls that the tweets consumer is working properly
• and, if not, it starts a new consumer
3. the observer which
• generates and sends periodic summaries by email
• Further analytics is generated off-line by additional processing, such as
generation of
– counts, word clouds, co-occurrence of terms, sentiment index
17
by J. Reade and G. Di Fatta
Tweets Mining
• Term frequency
– Tweets as bag of words for computing
• Frequent tracked terms
• Frequent words
– Word clouds
• Twitter Sentiment Index
– A list of adjectives has been extracted from ‘political’ tweets
– Each adjective has been classified as positive, negative or neutral by
several team members.
– If a party or one of its equivalent terms is present in a tweet, positive
and negative adjectives contribute to a sentiment index for the party.
18
by J. Reade and G. Di Fatta
A Knowledge Discovery Process
• A process of knowledge discovery from social media
data streams (Twitter)
data gathering,
filtering and in-
line analytics
Streaming
APIdata
storage
off-line data
analytics
To join the mailing list please contact <[email protected]>
Blog URL: http://blogs.reading.ac.uk/reading-general-election-blog/
1h and 24h automatic reports sent to:
From March 01:
13M tweets,
currently 350K
tweets per day
by J. Reade and G. Di Fatta
Number of Tweets per day• as reported by the observer by email at midnight
• Important TV events:
– debate-1 on 26/03/2015: ”Cameron & Miliband: The Battle for Number 10″
– debate-2 on 02/04/2015: ”Leaders’ debate″
– debate-3 on 16/04/2015: ”Challengers’ debate″
debate-3
debate-1
debate-2
by J. Reade and G. Di Fatta
Twitter Boom on April the 2nd
Leaders’ Debate (02-04-2015, 20:00-22:00)– If you have missed the TV debate, you can watch it on YouTube:
• https://www.youtube.com/watch?v=7Sv2AOQBd_s
• ‘political’ tweets on the entire day (24h)
– recorded: 800,350
– #leadersdebate: 438,944
– missed: 175,959 (18%) (because of Twitter track-limit)
– ext. total: 976,309
• TV debate related/induced tweets from 19:00 to 24:00
– recorded: 614,800
– ext. total: 790,759
by J. Reade and G. Di Fatta
Leaders’ Debate (02-04-2015)
• The number of tweets with a reference to a party
# tweets
(5’ intervals)
debate
by J. Reade and G. Di Fatta
Leaders’ Debate (02-04-2015)
• Twitter Sentiment Index: before, during and after the debate
Sentiment
Index
(5’ intervals)
debate
by J. Reade and G. Di Fatta
Leaders’ Debate (02-04-2015)
• Two ‘interesting’ moments during the debate
– Two ‘interesting’ time intervals following those moments
#1 @ 20:55 #2 @ 21:35
10’ 20’Twitter
Sentiment
Index
(1’ intervals)
by J. Reade and G. Di Fatta
Leaders’ Debate (02-04-2015, 20:54)
#1 @20:54: Nigel Farage’s controversial statement (18” video)
by J. Reade and G. Di Fatta
Leaders’ Debate (02-04-2015, 20:55)
#1 @20:55: Nicola Sturgeon’s reply to Nigel Farage (8” video)
by J. Reade and G. Di Fatta
Leaders’ Debate (02-04-2015, #1)
• #1: word cloud for tweets referring to “SNP” from 21:02 to 21:12
by J. Reade and G. Di Fatta
Leaders’ Debate (02-04-2015, 21:35)
#2 @21:35: Nicola Sturgeon’s statement (12” video)
by J. Reade and G. Di Fatta
Leaders’ Debate (02-04-2015, #2)
• #2: word cloud for tweets referring to “SNP” from 21:40 to 22:00
by J. Reade and G. Di Fatta 30
Conclusions
We have been collecting Big Social Data
tweets about UK politics and GE2015 from March 2015
Simple real-time analysis and more complex off-line
analytics can provide interesting insights.
We will use the data in the future to test research ideas on
Text mining
Data visualisation
Complex networks (social networks)
Economics and Politics
Acknowledgments:
Prof. Steven Mithen (Deputy VC) for supporting this project
as well as HBS, SPEIR, SSE, SLL
Questions?