Eavesdropping on the Twitter Microblogging Site
-
Upload
shalin-hai-jew -
Category
Social Media
-
view
194 -
download
0
Transcript of Eavesdropping on the Twitter Microblogging Site
EAVESDROPPING ON THE TWITTER MICROBLOGGING SITE
Summer Institute on Distance Learning and Instructional Technology (SIDLIT 2016)
August 4-5, 2016
OVERVIEW
Research analysts go to Twitter to capture the general trends of public conversations, identify and profile influential accounts, and extract subgroups within larger collectives and larger discourses; they also go to eavesdrop on individual self-talk and individual-to-individual conversations. So what is technically in your tweets, asked Dave Rosenberg famously in a CNET article (2010). The answer: a whole lot more than 140 characters. How are the most influential social media accounts identified through #hashtag graphs? How are themes extracted? How are sentiments understood? How can users be profiled through their Tweetstreams? How can locations be mapped in terms of the Twitter conversations occurring in particular physical areas? How can live and trending issues be identified and categorized in terms of sentiment (positive, negative, and neutral)? This presentation will summarize some of the free and open-source tools as well as commercial and proprietary ones that enable increased knowability.
2
ABOUT TWITTER
3
TWITTER DEMOGRAPHICS
320 million monthly active users
A billion unique visits monthly to sites with embedded Tweets
80% active users on mobile
Support for 35 languages (“About Twitter,” 2015)
4
COUNTRIES AND CITIES WITH LOCAL TRENDING TOPICS IN TWITTER (BY FOBOS92)
5
BUSINESS MODEL
Runs on advertising based on delivery of human attention and encouragement of certain types of consumption
Intersperses funded commercial messaging into regular messaging
Enables fine-tuned targeting of desirable audiences along with various metrics(“Advertising on Twitter”)
6
140 CHARACTERS
Geographical Coverage
Foremost microblogging site in major parts of the world and a main part of the social ecosystem
Blocked in some countries: Turkey, Iran, China, and North Korea (Liebelson, Mar. 28, 2014) and with some temporary phases of inaccessibility in a number of other countries (“Censorship of Twitter,” Apr. 20, 2016) May be based in part on market protectionism for native microblogging services May be political May be due to a mix of factors
English is the predominant language used
7
140 CHARACTERS (CONT.)
User Accounts
Verified and unverified user accounts (with regular efforts to clean off spam accounts)
Human, robot, sensor, and cyborg accounts User-created profile data Image data, video data Start-date of account Tweets, following, followers, likes, and lists (account status)
Notifications for raising sociality: Who is following you, who retweeted you, who replied, and others
8
140 CHARACTERS (CONT.)
Data Types
Content data: Text messages, URL links (including shortened form links), images, Vine video snippets
Trace data: Online social network relationships based on replies, liking, mentions, following / unfollowing, addressing @ accounts, #hashtagging around a shared topic of interest, and other types of interactions
Metadata: Geolocational information, exchangeable image format (EXIF) data from imagery, and others
Long memory of contents (constant recording), even of deleted messages; recoverability of data
May have private networks (with publicly-inaccessible, unscrape-able, and otherwise hidden data)
9
140 CHARACTERS (CONT.)
Publicly Accessible Data from the Twitter API and Limits
Twitter API (application programming interface) enables very partial capture of available data on a topic Is a rate-limited feature Enables access to a few percent of the available messaging Requires authenticated sign-in for “whitelisting”
Are challenges with assertability because of the non-random capture of data and inherent limits
For full data, need to go with Gnip as a provider of big and social data from a variety of social media platforms
10
WAYS THE MICROBLOGGING PLATFORM IS USED BY THE CROWD
#hashtag campaigns and social movements
Streaming live events via Web and mobile
Hosting public conversations with others
Expressing social and political power (and solidarity); commenting on social issues and sparking people to action
Ego expression and social performance
Engaging real-world role-playing games
Calling out other individuals and groups for certain behaviors through mockery and challenges
Engaging in online socio-cultural traditions like sharing food images, #TBTs, selfie-sharing, and others
11
WAYS THE MICROBLOGGING PLATFORM IS USED BY THE CROWD (CONT.)
Connecting social media endeavors across various platforms (image-sharing, video-sharing, and others)
Driving traffic
Deploying automated agents / robots (‘bots) to elicit information, communicate information, create a sense of artificial virality
Advertising products and services
Exploring particular social and other phenomena through data elicitation and research
Disseminating information about threats to citizens Weather, crime, wildfire, and other
data
Sharing weather sensor information; sharing air quality information
12
WAYS THE MICROBLOGGING PLATFORM IS USED BY THE CROWD (CONT.)
Enhancing e-governance (the work of democratic governments through electronic means) Eliciting citizen feedback for various
proposed laws and endeavors
Making social and professional relationships
Acquiring digital coupons and resources
Understanding certain locations and the interests of certain locations (such as around events)
and others…
13
WHY A GOOD SOURCE FOR RESEARCH?
Cyber interlinked with the physical world (cyber-physical confluence) May target a particular area to capture microblogging messages being shared in near-
real time
Social media platform where people congregate and interact (particularly through mobile devices) and a culture of hyper-sharing Is regularly integrated with mainstream media Includes highly dynamic data
Ability to share via any language expressible via UTF-8 Unicode character set and with multimedia and with links
14
WHY A GOOD SOURCE FOR RESEARCH? (CONT.)
Data leakage (unintended sharing of information) Human impulsivity, with personal guard up; near-constant “status updates” to an
imagined audience Lack of full control in terms of strategic information sharing Inadvertent digital recording of imagery / sound / audio Metadata capture (such as EXIF data) Narrow-casting to an intended audience but broadcasting to all Assumption of ephemeral interactions and erase-ability of messages Accidental “send” Self-talking / talking to self in online public spaces Accidental change to privacy settings Letting an untrustworthy member into a private network
15
WHY A GOOD SOURCE FOR RESEARCH? (CONT.)
Human analytical capabilities applied to the data May engage in close readings for understandings May engage the imagery May engage the language May engage the URLs May engage the public reputations and interrelationships
Ability to collate data across regions, user accounts, topics, events, and other elements using various applications (that tap into the Twitter API)
Ability to access full sets of “N” through commercial means for research purposes
16
WHY A GOOD SOURCE FOR RESEARCH? (CONT.)
Ability to apply state-of-the-art computational analytics Social network analysis Text analyses Linguistic analysis Word clusters Word co-occurrence / matrix analyses, and others Sentiment analysis Emotion analysis Theme and sub-theme extraction / topic modeling
Geographical mapping
17
SOME COMMON METHODS FOR EAVESDROPPING
18
SOME ASPECTS OF DATA QUALITY IN TWITTER
Raw or processed data (and summary data)
In context of interactivity or not in social context
Verified or not
Dynamic real-time or time-delayed data or historical data
Complete set (N=all) or partial set (albeit not a random selection)
Customized data or not
Filtered data or not
Data analytics enhanced or non-data-analytics enhanced
Limited access or accessible-to-all Private or public
19
VARYING QUALITY STANDARDS OF TWITTER INFORMATION (BASED ON ACCESS LEVELS AND ANALYTICAL CAPABILITIES)
• N = all (gold standard) • Access through commercial means
• Sophisticated research and analytics techniques
• Limited access from Twitter API
• Partial data extraction through scraping
• Partial data extraction through third-party data exporters
• Individual usage (based on EULA and affordances)
• Engaged and interactive
• With community
• With multimedia
• Broad common usage (for informational purposes)
20
SOME COMMON METHODS USED FOR EAVESDROPPING ON TWITTER
Following and interacting with people and groups on their Tweetstreams Only requires connectivity and an account on Twitter
Mapping social network graphs Requires access to data and software to map network graphs
Drawing content networks (such as word relationships from a Tweet set) Requires access to data, software to analyze the text
Mapping eventgraphs through “human sensor networks” Requires access to data over time, over space, over topic, and over social media user
account an Requires ability to translate data from other languages back to English (or the base
language) Requires ability to computationally draw data as data visualizations
21
SOME COMMON METHODS USED FOR EAVESDROPPING ON TWITTER (CONT.)
Capturing #hashtag conversations Requires access to the data (through software tools, through high-level computer
language for data scraping) Requires ability to map hashtag conversations based on users, communications, and
other dimensions Requires ability to interact with the extracted textual and multimedia data
Capturing keyword searches Requires access to the data and ability to map networks and interact with the textual
and multimedia data (see above)
22
SOME COMMON METHODS USED FOR EAVESDROPPING ON TWITTER (CONT.)
Drawing geographical maps to spatially map social networks Requires access to the data Requires ability to map geolocational coordinates to spatial locations on a map
(including dense clusters)
Profiling individuals and groups remotely (zero-interaction profiling) Requires access to the data Requires access to profile information Requires access to the expressions of the site holder (text, imagery, audio, and video)
23
SOME COMMON TWITTER DATA CAPTURE TOOLS
NodeXL (Network Overview, Discovery and Exploration for Excel, free add-on to Excel in the Basic version)
NCapture web browser add-on linked to NVivo 11 Plus (proprietary software tool)
R (free, high-level programming language )
Python (free, high-level programming language)
Also online cloud-based data download tools: Twitter Advanced Search Digital Methods Initiative (DMI) Tools Netlytic
24
TYPES OF DATA FROM TWITTER
25
DATA SETS OF TWITTER DATA
Row ID Tweet ID Username Tweet TimeTweet Type
Retweeted By
Number of Retweets Hashtags Mentions Name Location Web Bio
Number of Tweets
Number of Followers
Number Following
Location Coordinates
26
TYPES OF DATA FROM TWITTER
Machine coding and analysis
Text Symbolic processing required
Metadata Relational (trace) data Profile information Time zone information Locational information Locational coordinate data, and others
Manual coding and analysis
Imagery High dimensionality data
Video High dimensionality data Multi-sensory data
Links High dimensionality data Multi-sensory data
27
FEATURES OF THE DATA SET
Structured data
Quantitative
Unstructured / semi-structured data
Qualitative Textual data Multimedia Digital imagery (including automated
gifs) Audio Video Live video streams Interactive contents
28
29
30
31
32
33
34
35
36
EXPLORING MICROBLOGGING DATA
37
UNDERSTANDING DATA LIMITATIONS
Dynamic Tweetsets are rate- and size-limited on the Twitter API Captured sets are from the most recent and work backwards Twitter sets tend to be highly volatile, with quite a few changes over time as compared
to other data from social media like related tags networks on Flickr or Wikipedia article networks Extracted data are accurate for a certain short amount of time and then must be
updated for accuracy
Tweetstream extractions from a target account may range from 1% to 100% of the available Tweets, depending on the account activity and length of existence (and whether retweets / RTs are included)
It is possible to sample various datasets form Twitter over time for time-based insights
38
UNDERSTANDING DATA LIMITATIONS (CONT.)
When reporting research, it’s important to explain data limitations, and use appropriate qualifiers.
Data visualizations are always summary data and so will need textual augmentation / explanation for sufficient clarity. It may be helpful to share examples of the underlying data as well.
39
RICH DATA HANDLING
Twitter data contains messaging, trace, and metadata Messaging consists of text, imagery, audio, video, and other rich types of data types Rich data types require plenty of manual analysis for deeper insights Text data may be analyzed using a combination of human “close reading” and manual coding
and “distant reading” and machine coding
Twitter trace (relational) data needs to be mapped to networks for analytics Twitter metadata needs to be analyzed using both human analysis and machine analysis Some locational / location coordinates metadata may be mapped to a geographical map Time data may be mapped to line charts and bar charts Profile metadata may be analyzed using text analytics tools and close human reading
40
EXPLORING MICROBLOGGING DATA
Manual coding
Theory-informed coding for dominant themes
Emergent coding (from the data set), with qual and quant insights Identification of data types Identification of points of interest
Identification of personalities
Auto coding
Theme and subtheme extraction
Sentiment mining and analysis
41
EXPLORING MICROBLOGGING DATA (CONT.)
Data queries
Text frequency count
Text search
Matrix query
Cluster analysis
Data visualizations
Word clouds
Word trees
Dendrograms (vertical and horizontal)
Cluster diagrams (2D and 3D)
Hierarchy charts: treemap diagrams, sunburst graphs
Ring lattice graphs / circle graphs
Intensity matrices
Bar charts
Geographical maps, and others
42
SOME WIDELY AVAILABLE SOFTWARE TOOLS FOR DATA EXTRACTION
Using the right tool for the right questions NodeXL NCapture of NVivo R Python
Must read the evolving Twitter API page to understand the limits of the data extraction
43
SOME PRACTICAL RESEARCH APPLICATIONS
44
SOME PRACTICAL RESEARCH APPLICATIONS
Capturing a range of data to try to understand an issue
Using social media data for decision-making
Exploring relationships on social media sites to understand leaders and their messaging, to understand followers, and the state of the social network
Using social media messaging to “remote profile” others (even with zero interactions)
Extracting major themes from online groups / #hashtag networks / keyword communities, and profile-specific Tweetstreams
Analyzing sentiment from online communities / networks and profile-specific Tweetstreams
45
SOME PRACTICAL AWARENESS AND DECISION-MAKING
APPLICATIONS
46
SOME PRACTICAL AWARENESS AND DECISION-MAKING APPLICATIONS
Designing and deploying messaging campaigns
Identifying influential individuals in order to pass along a message (whether microcasting or broadcasting or mixed casting)
Learning about online communities
Understanding other online personalities (egos, entities)
47
RECENT PUBLISHED RUMORS OF POSSIBLE CHANGES TO THE SERVICE
Possible raising of the 140-character limit?
Pressures to make the platform more engaging and relevant?
Clearer labeling of commercial accounts?
Improved usage of the platform for learning?
48
DEMOS
49
50
QUESTIONS? COMMENTS?
Ideas for other possible uses?
51
CONCLUSION AND CONTACT
Dr. Shalin Hai-Jew iTAC, Kansas State University 212 Hale / Farrell Library [email protected] 785-532-5262
The presenter has no formal tie to Twitter, Inc.
The Twitter logo on the cover is from the company and aligns with the company’s terms of usage to represent Twitter. The map on Slide 5 is by FOBOS92 and was released in 2012 for usage through the Wikimedia Commons and the CC Attribution-Share Alike 3.0 license. The other images were created by the author using a range of software tools, including NodeXL, NCapture, NVivo 11 Plus, and Microsoft Visio.
52