Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue...

19
Social Network Analysis Update A Short Overview of the Problems and an Update of Our Twitter Capture/Analysis System Joshua White CS644

Transcript of Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue...

Page 1: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

Social Network AnalysisUpdate

A Short Overview of the Problems and an Update of

Our Twitter Capture/Analysis System

Joshua WhiteCS644

Page 2: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

Background

Page 3: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

The Problem

• Social Networking Sites:– Provide a communication method thought by

many to be at least somewhat private• Many never change the default security

setting associated with their accounts– Support linking of older accounts/sites to new

sites through unified login which often leads to a sort of “most-privileged” escalation• This is where the site with the highest public access

settings enabled is able to gain private data from the restricted account on another site and re-display it because they share a login system.

Page 4: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

Target for Current Work

• Twitter– A real-time social information network– Various Parsable API

• Search, Live, Some Historical (24 hrs)

– Large userbase:• 65 million ‘tweets’ per day*• ~750 ‘tweets’ per second

– International community

Page 5: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

Who Uses Twitter

• People– Every Day People– Politicians– Celebrities– Professionals– Bad-Guys

• Objects– Tweeting gadgets (sensors, bots, computers,

bot-masters, spammers)• Labeled Nefarious Groups

– Lulzsec– Anonymous

Page 6: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

The Twitter API

• Twitter protocol fields:– Typically Shown in XML or JSON:

• Provide:– Location (geo fields)– Username/Real Name– Threading

» Track conversations and @ replies» Track retweets

– Twitter client software data– Timestamping– And, of course, the text of the tweet.

Page 7: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

Field Name Description Example Data

name User's REAL Name Text: "Robert Scoble"

screen_name User's Twitter username Text: "scobleizer"

profile_image_url Link to users profile imageLink: "http://a1.twimg.com/profile_images/456562836/scoblebuilding43crop-fanatiguy_normal.jpg"

url Link to user's non-Twitter site Link: "http://www.google.com/profiles/scobleizer"

followers_count Number of followers user has Number: "185496"

friends_count Number of people user follows Number: "31971"

utc_offset Offset from GMT (in seconds) Number: "-28800"

geo_enabled Whether user has enabled location Boolean: "True"

statuses_count Number of statuses user has posted Number: "53522"

Tweet Specific Fields    

created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011"

id Tweet id (useful for URL creation) Number: "80703603437875201"

textContains the actual text + any embedded URLs Whatever text the person chooses to enter. <- Could be any language supported.

sourceLinks to Twitter client URL <- not important HTML code: "<a href="http://www.echofon.com/" rel="nofollow">Echofon</a>"

in_reply_to_status_id Number of status that user replied to Number: "80671170374025220"

in_reply_to_screen_name

Screen name of user the current status replies to Text: "danharmon"

retweet_countNumber of times this status is retweeted Number: "0"

retweetedWhether or not the status has been retweeted Boolean: "false"

'geo' flag specific:    

georss:point Lat. & Long. Location Number: "43.21227199 -75.39866939"

urlPoints to a JSON or XML file with further GEO Info. Link: "http://api.twitter.com/1/geo/id/00228ed265b1139e.xml"

Page 8: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

Benefits to Social Media Awareness As The Gov.

Agencies See It.

• Track locations with reasonable accuracy– If enabled by the user

• Bad guys may have protected feeds– Others may ‘retweet’ them – this can be

tracked.

• Track trends– Who said what, who repeated it

• News before ‘official’ reports

Page 9: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

9

Gov. View

• DHS identified categories of Social Media sites [2]:– Search– Video– Maps– Photos– Blog Aggregators– Twitter related sites)– Facebook related sites)

Page 10: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

10

Gov. View Continued

• Among these categories are sites like:– RSSOwl– Hulu– YouTube– Google Flu– Flickr– Twitter– Facebook– ABCNews Blotter– Myspace

Page 11: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

11

Interesting FaceBook Privacy Facts:

• Percentage of FaceBook users by age that change their account security settings to something other then the default (no security) [1]:– 18-29 years old = 71%

– 30-39 years old = 67%

– 50-64 years old = 55%

• 80% of all users (according to some websites) fall within that 18-64 age range.– That means that potentially 20+ million users have no security

on their accounts.

Page 12: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

12

Proliferation of Facebook

Page 13: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

13

Current Work Update

Page 14: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

14

Data Collection

• To date:– We have collected over 80 million tweets

using *John's Java based method/system.• Located At the GI (Griffiss Institute)• Each compressed .tcm capture file is

– 10 days of capture– ~ 8.5 million tweets and associated data

» Tweets are only a sampling of the total data being posted to twitter, but we're rate limited by Twitters API

– Uses the twitter streaming API* John Stacy

Page 15: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

15

Data Collection Update

• As of 6/21/2011 a new data collection method/system is in place:– Located at the GI as well– Uses John's JSON analysis method re-

implemented in php with data storage in MySQL – Captured Data:

• ~ 160,000 Tweets per hour so far– Estimated ~ 4 million per day

• Uses phirehose api [3]• DB consists of raw json data, parsed out tweets, and a

special stripped down user section– User section is in preparation to add crawled account

data to.

Page 16: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

16

DB Snapshot

Page 17: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

17

Why Is This New DB Important

• The previous method is perfect for long term analysis, but we need a method that will allow us to gather stats and see what does/doesn't work that doesn't need to be coded.

• The new DB allows for simple SQL queries such as:– SELECT * FROM `tweets` WHERE `geo_lat` >0 LIMIT 0 , 30

• This looks for any tweet that has a greater then 0 value in the latitude field.

– Out of 1,593,922 tweets at the time of this query– 8,470 had a latitude/longitude associated with them

» We'll need a more complex query to see how many of those are associated with individual users

Page 18: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

18

Conclusion

• There's a lot we can do with this data– I suggest we develop methods using the

DB and then port them over to the Coalmine Query code for better scalability.

• The privacy implications of using this data are high.– I'm torn on it's usage by the Gov.

• I see the national security implications and also the privacy violations that may ensue.

Page 19: Social Network Analysis Providing · Tweet Specific Fields created_at Tweet timestamp Text: "Tue Jun 14 18:30:13 +0000 2011" id Tweet id (useful for URL creation) Number: "80703603437875201"

19

Citations• [1]

– “Vaidhyanathan, S.; , "Welcome to the surveillance society," Spectrum, IEEE , vol.48, no.6, pp.48-51, June 2011 doi: 10.1109/MSPEC.2011.5779791 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5779791&isnumber=5779759

• [2]

– DHS, Office of Operations Coordination and Planning, “Publicly Available Social Media Monitoring and Situational Awareness Initiative,” June 22 2010 http://www.dhs.gov/xlibrary/assets/privacy/privacy_pia_ops_publiclyavailablesocialmedia.pdf

• [3]

– http://code.google.com/p/phirehose/