Twarfing: Gathering Intelligence from Twitter Data

42
Twarfing: Gathering Intelligence from Twitter Data Morton Swimmer Senior Threat Researcher Trend Micro, Inc. Thursday, January 21, 2010

description

A Presentation on the use of Semantic Web technology in analyzing Twitter data for attacks against users. This is meant to illustrate how Semantic Web technology can be useful in real-world applications.

Transcript of Twarfing: Gathering Intelligence from Twitter Data

Page 1: Twarfing: Gathering Intelligence from Twitter Data

Twarfing: Gathering Intelligence from Twitter Data

Morton SwimmerSenior Threat Researcher

Trend Micro, Inc.

Thursday, January 21, 2010

Page 2: Twarfing: Gathering Intelligence from Twitter Data

Overview

• What is Twitter?

• Malicious activity on Twitter

• URL shortening services

• The Twarfing architecture

• Examples

• Conclusions

Thursday, January 21, 2010

Page 3: Twarfing: Gathering Intelligence from Twitter Data

Thanks

• Ben April, Trend, for providing the massive hardware

• Based in part on a Virus Bulletin Presentation with Costin Riau, Chief of R&D, Kaspersky Labs

Thursday, January 21, 2010

Page 4: Twarfing: Gathering Intelligence from Twitter Data

What is Twitter

• Founded by Jack Dorsey, Biz Stone and Evan Williams back in 2006

• Publish/Subscribe Communications system

• SMS

• Web site

• WS API and RSS

• Application

Thursday, January 21, 2010

Page 5: Twarfing: Gathering Intelligence from Twitter Data

Twitter Internals

• 140 chars max to be SMS compatible

• SMS has a 160 char restriction

• But Twitter needed to add the user name

• Just text and metadata

Thursday, January 21, 2010

Page 6: Twarfing: Gathering Intelligence from Twitter Data

Users not necessarily human!

• Devices

• From buoys to power meters

• Search for Twitter on instructables.com

Thursday, January 21, 2010

Page 7: Twarfing: Gathering Intelligence from Twitter Data

Users not necessarily human!

• Not surprising that malware would use it, but

• Not good for Command and Control

• Easily blocked after detection

• … and twitter has been trigger happy with blocking

Thursday, January 21, 2010

Page 8: Twarfing: Gathering Intelligence from Twitter Data

August 2008: Malware on Twitter

Thursday, January 21, 2010

Page 9: Twarfing: Gathering Intelligence from Twitter Data

April 2009 – Twitter gets hit by XSS worm

• Multiple variants of JS.Twettir.a-h identified

• Thousands of spam messages containing the word "Mikeyy“ filled the timeline

• Proof of concept – no malicious intent

• Later, the author (Mikey Mooney) got a job at exqSoft Solutions, a web security company

• September 2009 - Yet another attack

• Phishing for the login details

Thursday, January 21, 2010

Page 10: Twarfing: Gathering Intelligence from Twitter Data

2009 Trending topics start being exploited

Thursday, January 21, 2010

Page 11: Twarfing: Gathering Intelligence from Twitter Data

June 2009 – Koobface spreading in Twitter

• Originally only targeted Facebook and MySpace users

• Now spreads through more social networks:

• Facebook, MySpace, Hi5, Bebo, Tagged, Netlog and most recently… Twitter

Thursday, January 21, 2010

Page 12: Twarfing: Gathering Intelligence from Twitter Data

Twitter architecture, historically

• Multiple Ruby on Rails servers

• Mongrel HTTP servers

• Centrally MySQL backed

Thursday, January 21, 2010

Page 13: Twarfing: Gathering Intelligence from Twitter Data

Twitter architecture, now• Details super-secret, but

this is what we think

• Front end

• Ruby-based front end

• Mongrel HTTP servers

• Back end

• Starling for queuing/messaging

• Scala-based code

• MySQL

• denormalized data whenever possible

• Only for backup and persistance

• Lots of caching using memcached

Thursday, January 21, 2010

Page 14: Twarfing: Gathering Intelligence from Twitter Data

Stats (June 2009)• 25M users

• I measured 475K different users posting over a one week period in August 2009

• 300 tweets/sec

• MySQL handles 2400 reqs per second

• API traffic == 10x website traffic!

• Indicates that far more people are using applications

• TweetDeck, Twitteriffic, Digsby, Twhirl

• Many are Adobe Air based (!)

• One key to Twitter's success!

Thursday, January 21, 2010

Page 15: Twarfing: Gathering Intelligence from Twitter Data

<id><user><text><created_at><source><truncated><in_reply_to_status_id><in_reply_to_user_id><favorited><in_reply_to_screen_name>

Twitter metadata

user information

Tweet text

in each tweet!

Thursday, January 21, 2010

Page 16: Twarfing: Gathering Intelligence from Twitter Data

Tweet text• Retweets: RT passes the

note along

• now also in metadata

• Location: L tells friends where I am

• Hashtags: #

• topics

• show group associations

• just for tagging

• User reference: @ for public discussion

• URLs

Thursday, January 21, 2010

Page 17: Twarfing: Gathering Intelligence from Twitter Data

Long URLs, short URLs

• URL shortening services have grown up around Twitter

• longurl.org counts 208 different ones

• Malicious URLs are one potential threat

• obscure the true URL

• May become malicious later

• RickRolling, but maliciously

• Benefits:

• ‘bit.ly’ and others blocks malicious URLs

Thursday, January 21, 2010

Page 18: Twarfing: Gathering Intelligence from Twitter Data

Google SB API

• August 2009

• Twitter began filtering malicious URLs

• Mikko Hyppönen:

• seemed to indicate Google SB API!

• After more testing, we discovered it used some additional filtering

Thursday, January 21, 2010

Page 19: Twarfing: Gathering Intelligence from Twitter Data

<id><name><screen_name><location><description><profile_image_url><url><protected><followers_count><profile_background_color><profile_text_color><profile_link_color><profile_sidebar_fill_color><profile_sidebar_border_color><friends_count><created_at>

<favourites_count><utc_offset><time_zone><profile_background_image_url><profile_background_tile><statuses_count><notifications><following><verified>

User data

Thursday, January 21, 2010

Page 20: Twarfing: Gathering Intelligence from Twitter Data

Twitter to RDF

• Goal

• Avoid inventing own vocabulary

• Makes it easier to mix-in data later

• eg. Facebook, Tumblr, ...

• Most Twitter data fit readily into the SIOC ontology

• The rest used DublinCore

• Proprietary ontology for internal data

Thursday, January 21, 2010

Page 21: Twarfing: Gathering Intelligence from Twitter Data

Using existing vocabs

• DublinCore, http://dublincore.org/

• SIOC, http://sioc-project.org/

• describe information from online communities

• FOAF, http://www.foaf-project.org/

• machine-readable pages describing people

• GeoOWL

Thursday, January 21, 2010

Page 22: Twarfing: Gathering Intelligence from Twitter Data

Example

<http://twitter.com/float3r/status/3492845110> a sioc:Post ; dc:created "Sun, 23 Aug 2009 15:10:31 +0000" ; sioc:has_creator <http://twitter.com/float3r> ; sioc:content "RT @THErealDVORAK http://www.usdoj.gov/ndic/pubs31/31379/31379p.pdf"@no ;.<http://twitter.com/float3r> a sioc:User ; sioc:id "251294"^^xsd:integer ; rdfs:label "float3r" ; sioc:avatar "http://a1.twimg.com/profile_images/53008680/Picture_022_normal.jpg"^^xsd:anyURI ;.

float3r: RT @THErealDVORAK http://www.usdoj.gov/ndic/pubs31/31379/31379p.pdfSun, 23 Aug 2009 15:10:31 +0000

Thursday, January 21, 2010

Page 23: Twarfing: Gathering Intelligence from Twitter Data

Example, continued

float3r: RT @THErealDVORAK http://www.usdoj.gov/ndic/pubs31/31379/31379p.pdfSun, 23 Aug 2009 15:10:31 +0000

Thursday, January 21, 2010

Page 24: Twarfing: Gathering Intelligence from Twitter Data

WhiteTwarf/RedTwarf

White Twarf

Thursday, January 21, 2010

Page 25: Twarfing: Gathering Intelligence from Twitter Data

Whitetwarf

• Receives a subset of the tweets via twitter search

• Stores metadata from twitter

• Processes text part for internal metadata

• User references, hashtags, Informal tags, URLs

• Creates canonical text representations

• Export to an RDF store for analysis

Thursday, January 21, 2010

Page 26: Twarfing: Gathering Intelligence from Twitter Data

WhiteTwarf architecture v.2

Twitter Stream tweet processing

couchDB

tweets users

Domain Reputation

URLs

URL processing

RDF Converter

RDF Converter

4Store

RDF Graph

Thursday, January 21, 2010

Page 27: Twarfing: Gathering Intelligence from Twitter Data

Tweet Processing

• Tweet fetch process

• Fetch a limited number of tweets as JSON

• Extract metadata and text

• Separate user metadata from tweet metadata

• Store in the CouchDB database

• Tweet processing

• Extract Tags, URLS, user references

• Text signatures

• Meant to remove small differences in text

• Normalization and whitespace removal

• UTF-8 tricks expansion/removal

• CouchDB views

Thursday, January 21, 2010

Page 28: Twarfing: Gathering Intelligence from Twitter Data

CouchDB

• Scalable document and key-value store

• APIs REST and JSON based

• Embedded Javascript and MapReduce for querying

• http://couchdb.apache.org/

Thursday, January 21, 2010

Page 29: Twarfing: Gathering Intelligence from Twitter Data

4Store

• Scalable RDF Store

• Developed at Garlik in C and is GPLed

• Has REST and native APIs

• Queries via SPARQL

• http://4store.org/

Thursday, January 21, 2010

Page 30: Twarfing: Gathering Intelligence from Twitter Data

RDF Stores in general

• Store RDF triples (often as quads or named graphs)

• Have recently become much more scalable

• Still have limitations, though

• eg. have locking issues on writes

• Allegrograph 4 (Franz, Inc) may resolve some of these for us

Thursday, January 21, 2010

Page 31: Twarfing: Gathering Intelligence from Twitter Data

URL redirector processing

• For every URL extracted by the view we follow the link

• In most cases we get a 30x response (redirect)

• These are also followed after storing

• Testing showed: usually faster to use shortener APIs

• So we are testing code that uses API instead for known shorteners

• We also capture other HTTP metadata

• Basically we are looking for malicious links

Thursday, January 21, 2010

Page 32: Twarfing: Gathering Intelligence from Twitter Data

Detection malicious activity

• Data exported to an RDF Store

• This is a graph database

• Allows for complex queries

• Does have some performance issues and is not real time

• Simple Attack scenario

• User is observed to post to a malicious domain

• We want to see what else he has posted

Thursday, January 21, 2010

Page 33: Twarfing: Gathering Intelligence from Twitter Data

Correlation with domain reputations

mal

http://unk.com/what.exe

tweet/1234

malicious

mal.com

http://mal.com/evil.exe

tweet/5678

tw:posts

tw:posts

tw:hasURL

tw:hasURL

drs:hasFQDN

drs:hasRating

Note that the examples in this

presentation are all radically simplified

for clarity.

Thursday, January 21, 2010

Page 34: Twarfing: Gathering Intelligence from Twitter Data

Matching this in SPARQL

?m

?u2

?t1

malicious

?f1

?u1

?t2

tw:posts

tw:posts

tw:hasURL

tw:hasURL

drs:hasFQDN

drs:hasRating

Thursday, January 21, 2010

Page 35: Twarfing: Gathering Intelligence from Twitter Data

Matching this in SPARQL

?m

?u2

?t1

malicious

?f1

?u1

?t2

posts

posts

tw:hasURL

tw:hasURL

drs:hasFQDN

drs:hasRating

SELECT ?m ?u2WHERE {?m tw:posts ?t1.?t1 tw:hasURL ?u1.?u1 drs:hasFQDN ?f1.?f1 drs:hasRating drs:Malicious.?m tw:posts ?t2.?t2 tw:hasURL ?u2.}

Thursday, January 21, 2010

Page 36: Twarfing: Gathering Intelligence from Twitter Data

Advantages of using RDF

• Avoid multiple explicite joins

• An intuitive visualization available

• Mixing in new data as easy as adding it to the RDF store

Thursday, January 21, 2010

Page 37: Twarfing: Gathering Intelligence from Twitter Data

Another attack scenario

• Observed: User modified URL on retweet to be malicious

@iceman: This link is cool http://cool.com/ice.html

@notniceman: RT: @iceman: This link is cool http://c00l.com/ice.exe

iceman

notniceman

tweet/1001

tweet/1005

http://cool.com/ice.html

http://c001.com/ice.exe

thislinkiscool

tw:posts

tw:posts

tw:hasTextSignature

tw:hasURL

tw:hasURL

Thursday, January 21, 2010

Page 38: Twarfing: Gathering Intelligence from Twitter Data

Matching it

?m1

?m2

?t1

?t2

?u1

?u2

?ts

tw:posts

tw:posts

tw:hasTextSignature

tw:hasURL

tw:hasURL

Thursday, January 21, 2010

Page 39: Twarfing: Gathering Intelligence from Twitter Data

A few stats

• New architecture since October

• Since the new version started, we logged 3.6 million users

• and keep about 12 million tweets around

• Which is about 300 million triples

• Once out of experimental phase we need to increase this number dramatically

Thursday, January 21, 2010

Page 40: Twarfing: Gathering Intelligence from Twitter Data

RedTwarf

• Goal: look for new attack patterns

• Based on WhiteTwarf data

• Using Data and Text-mining techniques to detect rules

• Probably will be Lucene based

Thursday, January 21, 2010

Page 41: Twarfing: Gathering Intelligence from Twitter Data

Conclusions

• Twarf is still an experimental side project

• Twitter is becoming a popular attack vector

• Goal: protecting you, our customers

• Identifying the future development directions of Twitter threats

I would like to thank you for your support with 140 characters or less and guess what? I just did it! #swnyc

Thursday, January 21, 2010

Page 42: Twarfing: Gathering Intelligence from Twitter Data

[email protected]

Thursday, January 21, 2010