Twitter Stream Processing

Post on 10-May-2015

7.152 views 3 download

Tags:

description

Process the Twitter stream using Storm & Redstorm with Ruby & JRuby. Full working demo, code on github https://github.com/colinsurprenant/tweitgeist and live demo http://tweitgeist.needium.com/

Transcript of Twitter Stream Processing

Twitterstream processing

Colin Surprenant@colinsurprenantLead ninja

Big Data MontrealApril 2012

Twitter - Spring 2012

● 350 million tweets/day● 140 million active users● >1 million applications using API

Daily Twitter @ Needium

50 000 000 processed tweets 5 000 opportunities 500 messages sent 100GB data

Anatomy of a tweet

What's in a tweet? Anything else?

avatar

usertimestamp

message

How to get the tweets?

● Streaming API

Subscribe to realtime feeds moving forward ● REST Search API

Search request on past data (1 week)

Streaming APIpublic statuses from all users

● status/filtertrack/location/follow○ 5000 follow user ids

○ 400 track keywords

○ 25 location boxes

○ rate limited

● status/sample○ 1% of all public statuses (message id mod 100)

○ two status/sample streams will result in same data

Streaming APIper user streams

● User Streams

○ all data required to update a user's display○ requires user's OAuth token○ statuses from followings, direct messages, mentions○ cannot open large number of user streams from same host

● Site Streams○ multiplexing of multiple User Streams

Streaming APIFirehose

need more/full data? only through partners ● gnip.com● datasift.com

● filtering/tracking

● partial to full Firehose

What's the catch? $$$lots of it

Search API

● REST API (http request/response)

○ search query○ geocode (lat, long, radius)○ result type (mixed/recent/popular)○ since id

● max 100 rpp and 1500 results● rate limited (~1 request/sec)

StormDistributed and fault-tolerant realtime computation

https://github.com/nathanmarz/storm

Storm

The promise ● Guaranteed data processing● Horizontal scalability● Fault-tolerance● No intermediate message brokers● Higher level abstraction than message passing● Just work

RedStorm

Storm: Java + Closure

RedStorm: JRuby integration & DSL for Storm

Ruby + Storm on JVM

https://github.com/colinsurprenant/redstorm

+

StormTypical use cases

Streamprocessing

Continuouscomputation

DistributedRPC

StormConcepts

Streams

Unbounded sequence of tuples

Tuple Tuple Tuple Tuple Tuple

StormConcepts

Spouts

Source of streams

TupleTuple

TupleTuple

Tuple

TupleTuple

TupleTuple

Tuple

StormConcepts

Bolts

Processes input streams and produce new streams

Tuple Tuple Tuple Tuple Tuple

Tuple Tuple Tuple Tuple Tuple

Tuple Tuple Tuple Tuple Tuple

StormConcepts

Topology

Network of spouts and bolts

StormConcepts

bolt A bolt B

bolt A bolt B

field X

field Y

bolt A bolt B

bolt A bolt B

random+fair distribution across tasks

replication to all tasks

partition on specified field entire stream to single task

Shuffle

GlobalFields

All

Grouping

Storm

What Storm does ● Distributes code● Robust process management● Monitors topologies and reassigns failed tasks● Provides reliability by tracking tuple trees● Routing and partitioning of stream

TweitgeistLive top 10 trending hashtags on Twitter

DEMO

https://github.com/colinsurprenant/tweitgeistLive demo: http://tweitgeist.needium.com

TweitgeistArchitecture

Twitterstream

Queue

Parallel extraction

Partitionedcounting

Partitionedcounting

Partitionedranking

Partitionedranking

Merging

Queue

Frontend

N partitions N partitions

TweitgeistArchitecture

● Parallel computation across partitions of the stream● Scalable architecture● Work for full Twitter firehose with very little mods

TweitgeistStorm Topology

TwitterStreamSpout

ExtractMessageBolt

ExtractHashtagBolt

RollingCountBolt

shuf

fle

shuf

fle

field

hash

tag

RankBolt

field

hash

tag

MergeBolt

glob

al

Shuffle grouping: Tuples are randomly distributed across the bolt's tasksFields grouping: The stream is partitioned by the fields specified in the groupingGlobal grouping: The entire stream goes to a single one of the bolt's tasks

Redisqueue

Redisqueue

Twitter U

I

streamreader

messageextract

hashtagextract

rollingcounter

ranking merging

TweitgeistTopology definition

TweitgeistWhere

Codehttps://github.com/colinsurprenant/tweitgeist

Live demohttp://tweitgeist.needium.com