Analyzing twitter data with hadoop
-
Upload
joey-echeverria -
Category
Documents
-
view
937 -
download
2
Transcript of Analyzing twitter data with hadoop
1
Analyzing Twi,er Data with Hadoop Data Science Maryland, May 2013 Joey Echeverria | Director Federal FTS [email protected] | @fwiffo
©2013 Cloudera, Inc.
About Joey
• Director Federal FTS • Technical guy playing manager
• 2 years @ Cloudera • 5 years of Hadoop • Local
2
Analyzing Twi,er Data with Hadoop
BUILDING A BIG DATA SOLUTION
3 ©2013 Cloudera, Inc.
Big Data
• Big • Larger volume than you’ve handled before
• No litmus test • High value, under uRlized
• Data • Structured • Unstructured • Semi-‐structured
• Hadoop • Distributed file system • Distributed, batch computaRon
4 ©2013 Cloudera, Inc.
Data Management Systems
5 ©2013 Cloudera, Inc.
Data Source Data Storage Data
IngesRon
Data Processing
RelaRonal Data Management Systems
6 ©2013 Cloudera, Inc.
Data Source RDBMS ETL
ReporRng
A Canonical Hadoop Architecture
7 ©2013 Cloudera, Inc.
Data Source HDFS Flume
Hive (Impala)
Analyzing Twi,er Data with Hadoop
AN EXAMPLE USE CASE
8 ©2013 Cloudera, Inc.
Analyzing Twi,er
• Social media popular with markeRng teams • Twi,er is an effecRve tool for promoRon • Who is influenRal?
• Tweets • Followers • Retweets
• Similar to e-‐mail forwarding
• Which twi,er user gets the most retweets? • Who is influenRal in our industry?
9 ©2013 Cloudera, Inc.
Analyzing Twi,er Data with Hadoop
HOW DO WE ANSWER THESE QUESTIONS?
10 ©2013 Cloudera, Inc.
Techniques
• SQL • Filtering • AggregaRon • SorRng
• Complex data • Deeply nested • Variable schema
11
Architecture
12 ©2013 Cloudera, Inc.
Twi,er
HDFS Flume Hive
Custom Flume Source
Sink to HDFS
JSON SerDe Parses Data
Oozie
Add ParRRons Hourly
Impala
Queries
Queries and ETL
Analyzing Twi,er Data with Hadoop
TWITTER SOURCE
13 ©2013 Cloudera, Inc.
Flume
• Streaming data flow • Sources
• Push or pull
• Sinks • Event based
14 ©2013 Cloudera, Inc.
Pulling Data From Twi,er
• Custom source, using twi,er4j • Sources process data as discrete events
Loading Data Into HDFS
• HDFS Sink comes stock with Flume • Easily separate files by creaRon Rme
• hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
Flume Source
17 ©2013 Cloudera, Inc.
public class TwitterSource extends AbstractSource! implements EventDrivenSource, Configurable { ! ... ! // The initialization method for the Source. The context contains all ! // the Flume configuration info ! @Override ! public void configure(Context context) { ! ... ! } ! ... ! // Start processing events. Uses the Twitter Streaming API to sample ! // Twitter, and process tweets. ! @Override ! public void start() { ! ... ! } ! ... ! // Stops Source's event processing and shuts down the Twitter stream. ! @Override ! public void stop() { ! ... ! } !} !!
Twi,er API
• Callback mechanism for catching new tweets
18 ©2013 Cloudera, Inc.
/** The actual Twitter stream. It's set up to collect raw JSON data */ !private final TwitterStream twitterStream = new TwitterStreamFactory( ! new ConfigurationBuilder().setJSONStoreEnabled(true).build()) ! .getInstance(); !... !// The StatusListener is a twitter4j API that can be added to a stream, !// and will call a method every time a message is sent to the stream. !StatusListener listener = new StatusListener() { ! // The onStatus method is executed every time a new tweet comes in. ! public void onStatus(Status status) { ! ... ! } !} !... !// Set up the stream's listener (defined above), and set any necessary !// security information. !twitterStream.addListener(listener); !twitterStream.setOAuthConsumer(consumerKey, consumerSecret); !AccessToken token = new AccessToken(accessToken, accessTokenSecret); !twitterStream.setOAuthAccessToken(token); !!
JSON Data
• JSON data is processed as an event and wri,en to HDFS
19 ©2013 Cloudera, Inc.
public void onStatus(Status status) { ! // The EventBuilder is used to build an event using the headers and ! // the raw JSON of a tweet !! headers.put("timestamp", String.valueOf( ! status.getCreatedAt().getTime())); ! Event event = EventBuilder.withBody( ! DataObjectFactory.getRawJSON(status).getBytes(), headers); ! ! channel.processEvent(event); !} !
Analyzing Twi,er Data with Hadoop
FLUME DEMO
20 ©2013 Cloudera, Inc.
Analyzing Twi,er Data with Hadoop
HIVE
21 ©2013 Cloudera, Inc.
What is Hive?
• Created at Facebook • HiveQL
• SQL like interface
• Hive interpreter converts HiveQL to MapReduce code
• Returns results to the client
22 ©2013 Cloudera, Inc.
Hive Details
• Schema on read • Scalar types (int, float, double, boolean, string) • Complex types (struct, map, array) • Metastore contains table definiRons
• Stored in a relaRonal database • Similar to catalog tables in other DBs
23
Complex Data
24 ©2013 Cloudera, Inc.
SELECT ! t.retweet_screen_name, ! sum(retweets) AS total_retweets, ! count(*) AS tweet_count!FROM (SELECT ! retweeted_status.user.screen_name AS retweet_screen_name, ! retweeted_status.text, ! max(retweeted_status.retweet_count) AS retweets! FROM tweets ! GROUP BY ! retweeted_status.user.screen_name, ! retweeted_status.text) t !GROUP BY t.retweet_screen_name!ORDER BY total_retweets DESC !LIMIT 10; !
Analyzing Twi,er Data with Hadoop
JSON INTERLUDE
25 ©2013 Cloudera, Inc.
What is JSON?
• Complex, semi-‐structured data • Based on JavaScript’s data syntax • Rich, nested data types:
• number • string • Array • object • true, false • null
26 ©2013 Cloudera, Inc.
What is JSON?
27 ©2013 Cloudera, Inc.
{ ! "retweeted_status": { ! "contributors": null, ! "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", ! "retweeted": false, ! "entities": { ! "hashtags": [ ! { ! "text": "Crowdsourcing", ! "indices": [0, 14] ! }, ! { ! "text": "bigdata", ! "indices": [129,137] ! } ! ], ! "user_mentions": [] ! } ! } !} !
Hive Serializers and Deserializers
• Instructs Hive on how to interpret data • JSONSerDe
28 ©2013 Cloudera, Inc.
Analyzing Twi,er Data with Hadoop
HIVE DEMO
29 ©2013 Cloudera, Inc.
Analyzing Twi,er Data with Hadoop
IT’S A TRAP!
30 ©2013 Cloudera, Inc.
Photo from h,p://www.flickr.com/photos/vanf/6798065626/ Some rights reserved
Not a Database
31 ©2013 Cloudera, Inc.
RDBMS Hive
Language Generally >= SQL-92
Subset of SQL-92 plus Hive specific extensions
Update Capabilities INSERT, UPDATE, DELETE
INSERT OVERWRITE no UPDATE, DELETE
Transactions Yes No Latency Sub-second Minutes Indexes Yes Yes Data size Terabytes Petabytes
ETL
• Hive works great for SQL-‐based ETL • JSON -‐> SequenceFiles
32 ©2013 Cloudera, Inc.
Analyzing Twi,er Data with Hadoop
IMPALA
33 ©2013 Cloudera, Inc.
Cloudera Impala
34
Real-‐Time Query for Data Stored in Hadoop.
Supports Hive SQL
4-‐30X faster than Hive over MapReduce
Uses exisRng drivers, integrates with exisRng metastore, works with leading BI tools
Flexible, cost-‐effecRve, no lock-‐in
Deploy & operate with Cloudera Enterprise RTQ
Supports mulRple storage engines & file formats
©2013 Cloudera, Inc.
Benefits of Cloudera Impala
35
Real-‐Time Query for Data Stored in Hadoop
• Real-‐Rme queries run directly on source data • No ETL delays • No jumping between data silos
• No double storage with EDW/RDBMS • Unlock analysis on more data • No need to create and maintain complex ETL between systems • No need to preplan schemas
• All data available for interacRve queries • No loss of fidelity from fixed data schemas
• Single metadata store from originaRon through analysis • No need to hunt through mulRple data silos
©2013 Cloudera, Inc.
Cloudera Impala Details
36 ©2013 Cloudera, Inc.
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
ODBC
SQL App
HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase HDFS DN
Query Exec Engine
Query Coordinator
Query Planner
HBase
Fully MPP Distributed
Local Direct Reads
State Store
HDFS NN Hive Metastore YARN
Common Hive SQL and interface
Unified metadata and scheduler
Low-‐latency scheduler and cache (low-‐impact failures)
Analyzing Twi,er Data with Hadoop
IMPALA DEMO
37 ©2013 Cloudera, Inc.
Analyzing Twi,er Data with Hadoop
OOZIE AUTOMATION
38 ©2013 Cloudera, Inc.
Oozie: Everything in its Right Place
Oozie for ParRRon Management
• Once an hour, add a parRRon • Takes advantage of advanced Hive funcRonality
Analyzing Twi,er Data with Hadoop
OOZIE DEMO
41 ©2013 Cloudera, Inc.
Analyzing Twi,er Data with Hadoop
PUTTING IT ALL TOGETHER
42 ©2013 Cloudera, Inc.
Complete Architecture
43 ©2013 Cloudera, Inc.
Twi,er
HDFS Flume Hive
Custom Flume Source
Sink to HDFS
JSON SerDe Parses Data
Oozie
Add ParRRons Hourly
Impala
Queries
Queries and ETL
Analyzing Twi,er Data with Hadoop
MORE DEMOS
44 ©2013 Cloudera, Inc.
What next?
• Cloudera University • Download Hadoop! • CDH available at www.cloudera.com • Cloudera provides pre-‐loaded VMs
• h,ps://ccp.cloudera.com/display/SUPPORT/Cloudera+Manager+Free+EdiRon+Demo+VM
• Clone the source repo • h,ps://github.com/cloudera/cdh-‐twi,er-‐example
My personal preference
• Cloudera Manager • h,ps://ccp.cloudera.com/display/SUPPORT/Downloads
• Free up to 50 unlimited nodes!
Shout Out
• Jon Natkins • @na,yice • Blog posts
• h,p://blog.cloudera.com/blog/2013/09/analyzing-‐twi,er-‐data-‐with-‐hadoop/
• h,p://blog.cloudera.com/blog/2013/10/analyzing-‐twi,er-‐data-‐with-‐hadoop-‐part-‐2-‐gathering-‐data-‐with-‐flume/
• h,p://blog.cloudera.com/blog/2013/11/analyzing-‐twi,er-‐data-‐with-‐hadoop-‐part-‐3-‐querying-‐semi-‐structured-‐data-‐with-‐hive/
49 ©2013 Cloudera, Inc.