Twitter with hadoop for oow

44
1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc.

description

"Analyzing Twitter Data with Hadoop - Live Demo", presented at Oracle Open World 2014. The repository for the slides is in https://github.com/cloudera/cdh-twitter-example

Transcript of Twitter with hadoop for oow

Page 1: Twitter with hadoop for oow

1

Analyzing Twitter Data with HadoopGwen Shapira, Software Engineer@Gwenshap

©2012 Cloudera, Inc.

Page 2: Twitter with hadoop for oow

All meetings located in Moscone South - Room 208

Monday, September 29Exadata SIG: 2:00 p.m. - 3:00 p.m.BIWA SIG: 5:00 p.m. – 6:00 p.m.

Tuesday, September 30Internet of Things SIG: 11:00 a.m. - 12:00 p.m.Storage SIG: 4:00 p.m. - 5:00 p.m.SPARC/Solaris SIG: 5:00 p.m. - 6:00 p.m.

Wednesday, October 1Oracle Enterprise Manager SIG: 8:00 a.m. - 9:00 a.m.Big Data SIG: 10:30 a.m. - 11:30 a.m.Oracle 12c SIG: 2:00 p.m. – 3:00 p.m.Oracle Spatial and Graph SIG: 4:00 p.m. (*OTN lounge)

IOUG SIG Meetings at OpenWorld

Page 3: Twitter with hadoop for oow

• Save more than $1,000 on education offerings like pre-conference workshops• Access the brand-new, specialized IOUG Strategic Leadership Program• Priority access to the hands-on labs with Oracle ACE support• Advance access to supplemental session material and presentations• Special IOUG activities with no "ante in" needed - evening networking opportunities

and more

COLLABORATE 15 – IOUG ForumApril 12-16, 2015

Mandalay Bay Resort and CasinoLas Vegas, NV

COLLABORATE 15 Call for Speakers

Ends October 10

The IOUG Forum Advantage

www.collaborate.ioug.org

Follow us on Twitter at @IOUG or via the conference hashtag #C15LV!

Page 4: Twitter with hadoop for oow

©2014 Cloudera, Inc. All rights reserved.

I have15 years of experience in

moving data around

Page 5: Twitter with hadoop for oow

©2014 Cloudera, Inc. All rights reserved.

• Oracle ACE Director• Member of Oak Table• Blogger• Presenter – Hotsos, IOUG, OOW, OSCON• NoCOUG board• Contributor to Apache Oozie, Sqoop, Kafka• Author – Hadoop Application Architectures

In my spare time…

Page 6: Twitter with hadoop for oow

6

Analyzing Twitter Data with Hadoop

BUILDING AN HADOOP APPLICATION

©2012 Cloudera, Inc.

Page 7: Twitter with hadoop for oow

7

Page 8: Twitter with hadoop for oow

8

Hive Level Architecture

©2012 Cloudera, Inc.

Data Source HDFSFlume

Hive +Oozie

Impala / Oracle

Page 9: Twitter with hadoop for oow

9

Analyzing Twitter Data with Hadoop

AN EXAMPLE USE CASE

©2012 Cloudera, Inc.

Page 10: Twitter with hadoop for oow

10

Analyzing Twitter

• Social media popular with marketing teams• Twitter is an effective tool for promotion• Which twitter user gets the most retweets?• Who is influential in our industry?• Which topics are trending?• “You mentioned Oracle, please take this survey”

©2012 Cloudera, Inc.

Page 11: Twitter with hadoop for oow

11

Analyzing Twitter Data with Hadoop

HOW DO WE ANSWER THESE QUESTIONS?

©2012 Cloudera, Inc.

Page 12: Twitter with hadoop for oow

12

Techniques

• Bring Data with Flume• Complex data

• Deeply nested• Variable schema

• Clean, Standardize, Partition, etc• SQL

• Filtering• Aggregation• Sorting

Page 13: Twitter with hadoop for oow

13

Analyzing Twitter Data with Hadoop

FLUME

Page 14: Twitter with hadoop for oow

14

Flume Agent design

Page 15: Twitter with hadoop for oow

15

In our case…

• Twitter source• Pulls JSON format files from twitter

• Memory Channel• HDFS Sink – directory per hour

Page 16: Twitter with hadoop for oow

16

What is JSON?

©2012 Cloudera, Inc.

{ "retweeted_status": { "contributors": null, "text": "#Crowdsourcing – drivers already generate traffic data for your smartphone to suggest alternative routes when a road is clogged. #bigdata", "retweeted": false, "entities": { "hashtags": [ { "text": "Crowdsourcing", "indices": [0, 14] }, { "text": "bigdata", "indices": [129,137] } ], "user_mentions": [] } }}

Page 17: Twitter with hadoop for oow

17

But Wait! There’s More!

• Many sources – directory, files, log4j, net, JMS• Interceptors – process data in flight• Selectors – choose which sink• Many channels – Memory, file• Many sinks – HDFS, Hbase, Solr

Page 18: Twitter with hadoop for oow

18

High Level Pipeline Architecture

Web App Flume Avro Client

Web App Flume Avro Client

Web App Flume Avro Client

Web App Flume Avro Client

Web App Flume Avro Client

Web App Flume Avro Client

Web App Flume Avro Client

Web App Flume Avro Client

Flume Agent

Flume Agent

Flume Agent

Flume Agent

HDFS

SparkStreaming HBase

Report App

Fan-in Pattern

Multi Agents for Failover and rolling restarts

SparkStreaming data is sub set of whole events

ML Map/Reduce Jobs

Batch Report Updates

Pull Near Real Time Results

Query With Hbase API Or Impala

Client providing, multi-threading, compression, encryption, and batching

Page 19: Twitter with hadoop for oow

19

TwitterAgent.sources = TwitterTwitterAgent.channels = MemChannelTwitterAgent.sinks = HDFS

TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSourceTwitterAgent.sources.Twitter.channels = MemChannelTwitterAgent.sources.Twitter.consumerKey = TwitterAgent.sources.Twitter.consumerSecret = TwitterAgent.sources.Twitter.accessToken = TwitterAgent.sources.Twitter.accessTokenSecret = TwitterAgent.sources.Twitter.keywords = hadoop, big data, flume, sqoop, oracle, oow

TwitterAgent.sinks.HDFS.channel = MemChannelTwitterAgent.sinks.HDFS.type = hdfsTwitterAgent.sinks.HDFS.hdfs.path = hdfs://quickstart :8020/user/flume/tweets/%Y/%m/%d/%H/TwitterAgent.sinks.HDFS.serializer = text

TwitterAgent.channels.MemChannel.type = memory

Configuration

Page 20: Twitter with hadoop for oow

20

Analyzing Twitter Data with Hadoop

FLUME DEMO

©2012 Cloudera, Inc.

Page 21: Twitter with hadoop for oow

21

Analyzing Twitter Data with Hadoop

HIVE

©2012 Cloudera, Inc.

Page 22: Twitter with hadoop for oow

22

What is Hive?

• Created at Facebook• HiveQL

• SQL like interface• Hive interpreter

converts HiveQL to MapReduce code

• Returns results to the client

©2012 Cloudera, Inc.

Page 23: Twitter with hadoop for oow

23

Hive Details

• Metastore contains table definitions• Stored in a relational database• Basically a data dictionary

• SerDes parse data • and converts to table/column structure• SerDe:

• CSV, XML, JSON, Avro, Parquet, OCR files• Or write your own (We created one for CopyBook)

Page 24: Twitter with hadoop for oow

24

Complex Data

©2012 Cloudera, Inc.

SELECT  t.retweet_screen_name,  sum(retweets) AS total_retweets,  count(*) AS tweet_countFROM (SELECT   retweeted_status.user.screen_name AS retweet_screen_name,     retweeted_status.text,     max(retweeted_status.retweet_count) AS retweets FROM tweets   GROUP BY

retweeted_status.user.screen_name,       retweeted_status.text) tGROUP BY t.retweet_screen_nameORDER BY total_retweets DESCLIMIT 10;

Page 25: Twitter with hadoop for oow

25

Analyzing Twitter Data with Hadoop

HIVE DEMO

©2012 Cloudera, Inc.

Page 26: Twitter with hadoop for oow

26

Analyzing Twitter Data with Hadoop

IT’S A TRAP

©2012 Cloudera, Inc.

Page 27: Twitter with hadoop for oow

27

Not a Database

©2012 Cloudera, Inc.

RDBMS Hive Impala

LanguageGenerally >= SQL-92

Subset of SQL-92 plus Hive specific extensions

Subset of SQL-92

Update Capabilities

INSERT, UPDATE, DELETE

Bulk INSERT, UPDATE, DELETE

Insert, truncate

Transactions Yes Yes No

Latency Sub-second Minutes Sub-second

Indexes Yes Yes No

Data size Few Terabytes Petabytes Lots of Terabytes

Page 28: Twitter with hadoop for oow

28

Analyzing Twitter Data with Hadoop

DATA FORMATS

Page 29: Twitter with hadoop for oow

29

I don’t like our data

• Lots of small files• JSON – requires parsing• Can’t compress• Sensitive to changes

Page 30: Twitter with hadoop for oow

30

I’d rather use Avro

• Few large files containing records• Schema in file• Schema evolution• Can compress• Well supported in Hadoop• Clients in other languages

Page 31: Twitter with hadoop for oow

31

Lets convert

• Create table AVRO_TWEETS• Insert into Avro_tweets

select …. From tweets

Page 32: Twitter with hadoop for oow

32

Analyzing Twitter Data with Hadoop

IMPALA ASIDE

©2012 Cloudera, Inc.

Page 33: Twitter with hadoop for oow

33

Cloudera ImpalaReal-Time Query for Data Stored in Hadoop.

FAMILIAR Supports Hive SQL

FAST 4-30X faster than Hive over MapReduce

Uses existing drivers, integrates with existing metastore, works with leading BI toolsINTEGRATED

100% OPEN SOURCE Flexible, cost-effective, no lock-in

EASY TO USE Deploy & operate withCloudera Enterprise RTQ

FLEXIBLE Supports multiple storage engines & file formats

©2012 Cloudera, Inc.

Page 34: Twitter with hadoop for oow

34

Benefits of Cloudera ImpalaReal-Time Query for Data Stored in Hadoop

SPEED TO INSIGHT

COST SAVINGS

FULL FIDELITY ANALYSIS

DISCOVERABILITY

• Real-time queries run directly on source data• No ETL delays• No jumping between data silos

• No double storage with EDW/RDBMS• Unlock analysis on more data• No need to create and maintain complex ETL between systems• No need to preplan schemas

• All data available for interactive queries• No loss of fidelity from fixed data schemas

• Single metadata store from origination through analysis• No need to hunt through multiple data silos

©2012 Cloudera, Inc.

Page 35: Twitter with hadoop for oow

Cloudera Impala Details

35 ©2012 Cloudera, Inc.

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

ODBC

SQL App

HDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBaseHDFS DN

Query Exec Engine

Query Coordinator

Query Planner

HBase

Fully MPP Distributed

Local Direct Reads

State Store

HDFS NNHive Metastore YARN

Common Hive SQL and interface

Unified metadata and scheduler

Low-latency scheduler and cache(low-impact failures)

Page 36: Twitter with hadoop for oow

LOAD DATA TO ORACLE

Page 37: Twitter with hadoop for oow

Oracle Connectors for Hadoop

• Oracle Loader for Hadoop

• Oracle SQL Connector for Hadoop

• BigData SQL

Page 38: Twitter with hadoop for oow

Oracle Loader for Hadoop

• Load data from Hadoop into Oracle• Map-Reduce job inside Hadoop• Converts data types, partitions and sorts• Direct path loads• Reduces CPU utilization on database • Supports Avro and compression

Page 39: Twitter with hadoop for oow

Oracle SQL Connector for Hadoop

• Run a Java app• Creates an external table• Runs MapReduce when external table is queries• Can use Hive Metastore for schema• Optimized for parallel queries• Supports Avro and compression

Page 40: Twitter with hadoop for oow

40

Big Data SQL

• Also external table• Can also use Hive metastore for schema• But …. NO MapReduce• Instead – an agent will do SMART SCANS

• Bloom filters• Storage indexes• Filters

• Supports any Hadoop data format

Page 41: Twitter with hadoop for oow

41

Analyzing Twitter Data with Hadoop

PUTTING IT ALL TOGETHER

©2012 Cloudera, Inc.

Page 42: Twitter with hadoop for oow

42

Hive Level Architecture

©2012 Cloudera, Inc.

Data Source HDFSFlume

Hive +Oozie

Impala / Oracle

Page 44: Twitter with hadoop for oow

44 ©2012 Cloudera, Inc.