Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

20
Rethinking Data Warehousing & Analytics Ashish Thusoo, Facebook Data Infrastructure Team

Transcript of Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Page 1: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Rethinking Data Warehousing & Analytics

Ashish Thusoo,Facebook Data Infrastructure Team

Page 2: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Why Another Data Warehousing System?

Data, data and more data200GB per day in March 2008

2+TB(compressed) raw data per day in April 20094+TB(compressed) raw data per day today

Page 3: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook
Page 4: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Trends Leading to More Data

Free or low cost of user services

Realization that more insights are derived fromsimple algorithms on more data

Page 5: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Deficiencies of Existing Technologies

Cost of Analysis and Storage on proprietary systems does not support trends

towards more data

Closed and Proprietary Systems

Limited Scalability does not support trends towards more data

Page 6: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Hadoop Advantages

Pros– Superior in availability/scalability/manageability despite

lower single node performance– Open system– Scalable costs

Cons: Programmability and Metadata– Map-reduce hard to program (users know

sql/bash/python/perl)– Need to publish data in well known schemas

Solution: HIVE

Page 7: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

What is HIVE?

A system for managing and querying structured data built on top of Hadoop

Components– Map-Reduce for execution– HDFS for storage– Metadata in an RDBMS

Page 8: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Hive: Simplifying Hadoop – New Technology Familiar Interfaces

hive> select key, count(1) from kv1 where key > 100 group by key;

vs.

$ cat > /tmp/reducer.sh

uniq -c | awk '{print $2"\t"$1}‘

$ cat > /tmp/map.sh

awk -F '\001' '{if($1 > 100) print $1}‘

$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1

$ bin/hadoop dfs –cat /tmp/largekey/part*

Page 9: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Hive: Open and Extensible

Query your own formats and types with your own Serializer/Deserializers

Extend the SQL functionality through User Defined Functions

Do any non-SQL transformations through TRANSFORM operator that sends data from Hive to any user program/script

Page 10: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Hive: Smart Execution Plans for Performance

Hash based Aggregations

Map-side Joins

Predicate Pushdown

Partition Pruning

Many more to come in the future

Page 11: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Interoperability

JDBC and ODBC interfaces available

Integrations with some traditional SQL tools with some minor modifications

More improvements in future to support interoperability with existing front end tools

Page 12: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Information

Available as a sub project in Hadoop- http://wiki.apache.org/hadoop/Hive (wiki)- http://hadoop.apache.org/hive (home page)- http://svn.apache.org/repos/asf/hadoop/hive (SVN repo)- ##hive (IRC)- Works with hadoop-0.17, 0.18, 0.19, 0.20

Release 0.4.0 is coming in the next few days

Mailing Lists: – hive-{user,dev,commits}@hadoop.apache.org

Page 13: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Data Warehousing @ Facebookusing

Hive & Hadoop

Page 14: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Data Flow Architecture at Facebook

Web Servers Scribe MidTier

Filers

Production Hive-Hadoop ClusterOracle RAC Federated MySQL

Scribe-Hadoop Cluster

Adhoc Hive-Hadoop Cluster

Hivereplication

Page 15: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Looks like this ..

Disks

Node

Disks

Node

Disks

Node

Disks

Node

Disks

Node

Disks

Node

1 Gigabit 4 Gigabit

Node=

DataNode +

Map-Reduce

Page 16: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Hadoop & Hive Cluster @ Facebook

Hadoop/Hive Warehouse – the new generation– 4800 cores, Storage capacity of 5.5 PetaBytes– 12 TB per node– Two level network topology

1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch

Page 17: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Hive & Hadoop Usage @ Facebook

Statistics per day:– 4 TB of compressed new data added per day– 135TB of compressed data scanned per day– 7500+ Hive jobs on per day– 80K compute hours per day

Hive simplifies Hadoop:– New engineers go though a Hive training session– ~200 people/month run jobs on Hadoop/Hive– Analysts (non-engineers) use Hadoop through

Hive– 95% of jobs are Hive Jobs

Page 18: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Hive & Hadoop Usage @ Facebook

Types of Applications:– Reporting

Eg: Daily/Weekly aggregations of impression/click counts Measures of user engagement Microstrategy dashboards

– Ad hoc Analysis Eg: how many group admins broken down by state/country

– Machine Learning (Assembling training data) Ad Optimization Eg: User Engagement as a function of user attributes

– Many others

Page 19: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook

Facebook’s contributions…

A lot of significant contributions:– Hive– Hdfs features– Scheduler workEtc…

Talks by Dhruba Borthakur and Zheng Shao in the development track for more information on these projects

Page 20: Hadoop World: Rethinking the Data Warehouse with Hadoop and Hive, Facebook