Facebook hadoop-summit

HIVE

data warehousing using Hadoop

Facebook Data Team

Motivation

Structured log and dimension data– Well known schemas, different serialization formats (binary/text)– Rich data structures – nesting/maps/lists

Query language over structured data– SQL helps in easier adoption by business analysts + reduced learning

curve for everyone– Developers love streaming and direct access to map-reduce– Query Language brings together SQL and Streaming

Data Management– Tables/Partitions for easy data addressability– Abstractions allow optimizations:

Organize data for large joins/samplingAdd indices/manage compression/replication transparently

What is HIVE?

HDFS

Hive CLIDDLQueriesBrowsing

MetaStore

Thrift API

Map Reduce

SerDeThrift Jute JSON..

Execution

Hive QL

Parser

Planner

Mgm

t. W

eb U

I

Dealing with Structured Data

Type system– Primitive types– Recursively build up using Composition/Maps/Lists

Generic (De)Serialization Interface (SerDe)– To recursively list schema– To recursively access fields within a row object

Serialization families implement interface– Thrift (Binary and Delimited Text), RecordIO, JSON/PADS(?)

XPath like field expressions– profiles.network[@is_primary=1].id

Inbuilt DDL– Define schema over delimited text files– Leverages Thrift DDL

Data Model

Logical Partitioning

HashPartitioning

Schema

Library

clicks

HDFS MetaStore

/hive/clicks/hive/clicks/ds=2008-03-25/hive/clicks/ds=2008-03-25/0

IPuserIdAdId

views

…

Tables Dimensions

#Partitions=32Sort-key=uiduid

MetaStore

Stores Table/Partition properties:– Table schema and SerDe library– Table Location on HDFS– Logical Partitioning keys and types– Sort column– Mapping from columns to well known Dimensions

Thrift API– Current clients in Php (Web Interface), Python (CLI), Java (Query

Engine), Perl (Tests)

Stores all properties in text files

Hive CLI

Implemented in Python – uses MetaStore Thrift API

DDL:– create table/drop table/rename table– alter table add column etc.

Browsing:– show tables– describe table– cat table

Loading Data– load data inpath <path1, …> into table <tablename/partition-spec>]

[bucketed <N> ways by <dimension>]

Queries– Issue queries in Hive QL.

Hive Query Language

Philosophy– SQL like constructs + Hadoop Streaming

Query Operators in initial version– Projections– Equijoins and Cogroups– Group by– Sampling

Output of these operators can be:– passed to Streaming mappers/reducers– can be stored in another Hive Table– can be output to HDFS files

Hive Query Language

Package these capabilities into a more formal SQL like query language in next versionIntroduce other important constructs:– Views– Multi table inserts– Order bys– Select distincts– SQL like column expressions– A bunch of other builtin functions

Still work in progress

Query Language - Examples

Multi table inserts

FROM ad_impressions_stg imps

INSERT INTO ad_legals/ds=2008-03-08 select imps.* where imps.legal = 1INSERT INTO ad_non_legals/ds=2008-03-08 select imps.* where imps.legal = 0

Joins

FROM ad_impressions imps, ad_dimensions adsINSERT INTO ad_legals_joined select imps.*, ads.campaignid

JOIN ON(imps.adid, ads.adid)WHERE imps.legal = 1

Query Language - Examples

Group By

FROM ad_legals_joined impsINSERT INTO hdfs://hadoop001:9000/user/ads/adid_uu_summary

select imps.adid, count_distinct(imps.uid) group by(imps.adid)

INSERT INTO hdfs://hadoop001:9000/user/ads/campaignid_uu_summary select imps.campaign_id, count_distinct(imps.uid) group by(imps.campaignid)

Query Language – HadoopStreaming

APPLY ON TABLE

CREATE OPERATOR filter_legal using ‘exec://filter_legal.py’(ts date, adid long, uid long)

FROM (APPLY filter_legal ON TABLE ad_impression)INSERT INTO ad_legals where ts >= ‘2008-03-11’ and ts < ‘2008-03-12’

APPLY can also be applied after JOIN as reducer script

FROM ad_impressions imps, ad_dimensions adsINSERT INTO ad_legals_joined select imps.*, ads.campaignid

JOIN ON(imps.adid, ads.adid)APPLY filter_legal BEFORE OUTPUT

Query Language – Views

Used for expressing– Union alls– APPLY operators

Example

CREATE VIEW actionsSELECT photo_views.*UNION ALLSELECT video_views.*UNION ALLSELECT profile_views.* …

Hive Usage in Facebook

Applications:– Summarization

Eg: Daily/Weekly aggregations of impression/click counts

– Ad hoc AnalysisEg: how many group admins broken down by state/country

– Data Mining (Assembling training data)Eg: User Engagement as a function of user attributes

Usage statistics:– Total Users: ~40 (about 25% of engineering !)– Hive Data (compressed): 22 TB total, ~200GB incoming per day– Jobs over last 7 days:

Total Jobs: 3514, Projections:821, Joins: 152, Aggregates: 800, Loaders: 600

* Aggregates biased because of multi-stage map-reduce

Conclusion

Release to Open Source in 3-4 monthsPeople:– Suresh Anthony ([email protected])– Jeff Hammerbacher (jeffh@)– Joydeep Sarma (jssarma@)– Ashish Thusoo (athusoo@)– Pete Wyckoff (pwyckoff@)

mailto:[email protected]

mailto:[email protected]

Facebook hadoop-summit

Technology

Transcript of Facebook hadoop-summit