June 2014 HUG : Hive On Tez - Benchmarked at Yahoo Scale

Benchmark ing H ive a t Yahoo Sca le

P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n J u n e 1 8 , 2 0 1 4⎪

H a d o o p U s e r G r o u p

2

About myself

HCatalog Committer, Hive contributor› Metastore, Notifications, HCatalog APIs› Integration with Oozie, Data Ingestion

Other odds and ends› DistCp

[email protected]

Hadoop User Group, 201406181830, Yahoo Sunnyvale

3

About this talk

Introduction to “Yahoo Scale” The use-case in Yahoo The Benchmark The Setup The Observations (and, possibly, lessons) Fisticuffs


4

The Y!Grid

16 Hadoop Clusters in YGrid› 32500 Nodes› 750K jobs a day

Hadoop 0.23.10.x, 2.4.x Large Datasets

› Daily, hourly, minute-level frequencies› Terabytes of data, 1000s of files, per dataset instance

Pig 0.11 Hive 0.10 / HCatalog 0.5

› => Hive 0.12


5

Data Processing Use cases


Pig for Data Pipelines› Imperative paradigm› ~45% Hadoop Jobs on Production Clusters

• M/R + Oozie = 41%

Hive for Ad hoc queries› SQL› Relatively smaller number of jobs

• *Major* Uptick

Use HCatalog for Inter-op

6 Yahoo Confidential & Proprietary

Hive is Currently the Fastest Growing Product on the Grid

Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-140

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

30,000,000

0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

7.0%

8.0%

9.0%

10.0%

All Jobs Hive (% of all jobs)

All

Gri

d J

ob

s (i

n M

illi

on

s)

Hiv

eJo

bs

(% o

f A

ll J

ob

s)

2.4 million Hive jobs

7

Business Intelligence Tools

{Tableau, MicroStrategy, Excel, … } Challenges:

› Security• ACLs, Authentication, Encryption over the wire, Full-disk Encryption

› Bandwidth• Transporting results over ODBC

› Query Latency• Query execution time

• Cost of query “optimizations”

• “Bad” queries


8

The Benchmark

TPC-h› Industry standard (tpc.org/tpch)› 22 queries› dbgen –s 1000 –S 3

• Parallelizable

Reynold Xin’s excellent work:› https://github.com/rxin› Transliterated queries to suit Hive 0.9


http://www.tpc.org/tpch/

https://github.com/rxin

9

Relational Diagram


PARTKEY

NAME

MFGR

BRAND

TYPE

SIZE

CONTAINER

COMMENT

RETAILPRICE

PARTKEY

SUPPKEY

AVAILQTY

SUPPLYCOST

COMMENT

SUPPKEY

NAME

ADDRESS

NATIONKEY

PHONE

ACCTBAL

COMMENT

ORDERKEY

PARTKEY

SUPPKEY

LINENUMBER

RETURNFLAG

LINESTATUS

SHIPDATE

COMMITDATE

RECEIPTDATE

SHIPINSTRUCT

SHIPMODE

COMMENT

CUSTKEY

ORDERSTATUS

TOTALPRICE

ORDERDATE

ORDER-PRIORITY

SHIP-PRIORITY

CLERK

COMMENT

CUSTKEY

NAME

ADDRESS

PHONE

ACCTBAL

MKTSEGMENT

COMMENT

PART (P_)SF*200,000

PARTSUPP (PS_)SF*800,000

LINEITEM (L_)SF*6,000,000

ORDERS (O_)SF*1,500,000

CUSTOMER (C_)

SF*150,000

SUPPLIER (S_)SF*10,000

ORDERKEY

NATIONKEY

EXTENDEDPRICE

DISCOUNT

TAX

QUANTITY

NATIONKEY

NAME

REGIONKEY

NATION (N_)25

COMMENT

REGIONKEY

NAME

COMMENT

REGION (R_)5

10

The Setup

› 350 Node cluster• Xeon boxen: 2 Slots with E5530s => 16 CPUs

• 24GB memory– NUMA enabled

• 6 SATA drives, 2TB, 7200 RPM Seagates

• RHEL 6.4

• JRE 1.7 (-d64)

• Hadoop 0.23.7+/2.3+, Security turned off

• Tez 0.3.x

• 128MB HDFS block-size

› Downscale tests: 100 Node cluster• hdfs-balancer.sh


11

The Prep

Data generation:› Text data: dbgen on MapReduce› Transcode to RCFile and ORC: Hive on MR

• insert overwrite table orc_table partition( … ) select * from text_table;

› Partitioning:• Only for 1TB, 10TB cases

• Perils of dynamic partitioning

› ORC File:• 64MB stripes, ZLIB Compression


Observat ions

13 Hadoop User Group, 201406181830, Yahoo Sunnyvale

14

100 GB

› 18x speedup over Hive 0.10 (Textfile)• 6-50x

› 11.8x speedup over Hive 0.10 (RCFile)• 5-30x

› Average query time: 28 seconds• Down from 530 (Hive 0.10 Text)

› 85% queries completed in under a minute


16

1 TB

› 6.2x speedup over Hive 0.10 (RCFile)• Between 2.5-17x

› Average query time: 172 seconds• Between 5-947 seconds

• Down from 729 seconds (Hive 0.10 RCFile)

› 61% queries completed in under 2 minutes› 81% queries completed in under 4 minutes


18

10 TB

› 6.2x speedup over Hive 0.10 (RCFile)• Between 1.6-10x

› Average query time: 908 seconds (426 seconds excluding outliers)• Down from 2129 seconds with Hive 0.10 RCFile

– (1712 seconds excluding outliers)

› 61% queries completed in under 5 minutes› 71% queries completed in under 10 minutes› Q6 still completes in 12 seconds!


19

Explaining the speed-ups

Hadoop 2.x, et al. Tez

› (Arbitrary DAG)-based Execution Engine› “Playing the gaps” between M&R

• Temporary data and the HDFS

› Feedback loop› Smart scheduling› Container re-use› Pipelined job start-up

Hive › Statistics› “Vector-ized” Execution

ORC› PPD



ORC File Layout

Data is composed of multiple streams per column

Index allows for skipping rows (default to every 10,000 rows), keeping position in each stream, and min-max for each column

Footer contains directory of stream locations, and the encoding for each column

Integer columns are serialized using run-length encoding

String columns are serialized using dictionary for column values, and the same run length encoding

Stripe footer is used to find the requested column’s data streams and adjacent stream reads are merged


ORC UsageCREATE TABLE addresses ( name string, street string, city string, state string, zip int ) STORED AS orc TBLPROPERTIES ("orc.compress"= "ZLIB");LOCATION ‘/path/to/addresses’;

ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT orc

SET hive.default.fileformat = orcSET hive.exec.orc.memory.pool = 0.50 (ORC writer is allowed 50% of JVM heap size by default)

ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde’INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’ OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';

Key Default Comments

orc.compress ZLIB high-level compression (one of NONE, ZLIB, Snappy)

orc.compress.size 262,144 (256 KB) number of bytes in each compression chunk

orc.stripe.size 67,108,864 (64 MB) number of bytes in each stripe. Each ORC stripe is processed in one map task (try 32 MB to cut down on disk I/O)

orc.row.index.stride 10,000 number of rows between index entries (must be >= 1,000). A larger stride-size increases the probability of not being able to skip the stride, for a predicate.

orc.create.index true whether to create row indexes. This is for predicate push-down. If data is frequently accessed/filtered on a certain column, then sorting on the column and using index-filters makes column filters work faster

25

Configuring ORC

set hive.merge.mapredfiles=true set hive.merge.mapfiles=true set orc.stripe.size=67,108,864

› Half the HDFS block-size

• Tangent: nStripes vs nBlocks

• Tangent: DistCp

set orc.compress=???› Depends on size and distribution› Snappy compression hasn’t been explored

YMMV› Experiment


Conclusions

28

Y!Grid sticking with Hive

Familiarity› Existing ecosystem

Community Scale Multitenant Coming down the pike

› CBO› In-memory caching solutions atop HDFS

• RAMfs a la Tachyon?


29

We’re not done yet

SQL compliance Scaling up the metastore

performance Better BI Tool integration Faster transport

› HiveServer2 result-sets


30

References

The YDN blog post:› http

://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-tez-and-yarn

Code:› https://github.com/mythrocks/hivebench (TPC-h scripts, datagen, transcode utils)› https://github.com/t3rmin4t0r/tpch-gen (Parallel TPC-h gen)› https://github.com/rxin/TPC-H-Hive (TPC-h scripts for Hive)› https://issues.apache.org/jira/browse/HIVE-600 (Yuntao’s initial TPC-h JIRA)


http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-tez-and-yarn




https://github.com/mythrocks/hivebench

https://github.com/mythrocks/hivebench

https://github.com/t3rmin4t0r/tpch-gen

https://github.com/t3rmin4t0r/tpch-gen

https://github.com/rxin/TPC-H-Hive

https://github.com/rxin/TPC-H-Hive

https://issues.apache.org/jira/browse/HIVE-600

https://issues.apache.org/jira/browse/HIVE-600

Thank You

@[email protected]

We are hiring!

Reach out to us at [email protected].

I ’m glad you asked.

33

Sharky comments

Testing with Shark 0.7.x and Shark 0.8› Compatible with Hive Metastore 0.9› 100GB datasets : Admirable performance› 1TB/10TB: Tests did not run completely

• Failures, especially in 10TB cases

• Hangs while shuffling data

• Scaled back to 100 nodes -> More tests ran through, but not completely

› nReducers: Not inferred

Miscellany› Security› Multi-tenancy› Compatibility


June 2014 HUG : Hive On Tez - Benchmarked at Yahoo Scale

Technology

Transcript of June 2014 HUG : Hive On Tez - Benchmarked at Yahoo Scale