Apache Hiveの今とこれから - 2016

Apache Hiveの今とこれからJoeOoura&YutaImai2016/4/22

©HortonworksInc.2011–2015.AllRightsReserved

2 ©HortonworksInc.2011–2016.AllRightsReserved

はじめに

Ã  質問はQUESTIONSというボタンからお願いします。プレゼンター以外には⾒えません。

Ã  Twitter経由でもコメント、質問、⼤歓迎です！ #hwxjp


自己紹介 Ã  ⼤浦譲太郎 Twitter：@JOOOURAÃ  5歳児と8歳児の⽗Ã  サーバ、ストレージのシステム営業を経て2011年にフラッシュメモリストレージ企業の⽇本法⼈⽴ち上げに参画。Evangelist、プリセールスSE、広報、営業など⼀通りをカバーエンタープライズフラッシュの代名詞ともなるioDriveシリーズを⽇本国内の通信キャリア、⾦融機関、WEBサービス事業者、アドテク、DC事業者に多数導⼊。Ã  2016年1⽉より、ホートンワークスジャパンの⼆⼈⽬の

営業として参画。現在はエヴァンジェリスト活動及びエンタープライズ向けセールス、パートナー⽀援を⾏なっている。


About Hortonworks

お客様との歩み •  ~800社(2016年2月現在)•  152社は2015年第三四半期で•  2015年10月NASDAQへ上場:HDP

The Leader in Connected Data Platforms •  HortonworksDataFlowfordatainmoNon•  HortonworksDataPlaOormfordataatrest•  PoweringnewmoderndataapplicaNons

Partner for Customer Success •  Leaderinopen-sourcecommunity,focusedoninnovaNontomeetenterpriseneeds

•  UnrivaledsupportsubscripNons

Founded in 2011

Yahoo! で初代の Hadoop 開発を手がけたアーキテクト、デベロッパー、オ

ペレータ　24名によって創立

1000+ E M P L O Y E E S

1500+ E C O S Y S T E M

PA R T N E R S


Our Model: Drive an Enterprise-focused Roadmap

1.   InnovateExis6ngProjects–  Hive/SNnger,YARN,HDFS,commonops&securityviaAmbari&Ranger

2.   IncubateNewProjects–  Metron(wasOpenSOC),Ranger,Knox,Atlas,Falcon,Ambari,Tez,etc.

3.   AcquireIP&Contribute

–  AcquiredXASecureandcreatedApacheRanger;contributedOpenSOC

4.   Partner&DeliverJointSolu6ons–  Microsod,EMC,HP,SAS,Pivotal,RedHat,Teradata,etc.

5.   RallytheEcosystem

–  FastSQLviaSNngeriniNaNve,DataGovernanceiniNaNve,ODPi

Data

Acce

ss

(batc

h, int

erac

tive,

real

time)

Int

egra

tion &

Go

vern

ance

Op

erati

ons

Secu

rity

ApacheProject HortonworksCommiNers

HortonworksPMC

HWX%ofCommiNers

Hadoop 29 24 31%Accumulo 2 2 9%Calcite 6 3 43%HBase 8 5 17%Hive 19 11 38%NiFi 5 5 42%

Phoenix 5 5 22%Pig 5 5 24%

Slider 12 12 100%Spark 1 0 2%Storm 4 4 19%Tez 15 15 44%Atlas 7 0 35%Falcon 7 5 41%Flume 1 1 4%KaZa 0 0 0%Sqoop 1 1 4%Ambari 39 30 76%Oozie 4 2 22%

Zookeeper 2 1 13%Knox 12 2 80%Ranger 13 11 76%

TOTAL 197 144

Source:ApacheSodwareFoundaNon.AsofOctober5,2015.Acommi'erissomeonewhohas“earnedtheirstripes”withintheApachecommunityandhastheability

tocommitcodedirectlytotheircorrespondingApacheprojectsourcecoderepository

6 ©HortonworksInc.2011–2016.AllRightsReservedPage6 ©HortonworksInc.2011–2015.AllRightsReserved

100%OpenSourceConnectedDataPlaaorms

Eliminates Risk ofvendorlock-inbydelivering100%Apacheopensourcetechnology

Maximizes Community Innovation withhundredsofdevelopersacrosshundredsofcompanies

IntegratesSeamlesslythroughcommijedco-engineeringpartnershipswithotherleadingtechnologies

M A X I M U M C O M M U N I T Y I N N O VAT I O N

T H E I N N O VAT I O N A D VA N TA G E

P R O P R I E T A R Y H A D O O P

T I M E

INN

OV

AT

ION

O P E N C O M M U N I T Y


自己紹介 Ã 今井雄太 Twijer：@imai_factoryÃ  SoluNonsEngineerÃ 広告配信サーバーのレポート作成のためにMapReduce(perl+streaming!)を使ったのがHadoopとの出会い。

Ã その後、AWSにてアドテクやゲームのお客様を担当しつつ、EMRやS3などのビッグデータなプロダクトを主に担当。そんなつながりでHortonworksに入社してHadoopをやっています。


Ã ~Hive1.2.1– Tez– Cost Based Optimizer(CBO)– ORC File format– Vectorization

Ã Hive2.0– LLAP

最近のApache Hive: Key highlights





Stinger InitiativeHiveを100倍以上⾼速化

Already available on HDP!


Sub-secondショートクエリで

1秒以下のレスポンスを⽬指す







Ã  いずれの改善も数⾏の設定もしくはコマンドで利⽤可能です。–  Hive2.0については現時点(4/22)においてまだHDPに取り込まれていません。

Ã  今⽇は、それらの仕組みにフォーカスしてお話します。



Hive performance recap •  Stinger: •  ApacheHiveのパフォーマンスを100倍にするというゴールのもとに始まったプロジェクト

VectorizedSQLEngine,TezExecuNonEngine,ORCColumnarformatCostBasedOpNmizer

Hive0.10BatchProcessing

100-150xQuerySpeedupHive0.14HumanInteracNve(5seconds)


TPC-DS Benchmark at 30 Terabyte Scale

•  TPC-DSより 50 のサンプルクエリを 30 terabyte のスケールで実⾏•  平均 52 倍の速度アップ, 最⼤ 160 倍の速度アップ•  ベンチマークの総実⾏時間が 7.8 ⽇から 9.3 時間に短縮•  Hive 14に追加された Cost-Based Optimizer が更に 2.5倍の速度アップ実現


TezBeyond MapReduce


Apache Tez

Ã データ処理アプリのための汎⽤分散処理エンジン– アプリ（フレームワーク）向け、エンドユーザー向けではない– Hive on Tez, Pig on Tez, Cascading on Tez, …

Ã MapReduceの教訓を活かした結果– ⼤幅なパフォーマンス改善– バッチ、インタラクティブ– Petabytesスケール

Ã YARNの上で動かす– クラスタリソースの活⽤ DAG(無閉路有向グラフ)


MapReduce & Tez

M M M

R R

M M

R

M M

R

M M

R

HDFS

HDFS

HDFS

M M M

R R

R

M M

R

R

Map – Reduce Intermediate results in HDFS

Tez Optimized Pipeline

•  中間データをHDFSに書き出さない

•  Map-Reduce-Reduceのような構成を取ることができる

•  セッションによるコンテナの再利⽤

•  ジョブを通してのパイプラインの最適化


What is DAG & Why DAG

ProjectionFilterGroupBy…

JoinUnionIntersect…

Split…

• Directed Acyclic Graph（無閉路有向グラフ）• どんなに複雑なDAGでも、基本的には以下の3つのパターンに分類ができる– Sequential– Merge– Divide


Tezの⼤まかな動き

ProcessorInput Output


Tez – Key benefits

• DAGの表現⼒• Easier to express computation in DAG

• 中間データをHDFSに吐き出さない• レイテンシ• NameNodeへの負荷

• Tezセッション/コンテナ再利⽤• AM/タスクコンテナアロケーションのオーバーヘッド• ResourceManagerの負荷• Object Registryによるデータ使い回し（MapJoin⽤のテーブルなど）•  JITによる実⾏コードの最適化

• DAG全体を⾒渡しての最適化


Tez - architecture

Ã Client– Starts session– Submits DAG

Ã Application Master– DAG Scheduler– Task Scheduler– Vertex Manager

Ã TezTask Containers– Execution


ORCOptimized Row Columnar


Hadoopで使われるファイルフォーマット

•  Text•  SequenceFile•  RCFile

•  + Can be read required column•  + Compression on each column•  - type-free binary blobs•  - no index•  - Compression by stream-based codec


ORCFile – Hiveのためのカラム型ストレージ

Ã High Compression– カラムごとに適⽤されるデータの型スペシフィックな圧縮– ストリーム単位でのZLIBやSNAPPYによる圧縮

Ã High Performance– File, Stripe, Rowそれぞれのレベルでのインデックス、メタデータ– Predicate Pushdown

Ã Flexible Data Model– Complex types(struct, list, map, union)– New types(datetime, decimal)


ORC at Facebook

Savedmorethan1,400serversworthofstorage.(2)

Compressioni CompressionraNoincreasedfrom5xto8xglobally.(2)

Compressioni


ORC at Spotify

16xlessHDFSreadwhenusingORCversusAvro.(3)

IOi 32xlessCPUwhenusingORCversusAvro.(3)

CPUi


ORC at Yahoo!

6-50xspeedupwhenusingORCversusTextFile.(4)

Speedupi 1.6-30xspeedupwhenusingORCversusRCFile.(4)

Speedupi


ORCFile – ファイルフォーマット



デフォルトで256MBという⼤きなチャンクサイズでファイルの中⾝を分割

Stripe



それぞれのStripeの場所、スキーマ、ファイル全体におけるそれぞれのカラムのmin/max/sum値を保持

File Footer


ファイルの圧縮形式と、圧縮済みのFooterのサイズを保持。その他カスタムメタデータも保持可能。最初にここだけ読み取られる。

Post Script



…

…

…

…

…

Stream: INDEX

Stream: BROOM FILTER

Stream: DATA

Stream: LENGTH

Stream: DICTIONARY

Row Group(Default: 10K records for each RG)



File-Column1-min-max-sum-hasNull-Column2-Column3-ColumnN-Compression-FooterLength

Stripe1-Column1-min-max-sum-hasNull-Column2-Column3-ColumnN

Column1

RG1-min-max-sum-hasNull-pos


…

ColumnN



……

StripeN

…



Compression

Ã データの型スペシフィックな圧縮(Light-Weight Compression)– カラムごとに適⽤される圧縮– 必ず適⽤される– RLE, Direct, Patch Base, Delta

Ã データストリームの圧縮(Generic Compression)– ファイル全体を通して共通で適⽤される圧縮– 実際にはそれぞれのStream、Footerに適⽤される– 上記のLight-Weight Compressionが適⽤された上に適⽤される– NONE, ZLIB, SNAPPY, LZO


High Compression


High Performance

FileレベルのIndex

StripeレベルのIndex

RowGroupレベルのIndex


ORCの情報をダンプする

orcfiledumphive --service orcfiledump /apps/hive/warehouse/rankings/000045_0

RowGroupごとのインデックス情報を含めるには rowindex <カラム番号> を指定。0を指定すれば全カラムの情報がとれるhive --service orcfiledump --rowindex 1 /apps/hive/warehouse/rankings/000045_0


File Statistics

File Statistics: Column 0: count: 1620325 hasNull: false Column 1: count: 1620325 hasNull: false min: 1.0.100.215 max: 99.99.97.199 sum: 21531540 Column 2: count: 1620325 hasNull: false min…max: …sum: 88890214 Column 3: count: 1620325 hasNull: false min: 1970-01-01 max: 2012-04-30 Column 4: count: 1620325 hasNull: false min: …-8 max: …sum: 810757.3001111746 Column 5: count: 1620325 hasNull: false min… max: … sum: 85357610 Column 6: count: 1620325 hasNull: false min: ALB max: ZAF sum: 4860975


Stripe Statistics

Stripe Statistics: Stripe 1: Column 0: count: 1545000 hasNull: false Column 1: count: 1545000 hasNull: false min: 1.0.100.215 max: 99.99.97.199 sum: 20530443 Column 2: count: 1545000 hasNull: false min: … max: … sum: 84763272 Column 3: count: 1545000 hasNull: false min: 1970-01-01 max: 2012-04-30 Column 4: count: 1545000 hasNull: false min: … max: … sum: 773016.625769496 Column 5: count: 1545000 hasNull: false min: … max: … sum: 81385950 Column 6: count: 1545000 hasNull: false min: ALB max: ZAF sum: 4635000


Row Group Indexes

Row group indices for column 1: Entry 0: count: 10000 hasNull: false min: 1.101.125.195 max: 99.98.152.204 sum: 132919 positions: 0,0,0,0,0 Entry 1: count: 10000 hasNull: false min: 1.104.147.167 max: 99.85.51.213 sum: 132976 positions: 0,132919,0,6119,52 Entry 2: count: 10000 hasNull: false min: 1.1.228.147 max: 99.88.166.75 sum: 132826 positions: 120403,3751,0,12339,3

Entry 3: count: 10000 hasNull: false min: 1.104.90.89 max: 99.96.30.136 sum: 132853 positions: 120403,136577,0,18482,4 Entry 4: count: 10000 hasNull: false min: 1.11.252.134 max: 99.71.248.30 sum: 132856 positions: 240743,7286,0,24600,2 Entry 5: count: 10000 hasNull: false min: 1.119.19.221 max: 99.96.184.74 sum: 132977 positions: 240743,140142,0,30713,8 Entry 6: count: 10000 hasNull: false min: 1.1.244.95 max: 99.99.242.168 sum: 132735 positions: 360961,10975,0,36946,1

Entry 7: count: 10000 hasNull: false min: 1.1.146.20 max: 99.93.105.159 sum: 132869 positions: 360961,143710,0,43145,2


SARG & Predicate Pushdown

Ã SARG: Search ARGument

Ã SELECT COUNT(*) FROM CUSTOMER WHERE CUSTOMER.state = ʻCAʼ;

Ã 上記のようなクエリにおいて、RecordReaderはwhere clauseにマッチするORCファイル、Stripe、RowGroupだけをストレージから読み出す


Bloom Filter Index

1 0 1 110 1 0 11

x y z

w

m=10k=3

m個の要素を持つ配列に対して⼊⼒値に対してk回のハッシュ関数をかけて結果を格納しておく。

確認対象の値をk回ハッシュして、結果がすべて1であれば、そのインデックスに値が含まれる。そうでなければ含まれないのでスキップする。偽陽性の結果になる可能性もある。


Bloom Filter Indexes Improvements

5999989709

540,000

10,000

NoIndexes Min-MaxIndexes BloomfilterIndexes

select*fromtpch_1000.lineitemwherel_orderkey=1212000001;(logscale–smallerisbeNer)

RowsRead


Bloom Filter Indexes Improvements

74

4.51.34

NoIndexes Min-MaxIndexes BloomfilterIndexes

select*fromtpch_1000.lineitemwherel_orderkey=1212000001;(smallerisbeNer)

TimeTaken(seconds)

~16ximprovement

~3.3ximprovement


ORCFile – テーブル定義の例

Ã テーブルまたはパーティション別に定義Ã 選べられる圧縮コーデック

create table Addresses ( name string, street string, city string, state string, zip int) stored as orc tblproperties ("orc.compress"=”ZLIB");


ORCFile – テキストからORCに変換

Ã ORCを使わない理由はないÃ SQL 1つでテキストからORCに変換

-- Create Text & ORC tablesCREATE TABLE test_details_txt( visit_id INT, store_id SMALLINT) STORED AS TEXTFILE;CREATE TABLE test_details_orc( visit_id INT, store_id SMALLINT) STORED AS ORC;

-- Load into Text tableLOAD DATA LOCAL INPATH '/home/user/test_details.csv' INTO TABLE test_details_txt;

-- Copy to ORC tableINSERT OVERWRITE INTO test_details_orc SELECT * FROM test_details_txt;


Vectorized Query ExecutionProcess 1024 Rows at a Time


Vectorization – ベクターSQLエンジン

Ã 機能:– １⾏づつの代わりに、⼀回に1024⾏を処理– モーデンなハードウェアアーキテクチャの活⽤

Ã 利点:– ⼤きいクエリは最⼤３倍早い– CPU使⽤時間を削減、クラスタリソースの有効利⽤


Column Store Layout

Table

Row Store Column Store

A B1 A1 B12 A2 B2

1A1B1

2A2B2

AA1A2

BB1B2


Column Store Characteristics

Row Store•  TextFile, SequenceFile, Avro•  Slower read performance•  Reads whole columns

•  Lower compression ratio•  Higher local cardinality

Column Store•  RCFile, Parquet, ORC•  Faster read performance•  Reads needed columns only

•  Higher compression ratio•  Lower local cardinality

•  Room for further optimization•  Vectorization


Hive Vectorization 2014

Rewriting Hive execution engine for performance•  No method calls•  Low instruction count•  Cache locality to 1,024 values•  No pipeline stalls•  SIMD in Java 8But not excellent without SIMD

set hive.vectorized.execution.enabled = true;J. Sompolski, M. Zukowski, P. Boncz. Vectorization vs. Compilation in Query Execution. 2011


Cost Based Optimizer


Cost Based Optimizer

Ã  Apache Calciteを利⽤Ã  何をしてくれるもの？

–  Ordering joins–  Bushy Join Tree–  Converting join algorithms

Ã  Paper: https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive

Ã  Anatomy: http://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/


MySQL

Splunk

Expression treeSELECT p.“product_name”, COUNT(*) AS cFROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id”WHERE s.“action” = 'purchase'GROUP BY p.”product_name”ORDER BY c DESC

join

Key: product_id

group

Key: product_nameAgg: count

filter

Condition:action =

'purchase'

sort

Key: c DESC

scan

scan

Table: splunk

Table: products


Splunk

Expression tree(optimized)

SELECT p.“product_name”, COUNT(*) AS cFROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id”WHERE s.“action” = 'purchase'GROUP BY p.”product_name”ORDER BY c DESC

join

Key: product_id

group

Key: product_nameAgg: count

filter

Condition:action =

'purchase'

sort

Key: c DESC

scan

Table: splunk

MySQL

scan

Table: products


Query preparation – Hive 0.13

SQL parser

Semantic analyzer

Logical Optimizer

Physical Optimizer

Abstract Syntax Tree

(AST)

Hive SQL

AnnotatedAST

Plan

Tez

Tuned Plan


Query preparation – Hive 0.14

SQL parser

Semantic analyzer

Logical Optimizer

Physical Optimizer

Hive SQL

AST with optimized join-

ordering

Tez

Tuned Plan

Translate to

algebra

Optiq optimize

r


Star schema

Sales Inventory

Time

Product

Customer

Warehouse

KeyFact tableDimension table Many-to-one relationship


Query combining two stars

SELECT product.id, sum(sales.units), sum(inventory.on_hand)FROM sales ON …JOIN customer ON …JOIN time ON …JOIN product ON …JOIN inventory ON …JOIN warehouse ON …WHERE time.year = 2014AND time.quarter = ̒Q1ʼAND product.color = ̒RedʼAND warehouse.state = ̒WA̓GROUP BY …

Sales InventoryTime

Product

Customer

Warehouse


Left-deep tree

“left-deep”ツリーすべてのジョインがシリアルに⾏われる。ジョインの順番は考慮されているが、ツリーの形は考慮されていない。

よくあるプラン:•  最⼤のテーブルを左下に置いてスタート•  絞り込みの⼤きいJoinから適⽤していく

Sales Customer

Time

Product

Inventory

Warehouse


Bushy tree (Bush:低⽊、茂み）

Joinがどこで⾏われるかに制約をかけない“Bushes” はファクトテーブル (Sales and Inventory)と関連するディメンションテーブルで形成されるディメンションテーブルがフィルターの役割を果たす結果としてデータの読み込み⾏数やネットワークを介してのやり取りを少なくしていける

Sales Customer

Time

Product

Inventory Warehouse


Cost variables

Ã  Hr - This is the cost of Reading 1 byte from HDFS in nano seconds.Ã  Hw - This is the cost of Writing 1 byte to HDFS in nano seconds.Ã  Lr - This is the cost of Reading 1 byte from Local FS in nano seconds.Ã  Lw - This is the cost of writing 1 byte to Local FS in nano seconds.Ã  NEt – This is the average cost of transferring 1 byte over network in

the Hadoop cluster from any node to any node; expressed in nano seconds.

Ã  T(R) - This is the number of tuples in the relation.Ã  Tsz – Average size of the tuple in the relationÃ  V(R, a) –The number of distinct values for attribute a in relation RÃ  CPUc – CPU cost for a comparison in nano seconds


Assumed values

Ã  CPUc = 1 nano secÃ  NEt = 150 * CPUc nano secsÃ  Lw = 4 * NetÃ  Lr = 4 * NetÃ  Hw = 10 * LwÃ  Hr = 1.5 * Lr


Profile Hive queries

Ã hive.tez.exec.print.summary=true

←このへんで仕事してる


LLAP: Live Long And ProcessChallenge for Sub-Second


What is LLAP?•  Hiveの処理実⾏のための常駐型プロセス•  タスクの起動コストの低減•  JITオプティマイザがより利きやすい

•  プロセスではなくスレッド型のExecutor•  メタデータやMapJoinのテーブルなどをタスク間で共

有できる

•  IOの⾮同期化とキャッシュの導⼊•  Query fragment API

Node

LLAP Process

Cache

Query Fragment

HDFS

Query Fragment


What LLAP isn't•  Hive execution engine (like Tez, MR, Spark…)•  Execution enginesは処理の組み⽴てやを⾏う

•  Not a storage layer•  LLAPデーモンはステートレスで、データはHDFSをsource of truth

として利⽤する•  Does not supersede existing Hive•  Containerベースの実⾏も引き続き進化していく


Example execution: MR vs Tez vs Tez+LLAP

M M M

R R

M MR

M M

R

M M

R

HDFS

HDFSHDFS

T T T

R R

R

T T

T

R

M M M

R R

R

M M

R

R

HDFSIn-Memorycolumnar cache

Map – ReduceIntermediate results in HDFS

TezOptimized Pipeline

Tez with LLAPResident process on Nodes

Map tasks read HDFS


LLAP in your cluster•  LLAPデーモンはYARN上で実⾏される•  Apache Sliderがデーモン⽤コンテナのプロビジョンとリ

カバリを⾏う•  Resource management via YARN delegation model

(WIP)•  LLAP and containers dynamically balance resource

usage (WIP)


Queryexecu6on


•  DAGによる処理の組み⽴てはそのまま利⽤される。Tezのランタイムもそのまま利⽤される。

•  フラグメント/タスクはLLAPもしくは通常のコンテナ、AM内のいずれでも実⾏可能

•  どこで実⾏されるかはHive Clientによって決定される•  Configurable – all in LLAP, none in LLAP, intelligent mix

•  LLAPにタスクを割り当てるポリシー(in auto mode)•  No user code (or only blessed user code)•  Data source – HDFS•  ORC and vectorized execution (for now)

•  Others can still run in LLAP in "all" mode, w/o IO elevator and cache•  Data size limitations (avoid heavy / long running processing within LLAP)

Tez + LLAP – overview


So…

M M M

R R

R

M M

R

R

Tez


AM

So…

T T T

R R

R

T T

T

R

M M M

R R

R

M M

R

R

Tez Tez with LLAP (auto)

auto


AM

AM

So…

T T T

R R

R

T T

T

R

M M M

R R

R

M M

R

R

Tez Tez with LLAP (auto)

T T T

R R

R

T T

T

R

Tez with LLAP (all)

allauto


Scheduling for LLAP in Tez AM•  Greedy scheduling per query•  クラスタ全体が利⽤可能な前提でスケジューリングが⾏われる

•  Schedule work to preferred location (HDFS locality)•  同じデータにアクセスする複数のクエリ間で、preferred locationの設定に

よって同じデーモン上でタスクを実⾏させることができる


LLAP

Queue

Queuing fragments•  LLAPデーモンはスレッドプールを使って

タスク/フラグメントを実⾏する

•  内部にキューを持っており、プラガブルな優先度付の仕組みもある

Executor Q1 Reducer 2

Executor Q1 Map 1

Executor Q1 Map 1

Executor Q3 Map 19

Q1 Reducer 2

Q1 Map 1

Q3 Map 19

Q1 Reducer 2


LLAP Scheduling – pipelining and preemption•  フラグメントは⼊⼒データが揃いきって

いなくても実⾏開始できる•  ⼊⼒データが揃った時点で”finishable”と

いうフラグが付与される

LLAP

Queue Executor

Executor

Interactive query map 1/3

…


Executor


Wide query reduce

Well,10mapperoutof100aredone!




いうフラグが付与される•  finishableになるまでexecutorを解放はしない

LLAP

Queue Executor

Executor


…


Executor


Wide query reduce





LLAP

Queue Executor

Executor


…


Executor


Wide query reduce





•  Non-finishableなフラグメントはプリエンプションされる

LLAP

Queue Executor

Executor


…


Executor


Wide query reduce


IOelevatorandotherinternals


Asynchronous IO•  これまでのHiveでは、IO

は同期的に⾏われていた•  データの圧縮、⾮圧縮も

同期型だった


Asynchronous IO•  LLAPでは、IOエレベー

タースレッドがディスクIO、圧縮、などを⾮同期に執り⾏う

•  IO threads can be spindle aware (WIP)

•  Depending on workload, IO and processing threads can balance resource usage (throttle IO, etc.) (WIP)


Caching and off-heap data•  解凍されたデータはoff-heapにキャッシュされる•  キャッシュについてはGCを気にしないでいいように•  HDFSのIOと解凍コストを排除。特にディメンションテーブ

ルに有効

•  プラガブルなEviction Policy•  現在はFIFO, LRFUをサポート


Other benefits•  ファイルのメタデータやインデックスもキャッシュされる•  Predicate Pushdownの⾼速化

•  MapJoin⽤のハッシュテーブルやフラグメントの実⾏計画もJVM内で共有される

•  タスク/フラグメントごとに実⾏計画のデシリアライズのコストが減る•  Better use of JIT optimizer•  起動しっぱなしのデーモンなので、JITが仕事をするための時間がよ

り⻑く取れる•  Especially good with vectorization!


まとめ


Sub-secondショートクエリで

1秒以下のレスポンスを⽬指す






Apache Hiveの今とこれから - 2016

Technology

Transcript of Apache Hiveの今とこれから - 2016