Download - Managing multi tenant resource toward Hive 2.0

Transcript
Page 1: Managing multi tenant resource toward Hive 2.0

Managing multi tenant resource toward Hive 2.0

Kai Sasaki Treasure Data Inc.

Page 2: Managing multi tenant resource toward Hive 2.0

About Me• Kai Sasaki (佐々木 海)

• @Lewuathe (Twitter)

• Software Engineer at Treasure Data Inc.

• Maintaining and develop Hadoop/Presto infrastructure

Page 3: Managing multi tenant resource toward Hive 2.0

Topic• Treasure Data infrastructure

• Hive 2.0 change

• Migration architecture

• Resource management for multi tenancy

• Performance comparison

Page 4: Managing multi tenant resource toward Hive 2.0

• Live Data Management Platform

• Original creator of Fluentd/Embulk/Digdag

• 70+ integrations with

• BI tools

• Mobile/IoT

• Cloud Storage

• and more

Page 5: Managing multi tenant resource toward Hive 2.0
Page 6: Managing multi tenant resource toward Hive 2.0

• Hive/Pig/Presto data processing interface

• 40000+ Hive queries / day

• 130000+ Presto queries / day

• Plazma Cloud Storage

• 450000+ records/sec imported

Page 7: Managing multi tenant resource toward Hive 2.0

Hive 1.x Hive 2.x

Any change?

Page 8: Managing multi tenant resource toward Hive 2.0

Hive 2.0• Include major new features

• Fixed 600+ bugs

• 140+ improvements or new features

• Backward compatible as much as possible

• Hive 1.x stable line

• 2.1.0 is available from June 20th, 2016

http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale

Page 9: Managing multi tenant resource toward Hive 2.0

Hive 2.0• HPLSQL

• LLAP

• HBase metastore

• Improvements of Hive on Spark

• CBO improvements

http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale

Page 10: Managing multi tenant resource toward Hive 2.0

HPLSQL• Procedural SQL like Oracle’s PL/SQL

• Cursor

• loops (WHILE, FOR, LOOP)

• branches (IF)

• External library which communicates through JDBC

• http://www.hplsql.org/doc

http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale

Page 11: Managing multi tenant resource toward Hive 2.0

LLAP• Sub-second Queries in Hive

• Save JVM container launch time

• Data caching

• Fit to Adhoc or interactive use case

• Beta in 2.0

http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png

Page 12: Managing multi tenant resource toward Hive 2.0

LLAP• Sub-second Queries in Hive

http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png

Page 13: Managing multi tenant resource toward Hive 2.0

HBase metastore• Use HBase as metastore of Hive

• Fetching thousands of partitions

• Limitation of concurrent connection

• Will support transaction with Apache Omid

• Alpha in Hive 2.0

http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png

Page 14: Managing multi tenant resource toward Hive 2.0

Many fixes and

Cutting edge features

Page 15: Managing multi tenant resource toward Hive 2.0

That’s all?• Operation cost of migration

• Manage multiple cluster

• Test and verify multiple packages

• Difference of configuration and parameter

Page 16: Managing multi tenant resource toward Hive 2.0

That’s all?• Operation cost of migration

• Manage multiple cluster

• Test and verify multiple packages

• Difference of configuration and parameter

• Need to reduce operation cost at the same time

Page 17: Managing multi tenant resource toward Hive 2.0

Now migration

Page 18: Managing multi tenant resource toward Hive 2.0

Challenge• NO DOWNTIME

• NO HARMFUL OPERATION

• Change package easily

• Separate from other components (Micro service)

• NO DEGRADATION

• Automatic query test and validation

Page 19: Managing multi tenant resource toward Hive 2.0

NO DOWNTIME• Hadoop cluster Blue-Green deployment

• Reliable queue system separated from Hadoop

→ PerfectQueue

• Reliable storage system separated from Hadoop

→ Plazma

Page 20: Managing multi tenant resource toward Hive 2.0

PerfectQueue• Distributed queue built on top of RDBMS

• At-least-once semantics

• Graceful and live restarting

• State consistency by transaction

• https://github.com/treasure-data/perfectqueue

Page 21: Managing multi tenant resource toward Hive 2.0

Plazma• Distributed cloud-based storage

• PostgreSQL + S3/Riak CS

• Enable time-index push down for Hive/Pig/Presto

• Column-oriented IO (mpc1)

• Data consistency with transactional API

Page 22: Managing multi tenant resource toward Hive 2.0

Plazma

x

PQ PQApp

request

Page 23: Managing multi tenant resource toward Hive 2.0

Plazma

x

PQ PQApp

request

pull

submit

Page 24: Managing multi tenant resource toward Hive 2.0

Plazma

x

PQ PQApp

request

pull

submit fetch

Page 25: Managing multi tenant resource toward Hive 2.0

Plazma

x

PQ PQApp

request

pull

submit fetch

disposablecomponents

Page 26: Managing multi tenant resource toward Hive 2.0

Plazma

x

PQ PQApp

request

pull

submit fetchv1

v2

Page 27: Managing multi tenant resource toward Hive 2.0

Plazma

x

PQ PQApp

request

pull

submitfetch

v1

v2

Page 28: Managing multi tenant resource toward Hive 2.0

Plazma

PQ PQApp

request

pull

submitfetch

v2

Page 29: Managing multi tenant resource toward Hive 2.0

NO HARMFUL OPS• Automatic package version up

• Chef server specifies the version

• Hadoop package repository

• S3 remote package repository

• Hadoop as a REST service

• elephant-server

Page 30: Managing multi tenant resource toward Hive 2.0

elephant-server• Hadoop as REST service

• Pluggable executor

• Hive

• Pig

• Embulk MapReduce executor

• Distributed on-memory queue (Hazelcast)

Page 31: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

Page 32: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

elephantserver

elephantserver

Page 33: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

hazelcast

elephantserver

elephantserver

Page 34: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

hazelcast

elephantserver

elephantserver

service discovery

Page 35: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

hazelcast

elephantserver

elephantserver

service discovery x

x

Page 36: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

hazelcast

elephantserver

elephantserver

service discovery

package distribution

S3

x

x

Page 37: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

hazelcast

elephantserver

elephantserver

request

x

x

fetch submit

service discovery

package distribution

S3

Page 38: Managing multi tenant resource toward Hive 2.0

NO DEGRADATION• Validation in

• Parameter difference

• Query result difference

• Performance deterioration

• Automatic testing and persistent result tables

Page 39: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

S3

1. upload param and configurations

Page 40: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

S3

1. upload param and configurations

x

submit

v1

Page 41: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

S3

1. upload param and configurations 2. upload query result

Plazma

x

submit

v1

3. send metrics

Page 42: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

pull REST

elephantserver

S3

1. upload param and configurations 2. upload query result

Plazma

x

submit

v1

3. send metrics

S3 Plazma

x

v2

Page 43: Managing multi tenant resource toward Hive 2.0

elephantserver

S3

1. upload param and configurations 2. upload query result

Plazma

x

submit

v1

3. send metrics

S3 Plazma

x

v2Verification between persistent result setPQ PQ

Apprequest

pull REST

Page 44: Managing multi tenant resource toward Hive 2.0

Resource management• Define 1 resource per 1 account

• Workload type of an account varies

• Batch, Adhoc, BI tool…

• Require high level resource management across clusters

• An account can have multiple resource pools

• For service and internal purpose

Page 45: Managing multi tenant resource toward Hive 2.0

request

queue1

queue2

cluster1

cluster2

cluster1

cluster2

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Page 46: Managing multi tenant resource toward Hive 2.0

request

queue1

queue2

cluster1

cluster2

cluster1

cluster2

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Hadoop queue A

Hadoop queue B

Enables us to define which resource the request can use

Page 47: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

REST

elephantserver

x

Page 48: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

REST

elephantserver

PQ

PQ

x

1. multiple job queue

Page 49: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

REST

elephantserver

x

x

PQ

PQ

1. multiple job queue 2. multiple Hadoop cluster

Page 50: Managing multi tenant resource toward Hive 2.0

PQ PQApp

request

REST

elephantserver

x

q1

q2

q3

x

PQ

PQ

q1

q2

q3

1. multiple job queue 2. multiple Hadoop cluster

3. multiple Hadoop queue

Page 51: Managing multi tenant resource toward Hive 2.0

Briefly performance comparison

Page 52: Managing multi tenant resource toward Hive 2.0

130GB+ 70B+ recordsEl

apse

d tim

e (s

ec)

0

200

400

600

800

COUNT

Hive 1.x + MapReduceHive 2.x + Tez + Vectorization

Page 53: Managing multi tenant resource toward Hive 2.0

130GB+ 70B+ recordsEl

apse

d tim

e (s

ec)

0

250

500

750

1000

GROUP BY

Hive 1.x + MapReduceHive 2.x + Tez + Vectorization

Page 54: Managing multi tenant resource toward Hive 2.0

130GB+ 70B+ recordsEl

apse

d tim

e (s

ec)

0

275

550

825

1100

JOIN

Hive 1.x + MapReduceHive 2.x + Tez + Vectorization

Page 55: Managing multi tenant resource toward Hive 2.0

Recap

• Hadoop architecture in Treasure Data for Hive 2.0 and beyond

• Resource management for multi tenancy

Page 56: Managing multi tenant resource toward Hive 2.0

We’re hiring!