Managing multi tenant resource toward Hive 2.0
date post
15-Apr-2017Category
Software
view
710download
1
Embed Size (px)
Transcript of Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0
Kai Sasaki Treasure Data Inc.
About Me Kai Sasaki ( )
@Lewuathe (Twitter)
Software Engineer at Treasure Data Inc.
Maintaining and develop Hadoop/Presto infrastructure
Topic Treasure Data infrastructure
Hive 2.0 change
Migration architecture
Resource management for multi tenancy
Performance comparison
Live Data Management Platform
Original creator of Fluentd/Embulk/Digdag
70+ integrations with
BI tools
Mobile/IoT
Cloud Storage
and more
Hive/Pig/Presto data processing interface
40000+ Hive queries / day
130000+ Presto queries / day
Plazma Cloud Storage
450000+ records/sec imported
Hive 1.x Hive 2.x
Any change?
Hive 2.0 Include major new features
Fixed 600+ bugs
140+ improvements or new features
Backward compatible as much as possible
Hive 1.x stable line
2.1.0 is available from June 20th, 2016
http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale
http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale
Hive 2.0 HPLSQL
LLAP
HBase metastore
Improvements of Hive on Spark
CBO improvements
http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale
http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale
HPLSQL Procedural SQL like Oracles PL/SQL
Cursor
loops (WHILE, FOR, LOOP)
branches (IF)
External library which communicates through JDBC
http://www.hplsql.org/doc
http://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale
http://www.hplsql.org/dochttp://www.slideshare.net/HadoopSummit/apache-hive-20-sql-speed-scale
LLAP Sub-second Queries in Hive
Save JVM container launch time
Data caching
Fit to Adhoc or interactive use case
Beta in 2.0
http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png
http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png
LLAP Sub-second Queries in Hive
http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png
http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png
HBase metastore Use HBase as metastore of Hive
Fetching thousands of partitions
Limitation of concurrent connection
Will support transaction with Apache Omid
Alpha in Hive 2.0
http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png
http://hortonworks.com/wp-content/uploads/2014/09/Screen-Shot-2014-09-02-at-5.03.47-PM.png
Many fixes and
Cutting edge features
Thats all? Operation cost of migration
Manage multiple cluster
Test and verify multiple packages
Difference of configuration and parameter
Thats all? Operation cost of migration
Manage multiple cluster
Test and verify multiple packages
Difference of configuration and parameter
Need to reduce operation cost at the same time
Now migration
Challenge NO DOWNTIME
NO HARMFUL OPERATION
Change package easily
Separate from other components (Micro service)
NO DEGRADATION
Automatic query test and validation
NO DOWNTIME Hadoop cluster Blue-Green deployment
Reliable queue system separated from Hadoop
PerfectQueue
Reliable storage system separated from Hadoop
Plazma
PerfectQueue Distributed queue built on top of RDBMS
At-least-once semantics
Graceful and live restarting
State consistency by transaction
https://github.com/treasure-data/perfectqueue
https://github.com/treasure-data/perfectqueue
Plazma Distributed cloud-based storage
PostgreSQL + S3/Riak CS
Enable time-index push down for Hive/Pig/Presto
Column-oriented IO (mpc1)
Data consistency with transactional API
Plazma
x
PQ PQApprequest
Plazma
x
PQ PQApprequest
pull
submit
Plazma
x
PQ PQApprequest
pull
submit fetch
Plazma
x
PQ PQApprequest
pull
submit fetch
disposablecomponents
Plazma
x
PQ PQApprequest
pull
submit fetchv1
v2
Plazma
x
PQ PQApprequest
pull
submitfetch
v1
v2
Plazma
PQ PQApprequest
pull
submitfetch
v2
NO HARMFUL OPS Automatic package version up
Chef server specifies the version
Hadoop package repository
S3 remote package repository
Hadoop as a REST service
elephant-server
elephant-server Hadoop as REST service
Pluggable executor
Hive
Pig
Embulk MapReduce executor
Distributed on-memory queue (Hazelcast)
PQ PQApprequest
pull REST
elephantserver
PQ PQApprequest
pull REST
elephantserver
elephantserver
elephantserver
PQ PQApprequest
pull REST
elephantserver
hazelcast
elephantserver
elephantserver
PQ PQApprequest
pull REST
elephantserver
hazelcast
elephantserver
elephantserver
service discovery
PQ PQApprequest
pull REST
elephantserver
hazelcast
elephantserver
elephantserver
service discovery x
x
PQ PQApprequest
pull REST
elephantserver
hazelcast
elephantserver
elephantserver
service discovery
package distribution
S3
x
x
PQ PQApprequest
pull REST
elephantserver
hazelcast
elephantserver
elephantserver
request
x
x
fetch submit
service discovery
package distribution
S3
NO DEGRADATION Validation in
Parameter difference
Query result difference
Performance deterioration
Automatic testing and persistent result tables
PQ PQApprequest
pull REST
elephantserver
S3
1. upload param and configurations
PQ PQApprequest
pull REST
elephantserver
S3
1. upload param and configurations
x
submit
v1
PQ PQApprequest
pull REST
elephantserver
S3
1. upload param and configurations 2. upload query result
Plazma
x
submit
v1
3. send metrics
PQ PQApprequest
pull REST
elephantserver
S3
1. upload param and configurations 2. upload query result
Plazma
x
submit
v1
3. send metrics
S3 Plazma
x
v2
elephantserver
S3
1. upload param and configurations 2. upload query result
Plazma
x
submit
v1
3. send metrics
S3 Plazma
x
v2Verification between persistent result setPQ
PQApp
request
pull REST
Resource management Define 1 resource per 1 account
Workload type of an account varies
Batch, Adhoc, BI tool
Require high level resource management across clusters
An account can have multiple resource pools
For service and internal purpose
request
queue1
queue2
cluster1
cluster2
cluster1
cluster2
Hadoop queue A
Hadoop queue B
Hadoop queue A
Hadoop queue B
Hadoop queue A
Hadoop queue B
Hadoop queue A
Hadoop queue B
request
queue1
queue2
cluster1
cluster2
cluster1
cluster2
Hadoop queue A
Hadoop queue B
Hadoop queue A
Hadoop queue B
Hadoop queue A
Hadoop queue B
Hadoop queue A
Hadoop queue B
Enables us to define which resource the request can use
PQ PQApprequest
REST
elephantserver
x
PQ PQApprequest
REST
elephantserver
PQ
PQ
x
1. multiple job queue
PQ PQApprequest
REST
elephantserver
x
x
PQ
PQ
1. multiple job queue 2. multiple Hadoop cluster
PQ PQApprequest
REST
elephantserver
x
q1
q2
q3
x
PQ
PQ
q1
q2
q3
1. multiple job queue 2. multiple Hadoop cluster
3. multiple Hadoop queue
Briefly performance comparison
130GB+ 70B+ recordsEl
apse
d tim
e (s
ec)
0
200
400
600
800
COUNT
Hive 1.x + MapReduceHive 2.x + Tez + Vectorization
130GB+ 70B+ recordsEl
apse
d tim
e (s
ec)
0
250
500
750
1000
GROUP BY
Hive 1.x + MapReduceHive 2.x + Tez + Vectorization
130GB+ 70B+ recordsEl
apse
d tim
e (s
ec)
0
275
550
825
1100
JOIN
Hive 1.x + MapReduceHive 2.x + Tez + Vectorization
Recap
Hadoop architecture in Treasure Data for Hive 2.0 and beyond
Resource management for multi tenancy
W