Qubole hadoop-summit-2013-europe
-
Upload
joydeep-sen-sarma -
Category
Technology
-
view
840 -
download
0
Transcript of Qubole hadoop-summit-2013-europe
![Page 1: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/1.jpg)
Cloud Friendly Hadoop & Hive
Joydeep Sen Sarma
Qubole
![Page 2: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/2.jpg)
2
Agenda
What is Qubole Data Service
Hadoop as a Service in Cloud
Hive as a Service in Cloud
![Page 3: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/3.jpg)
3
Qubole Data Service
AWS S3
AWS EC2
![Page 4: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/4.jpg)
Hadoop
Qubole Data Service
Sqoop Oozie Pig Hive
AWS S3
API
AWS EC2
![Page 5: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/5.jpg)
Hadoop
5
Qubole Data Service
Sqoop Oozie Pig Hive
AWS S3
API
AWS EC2
S3://adco/logs
Mysql
Vertica
![Page 6: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/6.jpg)
6
Hadoop
6
Qubole Data Service
Sqoop Oozie Pig Hive
AWS S3
API
ODBC SDK
AWS EC2
Explore – Integrate – Analyze – Schedule
S3://adco/logs
Mysql
Vertica
![Page 7: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/7.jpg)
7
Hadoop
7
Qubole Data Service
Sqoop Oozie Pig Hive
AWS S3
API
ODBC SDK
AWS EC2
Explore – Integrate – Analyze – Schedule
S3://adco/logs
Mysql
Vertica
![Page 8: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/8.jpg)
8
Agenda
• What is Qubole Data Service
• Hadoop as a Service in Cloud
• Hive as a Service in Cloud
![Page 9: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/9.jpg)
9
Step 1(Optional): Setup Hadoop
![Page 10: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/10.jpg)
10
Step 2: Fire Away
AdCo Hadoop
![Page 11: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/11.jpg)
11
Step 2: Fire Away
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;
AdCo Hadoop
![Page 12: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/12.jpg)
12
Step 2: Fire Away
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;
AdCo Hadoop
![Page 13: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/13.jpg)
13 13
Step 2: Fire Away
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;
insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on (a.id=b.ad_id)
group by a.id, a.zip;
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
AdCo Hadoop
![Page 14: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/14.jpg)
14 14
Step 2: Fire Away
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;
insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on (a.id=b.ad_id)
group by a.id, a.zip;
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
AdCo Hadoop
![Page 15: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/15.jpg)
15
Step 2: Fire Away
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
AdCo Hadoop
![Page 16: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/16.jpg)
16
Step 2: Fire Away
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
AdCo Hadoop
![Page 17: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/17.jpg)
17
Step 2: Fire Away
AdCo Hadoop
![Page 18: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/18.jpg)
18
Come back anytime
![Page 19: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/19.jpg)
19
Hadoop as Service
1. Detect when cluster is required
– Not all Hive statements require cluster (EXPLAIN/SHOW/..)
2. Atomically create cluster
– Long running process, concurrency control using Mysql
3. Shutdown when not in use
– Do on hour boundary (whose?)
– Not if User Sessions are active!
![Page 20: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/20.jpg)
20
Hadoop as Service
• Archive Job History/Logs to S3 – Transparent access to Old jobs
• Auto-Config different node types – Use ALL ephemeral drives for HDFS/MR
– Use right number of slots per machine
• Scrub, Scrub, Scrub – Bad Nodes, Bad Clusters, AWS timeouts
![Page 21: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/21.jpg)
21
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
![Page 22: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/22.jpg)
22
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …;
![Page 23: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/23.jpg)
23
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …;
![Page 24: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/24.jpg)
24
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …;
![Page 25: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/25.jpg)
25
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …; Progress
![Page 26: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/26.jpg)
26
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …; Progress
Demand
Supply
![Page 27: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/27.jpg)
27
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …; Progress
Demand
Supply
![Page 28: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/28.jpg)
28
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …; Progress
![Page 29: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/29.jpg)
29
Scaling Up
StarCluster
Map Tasks
ReduceTasks
AWS
Master
Slaves
Job Tracker
insert overwrite table dest
select … from ads join
campaigns on …group by …; Progress
![Page 30: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/30.jpg)
30
Scaling Down
1. On hour boundary – check if node is required: – Can’t remove nodes with map-outputs (today)
– Don’t go below minimum cluster size
2. Remove node from Map-Reduce Cluster
3. Request HDFS Decomissioning – fast! – Delete affected cache files instead of re-replicating
– One surviving replica and we are Done.
4. Delete Instance
![Page 31: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/31.jpg)
31 31
Spot Instances
On an average 50-60% cheaper
![Page 32: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/32.jpg)
32
Spot Instance: Challenges
• Can lose Spot nodes anytime
– Disastrous for HDFS
– Hybrid Mode: Use mix of On-Demand and Spot
– Hybrid Mode: Keep one replica in On-Demand nodes
• Spot Instances may not be available
– Timeout and use On-Demand nodes as fallback
![Page 33: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/33.jpg)
33
Agenda
What is Qubole Data Service
Hadoop as a Service in Cloud
Hive as a Service in Cloud
![Page 34: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/34.jpg)
34
Query History/Results
![Page 35: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/35.jpg)
35
Cheap to Test
Evaluate expressions on sample data
![Page 36: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/36.jpg)
36
Cheap to Test
Run Query on Sample
![Page 37: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/37.jpg)
37
Fastest Hive SaaS
• Works with Small Files!
– Faster Split Computation (8x)
– Prefetching S3 files (30%)
![Page 38: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/38.jpg)
38
Fastest Hive SaaS
• Works with Small Files!
– Faster Split Computation (8x)
– Prefetching S3 files (30%)
• Stable JVM Reuse!
– Fix re-entrancy issues
– 1.2-2x speedup
![Page 39: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/39.jpg)
39
Fastest Hive SaaS
• Works with Small Files!
– Faster Split Computation (8x)
– Prefetching S3 files (30%)
• Direct writes to S3
– HIVE-1620
• Stable JVM Reuse!
– Fix re-entrancy issues
– 1.2-2x speedup
![Page 40: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/40.jpg)
40
Fastest Hive SaaS
• Works with Small Files!
– Faster Split Computation (8x)
– Prefetching S3 files (30%)
• Direct writes to S3
– HIVE-1620
• Stable JVM Reuse!
– Fix re-entrancy issues
– 1.2-2x speedup
• Columnar Cache – Use HDFS as cache for S3
– Upto 5x faster for JSON data
![Page 41: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/41.jpg)
41
Fastest Hive SaaS
• Works with Small Files!
– Faster Split Computation (8x)
– Prefetching S3 files (30%)
• Direct writes to S3
– HIVE-1620
• NEW – Multi-Tenant Hive
Server
• Stable JVM Reuse!
– Fix re-entrancy issues
– 1.2-2x speedup
• Columnar Cache – Use HDFS as cache for S3
– Upto 5x faster for JSON data
![Page 42: Qubole hadoop-summit-2013-europe](https://reader033.fdocuments.net/reader033/viewer/2022052620/5578f223d8b42a5c5c8b5232/html5/thumbnails/42.jpg)
Questions?
@Qubole
Free Trial: www.qubole.com