Towards Energy Efficient Hadoop Wednesday, June 10, 2009 Santa Clara Marriott Yanpei Chen, Laura...
-
date post
18-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of Towards Energy Efficient Hadoop Wednesday, June 10, 2009 Santa Clara Marriott Yanpei Chen, Laura...
Towards Energy Efficient Hadoop
Wednesday, June 10, 2009 Santa Clara Marriott
Yanpei Chen, Laura Keys, Randy Katz
RAD Lab, UC Berkeley
Why Energy?
Cooling Costs Environment
Why Energy Efficient Software
Power Utilization Efficiency (PUE)
=Total power used by a datacenter
IT power used by a datacenter
=IT power + PDU + UPS + HVAC+ Lighting + other overhead
Servers, network, storage
≥ 2 circa 2006 and before
1 present day
Most of the further savings to be had in IT hardware and software
Productivity
Energy as a Performance Metric
Resources Used
Traditional view of the software system design space
Increase productivity for fixed resources of a system
Productivity
Energy as a Performance Metric
Resources Used
Maybe a better view of the design space?
Energy
Decrease energy without compromising productivity?
Methodology
Performance Metrics
Parameters
Workload
Energy measurement
Basket of metrics – job duration, energy, power (i.e. time rate of energy use).
Performance variance?
Static – cluster size, workload size, configuration parameters.
Dynamic – Task scheduling? Block placement? Speculative execution?
Exercise all components – sort, HDFS read, HDFS write, shuffle.
Representative of production workloads – nutch, gridmix, others?
Wall plug energy measurement – 1W accuracy, 1 reading per second.
Fine grain measurement to correlate energy consumption to hardware components?
Scaling to More Workers – Sort
JouleSort highly customized system vs. Out of box Hadoop with default config.
11k sorted records per joule vs. 87 sorted records per joule
Terasort format, 100 bytes records with 10 bytes keys, 10GB of total data
Out of box Hadoop 0.18.2 with default config.
0500
10001500200025003000
0 2 4 6 8 10 12 14
total power
workers + master
(W)
# of workers
Sort - Total power
0
500000
1000000
1500000
2000000
0 2 4 6 8 10 12 14
total energy
workers + master
(J)
# of workers
Sort - Total energy
0
1000
2000
3000
4000
5000
0 2 4 6 8 10 12 14
duration (s)
# of workers
Sort - Job duration
Reduce energy by adding more workers????
Scaling to More Workers – Sort
Terasort format, 100 bytes records with 10 bytes keys, 10GB of total data
Out of box Hadoop with default config., workers energy only
Energy of the master amortized by additional workers
0500
10001500200025003000
0 2 4 6 8 10 12 14
total power
workers + master
(W)
# of workers
Sort - Total power
0
500000
1000000
1500000
0 2 4 6 8 10 12 14
total energy
workers + master
(J)
# of workers
Sort - Total energy
0
1000
2000
3000
4000
5000
0 2 4 6 8 10 12 14
duration (s)
# of workers
Sort - Job duration
Scaling to More Workers – Nutch
Nutch web crawler and indexer, with Hadoop 0.19.1.
Index URLs anchored at www.berkeley.edu, depth 7, 2000 links per page
Workload has some built-in bottlenecks?
0200400600800
100012001400160018002000
0 4 8
workers
Nutch Total Power (W)
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
16000000
18000000
0 4 8
workers
Nutch Total Energy (J)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 4 8
workers
Nutch Duration (s)
Isolating IO Stages
HDFS read, shuffle, HDFS write jobs, modified from prepackaged sort example
Read, shuffle, write 10GB of data, terasort format, does nothing else
HDFS write seems to be the scaling bottleneck
0
0.2
0.4
0.6
0.8
1
1 4 8 12
fraction
workers
Duration fraction
Shuffle
HDFSWrite
HDFSRead0
0.2
0.4
0.6
0.8
1
1 4 8 12
fraction
workers
Energy fraction
Shuffle
HDFSWrite
HDFSRead
HDFS Replication
HDFS read, shuffle, HDFS write, sort jobs, 10GB data, terasort format
Modify the number of HDFS replica, default config. for everything else
Some workloads are affected – HDFS write, some are not – shuffle
0500
10001500200025003000
0 1 2 3 4 5
total power
workers + master
(W)
HDFS replication
HDFSWrite - Total power
0
500000
1000000
1500000
0 1 2 3 4 5
total energy
workers + master
(J)
HDFS replication
HDFSWrite - Total energy
0100200300400500600
0 1 2 3 4 5
duration (s)
HDFS replication
HDFSWrite- Job duration
0500
10001500200025003000
0 1 2 3 4 5
total power
workers + master
(W)
HDFS replication
Shuffle - Total power
0200000400000600000800000
10000001200000
0 1 2 3 4 5
total energy
workers + master
(J)
HDFS replication
Shuffle - Total energy
0
100
200
300
400
500
0 1 2 3 4 5
duration (s)
HDFS replication
Shuffle - Job duration
0
0.2
0.4
0.6
0.8
1
1 4 8 12
fraction
workers
Duration fraction
Shuffle
HDFSWrite
HDFSRead0
0.2
0.4
0.6
0.8
1
1 4 8 12
fraction
workers
Energy fraction
Shuffle
HDFSWrite
HDFSRead
0
0.2
0.4
0.6
0.8
1
1 4 8 12
fraction
workers
Duration fraction
Shuffle
HDFSWrite
HDFSRead0
0.2
0.4
0.6
0.8
1
1 4 8 12
fraction
workers
Energy fraction
Shuffle
HDFSWrite
HDFSRead
HDFS Replication
Replication 3 – default
Replication 2
Reducing HDFS replication to 2 makes HDFS write less of a bottleneck?
Changing Input Size
Sort, modified from prepackaged sort example
Jobs that handle less than ~1GB of data per node bottlenecked by overhead
Out of box Hadoop competitive with JouleSort winner at 100MB?!?
Here’s a somewhat noteworthy result:
0
2000
4000
6000
8000
10000
12000
14000
0 4 8 12
records
workers
Records sorted per joule
10GB
5GB
1GB
500MB
100MB
HDFS Block Size
HDFS read, shuffle, HDFS write, sort jobs, 10GB data, terasort format
Modify the HDFS block size, default config. for everything else
0500
10001500200025003000
16 64 256 1024
total power
workers + master
(W)
block size (MB)
HDFSRead - Total power
0200000400000600000800000
10000001200000
16 64 256 1024
total energy
workers + master
(J)
block size (MB)
HDFSRead - Total energy
0
100
200
300
400
16 64 256 1024
duration (s)
block size (MB)
HDFSRead - Job duration
0500
10001500200025003000
16 64 256 1024
total power
workers + master
(W)
block size (MB)
Shuffle - Total power
0
500000
1000000
1500000
2000000
16 64 256 1024
total energy
workers + master
(J)
block size (MB)
Shuffle - Total energy
0
200
400
600
800
16 64 256 1024
duration (s)
block size (MB)
Shuffle - Job duration
Some workloads are affected – HDFS read, some are not – shuffle
Slow Nodes
One node on the cluster consistently received fewer blocks
Removing the slow node leads to performance improvement
Clever ways to use the slow node instead of taking it offline?
0 0.2 0.4 0.6 0.8 1 1.2
r32r33r34r36r37r38r39
r4r40
r5
normalized # of blocks
HDFS block placement
0 0.2 0.4 0.6 0.8 1 1.2
r33
r34
r36
r37
r38
r39
r4
r40
r5
normalized # of blocks
HDFS block placement - no lagger
experiment duration (s) 95% conf interval total energy (J) 95% conf interval records / J avg power (W) 95% conf intervalwith r32 387.65 73.79715084 827039.6807 157169.6671 129.829542 2134.26656 1.798035323without r32 301.15 81.15046268 648813.0401 174540.4687 165.4932567 1940.780717 5.619743737
Predicting IO Energy
0
1000
2000
3000
4000
5000
0 2 4 6 8 10 12 14
duration (s)
# of workers
Sort - Job duration
measuredpredicted
0
500000
1000000
1500000
2000000
2500000
0 2 4 6 8 10 12 14
energy (J)
# of workers
Sort - Total Energy
measuredpredicted
0500
100015002000250030003500
0 2 4 6 8 10 12 14
power (W)
# of workers
Sort - Total Power
measuredpredicted
Working example: Predict IO energy for a particular task
Benchmark energy in joules per byte for HDFS read, shuffle, HDFS write
IO energy = bytes read × joules per byte (HDFS read) +
bytes shuffled × joules per byte (shuffle) +
bytes written × joules per byte (HDFS write)
The simple model is effective, but requires prior measurements
Cluster Provision and Configuration
Working example: Find optimal cluster size for a steady job steam
N = number of workers that we could assign
D(N) = job duration, as a function of N
Pa(N) = power when active, as a function of N
Pi = power when idle
T = average job arrival interval.
E(N) = expected energy consumed per job, as a function of N
E(N) = D(N) Pa(N) + [T – D(N)] Pi
Optimize for E(N) over the range N such that D(N) ≤ T
In general, multi-dimensional optimization problem to meet job constraints
Optimal HDFS Replication
Working example: Reduce HDFS replication from 3 to 2, i.e. off-rack replica only?
Benefit = probability(no failure) × [energy(3 replicas) – energy(2 replicas)]
Cost = probability(failure and local recovery) ×
[energy(off-rack recovery) – energy(rack-local recovery)]
Cost-benefit trade-off between lower energy and higher recovery costs
Need to quantify probability of failure/recovery to set sensible replication
Faster = More Energy Efficient?
r = fraction of resources used in a system, ranging from 0 to 1
R(r) = work rate of a system, ranging from 0 to RMAX
P(r) = power of the system, ranging from 0 to PMAX
r1, r2 = the lower and upper bounds of the resources operating region
W = workload size
E(r) = energy consumed when r
0 1
Pmax
r0 1
Rmax
r
R(r) = r × RMAX P(r) = r × PMAX
Power Work rate
E(r) = MAX
MAX
MAX
MAX
R
PW
Rr
PrWrP
rR
W)(
)(
Constant energy for fixed workload size, so run as fast as we can
Faster = More Energy Efficient?
r = fraction of resources used in a system, ranging from 0 to 1
R(r) = work rate of a system, ranging from 0 to RMAX
P(r) = power of the system, ranging from 0 to PMAX
r1, r2 = the lower and upper bounds of the resources operating region
W = workload size
E(r) = energy consumed when r
0 1
Rmax
r
R(r) = r × RMAX
Power Work rate
P(r) = PIDLE + r × (PMAX – PIDLE)
0 1
Pmax
r
Reduce energy by using more resources, so run as fast as we can, again
E(r ) =
MAX
IDLEMAXIDLE
Rr
PPrPWrP
rR
W )(
)(
E(r) = IDLEMAXMAX
IDLE
MAX
PPR
W
r
P
R
W
Faster = More Energy Efficient?
r = fraction of resources used in a system, ranging from 0 to 1
R(r) = work rate of a system, ranging from 0 to RMAX
P(r) = power of the system, ranging from 0 to PMAX
r1, r2 = the lower and upper bounds of the resources operating region
W = workload size
E(r) = energy consumed when r
0 1
Rmax
r
Power Work rate
P(r) = PIDLE + r × (PMAX – PIDLE)
0 1
Pmax
r
R(r) = r × RMAX
Caveats: What is meant by resource? What is a realistic behavior for R(r)?
Performance
Take Away Thoughts
Resources Used
If work rate resources used, energy is another aspect of performance
All prior performance optimization techniques don’t need to be re-invented
What if work rate is not proportional to resources used?
Different hardware?
Productivity benchmarks?
Hadoop as terasort and JouleSort winner?