Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock...

34
Achieving Power-Efficiency in Clusters without Distributed File System Complexity Hrishikesh Amur, Karsten Schwan Georgia Tech

Transcript of Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock...

Page 1: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Achieving Power-Efficiency in Clusters without Distributed

File System ComplexityHrishikesh Amur, Karsten Schwan

Georgia Tech

Page 2: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Green Computing Research Initiative at GT

Circuit level: DVFS, power states, clock gating (ECE)

Chip and Package: power multiplexing, spatiotemporal migration (SCS, ECE)

Board: VirtualPower, scheduling/scaling/operating system… (SCS, ME, ECE)

Rack: mechanical design, thermal and airflow analysis, VPTokens, OS and management (ME, SCS)

Pow

er

dis

trib

uti

on a

nd d

eliv

ery

(EC

E)

http://img.all2all.net/main.php?g2_itemId=157

Datacenter and beyond: design, IT management, HVAC control… (ME, SCS, OIT…)

focus of our work:

Page 3: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Data-intensive applications that use distributed storage

Focus

Page 4: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

CPUMemoryPCI slotsMotherboardDisksFan

Per-system Power Breakdown

Page 5: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Power off entire nodes

Approach to Power-Efficiency of Cluster

Page 6: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Turning Off Nodes Breaks Conventional DFS

Page 7: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Turning Off Nodes Breaks Conventional DFS

Page 8: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Turning Off Nodes Breaks Conventional DFS

Page 9: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Turning Off Nodes Breaks Conventional DFS

Page 10: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Turning Off Nodes Breaks Conventional DFS

Page 11: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Turning Off Nodes Breaks Conventional DFS

Page 12: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Turning Off Nodes Breaks Conventional DFS

Page 13: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

One replica of all data placed on a small set of nodes

Primary replica maintains availability, allowing nodes storing other replicas to be turned off [Sierra, Rabbit]

Modifications to Data Layout Policy

Page 14: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Where is new data to be written when part of the cluster is turned off?

Handling New Data

Page 15: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

New Data: Temporary Offloading

Page 16: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Temporary off-loading to ‘on’ nodes is a solution

Cost of additional copying of lots of data

Usage of network bandwidth

Increased complexity!!

New Data: Temporary Offloading

Page 17: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Failure of primary nodes cause a large number of nodes to be started up to restore availability

To solve this, additional groups with secondary, tertiary etc. copies have to be made.

Again, increased complexity!!

Handling Primary Failures

Page 18: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Making a DFS power-proportional increases its complexity significantly

Page 19: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Provide fine-grained control over what components to turn off

Our Solution

Page 20: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Switch between two extreme power modes: max_perf and io_server

How do we save power?

Page 21: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Fine-grained control allows all disks to be kept on maintaining access to stored data

How does this keep the DFS simple?

Page 22: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Prototype Node Architecture

SATA Switch

Asterix Node

Obelix Node

Page 23: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Prototype Node Architecture

SATA Switch

Asterix Node

Obelix Node

VMM

Page 24: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

max_perf Mode

SATA Switch

Asterix Node

Obelix Node

VM

Page 25: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

io_server Mode

SATA Switch

Asterix Node

Obelix Node

VM

Page 26: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

1 2 3 40

10

20

30

40

50

60

70

80

90

ObelixAsterix-II

Servers in max_perf mode

Th

rou

gh

pu

t/W

att

(M

B/s

/W)

Increased Performance/Power

Page 27: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

1 2 3 40

10

20

30

40

50

60

70

80

90

ObelixAsterix-II

Servers in max_perf mode

Th

rou

gh

pu

t/W

att

(M

B/s

/W)

Increased Performance/Power

Page 28: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

1 2 3 40

10

20

30

40

50

60

70

80

90

ObelixAsterix-II

Servers in max_perf mode

Th

rou

gh

pu

t/W

att

(M

B/s

/W)

Increased Performance/Power

Page 29: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

1 2 3 40

10

20

30

40

50

60

70

80

90

ObelixAsterix-II

Servers in max_perf mode

Th

rou

gh

pu

t/W

att

(M

B/s

/W)

Increased Performance/Power

Page 30: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Obelix Asterix0

10

20

30

40

50

60

70

80

90

LinuxdomUdom0domU*

Th

rou

gh

pu

t (M

B/s

)

Virtualization Overhead: Reads

Page 31: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Obelix Asterix0

10

20

30

40

50

60

70

80

LinuxdomUdom0domU*

Th

rou

gh

pu

t (M

B/s

)

Virtualization Overhead: Writes

Page 32: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Turning entire nodes off complicates DFS

Good to be able to turn components off, or achieve more power-proportional platforms/components

Prototype uses separate machines and shared disks

Summary

Page 33: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

Load Management Policies Static

◦ e.g., DFS, DMS, monitoring/management tasks… Dynamic

◦ e.g., based on runtime monitoring and management/scheduling…

◦ helpful to do power metering on per process/VM basis

X86+Atom+IB…

Page 34: Hrishikesh Amur, Karsten Schwan Georgia Tech. Circuit level Circuit level: DVFS, power states, clock gating (ECE) Chip and Package Chip and Package: power.

VM-level Power Metering: Our Approach

Built power profiles for various platform resources◦ CPU, memory, cache, I/O…

Utilize low-level hardware counters to track resource utilization on per VM basis◦ xenoprofile, IPMI, Xen tools…◦ track sets of VMs separately

Maintain low/acceptable overheads while maintaining desired accuracy◦ limit amount of necessary information, number of monitored

events: use instructions retired/s and LLC misses/s only

◦ establish accuracy bounds

Apply monitored information to power model to determine VM power utilization at runtime◦ in contrast to static purely profile-based approaches