Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for...

26
Unexplored energy aspects of scalable heterogeneous computing systems Holger Fröning Computer Engineering Group Institute of Computer Engineering Ruprecht-Karls University of Heidelberg HiPEAC Computing Systems Week, Milano, IT, 23.09.2015

Transcript of Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for...

Page 1: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Unexplored energy aspects of scalable heterogeneous computing systems

Holger Fröning Computer Engineering Group

Institute of Computer Engineering Ruprecht-Karls University of Heidelberg

HiPEAC Computing Systems Week, Milano, IT, 23.09.2015

Page 2: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Abstract

The concurrency galore is currently defining computing at all levels. leading to a vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and the still unsatisfied need for more computing power has led to a pervasive use of accelerators to speed up computations. While this helped to overcome the end of Dennard scaling and significantly increased the energy efficiency of computations. communication is left unaddressed. However. current analyses show that a large fraction of the power consumption originates to communication. In addition. this fraction will actually increase for future technologies. making communication more expensive than computation in terms of energy and time. Being amplified by the advent of Big Data. we are observing a fundamental transition to communication-centric systems composed of heterogeneous computing units. This talk will review some of our explorations and optimizations in the area of energy-efficient computing. While for instance cooling. power distribution. specialized processors and other aspects have already received plenty of attention. we focus instead on areas that are unexplored yet. but in our opinion also contribute significantly to overall power and energy consumption. This talk will conclude with some observations about the impact of technology trends on energy and anticipated related research questions.

!2

Page 3: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Post-­‐Dennard  performance  scaling  The  good  part

• Multi/Many-core trend won’t revert • End of Dennard scaling • Technological constraints • Single-threaded execution model obsolete

• Technology diversity is pervasive • Various processors – x86, ARM, GPUs, MICs,

FPGAs. … • Storage - DRAM, FLASH, SSD, spinning disks. … • Interconnect – {10,40,100}GE, IWARP, IB. … • Systems – IBM Blue Gene, Cray Blue Waters. … • Amazon EC2 – On-Demand GPUs • Upcoming technologies: PCM, photonics, die

stacking. …

!3

Page 4: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Post-­‐Dennard  performance  scaling  The  bleak  part

!4

Kathy Yelick, 2009

Page 5: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Power  and  Energy

• Energy consumption • Technological,

economically and ecologically constraints

• OPEX and CAPEX • Not only affects

EXASCALE • 20MW = ~20M$/yr

• You should know when you‘re off- or on-die • Note on optical

interconnects • Data movement is more

expensive than computation

!5

US  DOE,  Scien,fic  Grand  Challenges:  Architectures  and  Technology  for  Extreme  Scale  Compu,ng,  San  Diego,  CA,  2009,

Page 6: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Exascale  trend

• Who is still sure about that 20MW number? • DOE goal

• 20MW = 50 GFLOPs/Watt • Sustained, not theoretical peak • Best number today is approx. 30

GFLOPs/Watt (peak) • Sustained at scale: 5 GFLOPs/Watt

• Human brain • 20W • 1E06x more energy-efficient

!6

Year Presenter Statement? forgot <10MW

2010 Craig  Stunkel <20-­‐25MW

2010 Bill  Dally 15-­‐20MW

2011 Kathy  Yelick 20MW  

2012 William  Harrod 20MW  

2013 Horst  Simon 20-­‐30MW

2015 Keren  Bergmann max.  100MW

It  is  not  the  question  when  we  will  reach  Exascale,  the  question  is  when  does  it  come  within  20MW

Page 7: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

(Another)  fundamental  transition

!7

Fundamental*transi.on*to*communica.on1centric*systems*composed*of*heterogeneous*compu.ng*

units*

Mul.1/Many1core*revolu.on*in*combina.on*with*Big*Data*

Energy*constraints* Technology*diversity*

Need*for*specializa.on* Improved*sharing*

Page 8: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Energy in scalable interconnection networks

Page 9: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Does  it  matter  after  all?

• Pitfall: don’t make assumptions based on maximum power ratings • At TDP, processors outshine

anything • But are processors always

operating at 100% load? • Energy-proportional: at x%

load, a component should only consume x% energy

!9

Quote “It's a myth that interconnect

power is important” Commercial company, panel

presentation, mid 2015

0%#

10%#

20%#

30%#

40%#

50%#

60%#

70%#

80%#

90%#

100%#

TDP#

Component(power(share(

CPUs# GPU# Memory# Network#

Page 10: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

It  does!

• System power • Scalable energy-efficient network • Direct network, integrated switches

• Dynamic range of components • Many memory-bound applications

• E.g., emerging integer applications (R. Murphy, Sandia) & graph computations • DFS & BFS • Connected Components • Isomorphism • Shortest Path • Graph Partitioning • BLAST (alignment search) • zChaff (satisfiability)

• Exception: compute-bound applications with perfect overlap

!10

0%#

10%#

20%#

30%#

40%#

50%#

60%#

70%#

80%#

90%#

100%#

TDP# Idle#

Component(power(share(

CPUs# GPU# Memory# Network#

Page 11: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

It  does!  (Seconding  opinions)

• We need energy-proportional components • Processors have already improved significantly

• Lesson learned from embedded systems: anything matters

!11

Dennis Abts, Michael R. Marty, Philip M. Wells, Peter Klausler, and Hong Liu. 2010. Energy proportional datacenter networks. ISCA ’10

S. Rumley. et al., "Design Methodology for Optimizing Optical Interconnection Networks in High Performance Systems”, ISC 2015.

• Google  paper  on  energy-­‐proportional  networks:  up  to  50%  on  network  power,32k  nodes:  1.1MW  for  folded  CLOS,  0.7MW  for  flattened  butterfly  

• S.  Rumsey  et.  al.  (ISC2015):  networks  continue  to  consume  ~20%  of  system  power  even  using  optical  links  

• DOE  Report  on  Top  10  Exascale  Challenges:  “Interconnect  technology:  Increasing  the  performance  and  energy  efficiency  of  data  movement”

Page 12: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

A  short  analysis  of  application/system  verbosity

• Verbosity (B/FLOP): inverse of operational intensity (FLOP/B) • Inspired by Keren Bergman

• Optical Interconnection Networks for Ultra-High Bandwidth Energy Efficient Data Movement in HPC, ISC2015 session on On-Chip & Off-Chip Interconnection Networks for Future HPC Systems

• Examples today: 18pJ/bit (PCIe G3), ~25-30pJ/bit (electrical link), 10pJ/bit (optical cable, on top) • All these examples are not energy-proportional!

!12

Power  budget 100 50 20 MWattEnergy  efficiency 10 20 50 GFLOPs/JEnergy  per  FLOP 100 50 20 pJNetwork  power  (20%) 20 10 4 MWattNetwork  budget  per  FLOP 20 10 4 pJVerbosity  in  B/FLOP Network  budget  per  bit  [pJ/bit]

1.000 2.5 1.3 0.5 Amdahl's  rule0.100 25.0 12.5 5.0 Anticipated  case,  Sequoia0.017 147.1 73.5 29.4 Titan0.001 2500.0 1250.0 500.0 Tianhe-­‐2

Page 13: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Where  do  my  Joules  go?

• Serialization technology dominates power consumption • Clock recovery, high f, equalization, pre-emphasis, …

!13

71%$

14%$

15%$

Power&share&for&NIC&with&integrated&switch&

Links$(6)$

PCIe$

Core$

0.0#

0.5#

1.0#

1.5#

2.0#

2.5#

3.0#

3.5#

4#Lanes# 8#Lanes# 12#Lanes#Po

wer&con

sump-

on&[n

ormalized

]&

Link&power&scaling&

2.5#GHz#

5#GHz#

10#GHz#

• It is link width that matters, not frequency • CML = Current Mode Logic • Linear scaling for 10GHz case

• This leads us to many research thrusts!

Page 14: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Research thrust 1: specialized communication

Page 15: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Beyond  CPU-­‐centric  communication

!15

Source'Node'

GPU'

CPU'

NIC'PCIe'root'

GPU'memory'

Host'memory'

Target'Node'

GPU'

CPU'

NIC'PCIe'root'

GPU'memory'

Host'memory'

100x

Start-­‐up  latency  of  1.5usec  

Start-­‐up  latency  of  15usec  

GPU-controlled Put/Get (IBVERBS)

“… a bad semantic match between communication primitives required by the application and those provided by the network.” - DOE Subcommittee Report, Top Ten Exascale Research Challenges. 02/10/2014

Page 16: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Allreduce  –  Power  and  Energy  analysis

!16

Lena  Oden,  Benjamin  Klenk  and  Holger  Fröning,  Energy-­‐Efficient  Collecive  Reduce  and  Allreduce  Operaions  on  Distributed  GPUs,  14th  IEEE/ACM  Internaional  Symposium  

on  Cluster,  Cloud  and  Grid  Compuing  (CCGrid2014),  May  26-­‐29,  2014,  Chicago,  IL,  US.

For  this  case:  saved  50%  of  the  energy

Page 17: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

nbody_small nbody_large sum_small sum_large himeno randomAccess

benchmarks

perfo

rman

ce n

orm

alize

d to

MPI

0.0

1.0

2.0

3.0

2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12

GGAS RMA

NB−S NB−L sum−S sum−L RA Himeno avg.

benchmarksener

gy c

onsu

mpt

ion

norm

alize

d to

MPI

0.0

0.4

0.8

1.2

ggas rma

Towards  specialized  communication  models  for  heterogeneous  systems

• 12 nodes (each 2x Intel Ivy Bridge, Nvidia K20, Extoll FPGA) • Normalized: >1 better performance, <1 worse performance • We’re currently working on bringing this solution to a wider user community

!17

Benjamin  Klenk,  Lena  Oden,  Holger  Fröning,  Analyzing  Communicaion  Models  for  Distributed  Thread-­‐Collaboraive  Processors  in  Terms  of  Energy  and  Time,  2015  IEEE  Interna,onal  

Symposium  on  Performance  Analysis  of  Systems  and  SoIware  (ISPASS  2015),  Philadelphia,  PA,  March  29-­‐31,  2015.  

Page 18: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Research thrust 2: Integrated power models

Page 19: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Integrated  Power  Model

!19

Page 20: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

PCIe  power

• Power is all about serialization, remember? • PCIe for HPC is typically 16x, communication is bulky • Nodes will grow wider with more memory and cores

• Newer PCIe 3.0 & 4.0 systems support L0s “standby states” • Active State Power

Management (ASPM) • Experimental analysis

yields frustrating results => Alternative methodology required

!20

Jeffrey Young, Richard Vuduc, The Near-Term Implications of Network Low-Power States and Next-Generation Interconnects on Power Modeling, Workshop

on Modeling and Simulation of Systems and Applications (MODSIM), 2015.

Page 21: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Extending  network  simulation  for  power

• Power-aware network simulation by extending OMNET++ • Power states for each link

• Electrical and optical links • State selection logic in front

of each link • Various policies possible,

transition time matters! • Using traces, not synthetic traffic

• Joint effort with Pedro Garcia et al. (UCLM) & Jeff Young (GT) • Infiniband, Ethernet, EXTOLL, PCIe • Interested? Talk to us!

!21

0.0#

0.5#

1.0#

1.5#

2.0#

2.5#

3.0#

3.5#

4#Lanes# 8#Lanes# 12#Lanes#

Power&con

sump-

on&[n

ormalized

]&

Link&power&scaling&

2.5#GHz#

5#GHz#

10#GHz#

Page 22: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Research  questions

• Dynamic range - which power state granularity is required? • It seems that wide links provide better opportunities

• State transitioning - how much time can we tolerate? • Reported numbers vary from O(1)ns over 10us to milliseconds • Can we buffer or predict traffic well enough?

• Compensate potential oversubscription - which techniques do we need to tolerate congestions? • Traditional congestion management? • Path diversity? • Adaptive routing?

• Predictability - what are the key characteristics to predict power consumption? • Can I tell power consumption from looking at the code? • Simulation or modeling?

• Performance modeling languages like ASPEN seem very promising

!22

Page 23: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Conclusion

Page 24: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Past?  Future?  History  usually  repeats.

!24

Solves  power  for  processing,  not  

necessarily  power  for  data  movement

Page 25: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Summary

• Exascale networks • Performance is mainly a question of costs (IMHO) • Resilience is challenging! • Power contribution is key for sustainable Exascale computing

• We anticipate that huge efforts will be made on improving energy efficiency for processing and memory • Heterogeneity • Processing-in-memory (PIM) • => Power fraction of the network increases!

• Don’t assume that all HPC workloads will result in high loads • Memory-bound applications, graph traversals and computations, …

• Energy-proportional networks matter! • In general: it will be all about data movements, computations will be

almost for free

!25

Page 26: Unexplored energy aspects of scalable heterogeneous ... · vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend. and

Credits  Contribuions:  Lena  Oden  (former  PhD  student),  Benjamin  Klenk  (PhD  student),  Alexander  Matz  (PhD  

student),  Felix  Zahn  (PhD  student)  Discussions:  Sudha  Yalamanchili  (Georgia  Tech),  Jeff  Young  (Georgia  Tech),  Pedro  Garcia  et  al.  (UCLM),  

Maximilian  Thürmer  &  Markus  Müller  (Heidelberg  University)  Sponsoring:  Nvidia,  Xilinx,  German  Excellence  Iniiaive,  Google  

Current  main  interacZons

!26

Thank you!