Keynote snir spaa

37
Supercomputing: Technical Evolution & Programming Models Marc Snir Argonne Na.onal Laboratory & University of Illinois at UrbanaChampaign

description

SPAA 2013 keynote presentation

Transcript of Keynote snir spaa

Page 1: Keynote snir spaa

Supercomputing: Technical Evolution & Programming Models

Marc  Snir  Argonne  Na.onal  Laboratory  &  University  of  Illinois  at  Urbana-­‐Champaign  

Page 2: Keynote snir spaa

Introduction

July  13  

MCS    -­‐-­‐  Marc  Snir  

2  

Page 3: Keynote snir spaa

Theory of Punctuated Equilibrium (Eldredge, Gould, Mayer…)

§  Evolu.on  consists  of  long  periods  of  equilibrium,  with  liIle  change,  interspersed  with  short  periods  of  rapid  change.    –  Muta.ons  are  diluted  in  large  popula.ons  in  equilibrium  –  homogenizing  

effect  prevents  the  accumula.on  of  mul.ple  changes  –  Small,  isolated    popula.on  under  heavy  natural  selec.on  pressure  evolve  

rapidly  and  new  species  can  appear  –  Major  cataclysms  can  be  a  cause  for  rapid  change  

§  Punctuated  Equilibrium  is  a  good  model  for  technology  evolu.on:  –  Revolu.ons  are  hard  in  large  markets  with  network  effects  and  technology  

that  evolves  gradually  –  Changes  can  be  much  faster  when  small,  isolated  product  markets  are  

created,  or  when  current  technology  hits  a  wall  (cataclysm)  §  (Not  a  new  idea:  e.g.,  Levinthal  1998:  The  Slow  Pace  of  Rapid  Technological  

Change:  Gradualism  and  Punctua;on  in  Technological  Change)  

MCS    -­‐-­‐  Marc  Snir  

3  July  13  

Page 4: Keynote snir spaa

Why it Matters to SPAA (and PODC)

§  Periods  of  paradigm  shiW  generate  a  rich  set  of  new  problems  (new  low-­‐hanging  fruit?)  –  It  is  a  .me  when  good  theory  can  help  

§  E.g.,  Internet,  Wireless,  Big  data  –  Punctuated  evolu.on  due  to  the  appearance  of  new  markets  

§  Hypothesis:  HPC  now  and,  ul.mately,  much  of  IT  are  entering  a  period  of  fast  evolu.on:  Please  prepare  

July  13  

MCS    -­‐-­‐  Marc  Snir  

4  

Page 5: Keynote snir spaa

Where Analogy with Biological Evolution Breaks Down

§  Technology  evolu.on  can  be  accelerated  by  gene.c  engineering  –  Technology  developed  in  one  market  is  exploited  in  another  market  

–  E.g.,  Internet  or  wireless  were  enabled  by  cheap  microprocessors,  telephony    technology,  etc.  

§  “Gene.c  engineering”  has  been  essen.al  for  HPC  in  the  last  25  years:    –  Progress  enabled  by  reuse  of  technologies  from  other  markets  (micros,  GPUs…)  

July  13  

MCS    -­‐-­‐  Marc  Snir  

5  

Page 6: Keynote snir spaa

Past & Present

July  13  

MCS    -­‐-­‐  Marc  Snir  

6  

Page 7: Keynote snir spaa

Evidence of Punctuated Equilibrium in HPC

July  13  

MCS    -­‐-­‐  Marc  Snir  

7  

1  

10  

100  

1000  

10000  

100000  

1000000  

10000000  

Core  Count  Leading  Top500  System  

aIack  of  the  killer  micros  

mul.core  

accelerators  

SPAA  

Page 8: Keynote snir spaa

1990: The Attack of the Killer Micros (Eugene Brooks, 1990)

§  ShiW  from  ECL  vector  machines  &  to  clusters  of  MOS  micros  –  Cataclysm:  bipolar  evolu.on  reached  its  limits  (nitrogen  cooling,  gallium  

arsenide…);  MOS  was    on  a  fast  evolu.on  path  –  MOS  had  its  niche  markets:  controllers,  worksta.ons,  PCs  –  Classical  example  of  “good  enough,  cheaper  technology”  (Christensen,  The  

Innovator’s  Dilemma)  

July  13  

MCS    -­‐-­‐  Marc  Snir  

8  

Page 9: Keynote snir spaa

2002: Multicore

§  Clock  speed  stopped  increasing;  very  liIle  return  on  added  CPU  complexity;  chip  density  con.nued  to  increase  –  Technology  push  –  not  market  pull  

–  S.ll  has  limited  success  

July  13  

MCS    -­‐-­‐  Marc  Snir  

9  

Page 10: Keynote snir spaa

2010: Accelerators

§  New  market  (graphics)  created  ecological  niche  §  Technology  transplanted  into  other  markets  (signal  processing/

vision,  scien.fic  compu.ng)  –  Advantage  of  beIer  power/performance  ra.o  (less  logic)  

§  Technology  s.ll  changing  rapidly:  integra.on  with  CPU  and  evolving  ISA  

July  13  

MCS    -­‐-­‐  Marc  Snir  

10  

Page 11: Keynote snir spaa

Where the (R)evolutions Successful in HPC?

§  Killer  Micros:  Yes  –  Totally  replaced  vector  machines  –  All  HPC  codes  enabled  for  message-­‐passing  (MPI)  –  Took  >  10  years  and  >  $1B  govt.  investment  (DARPA)  

§  Mul:core:  Incomplete  –  Many  codes  s.ll  use  one  MPI  process  per  core  –  use  shared  memory  for  message-­‐passing  

–  Use  of  two  programming  models  (MPI+OpenMP)  is  burdensome  –  PGAS  is  not  used,  and  does  not  provide  (so  far)  a  real  advantage  over  MPI  

–  Many  open  issues  on  scaling  mul.threading  models  (OpenMP,  TBB,  Cilk…)  and  combining  them  with  message-­‐passing    

–  (See  history  of  large-­‐scale  NUMA  -­‐-­‐  which  did  not  become  a  viable  species)  

July  13  

MCS    -­‐-­‐  Marc  Snir  

11  

Page 12: Keynote snir spaa

Where the (R)evolutions Successful? (2)

§  Accelerators:  Just  beginning  –  Few  HPC  codes  converted  to  use  GPUs  

§  Obstacles:  –  Technology  s.ll  changing  fast  (integra.on  of  GPU  with  CPU,  con.nued  changes  in  ISA)  

–  No  good  non-­‐proprietary  programming  systems  are  available,  and  their  long-­‐term  viability  is  uncertain  

July  13  

MCS    -­‐-­‐  Marc  Snir  

12  

Page 13: Keynote snir spaa

Key Obstacles

§  Scien.fic  codes  live  much  longer  than  computer  systems  (two  decades  or  more);  they  need  to  be  ported  across  successive  HW  genera.ons    

§  Amount  of  code  to  be  ported  con.nuously  increases  (major  scien.fic  codes  each  have  >  1MLOCs)  

§  Need  very  efficient,  well  tuned  codes  (HPC  plarorms  are  expensive)  

§  Need  portability  across  plarorms  (HPC  programmers  are  expensive)  

§  Squaring  the  circle?    

§  Lack  of  performing,  portable  programming  models  has  become  the  major  impediment  to  the  evolu.on  of  HPC  hardware  

July  13  

MCS    -­‐-­‐  Marc  Snir  

13  

Page 14: Keynote snir spaa

Did Theory Help?

§  Killer  Micros:  Helped  by  work  on  scalable  algorithms  and  on  interconnects  

§  Mul:core:  Helped  by  work  on  communica.on  complexity  (efficient  use  of  caches)  –  Very  liIle  use  of  work  on  coordina.on  algorithms  or  transac.onal  memory  

§  Accelerators:  Cannot  think  of  relevant  work      –  Interes.ng  ques.on:  power  of  branching  &  power  of  indirec.on;  

–  Surprising  result:  AKS  sor.ng  network    

§  Too  oWen,  theory  follows  prac.ce,  rather  than  preceding  it.  

July  13  

MCS    -­‐-­‐  Marc  Snir  

14  

Page 15: Keynote snir spaa

Future

July  13  

MCS    -­‐-­‐  Marc  Snir  

15  

Page 16: Keynote snir spaa

The End of Moore’s Law is Coming

§  Moore’s  Law:  The  number  of  transistors  per  chip  doubles  every  two  years  

§  Stein’s  Law:  If  something  cannot  go  forever,  it  will  stop  

§  Ques.on  is  not  whether  but  when  will  Moore’s  Law  stop?  –  It  is  difficult  to  make  predic.ons,  especially  about  the  future  (Yogi  Berra)  

July  13  

MCS    -­‐-­‐  Marc  Snir  

16  

Page 17: Keynote snir spaa

Current Obstacle: Current Leakage

§  Transistors  do  not  shut-­‐off  completely  While  power  consump;on  is  an  urgent  challenge,  its  leakage  or  sta;c  component  will  become  a  major  industry  crisis  in  the  long  term,  threatening  the  survival  of  CMOS  technology  itself,  just  as  bipolar  technology  was  threatened  and  eventually  disposed  of  decades  ago  

Interna.onal  Technology  Roadmap  for  Semiconductors  (ITRS)  2011  §  The  ITRS  “long  term”  is  the  2017-­‐2024  .meframe.    

§  No  “good  enough”  technology  wai.ng  in  the  wings  

July  13  

MCS    -­‐-­‐  Marc  Snir  

17  

Page 18: Keynote snir spaa

Longer-Term Obstacle

§  Quantum  effects  totally  change  the  behavior  of  transistors  as  they  shrink  –  7-­‐5nm  feature  size  is  predicted  to  be  the  lower  limit  for  CMOS  devices  

–  ITRS  predicts  7.5nm  will  be  reached  in  2024  

   

July  13  

MCS    -­‐-­‐  Marc  Snir  

18  

Page 19: Keynote snir spaa

The 7nm Wall

24  July  2013  

ANL-­‐LBNL-­‐ORNL-­‐PNNL    

19  

(courtesy  S.  Dosanjh)  

Page 20: Keynote snir spaa

The Future Is Not What It Was

24  July  2013  

ANL-­‐LBNL-­‐ORNL-­‐PNNL    

20  

(courtesy  S.  Dosanjh)  

Page 21: Keynote snir spaa

Progress Does Not Stop

§  It  becomes  more  expensive  and  slows  down  –  New  materials  (e.g.,  III-­‐V,  germanium  thin  channels,  nanowires,  nanotubes  or  graphene)    

–  New  structures  (e.g.,  3D  transistor  structures)    –  Aggressive  cooling  –  New  packages  

§  More  inven.on  at  the  architecture  level  §  Seeking  value  from  features  other  than  speed  (More  than  Moore)  –  System  on  a  chip  –  integra.on  of  analog  and  digital  –  MEMS…  

§  Beyond  Moore?  (Quantum,  biological…)  –  beyond  my  horizon  

July  13  

MCS    -­‐-­‐  Marc  Snir  

21  

Page 22: Keynote snir spaa

Exascale

July  13  

MCS    -­‐-­‐  Marc  Snir  

22  

Page 23: Keynote snir spaa

Supercomputer Evolution

§  X1,000  performance  increase  every  11  years  –  X50  faster  than  Moore’s  Law  

§  Extrapola.on  predicts  exaflop/s  (1018  floa.ng  point  opera.ons  per  second)  before  2020  –  We  are  now  at  50  Petaflop/s  

§  Extrapola.on  may  not  work  if  Moore’s  Law  slows  down  

July  13  

MCS    -­‐-­‐  Marc  Snir  

23  

Page 24: Keynote snir spaa

Do We Care?

§  It’s  all  about  Big  Data  Now,  simula.ons  are  passé.  §  B***t  §  All  science  is  either  physics  or  stamp  collec;ng.  (Ernest  

Rutherford)  –  In  Physical  Sciences,  experiments  and  observa.ons  exist  to  validate/refute/mo.vate  theory.  “Data  Mining”  not  driven  by  a  scien.fic  hypothesis  is  “stamp  collec.on”.  

§  Simula.on  is  needed  to  go  from  a  mathema.cal  model  to  predic.ons  on  observa.ons.  –  If  system  is  complex  (e.g.,  climate)  then  simula.on  is  expensive  –  Predic.ons  are  oWen  sta.s.cal  –  complica.ng  both  simula.on  and  data  analysis  

 

July  13  

MCS    -­‐-­‐  Marc  Snir  

24  

Page 25: Keynote snir spaa

Observation Meets Data: Cosmology Computation Meets Data: The Argonne View

Mapping the Sky with Survey Instruments

Observations: Statistical error bars will ‘disappear’ soon!

Emulator based on Gaussian Process Interpolation in High-

Dimensional Spaces

Supercomputer Simulation Campaign

Markov chain Monte Carlo

‘PrecisionOracle’

‘Cosmic Calibration’

LSST Weak Lensing

HACC+CCF (Domain science+CS+Math+Stats

+Machine learning)

CCF= Cosmic Calibration Framework

w = -1w = - 0.9

LSSTHACC=Hardware/Hybrid Accelerated Cosmology Code(s)

Wednesday, September 19, 12

(courtesy  Salman  Habib)  Record-­‐breaking  applica.on:  3.6  Trillion  par.cles,  14  Pflop/s  

Page 26: Keynote snir spaa

Exascale Design Point 202x with a cap of $200M and 20MW Systems   2012  

BG/Q  Computer  

2020-­‐2024     Difference  Today  &  2019  

System  peak   20  Pflop/s   1  Eflop/s   O(100)  

Power   8.6  MW   ~20  MW  

System  memory   1.6  PB  (16*96*1024)    

32  -­‐  64  PB   O(10)  

Node  performance      205  GF/s  (16*1.6GHz*8)  

1.2    or  15TF/s   O(10)  –  O(100)  

Node  memory  BW   42.6  GB/s   2  -­‐  4TB/s   O(1000)  

Node  concurrency   64  Threads   O(1k)  or  10k   O(100)  –  O(1000)  

Total  Node  Interconnect  BW   20  GB/s   200-­‐400GB/s   O(10)  

System  size  (nodes)   98,304  (96*1024)  

O(100,000)  or  O(1M)   O(100)  –  O(1000)  

Total  concurrency   5.97  M   O(billion)   O(1,000)  

MTTI   4  days   O(<1  day)   -­‐  O(10)  

Both  price  and  power  envelopes  may  be  too  aggressive!  

Page 27: Keynote snir spaa

Identified Issues

§  Scale  (billion  threads)  §  Power  (10’s  of  MWaIs)  –  Communica:on:  >  99%  of  power  is  consumed  by  moving  operands  across  the  memory  hierarchy  and  across  nodes  

–  Reduced  memory  size:  (communica.on  in  .me)  §  Resilience:  Something  fails  every  hour;  the  machine  is  never  

“whole”  –  Trade-­‐off  between  power  and  resilience  

§  Asynchrony:  Equal  work  ≠  equal  .me  –  Power  management  –  Error  recovery  

July  13  

MCS    -­‐-­‐  Marc  Snir  

27  

Page 28: Keynote snir spaa

Other Issues

§  Uncertainly  about  underlying  HW  architecture  –  Fast  evolu.on  of  architecture  (accelerators,  3D  memory  and  processing  near  memory,  nvram)  

–  Uncertainty  about  the  market  that  will  supply  components  to  HPC  

–  Possible  divergence  from  commodity  markets  

§  Increased  complexity  of  soWware  –  Simula.ons  of  complex  systems  +  uncertainty  quan.fica.on  +  op.miza.on…  

–  SoWware  management  of  power  and  failure  –  Scale  and  .ght  coupling  (tail  of  distribu.on  maIers!)  

July  13  

MCS    -­‐-­‐  Marc  Snir  

28  

Page 29: Keynote snir spaa

Research Areas

July  13  

MCS    -­‐-­‐  Marc  Snir  

29  

Page 30: Keynote snir spaa

Scale

§  HPC  algorithms  are  being  designed  for  a  2-­‐level  hierarchy  (node,  global);  can  they  be  designed  for  a  mul.-­‐level  hierarchy?  Can  they  be  “hierarchy-­‐oblivious”?  

§  Can  we  have  a  programming  model  that  abstracts  the  specific  HW  mechanisms  are  each  level  (message-­‐passing,  shared-­‐memory)  yet  can  leverage  these  mechanisms  efficiently?  

–  Global  shared  object  space  +  caching  +  explicit  communica.on  –  Mul.level  programing  (compila.on  with  human  in  the  loop)  

July  13  

MCS    -­‐-­‐  Marc  Snir  

30  

Page 31: Keynote snir spaa

Communication

§  Communica.on-­‐efficient  algorithms  

§  A  beIer  understanding  of  fundamental  communica.on-­‐computa.on  tradeoffs  for  PDE  solvers  (ge�ng  away  from  DAG  –  based  lower  bounds;  tradeoffs  between  communica.on  and  convergence  rate)  

§  Programming  models,  libraries  and  languages  where  communica.on  is  a  first-­‐class  ci.zen  (other  than  MPI)  

   

July  13  

MCS    -­‐-­‐  Marc  Snir  

31  

Page 32: Keynote snir spaa

Resilient Distributed Systems

§  E.g.,  a  parallel  file  system,  with  768  I/O  nodes  >50K  disks  –  Systems  are  built  to  tolerate  disk  and  node  failures  –  However,  most  failures  in  the  field  are  due  to  “performance  bugs”:  e.g.,  .me-­‐outs,  due  to  thrashing  

§  How  do  we  build  feedback  mechanisms  that  ensure  stability?  (control  theory  for  large-­‐scale,  discrete  systems)  

§  How  do  we  provide  quality  of  service?  §  What  is  a  quan.ta.ve  theory  of  resilience?  (E.g.  Impact  of  failure  

rate  on  overall  performance)  –  Focus  on  systems  where  failures  are  not  excep.onal  

July  13  

MCS    -­‐-­‐  Marc  Snir  

32  

Page 33: Keynote snir spaa

Resilient Parallel Algorithms – Overcoming Silent Data Corruptions §  SDCs  may  be  unavoidable  in  future  large  systems  (due  to  flips  in  

computa.on  logic)  §  Intui.on:  SDC  can  either  be  –  Type  1:  Grossly  violates  the  computa.on  model  (e.g.  jump  to  wrong  address,  message  sent  to  wrong  node),  or  

–  Type  2:  Introduces  noise  in  the  data  (bit  flip  in  a  large  array)  §  Many  itera.ve  algorithms  can  tolerate  infrequent  type  2  errors  §  Type  1  errors  are  oWen  catastrophic  and  easy  to  detect  in  

soWware  

§  Can  we  build  systems  that  avoid  or  correct  easy  to  detect  (type  1)  errors  and  tolerate  hard  to  detect  (type  2)  errors?  

§  What  is  the  general  theory  of  fault-­‐tolerant  numerical  algorithms?  

July  13  

MCS    -­‐-­‐  Marc  Snir  

33  

Page 34: Keynote snir spaa

Asynchrony

§  What  is  a  measure  of  asynchrony  tolerance?  –  Moving  away  from  the  qualita.ve  (e.g.,  wait-­‐free)  to  the  quan.ta.ve:    

–  How  much  do  intermiIently  slow  processes  slow  down  the  en.re  computa.on  –  on  average?  

§  What  are  the  trade-­‐offs  between  synchronicity  and  computa.on  work?  

§  Load  balancing,  driven  not  by  uncertainty  on  computa.on,  but  uncertainty  on  computer  

July  13  

MCS    -­‐-­‐  Marc  Snir  

34  

Page 35: Keynote snir spaa

Architecture-Specific Algorithms

§  GPU/accelerators    §  Hybrid  memory  Cube  /  Near-­‐

memory  compu.ng  

§  NVRAM  –  E.g.,  flash  memory  

July  13  

MCS    -­‐-­‐  Marc  Snir  

35  

Page 36: Keynote snir spaa

Portable Performance

§  Can  we  redefine  compila.on  so  that:  –  It  supports  well  a  human  in  the  loop  (manual  high-­‐level  decisions  vs.  automated  low-­‐level  transforma.ons)  

–  It  integrates  auto-­‐tuning  and  profile-­‐guided  compila.on  –  It  preserves  high-­‐level  code  seman.cs  –  It  preserves  high-­‐level  code  “performance  seman.cs”  

July  13  

MCS    -­‐-­‐  Marc  Snir  

36  

Principle  High-­‐level  code  

Low-­‐level,  plarorm-­‐specific  codes  

“Compila.on”  

Prac:ce  

Code  A   Code  B   Code  C  

Manual  conversion  “ifdef”  spaghe�  

Page 37: Keynote snir spaa

Conclusion

§  Moore’s  Law  is  slowing  down;  the  slow-­‐down  has  many  fundamental  consequences  –  only  a  few  of  them  explored  in  this  talk  

§  HPC  is  the  “canary  in  the  mine”:  –  issues  appear  earlier  because  of  size  and  .ght  coupling  

§  Op.mis.c  view  of  the  next  decades:  A  frenzy  of  innova.on  to  con.nue  pushing  current  ecosystem,  followed  by  frenzy  of  innova.on  to  use  totally  different  compute  technologies  

§  Pessimis.c  view:    The  end  is  coming  

July  13  

MCS    -­‐-­‐  Marc  Snir  

37