Keynote snir spaa

Supercomputing: Technical Evolution & Programming Models

Marc Snir Argonne Na.onal Laboratory & University of Illinois at Urbana-‐Champaign

Introduction

July 13

MCS -‐-‐ Marc Snir

2

Theory of Punctuated Equilibrium (Eldredge, Gould, Mayer…)

§  Evolu.on consists of long periods of equilibrium, with liIle change, interspersed with short periods of rapid change. –  Muta.ons are diluted in large popula.ons in equilibrium – homogenizing

effect prevents the accumula.on of mul.ple changes –  Small, isolated popula.on under heavy natural selec.on pressure evolve

rapidly and new species can appear –  Major cataclysms can be a cause for rapid change

§  Punctuated Equilibrium is a good model for technology evolu.on: –  Revolu.ons are hard in large markets with network effects and technology

that evolves gradually –  Changes can be much faster when small, isolated product markets are

created, or when current technology hits a wall (cataclysm) §  (Not a new idea: e.g., Levinthal 1998: The Slow Pace of Rapid Technological

Change: Gradualism and Punctua;on in Technological Change)


3 July 13

Why it Matters to SPAA (and PODC)

§  Periods of paradigm shiW generate a rich set of new problems (new low-‐hanging fruit?) –  It is a .me when good theory can help

§  E.g., Internet, Wireless, Big data –  Punctuated evolu.on due to the appearance of new markets

§  Hypothesis: HPC now and, ul.mately, much of IT are entering a period of fast evolu.on: Please prepare

July 13


4

Where Analogy with Biological Evolution Breaks Down

§  Technology evolu.on can be accelerated by gene.c engineering –  Technology developed in one market is exploited in another market

–  E.g., Internet or wireless were enabled by cheap microprocessors, telephony technology, etc.

§  “Gene.c engineering” has been essen.al for HPC in the last 25 years: –  Progress enabled by reuse of technologies from other markets (micros, GPUs…)

July 13


5

Past & Present

July 13


6

Evidence of Punctuated Equilibrium in HPC

July 13


7

1

10

100

1000

10000

100000

1000000

10000000

Core Count Leading Top500 System

aIack of the killer micros

mul.core

accelerators

SPAA

1990: The Attack of the Killer Micros (Eugene Brooks, 1990)

§  ShiW from ECL vector machines & to clusters of MOS micros –  Cataclysm: bipolar evolu.on reached its limits (nitrogen cooling, gallium

arsenide…); MOS was on a fast evolu.on path –  MOS had its niche markets: controllers, worksta.ons, PCs –  Classical example of “good enough, cheaper technology” (Christensen, The

Innovator’s Dilemma)

July 13


8

2002: Multicore

§  Clock speed stopped increasing; very liIle return on added CPU complexity; chip density con.nued to increase –  Technology push – not market pull

–  S.ll has limited success

July 13


9

2010: Accelerators

§  New market (graphics) created ecological niche §  Technology transplanted into other markets (signal processing/

vision, scien.fic compu.ng) –  Advantage of beIer power/performance ra.o (less logic)

§  Technology s.ll changing rapidly: integra.on with CPU and evolving ISA

July 13


10

Where the (R)evolutions Successful in HPC?

§  Killer Micros: Yes –  Totally replaced vector machines –  All HPC codes enabled for message-‐passing (MPI) –  Took > 10 years and > $1B govt. investment (DARPA)

§  Mul:core: Incomplete –  Many codes s.ll use one MPI process per core – use shared memory for message-‐passing

–  Use of two programming models (MPI+OpenMP) is burdensome –  PGAS is not used, and does not provide (so far) a real advantage over MPI

–  Many open issues on scaling mul.threading models (OpenMP, TBB, Cilk…) and combining them with message-‐passing

–  (See history of large-‐scale NUMA -‐-‐ which did not become a viable species)

July 13


11

Where the (R)evolutions Successful? (2)

§  Accelerators: Just beginning –  Few HPC codes converted to use GPUs

§  Obstacles: –  Technology s.ll changing fast (integra.on of GPU with CPU, con.nued changes in ISA)

–  No good non-‐proprietary programming systems are available, and their long-‐term viability is uncertain

July 13


12

Key Obstacles

§  Scien.fic codes live much longer than computer systems (two decades or more); they need to be ported across successive HW genera.ons

§  Amount of code to be ported con.nuously increases (major scien.fic codes each have > 1MLOCs)

§  Need very efficient, well tuned codes (HPC plarorms are expensive)

§  Need portability across plarorms (HPC programmers are expensive)

§  Squaring the circle?

§  Lack of performing, portable programming models has become the major impediment to the evolu.on of HPC hardware

July 13


13

Did Theory Help?

§  Killer Micros: Helped by work on scalable algorithms and on interconnects

§  Mul:core: Helped by work on communica.on complexity (efficient use of caches) –  Very liIle use of work on coordina.on algorithms or transac.onal memory

§  Accelerators: Cannot think of relevant work –  Interes.ng ques.on: power of branching & power of indirec.on;

–  Surprising result: AKS sor.ng network

§  Too oWen, theory follows prac.ce, rather than preceding it.

July 13


14

Future

July 13


15

The End of Moore’s Law is Coming

§  Moore’s Law: The number of transistors per chip doubles every two years

§  Stein’s Law: If something cannot go forever, it will stop

§  Ques.on is not whether but when will Moore’s Law stop? –  It is difficult to make predic.ons, especially about the future (Yogi Berra)

July 13


16

Current Obstacle: Current Leakage

§  Transistors do not shut-‐off completely While power consump;on is an urgent challenge, its leakage or sta;c component will become a major industry crisis in the long term, threatening the survival of CMOS technology itself, just as bipolar technology was threatened and eventually disposed of decades ago

Interna.onal Technology Roadmap for Semiconductors (ITRS) 2011 §  The ITRS “long term” is the 2017-‐2024 .meframe.

§  No “good enough” technology wai.ng in the wings

July 13


17

Longer-Term Obstacle

§  Quantum effects totally change the behavior of transistors as they shrink –  7-‐5nm feature size is predicted to be the lower limit for CMOS devices

–  ITRS predicts 7.5nm will be reached in 2024

July 13


18

The 7nm Wall

24 July 2013

ANL-‐LBNL-‐ORNL-‐PNNL

19

(courtesy S. Dosanjh)

The Future Is Not What It Was

24 July 2013

ANL-‐LBNL-‐ORNL-‐PNNL

20

(courtesy S. Dosanjh)

Progress Does Not Stop

§  It becomes more expensive and slows down –  New materials (e.g., III-‐V, germanium thin channels, nanowires, nanotubes or graphene)

–  New structures (e.g., 3D transistor structures) –  Aggressive cooling –  New packages

§  More inven.on at the architecture level §  Seeking value from features other than speed (More than Moore) –  System on a chip – integra.on of analog and digital –  MEMS…

§  Beyond Moore? (Quantum, biological…) – beyond my horizon

July 13


21

Exascale

July 13


22

Supercomputer Evolution

§  X1,000 performance increase every 11 years –  X50 faster than Moore’s Law

§  Extrapola.on predicts exaflop/s (1018 floa.ng point opera.ons per second) before 2020 –  We are now at 50 Petaflop/s

§  Extrapola.on may not work if Moore’s Law slows down

July 13


23

Do We Care?

§  It’s all about Big Data Now, simula.ons are passé. §  B***t §  All science is either physics or stamp collec;ng. (Ernest

Rutherford) –  In Physical Sciences, experiments and observa.ons exist to validate/refute/mo.vate theory. “Data Mining” not driven by a scien.fic hypothesis is “stamp collec.on”.

§  Simula.on is needed to go from a mathema.cal model to predic.ons on observa.ons. –  If system is complex (e.g., climate) then simula.on is expensive –  Predic.ons are oWen sta.s.cal – complica.ng both simula.on and data analysis

July 13


24

Observation Meets Data: Cosmology Computation Meets Data: The Argonne View

Mapping the Sky with Survey Instruments

Observations: Statistical error bars will ‘disappear’ soon!

Emulator based on Gaussian Process Interpolation in High-

Dimensional Spaces

Supercomputer Simulation Campaign

Markov chain Monte Carlo

‘PrecisionOracle’

‘Cosmic Calibration’

LSST Weak Lensing

HACC+CCF (Domain science+CS+Math+Stats

+Machine learning)

CCF= Cosmic Calibration Framework

w = -1w = - 0.9

LSSTHACC=Hardware/Hybrid Accelerated Cosmology Code(s)

Wednesday, September 19, 12

(courtesy Salman Habib) Record-‐breaking applica.on: 3.6 Trillion par.cles, 14 Pflop/s

Exascale Design Point 202x with a cap of $200M and 20MW Systems 2012

BG/Q Computer

2020-‐2024 Difference Today & 2019

System peak 20 Pflop/s 1 Eflop/s O(100)

Power 8.6 MW ~20 MW

System memory 1.6 PB (16*96*1024)

32 -‐ 64 PB O(10)

Node performance 205 GF/s (16*1.6GHz*8)

1.2 or 15TF/s O(10) – O(100)

Node memory BW 42.6 GB/s 2 -‐ 4TB/s O(1000)

Node concurrency 64 Threads O(1k) or 10k O(100) – O(1000)

Total Node Interconnect BW 20 GB/s 200-‐400GB/s O(10)

System size (nodes) 98,304 (96*1024)

O(100,000) or O(1M) O(100) – O(1000)

Total concurrency 5.97 M O(billion) O(1,000)

MTTI 4 days O(<1 day) -‐ O(10)

Both price and power envelopes may be too aggressive!

Identified Issues

§  Scale (billion threads) §  Power (10’s of MWaIs) –  Communica:on: > 99% of power is consumed by moving operands across the memory hierarchy and across nodes

–  Reduced memory size: (communica.on in .me) §  Resilience: Something fails every hour; the machine is never

“whole” –  Trade-‐off between power and resilience

§  Asynchrony: Equal work ≠ equal .me –  Power management –  Error recovery

July 13


27

Other Issues

§  Uncertainly about underlying HW architecture –  Fast evolu.on of architecture (accelerators, 3D memory and processing near memory, nvram)

–  Uncertainty about the market that will supply components to HPC

–  Possible divergence from commodity markets

§  Increased complexity of soWware –  Simula.ons of complex systems + uncertainty quan.fica.on + op.miza.on…

–  SoWware management of power and failure –  Scale and .ght coupling (tail of distribu.on maIers!)

July 13


28

Research Areas

July 13


29

Scale

§  HPC algorithms are being designed for a 2-‐level hierarchy (node, global); can they be designed for a mul.-‐level hierarchy? Can they be “hierarchy-‐oblivious”?

§  Can we have a programming model that abstracts the specific HW mechanisms are each level (message-‐passing, shared-‐memory) yet can leverage these mechanisms efficiently?

–  Global shared object space + caching + explicit communica.on –  Mul.level programing (compila.on with human in the loop)

July 13


30

Communication

§  Communica.on-‐efficient algorithms

§  A beIer understanding of fundamental communica.on-‐computa.on tradeoffs for PDE solvers (ge�ng away from DAG – based lower bounds; tradeoffs between communica.on and convergence rate)

§  Programming models, libraries and languages where communica.on is a first-‐class ci.zen (other than MPI)

July 13


31

Resilient Distributed Systems

§  E.g., a parallel file system, with 768 I/O nodes >50K disks –  Systems are built to tolerate disk and node failures –  However, most failures in the field are due to “performance bugs”: e.g., .me-‐outs, due to thrashing

§  How do we build feedback mechanisms that ensure stability? (control theory for large-‐scale, discrete systems)

§  How do we provide quality of service? §  What is a quan.ta.ve theory of resilience? (E.g. Impact of failure

rate on overall performance) –  Focus on systems where failures are not excep.onal

July 13


32

Resilient Parallel Algorithms – Overcoming Silent Data Corruptions §  SDCs may be unavoidable in future large systems (due to flips in

computa.on logic) §  Intui.on: SDC can either be –  Type 1: Grossly violates the computa.on model (e.g. jump to wrong address, message sent to wrong node), or

–  Type 2: Introduces noise in the data (bit flip in a large array) §  Many itera.ve algorithms can tolerate infrequent type 2 errors §  Type 1 errors are oWen catastrophic and easy to detect in

soWware

§  Can we build systems that avoid or correct easy to detect (type 1) errors and tolerate hard to detect (type 2) errors?

§  What is the general theory of fault-‐tolerant numerical algorithms?

July 13


33

Asynchrony

§  What is a measure of asynchrony tolerance? –  Moving away from the qualita.ve (e.g., wait-‐free) to the quan.ta.ve:

–  How much do intermiIently slow processes slow down the en.re computa.on – on average?

§  What are the trade-‐offs between synchronicity and computa.on work?

§  Load balancing, driven not by uncertainty on computa.on, but uncertainty on computer

July 13


34

Architecture-Specific Algorithms

§  GPU/accelerators §  Hybrid memory Cube / Near-‐

memory compu.ng

§  NVRAM – E.g., flash memory

July 13


35

Portable Performance

§  Can we redefine compila.on so that: –  It supports well a human in the loop (manual high-‐level decisions vs. automated low-‐level transforma.ons)

–  It integrates auto-‐tuning and profile-‐guided compila.on –  It preserves high-‐level code seman.cs –  It preserves high-‐level code “performance seman.cs”

July 13


36

Principle High-‐level code

Low-‐level, plarorm-‐specific codes

“Compila.on”

Prac:ce

Code A Code B Code C

Manual conversion “ifdef” spaghe�

Conclusion

§  Moore’s Law is slowing down; the slow-‐down has many fundamental consequences – only a few of them explored in this talk

§  HPC is the “canary in the mine”: –  issues appear earlier because of size and .ght coupling

§  Op.mis.c view of the next decades: A frenzy of innova.on to con.nue pushing current ecosystem, followed by frenzy of innova.on to use totally different compute technologies

§  Pessimis.c view: The end is coming

July 13


37

Keynote snir spaa

Technology

Transcript of Keynote snir spaa