Keynote snir spaa
-
Upload
marc-snir -
Category
Technology
-
view
336 -
download
2
description
Transcript of Keynote snir spaa
Supercomputing: Technical Evolution & Programming Models
Marc Snir Argonne Na.onal Laboratory & University of Illinois at Urbana-‐Champaign
Introduction
July 13
MCS -‐-‐ Marc Snir
2
Theory of Punctuated Equilibrium (Eldredge, Gould, Mayer…)
§ Evolu.on consists of long periods of equilibrium, with liIle change, interspersed with short periods of rapid change. – Muta.ons are diluted in large popula.ons in equilibrium – homogenizing
effect prevents the accumula.on of mul.ple changes – Small, isolated popula.on under heavy natural selec.on pressure evolve
rapidly and new species can appear – Major cataclysms can be a cause for rapid change
§ Punctuated Equilibrium is a good model for technology evolu.on: – Revolu.ons are hard in large markets with network effects and technology
that evolves gradually – Changes can be much faster when small, isolated product markets are
created, or when current technology hits a wall (cataclysm) § (Not a new idea: e.g., Levinthal 1998: The Slow Pace of Rapid Technological
Change: Gradualism and Punctua;on in Technological Change)
MCS -‐-‐ Marc Snir
3 July 13
Why it Matters to SPAA (and PODC)
§ Periods of paradigm shiW generate a rich set of new problems (new low-‐hanging fruit?) – It is a .me when good theory can help
§ E.g., Internet, Wireless, Big data – Punctuated evolu.on due to the appearance of new markets
§ Hypothesis: HPC now and, ul.mately, much of IT are entering a period of fast evolu.on: Please prepare
July 13
MCS -‐-‐ Marc Snir
4
Where Analogy with Biological Evolution Breaks Down
§ Technology evolu.on can be accelerated by gene.c engineering – Technology developed in one market is exploited in another market
– E.g., Internet or wireless were enabled by cheap microprocessors, telephony technology, etc.
§ “Gene.c engineering” has been essen.al for HPC in the last 25 years: – Progress enabled by reuse of technologies from other markets (micros, GPUs…)
July 13
MCS -‐-‐ Marc Snir
5
Past & Present
July 13
MCS -‐-‐ Marc Snir
6
Evidence of Punctuated Equilibrium in HPC
July 13
MCS -‐-‐ Marc Snir
7
1
10
100
1000
10000
100000
1000000
10000000
Core Count Leading Top500 System
aIack of the killer micros
mul.core
accelerators
SPAA
1990: The Attack of the Killer Micros (Eugene Brooks, 1990)
§ ShiW from ECL vector machines & to clusters of MOS micros – Cataclysm: bipolar evolu.on reached its limits (nitrogen cooling, gallium
arsenide…); MOS was on a fast evolu.on path – MOS had its niche markets: controllers, worksta.ons, PCs – Classical example of “good enough, cheaper technology” (Christensen, The
Innovator’s Dilemma)
July 13
MCS -‐-‐ Marc Snir
8
2002: Multicore
§ Clock speed stopped increasing; very liIle return on added CPU complexity; chip density con.nued to increase – Technology push – not market pull
– S.ll has limited success
July 13
MCS -‐-‐ Marc Snir
9
2010: Accelerators
§ New market (graphics) created ecological niche § Technology transplanted into other markets (signal processing/
vision, scien.fic compu.ng) – Advantage of beIer power/performance ra.o (less logic)
§ Technology s.ll changing rapidly: integra.on with CPU and evolving ISA
July 13
MCS -‐-‐ Marc Snir
10
Where the (R)evolutions Successful in HPC?
§ Killer Micros: Yes – Totally replaced vector machines – All HPC codes enabled for message-‐passing (MPI) – Took > 10 years and > $1B govt. investment (DARPA)
§ Mul:core: Incomplete – Many codes s.ll use one MPI process per core – use shared memory for message-‐passing
– Use of two programming models (MPI+OpenMP) is burdensome – PGAS is not used, and does not provide (so far) a real advantage over MPI
– Many open issues on scaling mul.threading models (OpenMP, TBB, Cilk…) and combining them with message-‐passing
– (See history of large-‐scale NUMA -‐-‐ which did not become a viable species)
July 13
MCS -‐-‐ Marc Snir
11
Where the (R)evolutions Successful? (2)
§ Accelerators: Just beginning – Few HPC codes converted to use GPUs
§ Obstacles: – Technology s.ll changing fast (integra.on of GPU with CPU, con.nued changes in ISA)
– No good non-‐proprietary programming systems are available, and their long-‐term viability is uncertain
July 13
MCS -‐-‐ Marc Snir
12
Key Obstacles
§ Scien.fic codes live much longer than computer systems (two decades or more); they need to be ported across successive HW genera.ons
§ Amount of code to be ported con.nuously increases (major scien.fic codes each have > 1MLOCs)
§ Need very efficient, well tuned codes (HPC plarorms are expensive)
§ Need portability across plarorms (HPC programmers are expensive)
§ Squaring the circle?
§ Lack of performing, portable programming models has become the major impediment to the evolu.on of HPC hardware
July 13
MCS -‐-‐ Marc Snir
13
Did Theory Help?
§ Killer Micros: Helped by work on scalable algorithms and on interconnects
§ Mul:core: Helped by work on communica.on complexity (efficient use of caches) – Very liIle use of work on coordina.on algorithms or transac.onal memory
§ Accelerators: Cannot think of relevant work – Interes.ng ques.on: power of branching & power of indirec.on;
– Surprising result: AKS sor.ng network
§ Too oWen, theory follows prac.ce, rather than preceding it.
July 13
MCS -‐-‐ Marc Snir
14
Future
July 13
MCS -‐-‐ Marc Snir
15
The End of Moore’s Law is Coming
§ Moore’s Law: The number of transistors per chip doubles every two years
§ Stein’s Law: If something cannot go forever, it will stop
§ Ques.on is not whether but when will Moore’s Law stop? – It is difficult to make predic.ons, especially about the future (Yogi Berra)
July 13
MCS -‐-‐ Marc Snir
16
Current Obstacle: Current Leakage
§ Transistors do not shut-‐off completely While power consump;on is an urgent challenge, its leakage or sta;c component will become a major industry crisis in the long term, threatening the survival of CMOS technology itself, just as bipolar technology was threatened and eventually disposed of decades ago
Interna.onal Technology Roadmap for Semiconductors (ITRS) 2011 § The ITRS “long term” is the 2017-‐2024 .meframe.
§ No “good enough” technology wai.ng in the wings
July 13
MCS -‐-‐ Marc Snir
17
Longer-Term Obstacle
§ Quantum effects totally change the behavior of transistors as they shrink – 7-‐5nm feature size is predicted to be the lower limit for CMOS devices
– ITRS predicts 7.5nm will be reached in 2024
July 13
MCS -‐-‐ Marc Snir
18
The 7nm Wall
24 July 2013
ANL-‐LBNL-‐ORNL-‐PNNL
19
(courtesy S. Dosanjh)
The Future Is Not What It Was
24 July 2013
ANL-‐LBNL-‐ORNL-‐PNNL
20
(courtesy S. Dosanjh)
Progress Does Not Stop
§ It becomes more expensive and slows down – New materials (e.g., III-‐V, germanium thin channels, nanowires, nanotubes or graphene)
– New structures (e.g., 3D transistor structures) – Aggressive cooling – New packages
§ More inven.on at the architecture level § Seeking value from features other than speed (More than Moore) – System on a chip – integra.on of analog and digital – MEMS…
§ Beyond Moore? (Quantum, biological…) – beyond my horizon
July 13
MCS -‐-‐ Marc Snir
21
Exascale
July 13
MCS -‐-‐ Marc Snir
22
Supercomputer Evolution
§ X1,000 performance increase every 11 years – X50 faster than Moore’s Law
§ Extrapola.on predicts exaflop/s (1018 floa.ng point opera.ons per second) before 2020 – We are now at 50 Petaflop/s
§ Extrapola.on may not work if Moore’s Law slows down
July 13
MCS -‐-‐ Marc Snir
23
Do We Care?
§ It’s all about Big Data Now, simula.ons are passé. § B***t § All science is either physics or stamp collec;ng. (Ernest
Rutherford) – In Physical Sciences, experiments and observa.ons exist to validate/refute/mo.vate theory. “Data Mining” not driven by a scien.fic hypothesis is “stamp collec.on”.
§ Simula.on is needed to go from a mathema.cal model to predic.ons on observa.ons. – If system is complex (e.g., climate) then simula.on is expensive – Predic.ons are oWen sta.s.cal – complica.ng both simula.on and data analysis
July 13
MCS -‐-‐ Marc Snir
24
Observation Meets Data: Cosmology Computation Meets Data: The Argonne View
Mapping the Sky with Survey Instruments
Observations: Statistical error bars will ‘disappear’ soon!
Emulator based on Gaussian Process Interpolation in High-
Dimensional Spaces
Supercomputer Simulation Campaign
Markov chain Monte Carlo
‘PrecisionOracle’
‘Cosmic Calibration’
LSST Weak Lensing
HACC+CCF (Domain science+CS+Math+Stats
+Machine learning)
CCF= Cosmic Calibration Framework
w = -1w = - 0.9
LSSTHACC=Hardware/Hybrid Accelerated Cosmology Code(s)
Wednesday, September 19, 12
(courtesy Salman Habib) Record-‐breaking applica.on: 3.6 Trillion par.cles, 14 Pflop/s
Exascale Design Point 202x with a cap of $200M and 20MW Systems 2012
BG/Q Computer
2020-‐2024 Difference Today & 2019
System peak 20 Pflop/s 1 Eflop/s O(100)
Power 8.6 MW ~20 MW
System memory 1.6 PB (16*96*1024)
32 -‐ 64 PB O(10)
Node performance 205 GF/s (16*1.6GHz*8)
1.2 or 15TF/s O(10) – O(100)
Node memory BW 42.6 GB/s 2 -‐ 4TB/s O(1000)
Node concurrency 64 Threads O(1k) or 10k O(100) – O(1000)
Total Node Interconnect BW 20 GB/s 200-‐400GB/s O(10)
System size (nodes) 98,304 (96*1024)
O(100,000) or O(1M) O(100) – O(1000)
Total concurrency 5.97 M O(billion) O(1,000)
MTTI 4 days O(<1 day) -‐ O(10)
Both price and power envelopes may be too aggressive!
Identified Issues
§ Scale (billion threads) § Power (10’s of MWaIs) – Communica:on: > 99% of power is consumed by moving operands across the memory hierarchy and across nodes
– Reduced memory size: (communica.on in .me) § Resilience: Something fails every hour; the machine is never
“whole” – Trade-‐off between power and resilience
§ Asynchrony: Equal work ≠ equal .me – Power management – Error recovery
July 13
MCS -‐-‐ Marc Snir
27
Other Issues
§ Uncertainly about underlying HW architecture – Fast evolu.on of architecture (accelerators, 3D memory and processing near memory, nvram)
– Uncertainty about the market that will supply components to HPC
– Possible divergence from commodity markets
§ Increased complexity of soWware – Simula.ons of complex systems + uncertainty quan.fica.on + op.miza.on…
– SoWware management of power and failure – Scale and .ght coupling (tail of distribu.on maIers!)
July 13
MCS -‐-‐ Marc Snir
28
Research Areas
July 13
MCS -‐-‐ Marc Snir
29
Scale
§ HPC algorithms are being designed for a 2-‐level hierarchy (node, global); can they be designed for a mul.-‐level hierarchy? Can they be “hierarchy-‐oblivious”?
§ Can we have a programming model that abstracts the specific HW mechanisms are each level (message-‐passing, shared-‐memory) yet can leverage these mechanisms efficiently?
– Global shared object space + caching + explicit communica.on – Mul.level programing (compila.on with human in the loop)
July 13
MCS -‐-‐ Marc Snir
30
Communication
§ Communica.on-‐efficient algorithms
§ A beIer understanding of fundamental communica.on-‐computa.on tradeoffs for PDE solvers (ge�ng away from DAG – based lower bounds; tradeoffs between communica.on and convergence rate)
§ Programming models, libraries and languages where communica.on is a first-‐class ci.zen (other than MPI)
July 13
MCS -‐-‐ Marc Snir
31
Resilient Distributed Systems
§ E.g., a parallel file system, with 768 I/O nodes >50K disks – Systems are built to tolerate disk and node failures – However, most failures in the field are due to “performance bugs”: e.g., .me-‐outs, due to thrashing
§ How do we build feedback mechanisms that ensure stability? (control theory for large-‐scale, discrete systems)
§ How do we provide quality of service? § What is a quan.ta.ve theory of resilience? (E.g. Impact of failure
rate on overall performance) – Focus on systems where failures are not excep.onal
July 13
MCS -‐-‐ Marc Snir
32
Resilient Parallel Algorithms – Overcoming Silent Data Corruptions § SDCs may be unavoidable in future large systems (due to flips in
computa.on logic) § Intui.on: SDC can either be – Type 1: Grossly violates the computa.on model (e.g. jump to wrong address, message sent to wrong node), or
– Type 2: Introduces noise in the data (bit flip in a large array) § Many itera.ve algorithms can tolerate infrequent type 2 errors § Type 1 errors are oWen catastrophic and easy to detect in
soWware
§ Can we build systems that avoid or correct easy to detect (type 1) errors and tolerate hard to detect (type 2) errors?
§ What is the general theory of fault-‐tolerant numerical algorithms?
July 13
MCS -‐-‐ Marc Snir
33
Asynchrony
§ What is a measure of asynchrony tolerance? – Moving away from the qualita.ve (e.g., wait-‐free) to the quan.ta.ve:
– How much do intermiIently slow processes slow down the en.re computa.on – on average?
§ What are the trade-‐offs between synchronicity and computa.on work?
§ Load balancing, driven not by uncertainty on computa.on, but uncertainty on computer
July 13
MCS -‐-‐ Marc Snir
34
Architecture-Specific Algorithms
§ GPU/accelerators § Hybrid memory Cube / Near-‐
memory compu.ng
§ NVRAM – E.g., flash memory
July 13
MCS -‐-‐ Marc Snir
35
Portable Performance
§ Can we redefine compila.on so that: – It supports well a human in the loop (manual high-‐level decisions vs. automated low-‐level transforma.ons)
– It integrates auto-‐tuning and profile-‐guided compila.on – It preserves high-‐level code seman.cs – It preserves high-‐level code “performance seman.cs”
July 13
MCS -‐-‐ Marc Snir
36
Principle High-‐level code
Low-‐level, plarorm-‐specific codes
“Compila.on”
Prac:ce
Code A Code B Code C
Manual conversion “ifdef” spaghe�
Conclusion
§ Moore’s Law is slowing down; the slow-‐down has many fundamental consequences – only a few of them explored in this talk
§ HPC is the “canary in the mine”: – issues appear earlier because of size and .ght coupling
§ Op.mis.c view of the next decades: A frenzy of innova.on to con.nue pushing current ecosystem, followed by frenzy of innova.on to use totally different compute technologies
§ Pessimis.c view: The end is coming
July 13
MCS -‐-‐ Marc Snir
37