0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia...
-
Upload
angelina-norman -
Category
Documents
-
view
213 -
download
0
Transcript of 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia...
![Page 1: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/1.jpg)
1
Arun Rodrigues, Scott Hemmert, Dave Resnick:
Sandia National Lab (ABQ)
Keren Bergman: Columbia University
Bruce Jacob: U. Maryland
John Shalf, Paul Hargrove: Lawrence Berkeley National Laboratory
Gilbert Hendry: Sandia National Laboratory
Dan Quinlan, Chunhua Liao: Lawrence Livermore National Lab
Sudhakar Yalamanchili: Georgia Tech
Data Movement Dominates (DMD)and
CoDEx: CoDesign for Exascale
![Page 2: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/2.jpg)
Codesign Tools RecapArchitectural Simulation to Accelerate CoDesign
SST
• System level models
ACE
• Node level emulation
ROSE• Application
Analysis
ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST.
ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs.
SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects.
SST Micro Software Simulators: Software simulation for node-level simulation
![Page 3: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/3.jpg)
Codesign Tools RecapArchitectural Simulation to Accelerate CoDesign
SST
• System level models
ACE
• Node level emulation
ROSE• Application
Analysis
ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST.
ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs.
SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects.
SST Micro Software Simulators: Software simulation for node-level simulation
CoDEx: CoDesign For Exascale
ASCR-funded Simulation Infrastructure Project
SST: Structure Simulation Toolkit
NNSA-funded Simulation Tools
(ASC Program)
![Page 4: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/4.jpg)
Codesign Tools RecapArchitectural Simulation to Accelerate CoDesign
SST
• System level models
ACE
• Node level emulation
ROSE• Application
Analysis
ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST.
ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs.
SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects.
SST Micro Software Simulators: Software simulation for node-level simulation
CoDEx: CoDesign For Exascale
ASCR-funded Simulation Infrastructure Project
SST: Structure Simulation Toolkit
NNSA-funded Simulation Tools
(ASC Program)
CAL: (Sandia/LBL) Computer Architecture
Laboratory
![Page 5: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/5.jpg)
Fidelity vs. Scope for Architectural Simulation Methods
5
![Page 6: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/6.jpg)
ROSE CompilerFull Program Understanding through Deep Source-Code Analysis
6
![Page 7: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/7.jpg)
• Can automatically predict performance for many input codes and software optimizations
• Predict performance under different architectural scenarios
• Much faster than hardware simulation and manual modeling
ExaSAT: Exascale Static Analysis ToolCompiler-Automated Performance Model Extraction
7
Combustion Codes
Compiler Analysis
Performance PredictionSpreadsheet
Dependency Graph Optimization
<XML>
UserParameter
s
Performance Model
Machine Parameter
s
![Page 8: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/8.jpg)
SST/macro: Coarse-Grained Simulation
8
An application code with minor modifications
SST/Macro Impl. of interfaces (MPI), which simulate execution and communication
![Page 9: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/9.jpg)
SST/micro: Cycle-Accurate Framework
• Has a general simulation framework for integrating models
• Simulation backend is parallel• Plenty of people involved
9
![Page 10: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/10.jpg)
Some Models Currently Integrated
10
Gem5 is a well-known architectural simulator with models for processors, caches, busses, and network components.
MacSim provides a model of GPU/CPU cores or geterogenous computing nodes, which can be driven from x86 or PTX (CUDA) traces.
IRIS provides a pipelined, cycle- accurate router model capable of modeling a variety of Network-on-Chip (NoC) and inter-node interconnection architectures. PhoenixSim models photonic networks.
![Page 11: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/11.jpg)
Leveraging Embedded Design Automation For Design Space Exploration
This stuff is essential!
![Page 12: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/12.jpg)
Embedded Design Automation(Using FPGA emulation to do rapid prototyping)
RAMP FPGA-acceleratedEmulation of ASIC
Or “tape out”To FPGA
![Page 13: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/13.jpg)
Data Movement Dominates (Sandia, Micron, Columbia, LBL)Understand the Potential of Intelligent, Stacked DRAM Technology
• Data movement are projected to account for over 75% of power budget for an exascale platform
• Work to reduce that via– Optical interconnect(s)– 3D stacking (logic + memory + optics)– New memory protocols
Research Questions– What is the performance potential of stacked memory (power &
speed)– How much intelligence to put into logic layer
• Atomics, gather/scatter, checksums, full-processor-in-memory
– What is the memory consistency model for intelligent DRAM– How to program it if we put embed more intelligence into DRAM
![Page 14: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/14.jpg)
The Cost of Moving Data
![Page 15: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/15.jpg)
Locality Management is KeyWhat are the best combination of software and hardware
mechanisms to maximize data movement efficiency
Vertical Locality Management Horizontal Locality Management
15
Sun Microsystems
TemporalTopological
![Page 16: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/16.jpg)
Why Study Chip Stacking (TSVs)?Energy = (V 2 C) Overhead + Ecomm ∗ ∗
DRAM Cells Efficient• DRAM cells require < 1 pJ to access • Current DRAM architectures are not
power efficient • Long distances high power ➔• We pay for more than we get at every
level – Cache: throw away 75-80% – DRAM Row: Charge 1024B for each 64B
access – DIMM: Charge 8-9 chips/access – ~800 pJ/byte total
• DRAM design driven by packaging constraints – ~50% of DRAM chip cost is packaging,
mainly in pins – DIMMs use multiple chips with a few
data pins to achieve high BW
TSVs Reduce Costs• TSVs orders of magnitude less energy • –250 fJ/bit for reading DRAM • –5 fJ/bit for TSV • –250 fJ/bit for mem. controller • –~0.5 pJ/bit (compared to 30pJ for
conventional DIMM) • –Don’t have to access more data than
needed • • Enables....
–Lower Capacitance: Narrower –Lower Overhead: Smarter –In-Memory computation
• • Requires • –...changes to how we view the
machine & the memory
16
![Page 17: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/17.jpg)
Why Photonics?
TX RX
ELECTRONICS: Buffer, receive, and re-
transmitat every router.
Space Parallelism:Each bus lane routed independently (P NLANES).
Off-chip BW requires much more power than on-chip BW.
Photonics changes the rules for Bandwidth-per-Watt.
PHOTONICS: Modulate/receive data
stream once per communication event.
Wavelength Parallelism:Broadband switch routes entire multi-wavelength stream.
Off-chip BW ≈ on-chip BW for nearly same power.
RX
TXRX
RX
TX
RX
TX
RX
TXTX
![Page 18: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/18.jpg)
HBDRAM
HBDRAM
• Large Pin-out• Complex wiring• Low bandwidth density• Distance constrained by electrical
limitations• High power dissipation
• All-optical link, no electronic bus to drive• Bit-rate transparent link• High bandwidth density, less pins• Distance immunity at computer scale• Low power dissipation
Optical Link
Traditional Memory Optically-Connected Memory
Why Optically-Connected Memory?
Will not scale to meet power and bandwidth requirements of future high-
performance computing systems
Enables scaling of high-performance computing through increased memory
capacity and bandwidth
CPU
HBDRAM
HBDRAM
HBDRAM
HBDRAM
CPU
HBDRAM
HBDRAM
Electronic Bus
![Page 19: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/19.jpg)
19
![Page 20: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/20.jpg)
Mixed Model Simulationcycle accurate and energy-accurate models
SST/macro
skeleton app (C, C++, Fortran)
(C++)NoC Model
(PhoenixSim)
Memory Model(DRAMSim2, FLASHsim, NVRAM)
Address Translation
Processor Model(SST/micro & Tensilica)
Workload Translation
kernels
SystemC
Fa
ult I
nje
ctio
n
Checkpoint/restart
MPI Traces(DUMPI)
![Page 21: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/21.jpg)
Simulator Infrastructure: Interconnectscycle accurate and energy-accurate models
Developed by Sandia CollaboratorsCoDEx project
![Page 22: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/22.jpg)
Simulator Infrastructure: Memorycycle accurate and energy-accurate models
Validated against Micron DRAMHMC model coming this summer
![Page 23: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/23.jpg)
Simulator Infrastructurecycle accurate and energy-accurate models
Rewrote Columbia PhoenixSimsummer 2011
Orion-2 energy modelValidated against Cornell test parts
![Page 24: 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia University Bruce Jacob: U. Maryland John Shalf, Paul Hargrove:](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649cdc5503460f949a70de/html5/thumbnails/24.jpg)
Simulator Infrastructurecycle accurate and energy-accurate models
Full Gate-level RTL model of processorWell characterized energy model