Post on 15-Jan-2016
1
Arun Rodrigues, Scott Hemmert, Dave Resnick:
Sandia National Lab (ABQ)
Keren Bergman: Columbia University
Bruce Jacob: U. Maryland
John Shalf, Paul Hargrove: Lawrence Berkeley National Laboratory
Gilbert Hendry: Sandia National Laboratory
Dan Quinlan, Chunhua Liao: Lawrence Livermore National Lab
Sudhakar Yalamanchili: Georgia Tech
Data Movement Dominates (DMD)and
CoDEx: CoDesign for Exascale
Codesign Tools RecapArchitectural Simulation to Accelerate CoDesign
SST
• System level models
ACE
• Node level emulation
ROSE• Application
Analysis
ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST.
ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs.
SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects.
SST Micro Software Simulators: Software simulation for node-level simulation
Codesign Tools RecapArchitectural Simulation to Accelerate CoDesign
SST
• System level models
ACE
• Node level emulation
ROSE• Application
Analysis
ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST.
ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs.
SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects.
SST Micro Software Simulators: Software simulation for node-level simulation
CoDEx: CoDesign For Exascale
ASCR-funded Simulation Infrastructure Project
SST: Structure Simulation Toolkit
NNSA-funded Simulation Tools
(ASC Program)
Codesign Tools RecapArchitectural Simulation to Accelerate CoDesign
SST
• System level models
ACE
• Node level emulation
ROSE• Application
Analysis
ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST.
ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs.
SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects.
SST Micro Software Simulators: Software simulation for node-level simulation
CoDEx: CoDesign For Exascale
ASCR-funded Simulation Infrastructure Project
SST: Structure Simulation Toolkit
NNSA-funded Simulation Tools
(ASC Program)
CAL: (Sandia/LBL) Computer Architecture
Laboratory
Fidelity vs. Scope for Architectural Simulation Methods
5
ROSE CompilerFull Program Understanding through Deep Source-Code Analysis
6
• Can automatically predict performance for many input codes and software optimizations
• Predict performance under different architectural scenarios
• Much faster than hardware simulation and manual modeling
ExaSAT: Exascale Static Analysis ToolCompiler-Automated Performance Model Extraction
7
Combustion Codes
Compiler Analysis
Performance PredictionSpreadsheet
Dependency Graph Optimization
<XML>
UserParameter
s
Performance Model
Machine Parameter
s
SST/macro: Coarse-Grained Simulation
8
An application code with minor modifications
SST/Macro Impl. of interfaces (MPI), which simulate execution and communication
SST/micro: Cycle-Accurate Framework
• Has a general simulation framework for integrating models
• Simulation backend is parallel• Plenty of people involved
9
Some Models Currently Integrated
10
Gem5 is a well-known architectural simulator with models for processors, caches, busses, and network components.
MacSim provides a model of GPU/CPU cores or geterogenous computing nodes, which can be driven from x86 or PTX (CUDA) traces.
IRIS provides a pipelined, cycle- accurate router model capable of modeling a variety of Network-on-Chip (NoC) and inter-node interconnection architectures. PhoenixSim models photonic networks.
Leveraging Embedded Design Automation For Design Space Exploration
This stuff is essential!
Embedded Design Automation(Using FPGA emulation to do rapid prototyping)
RAMP FPGA-acceleratedEmulation of ASIC
Or “tape out”To FPGA
Data Movement Dominates (Sandia, Micron, Columbia, LBL)Understand the Potential of Intelligent, Stacked DRAM Technology
• Data movement are projected to account for over 75% of power budget for an exascale platform
• Work to reduce that via– Optical interconnect(s)– 3D stacking (logic + memory + optics)– New memory protocols
Research Questions– What is the performance potential of stacked memory (power &
speed)– How much intelligence to put into logic layer
• Atomics, gather/scatter, checksums, full-processor-in-memory
– What is the memory consistency model for intelligent DRAM– How to program it if we put embed more intelligence into DRAM
The Cost of Moving Data
Locality Management is KeyWhat are the best combination of software and hardware
mechanisms to maximize data movement efficiency
Vertical Locality Management Horizontal Locality Management
15
Sun Microsystems
TemporalTopological
Why Study Chip Stacking (TSVs)?Energy = (V 2 C) Overhead + Ecomm ∗ ∗
DRAM Cells Efficient• DRAM cells require < 1 pJ to access • Current DRAM architectures are not
power efficient • Long distances high power ➔• We pay for more than we get at every
level – Cache: throw away 75-80% – DRAM Row: Charge 1024B for each 64B
access – DIMM: Charge 8-9 chips/access – ~800 pJ/byte total
• DRAM design driven by packaging constraints – ~50% of DRAM chip cost is packaging,
mainly in pins – DIMMs use multiple chips with a few
data pins to achieve high BW
TSVs Reduce Costs• TSVs orders of magnitude less energy • –250 fJ/bit for reading DRAM • –5 fJ/bit for TSV • –250 fJ/bit for mem. controller • –~0.5 pJ/bit (compared to 30pJ for
conventional DIMM) • –Don’t have to access more data than
needed • • Enables....
–Lower Capacitance: Narrower –Lower Overhead: Smarter –In-Memory computation
• • Requires • –...changes to how we view the
machine & the memory
16
Why Photonics?
TX RX
ELECTRONICS: Buffer, receive, and re-
transmitat every router.
Space Parallelism:Each bus lane routed independently (P NLANES).
Off-chip BW requires much more power than on-chip BW.
Photonics changes the rules for Bandwidth-per-Watt.
PHOTONICS: Modulate/receive data
stream once per communication event.
Wavelength Parallelism:Broadband switch routes entire multi-wavelength stream.
Off-chip BW ≈ on-chip BW for nearly same power.
RX
TXRX
RX
TX
RX
TX
RX
TXTX
HBDRAM
HBDRAM
• Large Pin-out• Complex wiring• Low bandwidth density• Distance constrained by electrical
limitations• High power dissipation
• All-optical link, no electronic bus to drive• Bit-rate transparent link• High bandwidth density, less pins• Distance immunity at computer scale• Low power dissipation
Optical Link
Traditional Memory Optically-Connected Memory
Why Optically-Connected Memory?
Will not scale to meet power and bandwidth requirements of future high-
performance computing systems
Enables scaling of high-performance computing through increased memory
capacity and bandwidth
CPU
HBDRAM
HBDRAM
HBDRAM
HBDRAM
CPU
HBDRAM
HBDRAM
Electronic Bus
19
Mixed Model Simulationcycle accurate and energy-accurate models
SST/macro
skeleton app (C, C++, Fortran)
(C++)NoC Model
(PhoenixSim)
Memory Model(DRAMSim2, FLASHsim, NVRAM)
Address Translation
Processor Model(SST/micro & Tensilica)
Workload Translation
kernels
SystemC
Fa
ult I
nje
ctio
n
Checkpoint/restart
MPI Traces(DUMPI)
Simulator Infrastructure: Interconnectscycle accurate and energy-accurate models
Developed by Sandia CollaboratorsCoDEx project
Simulator Infrastructure: Memorycycle accurate and energy-accurate models
Validated against Micron DRAMHMC model coming this summer
Simulator Infrastructurecycle accurate and energy-accurate models
Rewrote Columbia PhoenixSimsummer 2011
Orion-2 energy modelValidated against Cornell test parts
Simulator Infrastructurecycle accurate and energy-accurate models
Full Gate-level RTL model of processorWell characterized energy model