0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia...

Post on 15-Jan-2016

213 views 0 download

Tags:

Transcript of 0 Arun Rodrigues, Scott Hemmert, Dave Resnick: Sandia National Lab (ABQ) Keren Bergman: Columbia...

1

Arun Rodrigues, Scott Hemmert, Dave Resnick:

Sandia National Lab (ABQ)

Keren Bergman: Columbia University

Bruce Jacob: U. Maryland

John Shalf, Paul Hargrove: Lawrence Berkeley National Laboratory

Gilbert Hendry: Sandia National Laboratory

Dan Quinlan, Chunhua Liao: Lawrence Livermore National Lab

Sudhakar Yalamanchili: Georgia Tech

Data Movement Dominates (DMD)and

CoDEx: CoDesign for Exascale

Codesign Tools RecapArchitectural Simulation to Accelerate CoDesign

SST

• System level models

ACE

• Node level emulation

ROSE• Application

Analysis

ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST.

ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs.

SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects.

SST Micro Software Simulators: Software simulation for node-level simulation

Codesign Tools RecapArchitectural Simulation to Accelerate CoDesign

SST

• System level models

ACE

• Node level emulation

ROSE• Application

Analysis

ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST.

ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs.

SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects.

SST Micro Software Simulators: Software simulation for node-level simulation

CoDEx: CoDesign For Exascale

ASCR-funded Simulation Infrastructure Project

SST: Structure Simulation Toolkit

NNSA-funded Simulation Tools

(ASC Program)

Codesign Tools RecapArchitectural Simulation to Accelerate CoDesign

SST

• System level models

ACE

• Node level emulation

ROSE• Application

Analysis

ROSE Compiler: Enables deep analysis of application requirements, semi-automatic generation of skeleton applications, and code generation for ACE and SST.

ACE Node Emulation: Rapid design synthesis and FPGA-accelerated emulation for rapid prototyping cycle accurate models of manycore node designs.

SST Macro System Simulation: Enables system-scale simulation through capture of application communication traces and simulation of large-scale interconnects.

SST Micro Software Simulators: Software simulation for node-level simulation

CoDEx: CoDesign For Exascale

ASCR-funded Simulation Infrastructure Project

SST: Structure Simulation Toolkit

NNSA-funded Simulation Tools

(ASC Program)

CAL: (Sandia/LBL) Computer Architecture

Laboratory

Fidelity vs. Scope for Architectural Simulation Methods

5

ROSE CompilerFull Program Understanding through Deep Source-Code Analysis

6

• Can automatically predict performance for many input codes and software optimizations

• Predict performance under different architectural scenarios

• Much faster than hardware simulation and manual modeling

ExaSAT: Exascale Static Analysis ToolCompiler-Automated Performance Model Extraction

7

Combustion Codes

Compiler Analysis

Performance PredictionSpreadsheet

Dependency Graph Optimization

<XML>

UserParameter

s

Performance Model

Machine Parameter

s

SST/macro: Coarse-Grained Simulation

8

An application code with minor modifications

SST/Macro Impl. of interfaces (MPI), which simulate execution and communication

SST/micro: Cycle-Accurate Framework

• Has a general simulation framework for integrating models

• Simulation backend is parallel• Plenty of people involved

9

Some Models Currently Integrated

10

Gem5 is a well-known architectural simulator with models for processors, caches, busses, and network components.

MacSim provides a model of GPU/CPU cores or geterogenous computing nodes, which can be driven from x86 or PTX (CUDA) traces.

IRIS provides a pipelined, cycle- accurate router model capable of modeling a variety of Network-on-Chip (NoC) and inter-node interconnection architectures. PhoenixSim models photonic networks.

Leveraging Embedded Design Automation For Design Space Exploration

This stuff is essential!

Embedded Design Automation(Using FPGA emulation to do rapid prototyping)

RAMP FPGA-acceleratedEmulation of ASIC

Or “tape out”To FPGA

Data Movement Dominates (Sandia, Micron, Columbia, LBL)Understand the Potential of Intelligent, Stacked DRAM Technology

• Data movement are projected to account for over 75% of power budget for an exascale platform

• Work to reduce that via– Optical interconnect(s)– 3D stacking (logic + memory + optics)– New memory protocols

Research Questions– What is the performance potential of stacked memory (power &

speed)– How much intelligence to put into logic layer

• Atomics, gather/scatter, checksums, full-processor-in-memory

– What is the memory consistency model for intelligent DRAM– How to program it if we put embed more intelligence into DRAM

The Cost of Moving Data

Locality Management is KeyWhat are the best combination of software and hardware

mechanisms to maximize data movement efficiency

Vertical Locality Management Horizontal Locality Management

15

Sun Microsystems

TemporalTopological

Why Study Chip Stacking (TSVs)?Energy = (V 2 C) Overhead + Ecomm ∗ ∗

DRAM Cells Efficient• DRAM cells require < 1 pJ to access • Current DRAM architectures are not

power efficient • Long distances high power ➔• We pay for more than we get at every

level – Cache: throw away 75-80% – DRAM Row: Charge 1024B for each 64B

access – DIMM: Charge 8-9 chips/access – ~800 pJ/byte total

• DRAM design driven by packaging constraints – ~50% of DRAM chip cost is packaging,

mainly in pins – DIMMs use multiple chips with a few

data pins to achieve high BW

TSVs Reduce Costs• TSVs orders of magnitude less energy • –250 fJ/bit for reading DRAM • –5 fJ/bit for TSV • –250 fJ/bit for mem. controller • –~0.5 pJ/bit (compared to 30pJ for

conventional DIMM) • –Don’t have to access more data than

needed • • Enables....

–Lower Capacitance: Narrower –Lower Overhead: Smarter –In-Memory computation

• • Requires • –...changes to how we view the

machine & the memory

16

Why Photonics?

TX RX

ELECTRONICS: Buffer, receive, and re-

transmitat every router.

Space Parallelism:Each bus lane routed independently (P NLANES).

Off-chip BW requires much more power than on-chip BW.

Photonics changes the rules for Bandwidth-per-Watt.

PHOTONICS: Modulate/receive data

stream once per communication event.

Wavelength Parallelism:Broadband switch routes entire multi-wavelength stream.

Off-chip BW ≈ on-chip BW for nearly same power.

RX

TXRX

RX

TX

RX

TX

RX

TXTX

HBDRAM

HBDRAM

• Large Pin-out• Complex wiring• Low bandwidth density• Distance constrained by electrical

limitations• High power dissipation

• All-optical link, no electronic bus to drive• Bit-rate transparent link• High bandwidth density, less pins• Distance immunity at computer scale• Low power dissipation

Optical Link

Traditional Memory Optically-Connected Memory

Why Optically-Connected Memory?

Will not scale to meet power and bandwidth requirements of future high-

performance computing systems

Enables scaling of high-performance computing through increased memory

capacity and bandwidth

CPU

HBDRAM

HBDRAM

HBDRAM

HBDRAM

CPU

HBDRAM

HBDRAM

Electronic Bus

19

Mixed Model Simulationcycle accurate and energy-accurate models

SST/macro

skeleton app (C, C++, Fortran)

(C++)NoC Model

(PhoenixSim)

Memory Model(DRAMSim2, FLASHsim, NVRAM)

Address Translation

Processor Model(SST/micro & Tensilica)

Workload Translation

kernels

SystemC

Fa

ult I

nje

ctio

n

Checkpoint/restart

MPI Traces(DUMPI)

Simulator Infrastructure: Interconnectscycle accurate and energy-accurate models

Developed by Sandia CollaboratorsCoDEx project

Simulator Infrastructure: Memorycycle accurate and energy-accurate models

Validated against Micron DRAMHMC model coming this summer

Simulator Infrastructurecycle accurate and energy-accurate models

Rewrote Columbia PhoenixSimsummer 2011

Orion-2 energy modelValidated against Cornell test parts

Simulator Infrastructurecycle accurate and energy-accurate models

Full Gate-level RTL model of processorWell characterized energy model