On Chip Interconnects for TeraScale Computingcarch/carchday2009/kundu.pdfOn Chip Interconnects for...

On Chip Interconnects for TeraScale Computing

Computer Arch DayApril 2, 2009

Princeton University

Partha Kundu,Microprocessor Tech. Labs

Intel Corporation

Overview of Talk

intel 80-core research prototype

Support for Fine Grained Parallelism

Partitioning, Isolation and QoS in Interconnects

Overview of Talk

80-Core Prototype:Router Design

Teraflops Research Processor

Goals:Deliver Tera-scale performance- Single precision TFLOP at desktop

- Frequency target 5GHz

Prototype two key technologies- On-die interconnect fabric

- 3D stacked memory

Develop a scalable design methodology- Tiled design approach

- Mesochronous clocking

- Power-aware capability

I/O Area

single tile

I/O Area

PLL TAP

12.64mm

65nm, 1 poly, 8 metal (Cu)Technology

100 Million (full-chip)

1.2 Million (tile)

Transistors

275mm2 (full-chip)

3mm2 (tile)

Die Area

8390C4 bumps #

65nm, 1 poly, 8 metal (Cu)Technology

100 Million (full-chip)

1.2 Million (tile)

Transistors

275mm2 (full-chip)

3mm2 (tile)

Die Area

8390C4 bumps #

Tiled Design & Mesh Network

Router

Compute

Element

One tile

Assemble & Validate

Step and repeat

Key Ingredients for Teraflops on a Chip

Communication

Technology

Clocking

Power management techniques

SOURCE DRAIN

2KB Data memory

(DMEM)

6-read, 4-write 32 entry RF

3264 64

Processing Engine (PE)

FPMAC0

Normalize

FPMAC0

Normalize

FPMAC1

Normalize

FPMAC1

Normalize

Crossbar

RouterMS

MSINTMSINT MS

MSINTMSINT

High performance

Dual FPMACs

High bandwidth

Low latency router

eight metal

36bits

To neighbors

Synchronizers

Industry leading NoC8x10 mesh- Bisection bandwidth of 320GB/s- 40GB/s peak per node- 4byte bidirectional links- 6 port non-blocking crossbar- Crossbar/Switch double-pumped to

reduce area

Router Architecture- Source routed- Wormhole switching- 2 virtual lanes - On/off flow control

Meso-chronous clocking- Key enabler for tile based approach- Tile clock @ 5Ghz- Phase-tolerant synchronizers at tile

interface

High bandwidth, Low-latency fabric

Five port, 5-stage, two lane, 16-FLIT FIFO, 100GB/s

Shared crossbar architecture, two-stage arbitration

Lane 0

Lane 1From

Crossbar switch

Stg 5Stg 1

To NTo N

To ETo E

To STo S

To WTo W

Lane 0

Lane 1From

Crossbar switch

39393939

Stg 5Stg 1

Router Architecture

BufferWrite

BufferRead

RouteCompute

Port/LaneArb

SwitchTraversal

LinkTraversal

BufferWrite

BufferRead

RouteCompute

Port/LaneArb

SwitchTraversal

LinkTraversal

clkbclkStg 5

Td < 125ps

1.0mm crossbar

interconnect

clkclkb

2:1 MuxStg 4

clkbclkStg 5

Td < 125ps

1.0mm crossbar

interconnect

clkclkb

2:1 MuxStg 4

This Work

34%0.53mm

Lane 0

Lane 1

0.53mm

Lane 0

Lane 1

Lane 0 Lane 1

CrossbarCrossbar

0.8mm0.8mm

Lane 0 Lane 1

CrossbarCrossbar

Work in ISSCC 2001 Scaled to 65nm

ISSCC ’01

This Work

34%Area (mm2)

16.6%Latency

26%Transistors

BenefitRouter

ISSCC ’01

This Work

34%Area (mm2)

16.6%Latency

26%Transistors

BenefitRouter

Double-pumped Crossbar Router

Circular FIFO, 4-deep

Programmable strobe delay

Low-latency setting

Rx_data

Tx_data

Tx_clk

D QRx_clk

Low latency

Delay Line

Scan Register

4-deep FIFO

Rx_clk

4:1Stg 1

SynchronizerCircuit

write statemachine

read statemachine

Sync Sync

Rx_data

Tx_data

Tx_clk

D QRx_clk

Low latency

Delay Line

Scan Register

4-deep FIFO

Rx_clk

4:1Stg 1

SynchronizerCircuit

write statemachine

read statemachine

Sync Sync

Mesochronous Interface (MSINT)

Low Power Clock Distribution

Global mesochronous clocking, extensive clock gating

1 2 3 4 5 6 7 8

Horizontal clock spine

Clock Arrival Times @ 4GHz (ps)

200-250

150-200

100-150

50-100

0-50PLL

12.6mm

PLLclk, clk#

router

Clock gating points

clk clk#

12.6mm

PLLclk, clk#

router

Clock gating points

clk clk#

Fine Grain Power Management

• Modular nature enables fine-grained power management

• New instructions to sleep/wake any core as

applications demand

• Chip Voltage & freq. control (0.7-1.3V, 0-5.8GHz)

1680 dynamic power gating regions on-chip

Dynamic sleep

STANDBY:

• Memory retains data

• 50% less power/tile

FULL SLEEP:

•Memories fully off

•80% less power/tile

21 sleep regions per tile (not all shown)

FP Engine 1

FP Engine 2

Router

Data Memory

Instruction

Memory

FP Engine 1

on sleep

FP Engine 2

on sleep

Router

10% on sleep

(stays on to

pass traffic)

Data Memory

57% on sleep

Instruction

Memory

on sleep

Leakage Savings

2KB Data memory (DMEM)

6R, 4W 32 entry Register File

FPMAC0

Normalize

Crossbar Router

Processing Engine (PE)

FPMAC1

Normalize

2X-5X leakage power reduction

20mW63m

15mW7.5mW

42mW8mW

• NMOS sleep transistors

Regulated sleep for memory arrays

Impact - Area : 5.4%, FMAX: 4%

VREF=Vcc-VMIN _+

Memory array

VREF=Vcc-VMIN _+

Memory array

Memory

Clamping

Est. breakdown @ 1.2V 110C

Router Power

Activity based power management

Individual port enables

- Queues on sleep and clock gated when port idle

Router Port 0Router Port 0

InputQueue

Port_en

Queue Array(Register File)

Sleep Transistor

gated_clk

Lanes 0 & 1

Ports 1Ports 1--44

Global clock buffer

Router Port 0Router Port 0

InputQueue

Port_en

Queue Array(Register File)

Sleep Transistor

gated_clk

Lanes 0 & 1

Ports 1Ports 1--44

Global clock buffer

0 1 2 3 4 5

Number of active router ports

1.2V, 5.1GHz

Measured 7X power reduction for idle routers

Clocking

Queues

Datapath

Arbiters

+ Control

Crossbar

Communication Power

4GHz, 1.2V, 110C

Estimated Power Breakdown

• Significant (>80%) power in

router compared to wires

• Router power primarily in

Crossbar and Buffers

• Clocking power w/fwded clock can

be expensive

Energy efficiency of Interconnection networks

• Topology work :

o low diameter: Concentrated Mesh, Balfour et al, ICS 2006Flattened Butterfly, Kim et al, Micro 2008Multi-drop Express Channels, Grot et al, HPCA 2009

• Router micro-architecture

o Minimize dynamic buffer usage : Express Virtual Channels, Kumar et al, ISCA 2007

o Reduce Buffers :Rotary Router : Abad et al, ISCA 2007 ViChaR, Nicopoulos et al, Micro 2006

• Clocking schemes :Synchronous vs. GALS (Async, Meso-chronous)

Support for Fine-Grained Parallelism

Flavors of Parallelism

Support multiple types of parallelism

- Vector parallelism

- Loop (non-vector) parallelism

- Task parallelism (irregular)

A single RMS application might use multiple types of parallelism

Sometimes even nested

Need to support them at the same time

On 8-core MCA

32 Tasks

On 8-core MCA

8 Tasks

Asymmetry

Sources of Asymmetry

- Applications

- MCA: Heterogeneous cores, SMT, NUCA Cache Hierarchies

fine-grained parallelism can mitigate performance asymmetry

Significant inefficiency due to load imbalance

Less load imbalance 25% better performance

32 Tasks

8 Tasks

Platform PortabilityConsider a 8-core MCA

4X speedup

On 4 cores

8X speedup

On 8 cores

4X speedup

On 6 cores

4X speedup

On 4 cores

8X speedup

On 8 cores

5.3X speedup

On 6 cores

Platform Portability Cont’d

Requires finer-granularity parallelism

Even an order of magnitude

MCA with 64 cores

Problem Statement

Fine-grained parallelism needs to be efficiently supported in MCA

- Several key RMS modules exhibit very fine-grained parallelism

- Platform portability requires application to expose parallelism at a finer granularity

- Account for asymmetries in architecture

Carbon : Architectural Support for Fine-Grained Parallelism, Kumar et alISCA 2007

Loop-Level parallelism

Most common form of parallelismsupported by OpenMP, NESL

Requires dynamic load balancing

Significant performance potential

• 13-271%

c18 gis pds Pilot

Problem InputS

Performance potential over optimized S/WSparse MVM on 64 cores271

Computation per iteration (tile)

can vary dramatically

Task-Level Parallelism

Irregular structured parallelism

- Trees to complex dependency graphs

Significant performance potential

• 15-32% for Backward solve

• Up to 241% for Canny

ken mod pds watson world

Problem Input

Performance potential over optimized S/WBackward Solve on 64 cores

Typical Parallel Program

void task(node)

Do(node);

foreach child of node

Enqueue(task,child);

B C D E

Task dependence graph

A B C D E F G H

Sparse matrix

Non-zero

Example module: forward solve

Use task parallelism

Task = unit of work

Typical Parallel Run-Time Systemvoid task(node)

Do(node);

foreach child of node

Enqueue(task,child);

taskchild - -

Software data

structure

(Holds tasks)

Run-time system creates tuple to represent task

- GPUs, conventional S/W (e.g., TBB) do this today

taskchild - -

Multiple implementation options

Need for Hardware Acceleration

Software “Enqueue” & “Dequeue” is slow- Serial access to head/tail

Additional overhead for smart ordering of tasks- Placement, Cache/data locality, process prioritization, etc.

Overheads increase with more threads

For “frequent” enqueue/dequeue

resource checking overhead is wasteful

hardware does a fine job w/scheduling

allow fall-back to software on hardware queue limit (overflow/underflow)

Accelerate data structure accesses & task ordering with H/W

uArch Support for Carbon

L1 $ LTU

Global Task Unit (GTU)

Task pool storage

Local Task Unit (LTU)

Prefetches and buffers tasks

Performance: Loop/Vector Parallelism

• Significant performance benefit over optimized S/W 88% better on average

• Similar performance to “Ideal” 3% worse on average

Performance: Task-Level Parallelism

• Significant performance benefit over optimized S/W 98% better on average

• Similar performance to “Ideal” 2.7% worse on average

Task Queuing in the Interconnect

• Arrangement of GTU and LTU suffices for apps studied

• But, long vectors running on MCA can be tricky

o requires dynamic resource discovery

o Vector length breaks Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow, Fung et al (Micro „07)

Partitioning, Isolation and QoSin Interconnects

Data mining OLTP Networking

Interconnect

Analytics

Virtualization and Partitioning of on-chip Resources

Virtualize: interconnect, cache, memory, I/O, etcPartition : Cores, private caches, dedicated interconnect

Focus:

Domain isolation for performance and security

Shared Channel Reservation

Application Isolation enforced by Interconnect

Allow “arbitrary” shaped domains

Isolation with Fault Tolerance

Good cores become un-usable

2Reconfigure Partitions

Enable Fault Discovery & Repartitioning

Route Flexibility

Motivation

- Performance isolation

- Fault-tolerance

- Topology independence

- Load-balancing and

- Improved network efficiency2 big cores, 12 small cores, 4 MCs

Challenge: low area/timing overhead

to achieve routing flexibility

Logic Based Distributed Routing in NoC, Flich & Duato, Computer Arch Letters, Jan 2008

012345

0 1 2 3 4 5

Fair BW Allocation

Adversarial traffic to Shared (critical) resource

=> network hot spots can lead to unfair resource access

Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks, Lee, Ng, Asanovic, ISCA 2008

0 1 2 3 4 5

6 7 8 9 10

Network Accepted Traffic at

Each Node Directed to

Node 0

0 1 2 3 4 5

900 rotation from previous plot

Technical Contributors

Mani Azimi, Akhilesh Kumar, Roy Saharoy, Aniruddha Vaidya, Sriram Vangal, Yatin Hoskote,

Sanjeev Kumar, Anthony Nguyen, Chris Hughes, Niket Agarwal,

the prototype design team

Ack support of: Mani Azimi, Nitin Borkar

Summary

• Energy efficient interconnects that scale are important for future multi-cores

• Interconnect can play a part in thread scheduling

• Application consolidation in many-core requires close cooperation with run-time.

• Efficient support required for route flexibility

Thanks!

Questions?

On Chip Interconnects for TeraScale Computingcarch/carchday2009/kundu.pdfOn Chip Interconnects for...

Documents

Transcript of On Chip Interconnects for TeraScale Computingcarch/carchday2009/kundu.pdfOn Chip Interconnects for...

Optical Interconnects for Backplane and Chip-to …async.org.uk/async2008/async-nocs-slides/Friday/... · Optical Interconnects for Backplane and Chip-to-chip Photonics I H White*

Analysis of on-chip inductance effects for distributed RLC ...ee.sharif.edu/~sarvari/Interconnect/P/02-Banerjee.pdf · Analysis of On-Chip Inductance Effects for Distributed RLC Interconnects

Wavelength-Multiplexed Duplex Transceiver Based on III-V ... · Integration for Off-Chip and On-Chip Optical Interconnects Volume 8, Number 1, February 2016 Kaixuan Chen Qiangsheng

On-Chip Interconnects in Sub-100nm Circuits

1 Oak Ridge Interconnects Workshop - November 1999 ASCI-00-003. ASCI-00-003.1 ASCI Terascale Simulation Requirements and Deployments David A. Nowak ASCI.

Layout-conscious Random Topologies for HPC Off-chip ...research.nii.ac.jp/~koibuchi/pdf/hpca2013_slide.pdf · Layout-conscious Random Topologies for HPC Off-chip Interconnects ...

A Variation Tolerant Current-Mode Signalling Scheme for on-Chip Interconnects

Chapter 22: Interconnects for 2D and 3D Architectures - IEEE · Chapter 22: Interconnects for 2D and 3D Architectures ... Flip chip array, low end & consumer 150 150 130 130 130 130

CONTROLLING AND DETECTING LIGHT FOR ON-CHIP OPTICAL INTERCONNECTS

Reinventing germanium avalanche photodetector for nanophotonic on-chip optical interconnects

Chip-scale Photonic Interconnects for Reconfigurable …

TeraScale Supernova Initiative

Substrate-embedded and flip-chip-bonded photodetector polymer-based optical interconnects

CCNoC : On-Chip Interconnects for Cache-Coherent Manycore Server Chips

SCALING INDUCED PERFORMANCE … Kapur Thesis.pdfscaling induced performance challenges/limitations of on-chip metal interconnects and comparisons with optical interconnects a dissertation

An Intra-Chip Free-Space Optical · PDF filecusses the background of on-chip optical interconnect; ... not practical for intra-chip optical interconnects due to ... n ic s F r e e

Low-Power CMOS Optical Interconnect Transceivers RX with equalization (additional power & area) Chip-to-Chip Electrical Interconnects Sophisticated equalization circuitry required

III-V Nano-Optoelectronics for On- Chip Optical Interconnects · 2017. 9. 5. · 2 . Motivation Integrated silicon photonics Inter- and intra-chip optical interconnects for better

Interconnect Networks. Generic scalable multiprocessor architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters of servers)

Comparison of electrical, optical and plasmonic on chip interconnects