EE482S Lecture 1 Stream Processor...

45
Copyright (C) by William J. Dally, All Rights Reserved EE482C, L1, Apr 4, 2002 1 EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J. Dally Computer Systems Laboratory Stanford University [email protected]

Transcript of EE482S Lecture 1 Stream Processor...

Page 1: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 1

EE482SLecture 1

Stream Processor Architecture

April 4, 2002

William J. DallyComputer Systems Laboratory

Stanford [email protected]

Page 2: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 2

Today’s Class Meeting

• What is EE482C?– Material covered– Course format– Assignments– Scribing

• What is Stream Processor Architecture?– What problem is being solved

• Performance scaling, power efficiency, bandwidth bottlenecks– What is a stream program?– What is a stream processor

Page 3: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 3

What is EE482C?

• New course in EE482 sequence (advanced comp. arch.)– EE482A – superscalar architecture– EE482B – interconnection networks– EE482C – stream processor architecture

• Course format– Readings – typically one or two papers per class meeting

• Read the paper before the meeting for which it is listed• E.g., read Rixner et al. before 4/9/02

– Mix of lecture and discussion in class meetings• Be prepared to discuss each reading

– Two programming assignments• One in StreamC/KernelC, one in Brook

– Class project • Original research on stream architecture or programming

Page 4: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 4

Where to find more information

• Course policy sheet• Web page http://cva.stanford.edu/ee482s• Class schedule

– Subject to change

Page 5: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 5

What is a Stream Processor?

• A stream program is a computation organized as streams of records operated on by computation kernels.

• A stream processor is optimized to exploit the locality and concurrency of stream programs

• More later

Page 6: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 6

Stream Processing is becoming pervasive

InsightsPeter Huber, 01.07.02A new type of computing--"stream computing"--has emerged to handle the back end of sonar, radar, X-ray sources and certain broadband applications such as voice-over Internet and digital TV.

Page 7: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 7

Motivation – Why do we need a stream processor?

• Application demand• Power• Bandwidth• Performance Scaling

Page 8: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 8

Application Pull

• Emerging media applications demand 10s of GOPS to TOPs with low power.– Video compression– Image understanding– Signal processing– Graphics

• Scientific applications also need TOPs of performance with reasonable cost

Page 9: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 9

Conventional Processors No Longer Scale Performance by 50% each year

1e+0

1e+1

1e+2

1e+3

1e+4

1e+5

1e+6

1e+7

1980 1990 2000 2010 2020

Perf (ps/Inst)52%/year

19%/year

ps/gate 19%Gates/clock 9%

Clocks/inst 18%

Page 10: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 10

Future potential of novel architecture is large (1000 vs 30)

1e-41e-31e-21e-11e+01e+11e+21e+31e+41e+51e+61e+7

1980 1990 2000 2010 2020

Perf (ps/Inst)Linear (ps/Inst)

52%/year

74%/year

19%/year30:1

1,000:1

30,000:1

Page 11: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 11

Clock Scaling: Historical and Projected

0

1

10

100

1000

10000

1970

1972

1974

1976

1978

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

Clo

ck S

peed

(MH

z)

8FO416FO4Intel Processors

375 FO4

70 FO4

85 FO4115 FO4

230 FO4185 FO4

175 FO4

53 FO4

40 FO446 FO4

25 FO4

9 FO4

15 FO4

10µm 3µm 1µm1.5µm 0.8µm 0.35µm 0.18µm0.13µm 0.07µm

0.1µm 0.05µm0.035µm

Page 12: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 12

Anant Agarwal (MIT) Panel at HPCA ‘02

Page 13: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 13

Pentium III vs. Pentium IV

2x108108# Grids

12KBytes16KBytesL1 D$ Capacity

42 million24 millionTransistor Count

217mm2106mm2Die Size

0.350.45SpecInt/MHz

524454SpecInt2000

1.5GHz (10.4 FO4)1GHz (15 FO4)Clock Rate

2010Pipeline Stages

180nm180nmTechnology

Pentium IVPentium III

Page 14: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 14

Why do Special-Purpose Processors Perform Well?

Lots (100s) of ALUs Fed by dedicated wires/memories

Page 15: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 15

Care and Feeding of ALUs

Instr.CacheIP

Instruction Bandwidth

DataBandwidth

IR

Regs

‘Feeding’ Structure Dwarfs ALU

Page 16: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 16

Stream Programs make communication explicit

• This reduces energy and delay

Page 17: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 17

Energy is a matter of distance (interconnect)

1.1nJ 130pJExecute a uP instruction (SB-1)

10pJ 0.6pJ32b Register Read

1.9nJ 1.9nJTransfer 32b off chip (200M HSTL)1.3nJ 400pJTransfer 32b off chip (2.5G CML)

100pJ 17pJTransfer 32b across chip (10mm)50pJ 3pJRead 32b from 8KB RAM

5pJ 0.3pJ32b ALU Operation

Energy(0.13um) (0.05um)

Operation

300: 20:1 off-chip to global to local ratio in 20021300: 56:1 in 2010

Page 18: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 18

Interconnect dominates delay

2800ps 4600psTransfer 32b across chip (20mm)

325ps 125ps32b Register Read

1400ps 2300psTransfer 32b across chip (10mm)780ps 300psRead 32b from 8KB RAM

650ps 250ps32b ALU Operation

Delay(0.13um) (0.05um)

Operation

2:1 global on-chip comm to operation delay9:1 in 2010

Page 19: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 19

What is a Stream Program?

• A program organized as streams of records flowing through kernels

• Example, stereo depth extraction

Page 20: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 20

Stereo Depth Extraction Stream Program

Kernels can be partitioned across chips to exploit task parallelism.

Kernels exploit both instruction (ILP) and data (SIMD) level parallelism.

SAD

convolve

Streams expose producer-consumer locality.

Image 1 convolve

Image 0 convolve convolve

Depth Map

The stream model exploits parallelism without the complexity of traditional parallel programming.

Page 21: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 21

Why Organize an Application This Way?

• Expose parallelism at three levels– ILP within kernels– DLP across stream elements– TLP across sub-streams and across kernels– Keeps ‘easy’ parallelism easy

• Expose locality in two ways– Within a kernel – kernel locality– Between kernels – producer-consumer locality– This locality can be exploited independent of spatial or temporal

locality– Put another way, stream programs make communication explicit

Page 22: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 22

Streams expose Kernel Locality missed by Vectors

• Streams– Traverse operations first

• All operations for one record, then next record

• Smaller working set of temporary values

– Store and access whole records as a unit

• Spatial locality of memory references

– e.g., get contiguous record on gather/scatter

• Vectors– Traverse records first

• All records for one operation, then next operation

• Large set of temporary values– Group like-elements of records

into vectors• Read one word of each record at

a time– No locality on gather/scatter

Page 23: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 23

Example – Vertex Transform

×=

wzyx

T

wzyx

''''

input record

result recordintermediate results

x

y

z

w

t00x

t10x

t20x

t30x

t01y

t11y

t21y

t31y

t02z

t12z

t22z

t32z

t03w

t13w

t23w

t33w

x’

y’

z’

w’

Page 24: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 24

Vertex transform with Streams vs. Vectors

• Small set of intermediate results– enable small and fast LRFs

• Large working set of intermediates– VL times larger (e.g. 64x)– Must use a large, slow global

VRF

Page 25: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 25

What class of applications can be written as stream programs?

• Media applications (signal, image, video, packet, and graphics processing) are naturally expressed in this style

• Scientific applications can be efficiently cast as stream programs

• Others?– This is an open question

• Hypothesis– Any application with a long run time (large operation count) has a

great deal of parallelism and hence can be cast as a stream program.

Page 26: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 26

Example, Ingress Packet Functions

Finalize RouteClassify Pkt

Route Table1st Address

Calc

Rd

Route Table2nd Address

Calc

Rd

Route Table3rd Address

Calc

Rd

PolicingCalculation

ComputeStatisticsUpdates

F&A F&ACompute

Packet QueueAddress

Wr

Extract HeaderFields

Packet Checks

Page 27: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 27

Split into Kernels

Finalize RouteClassify Pkt

Route Table1st Address

Calc

Rd

Route Table2nd Address

Calc

Rd

Route Table3rd Address

Calc

Rd

PolicingCalculation

ComputeStatisticsUpdates

F&A F&ACompute

Packet QueueAddress

Wr

Extract HeaderFields

Packet Checks

Page 28: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 28

Software Pipeline The Loop

i-6 i-4 i-2 iArithmeticClusters

MemoryChannel 1

MemoryChannel 2

i-1i-3i-5

i-7

i-7

Each block represents a kernel ormemory operation on a strip of packets

Page 29: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 29

Is it hard to write stream programs?

• We will let you be the judge of that.• It does constrain how you write a program• Depends on the quality of the programming tools.

Page 30: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 30

What is a Stream Processor?

• A processor that is optimized to execute a stream program

• Features include– Exploit parallelism

• TLP with multiple processors• DLP with multiple clusters within each processor• ILP with multiple ALUs within each cluster

– Exploit locality with a bandwidth hierarchy• Kernel locality within each cluster• Producer-consumer locality within each processor

• Many different possible architectures

Page 31: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 31

The Imagine Stream Processor

Stream Register File NetworkInterface

StreamController

Imagine Stream Processor

HostProcessorA

LU

Clu

ster

0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory SystemM

icro

cont

rolle

r

Net

wor

k

Page 32: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 32

Arithmetic Clusters

CU

Inte

rclu

ster

Net

wor

k

+

From SRF

To SRF

+ + * * /

Cross Point

Local Register File

Page 33: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 33

A Bandwidth Hierarchy exploits locality and concurrency

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Stre

am

Reg

iste

r File

ALU Cluster

ALU Cluster

ALU Cluster

544GB/s

• VLIW clusters with shared control• 41.2 32-bit floating-point operations per word of memory BW

Page 34: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 34

A Bandwidth Hierarchy exploits kernel and producer-consumer locality

Memory BW Global RF BW Local RF BWDepth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s

MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s

Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s

QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s

2GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAMSt

ream

R

egis

ter F

ile

ALU ClusterALU Cluster

ALU Cluster

544GB/s

Page 35: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 35

Producer-Consumer Locality in the Depth Extractor

Memory/Global Data SRF/Streams Clusters/Kernels

new partial sums

sharpened row

Convolution(Laplacian)

row of pixels

previous partial sumsConvolution(Gaussian)

new partial sums

blurred row

previous partial sums

filtered row segment

filtered row segment

previous partial sums

new partial sumsdepth map row segment

SAD

1 : 23 : 317

Page 36: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 36

Bandwidth Demand of Applications

Page 37: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 37

Overview

Page 38: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 38

Die Photos

• 21 M transistors / TI 0.15µm 1.5V CMOS / 16mm x 16mm• 300 MHz TTTT, hope for 400 MHz in lab• Chips arrived 4/1/02, no fooling!

[Khailany et al., ICCD ’02 (submitted)]

Page 39: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 39

Performance demonstrated on signal and image processing

12.1

19.8

11.0

23.925.6

7.0

0

5

10

15

20

25

30

GO

PS

depth mpeg qrd dct convolve fft

16-bit kernels16-bit

applications

floating-pointapplication

floating-pointkernel

Page 40: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 40

Initial studies indicate that it also applies to solving PDEs and ODEs

Page 41: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 41

Architecture of a Streaming Supercomputer

StreamProcessor64 FPUs

64GFLOPS

16 xDRDRAM2GBytes

38GBytes/s

20GBytes/s32+32 pairs

Node

On-Board Network

Node2

Node16 Board 2

16 Nodes1K FPUs1TFLOPS32GBytes

Intra-Cabinet Network(passive - wires only)

Board 64

160GBytes/s256+256 pairs

10.5" Teradyne GbX

Board

Cabinet

Inter-Cabinet Network

Cabinet 264 Boards1K Nodes64K FPUs64TFLOPS

2TBytes

E/OO/E

5TBytes/s8K+8K links

Ribbon Fiber

Cabinet 16

Bisection 64TBytes/s

All links 5Gb/s per pair or fiberAll bandwidths are full duplex

Page 42: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 42

Streaming processor

Stream Execution Unit

Stream Register File

Scal

arPr

oces

sor

AddressGenerators

MemoryControl

On-

Chi

pM

emor

y

LocalDRDRAM(38 GB/s)

Net

wor

kIn

terfa

ce

NetworkChannels(20 GB/s)

Clu

ster

0

Clu

ster

1

Clu

ster

15

CommFP

MUL

FPMUL

FPADD

FPADD

FPDSQ

Regs

Regs

Regs

Regs

Clu

ster

Sw

itch

Regs

Regs

Regs

Regs

Regs

Regs

Regs

Regs

Scra

tch

Pad

Regs

Stream Execution Unit

Cluster

Stream Processor

To/From SRF(260 GB/s)

LRF BW = 1.5TB/s

Page 43: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 43

Rough per-node budget

200200Processor chip

4$/M-GUPS (250/node)

15$/GFLOPS (64/node)

976Per-Node Cost

501Power

4950000Cabinet

1883000Board/Backplane

32020Memory chip

50200Router chip

Per NodeCostItem

Preliminary numbers, parts cost only, no I/O included.

Page 44: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 44

Many open problems

• A small sampling• Software

– Program transformation– Program mapping– Bandwidth optimization– Conditionals– Irregular data structures

• Hardware– Alternative stream models– Register organization– Bandwidth hierarchies– Memory organization– Short stream issues– ISA design– Cluster organization– Processor organization

Page 45: EE482S Lecture 1 Stream Processor Architecturecva.stanford.edu/classes/ee482c/slides/lect01_slides.pdf– EE482C – stream processor architecture • Course format – Readings –

Copyright (C) by William J. Dally, All Rights ReservedEE482C, L1, Apr 4, 2002 45

Next Time

• Discuss Imagine paper• Discuss the stream programming model