Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

46
Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally

description

HPCA-10NSJ3 Renewed Interest in Data Parallelism Data parallel application classes –Media, signal, network processing, scientific simulations, encryption etc. High-end vector machines –Have always been data parallel Academic research –Stanford Imagine, Berkeley V-IRAM, programming GPUs etc. “Main-stream” industry –Sony Emotion Engine, Tarantula etc.

Transcript of Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

Page 1: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

Stream Register Files with Indexed Access

Nuwan JayasenaMattan ErezJung Ho Ahn

William J. Dally

Page 2: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 2

Scaling Trends

• ILP increasingly harder and more expensive to extract

CPU data courtesy of Francois Labonte, Stanford University

• Graphics processors exploit data parallelism

CPUs - Specint2000 per MHz

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Jan-85 Sep-87 Jun-90 Mar-93 Dec-95 Sep-98 Jun-01

8038680486PentiumPentium IIPentium IIIPentium 4

Graphics Processors - Vertices per MHz

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Jan-85 Sep-87 Jun-90 Mar-93 Dec-95 Sep-98 Jun-01

nVidia

NV10

NV35

Page 3: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 3

Renewed Interest in Data Parallelism

• Data parallel application classes– Media, signal, network processing, scientific simulations,

encryption etc.

• High-end vector machines– Have always been data parallel

• Academic research– Stanford Imagine, Berkeley V-IRAM, programming GPUs

etc.

• “Main-stream” industry– Sony Emotion Engine, Tarantula etc.

Page 4: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 4

Storage Hierarchy

• Bandwidth taper• Only supports sequential

streams/vectors• But many data parallel apps

with– Data reorderings– Irregular data structures– Conditional accesses

DRAM

Stream/vector storage

+x

+x

+x

Cache

Page 5: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 5

Sequential Streams/Vectors Inefficient

Evaluate arbitrary order access to streams

Memory/cache Stream/vector storage Compute unitsa00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

Time

Rowmajor

Columnmajor

b13 b12 b11 b10 b03 b02 b01 b00b33

a13 a12 a11 a10 a03 a02 a01 a00a33

c00 c01 c02 c03

c10 c11 c12 c13

c20 c21 c22 c23

c30 c31 c32 c33

c31 c21 c11 c01 c30 c20 c10 c00c33

b00 b01 b02 b03

b10 b11 b12 b13

b20 b21 b22 b23

b30 b31 b32 b33b31 b21 b11 b01 b30 b20 b10 b00b33

Reorder

Page 6: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 6

Outline

• Stream processing overview• Applications• Implementation• Results• Conclusion

Page 7: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 7

Stream Programming

• Streams of records passing through compute kernels• Parallelism

– Across stream elements– Across kernels

• Locality– Within kernels– Between kernels

FFT_stage FFT_stage FFT_stagein1

in2

OutOutput

Page 8: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 8

Bandwidth Hierarchy

• Stream programming is well matched to bandwidth hierarchy

FFT_stage

FFT_stage

FFT_stage

Memory Stream register file (SRF) Compute units

Time

Page 9: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 9

Stream Processors

• Several lanes– Execute in SIMD– Operate on records

• Inter-cluster network

Compute cluster 0

SRF bank 0

Compute

cluster(N-1)

SRF bank(N-1)

Inter-cluster network

Lane 0

Memory system

Memory switch

Page 10: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 10

Outline

• Stream processing overview• Applications• Implementation• Results• Conclusion

Page 11: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 11

Stream-Level Data Reuse

• Sequential streams only capture in-order reuse• Arbitrary access patterns in SRF capture more of available

temporal locality

Sequential (in-order) reuse

e.g.: linear streams

Non-sequential reuse

Stream data reuse

Reordered reusee.g.: 2-D, 3-D accesses,

multi-grid

Intra-stream reusee.g.: irregular

neighborhoods, table lookups

Page 12: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 12

Reordered Reuse

Memory/cache Stream register file (SRF) Compute clustersa00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

a13 a12 a11 a10 a03 a02 a01 a00a33

b13 b12 b11 b10 b03 b02 b01 b00b33

b31 b21 b11 b01 b30 b20 b10 b00b33

Time

c00 c01 c02 c03

c10 c11 c12 c13

c20 c21 c22 c23

c30 c31 c32 c33

c31 c21 c11 c01 c30 c20 c10 c00c33

1D FFT

1D FFT

Reorder

• Indexed SRF access eliminates reordering through memory

Page 13: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 13

Reordered Reuse

Memory/cache Stream register file (SRF) Compute clustersa00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

a13 a12 a11 a10 a03 a02 a01 a00a33

b13 b12 b11 b10 b03 b02 b01 b00

b31 b21 b11 b01 b30 b20 b10 b00b33

Time

c00 c01 c02 c03

c10 c11 c12 c13

c20 c21 c22 c23

c30 c31 c32 c33

c31 c21 c11 c01 c30 c20 c10 c00c33

1D FFT

1D FFT

Reorder

b33

Reorder

• Indexed SRF access eliminates reordering through memory

Page 14: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 14

Intra-stream Reuse

• Indexed SRF access eliminates – Replication in SRF– Redundant memory transfers

Memory/cache Stream register file (SRF) Compute clusters

Time

A

B D C B AD

A D BCA B DB ComputeE

F H

C

G

G F EH

Replicate

Page 15: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 15

Intra-stream Reuse

Memory/cache Stream register file (SRF) Compute clusters

Time

A

B D

A D BCA B DB ComputeE

F H

C

G

G F EH

Replicate

C B AD

C B AD Replicate

• Indexed SRF access eliminates – Replication in SRF– Redundant memory transfers

Page 16: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 16

Conditional Accesses

• Fine-grain conditional accesses– Expensive in SIMD architectures– Translate to conditional address computation

Page 17: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 17

Outline

• Stream processing overview• Applications• Implementation• Results• Conclusion

Page 18: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 18

Base Architecture

• Each SRF bank accesses block of b contiguous words

Compute cluster 0

SRFbank 0

Compute cluster(N-1)

SRFbank(N-1)

Inter-cluster network

b*W

Page 19: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 19

Indexed SRF Architecture

• Address path from clusters

• Lower indexed access bandwidth

Compute cluster(N-1)

SRFbank(N-1)

Inter-cluster network

Compute cluster 0

SRFbank 0

Address FIFOs

Page 20: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 20

Base SRF Bank

• Several SRAM sub-arrays

• Each access is to one sub-array

Compute cluster

SRFbank

Sub array 0

Sub array 1

Sub array 2

Local word -line drivers

Sub array 3

Page 21: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 21

Indexed SRF Bank

• Extra 8:1 mux at sub-array output– Allows 4x 1-word

accesses

Compute cluster

SRFbank

Sub array 1

Sub array 2

Pre-decode

& row

dec.P

re-decode&

row dec.

Pre-decode

& row

dec.P

re-decode&

row dec.

Sub array 3

mux

Sub array 0

Page 22: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 22

Cross-lane Indexed SRF

• Address switch added

• Inter-cluster network used for cross-lane SRF data

Inter-cluster network

SRF address network

Compute cluster 0

SRFbank 0

Address FIFOs

Compute cluster 0

SRFbank 0

Page 23: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 23

Overhead - Area

• In-lane indexing overheads– 11% over sequential SRF

• Per-sub-array independent addressing overheads

• Cross-lane indexing overheads– 22% over sequential SRF

• Address switch

• 1.5% to 3% increase in die area (Imagine processor)

Page 24: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 24

Overhead - Energy

• 0.1nJ (0.13m) per indexed SRF access• ~4x sequential SRF access• > order of magnitude lower than DRAM access• 0.25nJ per cache access• Each indexed access replaces many SRF and

DRAM/cache accesses

Page 25: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 25

Outline

• Stream processing overview• Applications• Implementation• Results• Conclusion

Page 26: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 26

Benchmarks

• 64x64 2D FFT– 2D accesses

• Rijndael (AES)– Table lookups

• Merge-sort– Fine-grain conditionals

• 5x5 convolution filter– Regular neighborhood

• Irregular graph – Irregular neighborhood access– Parameterized (IG_SML/DMS/DCS/SCL): Sparse/Dense

graph, Memory/Compute-limited, Short/Long strips

Page 27: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 27

Machine Organizations

Base(Sequential SRF)

Computeclusters

SRFbanks

Base + Cache

DRAM

Memory switch

Inter-cluster net

DRAM

Memory switch

Inter-cluster net

Cache

Indexed SRF

SRF address net

DRAM

Memory switch

Inter-cluster net

Page 28: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 28

Machine Parameters

Base Base +cache

IndexedSRF

Technology 0.13m1GHz

Compute 8 compute clusters32GFLOPs (peak)

SRF 128KB128GB/s seq.

128KB128GB/s seq

128GB/s in-lane32GB/s x-lane

Cache 128KB16GB/s

DRAM 9.14GB/s

Page 29: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 29

Off-chip Memory Bandwidth

0

0.2

0.4

0.6

0.8

1

FFT 2D

Rijnda

elSort

Filter

IG_S

ML

IG_D

MS

IG_D

CS

IG_S

CL

Norm

aliz

ed m

emor

y tr

affic ISRF

Page 30: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 30

Off-chip Memory Bandwidth

0

0.2

0.4

0.6

0.8

1

FFT 2D

Rijnda

elSort

Filter

IG_S

ML

IG_D

MS

IG_D

CS

IG_S

CL

Norm

aliz

ed m

emor

y tr

affic ISRF

Cache

Page 31: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 31

Execution Time

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Bas

eC

ache

ISR

F

Bas

eC

ache

ISR

F

Bas

eC

ache

ISR

F

Bas

eC

ache

ISR

F

Bas

eC

ache

ISR

F

Bas

eC

ache

ISR

F

Bas

eC

ache

ISR

F

Bas

eC

ache

ISR

F

FFT 2D Rijndael Sort Filter IG_SML IG_DMS IG_DCS IG_SCL

Norm

aliz

ed e

xecu

tion

time

Kerneloverheads

SRF stall

Memorystall

Kernelloop body

Page 32: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 32

Outline

• Stream processing overview• Applications• Implementation• Results• Conclusion

Page 33: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 33

Conclusions

• Data parallelism increasingly important• Current data parallel architectures inefficient for some

application classes– Irregular accesses

• Indexed SRF accesses– Reduce memory traffic– Reduce SRF data replication– Efficiently support complex/conditional stream accesses

• Performance improvements– 3% to 410% for target application classes

• Low implementation overhead– 1.5% to 3% die area

Page 34: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 34

Backups

Page 35: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 35

Indexed Access Instruction Overhead

Relative Instruction Counts of SRF Indexed Kernels

0

0.2

0.4

0.6

0.8

1

1.2

FFT2D Rijndael Sort IG_1 IG_2

• Excludes address issue instructions

Page 36: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 36

Kernel C API

while(!eos(in)) { in >> a; LUT[a] >> b; c = foo(a, b); out << c;}

LUT.index << a;Indep. instructions;LUT >> b;

• 2 separate instructions– Address issue– Data read

• Address-data separation– May require loop unrolling, software pipelining etc.

Page 37: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 37

Sensitivity to SRF Access Latency (1)

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

0 4 8 12 16 20 24

Index and data separation (cycles)

Loop

legt

hFFT2D

Rijndael

Sort1

Sort2

Filter

IGraph1

IGraph2

0.5

0.6

0.7

0.8

0.9

1

1.1

0 2 4 6 8 10 12

Index and data separation (cycles)

Ave

rage

ker

nel e

xecu

tion

time FFT2D

Rijndael

Filter

Sort1

Sort2

Page 38: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 38

Sensitivity to SRF Access Latency (2)

0.88

0.9

0.92

0.94

0.96

0.98

1

1.02

4 8 12 16 20 24 28

Index and data separation (cycles)

Ave

rage

ker

nel e

xecu

tion

time

IGraph1

IGraph2

Page 39: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 39

Why Graphics Hardware? Pentium 4 SSE theoretical*

3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS

GeForce FX 5900 (NV35) fragment shader observed:MULR R0, R0, R0: 20 GFLOPSequivalent to a 10 GHz P4

and getting faster: 3x improvement over NV30 (6 months)

*from Intel P4 Optimization Manual

0

5

10

15

20

25

Jun-01 Sep-01 Dec-01 Mar-02 Jun-02 Sep-02 Dec-02 Apr-03 Jul-03

GFL

OPS

Pentium 4NV30

NV35

Slide from Ian Buck, Stanford University

Page 40: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 40

NVIDIA Graphics growth (225%/yr)

• 1: Dual textured• 2: Programmable

Essentially Moore’s Law Cubed.

Season Product Process # Trans Gflops 32-bit AA Fill Mpolys Notes2H97 Riva 128 .35 3M 5 20M 3M Integrated 2D/3D

1H98 Riva ZX .25 5M 7 31M 3M AGP2x

2H98 Riva TNT .25 7M 10 50M 6M 32-bit

1H99 TNT2 .22 9M 15 75M 9M AGP4x

2H99 GeForce .22 23M 25 120M 15M HW T&L

1H00 GF2 GTS .18 25M 35 200M1 25M Per-Pixel Shading

2H00 GF2 Ultra .18 25M 45 250M1 31M 230 Mhz DDR

1H01 GeForce3

.15 57M 80 500M1 30M2 Programmable

Slide from Pat Hanrahan, Kurt Akeley

Page 41: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 41

NVIDIA Historicals

Season Product MT/s Yr rate MF/s Yr rate2H97 Riva 128 5 - 100 -1H98 Riva ZX 5 1.0 100 1.02H98 Riva TNT 5 1.0 180 3.21H99 Riva TNT2 8 1.0 333 3.42H99 GeForce 15 3.5 480 2.11H00 GeForce2 GTS 25 2.8 666 1.92H00 GeForce2 Ultra 31 1.5 1000 2.31H01 GeForce3 40 1.7 3200 10.21H02 GeForce4 65 1.6 4800 1.5

1.8 2.4Slide from Pat Hanrahan, Kurt Akeley

Page 42: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 42

Base Architecture

• Stream buffers match SRF bandwidth to compute needs

Stream buffers

32b

128b

Compute cluster 0

SRFbank 0

32b

128b

Compute cluster 7

SRFbank 7

Inter-cluster network

Page 43: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 43

Indexed SRF Architecture

• Address path from clusters

• Lower indexed access bandwidth

Stream buffers

Address FIFOs

Inter-cluster network

Compute cluster 7

SRFbank 7

32b

128b

Compute cluster 0

SRFbank 0

Page 44: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 44

Base SRF Bank

• Several SRAM sub-arrays

Sub array 1

Sub array 2

Sub array 3

Local WL drivers

Sub array 0256 128

Compute cluster

SRFbank

Page 45: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 45

Indexed SRF Bank

• Extra 8:1 mux at sub-array output– Allows 4x 1-word

accesses

Compute cluster

SRFbank

Sub array 1

Sub array 0256 128

Sub array 2

Sub array 3

Pre-decode

& row

dec.P

re-decode&

row dec.

Pre-decode

& row

dec.P

re-decode&

row dec.

8:1 mux

Page 46: Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.

HPCA-10 NSJ 46

Cross-lane Indexed SRF

• Address switch added

• Inter-cluster network used for cross-lane SRF data

Stream buffers

Address FIFOs

Inter-cluster network

Compute cluster 7

SRFbank 7

32b

Compute cluster 0

SRFbank 0

SRF address network