Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.
-
Upload
june-caroline-porter -
Category
Documents
-
view
223 -
download
0
description
Transcript of Stream Register Files with Indexed Access Nuwan Jayasena Mattan Erez Jung Ho Ahn William J. Dally.
Stream Register Files with Indexed Access
Nuwan JayasenaMattan ErezJung Ho Ahn
William J. Dally
HPCA-10 NSJ 2
Scaling Trends
• ILP increasingly harder and more expensive to extract
CPU data courtesy of Francois Labonte, Stanford University
• Graphics processors exploit data parallelism
CPUs - Specint2000 per MHz
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Jan-85 Sep-87 Jun-90 Mar-93 Dec-95 Sep-98 Jun-01
8038680486PentiumPentium IIPentium IIIPentium 4
Graphics Processors - Vertices per MHz
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Jan-85 Sep-87 Jun-90 Mar-93 Dec-95 Sep-98 Jun-01
nVidia
NV10
NV35
HPCA-10 NSJ 3
Renewed Interest in Data Parallelism
• Data parallel application classes– Media, signal, network processing, scientific simulations,
encryption etc.
• High-end vector machines– Have always been data parallel
• Academic research– Stanford Imagine, Berkeley V-IRAM, programming GPUs
etc.
• “Main-stream” industry– Sony Emotion Engine, Tarantula etc.
HPCA-10 NSJ 4
Storage Hierarchy
• Bandwidth taper• Only supports sequential
streams/vectors• But many data parallel apps
with– Data reorderings– Irregular data structures– Conditional accesses
DRAM
Stream/vector storage
+x
+x
+x
Cache
HPCA-10 NSJ 5
Sequential Streams/Vectors Inefficient
Evaluate arbitrary order access to streams
Memory/cache Stream/vector storage Compute unitsa00 a01 a02 a03
a10 a11 a12 a13
a20 a21 a22 a23
a30 a31 a32 a33
Time
Rowmajor
Columnmajor
b13 b12 b11 b10 b03 b02 b01 b00b33
a13 a12 a11 a10 a03 a02 a01 a00a33
c00 c01 c02 c03
c10 c11 c12 c13
c20 c21 c22 c23
c30 c31 c32 c33
c31 c21 c11 c01 c30 c20 c10 c00c33
b00 b01 b02 b03
b10 b11 b12 b13
b20 b21 b22 b23
b30 b31 b32 b33b31 b21 b11 b01 b30 b20 b10 b00b33
Reorder
HPCA-10 NSJ 6
Outline
• Stream processing overview• Applications• Implementation• Results• Conclusion
HPCA-10 NSJ 7
Stream Programming
• Streams of records passing through compute kernels• Parallelism
– Across stream elements– Across kernels
• Locality– Within kernels– Between kernels
FFT_stage FFT_stage FFT_stagein1
in2
OutOutput
HPCA-10 NSJ 8
Bandwidth Hierarchy
• Stream programming is well matched to bandwidth hierarchy
FFT_stage
FFT_stage
FFT_stage
Memory Stream register file (SRF) Compute units
Time
HPCA-10 NSJ 9
Stream Processors
• Several lanes– Execute in SIMD– Operate on records
• Inter-cluster network
Compute cluster 0
SRF bank 0
Compute
cluster(N-1)
SRF bank(N-1)
Inter-cluster network
Lane 0
Memory system
Memory switch
HPCA-10 NSJ 10
Outline
• Stream processing overview• Applications• Implementation• Results• Conclusion
HPCA-10 NSJ 11
Stream-Level Data Reuse
• Sequential streams only capture in-order reuse• Arbitrary access patterns in SRF capture more of available
temporal locality
Sequential (in-order) reuse
e.g.: linear streams
Non-sequential reuse
Stream data reuse
Reordered reusee.g.: 2-D, 3-D accesses,
multi-grid
Intra-stream reusee.g.: irregular
neighborhoods, table lookups
HPCA-10 NSJ 12
Reordered Reuse
Memory/cache Stream register file (SRF) Compute clustersa00 a01 a02 a03
a10 a11 a12 a13
a20 a21 a22 a23
a30 a31 a32 a33
a00 a01 a02 a03
a10 a11 a12 a13
a20 a21 a22 a23
a30 a31 a32 a33
a13 a12 a11 a10 a03 a02 a01 a00a33
b13 b12 b11 b10 b03 b02 b01 b00b33
b31 b21 b11 b01 b30 b20 b10 b00b33
Time
c00 c01 c02 c03
c10 c11 c12 c13
c20 c21 c22 c23
c30 c31 c32 c33
c31 c21 c11 c01 c30 c20 c10 c00c33
1D FFT
1D FFT
Reorder
• Indexed SRF access eliminates reordering through memory
HPCA-10 NSJ 13
Reordered Reuse
Memory/cache Stream register file (SRF) Compute clustersa00 a01 a02 a03
a10 a11 a12 a13
a20 a21 a22 a23
a30 a31 a32 a33
a00 a01 a02 a03
a10 a11 a12 a13
a20 a21 a22 a23
a30 a31 a32 a33
a13 a12 a11 a10 a03 a02 a01 a00a33
b13 b12 b11 b10 b03 b02 b01 b00
b31 b21 b11 b01 b30 b20 b10 b00b33
Time
c00 c01 c02 c03
c10 c11 c12 c13
c20 c21 c22 c23
c30 c31 c32 c33
c31 c21 c11 c01 c30 c20 c10 c00c33
1D FFT
1D FFT
Reorder
b33
Reorder
• Indexed SRF access eliminates reordering through memory
HPCA-10 NSJ 14
Intra-stream Reuse
• Indexed SRF access eliminates – Replication in SRF– Redundant memory transfers
Memory/cache Stream register file (SRF) Compute clusters
Time
A
B D C B AD
A D BCA B DB ComputeE
F H
C
G
G F EH
Replicate
HPCA-10 NSJ 15
Intra-stream Reuse
Memory/cache Stream register file (SRF) Compute clusters
Time
A
B D
A D BCA B DB ComputeE
F H
C
G
G F EH
Replicate
C B AD
C B AD Replicate
• Indexed SRF access eliminates – Replication in SRF– Redundant memory transfers
HPCA-10 NSJ 16
Conditional Accesses
• Fine-grain conditional accesses– Expensive in SIMD architectures– Translate to conditional address computation
HPCA-10 NSJ 17
Outline
• Stream processing overview• Applications• Implementation• Results• Conclusion
HPCA-10 NSJ 18
Base Architecture
• Each SRF bank accesses block of b contiguous words
Compute cluster 0
SRFbank 0
Compute cluster(N-1)
SRFbank(N-1)
Inter-cluster network
b*W
HPCA-10 NSJ 19
Indexed SRF Architecture
• Address path from clusters
• Lower indexed access bandwidth
Compute cluster(N-1)
SRFbank(N-1)
Inter-cluster network
Compute cluster 0
SRFbank 0
Address FIFOs
HPCA-10 NSJ 20
Base SRF Bank
• Several SRAM sub-arrays
• Each access is to one sub-array
Compute cluster
SRFbank
Sub array 0
Sub array 1
Sub array 2
Local word -line drivers
Sub array 3
HPCA-10 NSJ 21
Indexed SRF Bank
• Extra 8:1 mux at sub-array output– Allows 4x 1-word
accesses
Compute cluster
SRFbank
Sub array 1
Sub array 2
Pre-decode
& row
dec.P
re-decode&
row dec.
Pre-decode
& row
dec.P
re-decode&
row dec.
Sub array 3
mux
Sub array 0
HPCA-10 NSJ 22
Cross-lane Indexed SRF
• Address switch added
• Inter-cluster network used for cross-lane SRF data
Inter-cluster network
SRF address network
Compute cluster 0
SRFbank 0
Address FIFOs
Compute cluster 0
SRFbank 0
HPCA-10 NSJ 23
Overhead - Area
• In-lane indexing overheads– 11% over sequential SRF
• Per-sub-array independent addressing overheads
• Cross-lane indexing overheads– 22% over sequential SRF
• Address switch
• 1.5% to 3% increase in die area (Imagine processor)
HPCA-10 NSJ 24
Overhead - Energy
• 0.1nJ (0.13m) per indexed SRF access• ~4x sequential SRF access• > order of magnitude lower than DRAM access• 0.25nJ per cache access• Each indexed access replaces many SRF and
DRAM/cache accesses
HPCA-10 NSJ 25
Outline
• Stream processing overview• Applications• Implementation• Results• Conclusion
HPCA-10 NSJ 26
Benchmarks
• 64x64 2D FFT– 2D accesses
• Rijndael (AES)– Table lookups
• Merge-sort– Fine-grain conditionals
• 5x5 convolution filter– Regular neighborhood
• Irregular graph – Irregular neighborhood access– Parameterized (IG_SML/DMS/DCS/SCL): Sparse/Dense
graph, Memory/Compute-limited, Short/Long strips
HPCA-10 NSJ 27
Machine Organizations
Base(Sequential SRF)
Computeclusters
SRFbanks
Base + Cache
DRAM
Memory switch
Inter-cluster net
DRAM
Memory switch
Inter-cluster net
Cache
Indexed SRF
SRF address net
DRAM
Memory switch
Inter-cluster net
HPCA-10 NSJ 28
Machine Parameters
Base Base +cache
IndexedSRF
Technology 0.13m1GHz
Compute 8 compute clusters32GFLOPs (peak)
SRF 128KB128GB/s seq.
128KB128GB/s seq
128GB/s in-lane32GB/s x-lane
Cache 128KB16GB/s
DRAM 9.14GB/s
HPCA-10 NSJ 29
Off-chip Memory Bandwidth
0
0.2
0.4
0.6
0.8
1
FFT 2D
Rijnda
elSort
Filter
IG_S
ML
IG_D
MS
IG_D
CS
IG_S
CL
Norm
aliz
ed m
emor
y tr
affic ISRF
HPCA-10 NSJ 30
Off-chip Memory Bandwidth
0
0.2
0.4
0.6
0.8
1
FFT 2D
Rijnda
elSort
Filter
IG_S
ML
IG_D
MS
IG_D
CS
IG_S
CL
Norm
aliz
ed m
emor
y tr
affic ISRF
Cache
HPCA-10 NSJ 31
Execution Time
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Bas
eC
ache
ISR
F
Bas
eC
ache
ISR
F
Bas
eC
ache
ISR
F
Bas
eC
ache
ISR
F
Bas
eC
ache
ISR
F
Bas
eC
ache
ISR
F
Bas
eC
ache
ISR
F
Bas
eC
ache
ISR
F
FFT 2D Rijndael Sort Filter IG_SML IG_DMS IG_DCS IG_SCL
Norm
aliz
ed e
xecu
tion
time
Kerneloverheads
SRF stall
Memorystall
Kernelloop body
HPCA-10 NSJ 32
Outline
• Stream processing overview• Applications• Implementation• Results• Conclusion
HPCA-10 NSJ 33
Conclusions
• Data parallelism increasingly important• Current data parallel architectures inefficient for some
application classes– Irregular accesses
• Indexed SRF accesses– Reduce memory traffic– Reduce SRF data replication– Efficiently support complex/conditional stream accesses
• Performance improvements– 3% to 410% for target application classes
• Low implementation overhead– 1.5% to 3% die area
HPCA-10 NSJ 34
Backups
HPCA-10 NSJ 35
Indexed Access Instruction Overhead
Relative Instruction Counts of SRF Indexed Kernels
0
0.2
0.4
0.6
0.8
1
1.2
FFT2D Rijndael Sort IG_1 IG_2
• Excludes address issue instructions
HPCA-10 NSJ 36
Kernel C API
while(!eos(in)) { in >> a; LUT[a] >> b; c = foo(a, b); out << c;}
LUT.index << a;Indep. instructions;LUT >> b;
• 2 separate instructions– Address issue– Data read
• Address-data separation– May require loop unrolling, software pipelining etc.
HPCA-10 NSJ 37
Sensitivity to SRF Access Latency (1)
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
0 4 8 12 16 20 24
Index and data separation (cycles)
Loop
legt
hFFT2D
Rijndael
Sort1
Sort2
Filter
IGraph1
IGraph2
0.5
0.6
0.7
0.8
0.9
1
1.1
0 2 4 6 8 10 12
Index and data separation (cycles)
Ave
rage
ker
nel e
xecu
tion
time FFT2D
Rijndael
Filter
Sort1
Sort2
HPCA-10 NSJ 38
Sensitivity to SRF Access Latency (2)
0.88
0.9
0.92
0.94
0.96
0.98
1
1.02
4 8 12 16 20 24 28
Index and data separation (cycles)
Ave
rage
ker
nel e
xecu
tion
time
IGraph1
IGraph2
HPCA-10 NSJ 39
Why Graphics Hardware? Pentium 4 SSE theoretical*
3GHz * 4 wide * .5 inst / cycle = 6 GFLOPS
GeForce FX 5900 (NV35) fragment shader observed:MULR R0, R0, R0: 20 GFLOPSequivalent to a 10 GHz P4
and getting faster: 3x improvement over NV30 (6 months)
*from Intel P4 Optimization Manual
0
5
10
15
20
25
Jun-01 Sep-01 Dec-01 Mar-02 Jun-02 Sep-02 Dec-02 Apr-03 Jul-03
GFL
OPS
Pentium 4NV30
NV35
Slide from Ian Buck, Stanford University
HPCA-10 NSJ 40
NVIDIA Graphics growth (225%/yr)
• 1: Dual textured• 2: Programmable
Essentially Moore’s Law Cubed.
Season Product Process # Trans Gflops 32-bit AA Fill Mpolys Notes2H97 Riva 128 .35 3M 5 20M 3M Integrated 2D/3D
1H98 Riva ZX .25 5M 7 31M 3M AGP2x
2H98 Riva TNT .25 7M 10 50M 6M 32-bit
1H99 TNT2 .22 9M 15 75M 9M AGP4x
2H99 GeForce .22 23M 25 120M 15M HW T&L
1H00 GF2 GTS .18 25M 35 200M1 25M Per-Pixel Shading
2H00 GF2 Ultra .18 25M 45 250M1 31M 230 Mhz DDR
1H01 GeForce3
.15 57M 80 500M1 30M2 Programmable
Slide from Pat Hanrahan, Kurt Akeley
HPCA-10 NSJ 41
NVIDIA Historicals
Season Product MT/s Yr rate MF/s Yr rate2H97 Riva 128 5 - 100 -1H98 Riva ZX 5 1.0 100 1.02H98 Riva TNT 5 1.0 180 3.21H99 Riva TNT2 8 1.0 333 3.42H99 GeForce 15 3.5 480 2.11H00 GeForce2 GTS 25 2.8 666 1.92H00 GeForce2 Ultra 31 1.5 1000 2.31H01 GeForce3 40 1.7 3200 10.21H02 GeForce4 65 1.6 4800 1.5
1.8 2.4Slide from Pat Hanrahan, Kurt Akeley
HPCA-10 NSJ 42
Base Architecture
• Stream buffers match SRF bandwidth to compute needs
Stream buffers
32b
128b
Compute cluster 0
SRFbank 0
32b
128b
Compute cluster 7
SRFbank 7
Inter-cluster network
HPCA-10 NSJ 43
Indexed SRF Architecture
• Address path from clusters
• Lower indexed access bandwidth
Stream buffers
Address FIFOs
Inter-cluster network
Compute cluster 7
SRFbank 7
32b
128b
Compute cluster 0
SRFbank 0
HPCA-10 NSJ 44
Base SRF Bank
• Several SRAM sub-arrays
Sub array 1
Sub array 2
Sub array 3
Local WL drivers
Sub array 0256 128
Compute cluster
SRFbank
HPCA-10 NSJ 45
Indexed SRF Bank
• Extra 8:1 mux at sub-array output– Allows 4x 1-word
accesses
Compute cluster
SRFbank
Sub array 1
Sub array 0256 128
Sub array 2
Sub array 3
Pre-decode
& row
dec.P
re-decode&
row dec.
Pre-decode
& row
dec.P
re-decode&
row dec.
8:1 mux
HPCA-10 NSJ 46
Cross-lane Indexed SRF
• Address switch added
• Inter-cluster network used for cross-lane SRF data
Stream buffers
Address FIFOs
Inter-cluster network
Compute cluster 7
SRFbank 7
32b
Compute cluster 0
SRFbank 0
SRF address network