Electrical Engineering and Computer Sciences B P L ERKELEY...
Transcript of Electrical Engineering and Computer Sciences B P L ERKELEY...
![Page 1: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/1.jpg)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
EECS Electrical Engineering and
Computer Sciences BERKELEY PAR LAB
Exploring Tradeoffs between Programmability and
Efficiency inData-Parallel Accelerators"
Yunsup Lee1, Rimas Avizienis1, Alex Bishara1, !Richard Xia1, Derek Lockhart2,!
Christopher Batten2, Krste Asanovic1!1The Parallel Computing Lab, UC Berkeley!2Computer Systems Lab, Cornell University!
![Page 2: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/2.jpg)
Yunsup Lee / UC Berkeley Par Lab
DLP Kernels Dominate Many Computational Workloads
Graphics Rendering Computer Vision
Audio Processing Physical Simulation
![Page 3: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/3.jpg)
Yunsup Lee / UC Berkeley Par Lab
DLP Accelerators are Getting Popular
Sandy Bridge
Tegra Knights Ferry
Fermi
![Page 4: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/4.jpg)
Yunsup Lee / UC Berkeley Par Lab
Important Metrics when Comparing DLP Accelerator Architectures
• Performance per Unit Area"• Energy per Task!• Flexibility (What can it run well?)!• Programmability (How hard is it to
write code?)!
![Page 5: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/5.jpg)
Yunsup Lee / UC Berkeley Par Lab
Efficiency vs. Programmability: It’s a tradeoff
Programmability
Effi
cien
cy
Programmability
Effi
cien
cy
MIMD
Vector
Irregular DLP
Vector
MIMD
Regular DLP
![Page 6: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/6.jpg)
Yunsup Lee / UC Berkeley Par Lab
Maven Provides Both Greater Efficiency and Easier Programmability
Programmability
Effi
cien
cy
Programmability
Effi
cien
cy
MIMD
Vector
Irregular DLP
Vector
MIMD
Maven/Vector-Thread
Maven/Vector-Thread
Regular DLP
![Page 7: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/7.jpg)
Yunsup Lee / UC Berkeley Par Lab
Where does the GPU/SIMT fit in this picture?
Programmability
Effi
cien
cy
Programmability
Effi
cien
cy
MIMD
Vector GPU SIMT?
Irregular DLP
Vector
MIMD
GPU SIMT?
Maven/Vector-Thread
Maven/Vector-Thread
Regular DLP
![Page 8: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/8.jpg)
Yunsup Lee / UC Berkeley Par Lab
Outline § Data-Parallel Architecture
Design Patterns"§ MIMD, Vector-SIMD, Subword-SIMD,
SIMT, Maven/Vector-Thread!§ Microarchitectural Components!§ Evaluation Framework!§ Evaluation Results!
![Page 9: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/9.jpg)
Yunsup Lee / UC Berkeley Par Lab
DLP Pattern #1: MIMD
Programmer’s Logical View
FILTER OP }
![Page 10: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/10.jpg)
Yunsup Lee / UC Berkeley Par Lab
DLP Pattern #1: MIMD
Programmer’s Logical View
Typical Micro- architecture
Examples: Tilera Rigel
![Page 11: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/11.jpg)
Yunsup Lee / UC Berkeley Par Lab
DLP Pattern #2: Vector-SIMD
Programmer’s Logical View
![Page 12: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/12.jpg)
Yunsup Lee / UC Berkeley Par Lab
DLP Pattern #2: Vector-SIMD
Programmer’s Logical View
Typical Micro- architecture
Examples: T0 Cray-1
![Page 13: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/13.jpg)
Yunsup Lee / UC Berkeley Par Lab
DLP Pattern #3: Subword-SIMD
Programmer’s Logical View
Typical Micro- architecture
Examples: AVX/SSE
![Page 14: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/14.jpg)
Yunsup Lee / UC Berkeley Par Lab
DLP Pattern #4: GPU/SIMT
Programmer’s Logical View
![Page 15: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/15.jpg)
Yunsup Lee / UC Berkeley Par Lab
DLP Pattern #4: GPU/SIMT
Programmer’s Logical View
Typical Micro- architecture
Example: Fermi
![Page 16: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/16.jpg)
Yunsup Lee / UC Berkeley Par Lab
DLP Pattern #5: Vector-Thread (VT)
Programmer’s Logical View
![Page 17: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/17.jpg)
Yunsup Lee / UC Berkeley Par Lab
DLP Pattern #5: Vector-Thread (VT)
Programmer’s Logical View
Typical Micro- architecture
Examples: Scale Maven
![Page 18: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/18.jpg)
Yunsup Lee / UC Berkeley Par Lab
Outline § Data Parallel Architectural Design
Patterns!§ Microarchitectural Components"§ Evaluation Framework!§ Evaluation Results!
![Page 19: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/19.jpg)
Yunsup Lee / UC Berkeley Par Lab
Focus on the Tile
MIMD Tile Vector Tile with Four Single-Lane Cores
Vector Tile with One Four-Lane Core
![Page 20: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/20.jpg)
§ Developed a library of parameterized synthesizable RTL components!
uArchitecture"
![Page 21: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/21.jpg)
§ 32-bit integer multiplier, divider!
§ Single-precision floating-point add, multiply, divide, square root!
Retimable Long-latency
Functional Units"
![Page 22: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/22.jpg)
5-stage Multi-threaded
Scalar Core"
§ Change number of entries in register file (32,64,128,256) to vary degree of multi-threading (1,2,4,8 threads)!
![Page 23: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/23.jpg)
§ Vector registers and ALUs!
§ Density-time Execution!
§ Replicate the lanes and execute in lock step for higher throughput!
§ Vector-SIMD: Flag Registers!
Vector Lanes"
![Page 24: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/24.jpg)
Vector Issue Unit"
§ Vector-SIMD: VIU only handles scheduling, data dependent control done by flag registers!
§ Maven: VIU fetches instructions, PVFB handles uT branches and does control flow convergence!
![Page 25: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/25.jpg)
Vector Memory Unit"
§ VMU Handles unit stride, constant stride vector memory operations!
§ Vector-SIMD: VMU handles scatter, gather!
§ Maven: VMU handles uT loads and stores!
![Page 26: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/26.jpg)
Blocking, Non-blocking Caches"
§ Access Port Width!§ Refill Port Width!§ Cache Line Size!§ Total Capacity!§ Associativity!
Only for Non-blocking Caches:!§ # MSHR!§ # secondary
misses per MSHR!
![Page 27: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/27.jpg)
Yunsup Lee / UC Berkeley Par Lab
A Big Design Space …
§ Number of entries in scalar register file!§ 32,64,128,256 (1,2,4,8 threads)!
§ Number of entries in vector register file!§ 32,64,128,256!
§ Architecture of vector register file!§ 6r3w unified register file, 4x 2r1w banked register file!
§ Per-bank integer ALU!§ Density time execution!§ Pending Vector Fragment Buffer (PVFB)!
§ FIFO, 1-stack, 2-stack!
![Page 28: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/28.jpg)
Yunsup Lee / UC Berkeley Par Lab
Outline § Data Parallel Architectural Design
Patterns!§ Microarchitectural Components!§ Evaluation Framework"§ Evaluation Results!
![Page 29: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/29.jpg)
Yunsup Lee / UC Berkeley Par Lab
Programming Methodology
§ Use GCC C++ Cross Compiler (which we ported)!§ MIMD!
§ Custom application-scheduled lightweight threading lib!§ Vector-SIMD!
§ Leverage built-in GCC vectorizer for mapping very simple regular DLP code!
§ Use GCCʼs inline assembly extensions for more complicated code!
§ Maven!§ Use C++ Macros with special library, which glues the
control thread and microthreads!§ Automatic vector register allocation added to GCC!
![Page 30: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/30.jpg)
Yunsup Lee / UC Berkeley Par Lab
Microbenchmarks & Application Kernels
Name Explanation Irregularity vvadd 1000 element FP vector-vector add Regular
bsearch 1000 look-ups into a sorted array Very Irregular bsearch-cmv inner-loop rewritten with cond. mov Somewhat Irregular
Microbenchmarks
Name Explanation Irregularity viterbi Decode frames using Viterbi alg. Regular rsort Radix sort on an array of integers Slightly Irregular
kmeans K-means clustering algorithm Slightly Irregular dither Floyd-Steinberg dithering Somewhat Irregular
physics Newtonian physics simulation Very Irregular strsearch Knuth-Morris-Pratt algorithm Very Irregular
Application Kernels
![Page 31: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/31.jpg)
Yunsup Lee / UC Berkeley Par Lab
Evaluation Methodology
![Page 32: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/32.jpg)
Yunsup Lee / UC Berkeley Par Lab
Three Example Layouts
D$
I$
D$
I$
D$
I$
MIMD Tile 1 Core x 4 Lanes
Maven Tile 4 Cores x 1 Lane
Maven Tile
![Page 33: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/33.jpg)
Yunsup Lee / UC Berkeley Par Lab
Need Gate-level Activity for Accurate Energy Numbers
Configuration Post Place&Route Statistical (mW)
Simulated Gate-level Activity (mW)
MIMD 1 149 137-181
MIMD 2 216 130-247
MIMD 3 242 124-261
MIMD 4 299 221-298
Multi-core Vector-SIMD 396 213-331
Multi-lane Vector-SIMD 224 137-252
Multi-core Vector-Thread 1 428 162-318
Multi-core Vector-Thread 2 404 147-271
Multi-core Vector-Thread 3 445 172-298
Multi-core Vector-Thread 4 409 225-304
Multi-core Vector-Thread 5 410 168-300
Multi-lane Vector-Thread 1 205 111-167
Multi-lane Vector-Thread 2 223 118-173
![Page 34: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/34.jpg)
Yunsup Lee / UC Berkeley Par Lab
Outline § Data Parallel Architectural Design
Patterns!§ Microarchitectural Components!§ Evaluation Framework!§ Evaluation Results"
![Page 35: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/35.jpg)
Yunsup Lee / UC Berkeley Par Lab
Efficiency vs. Number of uTs running bsearch-cmv
1.0 1.4 1.8 2.2 2.6Normalized Tasks / Sec
0.40.50.60.70.80.91.01.11.21.31.41.51.6
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
mimd-c4
r32
0
5
10
15
20
25
30
Ener
gy /
Task
(uJ)
ctrlregmemfpint
cpi$d$leak
![Page 36: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/36.jpg)
Yunsup Lee / UC Berkeley Par Lab
1.0 1.4 1.8 2.2 2.6Normalized Tasks / Sec
0.40.50.60.70.80.91.01.11.21.31.41.51.6
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
mimd-c4
Efficiency vs. Number of uTs running bsearch-cmv
Faster
Lower Energy
r32
0
5
10
15
20
25
30
Ener
gy /
Task
(uJ)
ctrlregmemfpint
cpi$d$leak
![Page 37: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/37.jpg)
Yunsup Lee / UC Berkeley Par Lab
Efficiency vs. Number of uTs running bsearch-cmv
r32r64
0
5
10
15
20
25
30
Ener
gy /
Task
(uJ)
ctrlregmemfpint
cpi$d$leak
1.0 1.4 1.8 2.2 2.6Normalized Tasks / Sec
0.40.50.60.70.80.91.01.11.21.31.41.51.6
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
r64
mimd-c4
![Page 38: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/38.jpg)
Yunsup Lee / UC Berkeley Par Lab
Efficiency vs. Number of uTs running bsearch-cmv
1.0 1.4 1.8 2.2 2.6Normalized Tasks / Sec
0.40.50.60.70.80.91.01.11.21.31.41.51.6
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
r64
r128
r256mimd-c4
r32r64r128r256
0
5
10
15
20
25
30
Ener
gy /
Task
(uJ)
ctrlregmemfpint
cpi$d$leak
![Page 39: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/39.jpg)
Yunsup Lee / UC Berkeley Par Lab
Efficiency vs. Number of uTs running bsearch-cmv
r32r64r128r256r32r64r128r256
0
5
10
15
20
25
30
Ener
gy /
Task
(uJ)
ctrlregmemfpint
cpi$d$leak
1.0 1.4 1.8 2.2 2.6Normalized Tasks / Sec
0.40.50.60.70.80.91.01.11.21.31.41.51.6
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
r64
r128
r256
r32
r64r128
r256
mimd-c4vt-c4v1
![Page 40: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/40.jpg)
Yunsup Lee / UC Berkeley Par Lab
6r3w Vector Register File is Area Inefficient
r32
r64
r128
r256
r32
r64
r128
r256
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
Nor
mal
ized
Are
a
ctrlregmemfp
intcpi$d$
MIMD Tile
Vector-Thread Tile
![Page 41: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/41.jpg)
Yunsup Lee / UC Berkeley Par Lab
Efficiency vs. Number of uTs with Banking running bsearch-cmv
1.0 1.4 1.8 2.2 2.6Normalized Tasks / Sec
0.40.50.60.70.80.91.01.11.21.31.41.51.6
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
r64
r128
r256
r128 r256
mimd-c4vt-c4v1vt-c4v1+b
r32r64r128r256r32r64r128r256r128r256
0
5
10
15
20
25
30
Ener
gy /
Task
(uJ)
ctrlregmemfpint
cpi$d$leak
![Page 42: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/42.jpg)
Yunsup Lee / UC Berkeley Par Lab
Efficiency vs. Number of uTs with Per-Bank Integer ALU running bsearch-cmv
1.0 1.4 1.8 2.2 2.6Normalized Tasks / Sec
0.40.50.60.70.80.91.01.11.21.31.41.51.6
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
r64
r128
r256
r128r256
mimd-c4vt-c4v1vt-c4v1+bvt-c4v1+bi
r32r64r128r256r32r64r128r256r128r256r128r256
0
5
10
15
20
25
30
Ener
gy /
Task
(uJ)
ctrlregmemfpint
cpi$d$leak
![Page 43: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/43.jpg)
Yunsup Lee / UC Berkeley Par Lab
Bank Vector Register File Per-Bank Integer ALUs
r32r64r128r256
r32r64r128r256
r128+br256+br128+bir256+bi
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
Nor
mal
ized
Are
a
ctrlregmemfp
intcpi$d$
MIMD Tile
Vector-Thread Tile Banking
Local ALUs
![Page 44: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/44.jpg)
Yunsup Lee / UC Berkeley Par Lab
Results running bsearch compared to bsearch-cmv
2.0 4.0 6.0 8.0 10.0 12.0 14.0Normalized Tasks / Sec
0.00.10.20.30.40.50.60.70.80.91.0
Nor
mal
ized
Ene
rgy
/ Tas
k
FIFO
cmv+FIFO
FIFO+dt
1-stack
1-stack+dt 2-stack
2-stack+dt cmv+2-stack+dt
Results of Design Space Exploration Apply Density-Time Execution
Convergence Scheme: 2-Stack PVFB
![Page 45: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/45.jpg)
Yunsup Lee / UC Berkeley Par Lab
Results Running Application Kernels
0.5 1.0 1.50.0
0.5
1.0
1.5
2.0
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
0.5 1.0 1.50.0
0.5
1.0
1.5
2.0
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
0.5 1.0 1.5
r32
1.0 2.0 3.0
r32
1.0 2.0 3.0
r32
0.5 1.0 1.5 2.0 2.5
r32
0.5 1.0 1.5 2.0 2.5
r32
0.5 1.0 1.5 2.0
r32
0.5 1.0 1.5 2.0
r32
0.5 1.0 1.5
r32
0.5 1.0 1.5
r32
0.5 1.0 1.5
r32
Normalized Tasks / Second
Normalized Tasks / Second / Area
viterbi rsort kmeans dither physics strsearch
![Page 46: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/46.jpg)
Yunsup Lee / UC Berkeley Par Lab
Results Running Application Kernels
0.5 1.0 1.50.0
0.5
1.0
1.5
2.0
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
0.5 1.0 1.50.0
0.5
1.0
1.5
2.0
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
0.5 1.0 1.5
r32
1.0 2.0 3.0
r32
1.0 2.0 3.0
r32
0.5 1.0 1.5 2.0 2.5
r32
0.5 1.0 1.5 2.0 2.5
r32
0.5 1.0 1.5 2.0
r32
0.5 1.0 1.5 2.0
r32
0.5 1.0 1.5
r32
0.5 1.0 1.5
r32
0.5 1.0 1.5
r32
Normalized Tasks / Second
Normalized Tasks / Second / Area
Performance
Performance per Unit Area
viterbi rsort kmeans dither physics strsearch
![Page 47: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/47.jpg)
Yunsup Lee / UC Berkeley Par Lab
Results Running Application Kernels
0.5 1.0 1.50.0
0.5
1.0
1.5
2.0
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
0.5 1.0 1.50.0
0.5
1.0
1.5
2.0
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
0.5 1.0 1.5
r32
1.0 2.0 3.0
r32
1.0 2.0 3.0
r32
0.5 1.0 1.5 2.0 2.5
r32
0.5 1.0 1.5 2.0 2.5
r32
0.5 1.0 1.5 2.0
r32
0.5 1.0 1.5 2.0
r32
0.5 1.0 1.5
r32
0.5 1.0 1.5
r32
0.5 1.0 1.5
r32
Normalized Tasks / Second
Normalized Tasks / Second / Area
More Irregular
viterbi rsort kmeans dither physics strsearch
![Page 48: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/48.jpg)
Yunsup Lee / UC Berkeley Par Lab
Multi-threading is not Effective on DLP Code
0.5 1.0 1.50.0
0.5
1.0
1.5
2.0
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
0.5 1.0 1.50.0
0.5
1.0
1.5
2.0
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
0.5 1.0 1.5
r32
0.5 1.0 1.5
r32
1.0 2.0 3.0
r32
1.0 2.0 3.0
r32
0.5 1.0 1.5 2.0 2.5
r32
0.5 1.0 1.5 2.0 2.5
r32
0.5 1.0 1.5 2.0
r32
0.5 1.0 1.5 2.0
r32
0.5 1.0 1.5
r32
0.5 1.0 1.5
r32
Normalized Tasks / Second
Normalized Tasks / Second / Area
viterbi rsort kmeans dither physics strsearch
![Page 49: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/49.jpg)
Yunsup Lee / UC Berkeley Par Lab
Vector-SIMD is Faster and/or More Efficient than MIMD
0.5 1.0 1.50.0
0.5
1.0
1.5
2.0
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
mlane
0.5 1.0 1.50.0
0.5
1.0
1.5
2.0
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
mlane
0.5 1.0 1.5
r32
mlane
0.5 1.0 1.5
r32
mlane
1.0 2.0 3.0
r32
mlane
1.0 2.0 3.0
r32 mlane
0.5 1.0 1.5 2.0 2.5
r32
mlane
0.5 1.0 1.5 2.0 2.5
r32
mlane
0.5 1.0 1.5 2.0
r32
0.5 1.0 1.5 2.0
r32
0.5 1.0 1.5
r32
0.5 1.0 1.5
r32
Normalized Tasks / Second
Normalized Tasks / Second / Area
viterbi rsort kmeans dither physics strsearch
No Vector-SIMD
Implementation
Too hard to map
![Page 50: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/50.jpg)
Yunsup Lee / UC Berkeley Par Lab
Maven Vector-Thread is More Efficient than Vector-SIMD
0.5 1.0 1.50.0
0.5
1.0
1.5
2.0
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
mlane
0.5 1.0 1.50.0
0.5
1.0
1.5
2.0
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
mlane
0.5 1.0 1.5
r32
mlane
0.5 1.0 1.5
r32
mlane
1.0 2.0 3.0
r32
mlane
1.0 2.0 3.0
r32 mlane
0.5 1.0 1.5 2.0 2.5
r32
mlane
0.5 1.0 1.5 2.0 2.5
r32
mlane
0.5 1.0 1.5 2.0
r32
mlane
0.5 1.0 1.5 2.0
r32
mlane
0.5 1.0 1.5
r32
mlane
0.5 1.0 1.5
r32mlane
Normalized Tasks / Second
Normalized Tasks / Second / Area
viterbi rsort kmeans dither physics strsearch
![Page 51: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/51.jpg)
Yunsup Lee / UC Berkeley Par Lab
Multi-Lane Tiles are More Efficient than Multi-Core Tiles
0.5 1.0 1.50.0
0.5
1.0
1.5
2.0
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
mlane
mcore
0.5 1.0 1.50.0
0.5
1.0
1.5
2.0
Nor
mal
ized
Ene
rgy
/ Tas
k
r32
mlane
mcore
0.5 1.0 1.5
r32
mlane
mcore
0.5 1.0 1.5
r32
mlane
mcore
1.0 2.0 3.0
r32
mlanemcore
1.0 2.0 3.0
r32 mcore/mlane
0.5 1.0 1.5 2.0 2.5
r32
mlane
mcore
0.5 1.0 1.5 2.0 2.5
r32
mlane
mcore
0.5 1.0 1.5 2.0
r32
mlanemcore
0.5 1.0 1.5 2.0
r32
mlanemcore
0.5 1.0 1.5
r32
mlanemcore
0.5 1.0 1.5
r32mlane
mcore
Normalized Tasks / Second
Normalized Tasks / Second / Area
viterbi rsort kmeans dither physics strsearch
![Page 52: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/52.jpg)
Yunsup Lee / UC Berkeley Par Lab
Comparing vector load/stores vs. uT load/stores running vvadd
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Normalized Tasks / Sec
1
2
3
4
5
6
Nor
mal
ized
Ene
rgy
/ Tas
k
vec ld/st
![Page 53: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/53.jpg)
Yunsup Lee / UC Berkeley Par Lab
uT load/stores are Inefficient
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Normalized Tasks / Sec
1
2
3
4
5
6
Nor
mal
ized
Ene
rgy
/ Tas
k
vec ld/st
uT ld/st
9x Slower 5x More Energy
![Page 54: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/54.jpg)
Yunsup Lee / UC Berkeley Par Lab
Memory Coalescing Helps, but Still Far Off
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Normalized Tasks / Sec
1
2
3
4
5
6
Nor
mal
ized
Ene
rgy
/ Tas
k
vec ld/st
uT ld/st
uT ld/st + mem coalescing
![Page 55: Electrical Engineering and Computer Sciences B P L ERKELEY ...maven.cs.berkeley.edu/papers/maven-isca2011-talk.pdf · viterbi Decode frames using Viterbi alg. Regular rsort Radix](https://reader034.fdocuments.net/reader034/viewer/2022043017/5f3a07901347a60c1e2f45d1/html5/thumbnails/55.jpg)
Yunsup Lee / UC Berkeley Par Lab
Conclusions § Vector architectures are more area and energy efficient
than MIMD architectures on regular DLP and (surprisingly) on irregular DLP!
§ The Maven vector-thread architecture is a promising alternative to traditional vector-SIMD architectures, providing greater efficiency and easier programmability!
§ Using real RTL implementations and a standard ASIC toolflow is necessary to compare energy-optimized future architectures!
!This work was supported in part by Microsoft (Award #024263) and Intel (Award #024894, equipment donations) funding and by matching funding from U.C. Discovery (Award #DIG07-10227)!