1 2 The Maven Vector-Thread Architecture 2 1 › wp-content › uploads › hc... · 7 Conclusions...

1
The Maven Vector-Thread Architecture Yunsup Lee 1 , Rimas Avi´ zienis 1 , Alex Bishara 1 , Richard Xia 1 , Derek Lockhart 2 , Christopher Batten 2 , Krste Asanovi´ c 1 1 Parallel Computing Laboratory, UC Berkeley 2 Computer Systems Laboratory, Cornell University 1 Motivation Future manycore processors will be energy-constrained, and thus the primary metric for evaluating these architectures will be their energy-efficiency. In this work, we investigate new architectural and mi- croarchitectural mechanisms which enable a wider array of applications to be mapped to energy-efficient vector units. 2 Architectural Patterns Three different architectural patterns (excluding subword-SIMD and SIMT) were evaluated in terms of their performance, energy efficiency and area. Maven is a new vector-thread architecture, which is based on a vector-SIMD architecture, adds minimal hardware to support irregular DLP well, and is considerably simpler to implement than previous vector-thread designs. 3 Maven Programming Methodology 4 Maven Tile Microarchitecture We focus on comparing the various architectural design patterns with respect to a single data-parallel tile. Example tiles are a MIMD tile, a vector tile with four single-lane cores, or one four-lane core. To manage complexity of many design points, we developed a library of parameterized synthesizable RTL components. Tile Configurations Maven Core Microarchitecture 5 Evaluation Framework Toolflow An Example VLSI Layout Core 0 Core 1 Core 2 Core 3 CP I$ VU I$ D$ XBAR 1.8mm 3.3mm 6 Evaluation Results We first compare tile configurations based on their cycle time and area before exploring the impact of various microarchitectural optimizations. We then compare implementation efficiency and performance of the Maven VT pattern against the MIMD, and vector-SIMD patterns for the six application kernel. Area & Cycle Time mimd-c4 vsimd +bi vt-c4v1 vt-c4v1 +bi vt-c1v4 +bi+2s r32 r64 r128 r256 c1v4r256 c4v1r256 r256 r256+b r256+bi r256+2s r256+2s+d r256 r256+mc 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Normalized Area ctrl reg mem fp int cp i$ d$ !"#$%&'()"# *"+,' ./()0)1(2 3.45&2(/,6 7589 :"/(2 ;',( 755 < 9 !=12, :45, 7#09 !"!#$%&'() *&+ -*(.$*/* (0. *0*1 !"!#$%&'2& )*2 -*(1$)&. &01 *0*( !"!#$%&'*)/ )&) -*)&$)2* &0) *0*+ !"!#$%&')32 )++ -))*$)+/ &0. *0). 45"!#$%&4*')3267" (+2 -)*($((* 302 *0(. 45"!#$%*4&')3267" ))& -*(.$)3) (0+ *0&2 48$%&4*')32 &)/ -*2)$(*/ 20( *0&. 48$%&4*')3267 &1& -*&.$).* 302 *0(* 48$%&4*')3267" &&3 -*.)$)+/ 30+ *0() 48$%&4*')3267"6)5 &1+ -))3$(1& 30+ *0() 48$%&4*')3267"6)56# &*1 -*2/$(11 30+ *0(2 48$%*4&')3267"6)5 )13 -***$*2. (0+ *0&) 48$%*4&')3267"6)56!% ))( -**/$*.( &01 *0&) Performance and Energy Efficiency Impact of Additional Physical Registers, Intra-Lane Regfile Banking, and Additional Per-Bank Integer ALUs 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 Normalized Tasks / Sec 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Normalized Energy / Task r32 r64 r128 r256 r32 r64 r128 r256 r128 r256 r128 r256 mimd-c4 vt-c4v1 vt-c4v1+b vt-c4v1+bi mimd-c4 vt-c4v1 vt-c4v1+b vt-c4v1+bi r32 r64 r128 r256 r32 r64 r128 r256 r128 r256 r128 r256 0 5 10 15 20 25 30 Energy / Task (uJ) ctrl reg mem fp int cp i$ d$ leak Impact of Density-Time Execution and Stack-Based Convergence Schemes / Impact of Memory Coalescing 2.0 4.0 6.0 8.0 10.0 12.0 14.0 Normalized Tasks / Sec 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Normalized Energy / Task FIFO FIFO+dt 1-stack 1-stack+dt 2-stack 2-stack+dt cmv +FIFO cmv +2-stack+dt 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Normalized Tasks / Sec 1 2 3 4 5 6 Normalized Energy / Task vec ld/st uT ld/st uT ld/st + mem coalescing Implementation Efficiency and Performance for MIMD, vector-SIMD, and VT Patterns Running Application Kernels 0.5 1.0 1.5 0.0 0.5 1.0 1.5 2.0 Normalized Energy / Task r32 mlane mcore 0.5 1.0 1.5 r32 mlane mcore 1.0 2.0 3.0 r32 mcore/mlane 0.5 1.0 1.5 2.0 2.5 r32 mlane mcore 0.5 1.0 1.5 2.0 r32 mlane mcore 0.5 1.0 1.5 r32 mlane mcore Normalized Tasks / Second / Area viterbi radix sort kmeans dither physics sim. string search 7 Conclusions 1. The Maven vector-thread architecture is more area and energy efficient than MIMD architectures on regular DLP and (surprisingly) on irregular DLP 2. The Maven vector-thread architecture is a promising alternative to traditional vector-SIMD architectures, providing greater efficiency and easier programmability 3. Using real RTL implementations and a standard ASIC toolflow is necessary to compare energy-optimized future architectures For more details, see our ISCA ’11 paper below or use the QR code on the right with a barcode reader. ”Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators” This work was supported in part by Microsoft (Award #024263) and Intel (Award #024894, equipment donations) funding and by matching funding from U.C. Discovery (Award #DIG07-10227). The authors acknowledge and thank Jiongjia Fang and Ji Kim for their help writing application kernels, Christopher Celio for his help writing Maven software and developing the vector-SIMD instruction set, and Hidetaka Aoki for his early feedback on the Maven microarchitecture. Yunsup Lee 577C Soda Hall, Berkeley, CA 94720 [email protected] http://www.cs.berkeley.edu/˜yunsup

Transcript of 1 2 The Maven Vector-Thread Architecture 2 1 › wp-content › uploads › hc... · 7 Conclusions...

Page 1: 1 2 The Maven Vector-Thread Architecture 2 1 › wp-content › uploads › hc... · 7 Conclusions 1. The Maven vector-thread architecture is more area and energy efficient than

The Maven Vector-Thread ArchitectureYunsup Lee1, Rimas Avizienis1, Alex Bishara1, Richard Xia1, Derek Lockhart2,

Christopher Batten2, Krste Asanovic1

1Parallel Computing Laboratory, UC Berkeley2Computer Systems Laboratory, Cornell University

1 Motivation

Future manycore processors

will be energy-constrained,

and thus the primary metric for

evaluating these architectures

will be their energy-efficiency.

In this work, we investigate

new architectural and mi-

croarchitectural mechanisms

which enable a wider array of

applications to be mapped to

energy-efficient vector units.

2 Architectural Patterns

Three different architectural patterns (excluding subword-SIMD and SIMT) were

evaluated in terms of their performance, energy efficiency and area. Maven is a

new vector-thread architecture, which is based on a vector-SIMD architecture, adds

minimal hardware to support irregular DLP well, and is considerably simpler to

implement than previous vector-thread designs.

3 Maven Programming Methodology

4 Maven Tile Microarchitecture

We focus on comparing the various architectural design patterns with respect to a

single data-parallel tile. Example tiles are a MIMD tile, a vector tile with four

single-lane cores, or one four-lane core. To manage complexity of many design

points, we developed a library of parameterized synthesizable RTL components.

Tile Configurations

Maven Core Microarchitecture

5 Evaluation Framework

Toolflow

An Example VLSI Layout

Core 0 Core 1 Core 2

Core 3

CP I$

VU I$

D$XBAR

1.8mm

3.3mm

6 Evaluation Results

We first compare tile configurations based on their cycle time and area before exploring the impact of various

microarchitectural optimizations. We then compare implementation efficiency and performance of the Maven VT

pattern against the MIMD, and vector-SIMD patterns for the six application kernel.

Area & Cycle Time

mimd-c4 vsimd+bi

vt-c4v1 vt-c4v1+bi

vt-c1v4+bi+2s

r32

r64

r128

r256

c1v4r256c4v1r256

r256

r256+b

r256+bi

r256+2s

r256+2s+

d

r256

r256+m

c

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

No

rmal

ized

Are

a

ctrlregmemfpintcpi$d$

!"#$%&'()"#

*"+,'-./()0)1(2-3.45&2(/,6-

7589

:"/(2-;',(-755<9

!=12,-:45,-7#09

!"!#$%&'() *&+,-*(.$*/* (0. *0*1

!"!#$%&'2& )*2,-*(1$)&. &01 *0*(

!"!#$%&'*)/ )&),-*)&$)2* &0) *0*+

!"!#$%&')32 )++,-))*$)+/ &0. *0).

45"!#$%&4*')3267" (+2,-)*($((* 302 *0(.

45"!#$%*4&')3267" ))&,-*(.$)3) (0+ *0&2

48$%&4*')32 &)/,-*2)$(*/ 20( *0&.

48$%&4*')3267 &1&,-*&.$).* 302 *0(*

48$%&4*')3267" &&3,-*.)$)+/ 30+ *0()

48$%&4*')3267"6)5 &1+,-))3$(1& 30+ *0()

48$%&4*')3267"6)56# &*1,-*2/$(11 30+ *0(2

48$%*4&')3267"6)5 )13,-***$*2. (0+ *0&)

48$%*4&')3267"6)56!% ))(,-**/$*.( &01 *0&)

Performance and Energy Efficiency

Impact of Additional Physical Registers, Intra-Lane Regfile Banking, and Additional Per-Bank Integer ALUs

0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6Normalized Tasks / Sec

0.40.50.60.70.80.91.01.11.21.31.41.51.6

Norm

aliz

edE

ner

gy

/T

ask

r32

r64

r128

r256

r32

r64r128

r256r128 r256 r128

r256

mimd-c4vt-c4v1vt-c4v1+bvt-c4v1+bi

mimd-c4 vt-c4v1 vt-c4v1+b vt-c4v1+bi

r32r64r128r256

r32r64r128r256

r128r256

r128r256

0

5

10

15

20

25

30

Ener

gy

/T

ask

(uJ)

ctrlregmemfpint

cpi$d$leak

Impact of Density-Time Execution and Stack-Based Convergence Schemes / Impact of Memory Coalescing

2.0 4.0 6.0 8.0 10.0 12.0 14.0Normalized Tasks / Sec

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Norm

aliz

edE

ner

gy

/T

ask FIFO

FIFO+dt

1-stack

1-stack+dt 2-stack

2-stack+dt

cmv+FIFO

cmv+2-stack+dt

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Normalized Tasks / Sec

1

2

3

4

5

6

Norm

aliz

edE

ner

gy

/T

ask

vec ld/st

uT ld/st

uT ld/st + mem coalescing

Implementation Efficiency and Performance for MIMD, vector-SIMD, and VT Patterns Running Application Kernels

0.5 1.0 1.50.0

0.5

1.0

1.5

2.0

No

rmal

ized

Ener

gy

/T

ask

r32

mlane

mcore

0.5 1.0 1.5

r32

mlane

mcore

1.0 2.0 3.0

r32 mcore/mlane

0.5 1.0 1.5 2.0 2.5

r32

mlane

mcore

0.5 1.0 1.5 2.0

r32

mlane

mcore

0.5 1.0 1.5

r32mlane

mcore

Normalized Tasks / Second / Area

viterbi radix sort kmeans dither physics sim. string search

7 Conclusions

1. The Maven vector-thread architecture is more area and energy efficient than MIMD

architectures on regular DLP and (surprisingly) on irregular DLP

2. The Maven vector-thread architecture is a promising alternative to traditional vector-SIMD

architectures, providing greater efficiency and easier programmability

3. Using real RTL implementations and a standard ASIC toolflow is necessary to

compare energy-optimized future architectures

For more details, see our ISCA ’11 paper below or use the QR code on the right with a barcode reader.

”Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators”

This work was supported in part by Microsoft (Award #024263) and Intel (Award #024894, equipment donations) funding and by matching funding from U.C. Discovery

(Award #DIG07-10227). The authors acknowledge and thank Jiongjia Fang and Ji Kim for their help writing application kernels, Christopher Celio for his help writing

Maven software and developing the vector-SIMD instruction set, and Hidetaka Aoki for his early feedback on the Maven microarchitecture.

Yunsup Lee — 577C Soda Hall, Berkeley, CA 94720 — [email protected] — http://www.cs.berkeley.edu/˜yunsup