University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore...

1 University of MichiganElectrical Engineering and Computer Science

Extending Multicore Architectures to Exploit Hybrid Parallelism inSingle-Thread Applications

Hongtao Zhong, Steven A. Lieberman,

and Scott A. Mahlke

Advanced Computer Architecture Laboratory

University of Michigan


Multicore Architectures

• Multicore becomes a trend– Intel Core Duo, 2005– Intel Core Quad, 2006– Sun T1, 8 cores, 2005– 16 – 32 cores, near future

• Need for simpler cores– Power density– Cooling costs

• Multiple cores on a chip– High throughput– Good for multithreaded apps

core 0

core 1

core 2

core 3

core 0

core 1

core 2

core 7

L2 L2

L2 L2

8 core Sun T1 processor

interconnect


How About Single Thread Applications?

Single thread performance, Core Duo vs. Pentium M (same cache, same platform)Source : Meldelson et al. Intel Technology Journal, Vol 10, Issue 02, 2006


Objective of this Work

• Automatically accelerate single thread applications on multicore systems– Exploit irregular parallelism across cores

• Instruction level parallelism (ILP)• Fine-grain thread level parallelism (TLP )• Loop level parallelism (LLP)

– Adaptive architecture• Configurate resources to exploit available parallelism• Dynamic adaptability

Hybrid parallelism


Approach

• Voltron: Hardware/software approach– Architecture mechanisms

• Dual mode execution (coupled, decoupled)• Flexible inter-core communication• Fast thread spawning• Efficient memory ordering• High rate-of-return speculation

– Compiler techniques• Compiler controlled distributed branch• Fine-grain thread extraction• Speculative loop parallelization with recovery


Parallelism Type 1: ILP

+

L

>>

*

L

+

L

+

+

/

+

/

*

L

<<

-

S

-

<

+

|

&

L

|

L

-

&

S

+

br



• Emulate VLIW– Low latency

communication

+

L

>>

*

L

+

L

+

+

/

+

/

*

L

<<

-

S

-

<

+

|

&

L

|

L

-

&

S

+

brCore 0 Core 1 Core 2 Core 3



• Emulate VLIW– Low latency

communication– Compiler controlled

distributed branch– Lockstep execution

+

L

>>

*

L

+

L

+

+

/

+

/

*

L

<<

-

S

-

<

+

|

&

L

|

L

-

&

S

+

brCore 0 Core 1 Core 2 Core 3

br brbr


Voltron Architecture for ILP

stall bus

br bus

Core 0 Core 1

Core 2 Core 3

Banked L2 Cache

Banked L2 Cache

GPR FPR PR BTR

Register Files

FU MemFU

. . .

To northTo west

L1Instruction Cache

L1Data Cache

From Banked L2 To/From Banked L2

Instruction Fetch/Decode

CommFU


Experimental Setup

• Trimaran Toolset• Simulator

– Multiple cores, multiple instruction stream– Inter-core communication– MOESI coherent protocol

• Configuration– 1 ALU, 1 memory unit, 1 communication unit per core– 1 cycle inter-core move latency per hop– 4KB L1 I-cache, 4KB L1 D-cache per core– 128KB shared L2 cache– Single core baseline

• 25 benchmarks from SpecInt, SpecFP, and MediaBench


1

1.2

1.4

1.6

1.8

2

2.213

2.ijp

eg

164.

gzip

175.

vpr

197.

pars

er

255.

vort

ex

256.

bzip

2

052.

alvi

nn

056.

ear

171.

swim

172.

mgr

id

177.

mes

a

179.

art

183.

equa

ke

cjpe

g

djpe

g

epic

g721

deco

de

g721

enco

de

gsm

deco

de

gsm

enco

de

mpe

g2de

c

mpe

g2en

c

raw

caud

io

raw

daud

io

unep

ic

aver

age

2 core 4 core

ILP Speedup

SpecInt MediabenchSpecFP

Achieved > 80% of the performance on wide VLIW with same resources.


Parallelism Type 2 : Fine-grain TLP

CB

D

E

CB

D

E

• Fine-grain threads– Few instructions– Scalar communication– Shared stack frame

A



ld st ld ldldst

CB

D

E

A

CB

D

E

• Fine-grain threads– Few instructions– Scalar communication– Shared stack frame


B

D

C

E


ld st ld ldldst

A A’

Core 0 Core 1

• Fine-grain threads– Few instruction– Scalar communication– Shared stack frame

• Decoupled execution– Different control flow– Asynchronous communication

• Fast thread spawning• Efficient memory ordering• Compiler algorithm

– Memory dependences– Load balance


Core 0 Core 1

Core 2 Core 3

Banked L2 Cache

Banked L2 Cache

Voltron for Fine-grain TLP

GPR FPR PR BTR

Register Files

FU MemFU

. . .

To northTo west

L1Instruction Cache

L1Data Cache



CommFU

Core

Comm FU

To west To north

To Register File

Direc

t M

ode

Byp

ass

Direc

t M

ode

Byp

ass

Routing Logic

Send

Que

ue

Rec

eive

Que

ue


Dual Mode Network

• Coupled mode– Direct bypass [Multiflow]– Coupled execution– 1 cycle min latency, num_hops

• Decoupled mode– Message queues [RAW]– SEND / RECV– Decoupled execution – 3 cycle min latency, 2 + num_hops– Fast fine-grain thread spawning– Enforce operation ordering Core

Comm FU

To west To north

To Register File

Direc

t M

ode

Byp

ass

Direc

t M

ode

Byp

ass

Core

Comm FU

To west To north

To Register File

Direc

t M

ode

Byp

ass

Direc

t M

ode

Byp

ass

Core

Comm FU

To west To north

To Register File

Direc

t M

ode

Byp

ass

Direc

t M

ode

Byp

ass

Routing Logic

Send

Que

ue

Rec

eive

Que

ue


1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.813

2.ijp

eg

164.

gzip

175.

vpr

197.

pars

er

255.

vort

ex

256.

bzip

2

052.

alvi

nn

056.

ear

171.

swim

172.

mgr

id

177.

mes

a

179.

art

183.

equa

ke

cjpe

g

djpe

g

epic

g721

deco

de

g721

enco

de

gsm

deco

de

gsm

enco

de

mpe

g2de

c

mpe

g2en

c

raw

caud

io

raw

daud

io

unep

ic

aver

age

2 core 4 core

Fine-grain TLP Speedup


Works better for memory intensive applications

* * * * * * * * *


Parallelism Type 3 : LLP

• DOALL loops– No cross-iteration

dependences– Iterations can execute in

parallel– Memory dependences hard to

prove


Parallelism Type 3 : LLP

• DOALL loops– No cross-iteration

dependences– Iterations can execute in

parallel– Memory dependences hard to

prove

• Statistical DOALL– Profile memory dependences– Speculatively parallelize– Detect violation and rollback

core 0

init

finalize

reset

iter 0-3

core 1

init

finalize

reset

iter 4-7iter 0-7

Unexpected dependence

restart


Voltron for LLP

Core 0 Core 1

Core 2 Core 3

Banked L2 Cache

Banked L2 Cache

GPR FPR PR BTR

Register Files

FU MemFU

. . .

To northTo west

L1Instruction Cache

L1 D-cachew/ Transactional Mem

Support



CommFU

T tag state data

cache

• Detect memory dependence violation• Roll back memory state• Compiler roll back register state


1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2132.ijp

eg

164.g

zip

175.v

pr

197.p

ars

er

255.v

ort

ex

256.b

zip

2

052.a

lvin

n

056.e

ar

171.s

wim

172.m

grid

177.m

esa

179.a

rt

183.e

quake

cjp

eg

djp

eg

epic

g721decode

g721encode

gsm

decode

gsm

encode

mpeg2dec

mpeg2enc

raw

caudio

raw

daudio

unepic

avera

ge

2 core 4 core3.12 3.03 3.03

LLP Speedup


Accelerate non-provable DOALL and small loops


1

1.5

2

2.5

3

3.5

13

2.ij

pe

g

16

4.g

zip

17

5.v

pr

19

7.p

ars

er

25

5.v

ort

ex

25

6.b

zip

2

05

2.a

lvin

n

05

6.e

ar

17

1.s

wim

17

2.m

gri

d

17

7.m

esa

17

9.a

rt

18

3.e

qu

ake

cjp

eg

djp

eg

ep

ic

g7

21

de

cod

e

g7

21

en

cod

e

gsm

de

cod

e

gsm

en

cod

e

mp

eg

2d

ec

mp

eg

2e

nc

raw

cau

dio

raw

da

ud

io

un

ep

ic

ave

rag

e

2 core 4 core

Speedup for Hybrid Execution


•2 core average – ILP:1.23, TLP: 1.16, LLP: 1.17, Hybrid: 1.46•4 core average – ILP:1.33, TLP: 1.23, LLP: 1.37, Hybrid: 1.83


0%

20%

40%

60%

80%

100%132.ijp

eg

164.g

zip

175.v

pr

197.p

ars

er

255.v

ort

ex

256.b

zip

2

052.a

lvin

n

056.e

ar

171.s

wim

172.m

grid

177.m

esa

179.a

rt

183.e

quake

cjp

eg

djp

eg

epic

g721decode

g721encode

gsm

decode

gsm

encode

mpeg2dec

mpeg2enc

raw

caudio

raw

daudio

unepic

avera

ge

decoupled

coupled

Time BreakdownSpecInt MediabenchSpecFP

Both coupled and decoupled mode are necessary.


Conclusions and Future Work

• Voltron – Adaptive multicore system– Accelerate single thread applications– Exploit ILP, fine-grain TLP and statistical LLP

• Coupled and decoupled execution• Dual-mode operand network• Compiler managed loop speculation

– Hybrid parallelism combines the benefits

• Future work– Fine-grain thread identification– Virtualization of resources


Thank You

• Questions?

For more information:

http://cccp.eecs.umich.edu

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore...

Documents

Transcript of University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore...