University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore...

25
1 University of Michigan Electrical Engineering and Computer Science Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications Hongtao Zhong, Steven A. Lieberman, and Scott A. Mahlke Advanced Computer Architecture Laboratory University of Michigan

Transcript of University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore...

1 University of MichiganElectrical Engineering and Computer Science

Extending Multicore Architectures to Exploit Hybrid Parallelism inSingle-Thread Applications

Hongtao Zhong, Steven A. Lieberman,

and Scott A. Mahlke

Advanced Computer Architecture Laboratory

University of Michigan

2 University of MichiganElectrical Engineering and Computer Science

Multicore Architectures

• Multicore becomes a trend– Intel Core Duo, 2005– Intel Core Quad, 2006– Sun T1, 8 cores, 2005– 16 – 32 cores, near future

• Need for simpler cores– Power density– Cooling costs

• Multiple cores on a chip– High throughput– Good for multithreaded apps

core 0

core 1

core 2

core 3

core 0

core 1

core 2

core 7

L2 L2

L2 L2

8 core Sun T1 processor

interconnect

3 University of MichiganElectrical Engineering and Computer Science

How About Single Thread Applications?

Single thread performance, Core Duo vs. Pentium M (same cache, same platform)Source : Meldelson et al. Intel Technology Journal, Vol 10, Issue 02, 2006

4 University of MichiganElectrical Engineering and Computer Science

Objective of this Work

• Automatically accelerate single thread applications on multicore systems– Exploit irregular parallelism across cores

• Instruction level parallelism (ILP)• Fine-grain thread level parallelism (TLP )• Loop level parallelism (LLP)

– Adaptive architecture• Configurate resources to exploit available parallelism• Dynamic adaptability

Hybrid parallelism

5 University of MichiganElectrical Engineering and Computer Science

Approach

• Voltron: Hardware/software approach– Architecture mechanisms

• Dual mode execution (coupled, decoupled)• Flexible inter-core communication• Fast thread spawning• Efficient memory ordering• High rate-of-return speculation

– Compiler techniques• Compiler controlled distributed branch• Fine-grain thread extraction• Speculative loop parallelization with recovery

6 University of MichiganElectrical Engineering and Computer Science

Parallelism Type 1: ILP

+

L

>>

*

L

+

L

+

+

/

+

/

*

L

<<

-

S

-

<

+

|

&

L

|

L

-

&

S

+

br

7 University of MichiganElectrical Engineering and Computer Science

Parallelism Type 1: ILP

• Emulate VLIW– Low latency

communication

+

L

>>

*

L

+

L

+

+

/

+

/

*

L

<<

-

S

-

<

+

|

&

L

|

L

-

&

S

+

brCore 0 Core 1 Core 2 Core 3

8 University of MichiganElectrical Engineering and Computer Science

Parallelism Type 1: ILP

• Emulate VLIW– Low latency

communication– Compiler controlled

distributed branch– Lockstep execution

+

L

>>

*

L

+

L

+

+

/

+

/

*

L

<<

-

S

-

<

+

|

&

L

|

L

-

&

S

+

brCore 0 Core 1 Core 2 Core 3

br brbr

9 University of MichiganElectrical Engineering and Computer Science

Voltron Architecture for ILP

stall bus

br bus

Core 0 Core 1

Core 2 Core 3

Banked L2 Cache

Banked L2 Cache

GPR FPR PR BTR

Register Files

FU MemFU

. . .

To northTo west

L1Instruction Cache

L1Data Cache

From Banked L2 To/From Banked L2

Instruction Fetch/Decode

CommFU

10 University of MichiganElectrical Engineering and Computer Science

Experimental Setup

• Trimaran Toolset• Simulator

– Multiple cores, multiple instruction stream– Inter-core communication– MOESI coherent protocol

• Configuration– 1 ALU, 1 memory unit, 1 communication unit per core– 1 cycle inter-core move latency per hop– 4KB L1 I-cache, 4KB L1 D-cache per core– 128KB shared L2 cache– Single core baseline

• 25 benchmarks from SpecInt, SpecFP, and MediaBench

11 University of MichiganElectrical Engineering and Computer Science

1

1.2

1.4

1.6

1.8

2

2.213

2.ijp

eg

164.

gzip

175.

vpr

197.

pars

er

255.

vort

ex

256.

bzip

2

052.

alvi

nn

056.

ear

171.

swim

172.

mgr

id

177.

mes

a

179.

art

183.

equa

ke

cjpe

g

djpe

g

epic

g721

deco

de

g721

enco

de

gsm

deco

de

gsm

enco

de

mpe

g2de

c

mpe

g2en

c

raw

caud

io

raw

daud

io

unep

ic

aver

age

2 core 4 core

ILP Speedup

SpecInt MediabenchSpecFP

Achieved > 80% of the performance on wide VLIW with same resources.

12 University of MichiganElectrical Engineering and Computer Science

Parallelism Type 2 : Fine-grain TLP

CB

D

E

CB

D

E

• Fine-grain threads– Few instructions– Scalar communication– Shared stack frame

A

13 University of MichiganElectrical Engineering and Computer Science

Parallelism Type 2 : Fine-grain TLP

ld st ld ldldst

CB

D

E

A

CB

D

E

• Fine-grain threads– Few instructions– Scalar communication– Shared stack frame

14 University of MichiganElectrical Engineering and Computer Science

B

D

C

E

Parallelism Type 2 : Fine-grain TLP

ld st ld ldldst

A A’

Core 0 Core 1

• Fine-grain threads– Few instruction– Scalar communication– Shared stack frame

• Decoupled execution– Different control flow– Asynchronous communication

• Fast thread spawning• Efficient memory ordering• Compiler algorithm

– Memory dependences– Load balance

15 University of MichiganElectrical Engineering and Computer Science

Core 0 Core 1

Core 2 Core 3

Banked L2 Cache

Banked L2 Cache

Voltron for Fine-grain TLP

GPR FPR PR BTR

Register Files

FU MemFU

. . .

To northTo west

L1Instruction Cache

L1Data Cache

From Banked L2 To/From Banked L2

Instruction Fetch/Decode

CommFU

Core

Comm FU

To west To north

To Register File

Direc

t M

ode

Byp

ass

Direc

t M

ode

Byp

ass

Routing Logic

Send

Que

ue

Rec

eive

Que

ue

16 University of MichiganElectrical Engineering and Computer Science

Dual Mode Network

• Coupled mode– Direct bypass [Multiflow]– Coupled execution– 1 cycle min latency, num_hops

• Decoupled mode– Message queues [RAW]– SEND / RECV– Decoupled execution – 3 cycle min latency, 2 + num_hops– Fast fine-grain thread spawning– Enforce operation ordering Core

Comm FU

To west To north

To Register File

Direc

t M

ode

Byp

ass

Direc

t M

ode

Byp

ass

Core

Comm FU

To west To north

To Register File

Direc

t M

ode

Byp

ass

Direc

t M

ode

Byp

ass

Core

Comm FU

To west To north

To Register File

Direc

t M

ode

Byp

ass

Direc

t M

ode

Byp

ass

Routing Logic

Send

Que

ue

Rec

eive

Que

ue

17 University of MichiganElectrical Engineering and Computer Science

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.813

2.ijp

eg

164.

gzip

175.

vpr

197.

pars

er

255.

vort

ex

256.

bzip

2

052.

alvi

nn

056.

ear

171.

swim

172.

mgr

id

177.

mes

a

179.

art

183.

equa

ke

cjpe

g

djpe

g

epic

g721

deco

de

g721

enco

de

gsm

deco

de

gsm

enco

de

mpe

g2de

c

mpe

g2en

c

raw

caud

io

raw

daud

io

unep

ic

aver

age

2 core 4 core

Fine-grain TLP Speedup

SpecInt MediabenchSpecFP

Works better for memory intensive applications

* * * * * * * * *

18 University of MichiganElectrical Engineering and Computer Science

Parallelism Type 3 : LLP

• DOALL loops– No cross-iteration

dependences– Iterations can execute in

parallel– Memory dependences hard to

prove

19 University of MichiganElectrical Engineering and Computer Science

Parallelism Type 3 : LLP

• DOALL loops– No cross-iteration

dependences– Iterations can execute in

parallel– Memory dependences hard to

prove

• Statistical DOALL– Profile memory dependences– Speculatively parallelize– Detect violation and rollback

core 0

init

finalize

reset

iter 0-3

core 1

init

finalize

reset

iter 4-7iter 0-7

Unexpected dependence

restart

20 University of MichiganElectrical Engineering and Computer Science

Voltron for LLP

Core 0 Core 1

Core 2 Core 3

Banked L2 Cache

Banked L2 Cache

GPR FPR PR BTR

Register Files

FU MemFU

. . .

To northTo west

L1Instruction Cache

L1 D-cachew/ Transactional Mem

Support

From Banked L2 To/From Banked L2

Instruction Fetch/Decode

CommFU

T tag state data

cache

• Detect memory dependence violation• Roll back memory state• Compiler roll back register state

21 University of MichiganElectrical Engineering and Computer Science

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2132.ijp

eg

164.g

zip

175.v

pr

197.p

ars

er

255.v

ort

ex

256.b

zip

2

052.a

lvin

n

056.e

ar

171.s

wim

172.m

grid

177.m

esa

179.a

rt

183.e

quake

cjp

eg

djp

eg

epic

g721decode

g721encode

gsm

decode

gsm

encode

mpeg2dec

mpeg2enc

raw

caudio

raw

daudio

unepic

avera

ge

2 core 4 core3.12 3.03 3.03

LLP Speedup

SpecInt MediabenchSpecFP

Accelerate non-provable DOALL and small loops

22 University of MichiganElectrical Engineering and Computer Science

1

1.5

2

2.5

3

3.5

13

2.ij

pe

g

16

4.g

zip

17

5.v

pr

19

7.p

ars

er

25

5.v

ort

ex

25

6.b

zip

2

05

2.a

lvin

n

05

6.e

ar

17

1.s

wim

17

2.m

gri

d

17

7.m

esa

17

9.a

rt

18

3.e

qu

ake

cjp

eg

djp

eg

ep

ic

g7

21

de

cod

e

g7

21

en

cod

e

gsm

de

cod

e

gsm

en

cod

e

mp

eg

2d

ec

mp

eg

2e

nc

raw

cau

dio

raw

da

ud

io

un

ep

ic

ave

rag

e

2 core 4 core

Speedup for Hybrid Execution

SpecInt MediabenchSpecFP

•2 core average – ILP:1.23, TLP: 1.16, LLP: 1.17, Hybrid: 1.46•4 core average – ILP:1.33, TLP: 1.23, LLP: 1.37, Hybrid: 1.83

23 University of MichiganElectrical Engineering and Computer Science

0%

20%

40%

60%

80%

100%132.ijp

eg

164.g

zip

175.v

pr

197.p

ars

er

255.v

ort

ex

256.b

zip

2

052.a

lvin

n

056.e

ar

171.s

wim

172.m

grid

177.m

esa

179.a

rt

183.e

quake

cjp

eg

djp

eg

epic

g721decode

g721encode

gsm

decode

gsm

encode

mpeg2dec

mpeg2enc

raw

caudio

raw

daudio

unepic

avera

ge

decoupled

coupled

Time BreakdownSpecInt MediabenchSpecFP

Both coupled and decoupled mode are necessary.

24 University of MichiganElectrical Engineering and Computer Science

Conclusions and Future Work

• Voltron – Adaptive multicore system– Accelerate single thread applications– Exploit ILP, fine-grain TLP and statistical LLP

• Coupled and decoupled execution• Dual-mode operand network• Compiler managed loop speculation

– Hybrid parallelism combines the benefits

• Future work– Fine-grain thread identification– Virtualization of resources

25 University of MichiganElectrical Engineering and Computer Science

Thank You

• Questions?

For more information:

http://cccp.eecs.umich.edu