Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1...

11

Boosting Single-thread Performance in Multi-Core Systems through Fine-Grain

Multi-Threading

Carlos Madriles, Pedro López, Josep M. Codina, Enric Gibert,

Fernando Latorre, Alejandro Martínez, Raúl Martínez and Antonio González

ISCA’09

June 24th

Intel Labs Barcelona

22

Why Is Single Thread Performance Important?

• Exploiting TLP is more power-efficient than exploiting ILP

– The trend is to increase the number of cores integrated into the same die

– Power and thermal constraints advocate for simpler cores

• However, single thread performance matters

– Sequential and lightly threaded apps

– Sequential regions limit scalability of parallelism (Amdahl’s law)

1 2 4 8 16 32 64

s=0s=0,1s=0.5

2

10

100s : serial fraction

overhead=0

overhead>0s=0s=0.1

Speed-up (c) = 1 / (s + ((1-s) / c))

Schemes to boost single-thread performance on CMPs are crucial

33

Main Research Efforts

• Main research efforts to exploit both ILP and TLP for boosting single-thread performance on CMPs

– Adaptative and cluster mircroarchitectures

– Combine or fuse cores

– Parallelism constrained by the instruction window

– Speculative multithreading (SpMT) techniques

– Traditional schemes constrain the exploitable parallelism

New techniques to exploit parallelism are needed

44

Our Contribution: Anaphase

• Novel speculative multithreading technique to boost single-thread applications on CMPs

• Effectively exploits ILP, TLP, and MLP

– Thread decomposition performed at instruction granularity

– Adapts to available parallelism in the application

– Minimizes inter-thread dependences

– Improves workload balance

– Increases the amount of exploitable MLP

– Hw / Sw hybrid scheme

– Low hardware complexity and powerful thread decomposition

• Cost-effective hardware support on top of a conventional CMP

– A novel uncore module supports anaphase threads execution

– Very few changes on the cores

55

Outline

• The Anaphase Scheme

– Threading Model

– Decomposition Algorithm

• The Anaphase Hardware Support

– Architecture Overview

– Architectural State Management

• Evaluation

• Conclusions

66

Anaphase Scheme Overview

C0

L2

L3

C1 C2

L2

C3

CMP with Anaphase Hw

Profile

Select hot regions

Fine-grain Thread

Decomposition

Serial Program

Paralellized Program

Compiler

HwSw

TILE

Anaphase Parallelizer

It behaves like a fused core

77

Anaphase Scheme Overview

C0

L2

L3

C1 C2

L2

C3

CMP with Anaphase Hw

Profile

Select hot regions

Fine-grain Thread

Decomposition

Serial Program

Paralellized Program

Compiler

HwSw

TILE

Anaphase Parallelizer

Main distinguishing features:

1. Threading model2. Fine-grain thread decomposition3. Hardware support

It behaves like a fused core

88

Threads may be composed of non-consecutive instructions

Threading Model: Key Features

A1

A2

A3 (Br)

B1

B2

B3

B4

D1

D2

D3 (Br)

C1

C2 (Br)

Original Region CFG

A2

B2

B3

D1

D2

C1

C2 (Br)

A1

A3 (Br)

B1

B4

D3 (Br)

Thread 1 Thread 2

99

Uses a heuristic to decide how to handle inter-thread dependences:

• Ignore Hw recovers in case of error

• Explicit communication• Pre-compute (p-slice)


A1

A2

A3 (Br)

B1

B2

B3

B4

D1

D2

D3 (Br)

C1

C2 (Br)

Original Region CFG

A2

B2

B3

D1

D2

C1

C2 (Br)

A1

A3 (Br)

B1

B4

D3 (Br)

Thread 1 Thread 2

1010

Replicate those instructions needed to generate the consumed value


A1

A2

A3 (Br)

B1

B2

B3

B4

D1

D2

D3 (Br)

C1

C2 (Br)

Original Region CFG

A2

B2

B3

D1

D2

C1

C2 (Br)

A1

A3 (Br)

B1

B4

D3 (Br)

Thread 1 Thread 2

A1’

1111

A2

B2

B3

D1

D2

C1

C2 (Br)

A1’

A3 (Br)

D3 (Br)


A1

A2

A3 (Br)

B1

B2

B3

B4

D1

D2

D3 (Br)

C1

C2 (Br)

Original Region CFG

A1

A3 (Br)

B1

B4

D3 (Br)

Thread 1 Thread 2

Threads include all required control instructions• Some branches may need to be replicated• Keep Hw simple (no changes in front-end)

1212

Profile

Select hotregions

Fine-grainThread

Decomposition

Decomposition Algorithm [1]

Compiler

• Multi-level graph partitioning

– Coarsening

– First partition of the region DDG

– Based on: critical path, independence, workload balance, and delinquent loads

– Refinement

– Evolution of the traditional K-L algorithm

– Refines first partition based on a model that estimates execution time

[1] C. Madriles, P. López, J.M. Codina, E. Gibert, F. Latorre, A. Martínez, R. Martínez, A. González, “Anaphase: A Fine-Grain Thread Decomposition Scheme for Speculative Multithreading”, to be published in PACT’09.

1313

Outline


– Threading Model





• Evaluation

• Conclusions

1414

Architecture Overview

MEM

Core 0

IC DC

Core 1

IC DC

L2

TILE

Core 0

IC DC

Core 1

IC DC

L2

ICMC ICMC

About 7% increase in area on an Intel® CoreTM 2 Duo

1515

ICMC

Tile in cooperativemode

Tile in single-coremode

Anaphase Execution

Core 0

IC DC

Core 1

IC DC

L2

TILE

Sequential

0

1

2

3

4

5

6

7

1616

0

ICMC


Anaphase Execution

Core 0

IC DC

Core 1

IC DC

L2

TILE

Thread 1 Thread 2

0

2

4

6

1

3

5

7

Local retire in threadorder

1

1717

ICMC


Anaphase Execution

Core 0

IC DC

Core 1

IC DC

L2

TILE

Thread 1 Thread 2

0

2

4

6

1

3

5

7

2 1

Global commit in program order

1818

ICMC


Anaphase Execution

Core 0

IC DC

Core 1

IC DC

L2

TILE

Thread 1 Thread 2

0

2

4

6

1

3

5

7

5

6

4

7

1919

ICMC


Anaphase Execution

Core 0

IC DC

Core 1

IC DC

L2

TILE

Thread 1 Thread 2

0

2

4

6

1

3

5

7

17

Inter-core memoryviolation!

2020

ICMC


Tile in single-coremode

Anaphase Execution

Core 0

IC DC

Core 1

IC DC

L2

TILE

Sequential

0

1

2

3

4

5

6

7

2121

Architectural State Management

• Memory– Speculative state kept private in L1 and L2 caches

– Merge performed in program order in the L2 cache

– Detects inter-core memory violations at chunk granularity

– Supports frequent memory checkpoints and very fast squash and commit operations

• Registers– Speculative state kept in core PRF

– Register architectural state merged in the ICMC in program order

– Supports frequent register checkpoints

– Reduces pressure on the PRF

– It allows a core to make forward progress

2222

Outline


– Threading Model





• Evaluation

• Conclusions

2323

Experimental Framework

• Compiler

– Anaphase partitioning implemented on top of the Intel®

Compiler (icc)

• Simulator

– Cycle accurate x86 CMP (1 tile with 2 OoO cores modeled)

– Two different core configurations evaluated

– Intel® CoreTM 2 like (Medium core)

– Half-sized main structures (Tiny core)

– Core Fusion[1] proposal implemented for comparison

• Benchmarks: Spec2006

– Profiling: PinPoint 20M traces using train input set

– Evaluation: 100M traces using ref input set

[1] E. Ipek, M. Kirman, N. Kirman, and J.F. Martinez, “Core fusion: Accommodating Software Diversity in Chip Multiprocessors”, in Proc. of the Int. Symp. On Computer Architecture, 2007

2424

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

bw

aves

gam

ess

Gem

sFD

TD lbm

lesl

ie3

d

milc

nam

d

po

vray

sop

lex

sph

inx3

ton

to wrf

AV

G S

pec

FP

asta

r

bzi

p2

gcc

gob

mk

h2

64

ref

hm

mer

libq

uan

tum

mcf

om

net

pp

per

lben

ch

sjen

g

zeu

smp

AV

G S

pec

INT

AV

G

Spee

du

p

CoreFusion Anaphase

Performance Evaluation

Results shown for Tiny core configuration

2.67 2.27

MLP exploited by Anaphase

AVG SpeedUp: Anaphase=41% CoreFusion=28%

2525

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

bw

aves

gam

ess

Gem

sFD

TD lbm

lesl

ie3

d

milc

nam

d

po

vray

sop

lex

sph

inx3

ton

to wrf

asta

r

bzi

p2

gcc

gob

mk

h2

64

ref

hm

mer

libq

uan

tum

mcf

om

net

pp

per

lben

ch

sjen

g

zeu

smp

AV

G

Dyn

amic

Inst

ruct

ion

s (%

)

Serial Parallel Replicated Communications Squashed

Dynamic Instruction Breakdown

High coverage less 10% serial codeLow overhead 22% replicated instrs and explicit commsFew explicit comms only 3%High accuracy less 1% squashed instrs

2626

Conclusions

• Anaphase is a Sw / Hw technique that leverages multiple cores to boost single-thread code

• It exploits ILP, TLP, and MLP with high accuracy and low overheads

• Low hardware overhead (7%)

• Important benefits: 41% average speedup for Spec2006

2727

Thank you!

Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1...

Documents

Transcript of Boosting Single-thread Performance in Multi-Core Systems …isca09.cs.columbia.edu/pres/40.pdf · 1...