Dataflow: A Complement to Superscalar

30
Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon University 2005

description

Dataflow: A Complement to Superscalar. Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon University 2005. Computer Architecture -- A Simplified History --. superscalar. dataflow. 1990. 2005. 1967. - PowerPoint PPT Presentation

Transcript of Dataflow: A Complement to Superscalar

Page 1: Dataflow:  A Complement to Superscalar

Dataflow: A Complement to Superscalar

Mihai Budiu – Microsoft Research

Pedro V. Artigas – Carnegie Mellon University

Seth Copen Goldstein – Carnegie Mellon University

2005

Page 2: Dataflow:  A Complement to Superscalar

2

Computer Architecture-- A Simplified History --

1967 1990

superscalar

dataflow

2005

Page 3: Dataflow:  A Complement to Superscalar

3

This Work

• Re-evaluate dataflow– Same workloads as superscalar

(C programs: Mediabench, Spec)

– Modern performance analysis tool(whole-program critical path)

• Use of superscalar mechanisms in dataflow

Page 4: Dataflow:  A Complement to Superscalar

4

Why Study Dataflow

• Naturally exploit ILP• Potentially very high ILP• Simple, regular

microarchitecture• Very low power

[1/1000 superscalar]• Suitable for stream processing

Page 5: Dataflow:  A Complement to Superscalar

5

Outline

• Motivation• ASH: A Static Dataflow Model

• Explaining bottlenecks• Conclusions

Page 6: Dataflow:  A Complement to Superscalar

6

Application-Specific Hardware

C program

Compiler

Dataflow IR

Page 7: Dataflow:  A Complement to Superscalar

7

Computation Dataflow

x = a & 7;...

y = x >> 2;

Program

&

a 7

>>

2

x

IR

a

Circuits

&7

>>2

Operations Nodes Pipeline stages

Variables Def-use edges Channels (wires)

Pure dataflow: no program counter

Page 8: Dataflow:  A Complement to Superscalar

8

Basic Computation=Pipeline Stage

data

valid

ack

latch+

Page 9: Dataflow:  A Complement to Superscalar

9

Control Flow => Data Flow

datapredicate

Merge (label)

Gateway

data

data

Split (branch)p

!

Page 10: Dataflow:  A Complement to Superscalar

10

i

+1< 100

0

*

+

sum

0

Loops

int sum=0, i;

for (i=0; i < 100; i++)

sum += i*i;

return sum;return sum; !

ret

Page 11: Dataflow:  A Complement to Superscalar

11

Comparison: Idealized Simulation

• Compared to 4-wide OOO SimpleScalar• Same operation latencies• Same memory hierarchy (LSQ, L1, L2)• not free

Page 12: Dataflow:  A Complement to Superscalar

12

Obvious!

ASH runs at full dataflow speed,and has no resource limitations, so CPU cannot do any better(if compilers equally good)

Page 13: Dataflow:  A Complement to Superscalar

13

SpecInt95, ASH vs 4-way OOO

-50

-40

-30

-20

-10

0

10

20

300

99

.go

12

4.m

88

ksim

12

9.c

om

pre

ss

13

0.li

13

2.ij

pe

g

13

4.p

erl

14

7.v

ort

ex

Pe

rce

nt

slo

we

r /

fas

ter

Page 14: Dataflow:  A Complement to Superscalar

14

Outline• Motivation• ASH: A Static Dataflow Model• Dissection: explaining bottlenecks

• Conclusions

Page 15: Dataflow:  A Complement to Superscalar

15

The Scalpel

C CASH ASH SimulatorASH

tracedrawings

Dynamic Critical Path

Automaticanalysis

Page 16: Dataflow:  A Complement to Superscalar

16

The (Loop) Body

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

SpecINT95: 124.m88ksim, init_processor()

Page 17: Dataflow:  A Complement to Superscalar

17

Dynamic Critical Path

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

load predicate

loop predicate

sizeof(X[j])

definition

Page 18: Dataflow:  A Complement to Superscalar

18

MIPS gcc CodeLOOP:

L1: beq $v0,$a1,EXIT ; X[j].r == i

L2: addiu $v1,$v1,20 ; &X[j+1].r

L3: lw $v0,0($v1) ; X[j+1].r

L4: addiu $a0,$a0,1 ; j++

L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF

EXIT:

L1=>L2=>L3=>L5=>L14-instructions loop-carried dependence

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

Page 19: Dataflow:  A Complement to Superscalar

19

If Branch Prediction Correct

L1=>L2=>L3=>L5=>L1for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

LOOP:

L1: beq $v0,$a1,EXIT ; X[j].r == i

L2: addiu $v1,$v1,20 ; &X[j+1].r

L3: lw $v0,0($v1) ; X[j+1].r

L4: addiu $a0,$a0,1 ; j++

L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF

EXIT:

Page 20: Dataflow:  A Complement to Superscalar

20

SpecInt95, perfect prediction

-60

-40

-20

0

20

40

60

09

9.g

o

12

4.m

88

ksim

12

9.c

om

pre

ss

13

0.li

13

2.ij

pe

g

13

4.p

erl

14

7.v

ort

ex

Pe

rce

nt

slo

we

r/fa

ste

r

Speed-up

prediction

no data

Page 21: Dataflow:  A Complement to Superscalar

21

Critical Path with Prediction

Loads are notspeculative

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

Page 22: Dataflow:  A Complement to Superscalar

22

Prediction + Load Speculation

~4 cycles!Load not pipelined(self-anti-dependence)

ack edge

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

Page 23: Dataflow:  A Complement to Superscalar

23

OOO Pipe Snapshot

IF DA EX WB CT

L3 L3 L3

registerrenaming

LOOP:

L1: beq $v0,$a1,EXIT ; X[j].r == i

L2: addiu $v1,$v1,20 ; &X[j+1].r

L3: lw $v0,0($v1) ; X[j+1].r

L4: addiu $a0,$a0,1 ; j++

L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF

EXIT:

Page 24: Dataflow:  A Complement to Superscalar

24

Conclusions: Limitations of Static Dataflow

1. dataflow state is “more” distributed

2. “control” dependences still limit ILP

3. nontrivial to squash distributed speculation

4. good prediction may need global information

5. self-antidependences can be critical

(removed by register renaming)

6. distributed computation => more remote accesses

7. more synchronization in dataflow (“join” is not free)

Page 25: Dataflow:  A Complement to Superscalar

25

Page 26: Dataflow:  A Complement to Superscalar

26

Unrolling Does Not Help

for(i = 0; i < 64; i++) {

for (j = 0; X[j].r != 0xF; j+=2) {

if (X[j].r == i)

break;

if (X[j+1].r == 0xF)

break;

if (X[j+1].r == i)

break;

}

Y[i] = X[j].q;

}

when 1 iteration

Page 27: Dataflow:  A Complement to Superscalar

27

How Performance Is Evaluated

C

Unlimited ILPstatic dataflow

LSQL18K

L21/4M

Mem

2

8

72

SimpleScalar

CASH

gcc

Page 28: Dataflow:  A Complement to Superscalar

28

Last-Arrival Events

+

data

valid

ack

• Event enabling the generation of a result• May be an ack• Critical path=collection of last-arrival edges

Page 29: Dataflow:  A Complement to Superscalar

29

Dynamic Critical Path

3. Some edges may repeat 2. Trace back along

last-arrival edges

1. Start from last node

back back to talk

Page 30: Dataflow:  A Complement to Superscalar

30

History

Out-of-orderBranch predSpeculation

TomasulloIBM 360

1967

ThorntonCDC 1964

KarpGraph model

1966

SmithBr pred1981

FisherVLIW

CockeSuperscalar

1985

SmithPrecise spec

1988

DennisDataflow lang

1974

BurgerTRIPS2001

OskinWaveScalar

2003

ArvindTagged-token

1977

PapadopoulosMonsoon

1988