10 June 2015 1 Mill Computing, Inc.Patents pending One of a series… Drinking from the Firehose...

10 June 2015 1Mill Computing, Inc. Patents pending

One of a series…

Drinking from the Firehose

Compilation for a Belt Architecture


Talks in this series

1. Encoding2. The Belt3. Memory4. Prediction5. Metadata6. Execution7. Security8. Specification9. Pipelining10.Compiling11. …

You are here

Slides and videos of other talks are at:

MillComputing.com/docs


Caution!

Gross over-simplification!

This talk tries to convey an intuitive understanding to the non-specialist.

The reality is more complicated.

(we try not to over-simplify, but sometimes…)


Specification

abstract Mill CPU architecture

family members Tin

Copper

Silver

Gold

The Mill is a family of member CPUs sharing an abstract operation set and micro-architecture.

specification driven

Members differ in concrete operation set and micro-architecture..

A designers describes a concrete member by writing a specification.


Specification

abstract Mill CPU architecture

family members Tin

Silver

Gold

tools compiler

asmdebugge

rHWgensim

Toolchain software automatically creates system software, verification tests, documentation, and a hardware framework for the new member from the specification.

specification driven

Copper

data driven


Late binding to family member

Mill compiles to the abstract target – the universal superset

Mill specializes to the concrete target – the executing family member

clang

LLVM middle

LLVM back

C++

genForm

specializer

prelinker

postlinker

genAsm

genassembler

conassembler

conForm

conAsm

CPU

target

This talk is mostly about the specializer


Specializer inputs: member specification

Micro-architecture attributes:

functional unit populationsupported data sizesresource constraints

Operation attributes: (1000+)+: 1*: 3-: 1&: 1retn:

0

op latency

issue→retire latencyarg/result count, sizebit encoding

Large static data structure, dynamically linkedMechanically generated from ~2 page spec


Specializer inputs: code

int foo(int a, b, c, d) { return (a-(b+c)) & ((b+c)*d);}

Static Single Assignment dataflow

define i32 @foo(i32 %a, i32 %b, i32 %c, i32 %d) {entry: %1 = add %b %c %2 = sub %a %1 %3 = mul %1 %d %4 = and %2 %3 ret %4}

a b dc

*-

&

retn

+

function args


Substitution pass

Goal: replace unsupported ops with emulation code

Walk graphFor each op, check spec for supportReplace unsupported with inline functionInline may call out-of-line code

Only a subset of operations exist in hardwareFew members have native decimal, or quad

*-

&

retn

+

function args

call


shiftmul

Wide issue

The Mill is wide-issue, like a VLIW or EPIC

mul shiftadd

PC

slot # 0 1 2

instruction

Instruction slots correspond to function pipelines

mult’er

shifter

adder

mult’er

shifter

adder

mult’er

shifter

adder

pipe # 0 1 2

Decode routes ops to matching pipes

add


*

Exposed pipeline

Every operation has a fixed latencya+b – c*d

sub

+

-

a b c d

a+b ?

a+b – c*d

c*d

a+b

add mul


Exposed pipeline

Every operation has a fixed latency

add mul

sub

+

-

a b c d

a+b

a+b – c*d

c*d

a+b

a+b – c*d

Who holds this?

*


*

Exposed pipeline

Every operation has a fixed latency

add mul

sub -a+b – c*d

c*d

a+b

a+b – c*d

+

a b c d

Code is best when producers feed directly to consumers


Latency pass

Goal: compute minimal dataflow latency as if hardware had infinite FU resources

Give schedule priority to longer latencyReduces overall schedule latency; faster execution

+: 1*: 3-: 1&: 1retn:

0

op specs

Walk graph

Look up latency in spec of each op

Mark each op with max argument latency

Mark each result with issue + op latency

0 0 0 0

2

5

1

4

Mark ops with issue cycleMark results with retire cycle

-

&

retn

+

function args

*

-1

0

1 1

14

5

-1


Dependency count pass

Goal: count outstanding dependencies

Need to know how many consumers must be placed before producer op can be placed

-

&

retn

+

function args

*Mark each op with number of consumers

Enter no-consumer ops on worklist

work list

4

2

11

1

0


Schedule pass

Goal: schedule producers so their results retire just before when consumers want them

-

&

retn

+

function args

*

work list

4

2

11

1

0

Take last-retiring op from worklist

Schedule it ahead of its consumers

Decrement the consumer count of theproducers of its arguments

If consumer count of arg producer becomes zero, enter producer on worklist

schedule:retn

0

0 0 0 0

2

1 1

4

5

# of unplacedconsumersretire cycle


Schedule pass


-

&

retn

+

function args

*

work list

4

2

11

0

0

Take longest-latency op from worklist




schedule:retn

0

0 0 0 0

2

1 1

4

5

&

0



Schedule pass


-

&

retn

+

function args

*

work list

4

2

00

0

0





schedule:retn

1

0 0 0 0

2

1 1

4

5

&

3

*



Schedule pass


-

&

retn

+

function args

*

work list

3

1

00

0

0





schedule:retn

0

0 0 0 0

2

1 1

4

5

&

2

*

-



Schedule pass


-

&

retn

+

function args

*

work list

2

0

00

0

0





schedule:retn

0 0 0 0

2

1 1

4

5

&

1

* -

+

0



Schedule pass


-

&

retn

+

function args

*

work list

0

0

00

0

0





schedule:retn

0 0 0 0

2

1 1

4

5

& * - +

function args

function args



Placement pass

Goal: place ops in instructions using limited FUs

schedule:retn & * - +

function args

tableau:branch

6543210

load ALU multcycle

FU

+: 1*: 3-: 1&: 1retn:

0

-

&

retn

+

function args

*

0


Placement pass


schedule:

retn

* - +function args

tableau:branch

6543210

load ALU multcycle

FU

+: 1*: 3-: 1&: 1retn:

0

-

&

retn

+

function args

*

1

0&


Placement pass


schedule:

retn

- +function args

tableau:branch

6543210

load ALU multcycle

FU

+: 1*: 3-: 1&: 1retn:

0

-

&

retn

+

function args

*4

1

0*

&


Placement pass


schedule:

retn

- +function args

tableau:branch

6543210

load ALU multcycle

FU

+: 1*: 3-: 1&: 1retn:

0

-

&

retn

+

function args

*2

1

0

*

&

4


Placement pass


schedule:

retn

function args

tableau:branch

6543210

load ALU multcycle

FU

+: 1*: 3-: 1&: 1retn:

0

-

&

retn

+

function args

*

5

4

1

2

0

*

&

-

+


Placement pass


schedule:

retn

tableau:branch

6543210

load ALU multcycle

FU

+: 1*: 3-: 1&: 1retn:

0

-

&

retn

+

function args

*

6

4

1

2

5

0

*

&

-

+

function args

args


Symex pass

After instructions have been populated and issue and retire cycles determined, the producer results must still be passed to the consumer arguments.

On a general register machine, they would be passed in registers

The Mill doesn’t have registers

The Mill has its own way to pass data between functional units.


We call it the BeltLike a conveyor belt – a fixed length FIFO

5 8 35 38 33 5

adder

Functional units can read any position

3


We call it the Belt

35 85 38 33

adder

adder

Functional units can read any position

8New results

drop on the front

Pushing the last off the end

3

Like a conveyor belt – a fixed length FIFO


Multiple reads

Functional units can read any mix of belt positions

5 85 38 33

adder

8

adder adder

3 3355 3


Multiple dropsAll results retiring in a cycle drop together

835 5838 3

adderadder adder

adderadder adder8 8 6


Belt addressing

Belt operands are addressed by relative position

68 5 58388

b3 b5

“b3” is the fourth most recent value to drop to the belt“b5” is the sixth most recent value to drop to the belt

This is temporal addressing

add b3, b5 No result address!


Temporal addressing

The temporal address of a datum changes with more drops

b38 3 3

5 5868 388

b6


Symex pass

The issue schedule and op latency give retire order

retn

branch

6543210

load ALU multcycle

FU

-

&

retn

+

function args

*

*

&

-

+

args

Retire order is belt order

infinite belt

-&

*

+

ABCD

cycle: 15 4 0


Symex pass

retn

branch

6543210

load ALU multcycle

FU

*

&

-

+

args

- &*+A B C D

cycle: 5 4 1 0

addmulnopsubandretn

012

-

&

retn

+

function args

*


Symex pass

retn

branch

6543210

load ALU multcycle

FU

*

&

-

+

args

- &*+A B C D

cycle: 5 4 1 0

addmulnopsubandretn

2

b2

01

b1

-

&

retn

+

function args

*

2 1


Symex pass

retn

branch

6543210

load ALU multcycle

FU

*

&

-

+

args

- &*+A B C D

cycle: 5 4 1 0

addmulnopsubandretn

b2

01

b1b0 b1

-

&

retn

+

function args

*

01


Symex pass

retn

branch

6543210

load ALU multcycle

FU

*

&

-

+

args

- &*+A B C D

cycle: 5 4 1 0

addmulnopsubandretn

b2

04

b1

b4 b0

-

&

retn

+

function args

*

01234

b1b0


Symex pass

retn

branch

6543210

load ALU multcycle

FU

*

&

-

+

args

- &*+A B C D

cycle: 5 4 1 0

addmulnopsubandretn

b2

01

b1

b1 b0

-

&

retn

+

function args

*

01

b1b0

b0b4


Symex pass

retn

branch

6543210

load ALU multcycle

FU

*

&

-

+

args

- &*+A B C D

cycle: 5 4 1 0

addmulnopsubandretn

b2

0

b1

b0

-

&

retn

+

function args

*

0

b1b0

b0b4b0b1


Symex pass

branch

181716151413

0

load ALU multcycle

FU

*

&

-

+

args

- &*+A B C D

cycle:

add b2 b1mul b0 b1nopsub b4 b0and b1 b0retn

17 16 14 13 0

retn b0

-

&

retn

+

function args

*

023

b23

But what if there isn’t a b23?


Use it or lose it

Compiler schedules producers near to consumers

Nearly all one-use values consumed while on belt

Belt is Single-Assignment - no hazards – no renames

300 rename registers become 8/16/32 belt positions

But - long-lived values must be saved


The scratchpad

88 3 3 68 388 3

belt

scratchpad

spill

3

fill

Frame local – each function has a new scratchpadFixed max size, must explicitly allocateStatic byte addressing, must be alignedThree cycle spill-to-fill latency


Symex pass

branch

181716151413

load ALU multcycle

FU

*

&

-

+

args

- &*+A B C D

retn

-

&

retn

+

function args

*

Insert spill-fill ops

fill

spill

01

12

spill

fill

fill

0

b0retn

- and reschedule


Symex pass

Added spill/fill ops may change the schedule so some other results need spill/fill too.

Add more spills/fills, and re-reschedule

Iteration is guaranteed to stop with a feasible schedule

Iteration limit has every producer spilled and a fill for every consumer, which is feasible.

In practice:

Most functions need no spills at allMore than one reschedule is very rare


The load problem

load

add

shift

store

stall

You write:

addloadshiftstore

You get:

stall

stall

stallEvery architecture must deal with this problem.


Every CPU’s goal – hide memory latency

General strategy:

Issue loads as early as possible- as soon as the address is known- or even earlier – aka prefetch

Find something else to do while waiting for data- hardware approach – dynamic scheduling

Tomasulo algorithm on IBM 360/91

- software approach – static schedulingexposed pipeline, delay slots

Ignore program order: issue operations as soon as their data is ready


Mill “deferred loads”

load(

Generic Mill load operation:

address: 64-bit base; offset; optional scaled indexwidth: scalar 1/2/4/8/16 byte, or vector of samedelay: number of issue cycles before retire

load(…, …, 4)instructioninstructioninstructioninstructionconsumer

load issues here

data available here

<address>, <width>, <delay>)

retire is deferred for four instructionsload retires here



int foo(int a, b, int* p) { return a*b + *p;}

a b p

load

+

retn

* load

+

retn

*

function argsargs

stall

tableau:

543210

branch load ALU multcycle

FU

(assuming load latency == 1)




load

+

retn

*

function args

tableau:

543210


FU

load




retire

+

retn

*

retire

+

retn

*

function args

tableau:

543210


FU

issue

retire

What is the latency of “issue”?




retire

+

retn

*

retire

+

retn

*

function argsargs

tableau:

543210


FU

issue

retire

Is it maxLatency?

issue

maxLatency

stall




retire

+

retn

*

retire

+

retn

*

function argsargs

tableau:

543210


FU

issue

retire

issue

What we want is…

neededlatency

highest non-load cycle minus retire cycle



The algorithm:

Temporarily assign all “issue” as maxLatency

Perform latency pass normally

Schedule all ops except “issue” normallyretire

retn

*

retire

+

function args

issue0

-1

5

0

9

8

10

maxLatency = 8



The algorithm:

retire

retn

*

retire

+

function args

issue0

-1

5

0

9

8

10

maxLatency = 8

retire

+

retn

*

retire

+

retn

*

function args

issue

retire

543210


FU

Temporarily assign all “issue” as maxLatency

Perform latency pass normally

Schedule all ops except “issue” normally



retire

retn

*

retire

+

function args

issue0

-1

5

0

9

8

10

maxLatency = 8

retire

+

retn

*

retire

+

retn

*

function args

issue

retire

+retn

*

543210


FU

retire

When scheduling an “issue”, adjust latency to: cycle of highest placed op minus cycle of corresponding “retire” minus predicted cycle of “issue” - or to one, whichever is larger

4202

2 cyclelatency

issue

args


Want more?

Sign up for technical announcements, white papers, etc.:

MillComputing.com/mailing-list

10 June 2015 1 Mill Computing, Inc.Patents pending One of a series… Drinking from the Firehose...

Documents

Transcript of 10 June 2015 1 Mill Computing, Inc.Patents pending One of a series… Drinking from the Firehose...