10 June 2015 1 Mill Computing, Inc.Patents pending One of a series… Drinking from the Firehose...
-
Upload
alicia-campbell -
Category
Documents
-
view
219 -
download
1
Transcript of 10 June 2015 1 Mill Computing, Inc.Patents pending One of a series… Drinking from the Firehose...
10 June 2015 1Mill Computing, Inc. Patents pending
One of a series…
Drinking from the Firehose
Compilation for a Belt Architecture
10 June 2015 2Mill Computing, Inc. Patents pending
Talks in this series
1. Encoding2. The Belt3. Memory4. Prediction5. Metadata6. Execution7. Security8. Specification9. Pipelining10.Compiling11. …
You are here
Slides and videos of other talks are at:
MillComputing.com/docs
10 June 2015 3Mill Computing, Inc. Patents pending
Caution!
Gross over-simplification!
This talk tries to convey an intuitive understanding to the non-specialist.
The reality is more complicated.
(we try not to over-simplify, but sometimes…)
10 June 2015 4Mill Computing, Inc. Patents pending
Specification
abstract Mill CPU architecture
family members Tin
Copper
Silver
Gold
The Mill is a family of member CPUs sharing an abstract operation set and micro-architecture.
specification driven
Members differ in concrete operation set and micro-architecture..
A designers describes a concrete member by writing a specification.
10 June 2015 5Mill Computing, Inc. Patents pending
Specification
abstract Mill CPU architecture
family members Tin
Silver
Gold
tools compiler
asmdebugge
rHWgensim
Toolchain software automatically creates system software, verification tests, documentation, and a hardware framework for the new member from the specification.
specification driven
Copper
data driven
10 June 2015 6Mill Computing, Inc. Patents pending
Late binding to family member
Mill compiles to the abstract target – the universal superset
Mill specializes to the concrete target – the executing family member
clang
LLVM middle
LLVM back
C++
genForm
specializer
prelinker
postlinker
genAsm
genassembler
conassembler
conForm
conAsm
CPU
target
This talk is mostly about the specializer
10 June 2015 7Mill Computing, Inc. Patents pending
Specializer inputs: member specification
Micro-architecture attributes:
functional unit populationsupported data sizesresource constraints
Operation attributes: (1000+)+: 1*: 3-: 1&: 1retn:
0
op latency
issue→retire latencyarg/result count, sizebit encoding
Large static data structure, dynamically linkedMechanically generated from ~2 page spec
10 June 2015 8Mill Computing, Inc. Patents pending
Specializer inputs: code
int foo(int a, b, c, d) { return (a-(b+c)) & ((b+c)*d);}
Static Single Assignment dataflow
define i32 @foo(i32 %a, i32 %b, i32 %c, i32 %d) {entry: %1 = add %b %c %2 = sub %a %1 %3 = mul %1 %d %4 = and %2 %3 ret %4}
a b dc
*-
&
retn
+
function args
10 June 2015 9Mill Computing, Inc. Patents pending
Substitution pass
Goal: replace unsupported ops with emulation code
Walk graphFor each op, check spec for supportReplace unsupported with inline functionInline may call out-of-line code
Only a subset of operations exist in hardwareFew members have native decimal, or quad
*-
&
retn
+
function args
call
10 June 2015 10Mill Computing, Inc. Patents pending
shiftmul
Wide issue
The Mill is wide-issue, like a VLIW or EPIC
mul shiftadd
PC
slot # 0 1 2
instruction
Instruction slots correspond to function pipelines
mult’er
shifter
adder
mult’er
shifter
adder
mult’er
shifter
adder
pipe # 0 1 2
Decode routes ops to matching pipes
add
10 June 2015 11Mill Computing, Inc. Patents pending
*
Exposed pipeline
Every operation has a fixed latencya+b – c*d
sub
+
-
a b c d
a+b ?
a+b – c*d
c*d
a+b
add mul
10 June 2015 12Mill Computing, Inc. Patents pending
Exposed pipeline
Every operation has a fixed latency
add mul
sub
+
-
a b c d
a+b
a+b – c*d
c*d
a+b
a+b – c*d
Who holds this?
*
10 June 2015 13Mill Computing, Inc. Patents pending
*
Exposed pipeline
Every operation has a fixed latency
add mul
sub -a+b – c*d
c*d
a+b
a+b – c*d
+
a b c d
Code is best when producers feed directly to consumers
10 June 2015 14Mill Computing, Inc. Patents pending
Latency pass
Goal: compute minimal dataflow latency as if hardware had infinite FU resources
Give schedule priority to longer latencyReduces overall schedule latency; faster execution
+: 1*: 3-: 1&: 1retn:
0
op specs
Walk graph
Look up latency in spec of each op
Mark each op with max argument latency
Mark each result with issue + op latency
0 0 0 0
2
5
1
4
Mark ops with issue cycleMark results with retire cycle
-
&
retn
+
function args
*
-1
0
1 1
14
5
-1
10 June 2015 15Mill Computing, Inc. Patents pending
Dependency count pass
Goal: count outstanding dependencies
Need to know how many consumers must be placed before producer op can be placed
-
&
retn
+
function args
*Mark each op with number of consumers
Enter no-consumer ops on worklist
work list
4
2
11
1
0
10 June 2015 16Mill Computing, Inc. Patents pending
Schedule pass
Goal: schedule producers so their results retire just before when consumers want them
-
&
retn
+
function args
*
work list
4
2
11
1
0
Take last-retiring op from worklist
Schedule it ahead of its consumers
Decrement the consumer count of theproducers of its arguments
If consumer count of arg producer becomes zero, enter producer on worklist
schedule:retn
0
0 0 0 0
2
1 1
4
5
# of unplacedconsumersretire cycle
10 June 2015 17Mill Computing, Inc. Patents pending
Schedule pass
Goal: schedule producers so their results retire just before when consumers want them
-
&
retn
+
function args
*
work list
4
2
11
0
0
Take longest-latency op from worklist
Schedule it ahead of its consumers
Decrement the consumer count of theproducers of its arguments
If consumer count of arg producer becomes zero, enter producer on worklist
schedule:retn
0
0 0 0 0
2
1 1
4
5
&
0
# of unplacedconsumersretire cycle
10 June 2015 18Mill Computing, Inc. Patents pending
Schedule pass
Goal: schedule producers so their results retire just before when consumers want them
-
&
retn
+
function args
*
work list
4
2
00
0
0
Take longest-latency op from worklist
Schedule it ahead of its consumers
Decrement the consumer count of theproducers of its arguments
If consumer count of arg producer becomes zero, enter producer on worklist
schedule:retn
1
0 0 0 0
2
1 1
4
5
&
3
*
# of unplacedconsumersretire cycle
10 June 2015 19Mill Computing, Inc. Patents pending
Schedule pass
Goal: schedule producers so their results retire just before when consumers want them
-
&
retn
+
function args
*
work list
3
1
00
0
0
Take longest-latency op from worklist
Schedule it ahead of its consumers
Decrement the consumer count of theproducers of its arguments
If consumer count of arg producer becomes zero, enter producer on worklist
schedule:retn
0
0 0 0 0
2
1 1
4
5
&
2
*
-
# of unplacedconsumersretire cycle
10 June 2015 20Mill Computing, Inc. Patents pending
Schedule pass
Goal: schedule producers so their results retire just before when consumers want them
-
&
retn
+
function args
*
work list
2
0
00
0
0
Take longest-latency op from worklist
Schedule it ahead of its consumers
Decrement the consumer count of theproducers of its arguments
If consumer count of arg producer becomes zero, enter producer on worklist
schedule:retn
0 0 0 0
2
1 1
4
5
&
1
* -
+
0
# of unplacedconsumersretire cycle
10 June 2015 21Mill Computing, Inc. Patents pending
Schedule pass
Goal: schedule producers so their results retire just before when consumers want them
-
&
retn
+
function args
*
work list
0
0
00
0
0
Take longest-latency op from worklist
Schedule it ahead of its consumers
Decrement the consumer count of theproducers of its arguments
If consumer count of arg producer becomes zero, enter producer on worklist
schedule:retn
0 0 0 0
2
1 1
4
5
& * - +
function args
function args
# of unplacedconsumersretire cycle
10 June 2015 22Mill Computing, Inc. Patents pending
Placement pass
Goal: place ops in instructions using limited FUs
schedule:retn & * - +
function args
tableau:branch
6543210
load ALU multcycle
FU
+: 1*: 3-: 1&: 1retn:
0
-
&
retn
+
function args
*
0
10 June 2015 23Mill Computing, Inc. Patents pending
Placement pass
Goal: place ops in instructions using limited FUs
schedule:
retn
* - +function args
tableau:branch
6543210
load ALU multcycle
FU
+: 1*: 3-: 1&: 1retn:
0
-
&
retn
+
function args
*
1
0&
10 June 2015 24Mill Computing, Inc. Patents pending
Placement pass
Goal: place ops in instructions using limited FUs
schedule:
retn
- +function args
tableau:branch
6543210
load ALU multcycle
FU
+: 1*: 3-: 1&: 1retn:
0
-
&
retn
+
function args
*4
1
0*
&
10 June 2015 25Mill Computing, Inc. Patents pending
Placement pass
Goal: place ops in instructions using limited FUs
schedule:
retn
- +function args
tableau:branch
6543210
load ALU multcycle
FU
+: 1*: 3-: 1&: 1retn:
0
-
&
retn
+
function args
*2
1
0
*
&
4
10 June 2015 26Mill Computing, Inc. Patents pending
Placement pass
Goal: place ops in instructions using limited FUs
schedule:
retn
function args
tableau:branch
6543210
load ALU multcycle
FU
+: 1*: 3-: 1&: 1retn:
0
-
&
retn
+
function args
*
5
4
1
2
0
*
&
-
+
10 June 2015 27Mill Computing, Inc. Patents pending
Placement pass
Goal: place ops in instructions using limited FUs
schedule:
retn
tableau:branch
6543210
load ALU multcycle
FU
+: 1*: 3-: 1&: 1retn:
0
-
&
retn
+
function args
*
6
4
1
2
5
0
*
&
-
+
function args
args
10 June 2015 28Mill Computing, Inc. Patents pending
Symex pass
After instructions have been populated and issue and retire cycles determined, the producer results must still be passed to the consumer arguments.
On a general register machine, they would be passed in registers
The Mill doesn’t have registers
The Mill has its own way to pass data between functional units.
10 June 2015 29Mill Computing, Inc. Patents pending
We call it the BeltLike a conveyor belt – a fixed length FIFO
5 8 35 38 33 5
adder
Functional units can read any position
3
10 June 2015 30Mill Computing, Inc. Patents pending
We call it the Belt
35 85 38 33
adder
adder
Functional units can read any position
8New results
drop on the front
Pushing the last off the end
3
Like a conveyor belt – a fixed length FIFO
10 June 2015 31Mill Computing, Inc. Patents pending
Multiple reads
Functional units can read any mix of belt positions
5 85 38 33
adder
8
adder adder
3 3355 3
10 June 2015 32Mill Computing, Inc. Patents pending
Multiple dropsAll results retiring in a cycle drop together
835 5838 3
adderadder adder
adderadder adder8 8 6
10 June 2015 33Mill Computing, Inc. Patents pending
Belt addressing
Belt operands are addressed by relative position
68 5 58388
b3 b5
“b3” is the fourth most recent value to drop to the belt“b5” is the sixth most recent value to drop to the belt
This is temporal addressing
add b3, b5 No result address!
10 June 2015 34Mill Computing, Inc. Patents pending
Temporal addressing
The temporal address of a datum changes with more drops
b38 3 3
5 5868 388
b6
10 June 2015 35Mill Computing, Inc. Patents pending
Symex pass
The issue schedule and op latency give retire order
retn
branch
6543210
load ALU multcycle
FU
-
&
retn
+
function args
*
*
&
-
+
args
Retire order is belt order
infinite belt
-&
*
+
ABCD
cycle: 15 4 0
10 June 2015 36Mill Computing, Inc. Patents pending
Symex pass
retn
branch
6543210
load ALU multcycle
FU
*
&
-
+
args
- &*+A B C D
cycle: 5 4 1 0
addmulnopsubandretn
012
-
&
retn
+
function args
*
10 June 2015 37Mill Computing, Inc. Patents pending
Symex pass
retn
branch
6543210
load ALU multcycle
FU
*
&
-
+
args
- &*+A B C D
cycle: 5 4 1 0
addmulnopsubandretn
2
b2
01
b1
-
&
retn
+
function args
*
2 1
10 June 2015 38Mill Computing, Inc. Patents pending
Symex pass
retn
branch
6543210
load ALU multcycle
FU
*
&
-
+
args
- &*+A B C D
cycle: 5 4 1 0
addmulnopsubandretn
b2
01
b1b0 b1
-
&
retn
+
function args
*
01
10 June 2015 39Mill Computing, Inc. Patents pending
Symex pass
retn
branch
6543210
load ALU multcycle
FU
*
&
-
+
args
- &*+A B C D
cycle: 5 4 1 0
addmulnopsubandretn
b2
04
b1
b4 b0
-
&
retn
+
function args
*
01234
b1b0
10 June 2015 40Mill Computing, Inc. Patents pending
Symex pass
retn
branch
6543210
load ALU multcycle
FU
*
&
-
+
args
- &*+A B C D
cycle: 5 4 1 0
addmulnopsubandretn
b2
01
b1
b1 b0
-
&
retn
+
function args
*
01
b1b0
b0b4
10 June 2015 41Mill Computing, Inc. Patents pending
Symex pass
retn
branch
6543210
load ALU multcycle
FU
*
&
-
+
args
- &*+A B C D
cycle: 5 4 1 0
addmulnopsubandretn
b2
0
b1
b0
-
&
retn
+
function args
*
0
b1b0
b0b4b0b1
10 June 2015 42Mill Computing, Inc. Patents pending
Symex pass
branch
181716151413
0
load ALU multcycle
FU
*
&
-
+
args
- &*+A B C D
cycle:
add b2 b1mul b0 b1nopsub b4 b0and b1 b0retn
17 16 14 13 0
retn b0
-
&
retn
+
function args
*
023
b23
But what if there isn’t a b23?
10 June 2015 43Mill Computing, Inc. Patents pending
Use it or lose it
Compiler schedules producers near to consumers
Nearly all one-use values consumed while on belt
Belt is Single-Assignment - no hazards – no renames
300 rename registers become 8/16/32 belt positions
But - long-lived values must be saved
10 June 2015 44Mill Computing, Inc. Patents pending
The scratchpad
88 3 3 68 388 3
belt
scratchpad
spill
3
fill
Frame local – each function has a new scratchpadFixed max size, must explicitly allocateStatic byte addressing, must be alignedThree cycle spill-to-fill latency
10 June 2015 45Mill Computing, Inc. Patents pending
Symex pass
branch
181716151413
load ALU multcycle
FU
*
&
-
+
args
- &*+A B C D
retn
-
&
retn
+
function args
*
Insert spill-fill ops
fill
spill
01
12
spill
fill
fill
0
b0retn
- and reschedule
10 June 2015 46Mill Computing, Inc. Patents pending
Symex pass
Added spill/fill ops may change the schedule so some other results need spill/fill too.
Add more spills/fills, and re-reschedule
Iteration is guaranteed to stop with a feasible schedule
Iteration limit has every producer spilled and a fill for every consumer, which is feasible.
In practice:
Most functions need no spills at allMore than one reschedule is very rare
10 June 2015 47Mill Computing, Inc. Patents pending
The load problem
load
add
shift
store
stall
You write:
addloadshiftstore
You get:
stall
stall
stallEvery architecture must deal with this problem.
10 June 2015 48Mill Computing, Inc. Patents pending
Every CPU’s goal – hide memory latency
General strategy:
Issue loads as early as possible- as soon as the address is known- or even earlier – aka prefetch
Find something else to do while waiting for data- hardware approach – dynamic scheduling
Tomasulo algorithm on IBM 360/91
- software approach – static schedulingexposed pipeline, delay slots
Ignore program order: issue operations as soon as their data is ready
10 June 2015 49Mill Computing, Inc. Patents pending
Mill “deferred loads”
load(
Generic Mill load operation:
address: 64-bit base; offset; optional scaled indexwidth: scalar 1/2/4/8/16 byte, or vector of samedelay: number of issue cycles before retire
load(…, …, 4)instructioninstructioninstructioninstructionconsumer
load issues here
data available here
<address>, <width>, <delay>)
retire is deferred for four instructionsload retires here
10 June 2015 50Mill Computing, Inc. Patents pending
Mill “deferred loads”
int foo(int a, b, int* p) { return a*b + *p;}
a b p
load
+
retn
* load
+
retn
*
function argsargs
stall
tableau:
543210
branch load ALU multcycle
FU
(assuming load latency == 1)
10 June 2015 51Mill Computing, Inc. Patents pending
Mill “deferred loads”
int foo(int a, b, int* p) { return a*b + *p;}
load
+
retn
*
function args
tableau:
543210
branch load ALU multcycle
FU
load
10 June 2015 52Mill Computing, Inc. Patents pending
Mill “deferred loads”
int foo(int a, b, int* p) { return a*b + *p;}
retire
+
retn
*
retire
+
retn
*
function args
tableau:
543210
branch load ALU multcycle
FU
issue
retire
What is the latency of “issue”?
10 June 2015 53Mill Computing, Inc. Patents pending
Mill “deferred loads”
int foo(int a, b, int* p) { return a*b + *p;}
retire
+
retn
*
retire
+
retn
*
function argsargs
tableau:
543210
branch load ALU multcycle
FU
issue
retire
Is it maxLatency?
issue
maxLatency
stall
10 June 2015 54Mill Computing, Inc. Patents pending
Mill “deferred loads”
int foo(int a, b, int* p) { return a*b + *p;}
retire
+
retn
*
retire
+
retn
*
function argsargs
tableau:
543210
branch load ALU multcycle
FU
issue
retire
issue
What we want is…
neededlatency
highest non-load cycle minus retire cycle
10 June 2015 55Mill Computing, Inc. Patents pending
Mill “deferred loads”
The algorithm:
Temporarily assign all “issue” as maxLatency
Perform latency pass normally
Schedule all ops except “issue” normallyretire
retn
*
retire
+
function args
issue0
-1
5
0
9
8
10
maxLatency = 8
10 June 2015 56Mill Computing, Inc. Patents pending
Mill “deferred loads”
The algorithm:
retire
retn
*
retire
+
function args
issue0
-1
5
0
9
8
10
maxLatency = 8
retire
+
retn
*
retire
+
retn
*
function args
issue
retire
543210
branch load ALU multcycle
FU
Temporarily assign all “issue” as maxLatency
Perform latency pass normally
Schedule all ops except “issue” normally
10 June 2015 57Mill Computing, Inc. Patents pending
Mill “deferred loads”
retire
retn
*
retire
+
function args
issue0
-1
5
0
9
8
10
maxLatency = 8
retire
+
retn
*
retire
+
retn
*
function args
issue
retire
+retn
*
543210
branch load ALU multcycle
FU
retire
When scheduling an “issue”, adjust latency to: cycle of highest placed op minus cycle of corresponding “retire” minus predicted cycle of “issue” - or to one, whichever is larger
4202
2 cyclelatency
issue
args
10 June 2015 58Mill Computing, Inc. Patents pending
Want more?
Sign up for technical announcements, white papers, etc.:
MillComputing.com/mailing-list