Industrial Semantics Or How to Stop the Maths Getting in the Way of the Marketing Joe Stoy
description
Transcript of Industrial Semantics Or How to Stop the Maths Getting in the Way of the Marketing Joe Stoy
Industrial SemanticsOr
How to Stop the Maths Getting in the Way of the Marketing
Joe Stoy Founder and Principal Engineer
Bluespec, Inc.
(with help from many at Bluespec)
APPSEM05 Workshop, 15 September 2005
Basic Message
An industrial tool needs good semantics Robust Simple Conforming to users’ model
But the theory must be “under the covers”
Learning curve Perceived learning curve
A technology based on Term-Rewriting Systems
A superstructure based on functional programming semantics
Semantic issues
Outline
Tool and marketFor designing chips (ASICs, FPGAs, ...)
currently low-level with Verilog or VHDL chip complexity rising (millions of gates)
For chip designers, verification engineers, system architects
ASICs have huge NREs ($500K–$1M) mistakes (respins) cost another NRE tools run into millions of $$$ per team,
form a significant fraction of a company’s budget (e.g., ~10%)
tools tend to run on UNIX (Solaris, Linux)
Productization within industryMajor pilot project: Arbiter for 160 Gb/s router (1.5M gates, 200 MHz, 1.3)
History of this technologyResearch@MIT on high-level synthesis & verification(Prof. Arvind et. al.)
Technology
Technology
VC funding
~1996 2000 2003
Bluespec, Inc.High-level synth. tool
2004
Productavailable
BluespecSystemVerilog
Transaction
Verilog RTL
Netlist
Design Flow
DALDesign Assertion Level
CycleAccurateSimulatio
n
EventBased
Simulation
RTLSynthesis
RTLSynthesis
Bluesim
BluespecSynthesisBluespecSynthesis
Cosim: Typical Use ModelsBluespec integrated into SystemC/Verilog
SystemC/Verilog integrated into Bluespec
Bluespec
SystemC/Verilog
SystemC/Verilog
Bluespec
Where Bluespec is:• …integrated into Verilog System-on-Chip (SoC) design• …back-annotated into SystemC model• …part of mixed SystemC/Bluespec model
Where Bluespec is designed with:• …existing Verilog IP re-used• …a SystemC model awaiting the Verilog
A technologybased on
term-rewriting systems
Term-rewriting systemswhile some rule is applicable
choose an applicable rule apply it to the term (or a subcomponent)
FP’s standard operational semanticsOur rewritings are less free-wheeling (don’t change structure of term)
maybe “state transition system” a better name
Clocked synchronous hardware
The compiler translates BSV source code into Verilog RTL
TransitionLogic
IOS“Next” SCollection
ofState
Elements
Scheduling and control logicModules’
(Current state)Rules
Scheduler
1
n
1
n
Muxing
1
nn
n
Modules’(Next state)
cond
action
“CAN_FIRE” “WILL_FIRE”
Term-rewriting systemswhile some rule is applicable
choose an applicable rule apply it to the term (or a subcomponent)
FP’s standard operational semanticsOur rewritings are less free-wheeling (don’t change structure of term)
Rules are constructed from guarded atomic actions with interfaces
Guarded Atomic Actions
Actions are guardedfifo.enq(x);
if fifo full, cannot happen hides lots of tedious bureaucracy
Actions are atomic
Atomicity
atomic
Atomicity
ατομος
Atomicity
a-tomic
Atomicity
a-tomic not
asymmetric atypical amoral
Atomicity
a-tomic not
asymmetric atypical amoral
cut microtome tomography tome (of a multi-volume book)
Atomicity
Rules are atomic“Not cut”
Whenever they run, they run to completion never interrupted
No other activities are interleaved with them
This greatly simplifies design avoids many race conditions
Guarded Atomic Actions
Actions can be composeda1;a2
resulting action atomic guarded by guards of a1 and a2
Guarded Atomic ActionsConditionals:
if (b) a1; Guarded by “b implies (a1’s guards)”
Another conditional:when (b) a1;
guarded by b (and a1’s guards)
Yet anotherperhaps-if (b) a1;
unguarded a1’s guards conjoined to b
Nice algebra (separate question — which to have in BSV)
… with interfaces
A BSV design is structured by modulesModules communicate only through interfaces
Modules and interfaces
interface
state
rule
module
An exampleinterface NumFn; method Action start (int n); method int result ();endinterface
module mkFact2 (NumFn); Reg#(int) x <- mkReg(?); Reg#(int) j <- mkReg(0);
rule step (j > 0); x <= x * j; j <= j - 1; endrule
method start (n) if (j == 0); x <= 1; j <= n; endmethod
method result () if (j == 0); return x; endmethodendmodule: mkFact2
module mkTest ();
int n = 15; // constant
Reg#(int) state <- mkReg(0); NumFn f <- mkFact2();
rule go (state == 0); f.start (n); state <= 1; endrule
rule finish (state == 1); $display (“Result is %d”, f.result()); state <= 2; endruleendmodule: mkTest
module mkTest () ; … Fact f <- mkFact2(); … rule finish (state==1); $display(“…%d”, f.result()) state <= 2; endruleendmodule
Method invocationsfit into rules
Rule condition is: state==1 && j==0 Explicit condition and all implicit conditions
of all method calls in the rule
Thus, a part of the rule’s condition (j == 0), and a part of a rule’s computation (reading x)
are in a different module, via a method invocation
module mkFact2 (Fact); Reg#(int) x <- mkReg(?) Reg#(int) j <- mkReg(n); … method result () if (j == 0); return x; endmethodendmodule
module mkTest () ; … Fact f <- mkFact2();
rule go (state==0); f.start(15); state <= 1; endrule …endmodule
Modularizing rules
Rule condition: state==0 && j==0
Rule actions: state<=1, x<=1 and j<=15
Thus, a part of the rule’s action is in a different module
module mkFact2 (Fact); Reg#(int) x <- mkReg(?) Reg#(int) j <- mkReg(0); … method start (int n) if (j == 0); x <= 1; j <= n; endmethodendmodule
Order of Evaluation
Not Lazy (e.g. Haskell’s)Schedule as many rules as possible in each clock cycle
(patented technology – James Hoe et al)
Clocked synchronous hardware
The compiler translates BSV source code into Verilog RTL
TransitionLogic
IOS“Next” SCollection
ofState
Elements
Rule semanticsmapped to hardware semantics
Rules
HW
Ri Rj Rk
clocks
rule steps
Ri
RjRk
The effect of each cycle is as if a sequence of ruleswas executed one-at-a-time
Consequence: The HW state can never result from an interleaving of actions from different rules
Rule atomicity (therefore, correctness) is preserved
Scheduling and control logicModules’
(Current state)Rules
Scheduler
1
n
1
n
Muxing
1
nn
n
Modules’(Next state)
cond
action
“CAN_FIRE” “WILL_FIRE”
… leads to pragmaticconstraints on rule combination
Initial set: A rule fires within a clock cycle A rule fires at most once in a clock cycle A rule’s effect is only visible in the next clock We only combine rules in a certain fixed order within a cycle All rules which read a register must precede any which write
it We only consider rules enabled at the start of the clock cycle Each rule is independent of previous rules executing in the
same cycle The logic path delay depends on individual rule paths, and
not on combinations of rules …
(Some since relaxed)
Benefits ofatomic-action
semantics
Consider this exampleProcess 0 increments register xProcess 1 transfers a unit from register x to register yProcess 2 decrements register y
This is an abstraction of some real applications: Bank account: 0 = deposit to checking, 1 = transfer from
checking to savings, 2 = withdraw from savings Packet processor: 0 = packet arrives, 1 = packet is
processed, 2 = packet departs …
0 1 2
x y
+1 -1 +1 -1
Concurrency in the example
Process j (= 0,1,2) only updates under condition condjOnly one process at a time can update a register. Note:
Process 0 and 2 can run concurrently if process 1 is not running
Both of process 1’s updates must happen “indivisibly” (else inconsistent state)
Suppose we want to prioritize process 2 over process 1 over process 0
0 1 2x y
+1 -1 +1 -1
Process priority: 2 > 1 > 0
cond0 cond1 cond2
Is either correct?
Which of these solutions are correct, if any?What’s required to verify that they’re correct?Now, what if I Δ’d the priorities: 1 > 2 > 0?And, what if the processes are in different modules?
always @(posedge CLK) // process 0 if (!cond1 && cond0) x <= x + 1;
always @(posedge CLK) // process 1 if (!cond2 && cond1) begin y <= y + 1; x <= x – 1; end
always @(posedge CLK) // process 2 if (cond2) y <= y – 1;
always @(posedge CLK) begin if (!cond2 && cond1) x <= x – 1; else if (cond0) x <= x + 1;
if (cond2) y <= y – 1; else if (cond1) y <= y + 1;end
0 1 2x y
+1 -1 +1 -1 Process priority: 2 > 1 > 0
cond0 cond1 cond2
if ((!cond1 || cond2) && cond0)
Where’sthe error?
With Bluespec, design is direct
(* descending_urgency = “proc2, proc1, proc0” *)
rule proc0 (cond0); x <= x + 1;endrule
rule proc1 (cond1); y <= y + 1; x <= x – 1;endrule
rule proc2 (cond2); y <= y – 1;endrule
Functional correctness follows directly from rule semantics
Related actions are grouped naturally with their conditions—easy to change
Interactions between rules are managed by the compiler (scheduling, muxing, control)
Same hardware as the RTL
0 1 2x y
+1 -1 +1 -1
Process priority: 2 > 1 > 0cond0 cond1 cond2
Reorder BufferVerification-centric design
Decode
Example from CPU design
Nirav Dave, MEMOCODE, 2004
Speculative, out-of-order
Many, many concurrent activities
Branch
RegisterFile
ALUUnitRe-
OrderBuffer(ROB) MEM
Unit
DataMemory
InstructionMemory
Fetch Decode
FIFO
FIFO FIFO FIFO FIFO
FIFO
FIFOFIFO
FIFOFIFORe-
OrderBuffer(ROB)
Branch
RegisterFile
ALUUnit
MEMUnit
DataMemory
InstructionMemory
Fetch
ROB actionsEmpty
WaitingDispatched
KilledDone
EWDiKDo
Head
Tail
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V - -Instr - V -
V 0 -Instr B V 0W
V 0 -Instr C V 0W
-Instr D V 0W
V 0 -Instr A V 0W
V - -Instr - V -
V - -Instr - V -E
E
E
E
E
E
E
E
E
E
E
E
V 0
Re-Order Buffer
Insert aninstr into
ROB
DecodeUnit
RegisterFile
Get operandsfor instr
Writebackresults
Get a readyALU instr
Get a readyMEM instr
Put ALU instr results in ROB
Put MEM instr results in ROB
ALUUnit(s)
MEMUnit(s)Resolve
branches
Operand 1 ResultInstruction Operand 2State
But, what about allthe potential race conditions?
Reading from the register file at the same time a separate instruction is writing back to the same location
Which value to read?
An instruction is being inserted into the ROB simultaneously with a dependent upstream instruction’s result coming back from an ALU
Put a tag or the value in the operand slot?
An instruction is being inserted into the ROB simultaneously with a branch mis-prediction
must kill the mis-predicted instructions and restore a “consistent state” across many modules
Rule Atomicity Lets you code each operation in isolation
Eliminates the nightmare of race conditions (“inconsistent state”) under such complex concurrency conditions
Insert Instr in ROB• Put instruction in firstavailable slot• Increment tail pointer• Get source operands
- RF <or> prev instr
Dispatch Instr• Mark instructiondispatched• Forward to appropriateunit
Write Back Results to ROB• Write back results toinstr result• Write back to all waitingtags• Set to done
Commit Instr• Write results to registerfile (or allow memorywrite for store)• Set to Empty• Increment head pointer
Branch Resolution• …• …• …
All behaviors are explicable as a sequence of atomic actions on the state
Performance SemanticsAnother processor example
Rule-based Specifications
RF
iMem dMem
WbIF
bI
Exe
bE
Mem
bW
Dec
bD
rule Execute Add: when (bD.first == (Ri = va + vb)) ==> begin result = va + vb; // compute addition
bE.enq (Ri = result); // enqueue result into bE
bD.deq; // dequeue instruction from bd end
Each pipeline stage is described as a set of atomic rules:
Any legal behavior can be understood in terms of applying one rule at a time
bypasses
R1 =
2 +
3
R1 =
5
Performance ConcernsThe designer wants to make sure that one instruction executes every cycleFIFOs must support both enq and deq in each cycle
RF
iMem dMem
WbIF
bI
Exe
bE
Mem
bW
Dec
bD
I0I1I2I3
A cycle in
slow motion
I4I5
What are thesemanticsof FIFOs?
data_in
push_req_n
pop_req_n
clk
rstn
data_out
full
empty
Example from a commerciallyavailable FIFO IP component
These constraints are taken from several paragraphs of documentation, spread over many pages, interspersed with other text
A FIFO interface in BSV
interface FIFOQueue #(type aType); method Action push (aType val); method aType first(); method ActionValue#(aType) pop(); method Action clear();endinterface
Methods as ports
rdy
enabn
rdy
pus
hcl
ea
r
not full
n
rdy
enab
pop
not empty
always true
FIFO
Qu
eu
em
od
ule
enab
n
rdy first
not empty
push:• n-bit argument• has side effect (action)
first:• n-bit result• has no side effect
pop:• n-bit result• has side effect (action)
clear:• no argument• has side effect (action)
FIFO semantics: types
interface FIFOQueue #(type aType); method Action push (aType val); method aType first(); method ActionValue#(aType) pop(); method Action clear();endinterface
FIFO semantics: laws
Algebra of enq, deq, etc
Not at present part of BSV Though SVA assertions are
Needed for formal verification work
So far, all in atomic-action world
FIFO semantics: performance
“Must support enq and deq in same cycle”
Naïve implementation
iptr
optr
(ignoring wraparound)
Bool not_empty = (iptr != optr);Bool not_full = (iptr+1 != optr);
method Action enq(x) if (not_full); buf[iptr] <= x; iptr <= iptr + 1;endmethod
method Actionvalue deq() if (not_empty); optr <= optr + 1; return (buf[optr]);endmethod
Semantics: concurrencyEmpty Neither Full
Naïve can’t deq enq and deq conflict can’t enq
Normal can’t deq enq and deq conflict-free can’t enq
Turbo can’t deq enq + deqdeq before
enq
Bypassenq before
deqenq + deq can’t enq
Super
(not yet)
enq before deq
enq + deq deq before
enq
A problem with “super”
rdy
enabn
pus
h
not full
n
rdy
enabp
op
not empty
So what are the semantics of FIFOs?
Performance requirements are sometimes an essential part of (interface) specification
Pipeline efficiency
Sometimes notAllow extra optional attributes on interface definitions?
Take care not accidentally to re-invent inheritance (à la OOP) badly
Semantics
Maybe best to separate functional and performance semantics
So functional semantics can be handled entirely in TRS (atomic action) framework
Care: must not rely on performance semantics to achieve functionally necessary concurrency.
If actions must be simultaneous, compose them in same action; don’t put them in separate rules
The processor exampleComposing Atomic Rules with Performance Guarantees
MIT PhD: Rosenband 2005
Performance ConcernsThe designer wants to make sure that one CPU instruction executes every cycleSuppose the designer can specify which rules (if enabled) should execute within a clock cycle, and in which order.
No change in rules
RF
iMem dMem
WbIF
bI
Exe
bE
Mem
bW
Dec
bD
I0I1I2I3
Wb x Mem x Exe x Dec x IF
Correctness of hardware guaranteed by the Compiler Order is important to minimize FIFO size and achieve optimal bypassing
A cycle in
slow motion
I4I5
Experiments in schedulingWhat happens if the user specifies:
No change in rules
RF
iMem dMem
WbIF
bI
Exe
bE
Mem
bW
Dec
bD
Executing 2 instructions per cycle requires more resources but is functionally equivalent to the original design
Wb x Wb x Mem x Mem x Exe x Exe x Dec x Dec x IF x IF
I1 I0I3 I2I5 I4I7 I6I9 I8 I3 I2I5 I4I7 I6I9 I8I11 I10
A cycle in
slow motion
a superscalar processor!
4-Stage Processor Results
benchmark: a program containing additions / jumps / loadc’s
DesignBenchmark
(cycles)Area 10ns
(µm2)
Timing10ns(ns)
Area2ns
(µm2)
Timing2ns(ns)
1 element fifo:
No Spec 18525 24762 5.85 26632 2.00
Spec 1 11115 25094 6.83 33360 2.00
Spec 2 11115 25264 6.78 34099 2.04
2 element fifo:
No Spec. 18525 32240 7.38 39033 2.00
Spec 1 11115 32535 8.38 47084 2.63
Spec 2 7410 45296 9.99 62649 4.72
Dan Rosenband & Arvind 2004
A superstructurebased on
disguised Haskell
Implementing BSV:two-level compilation
Level 2 Synthesis • Scheduling• Control logic generation• Object code generation
Level 1 Elaboration • Desugaring• Type-checking• Massive partial evaluation and static elaboration
BSV source program
Term Rewriting System(rules and actions)
Object code(Verilog RTL or C)
Bluespec Classic:a Haskell-based HDL
package Shift(shift) where
import List
sstep :: Bit m -> Bit n -> Nat -> Bit n
sstep s x i when s[i:i] == 1 = x << (1 << i)
sstep s x i = x
shift :: Bit m -> Bit n -> Bit n
shift s x = foldl (sstep s) x
(map fromInteger
(upto 0 ((valueOf m) - 1)))
Lennart Augustsson (2000)
Selling BS ClassicUnfamiliar syntax a significant barrier
even in marketing slides even ()s in function calls are different!
Many fronts in adoption war new hardware design methodology new unfamiliar syntax new type system new purely functional thinking new FP concepts (map, fold, monads)
Adapt an existing HDLMap matching concepts
expressions, bit vectors, functions, modules
Extend where straightforward higher-order functions, first-class objects,
polymorphism
Standardize where possible tagged unions, pattern matching (SV 3.1a)
Desugar where required imperative assignments, loops
Bluespec Classic:a Haskell-based HDLpackage Shift(shift) where
import List
sstep :: Bit m -> Bit n -> Nat -> Bit n
sstep s x i when s[i:i] == 1 = x << (1 << i)
sstep s x i = x
shift :: Bit m -> Bit n -> Bit n
shift s x = foldl (sstep s) x
(map fromInteger
(upto 0 ((valueOf m) - 1)))
Bluespec SystemVerilog:FP with Verilog Syntax
function Bit#(n) sstep(Bit#(m) s, Bit#(n) x, Nat i); if(s[i] == 1) return(x << (1 << i)); else return x;endfunction
function Bit#(n) shift(Bit#(m) s, Bit#(n) x); return(foldl((sstep(s)), x, (map(fromInteger, upto(0, valueof(m) - 1)))));endfunction
Bluespec SystemVerilog:Imperative circuit constructionfunction Bit#(n) shift(Bit#(m) s, Bit#(n) x);
Integer max = valueof(m);
Bit#(n) xA [max+1];
xA[0] = x;
for (Integer j = 0; j < max; j = j + 1)
if (s[fromInteger(j)] == 1)
xA[j+1] = xA[j] << (1 << fromInteger(j));
else
xA[j+1] = xA[j];
return xA[max];
endfunction
Teaching BSVLimited training time (2-4 days typical)Audience: hardware designers
little or no FP background wires and registers, not abstractions conservative (remember cost of mistakes?)
Format: lectures interspersed with labsNeed to communicate basics
or else evaluation project might be hard
Want to show full range of features or else benefits not perceived and no sale
Teaching conclusionsFunctional features are “advanced”
Get in the way of communicating basics
Strict typing seen as restrictive bit vs. Bool bit-width constraints structures vs. bit representations
Standards less relevant when teaching damn the torpedoes and teach the sugar
Key challenge: build intuition about generated hardware
Time-released HaskellFIFO#(Nat) f();mkSizedFIFO#(depth) the_f(f);
FIFO#(Nat) f <- mkSizedFIFO(depth);
Tuple2#(Nat,Bool) nb = foo(x);
let nb = foo(x);
Compiling FPinto hardware
Implementing BSV:two-level compilation
Level 2 Synthesis • Scheduling• Control logic generation• Object code generation
Level 1 Elaboration • Desugaring• Type-checking• Massive partial evaluation and static elaboration
BSV source program
Term Rewriting System(rules and actions)
Object code(Verilog RTL or C)
For historical reasons, the level one evaluator is lazy
Is this still a good idea as the language becomes more imperative?
Laziness is hard workPerformance is a challenge
graph reduction required to avoid duplicating work
Non-strict primitives (if, and, or) require careful handling
symmetric short-circuiting undetermined values must be propagated
correctly
Error messages can be confusing “Compile-time expression did not evaluate”
Consider: let z = x + yIs this:
a static constant? a fixed incrementer? a full adder?
A lazy evaluator does not care! evaluates what it can defers (or suspends) what it can't
User benefit: can move freely between static and dynamic code
Being lazy pays off
(?)
FP cultural heritage contains lots of goodies
TRS — semantic foundation Typing — unexpected convenience …
Hardware designers can enjoy the fruits without realising what they’re using …… till later, perhaps
Conclusions
The End