Industrial Semantics Or How to Stop the Maths Getting in the Way of the Marketing Joe Stoy

Industrial SemanticsOr

How to Stop the Maths Getting in the Way of the Marketing

Joe Stoy Founder and Principal Engineer

Bluespec, Inc.

(with help from many at Bluespec)

APPSEM05 Workshop, 15 September 2005

Basic Message

An industrial tool needs good semantics Robust Simple Conforming to users’ model

But the theory must be “under the covers”

Learning curve Perceived learning curve

A technology based on Term-Rewriting Systems

A superstructure based on functional programming semantics

Semantic issues

Outline

Tool and marketFor designing chips (ASICs, FPGAs, ...)

currently low-level with Verilog or VHDL chip complexity rising (millions of gates)

For chip designers, verification engineers, system architects

ASICs have huge NREs ($500K–$1M) mistakes (respins) cost another NRE tools run into millions of $$$ per team,

form a significant fraction of a company’s budget (e.g., ~10%)

tools tend to run on UNIX (Solaris, Linux)

Productization within industryMajor pilot project: Arbiter for 160 Gb/s router (1.5M gates, 200 MHz, 1.3)

History of this technologyResearch@MIT on high-level synthesis & verification(Prof. Arvind et. al.)

Technology

Technology

VC funding

~1996 2000 2003

Bluespec, Inc.High-level synth. tool

2004

Productavailable

BluespecSystemVerilog

Transaction

Verilog RTL

Netlist

Design Flow

DALDesign Assertion Level

CycleAccurateSimulatio

n

EventBased

Simulation

RTLSynthesis

RTLSynthesis

Bluesim

BluespecSynthesisBluespecSynthesis

Cosim: Typical Use ModelsBluespec integrated into SystemC/Verilog

SystemC/Verilog integrated into Bluespec

Bluespec

SystemC/Verilog

SystemC/Verilog

Bluespec

Where Bluespec is:• …integrated into Verilog System-on-Chip (SoC) design• …back-annotated into SystemC model• …part of mixed SystemC/Bluespec model

Where Bluespec is designed with:• …existing Verilog IP re-used• …a SystemC model awaiting the Verilog

A technologybased on

term-rewriting systems

Term-rewriting systemswhile some rule is applicable

choose an applicable rule apply it to the term (or a subcomponent)

FP’s standard operational semanticsOur rewritings are less free-wheeling (don’t change structure of term)

maybe “state transition system” a better name

Clocked synchronous hardware

The compiler translates BSV source code into Verilog RTL

TransitionLogic

IOS“Next” SCollection

ofState

Elements

Scheduling and control logicModules’

(Current state)Rules

Scheduler

1

n

1

n

Muxing

1

nn

n

Modules’(Next state)

cond

action

“CAN_FIRE” “WILL_FIRE”

Term-rewriting systemswhile some rule is applicable

choose an applicable rule apply it to the term (or a subcomponent)

FP’s standard operational semanticsOur rewritings are less free-wheeling (don’t change structure of term)

Rules are constructed from guarded atomic actions with interfaces

Guarded Atomic Actions

Actions are guardedfifo.enq(x);

if fifo full, cannot happen hides lots of tedious bureaucracy

Actions are atomic

Atomicity

atomic

Atomicity

ατομος

Atomicity

a-tomic

Atomicity

a-tomic not

asymmetric atypical amoral

Atomicity

a-tomic not

asymmetric atypical amoral

cut microtome tomography tome (of a multi-volume book)

Atomicity

Rules are atomic“Not cut”

Whenever they run, they run to completion never interrupted

No other activities are interleaved with them

This greatly simplifies design avoids many race conditions

Guarded Atomic Actions

Actions can be composeda1;a2

resulting action atomic guarded by guards of a1 and a2

Guarded Atomic ActionsConditionals:

if (b) a1; Guarded by “b implies (a1’s guards)”

Another conditional:when (b) a1;

guarded by b (and a1’s guards)

Yet anotherperhaps-if (b) a1;

unguarded a1’s guards conjoined to b

Nice algebra (separate question — which to have in BSV)

… with interfaces

A BSV design is structured by modulesModules communicate only through interfaces

Modules and interfaces

interface

state

rule

module

An exampleinterface NumFn; method Action start (int n); method int result ();endinterface

module mkFact2 (NumFn); Reg#(int) x <- mkReg(?); Reg#(int) j <- mkReg(0);

rule step (j > 0); x <= x * j; j <= j - 1; endrule

method start (n) if (j == 0); x <= 1; j <= n; endmethod

method result () if (j == 0); return x; endmethodendmodule: mkFact2

module mkTest ();

int n = 15; // constant

Reg#(int) state <- mkReg(0); NumFn f <- mkFact2();

rule go (state == 0); f.start (n); state <= 1; endrule

rule finish (state == 1); $display (“Result is %d”, f.result()); state <= 2; endruleendmodule: mkTest

module mkTest () ; … Fact f <- mkFact2(); … rule finish (state==1); $display(“…%d”, f.result()) state <= 2; endruleendmodule

Method invocationsfit into rules

Rule condition is: state==1 && j==0 Explicit condition and all implicit conditions

of all method calls in the rule

Thus, a part of the rule’s condition (j == 0), and a part of a rule’s computation (reading x)

are in a different module, via a method invocation

module mkFact2 (Fact); Reg#(int) x <- mkReg(?) Reg#(int) j <- mkReg(n); … method result () if (j == 0); return x; endmethodendmodule

module mkTest () ; … Fact f <- mkFact2();

rule go (state==0); f.start(15); state <= 1; endrule …endmodule

Modularizing rules

Rule condition: state==0 && j==0

Rule actions: state<=1, x<=1 and j<=15

Thus, a part of the rule’s action is in a different module

module mkFact2 (Fact); Reg#(int) x <- mkReg(?) Reg#(int) j <- mkReg(0); … method start (int n) if (j == 0); x <= 1; j <= n; endmethodendmodule

Order of Evaluation

Not Lazy (e.g. Haskell’s)Schedule as many rules as possible in each clock cycle

(patented technology – James Hoe et al)

Clocked synchronous hardware

The compiler translates BSV source code into Verilog RTL

TransitionLogic

IOS“Next” SCollection

ofState

Elements

Rule semanticsmapped to hardware semantics

Rules

HW

Ri Rj Rk

clocks

rule steps

Ri

RjRk

The effect of each cycle is as if a sequence of ruleswas executed one-at-a-time

Consequence: The HW state can never result from an interleaving of actions from different rules

Rule atomicity (therefore, correctness) is preserved

Scheduling and control logicModules’

(Current state)Rules

Scheduler

1

n

1

n

Muxing

1

nn

n

Modules’(Next state)

cond

action

“CAN_FIRE” “WILL_FIRE”

… leads to pragmaticconstraints on rule combination

Initial set: A rule fires within a clock cycle A rule fires at most once in a clock cycle A rule’s effect is only visible in the next clock We only combine rules in a certain fixed order within a cycle All rules which read a register must precede any which write

it We only consider rules enabled at the start of the clock cycle Each rule is independent of previous rules executing in the

same cycle The logic path delay depends on individual rule paths, and

not on combinations of rules …

(Some since relaxed)

Benefits ofatomic-action

semantics

Consider this exampleProcess 0 increments register xProcess 1 transfers a unit from register x to register yProcess 2 decrements register y

This is an abstraction of some real applications: Bank account: 0 = deposit to checking, 1 = transfer from

checking to savings, 2 = withdraw from savings Packet processor: 0 = packet arrives, 1 = packet is

processed, 2 = packet departs …

0 1 2

x y

+1 -1 +1 -1

Concurrency in the example

Process j (= 0,1,2) only updates under condition condjOnly one process at a time can update a register. Note:

Process 0 and 2 can run concurrently if process 1 is not running

Both of process 1’s updates must happen “indivisibly” (else inconsistent state)

Suppose we want to prioritize process 2 over process 1 over process 0

0 1 2x y

+1 -1 +1 -1

Process priority: 2 > 1 > 0

cond0 cond1 cond2

Is either correct?

Which of these solutions are correct, if any?What’s required to verify that they’re correct?Now, what if I Δ’d the priorities: 1 > 2 > 0?And, what if the processes are in different modules?

always @(posedge CLK) // process 0 if (!cond1 && cond0) x <= x + 1;

always @(posedge CLK) // process 1 if (!cond2 && cond1) begin y <= y + 1; x <= x – 1; end

always @(posedge CLK) // process 2 if (cond2) y <= y – 1;

always @(posedge CLK) begin if (!cond2 && cond1) x <= x – 1; else if (cond0) x <= x + 1;

if (cond2) y <= y – 1; else if (cond1) y <= y + 1;end

0 1 2x y

+1 -1 +1 -1 Process priority: 2 > 1 > 0

cond0 cond1 cond2

if ((!cond1 || cond2) && cond0)

Where’sthe error?

With Bluespec, design is direct

(* descending_urgency = “proc2, proc1, proc0” *)

rule proc0 (cond0); x <= x + 1;endrule

rule proc1 (cond1); y <= y + 1; x <= x – 1;endrule

rule proc2 (cond2); y <= y – 1;endrule

Functional correctness follows directly from rule semantics

Related actions are grouped naturally with their conditions—easy to change

Interactions between rules are managed by the compiler (scheduling, muxing, control)

Same hardware as the RTL

0 1 2x y

+1 -1 +1 -1

Process priority: 2 > 1 > 0cond0 cond1 cond2

Reorder BufferVerification-centric design

Decode

Example from CPU design

Nirav Dave, MEMOCODE, 2004

Speculative, out-of-order

Many, many concurrent activities

Branch

RegisterFile

ALUUnitRe-

OrderBuffer(ROB) MEM

Unit

DataMemory

InstructionMemory

Fetch Decode

FIFO

FIFO FIFO FIFO FIFO

FIFO

FIFOFIFO

FIFOFIFORe-

OrderBuffer(ROB)

Branch

RegisterFile

ALUUnit

MEMUnit

DataMemory

InstructionMemory

Fetch

ROB actionsEmpty

WaitingDispatched

KilledDone

EWDiKDo

Head

Tail

V - -Instr - V -

V - -Instr - V -

V - -Instr - V -

V - -Instr - V -

V - -Instr - V -

V - -Instr - V -

V - -Instr - V -

V - -Instr - V -

V - -Instr - V -

V - -Instr - V -

V 0 -Instr B V 0W

V 0 -Instr C V 0W

-Instr D V 0W

V 0 -Instr A V 0W

V - -Instr - V -

V - -Instr - V -E

E

E

E

E

E

E

E

E

E

E

E

V 0

Re-Order Buffer

Insert aninstr into

ROB

DecodeUnit

RegisterFile

Get operandsfor instr

Writebackresults

Get a readyALU instr

Get a readyMEM instr

Put ALU instr results in ROB

Put MEM instr results in ROB

ALUUnit(s)

MEMUnit(s)Resolve

branches

Operand 1 ResultInstruction Operand 2State

But, what about allthe potential race conditions?

Reading from the register file at the same time a separate instruction is writing back to the same location

Which value to read?

An instruction is being inserted into the ROB simultaneously with a dependent upstream instruction’s result coming back from an ALU

Put a tag or the value in the operand slot?

An instruction is being inserted into the ROB simultaneously with a branch mis-prediction

must kill the mis-predicted instructions and restore a “consistent state” across many modules

Rule Atomicity Lets you code each operation in isolation

Eliminates the nightmare of race conditions (“inconsistent state”) under such complex concurrency conditions

Insert Instr in ROB• Put instruction in firstavailable slot• Increment tail pointer• Get source operands

- RF <or> prev instr

Dispatch Instr• Mark instructiondispatched• Forward to appropriateunit

Write Back Results to ROB• Write back results toinstr result• Write back to all waitingtags• Set to done

Commit Instr• Write results to registerfile (or allow memorywrite for store)• Set to Empty• Increment head pointer

Branch Resolution• …• …• …

All behaviors are explicable as a sequence of atomic actions on the state

Performance SemanticsAnother processor example

Rule-based Specifications

RF

iMem dMem

WbIF

bI

Exe

bE

Mem

bW

Dec

bD

rule Execute Add: when (bD.first == (Ri = va + vb)) ==> begin result = va + vb; // compute addition

bE.enq (Ri = result); // enqueue result into bE

bD.deq; // dequeue instruction from bd end

Each pipeline stage is described as a set of atomic rules:

Any legal behavior can be understood in terms of applying one rule at a time

bypasses

R1 =

2 +

3

R1 =

5

Performance ConcernsThe designer wants to make sure that one instruction executes every cycleFIFOs must support both enq and deq in each cycle

RF

iMem dMem

WbIF

bI

Exe

bE

Mem

bW

Dec

bD

I0I1I2I3

A cycle in

slow motion

I4I5

What are thesemanticsof FIFOs?

data_in

push_req_n

pop_req_n

clk

rstn

data_out

full

empty

Example from a commerciallyavailable FIFO IP component

These constraints are taken from several paragraphs of documentation, spread over many pages, interspersed with other text

A FIFO interface in BSV

interface FIFOQueue #(type aType); method Action push (aType val); method aType first(); method ActionValue#(aType) pop(); method Action clear();endinterface

Methods as ports

rdy

enabn

rdy

pus

hcl

ea

r

not full

n

rdy

enab

pop

not empty

always true

FIFO

Qu

eu

em

od

ule

enab

n

rdy first

not empty

push:• n-bit argument• has side effect (action)

first:• n-bit result• has no side effect

pop:• n-bit result• has side effect (action)

clear:• no argument• has side effect (action)

FIFO semantics: types

interface FIFOQueue #(type aType); method Action push (aType val); method aType first(); method ActionValue#(aType) pop(); method Action clear();endinterface

FIFO semantics: laws

Algebra of enq, deq, etc

Not at present part of BSV Though SVA assertions are

Needed for formal verification work

So far, all in atomic-action world

FIFO semantics: performance

“Must support enq and deq in same cycle”

Naïve implementation

iptr

optr

(ignoring wraparound)

Bool not_empty = (iptr != optr);Bool not_full = (iptr+1 != optr);

method Action enq(x) if (not_full); buf[iptr] <= x; iptr <= iptr + 1;endmethod

method Actionvalue deq() if (not_empty); optr <= optr + 1; return (buf[optr]);endmethod

Semantics: concurrencyEmpty Neither Full

Naïve can’t deq enq and deq conflict can’t enq

Normal can’t deq enq and deq conflict-free can’t enq

Turbo can’t deq enq + deqdeq before

enq

Bypassenq before

deqenq + deq can’t enq

Super

(not yet)

enq before deq

enq + deq deq before

enq

A problem with “super”

rdy

enabn

pus

h

not full

n

rdy

enabp

op

not empty

So what are the semantics of FIFOs?

Performance requirements are sometimes an essential part of (interface) specification

Pipeline efficiency

Sometimes notAllow extra optional attributes on interface definitions?

Take care not accidentally to re-invent inheritance (à la OOP) badly

Semantics

Maybe best to separate functional and performance semantics

So functional semantics can be handled entirely in TRS (atomic action) framework

Care: must not rely on performance semantics to achieve functionally necessary concurrency.

If actions must be simultaneous, compose them in same action; don’t put them in separate rules

The processor exampleComposing Atomic Rules with Performance Guarantees

MIT PhD: Rosenband 2005

Performance ConcernsThe designer wants to make sure that one CPU instruction executes every cycleSuppose the designer can specify which rules (if enabled) should execute within a clock cycle, and in which order.

No change in rules

RF

iMem dMem

WbIF

bI

Exe

bE

Mem

bW

Dec

bD

I0I1I2I3

Wb x Mem x Exe x Dec x IF

Correctness of hardware guaranteed by the Compiler Order is important to minimize FIFO size and achieve optimal bypassing

A cycle in

slow motion

I4I5

Experiments in schedulingWhat happens if the user specifies:

No change in rules

RF

iMem dMem

WbIF

bI

Exe

bE

Mem

bW

Dec

bD

Executing 2 instructions per cycle requires more resources but is functionally equivalent to the original design

Wb x Wb x Mem x Mem x Exe x Exe x Dec x Dec x IF x IF

I1 I0I3 I2I5 I4I7 I6I9 I8 I3 I2I5 I4I7 I6I9 I8I11 I10

A cycle in

slow motion

a superscalar processor!

4-Stage Processor Results

benchmark: a program containing additions / jumps / loadc’s

DesignBenchmark

(cycles)Area 10ns

(µm2)

Timing10ns(ns)

Area2ns

(µm2)

Timing2ns(ns)

1 element fifo:

No Spec 18525 24762 5.85 26632 2.00

Spec 1 11115 25094 6.83 33360 2.00

Spec 2 11115 25264 6.78 34099 2.04

2 element fifo:

No Spec. 18525 32240 7.38 39033 2.00

Spec 1 11115 32535 8.38 47084 2.63

Spec 2 7410 45296 9.99 62649 4.72

Dan Rosenband & Arvind 2004

A superstructurebased on

disguised Haskell

Implementing BSV:two-level compilation

Level 2 Synthesis • Scheduling• Control logic generation• Object code generation

Level 1 Elaboration • Desugaring• Type-checking• Massive partial evaluation and static elaboration

BSV source program

Term Rewriting System(rules and actions)

Object code(Verilog RTL or C)

Bluespec Classic:a Haskell-based HDL

package Shift(shift) where

import List

sstep :: Bit m -> Bit n -> Nat -> Bit n

sstep s x i when s[i:i] == 1 = x << (1 << i)

sstep s x i = x

shift :: Bit m -> Bit n -> Bit n

shift s x = foldl (sstep s) x

(map fromInteger

(upto 0 ((valueOf m) - 1)))

Lennart Augustsson (2000)

Selling BS ClassicUnfamiliar syntax a significant barrier

even in marketing slides even ()s in function calls are different!

Many fronts in adoption war new hardware design methodology new unfamiliar syntax new type system new purely functional thinking new FP concepts (map, fold, monads)

Adapt an existing HDLMap matching concepts

expressions, bit vectors, functions, modules

Extend where straightforward higher-order functions, first-class objects,

polymorphism

Standardize where possible tagged unions, pattern matching (SV 3.1a)

Desugar where required imperative assignments, loops

Bluespec Classic:a Haskell-based HDLpackage Shift(shift) where

import List

sstep :: Bit m -> Bit n -> Nat -> Bit n

sstep s x i when s[i:i] == 1 = x << (1 << i)

sstep s x i = x

shift :: Bit m -> Bit n -> Bit n

shift s x = foldl (sstep s) x

(map fromInteger

(upto 0 ((valueOf m) - 1)))

Bluespec SystemVerilog:FP with Verilog Syntax

function Bit#(n) sstep(Bit#(m) s, Bit#(n) x, Nat i); if(s[i] == 1) return(x << (1 << i)); else return x;endfunction

function Bit#(n) shift(Bit#(m) s, Bit#(n) x); return(foldl((sstep(s)), x, (map(fromInteger, upto(0, valueof(m) - 1)))));endfunction

Bluespec SystemVerilog:Imperative circuit constructionfunction Bit#(n) shift(Bit#(m) s, Bit#(n) x);

Integer max = valueof(m);

Bit#(n) xA [max+1];

xA[0] = x;

for (Integer j = 0; j < max; j = j + 1)

if (s[fromInteger(j)] == 1)

xA[j+1] = xA[j] << (1 << fromInteger(j));

else

xA[j+1] = xA[j];

return xA[max];

endfunction

Teaching BSVLimited training time (2-4 days typical)Audience: hardware designers

little or no FP background wires and registers, not abstractions conservative (remember cost of mistakes?)

Format: lectures interspersed with labsNeed to communicate basics

or else evaluation project might be hard

Want to show full range of features or else benefits not perceived and no sale

Teaching conclusionsFunctional features are “advanced”

Get in the way of communicating basics

Strict typing seen as restrictive bit vs. Bool bit-width constraints structures vs. bit representations

Standards less relevant when teaching damn the torpedoes and teach the sugar

Key challenge: build intuition about generated hardware

Time-released HaskellFIFO#(Nat) f();mkSizedFIFO#(depth) the_f(f);

FIFO#(Nat) f <- mkSizedFIFO(depth);

Tuple2#(Nat,Bool) nb = foo(x);

let nb = foo(x);

Compiling FPinto hardware

Implementing BSV:two-level compilation

Level 2 Synthesis • Scheduling• Control logic generation• Object code generation

Level 1 Elaboration • Desugaring• Type-checking• Massive partial evaluation and static elaboration

BSV source program

Term Rewriting System(rules and actions)

Object code(Verilog RTL or C)

For historical reasons, the level one evaluator is lazy

Is this still a good idea as the language becomes more imperative?

Laziness is hard workPerformance is a challenge

graph reduction required to avoid duplicating work

Non-strict primitives (if, and, or) require careful handling

symmetric short-circuiting undetermined values must be propagated

correctly

Error messages can be confusing “Compile-time expression did not evaluate”

Consider: let z = x + yIs this:

a static constant? a fixed incrementer? a full adder?

A lazy evaluator does not care! evaluates what it can defers (or suspends) what it can't

User benefit: can move freely between static and dynamic code

Being lazy pays off

(?)

FP cultural heritage contains lots of goodies

TRS — semantic foundation Typing — unexpected convenience …

Hardware designers can enjoy the fruits without realising what they’re using …… till later, perhaps

Conclusions

The End

Industrial Semantics Or How to Stop the Maths Getting in the Way of the Marketing Joe Stoy

Documents

Transcript of Industrial Semantics Or How to Stop the Maths Getting in the Way of the Marketing Joe Stoy