Ece465 High Level Design Strategies

8/3/2019 Ece465 High Level Design Strategies

1/23

ECE 465

High Level Design Strategies

Lecture Notes # 9

Shantanu Dutt

Electrical & Computer Engineering

University of Illinois at Chicago


2/23

Outline Circuit Design Problem

Solution Approaches:

Truth Table (TT) vs. Computational/AlgorithmicYes, hardware, just like software can implement anyalgorithm!

Flat vs. Divide-&-Conquer Divide-&-Conquer:

Associative operations/functions

General operations/functions

Other Design Strategies for fast circuits: Speculative computation

Best of both worlds (best average and best worst-case)

Pipelining

Summary


3/23

Circuit Design Problem Design an 8-bit comparator that compares two 8-bit #s available in

two registers A[7..0] and B[7..0], and that o/ps F = 1 if A > B and F =0 if A


4/23

Circuit Design Problem (contd)

Approach 2: Think computationally/algorithmically aboutwhat the ckt is supposed to compute:

Approach 2(a): Flat algorithmic approach:

Note: A TT can be expressed as a sequence of if-then-elses

If A = 00000000 and B = 00000000 then F = 0

else if A = 00000000 and B = 00000001 then F=0

.

else if A = 00000001 and B = 00000000 then F=1

.

Essentially a re-hashing of the TT same problems as the TTapproach


5/23

Circuit Design Problem: Strategy 1: Divide-&-Conquer

Approach 2(b): Structured algorithmic approach:

Be more innovative, think of the structure/propertiesof the

computational problem E.g., think if the problem can be solved in a hierarchical or divide-

&-conquer (D&C) manner:

Subprob. A1

A1,1 A1,2 A2,1 A2,2

Root problem A

Subprob. A2

Stitch-up of solns to A1 and A2to form the complete soln to A

D&C approach: See if the problem can be broken up into 2 or more smallersubproblems whose solutions can be stitched-up to give a soln. to the parentprob. Do this recrusively for each large subprob until subprobs are small enough forTT-based solutions

If the subprobs are of a similar kind (but of smaller size) to the root prob thenthe breakup and stitching will also be similar

Do recursively until subprob-size

is s.t. TT-based design is doable


6/23

Shift Gears: Design of a Parity Detection CircuitA Series of XORs(b) 16-bit parity tree

Delay = (# of levels in

AND-OR tree) * td =log2(n) *td

x(15) x(14) x(1) x(0)

w(3,0)

w(3,1)

w(3,2)

w(3,3)

w(3,4)

w(3,5)

w(3,6)

w(3,7)

w(2,0)w(2,1)w(2,2)w(2,3)

w(1,0)w(1,1)

w(0,0) = f

An example of simpledesigner ingenuity---abad design wouldhave resulted in alinear delay that theVHDL code & thesynthesis tool wouldhave been at the

mercy of.

x(0)

x(1)

x(2)X(3)

x(15) f

(a) A linearly-connected circuit

No concurrency in design (a)---the actual problem hasavailable concurrency, though, and it is not exploited well inthe above linear design Complete sequentialization leading to a delay that is linearin the # of bits n(delay = n*td), td= delay of 1 gate All the available concurrency is exploited in design (b)---a

parity tree.Question: When can we have a tree-structured circuit foran operation on multiple operands?Answer: (1) When the operation makes sense for any # ofoperands. (2) It should be possible to break it down intooperations w/ fewer operands. (3) When the operation isassociative. An oper. x is said to be associative if:

a x b x c = (a x b) x c = a x (b x c). Thus if we have 4 operations a x b x c x d, we can eitherperform this as a x (b x (c x d)) [getting a linear delay of 3units] or as (a x b) x (c x d) [getting a logarithmic (base 2)delay of 2 units and exploiting the available concurrency dueto the fact that x is associative].We can extend this idea to noperands (& n-1 operations) to perform as many ofthe pairwise operations as possible in parallel (& do this recursively for every level

of remaining operations), similar to design (b) for the parity detector [xor is anassociative operation!] and thus get a (log2 n) delay.

f = (((x(15) xor x(14)) xor (x(13) xor x(12))) xor ((x(11) xor x(10)) xor (x(9) xor x(8))) )xor (((x(7) xor x(6)) xor (x(5) xor x(4))) xor ((x(3) xor x(2)) xor (x(1) xor x(0))))


7/23

D&C for Associative Operations Let f(xn-1, .., x0) be an associative function. What is the D&C principle involved in the design of an n-bit xor/parity

function? Can it also lead automatically to a tree-based ckt?

f(a,b)

a b

f(xn-1, .., x0)

Stitch-up function---same as theoriginal function for 2 inputs

Using the D&C approach for an associative operation results in the stitchup function being the same as the original function (not the case for non-assoc. operations), but w/ a constant # of operands (2, if the orig problemis broken into 2 subproblems) If the two sub-problems of the D&C approach are balanced (of the samesize or as close to it as possible), then unfolding the D&C results in a

balanced operation tree of the type for the xor/parity function seen earlier

f(xn-1, .., xn/2) f(xn/2-1, .., x0)


8/23

D&C Approach for Non-Associative Opers: n-bit Comparator

A

A[i] B[i] f1(i) f2(i)0 0 0 10 1 0 01 0 1 01 1 0 1

If A[i] = B[i] then { f1(i)=0; f2(i) = 1; /* f2(i) o/p is an i/p to the stitch logic */

/*f2(i) =1 meansf1( ), f2( ) o/ps of the LS of this subtreeshould be selected by the stitch logic as its o/ps */else if A[i] < B[i} then { f1(i) = 0; /* indicates < */f2(i) = 0 } /* indicates f1(i), f2(i) o/ps should be selected by stitch logic as its o/ps */else if A[i] > B[i] then {f1(i) = 1; /* indicates > */

f2(i) = 0 }

The TT may be derived directly or by first thinking of and expressing itscomputation in a high-level programming languageand then convertingit to a TT.

Useful property: At anylevel, comp. of MS (mostsignificant) half determineso/p if result is > or < else

comp. of LS determ. o/p Can thus break up problemat any level into MS andLS comparisons & basedon their results determinewhich o/p to choose for thehigher-level (parent) result

Comp A[7..4],B[7..4]

Comp. A[7..0]],B[7..0] Stitch-up of solns toA1 and A2 to form the

complete soln to A

A1 A2Comp A[3..0],B[3..0]

If A1 reslt is> or < takeA1 reslt elsetake A2 reslt

Comp A[7..6],B[7..6] Comp A[5,4],B[5,4]

A1,1 A1,2

If A1,1,1 reslt is> or < takeA1,1,1 reslt elsetake A1,1,2 reslt

Comp A[7],B[7] Comp A[6],B[6]

If A1,1 reslt is> or < takeA1,1 reslt elsetake A1,2 reslt

A1,1,1

A1,1,2

Small enough to bedesigned using a TT

(2-bit 2-o/p comparator)

Is this is associative?not sure For a non-associative func,determine its propeties that allowdetermining a break-up & a

correct stitch-up function


9/23

Comparator Circuit Design Using D&C (contd.)

Comp A[7..4],B[7..4]

Comp. A[7..0]],B[7..0] Stitch-up of solns to A1 and A2to form the complete soln to A

A

A1A2

Comp A[3..0],B[3..0]

If A1 reslt is> or < takeA1 reslt elsetake A2 reslt

Comp A[7..6],B[7..6]Comp A[5,4],B[5,4]

A1,1 A1,2

If A1,1,1 reslt is> or < takeA1,1,1 reslt elsetake A1,1,2 reslt

Comp A[7],B[7] Comp A[6],B[6]

If A1,1 reslt is> or < takeA1,1 reslt elsetake A1,2 reslt

A1,1,1 A1,1,2

A[i] B[i] f1(i) f2(i)0 0 0 10 1 0 01 0 1 01 1 0 1

Stitch up logic details:If f2(i) = 0 then { my_op1=f1(i);my_op2=f2(i) }/* select MS comp o/ps */else/* select LS comp. o/ps */

{my_op1=f1(i-1); my_op2=f2(i-1) }

Stitch-uplogic

f1(i) f2(i)

my_op1 my_op2

f1(i-1) f2(i-1)

f1(i) f2(i) f1(i-1) f2(i-1) my_op1 my_op2X 0 X X f1(i) f2(i)X 1 X X f1(i-1) f2(i-1)

OR

Once the D&C tree is formulatedit is easy to get the low-level &stitch-up designs Stitch-up design shown here

(Compact TT)

2-bit2:1 Mux

2

2 2

f(i) f(i-1)

my_op

f2(i)

I0 I1

(Direct design)


10/23

Comparator Circuit Design Using D&C Final Design

2-bit2:1 Mux

2

2 2

my(3)

f2(7) = f(7)(2)

I0 I1

1-bitcomparator

f(7)

A[7] B[7]

2

1-bitcomparator

f(6)

A[6] B[6]

2

1-bitcomparator

f(5)

A[5] B[5]

2

1-bitcomparator

f(4)

A[4] B[4]

2

1-bitcomparator

f(3)

A[3] B[3]

2

1-bitcomparator

f(2)

A[2] B[2]

2

1-bitcomparator

f(1)

A[1] B[1]

2

1-bitcomparator

f(0)

A[0] B[0]

2

2-bit2:1 Mux

2

2 2

my(2)

f(5)(2)

I0 I1

2-bit2:1 Mux

2

2 2

my(1)

f(3)(2)

I0 I1

2-bit2:1 Mux

2

2 2

my(0)

f(1)(2)

I0 I1

2-bit2:1 Mux

2

2 2

my(5)

my(3)(2)

I0 I1

2-bit2:1 Mux

2

2 2

my(4)

my(1)(2)

I0 I1

my(5)(2)1-bit

2:1 Mux

F= my1(6)

I0 I1

my(5)(1) my(4)(1)

Log n level

of Muxes

Delay(8-bit comp.) = 3 (delay of 2:1Mux) + delay of 2-bit comp. Note parallelism at work multiplelogic blocks are processing simult.

Delay(n-bit comp.) = log n (delay of2:1 Mux) + delay of 2-bit comp.

H/W_cost(8-bit comp.) =7(HW_cost(2:1 Muxes)) +8(H/W_cost(2-bit comp.)

H/W_cost(n-bit comp.) =(n-1)(H/W_cost(2:1 Muxes)) +n(H/W_cost(2-bit comp.))


11/23

D&C: Top-Down vs Bottom-Up: Mux Design

2:1Sn-1

Sn-2 S0

2n-1 :1

MUX

I0

12 nI n-1

12 nI

Sn-2 S0

2n-1 :1

MUX

n-12n

I

(a) Top-Down

2:1

2:1

2:1

Sn-1 S1

2n-1 :1

MUX

S0

S0

S0

2n-1

2:1

MUXes

(b) Bottom-Up

Generally better to try top-down first

All bits exceptmsb shouldhave differentcombinations;msb should be

at a constantvalue (here 0)

MSB value should differ

among these 2 groups

All bits exceptmsb shouldhave different

combinations;msb should beat a constantvalue (here 1)


12/23

8:1

MUX

I0

I1

I2

I3

I4

I5

I6

I7

S2 S1 S0

An 8:1 MUX example (bottom-up)

I1

2:1MUX

S0

I0

I32:1MUX

S0

I2

I5S0

I4

I7

2:1MUX

S0

I6

2:1

MUX

4:1MUX

S2 S1

I0

I2

I4

I6

Z

I1

I3

I5

I7

Selected when S0 = 1

Selected when S0 = 0

Z

These inputs shouldhave differentlsb or S0values, since their sel. isbased on S0 (all otherremaining, i.e., unselectedbit values should be thesame). Similarly for otheri/p pairs at 2:1 Muxes at

this level.


13/23

8:1

MUX

I0

I1

I2

I3

I4

I5

I6

I7

S2 S1 S0

Opening up the 8:1 MUXs hierarchical design and a top-down view

I1

2:1

MUXS0

I0

I32:1MUX

S0

I2

I5S0

I4

I7

2:1MUX

S0

I6

2:1

MUX

I0

I2

I4

I6

Z

2:1

MUX

2:1

MUX

2:1

MUXZ

S1

S1

S2

I2

I6

I6

Selected when S0 = 0, S1 = 1.

These i/ps should differ in S2

Selected whenS0 = 0, S1 = 1, S2=1

4:1 Mux

4:1 Mux

All bits except msb should have

different combinations; msb

should be at a constant value

(here 0)

All bits except msb should have

different combinations; msb

should be at a constant value

(here 1)

MSB value should differ

among these 2 groups

Add D i i D C


14/23

Adder Design using D&C Example: Ripple-Carry Adder

(RCA)

Stitching up: Carry from LS n/2 bits

is input to carry-in of MS n/2 bits ateach level of the D&C tree.

Leaf subproblem: Full Adder (FA)

Example: Carry-Lookahead Adder(CLA)

Division: 4 subproblems per level

Stitching up: A more complexstitching up process (generation ofsuper P,Gs to connect up thesubproblems)

Leaf subproblem: 4-bit basic CLAwith small p, g bits.

More intricate techniques (like P,Ggeneration in CLA) for complexstitching up for fast designs mayneed to be devised that is notdirectly suggested by D&C. But

D&C is a good starting point.

Add n-bit #s X, Y

Add MS n/2 bits

of X,Y

Add LS n/2 bits

of X,Y

FA FA FA FA

(a) D&C for Ripple-Carry Adder

Add n-bit #s X, Y

Add ms n/4 bits Add 3rd n/4 bits Add 2nd n/4 bits Add ls n/4 bits

4-bit CLA 4-bit CLA 4-bit CLA 4-bit CLA

(b) D&C for Carry-Lookahead Adder

D d R l ti i D&C


15/23

Dependency Resolution in D&C:(1) The Wait Strategy

Strategy 1: Waitfor required o/p of A1 and then perform A2, e.g.,

as in a ripple-carry adder: A = n-bit addition, A1 = (n/2)-bit addition

of the L.S. n/2 bits, A2 = (n/2)-bit addition of the M.S. n/2 bits

No concurrency between A1 and A2:

t(A) = t(A1) + t(A2) + t(stich-up)

= 2*t(A1) + t(stitch-up) if A1 and A2 are the same problems ofthe same size (w/ different i/ps)

Subprob. A2

Root problem A

Subprob. A1

Data flow

So far we have seen D&C breakups in which there is no datadependency between the two (or more) subproblems of the breakup

Data dependency leads to increased delays We now look at various ways of speeding up designs that havesubproblem ependencies in their D&C breakups


16/23

Note: Gate delay is propotional to # of inputs (since, generally there is a seriesconnection of transistors in either the up or down network = # of inputsRs of the

transistors in series add up and is prop to # of inputs delay ~ RC (C is capacitive load)

is prop. to # of inputs)

Assume each gate i/p contributes 2 ns of delay

For a 16-bit adder the delay will be 160 ns

For a 64 bit adder the delay will be 640 ns

Example of the Wait Strategy in Adder Design


17/23

Dependency Resolution in D&C:(2) The Design-for-all-cases-&-select or Speculative Strategy

Other variations---Predict Strategy: Have a single copy of A2 but choose a highly likely

value of the k-bit i/p and perform A1, A2 concurrently. If after k-bit i/p from A1 is available and

selection is incorrect, re-do A2 w/ correct available value.

t(A) = p(correct-choice)*max(t(A1), t(A2)) +[(1-p(correct-choice)]*t(A2) + t(Mux) + t(stich-up),

where p(correct-choice) is probability that our choice of the k-bit i/p for A2 is correct

Need a completion signal to indicate when the final o/p is available for A; assuming worst-

case time (when the choice is incorrect) is meaningless is such designs

Root problem A

Subprob. A1Subprob. A2

Subprob. A2

Subprob. A2

Subprob. A2

4-to-1Mux

Select i/p

00

01

10

11

I/p00

I/p01

I/p10

I/p11

Strategy 2: For a k-bit i/p from A1 to A2, design2k copies of A2 each with a different hardwired k-

bit i/p to replace the one from A1. Select the correct o/p from all the copies of A2via a (2k)-to-1 Mux that is selected by the k-bito/p from A1 when it becomes available (e.g.,carry-select adder) t(A) = max(t(A1), t(A2)) + t(Mux) + t(stich-up)= t(A1) + t(Mux) + t(stitch-up) if A1 and A2 are

the same problems


18/23

Example of the Speculative Strategy in Adder Design

For a 16-bit adder, the delay is (9*48)*2 = 56 ns (2 ns is the delay for a single

i/p); a 65% improvement ((160-56)*100/160)

For a 64-bit adder, the delay is (9*88)*2 = 128 ns; an 80% improvement.

D d R l ti i D&C


19/23

Dependency Resolution in D&C:(3) The Lookahead or Pre-Computation Strategy

Strategy 3: Redo the design of A2 so that it can do as much processing as possible that is independent of

the i/p from A1 (A2_indep = A2_lookahd). This is the lookahead computation that prepares for the final

computation of A2 (A2_dep) that can start once A2_indep and A1 are done.

t(A) = max(t(A1), t(A2_indep)) + t(A2_dep) + t(stitch-up)

E.g., Carry-looakahead adder --- does lookahead computation; also looakahead compuattion is

associative, so doable in (log n). Overall computation is also doable in (log n) time.

A less structured example: Let a1 be the i/p from A1 to A2. If A2 has the logic:

a2 = vx + uvx + wxy + wza1 + uxa1. If this were implemented using 2-i/p AND/OR gates, the delay will

be 8 delay units (1 unit = delay for 1 i/p) after a1 is available. If the logic is re-structured as

a2= (vx + uvx + wxy) + (wz + ux)a1, and if the logic in the 2 brackets are performed before a1 is

available (these constitute A2_indep), then the delay is only 4 delay units after a1 is available.

Root problem A

Subprob. A1

Data flow

Subprob.

A2

A2_dep

A2_indep

orA2_lookahd

Concept

a2 a2

w x y w z a1u x a1v x u v x

A2

Critical path aftera1 avail (8-unit delay)

w x y w z u x a1v x u v x

A2_indepA2_dep

Critical path aftera1 avail (4-unit delay)

Example of an unstructured logic for A2


20/23

D&C Summary

For complex digital design, we need to think of the computationunderlying the design in an algorithmic manner---are there propertiesof this computation that can be exploited for faster, less expensive,modular design; is it amenable to the D&C approach?

The design is then developed in an algorithmic manner & thecorresponding circuit may be synthesized by hand or describedcompactly using a HDL

For an operation/func x on n operands (an-1 x an-2 x x a0 ) if x isassociative, the D&C approach gives an easy stitch-up function,which is x on 2 operands (o/ps of applying x on each half). This resultsin a tree-structured circuit with (log n) delay instead of a linearly-connected circuit with (n) delay can be synthesized.

If x is non-associative, more ingenuity and determination of propertiesof x is needed to determine the stitch-up function. The resulting designmay or may not be tree-structured

D&C can be done top-down or bottom-up. Top-down generally better

way to think for beginners If there is dependency between the 2 subproblems, then we saw

strategies for addressing these dependencies: Wait (slowest, least hardware cost) Speculative (fastest, highest hardware cost) Lookahead (medium speed, medium hardware cost)


21/23

Strategy 2: A general view of speculativecomputations (w/ or w/o D&C) If there is a data dependency between two

or more portions of a computation (whichmay be obtained w/ or w/o using D&C),dont wait for the the previous computation

to finish before starting the next one

Assume all possible input values for thenext computation/stage B (e.g., if it has 2

inputs from the prev. stage there will be 4possible input value combinations) andperform it using a copy of the design forpossible input value.

All the different o/ps of the diff. Copies of B

are Muxed using prev. stage As o/p

E.g. design: Carry-Select Adder (at eachstage performs two additions one for carry-in of 0 and another for carry-in of 1 from theprevious stage)

B Ax

yz

B(0,0)0

0

B(0,1)0

1

B(1,0)1

0

B(1,1)1

1

Ax

y

4:1Mux

z

(a) Original design: Time = T(A)+T(B)

(b) Speculative computation: Time = max(T(A),T(B)) + T(Mux).Works well when T(A) approx = T(B) and T(A) >> T(Mux)


22/23

Strategy 3: Get the Best of Both Worlds(Average and Worst Case Delays)!

Use 2 circuits with different worst-case and average-case behaviors Use the first available output

Get the best of both (ave-case, worst-case) worlds

In the above schematic, we get the good ave case performance ofunary division (assuming uniformly distributed inputs w/o the

disadvantage of its bad worst-case performance)

Unary

Division Ckt

(good ave

case, bad

worst case)

Non-

Restoring

Div. Ckt

(bad ave

case, good

worst case)

Ext.

FSM done2done1

start

Muxselect

outputoutput

inputs inputsRegisters

Register


23/23

Strategy 4: Pipeline It!

Original ckt

or datapath

Stage 1

Stage 2

Stage k

Conversion

to a simple

level-partitionedpipeline (level

partition may not

always be possible

but other pipe-

lineable partitions

may be)

Throughput is defined as # of outputs / sec Non-pipelined throughput = (1 / D), where D = delay of original ckts datapath Pipeline thoughput = 1/ (max stage delay + register delay)Special case: If original ckts datapath is divided into n stages, each of equal delay,and dr is the delay of a register, then pipeline thoughput = 1/((D/n)+dr). If d

r

is negligible compared to D/n, then pipeline throughput = n/D, n times that of theoriginal ckt

Ece465 High Level Design Strategies

Documents

Transcript of Ece465 High Level Design Strategies