THE DESIGN AND IMPLEMENTATION OF A HIGH...

THE DESIGN AND IMPLEMENTATIONOF A HIGH-PERFORMANCEFLOATING-POINT DIVIDER

Stuart Oberman, Nhon Quach, and Michael Flynn

Technical Report: CSL-TR-94-599

January 1994

This work was supported by NSF under contract MIPS%22961, using facilitiessupported by NASA contract NAG2-842.

THE DESIGN AND IMPLEMENTATIONOF A HIGH-PERFORMANCEFLOATING-POINT DIVIDER

bY


Technical Report: CSL-TR-94-599

January 1994

Computer Systems Laboratory

Departments of Electrical Engineering and Computer Science

Stanford University

Stanford, California 94305-4055

Abstract

The increasing computation requirements of modern computer applications have stim-ulated a large interest in developing extremely high-performance floating-point dividers.A variety of division algorithms are available, with SRT being utilized in many computersystems. A careful analysis of SRT divider topologies has demonstrated that a relativelysimple divider designed in an aggressive circuit style can achieve extremely high perfor-mance. Further, an aggressive circuit implementation can minimize many of the perfor-mance advantages of more complex divider algorithms. This paper presents the tradeoffsof the different divider topologies, the design of the divider, and performance results.

Key Words and Phrases: Floating-point, SRT division, domino logic

Copyright @ 1994

bY


2.1 SRT Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.2 SRT Topologies . . . . . .

Contents

1 Introduction

2 Division

3 Mapping to Domino Logic

4 Divider Design4.1 Logic and Circuit Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.2 Layout/Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 Results I

5.1 Single Stage Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5.2 Multiple Stage Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Comparison

7 Conclusions

I _

I12 .-

4

669

111 111

13

13

. . .111

List of Figures

Basic SRT Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Two-Bank Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Simplified Operand Scaling Scheme . . . . . . . . . . . . . . . . . . . . . . . 4Domino CSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Allowable quotient digits in each SRT region . . . . . . . . . . . . . . . . . 7Layout of a CSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 0Multiple stage analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2

iv

List of Tables

1 Critical path comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 _

2 Look-up table comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Complexity effects of optimizations . . . . . . . . . . . . . . . . . . . . . . . 9

4 Evaluation Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Latencies for varying number of stages . . . . . . . . . . . . . . . . . . . . . 11 _

6 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3

V

1 Introduction

Modern computer applications have increased in their computation complexity in recentyears. The development of high-speed floating-point arithmetic is a requirement to meetthe computation demands of modern applications which have increased in their computa-tion complexity in recent years. The emphasis on high performance graphics in worksta-tions has placed further demands on the computation abilities of processors. Furthermore,the industry-wide usage of performance benchmarks, such as SPECMarks, forces processordesigners to pay particular attention to floating-point computation. With die sizes of pro-cessors also increasing, processor designers can realistically include aggressive floating-pointunits directly on-chip to achieve even higher performance.

Floating-point intensive applications typically exhibit a distribution of operations sim-ilar to the following: 45% addition, 45% multiplication, 5% division, and 3% square- root.Past research and development efforts have placed a large emphasis on designing high-performance floating-point adders and multipliers. Thus, these two areas have many fastand efficient solutions. Division, though, has not seen a similar research effort. Beingthe third largest contributor to floating-point arithmetic, an increasing emphasis on high-performance division is vital to producing fast and competitive processors. Many divisionalgorithms exist which can provide high-performance. Moreover, in a standard static CMOSimplementation, the more complex algorithms often exhibit markedly higher performancethan simpler algorithms. For example, Newton-Raphson implementations can achieve ex-tremely low latency, as Newton-Raphson division converges to a result quadratically. Thisperformance comes at the price of additional hardware, complexity, and accuracy.

In this paper, a variety of SRT division schemes will be compared. SRT is often used inworkstations because it provides reasonable performance and it is relatively simple [l, 21.The tradeoffs of these schemes will be analyzed, and the design and performance of anew divider will be presented. It will be shown that by mapping a simple topology to anaggressive circuit style in conjunction with some logic optimization, a high-performanceCMOS divider can be implemented. Analysis of the latencies achievable in a dynamicCMOS circuit implementation demonstrates that the more complex algorithms do not showa distinct performance advantage. Rather, sufficiently high-performance can be obtainedat a lower complexity cost by using a simple divider implementation.

2 Division

2.1 SRT Overv iew

SRT is a non-restoring division algorithm that is basicallyutilizes the following relationship:

pj+l = rpj - Qj+l D

a trial and error process. I t

This equation models the paper-and-pencil method of division. To calculate a next partialremainder, the divisor is multiplied by the next quotient digit, and the result is subtracted

from the product of the last partial remainder, or dividend for the first iteration, and aradix r. The next quotient digit is obtained by supplying a fixed number of bits from thelast partial remainder, approximately 8 bits for a radix-4 divider, to a look-up table. Bychoosing a radix to be a power of 2, the product of the radix and the last partial remaindercan be formed by shifting. Similarly, the various products of the divisor multiplied bythe next quotient digit can be formed by multiplexing different multiples of the divisor.However, the problem with this basic scheme is that it requires a full-width subtractor.Consequently, this scheme can be very slow.

In order to improve upon this basic scheme, some redundancy is introduced into thealgorithm. An extra constraint is added to provide redundancy:

where k = n / (r - 1) and n is the number of positive allowed digits for the next quotientdigit. A design tradeoff can be noted in this relationship. By using a large number of alloweddigits for the next quotient digit, and thus a large value for k, a smaller look-up table isrequired, and thus the complexity and latency of the table look-up can be reduced. However,choosing a smaller number of allowed digits for the quotient simplifies the generation of themultiple of the divisor. Multiples that are powers of two can be formed by simply shifting.If a multiple is required that is not a power of two (e.g. three), an additional operationsuch as addition may also be required, which can add to the complexity and latency of thedivisor multiple generating process. Thus, the complexity of the look-up table and that ofgenerating multiples of the divisor must be balanced.

To increase the performance of the subtraction process, the partial remainders them-selves are kept in a redundant form. Instead of using full-width adders that require carrypropagation to compute partial remainders, a series of carry-save adders are used to computethe next partial remainder in the delay of a single full-adder. In this way, the conversionto a non-redundant form only needs to be done after the final iteration using a full widthadder.

2.2 SRT Topologies

In this study, three different SRT topologies were analyzed for cost and performance. Figure1 shows the topology of a Basic SRT implementation [3]. In this scheme, the critical path,shown by the dotted line, includes delays through the carry-lookahead adder, an XOR gate,an OR gate, a look-up table, driving 56 multiplexors, and finally a CSA. The XOR gatesand OR gates are necessary to convert from a two’s complement form to sign-magnitudebefore the look-up table can be accessed.

Figure 2 is the topology of a Two-Bank SRT implementation as proposed by Quach [4].This design removes some of the latency of the basic SRT scheme by duplicating hardware.The CSA’s compute two sums of rPj - qj+r D in parallel: one with qj+r equal to either 0 or1, and the other with qj+r equal to 2. Because it is known early whether qj+r will be either0 or 1, this value can simply be multiplexed into one of the CSAs. From the figure, it can beseen that the critical path includes delays throuph a CSA, a multiplexor, a carry-lookaheadadder, and a big multiplexor.

2

MUX -

+

Pj

I /

D

i___._..__..__.._ _..._ . . . . . . . . . . . . . . . . . . . ..-................. _ ---,

- / MUX

1 V

i

CSA

*Quottent TABLE

Figure 1: Basic SRT Scheme

1I I 1 D

CSA

---L-l TABLE

Figure 2: Two-Bank Scheme

3

t t

: Radix4CSA

I 1

Ij Two-bit Adders

Figure 3: Simplified Operand Scaling Scheme

A topology based upon operand scaling as proposed by Ercegovac et al [5] is displayed infigure 3. This design attempts to speed-up the critical path of the conventional SRT scheme.To reduce the quotient-selection complexity, the divisor is scaled to be around unity. Thisscaling procedure makes the quotient-selection function independent of the divisor value.Accordingly, the table has fewer terms and thus becomes simpler to implement. The overallalgorithm consists of three steps: scaling of the divisor and the dividend, division recurrence,and quotient conversion. Thus, the critical path is faster due to the decreased look-up tablecomplexity. However, the algorithm does require an extra cycle for scaling.

3 Mapping to Domino Logic

In order to achieve maximum performance from the divider, circuit styles besides conven-tional static CMOS were considered. Specifically, the use of a precharged logic style wasdecided upon due to its inherent performance advantages. Precharged logic is fast, as la-tency is only dependent upon one logic level transition. Latency in static gates is typicallya function of the rise times of the p-transistors in the pull-up network. By devoting a fixedprecharge phase in every clock cycle in which all gates have their outputs precharged high,the delay of the gates become more a function of the fall-times of the n transistors, whichare inherently faster due to their higher mobility. Additionally, in general, precharged logicis denser and provides lower input loading than static logic, as there is only one transistorper input.

The limitation of precharged logic is that precharged gates can not be easily cascaded.

4

Figure 4: Domino CSA

Utilizing a standard timing convention for input and output signals, inputs to a prechargedgate must be stable during the evaluation period, but the outputs of a precharged gates areonly valid during the evaluation period. Thus, this delay between the timings of the inputsand outputs precludes the usage of the output of a precharged gate to directly drive theinput of another precharged gate.

Accordingly, it was decided to utilize domino logic for the circuit level implementation.By placing a static inverter in-between any two precharge gates, one can take advantage ofthe fact that during the evaluation period, the output can only fall. Utilizing a cascadedinverter, the output of the inverter can only rise. Thus, this output can be used to driveanother domino gate. However, because all outputs are monotonically rising, domino logicforms a non-inverting logic family. Given that arithmetic circuits are being designed in thisstudy, it is vital that inverting gates be allowed. By providing the true and complements ofall signals, it is possible to use DeMorgan’s law to produce inverted outputs. In the worstcase, this requires approximately two times the number of precharged transistors, or aboutthe number of transistors in a static implementation. Figure 4 is an implementation of aCSA in domino logic.

To determine the relative performance of the three aforementioned divider topologies,the components of the critical paths of each of the implementations were simulated inHSPICE as domino circuits in 1~ CMOS under typical conditions. Table 1 contains theresults of this analysis.

These results demonstrated that the difference between the schemes was not large. Thus,

5

Implementation Critical Path LatencyConventional CLA + XOR + OR + Table(G) + BigMux + CSA 2.26ns

Two-bank CSA + MUX + CLA + BigMux 1.51nsOperand Scaling 4-2 CSA + 2 bit-adder + Table(3) 1.45nsImproved Conv. 1 CLA + XOR + Table($) + BigMux + fastCSA 1 1.86ns

Table 1: Critical path comparison

the choice of which topology to utilize or modify was not clear. The operand scaling schemerequires more complex CSAs and an extra scaling cycle. The two-bank scheme essentiallydoubles the number of CSAs in each stage, and thus requires nearly double the area. Giventhe lowest cost of the conventional scheme in terms of area and complexity, an analysis ofthe effects of various optimizations among the different topologies compared to an improvedconvcnt~ional scheme (Oberman-Quach) was to be performed.

4 Divider Design

4.1 Logic and Circuit DesignIt was decided to implement a divider based upon the conventional scheme based on itsinherently good cost/performance. However, the basis of this research was the methods bywhich this divider could be optimized to yield even higher performance. The two means ofoptimization were: 1) logic encoding for the look-up table and 2) taking advantage of thedomino logic technology mapping.

The equations for the three different schemes is presented in table 2. The first optimiza-tion, logic encoding, was made possible by noting some optimizations that could be madein the logic equations for the look-up table in the conventional scheme. Figure 5 shows anSRT table with the allowable quotient digits. This is actually a folded table so that onlythe positive half of the original table needs to be implemented. This provided an initialreduction in the size of the look-up table as reported in [3]. To further reduce the size ofthe look-up table, the quotients digits were encoded as follows:

This encoding reduces the complexity of the logic to form ql, as it only has to follow thetop edge of the q = 1 region of figure 5. The bottom edge is followed by ANDing with 42’.This encoding also allows for the removal of the OR gates that were required in converting

6

$XXX.XXy\00.00

00.01

00.10

00.11L

3 01.00SWz2

01.01

2 01.10ka, 01.11

10.00

10.01

10.10

10.11

>ll.OO

0 0 0 0 () () 0 0 0 0 0 0 0 0 () (1 ’

u 0 0 0 0 0 0 0 0 0 0 0 0 0 u 0- _-~_ .___ ____ ~,1 1 1 1 l’(/ 0 0 0 0 0 I) 0 0 u 0

11 1 1 1 11111 1 1 1 1

11111111 111111

11111111 111111

A = 1 if negative and y bit = 1; A = 2 otherwise;

B = 2 if negative and y bit = I: R = 1 otherwise.

Figure 5: Allowable quotient digits in each SRT region

Scheme EquationsOperand Scaling

qm2 = a’bce’ + a’bcd’ + a’bcf’ + a’bcg’ + a’bch’ + ab’c’f’ + ab’c’e’ +ab’c’d’ + ac’d’e’f ‘h’ + ab’d’e’f’h’ + ab’c’h’ + ac’d’e’f ‘g’ +ab’d’e’f ‘g’ + ab’c’g’

qml = ac’d’f’gh + ab’d’f ‘gh + a’cdefgh + ac’d’e’f + ab’d’e’f +abc’d’e + ab’cd’e + abc’de’f’ + ab’cde’f’ + ab’c’defgh

q0 = b’c’d’e’f’ + bcd’e’f ’ + bc’de + b’cde + bc’df + b’cdf

ql = b’c’d’e’f + bcd’e’f + a’b’c’d’e + abcd’e + a’b’c’de’f’ + abcde’f’

q2 = a’c’de + a’c’df + a’bc’ + a’b’c + abcde + abcdfOptimized Baseline

ql = a’c’d’u + a’b’u + uv + y + x

q2 = a’b’d’s’yuw + a’b’c’yuw ’ + a’c’d’yuv + a’b’c’s’yu + a’b’c’d’yu +a’b’yuv + b’c’xv + a’xw ) + xuv + a’s’x + a’d’x + a’xv + b’xu +a’c’x + a’b’x + a’xu + xy + t

Baselineq‘l = cdt’x’yv’ + bst’x’yw + cst’x’yv ’ + dst’x’yv’w + ct’x’yv’w’ +

a’c’d’t’x’y’u + bct’x’uv + t’x’y’uv + bdt’x’uv + bt’x’yv’ +a’b’t ‘x’y’u + at ‘xy’u’v ) + t’x’yu’ + bcdst’xy’u’v’w + abt’xy’u’ +act ‘xy’u’ + abt ‘xy’v’ + at ‘x’y

q2 = t + a’xw’ + a’s’x + a’d’x + a’c’x + a’b’x + b’c’xv + a’xv + b’xu +a’xu + a’b’c’yuw ’ + a’b’d’s’yuw + a’b’c’s’yu + a’b’c’d’yu + xuv +a’c’d’yuvw’ + a’c’d’s’yuv + a’b’yuv + xy

Table 2: Look-up table comparison

the high order 8 bits from a redundant form to the signed-magnitude form required by thelook-up table. Thus, approximately one gate delay can be removed from the critical pathof the conventional scheme.

Second, many of the terms in the new table’s equations are heavily dependent uponthe divisor, which is constant throughout the entire computational process. Synthesis toolscan reduce the overall gate delay of the table by noting this behavior. Specifically, inthis design, Synopsys was utilized to synthesize the improved baseline look-up table. Itgenerated combinational logic to implement the table, with the constraints for the designoptimization being to minimize the critical path for any of the inputs which are from thepartial product (i.e. variables t, x, y, u, v, and w). A library of standard-cells was used asinput to Synopsys, with fan-in of each gate limited to a maximum of three. The optimizedimplementation yielded a critical path of 5 two-input NOR gates.

Third, because two of the three inputs to the partial remainder CSA’s are known early,as they are the result of the previous partial remainder, there is only one late signal. Thus,the CSA is similar to a mux. This can be implemented in domino logic by placing the latestarriving signal, the multiple of the divisor driven by the table output, close to the CSAoutput. In this way, the latency of the CSA can be further reduced.

8

By util izing HSPICE, the CSA’s, muxes, latches, and CLA’s were sized to providemaximum performance. There is always a tradeoff in domino logic between evaluation timeand precharge time. Typically, in domino logic, the designer designs to make one edge, therising edge of the inverter, to be very fast. This can be accomplished by appropriate sizing:increasing the size of the p-transistor in the inverter, and decreasing the n-transistor. Thiscan bring two problems: 1) the falling/precharge edge can become very slow, and 2) noisemargins can become a significant problem.

-

The inverters were designed to have large noise margins. The pull-ups were designedthree times as large as the pull-downs, which almost exactly compensates for the mobilitydifferences in the MOSIS process, placing the DC switching point very near Vdd/2. Thus,noise margins, area, and power were considerations in this design.

Further analysis demonstrated that the n-transistor in the evaluation path of the dominogates is not necessarily needed. As long as all of the inputs to the domino gate are guaranteedto be low during the precharge phase, an explicit evaluation transistor is not required.However, this guarantee becomes difficult to achieve, as without the evaluation transistor,any domino gate can not precharge until its previous gate has precharged, and so on. Thus,while this scheme can decrease the evaluation times in the circuit, it can lead to an increasein precharge times.

A summary of the effects of the various optimizations on the different dividers’ com-plexity is shown in table 3.

Method Sca l ing 2 -banks Oberman-QuachScaling down

Encoding down downRadix, Encoding down down down

Circuit UP down

Table 3: Complexity effects of optimizations

This table demonstrates how the Oberman-Quach design compares in terms of reducingthe complexity, and thus the latency, of the various divider components in relation to theother designs.

4.2 L a y o u t / A r e a

Layout has been done for the radix-256 divider utilizing a two-level metal technology filefor Magic. Figure 6 shows a domino CSA extracted from the magic layout.

Current cell area for a CSA is 11,120 pm2. Thus, the total area for the four-levels of 56-wide CSAs is approximately 2.5mm2. Total area, including control logic and wire routing,should be under 5mm2.

9

Figure 6: Layout of a CSA

1 0

5 Results

5.1 Single Stage Analysis

The HSPICE simulations using 1,~ CMOS and typical conditions provided the followingresults:

Circuit Evaluation Timefast CSA fan-out 8 = .32nsreg CSA fan-out 8 = .42ns4-2 CSA .83ns

2-bit adder .30ns2-input XOR 200fF = .20ns2-Input M UX 200fF=.15ns, 2pF=.33ns4-Input OR 200fF=.lOns

CLA 200fF=.61ns

Table 4: Evaluation Times

The worst case precharge time for all gates was a maximum of .7ns. Thus, the cycletime for a radix-4 (single stage) improved conventional divider would be 2.56ns.

5.2 Multiple Stage Analysis

Higher radix dividers can be formed by cascading single stage radix-4 dividers [4]. Tocalculate the cycle times and latencies for other radices, it is necessary to multiply theevaluation time for a single stage by the number of stages, add the precharge time percycle, and multiply the result by the number of cycles. The total number of cycles is equalt o (56/(bits/iteration) + l(rounding). Table 5 shows the latencies of various radix dividers.

Stages 1 Avg. Cycle time 1 Num. Cycles 1 Total Latency2.56ns 29 74.2ns4.42ns 1 5 66.3ns6.28ns 11 69. Ins8.14ns 8 65.lns13.7ns 5 68.6ns

Table 5: Latencies for varying number of stages

These numbers demonstrate that the highest performance can be achieved by stagingfour radix-4 dividers together to form a radix-256 divider that generates 8 quotient bits periteration. Additionally, the corresponding cycle time is of the order of modern processorcycle times.

11

60

5 0

40Radix 4 Radix 16 Radix 64 Radix 256 Radix 16384

I I II I I I

n Rounding

Precharge

0 Evaluate

I

Figure 7: Multiple stage analysis

1 2

Model C y c l e s C y c l e T i m e L a t e n c ySuperSparc 9 20ns 180ns

Snake 1 5 10ns 150nsOberman-Quach 8 8.14ns 65.lns

Table 6: Performance comparison

The contributions of the variolrs phases of the division process in the above numberscan be seen in figure 7.

6 Comparison

To provide some quantitative reference, table 6 compares the performance of the dividerpresented in this study with two other competitive dividers:

It should be noted that both the SuperSparc and Snake are fabricated in 0.8~ processes,while simulations presented in this study were for a typical 1~ process. If the circuit wasfabricated in a 0.8~ process, it is estimated that the total latency would be reduced byabout 20%, totalling 52ns. If it were fabricated in a 0.35~ process, total latency could bereduced by up to lOO%, yielding a latency of under 33ns.

7 Conclusions

In this study, current divider technology has been analyzed. By using optimizations on aconventional SRT topology, a high-performance divider was designed. The combination oflogic encoding to reduce the latency for conversion and table look-up with circuit style andimplementation optimizations yielded a performance advantage over a conventional SRT di-vider. The other two divider topologies analyzed yield a slight performance advantage overthe presented divider, but at nearly double the hardware and complexity. When comparedwith the conventional SRT scheme, the Oberman-Quach divider realizes a similar perfor-mance advantage, but at a reduced hardware cost. By staging four such radix-4 dividersto form a radix-256 divider, the precharge time of the domino circuit can be amortizedin such a way that a minimum overall latency can be achieved for a full double-precisiondivide. Thus, when all of the topologies presented are mapped into a faster circuit style,the performance advantage of any one scheme is minimized, and the focus shifts to thedesirability of increased hardware and complexity.

References

[l] D. E. Atkins. Higher-radix division using estimates of the divisor and partial remain-ders. IEEE Transactions on Computers, C-17(10):925-934, October 1968.

1 3

THE DESIGN AND IMPLEMENTATION OF A HIGH...

Documents

Transcript of THE DESIGN AND IMPLEMENTATION OF A HIGH...