base paper.pdf

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS 1

A Novel Modulo Adder forResidue Number System

Shang Ma, Jian-Hao Hu, Member, IEEE, and Chen-Hao Wang

AbstractModular adder is one of the key components for theapplication of residue number system (RNS). Moduli set with theform of can offer excellent balanceamong the RNS channels for multi-channels RNS processing. Inthis paper, a novel algorithm and its VLSI implementation struc-ture are proposed formodulo adder. In the proposed al-gorithm, parallel prefix operation and carry correction techniquesare adopted to eliminate the re-computation of carries. Any ex-isting parallel prefix structure can be used in the proposed struc-ture. Thus, we can get flexible tradeoff between area and delay withthe proposed structure. Compared with same type modular adderwith traditional structures, the proposedmodulo adderoffers better performance in delay and area.

Index TermsCarry correction, modular adder, parallel prefix,residue number system (RNS), VLSI.

I. INTRODUCTION

R ESIDUE number system (RNS) is an ancient numericalrepresentation system. It is recorded in one of Chinesearithmetical masterpieces, the Sun Tzu Suan Jing, in the 4th cen-tury and transferred to European known as Chinese RemainderTheorem (CRT) in the 12th century. RNS is a non-weightednumerical representation system and has carry-free property inmultiplication and addition operations. In recent years, it hasbeen received intensive study in the very large scale integra-tion circuits (VLSI) design for digital signal processing (DSP)systems with high speed and low power consumption [1][4].Modular adder is one of the key modules for RNS-based DSPsystems.For integers and with -bit width, the modular addition

can be performed by (1) if and is less than the modulus

(1)In (1), , which is referred as correction [5][8]. Inthe general modular adder design, the two values, and

, should be computed firstly. Then, one of them isselected as the final output. According to the form of the mod-ulus, modular adders can be classified into two types: the gen-eral modular adder and the special modular adder.

Manuscript received April 13, 2012; revised October 11, 2012 and December18, 2012; accepted February 05, 2013. This work was supported in part bythe National Natural Science Foundation of China under Grants 61101033 and61070696, and by the Fundamental Research Funds for the Central Universitiesof China under Grant ZYGX2011J118. This paper was recommended by Asso-ciate Editor B.-H. Gwee.The authors are with the National Key Laboratory of Science and Technology

on Communications, University of Science and Technology of China, Chengdu611731, China (e-mail: [email protected]).Digital Object Identifier 10.1109/TCSI.2013.2252639

For the general modular adder, Bayoumi proposed a schemefor arbitrary modulus by using two cascaded binary adders [5].However, the delay is the sum of the two binary adders. Severalliteratures constructed several modular adders with two parallelbinary adders to calculate and [6], [7]. Thismethod can achieve less delay but needs about twice area ofbinary adder. Dugdale proposed a method to construct a typeof general modular adders with a reused binary adder [9]. Theshortage of this structure is that it will use two operation cy-cles to perform one modular addition. The area or delay of thesemodular adders mentioned above is twice or more than that ofbinary adder. In recent studies, a fewmodular adders with betterarea and delay performance are presented. Hiasat proposed aclass of modular adders in which any regular Carry Look-Ahead(CLA)based binary adder can be used in the final stage [10].However, it needs an extra CLA unit to get the carry-out bit of

before the final CLA addition. As a result, the struc-ture does not reduce the delay significantly. The ELMMA algo-rithm proposed by Patel et al. [11] uses two carry computationmodules for and in which some carry com-putation units can be shared. The area reduction of this schemeis dominated by the form of . In the worst case, almost twoindependent carry generation modules are needed. Patel et al.[12] also proposed several algorithms which can generate car-ries fastly. A new number representation for modulo additionis proposed in [8]. However, its outputs are represented in spe-cial format. Thus, the extra area and delay are needed to per-form the conversion from the special representation to binarynumber representation or all operations should be performed inthis number representation format in RNS-based systems.On the other hand, the complexity of the special modular

adder is much less than that of general modular adder, sincethe structure of the special modular adder can be further opti-mized according to the modulus. The effective modular addersfor modulo and modulo have drawn much moreattention than other kinds of modular adders [13], [14]. [15] and[16] proposed an architecture for modulo adder based ondiminished-1 number representation. [17] and [18] presenteda structure for modulo and based on parallel prefixand carry correction, respectively. A similar architecture with[7] for modulo adder is also proposed in [19]. In [20],Patel et al. described an implementation structure for modulo

adder based on the technique of carry offset, whichis only required to obtain the carry information of .In order to obtain the carries required in the modular addition,each carry of has to be modified according to theutmost carry of . In this case, the redundant modulesof carry computation are eliminated, but the structure of carry

1549-8328/$31.00 2013 IEEE


2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS

computation is fixed and can only perform the special modularaddition, that is, modulo addition.One of the important issues is the selection of moduli sets in

RNS-based application. In addition and multiplication intensivesystems, residue channels are always expected as many as pos-sible when the dynamic range is fixed, that is, the word lengthof individual residue can be reduced to achieve better speedperformance. Meanwhile, the width of each channel is also ex-pected as close as possible to get similar critical path delay.That is the balance between each residue channel. Moreover,the complexity of modular adder should be evaluated carefullyin residue radix selection. At present, it is possible to get highperformance modular adders for a few moduli radixes, such asmodulo and modulo . But these moduli radixes arenot always suitable to construct multi-channel RNS with finechannel balance. For example, it is hard to construct a multi-channel moduli set with and to achieve co-primeand fine balance between channels. However, the modulus withthe form of have the prominentadvantage in constructing multi-channel moduli sets with finebalance [21]. We can find several methods for moduli set selec-tion with this type residue. For instance, we can verify that themoduli set

satisfies the co-prime requirement when , 4, 5,6, 8, 12, and when , 9, 10, 11 by removing a few radixes.Meanwhile, the channel widths of these moduli sets are allbits. Thus, the residue radix with the form of hasgreat potential in moduli sets constructing with high efficient,high dynamic range, and fine balance between channels. Dueto the advantages of radix , it is essential to studyits fundamental computation units, that is, moduloadder andmodulo multiplier. In [21], a general archi-tecture for modulo multiplier is proposed recently.A modulo and a modulo adder are alsoproposed in [19] and [20] respectively. However, there is littlediscussion about the general architecture for moduloadder.In this paper, a new class of modulo adder based on

carry correction and parallel prefix algorithm is proposed. Thenew modular adder can be divided into four units, the pre-pro-cessing unit, the prefix computation unit, the carry correctionunit, and the sum computation unit. In the proposed scheme, thecarry information of computed by prefix computationunit is modified twice to obtain the final carries required in thesum computation module. Meanwhile, any existing fast prefixstructure of binary adders can be used in the proposed mod-ular adder structure, which offers superior flexibility in design.In order to evaluate the performance of the proposed modularadder in this paper, the unit-gate model and Design Compiler(DC) of Synopsys Company are used to estimate its complexityand performance. The results show that the proposed modulo

adder can get the best delay performance. Com-pared with the special modulo adder proposed in[20], our method offers similar delay performance but has theability of design a class of modulo adder with dif-ferent based on identical algorithm. Moreover, compared withELMMA modular adder, the proposed modulo adder has betterarea delay performance at most cases and can achieve fasteroperation frequency.

Fig. 1. Prefix computation-based adder structure.

In the rest of the paper, the brief introduction of RNS andmodular addition are presented in Section II. Section III intro-duces the algorithm and hardware architecture of the proposedmodulo adder. Performance of the proposed modularadder are evaluated and compared with other modular adders inSection IV. Finally, we will conclude this paper.

II. BACKGROUND

A. RNS and Modular AdditionRNS is defined as a group of co-prime modular radixes

, where , ,, , and is the greatest

common divisor of and . The integer in can berepresented uniquely by its residues respect to the modulus ,that is , where , ,

Let , andbe the RNS representation of integers ,

and in the range of . According to Gaussian modularalgorithms, if , we can get ,where represent addition, subtraction, and multiplication.For integers and in the range of , modulo addi-

tion is defined as

(2)

If and the bit width of the modular adder is-bit, where (that is, is the smallest integer noless than ). Equation (2) can be represented as

(3)

where the correction [7], [8], [20]. That is, if thecarry-out bit of is 1, the result of modular additionis the least significant bits of , otherwise, the resultis . This is the basic rule in most modular adders design.

B. Prefix Parallel AdditionParallel prefix operation is widely adopted in binary adder de-

sign. Each sum bit and carry bit can be calculated with theprevious carries and inputs [22]. As shown in Fig. 1, prefix-based binary adders can be divided into three units, the pre-pro-cessing unit, the prefix computation unit, and the sum computa-tion unit.In the pre-processing unit, prefix computation is calculated as

(4)


MA et al.: A NOVEL MODULO ADDER FOR RNS 3

where and represent the carrygeneration bit and carry propagation bit respectively.The prefix computation unit is used to compute the carry in-

formation used in the sum computation unit. Prefix computationcan be performed by

(5)

where , , ,and represents the stage. The smaller means the shorterdelay of the carry chain. The operator in (5) is the prefixoperator and is the prefix computation result of thestage from the bit to the bit, which is also called group

prefix computation. There are several well known binary prefixaddition structures, such as Sklansky (SK), Brunt-Kung (BK),Kogge-Stone (KS), Han-Carlson (HC), ELM, and so forth [22].The prefix structures mentioned above are usually called prefixtrees.After prefix computation, carries for

the bit can be obtained. They can be computed as

(6)

In the sum computation unit, the carries from the prefixcomputation unit and the partial sum from the pre-processingunit are used together to compute the final sum bits ,

(7)

C. Unit-gate Model for Area and Delay AnalysisThe unit-gate model is one of the most commonly used

models to estimate the circuit complexity and performance inVLSI design. In the unit-gate model, simple two-input logicgates, such as AND, OR, NAND, and NOR, are treated as unitgates. They have the same area and delay, which are referred as

and in this paper, respectively. For those more compli-cated two-input gates, such as XOR and XNOR, their area anddelay are defined as and in our analysis, respectively.Complex logical circuits as well as multi-input gates can beimplemented with 2-input unit gates, and their gate countsequal the sum of gate counts of the unit gate [22].

III. PROPOSED MODULO ADDER

As shown in Fig. 2, the proposed modulo adderis composed of four modules, pre-processing unit, carry gener-ation unit, carry correction unit, and sum computation unit. InFig. 2, different shade represents different processing units.The proposed modular adder can be divided into two gen-

eral binary adders, and in Fig. 2, with carry correctionand sum computation module according to the characteristicsof correction for modulus . We can get the car-ries used in the final stage through correcting the carriesof , which can be computed by any existing prefixstructure with proper pre-processing. At last, we can get the final

Fig. 2. The proposed modulo adder structure.

modular addition result from and partial sum information.The proposed architecture shown in Fig. 2 can avoid the calcu-lation of carries information for and separately.Thus, the area and delay in VLSI implementation can be re-duced. Meanwhile, the proposed scheme offers flexible tradeoffof area and delay with different parallel prefix structures.

A. Pre-Processing Unit

The pre-processing unit is used to generate the carry genera-tion and carry propagation bits of . From (3),when

(8)

Obviously, the binary representation of is .

In Fig. 2, the computation of can be performedby and where and are used for lower- bits andhigher- bits addition, respectively. Let ,

, and the binary representations of and be

and respec-tively. The operation of adder and can be regarded as

(9)

where is the carry-out bit of adder .For , one of the inputs of , every bit is 0 except the

least significant bit. Thus, can be treated as a -bit adder withthe lowest carry-in bit, which is exactly as same as the generalbinary adder. And the way pre-processing of is also similarwith the general binary adder. The difference is that the lowestcarry-in bit should be considered. Therefore, carry generationand carry propagation bits are

(10)

For adder , it does not only add the constant , but alsothe carry-out bit from adder . It can be regarded as athree-inputs adder with the lowest carry-in bit. The three inputsare , and in binary. In this paper,



we reduce the number of inputs from three to two for adderby using Simple Carry Save Adder (SCSA). When

, we can get for and , firstly

(11)

And then is treated as the inputs of the second stagein SCSA. The second stage of SCSA generates the carry gen-eration and carry propagation bits from and . Actu-ally, it is the carry saved addition of these two binary numbers,

and . Thus, the final outputsof pre-processing unit for adder are

(12)From (10) and (12), all of the information required in the

prefix computation is obtained. Furthermore, the carry-out bitof SCSA, , is required to compute the carry-out bit of

, . It is calculated as

(13)

B. Carry Generation UnitIn carry generation unit, the carries

of can be obtained with the carry generation andcarry propagation bits from the pre-processing unit. Any ex-isting prefix structure can be used to get the carries in thispaper.It is worth pointing out that the carry-out bit of SCSA in

the pre-processing unit, as shown in (13), is not involved inthe prefix computation. Instead, combined with thecarry-out bit of the prefix tree is required to determine thecarry-out bit of (denoted as )

(14)

where .

C. Carry Correction UnitThe carry correction unit is used to get the real carries

for each bit needed in the final sum computation stage. In orderto reduce the area, we get the carries of by correcting thecarries of in the carry correction unit.We first derive the relation of and

in binary addition in Theorem 1, where and are the carryoutputs of prefix tree when the lowest carry in is 0 and 1,respectively.Theorem 1: Let be the carry bits of an-bit adder, and they will be propagated to the higher adjacentpositions, be the lowest carry in (that is, ), and

be the final carry-out bit (that is, ). Assumingthe carries for each bit be when and the carries foreach bit be when , we can get the relationship

Proof: Let and bethe binary representations of and , respectively. Then, wehave , and .According to the parallel prefix algorithm, we have

which can be rewritten as

.If , then , which yields

and . Thus, we have . That is, ,.

If , it means that cant be propagated to .Hence, , which is irrelevant with . That means

.Thus, .Q. E. D.Theorem 1 means that can be determined from by

simple logic operation. That is the foundation of the carrycorrection for the proposed modular adder. We present the pro-cedure of the carry correction in our scheme based on Theorem1 as following.For the proposed modulo adder,

and can be represented as in binary. The

computation of can be divided into two steps,and .

The two 1 bits in s binary representation can be regardedas the carry-in bits for adder and adder shown in Fig. 2,respectively. Correspondingly, the carry bits of can beobtained with twice carry corrections of based onTheorem 1. The first correction result is the carries of

. The second correction result is the car-

ries of . Whether carry correction is performed or notdepends on the carry-out bit of , that is, in (14).Carry Correction for AdderSince the binary representation of is

, can be regarded as the

carry bits of and .Therefore, can be modified with Theorem 1 todetermine the carry bits of

, that is

(15)

One point must be paid attention to perform (15). Thelowest propagation bit in , , is not equal to that in(10). Actually, it is equal to .According to Theorem 1, the carries of iscorrected under the condition of . We can usea 2-to-1 Multiplexer (MUX) to perform the operation. Forthis MUX, is the control signal, while and



are input signals. And the output is the result of the firstcorrection, denoted as

(16)

Carry Correction forFrom (16), is the carry infor-mation of or after the correctionfor adder . Then we can perform the second correctionbased on and let the carry bits of the second correctionbe . Similar to the first correction, is the carry of

(that is, ) when . Oth-erwise, is the carry of . That is, is thefinal carry information needed in sum computation unit.

When , . The bit 1 in willnot affect . Hence,

(17)

When , the inputs of adder in Fig. 2are and . And the carry-in bitis the carry-out bit of adder , that is, . Considering the leastsignificant bit of is 1, we can treat the oper-ation of adder as the addition of two inputs,and , with the lowest carry-in bit 1. Thatis, the results and carry information of , in (18) are iden-tical

(18)Thus, we can get the carries of by modifying the car-

ries of adder with Theorem 1.Combined with the final carry-out bit of , , thecarries required by the proposed modular adder are deter-mined.Since the second carry correction is performed under the con-

dition that the lowest carry-in bit of adder is a constant 1,the propagation bits used in the carry correction unit shouldbe computed by and . Fromthe above analysis, it is shown that the difference between these

two additions in (18) is that the least significant bits, 1 forin (18) and for in

(18) . The propagation carry information can be computedfrom (11) and (12). Let be the propagate carries of (18) ,we have

(19)

Let be the group propagate carries, then

(20)

When , according to Theorem 1and (16), the carries after the second carry correction are

(21)

Substituting (19) into (21), we get

(22)

Substituting (16) into (22), we get

(23)

When . Similarly, we get

(24)

According to (16), (17), (23) and (24), the carry bits requiredby the proposed modular adder are determined as shown in (25),at the bottom of the page.

(25)



Let

(26)

Then

.(27)

From the unit-gate evaluation model, the delay of computingis in (25) when , which is identical to the

delay of a prefix computation unit. It is shown from Fig. 2 thatthe pre-processing units of the proposed modular adder guaran-tees that is determined before atleast for most prefix structures.If is determined before no less

than two stages prefix computation, the delay of computingand in (26) is the delay sum of one XOR, one AND, and oneOR gate. That is, the total delay is . That means the outputtime of is identical with that of andin (26). Thus, there is also no extra delay.If is determined before only less

than one stage prefix computation delay, the delay of computingand should be reduced to at most one prefix computation

delay through special pre-processing to eliminate the possibleextra delay. In order to achieve this purpose, can be usedas the selection signal for the MUX. Meanwhile, and canbe pre-computed and used as the inputs of the MUX. Letand be the value of and when respectively.Similarly, let and be the value of and when ,respectively. We get

(28)

and

(29)Thus, we can get the carry information that will be used in

the sum computation unit of the proposed modular adder.

D. The Sum Computation

Generally, the sum computation is as same as that in prefix-based binary adder. However, is the correction result when

is taken into account. That is, if , is the carrybit of . Otherwise, it is the carry bit of . Thus,the partial sum bits of and are both requiredin the final sum computation. Let andbe the partial sum bits of and respectively.Note that has been determined in the

pre-processing unit (that is, ). Besides, justwhen and . Consequently

(30)Hence

(31)

(32)

When

(33)

At last, the sum bits are

(34)In (34), and can be obtained at the same time.Therefore, there is no extra delay comparedwith other sum com-putation units.

E. Design Example

The VLSI implementation structure of moduloadder based on the proposed scheme is shown in Fig. 3(a).

Fig. 3(b) illustrates the function of each module. Pre-processing UnitThe pattern in Fig. 3 is the pre-processing unit andused to generate carry generation and carry propagationbits for the following prefix computation. Since there arefixed 1 inputs at the 1st and the 4th places, the patterns and are used for this special situations. The pat-tern does not cost any resource in unit-gate model.The computations of these patterns correspond to (10), (11)and (12).

Prefix ComputationThe pattern is the prefix computation unit. In this ex-ample, the Sklansky prefix tree is used and there are 11prefix computation units, which corresponds to (4). Thedelay of is determined by its carry generation pathwhich is one OR gate and one AND gate. However, thepattern in the final stage of prefix tree is not neededto compute propagation bits.

The Computation ofThe is computed by pattern in Fig. 3. Ac-cording to (14), ,

and can be computed con-currently. Then, we can get after an OR gate. Thus,the delay of computation will not exceeding thedelay of pattern and there is no extra delay. In orderto minimize the delay of , the value of can be se-lected so that the delay difference of and is at



Fig. 3. Modulo adder based on the proposed structure.

least one OR gate delay. In (14), is computed inpre-processing firstly. Meanwhile, the delay of isalways smaller than that of in prefix tree. Thus, wecan compute firstly if is obtainedbefore . Otherwise, we can computefirstly. When the last one, or , arrived, only oneOR gate is needed to compute the final value of . Thatis, the delay is if the value of is selected properly. Inthis example, .

Carry Correction UnitThe pattern in Fig. 3 performs the computation corre-spond to (27). In this example, 7 correction operators areused. From (27), there are three different situations, that is

, and . The, and can be computed by independent modules.

The pattern and in Fig. 3 is used to compute, and in (27). In this example, is computed

out before with two prefix com-putation stages. Hence, we can get and without extradelay by using (26). In the worst case, the group propaga-tion bits required in (26) are needed to be computed oneby one from . However, the extracomponents for computing these group propagation bitscan be removed when the group propagation bits exist inprefix structure.

Sum Computation UnitThe pattern in Fig. 3 is used for performing the sumcomputation according to (34). As a matter of fact, thisoperator is the logic XOR operation. The pattern inFig. 3 is a modified XOR operator, one of its inputs is in-verted. Because the computation of in (34) can be

performed with carry correction simultaneously, only oneXOR operations are required to perform the sum compu-tation and no extra delay is introduced.

Numerical ExampleFor example, for modulo 239 (that is, and )addition, . If and , theresult of the modular addition is 153. According to (10),(11) and (12), pre-processing results are

Then, by using prefix tree and (13), we have

and

From (25), we can get carry correction results

Finally, the modulo addition results can be computed by(34),

That is the binary representation of 153.



TABLE IAREA OF MODULO ADDER BASED ON UNIT-GATE MODEL

This example shows the detailed design of moduloadder based on the proposed algorithm with the Sklansky prefixtree. There are two special measures in the proposed schemeare used to eliminate the possible extra delay. The first one is thecomputation of in (14) which shows the way of eliminatingthe delay. In fact, it is easy to satisfy requirements of (14) for anadder based on prefix structure. The second one is the pre-pro-cessing of temporary variables in carry correction. In the worstcase, are needed to determine the grouppropagate bits required in (25) by using independent modules.Nevertheless, the special logical resource for computing groupcarry information can always be reduced according to the prefixstructure used in the proposed modular adder.

IV. PERFORMANCE ANALYSIS AND COMPARISON

A. Performance Analysis and Comparison Based on Unit-GateModelAccording to (5), the delay of prefix tree is always determined

by the path of carry generation units which is . However,the delay of the pre-processing units and carry generation unitsat the first level of prefix tree can be reduced to . Let ,

be the inputs of pre-processing unitsand , be the outputs of pre-processingunits. If is computed by

(35)

we can get

(36)

Obviously, the critical path delay of pre-processing and the firstlevel prefix computation is . Meanwhile, we can get and

in the computation procedure of (36). As result, there isno extra area.The delay of carry correction units and sum computation units

are both . As for prefix operation, its delay depends on theadopted prefix structure. According to the above analysis, thecritical path delay of the proposed modulo adder isthe sum of the delay of prefix structure and 7 unit gates. That is

(37)

TABLE IITHE DELAY AND AREA OF MODUL ADDER WITH DIFFERENT

PREFIX STRUCTURES BASED ON UNIT-GATE MODEL

where represents the delay of prefix structure.The proposed modulo adder has the advantages

of regular structure and prefix tree selection-free. In order to getmore efficient performance on delay, only one special unit inprefix tree is used. That is the pattern in Fig. 3.According to above analysis and unit-gate model, the pro-

posed modulo adders area cost is shown in Table I.In Table I, is the number of prefix operation unit. The areaof carry computation module includes the area of whichperforms , and the area of sum computation module in-cludes the area of computation of in (34). For thearea of pre-processing for carry correction module, it considersthe worst case in Table I. In the worst case, all propagation bitsneeded in (27) are computed by independent modules, which isthe pattern and in Fig. 3. The unit-gate analysis re-sults in Table I show that the area of the proposed modular adderdecreases with the increase of . This is because of the decreaseof pre-processing unit along with the increase of .Table II is the delay and area of the proposed modulo

adder with different parallel prefix structures, suchas Sklansky, Brent-Kung, Kogge-Stone, and Han-Carlson trees[22]. It is shown that the delay and area of Sklansky prefixtree is the best one in the four trees. However, the fan-out ofKogge-Stone and Han-Carlson is a constant 2. In practice, wecan choose specific prefix tree as the generation computationunit according to specific application. In the following perfor-mance analysis in this paper, the area and delay of the proposedmodular adder are estimated under the worst case of the carrycorrection when using the Sklansky prefix tree.With the unit-gate model, the comparisons of area and delay

are shown in Table III. The reason why we choose these mod-ular adders for comparison is that their moduli are same or thealgorithms adopted by them are representative.The modular adder based on ELM algorithm in [11] is a class

of general modular adder with a fine inline structure, but therewould be considerable duplicate prefix computation units whenis composed of too many 1. In Table III, the area and delay

of ELMMA adder is estimated under the condition that there areonly two 1 in s binary representation.In [7], two binary adders are used to get andsimultaneously. Similarly, an extra CLA is used to compute

the carry-out bit of in [10]. In order to performaccurate and impartial performance analysis, the area and delayanalysis for [7] and [10] based on same prefix tree adopted in ourdesign. That is, the Sklansky prefix tree [22] is used in the binaryadders in [7] in the CLA in [10]. Furthermore, in the following



TABLE IIITHE AREA AND DELAY COMPARISON BASED ON UNIT-GATE MODEL

ASIC (Application Specific Integrated Circuit) synthesis theyare also implemented based on Sklansky prefix tree.Meanwhile,the analysis of [10] is under the assumption that there are onlytwo 1 in s binary representation.In [8], a new number representation method is adopted to

simplify modulo addition. Conversion from binary to its specialrepresentation bears no cost. However, its addition results are inthis special number representation format. Extra area and delayshould be used to perform the conversion from this format tobinary number representation or all operations, such as additionand multiplication, should be performed in this number repre-sentation format in RNS-based system. In order to perform com-parison without the conversion effect, the conversion from itsspecial number format to binary representation is not includedin the analysis and comparisons. Table III shows that its areais similar with that of [20] and the delay is similar with that of[11].The modulo adder proposed by [20] is the

special case of our scheme. Since the position of 1 in ofmodulo adder is fixed, some optimizations can bedone so as to reduce the delay of pre-processing module. Thus,the total delay of this modular adder is .Table III shows that the largest area is needed in [7] and

the smallest is needed in our scheme. Meanwhile, Table IIIalso shows that the fastest scheme in speed is [20] and theslowest is [10]. However, the unit-gate model is just a referencein performance analysis. In practice, different architecturemay have different ability in tradeoff between area and delay.In Section IV-B, we will implement all scheme mentionedin Table III and perform detailed comparison based on thecommon used synthesis tool, DC.

B. Performance Analysis and Comparison Based on DesignCompilerIn order to get more accurate performance evaluation, we de-

sign the proposed modulo adder with Sklanskyprefix tree and the other modulo adders mentioned in Table IIIwith VHDL. Then, we use DC to get area and delay perfor-mance. The version of DC is E-2010.12-SP5-2 for LINUX. Andwe use its TOPOGRAPHICAL mode to get more accurate wireloadmodel. Then, these designs are synthesized with the TaiwanSemiconductorManufacturing Company (TSMC) 0.13 log-ical library. Meanwhile, the TSMC 0.13 physical library isused to get more accurate area and timing evaluation in logical

TABLE IVASIC SYNTHESIZED RESULTS FOR THE PROPOSEDMODULAR ADDER WITH DIFFERENT

TABLE VASIC SYNTHESIZED RESULTS FOR TIMING OPTIMIZATION I

synthesis procedure. For comprehensive comparison, we firstdesign these modular adders in Table III for , 6, 12 at twocases, and . Then, we design our scheme for

, 3, 4, 5 when to get the performance change withthe different value of . Two different optimization approachesare used in the following ASIC synthesis procedures.The first optimization approach is that each design is re-

cursively optimized until they achieved a fastest operatingfrequency without timing violation and the value of slake iszero. The timing constraint step is 0.01 ns in recursive opti-mization procedure.Table IV is the synthesis results of area and delay for our

scheme when and varies from 1 to 6. The results inTable IV show that the delay and area decrease with the increaseof in value. They also indicate that the area and delay is notchanging in a linear fashion with the variation of . However,the ASIC synthesis results in Table IV reveal the changing trendin delay and area with the variation of .Table V, Table VI, and Table VII are the synthesis results of

area and delay for these modular adders when , 8, and 12,respectively. The values in the rightmost column of Table V,Table VI and Table VII are the area*delay ratio to ELMMA.In our design, the propagation bits needed in carry correctionunit are calculated by independent modules.Table V, Table VI and Table VII show that [7] has the largest

area and [10] has the largest delay at most cases. As for the mod-ular adder proposed by [20], some optimization for the delaycan be done because it just only works at a special case,

. Thus, the delay of the proposed modular adder is alittle worse than [20] in theory. Furthermore, the overall per-formance, area*delay, of the proposed modular adder havesimilar performance with [20] when and 8. Althoughthe area*delay performance of the proposed modular adder is



TABLE VIASIC SYNTHESIZED RESULTS FOR TIMING OPTIMIZATION II

TABLE VIIASIC SYNTHESIZED RESULTS FOR TIMING OPTIMIZATION III

larger than ELMMA in [11] when and , 12, thedelay is smaller than that of ELMMA. In fact, the delay of theproposed modulo adder is the best one at all cases. According tothe theoretical analysis, our design is not the best one in delay.However, synthesis results indicate that our scheme has betterability in the tradeoff between area and delay. For [8], the designdoes not consider the conversion from the special format in [8]to binary representation. It has similar delay and area with [11].The synthesis results in Table VVII also verify the theoreticalanalysis. These Tables also show that our scheme is better thanthat of [8] in delay and area at most cases.The second optimization approach is that these designs with

same value of are optimized for area under a same timingconstraint. Meanwhile, in order to get better area optimization,these target delays for different are set to the double of the maxvalue in the third column in Table V, VI and VII, respectively.That is, the target delay for all designs is set to 1.72 ns when, 1.82 ns when , and 2.3 ns when .Meanwhile, theset_max_area parameter in DC is set to zero for all designs. Thedifference from timing optimization approach is that we firstoptimize area and followed by delay. Table VIII is the synthesisresults for area optimization. It shows that the maximum area isneeded in [7] and the maximum delay is needed in [10] at mostcases. Our scheme has similar performance in area and delay

TABLE VIIIASIC SYNTHESIZED RESULTS FOR AREA OPTIMIZATION

, , .

with [20] when , 8 with .When with, [20] has the best performance in area because of its special

design for only one case. Table VIII also shows that our designhas little worse in area to [11] when . This is becausethe proposed modular adder needs more carry correction pre-processing units when and these pre-processing unitsare implemented independently. However, the word lengths incommon RNS-based applications are usually shorter than 8 bits.Meanwhile, the proposed adder has better performance in delay,especially when .

V. CONCLUSION

In this paper, a new class of modulo adder isproposed. The proposed structure is consisted of four units, thepre-processing, the carry computation, the carry correction andthe sum computation unit. The performance analysis and com-parison show that the proposed algorithm can construct a newclass of general modular adder with better performance in delayor area*delay. It has some main features as following:The way using twice carry corrections improves the perfor-

mance of area and timing in VLSI implementation and reducesthe redundant units for parallel computation of and

in the traditional modular adders.Any existing prefix tree can be used in this structure. That

means fine tradeoff property between area and delay for the pro-posed scheme. The synthesis results also show that our schemecan be optimized to work at faster operation frequency.Furthermore, the modulus with the form of

facilitate the construction of a new class of RNSwith larger dynamic and more balanced complexity among eachresidue channel. The work of this paper provides an alternativescheme of modular adder design for this type of RNS.

REFERENCES[1] S. Ma, J. H. Hu, L. Zhang, and L. Xiang, An efficient RNS parity

checker for moduli set and its applications,Sci. in China, Ser. F: Inform. Sci., vol. 51, no. 10, pp. 15631571, Oct.2008.

[2] Y. Liu and E.M.-K. Lai, Design and implementation of an RNS-based2-D DWT processor, IEEE Trans. Consum, Electron., vol. 50, no. 1,pp. 376385, Feb. 2004.



[3] P. Patronik, K. Berezowski, S. J. Piestrak, J. Biernat, and A. Shrivas-tava, Fast and energy-efficient constant-coefficient FIR filters usingresidue number system, in Proc. Int. Symp. Low Power Electronicsand Design (ISLPED), 2011, pp. 385390.

[4] J. C. Bajard, L. S. Didier, and T. Hilaire, -direct form transposedand residue number systems for filter implementations, in Proc. IEEE54th Int. Midwest Symp. Circuits and Systems (MWSCAS), 2011, pp.14.

[5] M. Bayoumi, G. Jullien, and W. Miller, A VLSI implementation ofresidue adders, IEEE Trans. Circuits Syst., vol. CAS-34, no. 3, pp.284288, Mar. 1987.

[6] S. J. Piestrak, Design of residue generators and multioperand modularadders using carry-save adders, IEEE Trans. Comput,, vol. 43, no. 1,pp. 6877, Jan. 1994.

[7] H. Vergos, On the design of efficient modular adders, J. Circuits,Syst., and Comput., vol. 14, no. 5, pp. 965972, Oct. 2005.

[8] G. Jaberipur, B. Parhami, and S. Nejati, On building general mod-ular adders from standard binary arithmetic components, inProc. 45thAsilomar Conf. Signals, Systems, and Computers, 2011, pp. 69.

[9] M. Dugdale, VLSI implementation of residue adders based on binaryadders, IEEE Trans. Circuits Syst. II: Analog Digit. Signal Process.,vol. 39, no. 5, pp. 325329, May 1992.

[10] A. A. Hiasat, High-speed and reduced-area modular adder structuresfor RNS, IEEE Trans. Comput,, vol. 51, no. 1, pp. 8489, Jan. 2002.

[11] R. A. Patel, M. Benaissa, N. Powell, and S. Boussakta, ELMMA: Anew low power high-speed adder for RNS, in Proc. IEEE Workshopon Signal Processing Systems, Oct. 2004, pp. 95100.

[12] R. A. Patel, M. Benaissa, N. Powell, and S. Boussakta, Novelpower-delay-area-efficient approach to generic modular addition,IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 54, no. 6, pp.12791292, Jun. 2007.

[13] E. Vassalos, D. Bakalis, and H. T. Vergos, Modulo arithmeticunits with embedded diminished-to-normal conversion, in Proc. 14thEuromicro Conf. Digital System Design (DSD), 2011, pp. 468475.

[14] G. Jaberipur and S. Nejati, Balanced minimal latency RNS additionfor moduli set ,, in Proc. 18th Int. Conf. Systems,Signals and Image Processing (IWSSIP), 2011, pp. 17.

[15] H. T. Vergos and C. Efstathiou, A unifying approach for weighted anddiminished-1 modulo addition, IEEE Trans. Circuits Syst. II,Exp. Briefs, vol. 55, no. 10, pp. 10411045, Oct. 2008.

[16] S. H. Lin and M. H. Sheu, VLSI design of diminished-one moduloadder using circular carry selection, IEEE Trans. Circuits Syst.

II, Exp. Briefs, vol. 55, no. 9, pp. 897901, Sep. 2008.[17] C. Efstathiou, H. T. Vergos, and D. Nikolos, Fast parallel-prefix

modulo adders, IEEE Trans. Comput., vol. 53, no. 9, pp.12111216, Sep. 2004.

[18] R. A. Patel and S. Boussakta, Fast parallel-prefix architectures formodulo addition with a single representation of zero, IEEETrans. Comput., vol. 56, no. 11, pp. 14841492, Nov. 2007.

[19] P. M. Matutino, R. Chaves, and L. Sousa, Arithmetic units for RNSmoduli and operations, in Proc. 13th Euromicro Conf.Digital System Design: Architecture, Methods and Tools (DSD), 2010,pp. 243246.

[20] R. A. Patel, M. Benaissa, and S. Boussakta, Fast moduloaddition: A new class of adder for RNS, IEEE Trans. Comput.,

vol. 56, no. 4, pp. 572576, Apr. 2007.[21] L. Li, J. Hu, and Y. Chen, An universal architecture for designing

modulo multipliers, IEICE Electron. Expr., vol. 9, no.3, pp. 193199, Feb. 2012.

[22] R. Zimmermann, Binary Adder Architectures for Cell-Based VLSIand their Synthesis, Ph.D. dissertation, Integrated Syst. Lab., SwissFederal Inst. of Technol., Zurich, 1997.

Shang Ma received the B.Eng. degree fromSouthwest University of Science and Technology,Mianyang, China, in 2001, and the M.Eng and Ph.D.degrees from University of Electronic Science andTechnology of China (UESTC), Chengdu, China in2006 and 2009, respectively.From July 2001 to May 2010, he was with

Southwest University of Science and Technology,Mianyang, China. Since May 2010, he has been withthe UESTC. His current research interests includecomputer arithmetic and baseband processing for

communications.

Jian-Hao Hu received the B.Eng. and Ph.D. degreesin communication systems from the Universityof Electronic Science and Technology of China(UESTC) in 1993 and 1999, respectively.He joined City University of Hong Kong from

1999 to 2000 as a postdoctoral researcher. From2000 to 2004, he served as a Senior System Engineerat the 3G Research Center, University of HongKong. He has been a Professor of the National KeyLaboratory of Communication of UESTC since2005. His areas of research include high-speed

low-power DSP technology with VLSI, NoC, wireless communications, andsoftware radio.

Chen-Hao Wang received the B.Eng. degree fromthe University of Electronic Science and Technologyof China (UESTC), Chengdu, in 2012, where he iscurrently pursuing the M.Eng. degree.

base paper.pdf

Documents

Transcript of base paper.pdf