base paper.pdf

11
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS 1 A Novel Modulo Adder for Residue Number System Shang Ma, Jian-Hao Hu, Member, IEEE, and Chen-Hao Wang Abstract—Modular adder is one of the key components for the application of residue number system (RNS). Moduli set with the form of can offer excellent balance among the RNS channels for multi-channels RNS processing. In this paper, a novel algorithm and its VLSI implementation struc- ture are proposed for modulo adder. In the proposed al- gorithm, parallel prex operation and carry correction techniques are adopted to eliminate the re-computation of carries. Any ex- isting parallel prex structure can be used in the proposed struc- ture. Thus, we can get exible tradeoff between area and delay with the proposed structure. Compared with same type modular adder with traditional structures, the proposed modulo adder offers better performance in delay and area. Index Terms—Carry correction, modular adder, parallel prex, residue number system (RNS), VLSI. I. INTRODUCTION R ESIDUE number system (RNS) is an ancient numerical representation system. It is recorded in one of Chinese arithmetical masterpieces, the Sun Tzu Suan Jing, in the 4th cen- tury and transferred to European known as Chinese Remainder Theorem (CRT) in the 12th century. RNS is a non-weighted numerical representation system and has carry-free property in multiplication and addition operations. In recent years, it has been received intensive study in the very large scale integra- tion circuits (VLSI) design for digital signal processing (DSP) systems with high speed and low power consumption [1]–[4]. Modular adder is one of the key modules for RNS-based DSP systems. For integers and with -bit width, the modular addition can be performed by (1) if and is less than the modulus (1) In (1), , which is referred as correction [5]–[8]. In the general modular adder design, the two values, and , should be computed rstly. Then, one of them is selected as the nal output. According to the form of the mod- ulus, modular adders can be classied into two types: the gen- eral modular adder and the special modular adder. Manuscript received April 13, 2012; revised October 11, 2012 and December 18, 2012; accepted February 05, 2013. This work was supported in part by the National Natural Science Foundation of China under Grants 61101033 and 61070696, and by the Fundamental Research Funds for the Central Universities of China under Grant ZYGX2011J118. This paper was recommended by Asso- ciate Editor B.-H. Gwee. The authors are with the National Key Laboratory of Science and Technology on Communications, University of Science and Technology of China, Chengdu 611731, China (e-mail: [email protected]). Digital Object Identier 10.1109/TCSI.2013.2252639 For the general modular adder, Bayoumi proposed a scheme for arbitrary modulus by using two cascaded binary adders [5]. However, the delay is the sum of the two binary adders. Several literatures constructed several modular adders with two parallel binary adders to calculate and [6], [7]. This method can achieve less delay but needs about twice area of binary adder. Dugdale proposed a method to construct a type of general modular adders with a reused binary adder [9]. The shortage of this structure is that it will use two operation cy- cles to perform one modular addition. The area or delay of these modular adders mentioned above is twice or more than that of binary adder. In recent studies, a few modular adders with better area and delay performance are presented. Hiasat proposed a class of modular adders in which any regular Carry Look-Ahead (CLA)—based binary adder can be used in the nal stage [10]. However, it needs an extra CLA unit to get the carry-out bit of before the nal CLA addition. As a result, the struc- ture does not reduce the delay signicantly. The ELMMA algo- rithm proposed by Patel et al. [11] uses two carry computation modules for and in which some carry com- putation units can be shared. The area reduction of this scheme is dominated by the form of . In the worst case, almost two independent carry generation modules are needed. Patel et al. [12] also proposed several algorithms which can generate car- ries fastly. A new number representation for modulo addition is proposed in [8]. However, its outputs are represented in spe- cial format. Thus, the extra area and delay are needed to per- form the conversion from the special representation to binary number representation or all operations should be performed in this number representation format in RNS-based systems. On the other hand, the complexity of the special modular adder is much less than that of general modular adder, since the structure of the special modular adder can be further opti- mized according to the modulus. The effective modular adders for modulo and modulo have drawn much more attention than other kinds of modular adders [13], [14]. [15] and [16] proposed an architecture for modulo adder based on “diminished-1” number representation. [17] and [18] presented a structure for modulo and based on parallel prex and carry correction, respectively. A similar architecture with [7] for modulo adder is also proposed in [19]. In [20], Patel et al. described an implementation structure for modulo adder based on the technique of carry offset, which is only required to obtain the carry information of . In order to obtain the carries required in the modular addition, each carry of has to be modied according to the utmost carry of . In this case, the redundant modules of carry computation are eliminated, but the structure of carry 1549-8328/$31.00 © 2013 IEEE

Transcript of base paper.pdf

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS 1

    A Novel Modulo Adder forResidue Number System

    Shang Ma, Jian-Hao Hu, Member, IEEE, and Chen-Hao Wang

    AbstractModular adder is one of the key components for theapplication of residue number system (RNS). Moduli set with theform of can offer excellent balanceamong the RNS channels for multi-channels RNS processing. Inthis paper, a novel algorithm and its VLSI implementation struc-ture are proposed formodulo adder. In the proposed al-gorithm, parallel prefix operation and carry correction techniquesare adopted to eliminate the re-computation of carries. Any ex-isting parallel prefix structure can be used in the proposed struc-ture. Thus, we can get flexible tradeoff between area and delay withthe proposed structure. Compared with same type modular adderwith traditional structures, the proposedmodulo adderoffers better performance in delay and area.

    Index TermsCarry correction, modular adder, parallel prefix,residue number system (RNS), VLSI.

    I. INTRODUCTION

    R ESIDUE number system (RNS) is an ancient numericalrepresentation system. It is recorded in one of Chinesearithmetical masterpieces, the Sun Tzu Suan Jing, in the 4th cen-tury and transferred to European known as Chinese RemainderTheorem (CRT) in the 12th century. RNS is a non-weightednumerical representation system and has carry-free property inmultiplication and addition operations. In recent years, it hasbeen received intensive study in the very large scale integra-tion circuits (VLSI) design for digital signal processing (DSP)systems with high speed and low power consumption [1][4].Modular adder is one of the key modules for RNS-based DSPsystems.For integers and with -bit width, the modular addition

    can be performed by (1) if and is less than the modulus

    (1)In (1), , which is referred as correction [5][8]. Inthe general modular adder design, the two values, and

    , should be computed firstly. Then, one of them isselected as the final output. According to the form of the mod-ulus, modular adders can be classified into two types: the gen-eral modular adder and the special modular adder.

    Manuscript received April 13, 2012; revised October 11, 2012 and December18, 2012; accepted February 05, 2013. This work was supported in part bythe National Natural Science Foundation of China under Grants 61101033 and61070696, and by the Fundamental Research Funds for the Central Universitiesof China under Grant ZYGX2011J118. This paper was recommended by Asso-ciate Editor B.-H. Gwee.The authors are with the National Key Laboratory of Science and Technology

    on Communications, University of Science and Technology of China, Chengdu611731, China (e-mail: [email protected]).Digital Object Identifier 10.1109/TCSI.2013.2252639

    For the general modular adder, Bayoumi proposed a schemefor arbitrary modulus by using two cascaded binary adders [5].However, the delay is the sum of the two binary adders. Severalliteratures constructed several modular adders with two parallelbinary adders to calculate and [6], [7]. Thismethod can achieve less delay but needs about twice area ofbinary adder. Dugdale proposed a method to construct a typeof general modular adders with a reused binary adder [9]. Theshortage of this structure is that it will use two operation cy-cles to perform one modular addition. The area or delay of thesemodular adders mentioned above is twice or more than that ofbinary adder. In recent studies, a fewmodular adders with betterarea and delay performance are presented. Hiasat proposed aclass of modular adders in which any regular Carry Look-Ahead(CLA)based binary adder can be used in the final stage [10].However, it needs an extra CLA unit to get the carry-out bit of

    before the final CLA addition. As a result, the struc-ture does not reduce the delay significantly. The ELMMA algo-rithm proposed by Patel et al. [11] uses two carry computationmodules for and in which some carry com-putation units can be shared. The area reduction of this schemeis dominated by the form of . In the worst case, almost twoindependent carry generation modules are needed. Patel et al.[12] also proposed several algorithms which can generate car-ries fastly. A new number representation for modulo additionis proposed in [8]. However, its outputs are represented in spe-cial format. Thus, the extra area and delay are needed to per-form the conversion from the special representation to binarynumber representation or all operations should be performed inthis number representation format in RNS-based systems.On the other hand, the complexity of the special modular

    adder is much less than that of general modular adder, sincethe structure of the special modular adder can be further opti-mized according to the modulus. The effective modular addersfor modulo and modulo have drawn much moreattention than other kinds of modular adders [13], [14]. [15] and[16] proposed an architecture for modulo adder based ondiminished-1 number representation. [17] and [18] presenteda structure for modulo and based on parallel prefixand carry correction, respectively. A similar architecture with[7] for modulo adder is also proposed in [19]. In [20],Patel et al. described an implementation structure for modulo

    adder based on the technique of carry offset, whichis only required to obtain the carry information of .In order to obtain the carries required in the modular addition,each carry of has to be modified according to theutmost carry of . In this case, the redundant modulesof carry computation are eliminated, but the structure of carry

    1549-8328/$31.00 2013 IEEE

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS

    computation is fixed and can only perform the special modularaddition, that is, modulo addition.One of the important issues is the selection of moduli sets in

    RNS-based application. In addition and multiplication intensivesystems, residue channels are always expected as many as pos-sible when the dynamic range is fixed, that is, the word lengthof individual residue can be reduced to achieve better speedperformance. Meanwhile, the width of each channel is also ex-pected as close as possible to get similar critical path delay.That is the balance between each residue channel. Moreover,the complexity of modular adder should be evaluated carefullyin residue radix selection. At present, it is possible to get highperformance modular adders for a few moduli radixes, such asmodulo and modulo . But these moduli radixes arenot always suitable to construct multi-channel RNS with finechannel balance. For example, it is hard to construct a multi-channel moduli set with and to achieve co-primeand fine balance between channels. However, the modulus withthe form of have the prominentadvantage in constructing multi-channel moduli sets with finebalance [21]. We can find several methods for moduli set selec-tion with this type residue. For instance, we can verify that themoduli set

    satisfies the co-prime requirement when , 4, 5,6, 8, 12, and when , 9, 10, 11 by removing a few radixes.Meanwhile, the channel widths of these moduli sets are allbits. Thus, the residue radix with the form of hasgreat potential in moduli sets constructing with high efficient,high dynamic range, and fine balance between channels. Dueto the advantages of radix , it is essential to studyits fundamental computation units, that is, moduloadder andmodulo multiplier. In [21], a general archi-tecture for modulo multiplier is proposed recently.A modulo and a modulo adder are alsoproposed in [19] and [20] respectively. However, there is littlediscussion about the general architecture for moduloadder.In this paper, a new class of modulo adder based on

    carry correction and parallel prefix algorithm is proposed. Thenew modular adder can be divided into four units, the pre-pro-cessing unit, the prefix computation unit, the carry correctionunit, and the sum computation unit. In the proposed scheme, thecarry information of computed by prefix computationunit is modified twice to obtain the final carries required in thesum computation module. Meanwhile, any existing fast prefixstructure of binary adders can be used in the proposed mod-ular adder structure, which offers superior flexibility in design.In order to evaluate the performance of the proposed modularadder in this paper, the unit-gate model and Design Compiler(DC) of Synopsys Company are used to estimate its complexityand performance. The results show that the proposed modulo

    adder can get the best delay performance. Com-pared with the special modulo adder proposed in[20], our method offers similar delay performance but has theability of design a class of modulo adder with dif-ferent based on identical algorithm. Moreover, compared withELMMA modular adder, the proposed modulo adder has betterarea delay performance at most cases and can achieve fasteroperation frequency.

    Fig. 1. Prefix computation-based adder structure.

    In the rest of the paper, the brief introduction of RNS andmodular addition are presented in Section II. Section III intro-duces the algorithm and hardware architecture of the proposedmodulo adder. Performance of the proposed modularadder are evaluated and compared with other modular adders inSection IV. Finally, we will conclude this paper.

    II. BACKGROUND

    A. RNS and Modular AdditionRNS is defined as a group of co-prime modular radixes

    , where , ,, , and is the greatest

    common divisor of and . The integer in can berepresented uniquely by its residues respect to the modulus ,that is , where , ,

    Let , andbe the RNS representation of integers ,

    and in the range of . According to Gaussian modularalgorithms, if , we can get ,where represent addition, subtraction, and multiplication.For integers and in the range of , modulo addi-

    tion is defined as

    (2)

    If and the bit width of the modular adder is-bit, where (that is, is the smallest integer noless than ). Equation (2) can be represented as

    (3)

    where the correction [7], [8], [20]. That is, if thecarry-out bit of is 1, the result of modular additionis the least significant bits of , otherwise, the resultis . This is the basic rule in most modular adders design.

    B. Prefix Parallel AdditionParallel prefix operation is widely adopted in binary adder de-

    sign. Each sum bit and carry bit can be calculated with theprevious carries and inputs [22]. As shown in Fig. 1, prefix-based binary adders can be divided into three units, the pre-pro-cessing unit, the prefix computation unit, and the sum computa-tion unit.In the pre-processing unit, prefix computation is calculated as

    (4)

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MA et al.: A NOVEL MODULO ADDER FOR RNS 3

    where and represent the carrygeneration bit and carry propagation bit respectively.The prefix computation unit is used to compute the carry in-

    formation used in the sum computation unit. Prefix computationcan be performed by

    (5)

    where , , ,and represents the stage. The smaller means the shorterdelay of the carry chain. The operator in (5) is the prefixoperator and is the prefix computation result of thestage from the bit to the bit, which is also called group

    prefix computation. There are several well known binary prefixaddition structures, such as Sklansky (SK), Brunt-Kung (BK),Kogge-Stone (KS), Han-Carlson (HC), ELM, and so forth [22].The prefix structures mentioned above are usually called prefixtrees.After prefix computation, carries for

    the bit can be obtained. They can be computed as

    (6)

    In the sum computation unit, the carries from the prefixcomputation unit and the partial sum from the pre-processingunit are used together to compute the final sum bits ,

    (7)

    C. Unit-gate Model for Area and Delay AnalysisThe unit-gate model is one of the most commonly used

    models to estimate the circuit complexity and performance inVLSI design. In the unit-gate model, simple two-input logicgates, such as AND, OR, NAND, and NOR, are treated as unitgates. They have the same area and delay, which are referred as

    and in this paper, respectively. For those more compli-cated two-input gates, such as XOR and XNOR, their area anddelay are defined as and in our analysis, respectively.Complex logical circuits as well as multi-input gates can beimplemented with 2-input unit gates, and their gate countsequal the sum of gate counts of the unit gate [22].

    III. PROPOSED MODULO ADDER

    As shown in Fig. 2, the proposed modulo adderis composed of four modules, pre-processing unit, carry gener-ation unit, carry correction unit, and sum computation unit. InFig. 2, different shade represents different processing units.The proposed modular adder can be divided into two gen-

    eral binary adders, and in Fig. 2, with carry correctionand sum computation module according to the characteristicsof correction for modulus . We can get the car-ries used in the final stage through correcting the carriesof , which can be computed by any existing prefixstructure with proper pre-processing. At last, we can get the final

    Fig. 2. The proposed modulo adder structure.

    modular addition result from and partial sum information.The proposed architecture shown in Fig. 2 can avoid the calcu-lation of carries information for and separately.Thus, the area and delay in VLSI implementation can be re-duced. Meanwhile, the proposed scheme offers flexible tradeoffof area and delay with different parallel prefix structures.

    A. Pre-Processing Unit

    The pre-processing unit is used to generate the carry genera-tion and carry propagation bits of . From (3),when

    (8)

    Obviously, the binary representation of is .

    In Fig. 2, the computation of can be performedby and where and are used for lower- bits andhigher- bits addition, respectively. Let ,

    , and the binary representations of and be

    and respec-tively. The operation of adder and can be regarded as

    (9)

    where is the carry-out bit of adder .For , one of the inputs of , every bit is 0 except the

    least significant bit. Thus, can be treated as a -bit adder withthe lowest carry-in bit, which is exactly as same as the generalbinary adder. And the way pre-processing of is also similarwith the general binary adder. The difference is that the lowestcarry-in bit should be considered. Therefore, carry generationand carry propagation bits are

    (10)

    For adder , it does not only add the constant , but alsothe carry-out bit from adder . It can be regarded as athree-inputs adder with the lowest carry-in bit. The three inputsare , and in binary. In this paper,

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    4 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS

    we reduce the number of inputs from three to two for adderby using Simple Carry Save Adder (SCSA). When

    , we can get for and , firstly

    (11)

    And then is treated as the inputs of the second stagein SCSA. The second stage of SCSA generates the carry gen-eration and carry propagation bits from and . Actu-ally, it is the carry saved addition of these two binary numbers,

    and . Thus, the final outputsof pre-processing unit for adder are

    (12)From (10) and (12), all of the information required in the

    prefix computation is obtained. Furthermore, the carry-out bitof SCSA, , is required to compute the carry-out bit of

    , . It is calculated as

    (13)

    B. Carry Generation UnitIn carry generation unit, the carries

    of can be obtained with the carry generation andcarry propagation bits from the pre-processing unit. Any ex-isting prefix structure can be used to get the carries in thispaper.It is worth pointing out that the carry-out bit of SCSA in

    the pre-processing unit, as shown in (13), is not involved inthe prefix computation. Instead, combined with thecarry-out bit of the prefix tree is required to determine thecarry-out bit of (denoted as )

    (14)

    where .

    C. Carry Correction UnitThe carry correction unit is used to get the real carries

    for each bit needed in the final sum computation stage. In orderto reduce the area, we get the carries of by correcting thecarries of in the carry correction unit.We first derive the relation of and

    in binary addition in Theorem 1, where and are the carryoutputs of prefix tree when the lowest carry in is 0 and 1,respectively.Theorem 1: Let be the carry bits of an-bit adder, and they will be propagated to the higher adjacentpositions, be the lowest carry in (that is, ), and

    be the final carry-out bit (that is, ). Assumingthe carries for each bit be when and the carries foreach bit be when , we can get the relationship

    Proof: Let and bethe binary representations of and , respectively. Then, wehave , and .According to the parallel prefix algorithm, we have

    which can be rewritten as

    .If , then , which yields

    and . Thus, we have . That is, ,.

    If , it means that cant be propagated to .Hence, , which is irrelevant with . That means

    .Thus, .Q. E. D.Theorem 1 means that can be determined from by

    simple logic operation. That is the foundation of the carrycorrection for the proposed modular adder. We present the pro-cedure of the carry correction in our scheme based on Theorem1 as following.For the proposed modulo adder,

    and can be represented as in binary. The

    computation of can be divided into two steps,and .

    The two 1 bits in s binary representation can be regardedas the carry-in bits for adder and adder shown in Fig. 2,respectively. Correspondingly, the carry bits of can beobtained with twice carry corrections of based onTheorem 1. The first correction result is the carries of

    . The second correction result is the car-

    ries of . Whether carry correction is performed or notdepends on the carry-out bit of , that is, in (14).Carry Correction for AdderSince the binary representation of is

    , can be regarded as the

    carry bits of and .Therefore, can be modified with Theorem 1 todetermine the carry bits of

    , that is

    (15)

    One point must be paid attention to perform (15). Thelowest propagation bit in , , is not equal to that in(10). Actually, it is equal to .According to Theorem 1, the carries of iscorrected under the condition of . We can usea 2-to-1 Multiplexer (MUX) to perform the operation. Forthis MUX, is the control signal, while and

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MA et al.: A NOVEL MODULO ADDER FOR RNS 5

    are input signals. And the output is the result of the firstcorrection, denoted as

    (16)

    Carry Correction forFrom (16), is the carry infor-mation of or after the correctionfor adder . Then we can perform the second correctionbased on and let the carry bits of the second correctionbe . Similar to the first correction, is the carry of

    (that is, ) when . Oth-erwise, is the carry of . That is, is thefinal carry information needed in sum computation unit.

    When , . The bit 1 in willnot affect . Hence,

    (17)

    When , the inputs of adder in Fig. 2are and . And the carry-in bitis the carry-out bit of adder , that is, . Considering the leastsignificant bit of is 1, we can treat the oper-ation of adder as the addition of two inputs,and , with the lowest carry-in bit 1. Thatis, the results and carry information of , in (18) are iden-tical

    (18)Thus, we can get the carries of by modifying the car-

    ries of adder with Theorem 1.Combined with the final carry-out bit of , , thecarries required by the proposed modular adder are deter-mined.Since the second carry correction is performed under the con-

    dition that the lowest carry-in bit of adder is a constant 1,the propagation bits used in the carry correction unit shouldbe computed by and . Fromthe above analysis, it is shown that the difference between these

    two additions in (18) is that the least significant bits, 1 forin (18) and for in

    (18) . The propagation carry information can be computedfrom (11) and (12). Let be the propagate carries of (18) ,we have

    (19)

    Let be the group propagate carries, then

    (20)

    When , according to Theorem 1and (16), the carries after the second carry correction are

    (21)

    Substituting (19) into (21), we get

    (22)

    Substituting (16) into (22), we get

    (23)

    When . Similarly, we get

    (24)

    According to (16), (17), (23) and (24), the carry bits requiredby the proposed modular adder are determined as shown in (25),at the bottom of the page.

    (25)

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    6 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS

    Let

    (26)

    Then

    .(27)

    From the unit-gate evaluation model, the delay of computingis in (25) when , which is identical to the

    delay of a prefix computation unit. It is shown from Fig. 2 thatthe pre-processing units of the proposed modular adder guaran-tees that is determined before atleast for most prefix structures.If is determined before no less

    than two stages prefix computation, the delay of computingand in (26) is the delay sum of one XOR, one AND, and oneOR gate. That is, the total delay is . That means the outputtime of is identical with that of andin (26). Thus, there is also no extra delay.If is determined before only less

    than one stage prefix computation delay, the delay of computingand should be reduced to at most one prefix computation

    delay through special pre-processing to eliminate the possibleextra delay. In order to achieve this purpose, can be usedas the selection signal for the MUX. Meanwhile, and canbe pre-computed and used as the inputs of the MUX. Letand be the value of and when respectively.Similarly, let and be the value of and when ,respectively. We get

    (28)

    and

    (29)Thus, we can get the carry information that will be used in

    the sum computation unit of the proposed modular adder.

    D. The Sum Computation

    Generally, the sum computation is as same as that in prefix-based binary adder. However, is the correction result when

    is taken into account. That is, if , is the carrybit of . Otherwise, it is the carry bit of . Thus,the partial sum bits of and are both requiredin the final sum computation. Let andbe the partial sum bits of and respectively.Note that has been determined in the

    pre-processing unit (that is, ). Besides, justwhen and . Consequently

    (30)Hence

    (31)

    (32)

    When

    (33)

    At last, the sum bits are

    (34)In (34), and can be obtained at the same time.Therefore, there is no extra delay comparedwith other sum com-putation units.

    E. Design Example

    The VLSI implementation structure of moduloadder based on the proposed scheme is shown in Fig. 3(a).

    Fig. 3(b) illustrates the function of each module. Pre-processing UnitThe pattern in Fig. 3 is the pre-processing unit andused to generate carry generation and carry propagationbits for the following prefix computation. Since there arefixed 1 inputs at the 1st and the 4th places, the patterns and are used for this special situations. The pat-tern does not cost any resource in unit-gate model.The computations of these patterns correspond to (10), (11)and (12).

    Prefix ComputationThe pattern is the prefix computation unit. In this ex-ample, the Sklansky prefix tree is used and there are 11prefix computation units, which corresponds to (4). Thedelay of is determined by its carry generation pathwhich is one OR gate and one AND gate. However, thepattern in the final stage of prefix tree is not neededto compute propagation bits.

    The Computation ofThe is computed by pattern in Fig. 3. Ac-cording to (14), ,

    and can be computed con-currently. Then, we can get after an OR gate. Thus,the delay of computation will not exceeding thedelay of pattern and there is no extra delay. In orderto minimize the delay of , the value of can be se-lected so that the delay difference of and is at

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MA et al.: A NOVEL MODULO ADDER FOR RNS 7

    Fig. 3. Modulo adder based on the proposed structure.

    least one OR gate delay. In (14), is computed inpre-processing firstly. Meanwhile, the delay of isalways smaller than that of in prefix tree. Thus, wecan compute firstly if is obtainedbefore . Otherwise, we can computefirstly. When the last one, or , arrived, only oneOR gate is needed to compute the final value of . Thatis, the delay is if the value of is selected properly. Inthis example, .

    Carry Correction UnitThe pattern in Fig. 3 performs the computation corre-spond to (27). In this example, 7 correction operators areused. From (27), there are three different situations, that is

    , and . The, and can be computed by independent modules.

    The pattern and in Fig. 3 is used to compute, and in (27). In this example, is computed

    out before with two prefix com-putation stages. Hence, we can get and without extradelay by using (26). In the worst case, the group propaga-tion bits required in (26) are needed to be computed oneby one from . However, the extracomponents for computing these group propagation bitscan be removed when the group propagation bits exist inprefix structure.

    Sum Computation UnitThe pattern in Fig. 3 is used for performing the sumcomputation according to (34). As a matter of fact, thisoperator is the logic XOR operation. The pattern inFig. 3 is a modified XOR operator, one of its inputs is in-verted. Because the computation of in (34) can be

    performed with carry correction simultaneously, only oneXOR operations are required to perform the sum compu-tation and no extra delay is introduced.

    Numerical ExampleFor example, for modulo 239 (that is, and )addition, . If and , theresult of the modular addition is 153. According to (10),(11) and (12), pre-processing results are

    Then, by using prefix tree and (13), we have

    and

    From (25), we can get carry correction results

    Finally, the modulo addition results can be computed by(34),

    That is the binary representation of 153.

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    8 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS

    TABLE IAREA OF MODULO ADDER BASED ON UNIT-GATE MODEL

    This example shows the detailed design of moduloadder based on the proposed algorithm with the Sklansky prefixtree. There are two special measures in the proposed schemeare used to eliminate the possible extra delay. The first one is thecomputation of in (14) which shows the way of eliminatingthe delay. In fact, it is easy to satisfy requirements of (14) for anadder based on prefix structure. The second one is the pre-pro-cessing of temporary variables in carry correction. In the worstcase, are needed to determine the grouppropagate bits required in (25) by using independent modules.Nevertheless, the special logical resource for computing groupcarry information can always be reduced according to the prefixstructure used in the proposed modular adder.

    IV. PERFORMANCE ANALYSIS AND COMPARISON

    A. Performance Analysis and Comparison Based on Unit-GateModelAccording to (5), the delay of prefix tree is always determined

    by the path of carry generation units which is . However,the delay of the pre-processing units and carry generation unitsat the first level of prefix tree can be reduced to . Let ,

    be the inputs of pre-processing unitsand , be the outputs of pre-processingunits. If is computed by

    (35)

    we can get

    (36)

    Obviously, the critical path delay of pre-processing and the firstlevel prefix computation is . Meanwhile, we can get and

    in the computation procedure of (36). As result, there isno extra area.The delay of carry correction units and sum computation units

    are both . As for prefix operation, its delay depends on theadopted prefix structure. According to the above analysis, thecritical path delay of the proposed modulo adder isthe sum of the delay of prefix structure and 7 unit gates. That is

    (37)

    TABLE IITHE DELAY AND AREA OF MODUL ADDER WITH DIFFERENT

    PREFIX STRUCTURES BASED ON UNIT-GATE MODEL

    where represents the delay of prefix structure.The proposed modulo adder has the advantages

    of regular structure and prefix tree selection-free. In order to getmore efficient performance on delay, only one special unit inprefix tree is used. That is the pattern in Fig. 3.According to above analysis and unit-gate model, the pro-

    posed modulo adders area cost is shown in Table I.In Table I, is the number of prefix operation unit. The areaof carry computation module includes the area of whichperforms , and the area of sum computation module in-cludes the area of computation of in (34). For thearea of pre-processing for carry correction module, it considersthe worst case in Table I. In the worst case, all propagation bitsneeded in (27) are computed by independent modules, which isthe pattern and in Fig. 3. The unit-gate analysis re-sults in Table I show that the area of the proposed modular adderdecreases with the increase of . This is because of the decreaseof pre-processing unit along with the increase of .Table II is the delay and area of the proposed modulo

    adder with different parallel prefix structures, suchas Sklansky, Brent-Kung, Kogge-Stone, and Han-Carlson trees[22]. It is shown that the delay and area of Sklansky prefixtree is the best one in the four trees. However, the fan-out ofKogge-Stone and Han-Carlson is a constant 2. In practice, wecan choose specific prefix tree as the generation computationunit according to specific application. In the following perfor-mance analysis in this paper, the area and delay of the proposedmodular adder are estimated under the worst case of the carrycorrection when using the Sklansky prefix tree.With the unit-gate model, the comparisons of area and delay

    are shown in Table III. The reason why we choose these mod-ular adders for comparison is that their moduli are same or thealgorithms adopted by them are representative.The modular adder based on ELM algorithm in [11] is a class

    of general modular adder with a fine inline structure, but therewould be considerable duplicate prefix computation units whenis composed of too many 1. In Table III, the area and delay

    of ELMMA adder is estimated under the condition that there areonly two 1 in s binary representation.In [7], two binary adders are used to get andsimultaneously. Similarly, an extra CLA is used to compute

    the carry-out bit of in [10]. In order to performaccurate and impartial performance analysis, the area and delayanalysis for [7] and [10] based on same prefix tree adopted in ourdesign. That is, the Sklansky prefix tree [22] is used in the binaryadders in [7] in the CLA in [10]. Furthermore, in the following

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MA et al.: A NOVEL MODULO ADDER FOR RNS 9

    TABLE IIITHE AREA AND DELAY COMPARISON BASED ON UNIT-GATE MODEL

    ASIC (Application Specific Integrated Circuit) synthesis theyare also implemented based on Sklansky prefix tree.Meanwhile,the analysis of [10] is under the assumption that there are onlytwo 1 in s binary representation.In [8], a new number representation method is adopted to

    simplify modulo addition. Conversion from binary to its specialrepresentation bears no cost. However, its addition results are inthis special number representation format. Extra area and delayshould be used to perform the conversion from this format tobinary number representation or all operations, such as additionand multiplication, should be performed in this number repre-sentation format in RNS-based system. In order to perform com-parison without the conversion effect, the conversion from itsspecial number format to binary representation is not includedin the analysis and comparisons. Table III shows that its areais similar with that of [20] and the delay is similar with that of[11].The modulo adder proposed by [20] is the

    special case of our scheme. Since the position of 1 in ofmodulo adder is fixed, some optimizations can bedone so as to reduce the delay of pre-processing module. Thus,the total delay of this modular adder is .Table III shows that the largest area is needed in [7] and

    the smallest is needed in our scheme. Meanwhile, Table IIIalso shows that the fastest scheme in speed is [20] and theslowest is [10]. However, the unit-gate model is just a referencein performance analysis. In practice, different architecturemay have different ability in tradeoff between area and delay.In Section IV-B, we will implement all scheme mentionedin Table III and perform detailed comparison based on thecommon used synthesis tool, DC.

    B. Performance Analysis and Comparison Based on DesignCompilerIn order to get more accurate performance evaluation, we de-

    sign the proposed modulo adder with Sklanskyprefix tree and the other modulo adders mentioned in Table IIIwith VHDL. Then, we use DC to get area and delay perfor-mance. The version of DC is E-2010.12-SP5-2 for LINUX. Andwe use its TOPOGRAPHICAL mode to get more accurate wireloadmodel. Then, these designs are synthesized with the TaiwanSemiconductorManufacturing Company (TSMC) 0.13 log-ical library. Meanwhile, the TSMC 0.13 physical library isused to get more accurate area and timing evaluation in logical

    TABLE IVASIC SYNTHESIZED RESULTS FOR THE PROPOSEDMODULAR ADDER WITH DIFFERENT

    TABLE VASIC SYNTHESIZED RESULTS FOR TIMING OPTIMIZATION I

    synthesis procedure. For comprehensive comparison, we firstdesign these modular adders in Table III for , 6, 12 at twocases, and . Then, we design our scheme for

    , 3, 4, 5 when to get the performance change withthe different value of . Two different optimization approachesare used in the following ASIC synthesis procedures.The first optimization approach is that each design is re-

    cursively optimized until they achieved a fastest operatingfrequency without timing violation and the value of slake iszero. The timing constraint step is 0.01 ns in recursive opti-mization procedure.Table IV is the synthesis results of area and delay for our

    scheme when and varies from 1 to 6. The results inTable IV show that the delay and area decrease with the increaseof in value. They also indicate that the area and delay is notchanging in a linear fashion with the variation of . However,the ASIC synthesis results in Table IV reveal the changing trendin delay and area with the variation of .Table V, Table VI, and Table VII are the synthesis results of

    area and delay for these modular adders when , 8, and 12,respectively. The values in the rightmost column of Table V,Table VI and Table VII are the area*delay ratio to ELMMA.In our design, the propagation bits needed in carry correctionunit are calculated by independent modules.Table V, Table VI and Table VII show that [7] has the largest

    area and [10] has the largest delay at most cases. As for the mod-ular adder proposed by [20], some optimization for the delaycan be done because it just only works at a special case,

    . Thus, the delay of the proposed modular adder is alittle worse than [20] in theory. Furthermore, the overall per-formance, area*delay, of the proposed modular adder havesimilar performance with [20] when and 8. Althoughthe area*delay performance of the proposed modular adder is

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    10 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS

    TABLE VIASIC SYNTHESIZED RESULTS FOR TIMING OPTIMIZATION II

    TABLE VIIASIC SYNTHESIZED RESULTS FOR TIMING OPTIMIZATION III

    larger than ELMMA in [11] when and , 12, thedelay is smaller than that of ELMMA. In fact, the delay of theproposed modulo adder is the best one at all cases. According tothe theoretical analysis, our design is not the best one in delay.However, synthesis results indicate that our scheme has betterability in the tradeoff between area and delay. For [8], the designdoes not consider the conversion from the special format in [8]to binary representation. It has similar delay and area with [11].The synthesis results in Table VVII also verify the theoreticalanalysis. These Tables also show that our scheme is better thanthat of [8] in delay and area at most cases.The second optimization approach is that these designs with

    same value of are optimized for area under a same timingconstraint. Meanwhile, in order to get better area optimization,these target delays for different are set to the double of the maxvalue in the third column in Table V, VI and VII, respectively.That is, the target delay for all designs is set to 1.72 ns when, 1.82 ns when , and 2.3 ns when .Meanwhile, theset_max_area parameter in DC is set to zero for all designs. Thedifference from timing optimization approach is that we firstoptimize area and followed by delay. Table VIII is the synthesisresults for area optimization. It shows that the maximum area isneeded in [7] and the maximum delay is needed in [10] at mostcases. Our scheme has similar performance in area and delay

    TABLE VIIIASIC SYNTHESIZED RESULTS FOR AREA OPTIMIZATION

    , , .

    with [20] when , 8 with .When with, [20] has the best performance in area because of its special

    design for only one case. Table VIII also shows that our designhas little worse in area to [11] when . This is becausethe proposed modular adder needs more carry correction pre-processing units when and these pre-processing unitsare implemented independently. However, the word lengths incommon RNS-based applications are usually shorter than 8 bits.Meanwhile, the proposed adder has better performance in delay,especially when .

    V. CONCLUSION

    In this paper, a new class of modulo adder isproposed. The proposed structure is consisted of four units, thepre-processing, the carry computation, the carry correction andthe sum computation unit. The performance analysis and com-parison show that the proposed algorithm can construct a newclass of general modular adder with better performance in delayor area*delay. It has some main features as following:The way using twice carry corrections improves the perfor-

    mance of area and timing in VLSI implementation and reducesthe redundant units for parallel computation of and

    in the traditional modular adders.Any existing prefix tree can be used in this structure. That

    means fine tradeoff property between area and delay for the pro-posed scheme. The synthesis results also show that our schemecan be optimized to work at faster operation frequency.Furthermore, the modulus with the form of

    facilitate the construction of a new class of RNSwith larger dynamic and more balanced complexity among eachresidue channel. The work of this paper provides an alternativescheme of modular adder design for this type of RNS.

    REFERENCES[1] S. Ma, J. H. Hu, L. Zhang, and L. Xiang, An efficient RNS parity

    checker for moduli set and its applications,Sci. in China, Ser. F: Inform. Sci., vol. 51, no. 10, pp. 15631571, Oct.2008.

    [2] Y. Liu and E.M.-K. Lai, Design and implementation of an RNS-based2-D DWT processor, IEEE Trans. Consum, Electron., vol. 50, no. 1,pp. 376385, Feb. 2004.

  • This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MA et al.: A NOVEL MODULO ADDER FOR RNS 11

    [3] P. Patronik, K. Berezowski, S. J. Piestrak, J. Biernat, and A. Shrivas-tava, Fast and energy-efficient constant-coefficient FIR filters usingresidue number system, in Proc. Int. Symp. Low Power Electronicsand Design (ISLPED), 2011, pp. 385390.

    [4] J. C. Bajard, L. S. Didier, and T. Hilaire, -direct form transposedand residue number systems for filter implementations, in Proc. IEEE54th Int. Midwest Symp. Circuits and Systems (MWSCAS), 2011, pp.14.

    [5] M. Bayoumi, G. Jullien, and W. Miller, A VLSI implementation ofresidue adders, IEEE Trans. Circuits Syst., vol. CAS-34, no. 3, pp.284288, Mar. 1987.

    [6] S. J. Piestrak, Design of residue generators and multioperand modularadders using carry-save adders, IEEE Trans. Comput,, vol. 43, no. 1,pp. 6877, Jan. 1994.

    [7] H. Vergos, On the design of efficient modular adders, J. Circuits,Syst., and Comput., vol. 14, no. 5, pp. 965972, Oct. 2005.

    [8] G. Jaberipur, B. Parhami, and S. Nejati, On building general mod-ular adders from standard binary arithmetic components, inProc. 45thAsilomar Conf. Signals, Systems, and Computers, 2011, pp. 69.

    [9] M. Dugdale, VLSI implementation of residue adders based on binaryadders, IEEE Trans. Circuits Syst. II: Analog Digit. Signal Process.,vol. 39, no. 5, pp. 325329, May 1992.

    [10] A. A. Hiasat, High-speed and reduced-area modular adder structuresfor RNS, IEEE Trans. Comput,, vol. 51, no. 1, pp. 8489, Jan. 2002.

    [11] R. A. Patel, M. Benaissa, N. Powell, and S. Boussakta, ELMMA: Anew low power high-speed adder for RNS, in Proc. IEEE Workshopon Signal Processing Systems, Oct. 2004, pp. 95100.

    [12] R. A. Patel, M. Benaissa, N. Powell, and S. Boussakta, Novelpower-delay-area-efficient approach to generic modular addition,IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 54, no. 6, pp.12791292, Jun. 2007.

    [13] E. Vassalos, D. Bakalis, and H. T. Vergos, Modulo arithmeticunits with embedded diminished-to-normal conversion, in Proc. 14thEuromicro Conf. Digital System Design (DSD), 2011, pp. 468475.

    [14] G. Jaberipur and S. Nejati, Balanced minimal latency RNS additionfor moduli set ,, in Proc. 18th Int. Conf. Systems,Signals and Image Processing (IWSSIP), 2011, pp. 17.

    [15] H. T. Vergos and C. Efstathiou, A unifying approach for weighted anddiminished-1 modulo addition, IEEE Trans. Circuits Syst. II,Exp. Briefs, vol. 55, no. 10, pp. 10411045, Oct. 2008.

    [16] S. H. Lin and M. H. Sheu, VLSI design of diminished-one moduloadder using circular carry selection, IEEE Trans. Circuits Syst.

    II, Exp. Briefs, vol. 55, no. 9, pp. 897901, Sep. 2008.[17] C. Efstathiou, H. T. Vergos, and D. Nikolos, Fast parallel-prefix

    modulo adders, IEEE Trans. Comput., vol. 53, no. 9, pp.12111216, Sep. 2004.

    [18] R. A. Patel and S. Boussakta, Fast parallel-prefix architectures formodulo addition with a single representation of zero, IEEETrans. Comput., vol. 56, no. 11, pp. 14841492, Nov. 2007.

    [19] P. M. Matutino, R. Chaves, and L. Sousa, Arithmetic units for RNSmoduli and operations, in Proc. 13th Euromicro Conf.Digital System Design: Architecture, Methods and Tools (DSD), 2010,pp. 243246.

    [20] R. A. Patel, M. Benaissa, and S. Boussakta, Fast moduloaddition: A new class of adder for RNS, IEEE Trans. Comput.,

    vol. 56, no. 4, pp. 572576, Apr. 2007.[21] L. Li, J. Hu, and Y. Chen, An universal architecture for designing

    modulo multipliers, IEICE Electron. Expr., vol. 9, no.3, pp. 193199, Feb. 2012.

    [22] R. Zimmermann, Binary Adder Architectures for Cell-Based VLSIand their Synthesis, Ph.D. dissertation, Integrated Syst. Lab., SwissFederal Inst. of Technol., Zurich, 1997.

    Shang Ma received the B.Eng. degree fromSouthwest University of Science and Technology,Mianyang, China, in 2001, and the M.Eng and Ph.D.degrees from University of Electronic Science andTechnology of China (UESTC), Chengdu, China in2006 and 2009, respectively.From July 2001 to May 2010, he was with

    Southwest University of Science and Technology,Mianyang, China. Since May 2010, he has been withthe UESTC. His current research interests includecomputer arithmetic and baseband processing for

    communications.

    Jian-Hao Hu received the B.Eng. and Ph.D. degreesin communication systems from the Universityof Electronic Science and Technology of China(UESTC) in 1993 and 1999, respectively.He joined City University of Hong Kong from

    1999 to 2000 as a postdoctoral researcher. From2000 to 2004, he served as a Senior System Engineerat the 3G Research Center, University of HongKong. He has been a Professor of the National KeyLaboratory of Communication of UESTC since2005. His areas of research include high-speed

    low-power DSP technology with VLSI, NoC, wireless communications, andsoftware radio.

    Chen-Hao Wang received the B.Eng. degree fromthe University of Electronic Science and Technologyof China (UESTC), Chengdu, in 2012, where he iscurrently pursuing the M.Eng. degree.