Low Power Noc

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 12, DECEMBER 20142585

A Systematic Design Methodology for

Low-Power NoCs

Gursharan Reehal, Member, IEEE, and Mohammed Ismail, Fellow, IEEE

Abstract Network-on-chip (NoC) communication architec-tures are emerging as the most scalable and efficient solution to handle on-chip communication challenges in the multicore era. In NoCs, power estimations in the early stages of the design help the designers to optimize the design for energy consumption and efficiently map applications to achieve low-power solutions. How-ever, in 90-nm designs or below, the impact of parasitics not only influence timing closure, but also leads to variability in power and area budgets among different NoC architectures. There is a growing need for advanced design methodologies to overcome these issues in NoC designs. This paper presents a system-level design methodology based on layout and power models to achieve low-power and high-performance NoC designs. The impact of global interconnects with and without repeater insertion on the bandwidth and power is considered. Width and spacing of global interconnects and its effect on performance and power dissipation are analyzed. For architectural-level power analysis, different router designs for Chip-Level Integration of Communicating Heterogeneous Elements (CLICHE), Butterfly Fat Tree (BFT), Scalable, Programmable, Integrated Network (SPIN), and Octagon NoC architectures are implemented using ARMs 65-nm standard cell library in 65-nm Taiwan Semiconductor Manufacturing Corporation (TSMC) process. The router designs are synthesized in RVT process using a Vdd of 1.0 V and a temperature of 25 C. Synopsys Prime Time-PX design tool is used for calculating average power dissipation of the router designs.

Index Terms Bandwidth, Butterfly Fat Tree (BFT), Chip Level Integration of Communicating Heterogeneous Elements (CLICHE), delay, interconnects, IP-based, network-on-chip (NoC), Octagon, performance, power models, Scalable Programmable Integrated Network (SPIN).

I. INTRODUCTION

AS SEMICONDUCTOR industry is moving toward com-plex system-on-chip (SoC) designs containing hundreds or thousands of heterogeneous IP blocks, network-on-chip

(NoC) designs are emerging as one of the most effective and reliable choice of communication fabric. NoCs are packet-switched interconnected networks, integrated onto a single chip, and their operation is based on the operating principle of macronetworks. NoC attempts to simplify the global com-munication problem, by providing various component-level

Manuscript received September 17, 2012; revised April 21, 2013 and August 25, 2013; accepted November 26, 2013. Date of publication March 3, 2014; date of current version November 20, 2014. This work was supported in part by ATIC, Abu Dhabi, and in part by the KSRC, Kustar, UAE.

G. Reehal is with the Department of Electrical and Computer Engi-neering, The Ohio State University, Columbus, OH 43210 USA (e-mail: [email protected] ).

M. Ismail was with the Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH 43210 USA. He is now with KUSTAR, UAE (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TVLSI.2013.2296742

Fig. 1. NoC architecture and its main components.

architectures with specific interconnection network topologies. Some of the main topologies for NoC architecture are chip-level integration of communicating heterogeneous elements (CLICH), Butterfly Fat Tree (BFT), scalable, programmable, integrated network (SPIN), and octagon [2][4]. NoC is scal-able due to its inherent structure and design. NoC architecture primarily composed of three main components: 1) switches (or routers); 2) interswitch links; and 3) repeaters. NoC architecture (or topology) specifies the physical arrangement of the communication network. It defines how nodes (IPs), switches (routers), and links are connected to each other. The success of an NoC design heavily depends on its power budget. A high-level network model of NoC architecture is shown in Fig. 1.

In an NoC design, wires linking two switches are called interconnects. These interconnect play a crucial role in the overall system performance and can have a large impact on total power consumption. In older technologies, when wires were much wider and thick, it was possible to treat on-chip interconnects as purely capacitive loads of logic gates, i.e., these wires had no intrinsic delays of their own and were modeled as short circuits. However, with more reduction in wire widths, interconnect capacitance became comparable with gate capacitance and was required in wire modeling. Now, with the technology scaling into the deep nanometer regime, wires have become much narrower, driving up their resistance and capacitance to the point that, in many paths, the wire RC delay exceeds gate delay and can severely impact achievable system bandwidth and thus NoC performance.

One of the physical constraints in the implementation of NoC networks is the available wiring area, as most of complex SoC designs are generally wire limited. The silicon area required by these systems is primarily determined by the interconnect area, and the choice of network dimension therefore influenced by how well the resulting topology makes

1063-8210 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2586IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 12, DECEMBER 2014

use of the available wiring area. One such method that can relate network topology to the wiring constraint is the performance measure bisection bandwidth, inherited from the macrocomputer networks. Bisection bandwidth is defined as the minimum number of wires that must be cut when the network is divided into two equal sets of nodes. Since the primary goal here is to judge the bandwidth and resulting wiring demand; only data lines are considered in the esti-mation process. Bisection bandwidth is a static measure and provides rough estimates and can only be used in the early stages of the design process, when a little or no information about the physical layout is available. In nanometer technolo-gies, however, the relation between bisection bandwidth and achievable bandwidth is not one to one. Bisection bandwidth ignores the facts that as technology scales and systems employ larger networks, wire delays begin to dominate and the delay through the long wires could be substantial. For these reasons, interconnects have become center of attention with respect to area, delay/performance, and power consumption in nanome-ter designs. Interconnects have not scaled exponentially like transistors with technology scaling. To overcome these power and performance issues of interconnects, some advanced miti-gation schemes may be necessary in designs. In addition, these enhancement solutions often result in different area and power budget preestimated in the early stages of designs. Hence, many researchers have pointed out that economical design of present and future nanometers designs is limited by their wiring demands.

In regards to power, the situation is similar; the portion of power associated with interconnects keeps increasing with each technology scaling. This is an important issue because in the conventional design the analysis and synthesis of very large scale integration circuits are based on the assumption that gates are the dominating sources of on chip power consumption. With power being the most critical design constraint in large SoC designs, architectural-level power estimation therefore has become extremely important to verify that power budgets can be met not only by the communication network, but by the entire system. Some early design optimizations techniques for NoCs are highly dependent on early power budget estimations.

In this paper, a design methodology based on NoC architectural-level layouts and high-level power models is presented to achieve low-power and high-performance NoC designs. The methodology is efficient in selecting an appropri-ate NoC architecture for low-power results and thus for shorter timing closure. The rest of this paper is organized as follows. In Section II, we study interconnect performance and the effect of scaling on the interconnect resistance and capacitance. In Section III, NoC link optimization technique based on RC delay model is discussed. Section IV provides insight into the buffer insertion technique for performance enhancement in longer interconnects. Section V provides NoC power dis-sipation models for various NoC architectures. In Section VI, IP-based design methodology for low-power results are pre-sented. In Section VII, some simulation results based on the design models are presented. Finally, the conclusion is presented in Section VIII.

Fig. 2. Gate delay versus interconnect delay.

Fig. 3. NoC interconnects.

II. NoC PERFORMANCE ANALYSIS

One of the most critical challenges for an NoC design is to provide the desired bandwidth by the SoC design to man-age certain performance thresholds. As technology is scaling in nanometer domain, however, achieving higher bandwidth could be tricky and may require mitigation schemes. As wires are continuing to shrink, wiring delay is dominating gate delays, as shown in Fig. 2.

The wire delay doubles with each technology node and increases quadratically as a function of wire length. Even a small number of global interconnects, where the signal delay is very high, can have a significant impact on system performance and may also influence timing closure. The key to solve this problem is to know more about the physical design, i.e., placement of IPs and estimated interconnects, early in the design cycle for accurate budgeting and shorter design cycles. To achieve higher bandwidth in NoC, it is possible to design pipelined routers such that they process one flit per cycle, but the duration of the clock cycle usually determines how fast each flit can be processed in the network. In nanometer NoCs, this cycle time is not limited by the logic in between two clocked elements, but by the links between two routers. NoC links typically consist of a number of parallel signal wires of fixed width and spacing, as shown in Fig. 3.

These links can be used directly to express a number of metrics, such as data rate, bandwidth density, or bisectional bandwidth. However, data rate is preferred and most appro-priate metric to estimate system performance and can be expressed as follows:Bandwidth =Nwires=Awire1(1)

delayW+Sdelay

REEHAL AND ISMAIL: SYSTEMATIC DESIGN METHODOLOGY FOR LOW-POWER NoCs2587

Fig. 5. Cross-sectional view of global interconnects.

Fig. 4. 15FO4 delay in different technology nodes.

resistivity (2.2 _m) to reduce wiring resistance. Wire capac-itance, on the other hand, is more complex, and many of its components are geometry dependent, as shown in the cross sectional view of a global interconnects in Fig. 5.

The three major components of wire capacitance are related to the geometry by the following relation:

where Nwires is the total number of signal wires in the linkCc

and delay is the delay of a single wire. Thus, to achieveCT = C f + Cb.W +(4)

higher bandwidth, it is important that the delay is kept to aS

minimum. Interconnect delay is a function of wire resistancewhere CT is the total capacitance, C f is the fringing capaci-

and capacitance. The delay of a distributed RC line driven bytance, Cb is the parallel plate capacitance due to the top and

an ideal driver (zero output impedance) at the near end, andbottom layers of metal and is proportional to the interconnect

an open termination at far end can be represented aswidth, and Cc is the coupling capacitance between neighboring

Twire(delay) = 0.4 R.C.L2(2)interconnects and is inversely proportional to the interconnect

spacing S. The parallel plate capacitance is

where Twire is the wiring delay, L is the wire length, R is the wire resistance per unit length, and C is the capacitance per unit length. This is believed to be a good approximation and is reported to be accurate within 5% for a wide range of R and C [30], [31], [35]. In NoC design, the minimum conceivable clock cycle time in a highly pipelined design can be assumed to be equal to the value of 15FO4, with FO4 defined as the delay of an inverter driving four identical ones [6].

In different technology nodes, FO4 can be estimated as 425 Lmin, where Lmin is the minimum gate length in any technology node [5]. In long wires, the intrinsic delay can easily exceed this limit of 15FO4, and thereby limiting the clock cycle time and as a result, system bandwidth may suffer. 15FO4 delay for different technology nodes is shown in Fig. 4.

Depending on the length of interconnects, different tech-niques may be necessary to reduce intrinsic RC delay. Two methods for reducing interconnect delays are discussed in Sections III and IV.

III. NoC LINK OPTIMIZATION USING

INTRINSIC RC MODEL

The two important parameters for interconnect delay are wire resistance and wire capacitance. The resistance per unit length of a wire can be defined asR = .L(3)

T.W

where is the resistivity of the wire, and is material depen-dent. In modern technologies, copper is being used for low

W.L

cb = ox H(5)

where ox is the SiO2 dielectric constant, W is the width, L is the length of the wire, and h is the dielectric height. Fringing and coupling capacitance are more difficult to compute by hand, but an empirical formula, which is computationally effi-cient and relatively accurate, is given by following equations:C f = ox L W + 0.77 + 1.06 W 0.25 + 1.06 T 0.5 (6) HHHCc = ox0 2223

L 0.03W+0.83T0.07T.H4/.

HHHS

(7)

Widening a wire proportionally reduces resistance but increases the capacitance due to the top and bottom layers. This leads to less than proportional increase in capacitance, still the overall RC delay improves, but on the other hand, increasing spacing between the wires reduces capacitance to the adjacent wires and leaves the resistance unchanged. This also reduces RC delay by significantly reducing the coupling capacitance. While T and H parameters are fixed for each metal layer in a given process technology, parameters W and S can be chosen by the link designer to achieve an acceptable delay. If a design is limited by the wiring space available then

Using (2)(7), delay of the longest interconnects (10 mm) in 65-nm technology was calculated to be 4254 ps, whereas the 15FO4 time in the same technology node is 414.375 ps. The achievable frequency by the system is 2.41 GHz, whereas due to the length and associated delay of the longest interconnect, the achievable frequency is limited to 0.24 GHz only. After buffer insertion, the wire delay was improved to 412.3 ps and is shown in Fig. 7.With optimal repeater insertion, the growth of the intercon-nect delay becomes linear with the wire length. However, for large high-performance designs, the number of such repeaters can be prohibitively large and can take up significant portion of silicon and routing area and additionally can consume significant amount of power. NoC power dissipation is discussed in the following section.

where Rs and Cs are the resistance and capacitance of a minimum size inverter, R is the resistance of wire per unit length, and C is the capacitance of wire per unit length. Similarly, optimal width Wopt and optimal spacing are given as follows:


varying W and S for optimal delay will have an impact on the number of wires in the link by the following relation:

Awire = Nwires W + ( Nwires 1) S.(8)

Increasing interconnects width and space in a limited area will reduce the number of links and thus the overall system bandwidth. As a result, these geometric adjustments to achieve lower delay can create an upper bound on the conceivable bandwidth.

IV. PERFORMANCE OPTIMIZATION USING

BUFFER INSERTION

Both resistance and capacitance of a wire increases with wire length, so the RC delay of a wire increases with L2. Therefore, for longer NoC interconnects, wire sizing and spacing alone is not sufficient to limit the quadratic growth of delay with respect to the wire length. Buffer insertion (also called repeater insertion) may be necessary, as shown in Fig. 6. The delay of a wire can be made linear with distance, by splitting the line into multiple segments and inserting a repeater between each segment to actively drive the wire.

Using the methodology in [8], optimal repeater size kopt and optimal inter buffer segment line length hopt can be calculated using

kopt=Rs .C(9)

R.Cs

hopt=2. Rs (Cs + C p )(10)

R.C

Fig. 6. NoC interconnects with optimal buffer insertion.

Fig. 7. Delay of 10-mm-long unbuffered versus buffered wire in 65 nm.

V. NoC POWER ANALYSIS

W=Ca Sopt + Cc

opt

Cb

Sopt=Cc Wopt.

Ca + Cb Wopt

Power consumption is a major concern for any large chip design including the design of an NoC. In NoCs, power is dissipated because of the large amount of activity generated by so many transistors. High power consumption can result in excessively high temperatures during operation, which can

(11) lead to reliability concerns because of electromigration and other heat-related failure mechanisms. Careful analysis of power consumption at all stages of design is essential for

(12) keeping power consumption within acceptable limits. In NoC design, power dissipation is mainly because of three main components of the network, namely: 1) routers; 2) intercon-nects (or wires); and 3) repeaters. The closed-form total power equation for NoC can thus be defined as

Ptotal = Pswitches + ( Pline + Prep)(13)

where Pswitches is the total power consumed by the switches in the network, Pline is the total power dissipation of interswitch links, and Prep is the total power dissipation of the repeaters, which are required for long interconnects. The number of repeaters required depends on the length of the interswitch link

Pline = C Lvdd2 f(14)

Prep = Nrep hopt Covdd2f + Ileakrep Vdd + IshortrepVdd

(15)

NIf the numbers of IPs is equal in the x and y directions, then the number of horizontal links is equal to the vertical links, and can be calculated using N ( N 1). Depending on the technology node, the optimal length for repeater insertion could be obtained using (9). The total interconnects length and the required number of repeaters for a CLICH topology can thus be calculated usingREEHAL AND ISMAIL: SYSTEMATIC DESIGN METHODOLOGY FOR LOW-POWER NoCs2589

Fig. 8. Layout for a 16-IP CLICH network.

Fig. 9. Layout for a 16-IP BFT network.

where is the activity factor of the interswitch link, C is the interconnect capacitance, and f is the clock frequency. Nrep is the total number of repeaters, hopt is the optimal repeater size, and Co is the input capacitance of a minimum size repeater. Vdd is the supply voltage.

calculated using the following expression:

2 f

Area(

Ptotalcliche = Nsw Pswitch+2N1) NwiresCVDD

(

21)AreaNwires

+NN

kopt N

hopt co VDD2f + Prepleak + Prepshort(19)

A. CLICH Architecture

CLICH topology is proposed in [1]. It is a 2-D mesh consisting of m n mesh of switches, interconnecting compu-tational resources (IPs). Every switch except those at the edges is connected to four neighboring switches and one IP block. In CLICH, the number of switches is equal to the number of functional IPs. The layout of a CLICH topology consisting of 16 IPs is shown in Fig. 8.

The IPs and switches are connected through communication channels. A channel consists of two uniform bidirectional links consisting of data and control signals. This topology is widely used in the NoC designs because of its regular structure and shorter interswitch interconnects. In this architecture, all interswitch wire segments are of same length, and can be determined using the following expression:

Area(16)

L_ = .

where Nsw is the total number of switches in the network and Pswitch is the power consumed by a single switch.

B. BFT Architecture

BFT topology as an NoC architecture is proposed in [5]. In this architecture, the IPs are placed at the leaves and switches at the vertices. At the lowest level (level 0), there are N IPs and IPs are connected to N/4 switches at the first level. The number of levels in BFT architecture depends on the total number of IPs, and can be calculated using (21). The layout scheme of a BFT architecture consisting of 16 IPs is shown in Fig. 9.

With the layout scheme shown in Fig. 8, the total num-ber of switches needed and interswitch wire lengths for a BFT architecture could be calculated using the following equations:

NswN11levels(20)

=22

where

levels = log2( N) 3(21)

switches at jth level =N(22)

.

2j+1

lCLICH =2 (1) Nwires

AreaN

(

=21)Area

NrepclicheNNNwires.

kopt N

As shown in Fig. 8, there are two different types of intercon-nect lengths in between switches for a 16 IP BFT architecture.

(17) The interswitch wire lengths can be calculated using the following expression [10]:

(18)

la+1,a =Area(23)

2levels a

Using total number of switches needed, total wire length for interconnects and total number of required repeaters, the total power consumption for the CLICH architecture can be

where la+1,ais the length of the interconnect spanning

the distancebetween level a and a+1 switches, where

a can takeinteger values between zero and levels 1.


Thus, the total length of interconnects and total num-ber of repeaters could be calculated using the following expressions:

ltotal =Area[levels N Nwires](24)

2levels

Nrepeaters =N Nwiresl1,0+1l2,1+1l3,2+

kopt2kopt4kopt

1llevelslevels1

+,(25)

2log4(N)1k

opt

where kopt is the optimal length of the global interconnectlength in between two repeaters. Using the total number ofFig. 10. Layout for a 16-IP SPIN architecture.switches, total wire length, and the total number of required repeaters, the total power dissipation of BFT architecture could be calculated using the following expression:

Ptotal BFTN11levels Pswitch

= 22

+Area[levels N Nwires] cVdd2f

2levels

l1,01l2,11l3,2

+ N Nwires+++

kopt2kopt4koptFig. 11. Layout for an Octagon architecture.

1llevelslevels1

+,

log4( N)1k

2optequation:

hoptco Vdd2f + Prepleak + Prepshort .

3N

(26)2

Ptotalspin =Pswitch + 0.875area N NwirescVddf

4

+ N Nwiresarea+area+area

C. SPIN Architecture8kopt4kopt2kopt

hoptco Vdd2f + Prepleak + Prepshort .

SPIN is proposed in [3]. This network makes use of fat tree(29)

topology; every node has four sons and the father is replicated

four times at any level of the tree. This topology carries someD. Octagon Architecture

redundant paths, and therefore offers higher throughput at

the cost of added area. This topology is scalable and usesOctagon network topology is proposedin [4], asan

small number of routers for a given number of IPs. In a largeon-chip communication architecture for network processors.

SPIN (>16 IPs), the total number of switches is 3N/4 [3].A basic octagon unit consists of eight nodes and 12 bidirec-

An efficient floor plan for the SPIN architecture is showntional links. Each node is associated with one IP and two

in Fig. 10.neighboring switches. Communication between any pair of

With this floor plan, the interswitch wire length can benodes takes in at most two hops. The number of switches

determined using (23). The total wire lengthand num-required in an octagon unit is equal to the number of IPs. For

ber of repeaters can be calculated using the followinga system containing more than eight nodes, the octagon unit

expressions:is expanded to multidimensional space using multiple basic

octagon units. An efficient layout scheme for a basic octagon

unit is shown in Fig. 11.

lspintotArea N Nwires(27)

= 0.875

Switches are represented by smaller rectangles and IPs with

= N Nwiresarea+area+area.

Nrepeatersbig rectangles. Depending on the layout style presented, as

8kopt4kopt2kopt

the one shown in Fig. 10, there are four different interswitch

(28) wire lengths needed in octagon architecture [8]. First set is connecting nodes 15 and 48, second set is connecting nodes

The total power consumption of a spin architecture using26 and 37, third connecting nodes 18 and 45, and fourth

the total length of the interconnects and the total number ofis connecting nodes 12, 23, 34, 56, 67, and 78. The

required repeaters thus can be calculated using the followinginterswitch wire lengths can be calculated using the following


expressions:3L

l1 =(30)

4L

l2=13 wl Nwires +(31)

4

l3= 13 wl Nwires(32)

l4 =L(33)

4

whereLis thelength offournodesandisequal to

(4) wlisthesummationofglobalintercon-

Area/N

nect width and space. Considering different interswitch wire

lengths, the total length of interconnect and total number of

required repeaters could be calculated using the following

expressions:

7

ltotal =L + 52wl Nwires Nwires Noct(34)

2

= 23L/413wl NwiresL 4

Nrepeaters+2+ /

koptkopt

13wNL/4

+ 2l wires+6Nwires Noct (35)

koptkopt

whereNoct is the number of basic octagon units. The totalFig. 12.Scaling of IP cores as technology scales.

power dissipation for the octagon network, can thus be calcu-

lated using the following expression:

Ptotalthe capacity to integrate similar type of IPs doubles or its area

= Pswitches +14+52wl Nwires Nwires NoctcVdd2 fhalves. A natural progression for the number of IPs that can

N

3L/413wl Nwiresbe fit on the same die due to technology scaling is shown

L/4in Fig. 12.

+ Nwires Noct 2+ 2+It is important to know the impact of an NoC dimension

koptkopt

13wl NwiresL/4to interconnects, and hence to the total power consumption.

+ 2+ 6Recently, a fair amount of research has been dedicated to effi-

koptkopt

hoptco Vdd2 f + Prepleak + Prepshort .(36)ciently mapping IPs in NoC designs. In this paper, functional

IP blocks are not discussed, since they are dependent on the

VI. IP-BASED DESIGNS AND EFFECT OF NoC SIZESspecific applications. However, for the purposes of this paper,

they may be considered as a set of embedded processors.

IP-based designs are now the dominant way to design anyIn this experiment, we show that NoC power is a function of

large system containing billions of transistors in a reasonablethe number of IPs being integrated and the die size. Depend-

amount of time. IP-based design differs from custom designsing on the number of IPs and an estimated die area required

in that, IPs are designed well before they are used. Therefore,by them, a topology for low-power design can be selected

in these designs, most of the system requirements, such asusing power models presented in the previous sections. For

bandwidth, area, and power consumption, are known a priori.different NoC architectures, power dissipation vary due to the

The life cycle of finely designed IPs may stretch well overdifference in interconnect wire lengths, number of switches,

the years from the time they are first created through severaland total number of repeaters required by the topology. The

generations of technology until their final retirement. Due tototal number of IPs being integrated on a given area can make

this, IP-based NoC designs are a dominant design methodol-a difference in whether repeater insertion is required for the

ogy. There are many different types of configuration possibleinterconnects or not, and may result in different area and power

with IP designs. Depending on the number of IPs, differentresults. As the number of IPs is increased for a given die area,

sizes of NoCs may be required. As an example, an NoCsome topologies scale well with shorter interconnect lengths,

with large number of IPs is a fine-grained network, whereaswhereas others do not. The lengths of the longest interconnect

with fewer IPs is a coarse-grained network. The granularityfor different NoC architectures, as the numbers of IPs are

of an NoC directly impacts its power consumption. Thereincreased on a 20 mm 20 mm die size, are shown in Fig. 13.

is a direct relationship between the size of an NoC and itsInterconnects for CLICH and octagon architectures scales

impact on the length of interconnects, if the die size is keptwell, i.e., the length of the longest interconnect is reduced with

constant. In addition, from one to another technology node,increased number of IPs. In other topologies, such as BFT and

the design may experience similar effects. Since, it is a naturalSPIN architectures, the length of the longest interconnects does

progression that, with every generation of technology scaling,not scale due to the physical arrangement of the switches in


Fig. 13. Scaling of the longest interconnect versus the number of IPs.

Fig. 14. Layout of BFT architecture for 64 and 256 IPs.

the layout. The layout of a BFT architecture for 64 and 256 IPs is shown in Fig. 14.

In the layout, the longest interconnects are marked red in color and they remain unchanged with increased number of IPs.

For a desired bandwidth requirement, longer interconnects require optimization techniques in terms of width and spacing along with repeater insertion. As the link length starts to increase, the link power consumption largely augments. This shows that the wire power consumption must be considered during the initial design phases. With the power models presented in the previous section, a systematic design approach to select an optimal topology for a low-power design solution is shown in Fig. 15.

The proposed synthesis approach can be used as a design space exploration tool to evaluate the efficiency of different NoC topologies. The flow is applicable to any NoC topology, and is consistent with the flow presented in this paper. Power analysis of different NoC topologies (based on the models developed earlier) is presented in Section VII.

VII. SIMULATION RESULT

To observe the importance of considering power consump-tion during the synthesis process, we evaluated the power dissipation performance for different NoC topologies through various experimental setups. Many different router designs to support different NoC topologies are implemented using ARMs standard cell library in 65-nm Taiwan Semiconductor

Fig. 15. IP-based design methodology for low-power NoC.

Manufacturing Corporation design process. Synopsyss Prime Time-PX tool is used to calculate average power dissipation of the router designs. Power dissipation of a router design is directly proportional to the number of ports in the design. For example, a six-port router design consumes 9.62 MW of power at a frequency of 200 MHz. Using the method presented in [8], for 65-nm technology node, the critical interconnect length is 1.44 mm and an optimal repeater size of 105 is used. The links are assumed to be bidirectional, with eight data lines and two control signal lines per link. For power calculations, an optimal interconnect width of 799 (nm) and an optimal interconnect spacing of 329 (nm) [8] are used. Using the power models and design flow presented earlier, power variance among different NoC topologies is shown in Figs. 1619. A range of 161024 IPs and a die size of 25400 mm2 are used. SPIN topology consumes the highest power, whereas BFT is more power efficient. SPIN topology


Fig. 16. Power CLICH architecture.Fig. 19. Power Octagon architecture.

TABLE I

TOTAL NUMBER OF REPEATERS AND METAL RESOURCES REQUIRED TO

IMPLEMENT CLICH ARCHITECTURE

TABLE II


Fig. 17. Power BFT architecture.IMPLEMENT BFT ARCHITECTURE

Fig. 18. Power SPIN architecture.

has the highest wiring demand, and as link lengths start to increase, the link power consumption dominates. This shows that power consumption by interconnects must be included in the initial NoC synthesis phase, as it is done

in the approach presented here. Considering a die size of 20 mm 20 mm, power consumed by different NoC architec-tures sizes (for 16, 64, and 256 IPs) is presented in Tables IIV. Power consumption by wires and repeaters is also presented. A system overhead in terms of 100 W [17] of power is evaluated.

A detailed analysis of power consumptions helps designers to save more power through different approaches that may be applicable. In SPIN topology with 256 IPs, repeaters alone can consume as much as 1209.6 MW of power; this is quite significant, considering the size and high end power budget of the chip. The total power consumed by the BFT architecture is less in all the three cases; however, it is important to observe how different components are contributing to the total power


TABLE III


IMPLEMENT SPIN ARCHITECTURE

TABLE IV


IMPLEMENT OCTAGON ARCHITECTURE

VIII. CONCLUSION

In this paper, an efficient design methodology to estimate NoC power at the architecture level is presented. The analysis is based on layout and power models. To achieve a low-power NoC architecture design, an accurate estimation of power and area budgets is important in the early phases of design. Difference in the area and power budgets of a logic design versus physical design can either offset a design completely or may result in endless iterations. In a conventional digital design flow, several iterations of logic synthesis and physical design are required before convergence to design specification is achieved. In this paper, a systematic approach is shown to tackle the issue through power modeling and performance analysis with area and layout awareness. A system-level SoC designer can apply these models to accelerate the NoC design process for low-power solution and faster timing closure. The impact of interconnect width and spacing on the area and power dissipation is analyzed. The tradeoff between delay and bandwidth is used as a figure of merit for interconnects performance. 3-D graphs of power as a function of die area and number of IPs are presented. In nanometer designs, interconnects power consumption is significant, and thus needs to be included at the early stages of design cycles.

REFERENCES

Fig. 20. Contribution to total power by different NoC (router, wires, and repeaters) components.

consumptions. A more detailed parameterized contribution by different NoC topologies in the case of 64 IPs is shown in Fig. 20. It is interesting to note that, in CLICH architecture, the biggest source of power consumption is switches, although it is second to BFT in total power consumption. CLICH, consumes less power in interconnects and repeaters, in com-parison with the other architectures. Thus, the exploration of power consumption by individual components is helpful and efficient in providing meaningful insight.

[1] S. Pasricha and N. Dutt, On-Chip Communication Architecures: Sytem on Chip Interconnect. San Mateo, CA, USA: Morgan Kaufmann, 2008.

[2] S. Kumar, et al., A network on chip architecture and design methodol-ogy, in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, 2002, pp. 117124.

[3] P. Guerrier and A. Greiner, A generic architecture for on-chip packet-switched interconnections, in Proc. Design Autom. Test Eur. Conf. Exhibit., Mar. 2000, pp. 250256.

[4] F. Karim, A. Nguyen, and S. Dey, An interconnect architecture for networking systems on chips, IEEE Micro, vol. 22, no. 5, pp. 3645, Sep./Oct. 2002.

[5] P. Pande, C. Grecu, A. Ivanov, and R. Saleh, Design of a switch for network on chip applications, in Proc. IEEE Int. Symp. Circuits Syst., vol. 5. May 2003, pp. 217220.

[6] K. Sundaresan and N. Mahapatra, An accurate energy and thermal model for global signal buses, in Proc. 18th Int. Conf. VLSI Design, Jan. 2005, pp. 685690.

[7] K. Sundaresan and N. Mahapatra, Accurate energy dissipation and thermal modeling for nanometer-scale buses, in Proc. 11th Int. Symp. HPCA, Feb. 2005, pp. 5160.

[8] X.-C. Li, J.-F. Mao, H.-F. Huang, and Y. Liu, Global interconnect width and spacing optimization for latency,bandwidthand

power dissipation, IEEE Trans. Electron Devices,vol. 52, no.10,

ap. 22722279, Oct. 2005.

[9] G. Reehal and M. Ismail, Layout-aware high performance interconnects for network-on-chip design in deep nanometer technologies, in Proc. IEEE 6th Int. Design Test Workshop, Dec. 2011, pp. 5861. [10] C. Grecu, P. Pande, A. Ivanov, and R. Saleh, Timing analysis of network on chip architectures for MP-SoC platforms, Microelectron. J., vol. 36, no. 9, pp. 833845, Sep. 2005.

[11] P. P. Pande, C. Grecu, M. Jones, A. Lvanov, and R. Saleh, Perfor-mance evaluation and design trade-offs for network-on-chip interconnect architectures, IEEE Trans. Comput., vol. 54, no. 8, pp. 10251040, Aug. 2005.

[12] L. Benini and G. de Micheli, Networks on chips: A new SoC paradigm, IEEE Comput., vol. 35, no. 1, pp. 7078, Jan. 2002.

[13] A. Balakrishnan and A. Naeemi, Optimal globalinterconnects for network-on-chip in many-core architectures, IEEE Electron Device Lett., vol. 31, no. 4, pp. 290292, Apr. 2010.

[14] Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar, A 5-GHz mesh interconnect for a teraflop processor, IEEE Micro, vol. 27, no. 5,

ap. 5161, Sep./Oct. 2007.


[15] G. Reehal, Designing low power and high performance network-on-chip communication architectures for nanometer SoCs, Ph.D thesis,

Dept. Electr. Comput. Eng., Ohio State Univ., Columbus, OH, USA, 2012.[16] M. B. Taylor, W. Lee, J. Miller, D. Wentzlaff, I. Bratt,

B. Greenwald, et al., Evaluation of raw microprocessor: An exposed wire-delay architecture for ILP and streams, IEEE ISCA, vol. 32, no. 2, pp. 213, Mar. 2004.

[17] L. P. Carloni, A. B. Kahng, S. Muddu, A. Pinto, K. Samadi, and

P. Sharma, Interconnect modeling for improved system-level design optimization, in Proc. ASPDAC, pp. 258264, Mar. 2008.

[18] H. Elmiligi, A. Morgan, M. El-Kharashi, and F. Gebali, Power opti-mization for application-specific network-on-chips: A topology-based approach, J. Microprocess. Microsyst., vol. 33, nos. 56, pp. 343355, Aug. 2009.

[19] S. Vangal, A. Singh, J. Howard, S. Dighe, N. Borkar, and A. Alvandpour, A 5.1 GHz 0.34mm2 router for network-on-chip applications, in IEEE Symp. VLSI Circuits Dig. Tech., Jun. 2007, pp. 4243.

[20] K. Bhardwaj and R. Jena, Energy and bandwidth aware mapping of IPs onto regular NoC architectures using multi objective genetic algorithm, in Proc. IEEE Int. Symp. Syst. Chip, Oct. 2009, pp. 2731.

[21] X. Wang, M. Yang, Y. Jiang, and P. Liu, Power-aware mapping for network-on-chip architectures under bandwidth and latency constraints, in Proc. IEEE 4th Int. Conf. Embedded Multimedia Comput., Dec. 2009, pp. 16.

[22] L. Ost, G. Guindani, L. Indrusiak, and S. Maatta, Exploring NoC-based MPSoC design space with power estimation models, IEEE J. Des. Test Comput., vol. 28, no. 2, pp. 1629, Mar./Apr. 2011.

[23] L. Xue, W. Ji, Q. Zuo, and Y. Zhang, Floorplanning exploration and performance evaluation of a new network-on-chip, in Proc. IEEE Design, Autom. Test Conf. Eur. (DATE), Mar. 2011, pp. 16.

[24] K. Latif, A. Rahmani, T. Seceleanu, and H. Tenhunen, Power-and performance aware IP mapping for NoC-based MPSoC platform, in Proc. 17th IEEE ICECS, Dec. 2010, pp. 758761.

[25] G. Reehal, M. A. Abd Elghany, and M. Ismail, Octagon architecture for low power and high performance NoC design, in Proc. IEEE NAECON, Jul. 2012, pp. 6367.

[26] J. Postman, T. Krishna, C. Edmonds, L. Peh, and P. Chiang, SWIFT: A low-power network-on-chip implementing the token flow control router architecture with swing-reduced interconnects, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 8, pp. 14321446, Aug. 2013.

[27]S. Murali, D. Atienza, P. Meloni, S. Carta, L. Benini, G. Micheli, et al., Synthesis of predictable networks-on-chip-based interconnect architectures for chip multiprocessors, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 15, no. 8, pp. 869880, Aug. 2007.

[28] G. Khan and A. Tino, Synthesis of NoC interconnects for multi-core architecturess, in Proc. IEEE 6th Int. Conf. CISIS, Jul. 2012, ap. 432437.

[29] International Technology Roadmap for Semiconductors. Denver, CO, USA. (2007). International Technology Roadmap for Semiconductors-System Drivers [Online]. Available: http://www.itrs.net/

[30] J. Liu, L.-R. Zheng, D. Pamunuwa, and H. Tenhunen, A global wire planning scheme for network-on-chip, in Proc. ISCAS, vol. 4. May 2003, pp. 892895.

[31] W. Dally and J. Poulton, Digital Systems Engineering. Cambridge, U.K.: Cambridge Univ. Press, 2008.

[32] M. Kim, D. Kim, and G. Sobelman, Network-on-chip link analysis under power and performance constraints, in Proc. Int. Symp. Circuits Syst., 2003, pp. 41634166.

[33] D. Pandini, C. Forzan, and L. Baldi, Design methodologies and archi-tecture solutions for high-performance interconnects, in Proc. IEEE ICCD, Oct. 2004, pp. 152159.

[34] Y. Shin and H. Kim, Analysis of Power consumption in VLSI global interconnects, in Proc. IEEE Int. Symp. Circuits Syst., May 2005,

ap. 47134716.

Gursharan Reehal (M09) received the B.S. (Hons.) degree in electrical engineering and the M.S. and Ph.D. degrees in electrical and computer engi-neering from The Ohio State University, Columbus, OH, USA, in 1996, 1998, and 2012, respectively.

After, graduation from MS, she joined Lucent Technologies, Columbus, OH, USA, as a Hardware Test Engineer. She is currently a Senior Lecturer with the Department of Electrical and Computer Engineering, The Ohio State University. Her current research interests include low power digital VLSI

design, network-on-chip (NoC) communication architectures, high perfor-mance I/O interfaces, embedded systems for medical applications, and recon-figurable computing.

Dr. Reehal is a member of the Engineering Honor Society, Tau Beta Pi, and the Electrical Engineering Honor Society, Eta Kappa Nu. She received the Best Paper Award from the IEEE System on Chip Conference in 2010 for her work in the area of low power and high performance NoCs and the prestigious Shining Star award from the Wireless Network Group, Lucent Technologies.

Mohammed Ismail (F09) is a prolific author and entrepreneur in the field of chip design and test in academia and industry in the U.S. and Europe. He is the Founder of The Ohio State University (OSU) Analog VLSI Laboratory, one of the foremost research entities in the field of analog, mixed signal, and RF integrated circuits. He served on the Faculty of the ElectroScience Laboratory, OSU. He held a Research Chair position at the Swedish Royal Institute of Technology, Stockholm Sweden, where he founded the Radio and Mixed Signal Integrated

Systems (RaMSIS) Research Group. He was with Aalto University, Espoo Finland, NTH; and University of Oslo, Norway, Twente University, Enschede, The Netherlands; and Tokyo Institute of Technology, Tokyo, Japan. He joined Khalifa University of Science, Technology and Research (KUSTAR), Abu Dhabi, UAE, in 2011, where he holds the ATIC Professor Chair and is the Head of the Electrical and Computer Engineering Department on both KUSTARs campuses in Sharjah and Abu Dhabi. He is serving as Co-Director of the ATIC-SRC Center of Excellence on Energy Efficient Electronic Systems targeting self-powered chip sets for wireless sensing and monitoring, bio chips, and power management solutions. He has advised the work of over 50 Ph.D. students and of over 100 M.S. students. He has authored or co-authored over 12 books and over 150 journal publications and has seven U.S. patents. His current research interests include self-healing design techniques for CMOS RF and mm-wave ICs in deep nanometer nodes. He served as a Corporate Consultant to over 30 companies and is a Co-Founder of Micrys Inc., Columbus, OH, USA, Spirea AB, Stockholm, Firstpass Technologies, Inc., Dublin, OH, USA, and ANACAD-Egypt (now part of Mentor Graphics).

Dr. Ismail is the Founding Editor of the Springer Journal of Analog Integrated Circuits and Signal Processing and serves as the Journals Editor-in-Chief. He served the IEEE in many editorial and administrative capacities. He is the Founder of the IEEE International Conference on Electronics, Circuits and Systems, the Flagship Region 8 Conference of the IEEE Circuits and Systems Society. He received the U.S. Presidential Young Investigator

[35] C. Grecu, P. Pande, A. Ivanov, and R. Saleh, A scalable communication- Award, the Ohio State Lumley Research Award four times, in 1992, 1997, centric SoC interconnect architecture, in Proc. IEEE ISQED, 2004, 2002, and 2007, and the U.S. Semiconductor Research Corporations Inventor

pp. 343348.Recognition Award twice.

Low Power Noc

Documents

Transcript of Low Power Noc