Reining in Time-To-Market for Next Generation Embedded Design€¦ · based Solutions...

ARM and Magma Implementation Reference Methodologies Addressing

High Performance, Low Power, Small Area and Full Automation

Reining in Time-To-Marketfor Next Generation Embedded Design

C O N T E N T S

Talus® and AutomationAccelerating Hierarchical Implementation of the

ARM® Cortex™-A9 MPCore™ processors with

Magma’s Hydra™ 2

Jumpstarting ARM Cortex-A9 MPCore

Processor-based SoC Designs with

Talus and Hydra 6

High-PerformanceA Fully Automated High Performance

Implementation of ARM Cortex-A8 7

Low-PowerMagma iRM for ARM Powered-optimized

Cortex-R4 Processors 12

Rapid Implementation of Low Power ARM

Microprocessors 16

A Compendium of Articles by Magma Design Automation from the ARM Information Quarterly Magazine,a publication of ARM and Convergence Promotions.

ARMRTL

Deliverables

ARM® and Magma® in Partnership

Industry StandardViews/Models forSoC Integration

High-QualityARM Processor

Implementation

Hardened Core

Ab

stra

cte

d M

od

el

ARM andMagma

iRMApplicationConstraints

+Technology Libraries

+

Implementation Reference Methodologies (iRMs) for customer-specific hardening of synthesizable ARM processors• Co-developed by ARM and Magma

• The proven route to successful silicon

• The basis of a custom deployment methodology

• Eases learning curve to adoption of ARM IP

• Enables rapid, application-specific implementation

A process of continuous improvement• Reap the benefit of researching new tools and techniques

• Leverage best practices for embedded design implementation

1

Most mobile phone chips embed ARM® processors.Most mobile phone chips are designed with Magma® software.

Shouldn’t you be using ARM and Magma?

which increases power consumption andheat dissipation. Moreover, pushing syn-thesis tools to achieve that last increase inMHz results in a significant area penalty.In summary, both power and areaincrease exponentially as the envelope ofachievable frequencies is pushed. The solu-tion lies in scaling high-performance com-puting and power consumption to meet aparticular device’s requirement.

Advanced technologies available withinthe ARM Cortex-A9 MPCore processorshelp accelerate system performance,reduce power consumption, and includekey new features and approaches thatenhance embedded multicore designs.These multicore processors have betterpower and area efficiency at the same per-formance point compared with uni-proces-sors. For example, the ARM Cortex-A9MPCore processors have a smaller

eed for Multicore Processor-based SolutionsNext-generation smart phones and MobileInternet Devices (MID) will incorporateexciting features that allow watching highdefinition movies, making video phonecalls, playing 3-D video games, watchinglive TV, GPS satellite-based navigation andnot least of all, provide a rich Internetbrowsing experience. These capabilitiesare possible in large part due to the rapidconvergence of technologies in the embed-ded System-on-Chip (SoC) platform thatearlier required desktop computing powerand bandwidth to accomplish. Designingsuch handheld or mobile devices meancombining rich functionality and high-performance computing while minimizingpower consumption.

Delivering the necessary performancerequires higher processor frequencies,

Accelerating HierarchicalImplementation of the ARM

Cortex-A9 with Magma’s HydraN

Author:Stuart Riches and Philip Watson, ARM Ltd., Cambridge; Jim Schultz, Pete Churchill, Somasekhar Eerappa and Vasu Madabushi, Magma DesignAutomation

Synopsis:Multicore processor-based systems are the future of embedded computing in high-performance hand-held and power-hungry devices. The ARM Cortex-A9 MPCore multicore processor is ideally poised to meet this demand for System-on-Chip (SoC) designs thatembed such processors. The sheer size of such SoCs powered by the ARM Cortex-A9 MPCore multicore processor will make it difficult to manage traditional tasks such as design partitioning, time budgeting, hierarchymanagement, block shaping, power planning, etc. With time-to-market pressures shrinking the design time, SoC implementers are turning to hierarchical chip planning and finishing systems to automate these traditional tasks and provide them with faster ways to achieve quality floorplans to close their designs.

This article will describe the quad-core hierarchical implementation of ARM’s latest Cortex-A9 MPCore multicore processor using Hydra, Magma’s automated hierarchical design solution.New technologies, features and the ability to provide hand-off quality floorplans in different stages of the design cycle and early-prototyping-to-final-implementation stages will be discussed.

Volume 7, Number 3, 2008Information Quarterly [20]

Figure 1: ARM Cortex-A9 MPCore architecture of a quad-core configuration

2


memory footprint that contributes to areaand power reduction. The ability to turnoff processors with power gating depend-ing on the workload requirement, andvoltage and frequency scaling also enablesignificant power savings. Multicoreprocessors thus offer excellent scalability ofperformance and power.

Design Size and Complexity of SoCswith Embedded ARM Cortex-A9MPCore Multicore ProcessorsTo meet the increasing demand for com-pute power the size and complexity ofembedded processors has increased. Therelative size of an ARM processor hasalways been just a fraction of the wholeSoC. However, there is currently a signifi-cant, in fact, exponential change. Forexample, the ARM 926EJ-S™ processor canbe efficiently implemented with under 75Kplaceable instances, while the ARM11™MPCore™ (quad) and the Cortex™-A8processors require approximately 500Kplaceable instances. There were tremen-dous advantages in maintaining a flatapproach with these implementations,which were more straightforward whenartificial timing boundaries and floorplanregions were not required. The Power,Performance and Area (PPA) targets werealways easier to achieve in a flat imple-mentation since the macro and cell placershad more degrees of freedom to exercise.

However, we have reached a point where,in balancing the need for turnaround time(TAT) and straightforward implementationmethodologies, some compromise isrequired. Take for example the Cortex-A9MPCore, where the dual configurationincluding the NEON™ and Floating Point

Unit (FPU) blocks is about 800K instancesin size and a full quad configurationincludes well over one millions cells. Inaddition, the ARM Cortex-A9 MPCoreMulticore processor allows excellent con-figurability with choice of single, dual andquad-cores, processor trace macrocell(PTM), L1 cache size; including NEON(SIMD media processing/DSP engine), FPU,L2 cache controllers, debug interfaces andAXI bus masters. Such a full configurationcan contribute in minor ways to increasingthe design size, complexity, and challengesposed by deep sub-micron process tech-nologies and libraries.

Complex Cortex-A9 MPCore-based SoCscan have hundreds of macros, multiplevoltage domains and complex clock topol-ogy, particularly in the mobile market seg-ment where power management is para-mount. Such exponential growth in thecomplexity of chips means several weeksof runtime to implement physical blocksflat, exceeding the bounds of reasonableTAT. We have moved into an era where theprocessor core has become a very large IPsubsystem in its own right and the targetfor a hardened re-useable block is muchlarger. As a result, a hierarchical approachbecomes a natural choice, but that willrequire high quality floorplanning andhierarchical planning solutions.

Today a methodolo-gy that enables hier-archical designplanning and proto-typing of designsusing advancedtechnologies is need-ed. A methodologythat performs auto-mated partitioningand shaping, opti-mized macro place-ment, global route-based pin assign-ment, accuratebudgeting andblack-box handling.The advantages ofsuch an approachwould be ease of use,

faster TAT, early feedback and predictabili-ty, and tight correlation with the finalimplementation. Magma’s Hydra providessuch an infrastructure for a comprehensivefloorplan synthesis and hierarchical plan-ning solution. Hydra also uses the sameunderlying engines for standard cell and

macro placement, physical optimization,timing and routing from the Talus® ICImplementation platform, ensuring thequality and accuracy of the floorplan. Thisalso ensures that both the logic and physi-cal design engineer can close on a solutionfaster. An added advantage is the hierar-chical design planning capabilities ofHydra, which allow designers to quicklyexplore multiple floorplan implementa-tions. Utilizing Relative FloorplanningConstraints™ (RFCs), designers can createconvergent floorplans by retaining floor-plan changes between iterations. Thisbecomes very useful when accommodat-ing late-arriving changes to the designthat require small changes to the core sizeand shape without performing time-con-suming macro placement updates.

With lessons learned from an earlierproven Talus-based hierarchical imple-mentation methodology for ARM11MPCore, we set out to implement thequad-core Cortex-A9 MPCore processorwith Hydra.

Design Planning: First Impressionswith HydraDesign exploration is a difficult and timeconsuming, iterative task. Designers typi-cally make a large number of iterationsbefore homing in on an optimal floorplan.Don’t we often hear SoC Project Managerscomplain when floorplan changes aremade? A common refrain is “I thoughtyou finished the floorplan last week?” Foran embedded processor core in an SoC,these changes will likely include size,shape, pin placement, macro placementand power grid.

Figure 2: Quad-Cortex-A9 MPCore-based SoCs: Increasing design size con-tributing to complexity and challenge

Figure 3: Hydra analyzes data and suggests animplementation: virtual flat placement beforeautomatic partitioning

3


The challenge posed by the full configura-tion quad Cortex-A9 MPCore multicoreprocessor was that it was almost threetimes the size of an ARM11 MPCoreprocessor with 2200 top-level boundarypins, 150 macros and a frequency target of600 MHz with a TSMC 65GP ARMAdvantage-HS physical IP library. In addi-tion, it also included memory-bist andscan runtime.

One of the conundrums of floorplanning alarge SoC with an embedded multicoreprocessor is figuring out where to start.This is where an automated solution isextremely valuable. Pictured in Figure 4 isthe result after the initial floorplan of thequad-Cortex-A9 MPCore processor. Thelogic was placed and the design was parti-tioned and automatically shaped intohierarchical blocks. In the first pass, asquare aspect ratio of 4mm x 4mm wasspecified. While the chances of having asquare area to place such a large multi-core processor in a design are almost neg-ligible, the important point is that Hydraprovided us a solid starting point for fur-ther exploration, in less than three hoursof runtime.

Channel SizingLet us now look at the results of multipleshaping experiments. Hydra offers flexibil-ity in design style with support for near-abutted and channel-based hierarchicaldesigns. From a congestion analysis stand-point, it was clear that locating the highly-connected Snoop Control Unit (SCU) blockin the middle (orange color in Figure 5)would generate quite a bit of congestion,leading to pin-assignment issues. Thisallowed for further exploration with the

tool to seek a better solution for the heavi-ly congested SCU block.

At first, some of the critical logic (timingdriven) from the SCU block was re-parti-tioned by placing them at the top-leveland allowing them to move in to the chan-nel. Additionally, the SCU partition wasgiven a smaller area utilization target inorder to be able to grow the block. Finally,given that there was more top-level logic,the global channels were widened.Honoring these constraints, new partitionswere shaped by Hydra in under an hourfor the entire design.

Partial Floorplan Re-useEven after widening the channels andpulling the SCU logic to the top, the blockcontinued to have issues. Each of the fourCPUs has 1500 pins and the SCU con-tained 5500 pins just by itself. To solvethis, some radical approaches were adopt-

ed. It was decided to delete the SCU parti-tion completely and put the logic at thetop-level. The real challenge lay in re-using what was already achieved untilthat point and to build from there. TheCPU’s highlighted by the bright green andyellow blocks were frozen while theturquoise and purple CPU blocks weremanually re-shaped. A few of the macroswere manually fixed, while placement ofother macros was reused using RelativeFloorplanning Constraints derived fromthe previous iteration by the tool. Thisgave us the results seen in figure 6.

Wider Aspect Ratio for aBetter/Efficient FloorplanIn a real design scenario, if the chip inte-grator changes the floorplan to say, awider aspect ratio, how would one makeuse of the knowledge gained from the pre-vious run? The solution is to do an intelli-gent restart! Pre-defined partitions andtop-level logic from the previous run wereused. Relative Floorplanning Constraintscan be used to assign relative locations tofloorplan objects. The channel size deter-mined in the previous run was also re-used. Given that the previous pin assign-ment was good, the relative pin locationswere re-used too. In 20 minutes, we wereable to replace and shape the design.

Also important to note in Figure 7 above isthe rectilinear shape of the four CPU hier-archical blocks. If trapped in a world ofpure rectangles, floorplanning wouldbecome a nearly impossible task.Rectilinear shapes are a must to fully uti-lize the silicon area available. Various hier-archical blocks in a design will have differ-ent shape requirements as dictated byinternal hard macros, external locationand connectivity. Hydra’s shaper provided

Figure 5: Quad-Cortex-A9 MPCore: Hydra useschannel sizing to provide mixed hierarchy/flatsupport

Figure 6: Floorplan results after partial reuse ofresults

Figure 7: Quad-Cortex-A9 MPCore: Hydra re-useallows rapid prototyping

Figure 4: Result after the initial floorplan of thequad-Cortex-A9 MPCore processor

4


an easy process for creating initial rectilin-ear shapes and refining current shapes asthe design matures, as seen by the progres-sion of our work.

Using Relative FloorplanningConstraints (RFC’s) on the QuadCortex-A9 MPCore ProcessorImplementationThe auto-interactive macro placer inHydra is straightforward and easy to run

and the resultsare excellent forgetting quickfeedback on afloorplan shapeand pin location.This is a solidstarting point touse Relative Floor-planning Con-straints effectivelyfor furtherchanges. Hydraoffers the abilityto extract the rel-ative constraints

from an existingmacro placement,which can then beadjusted andmaintained goingforward. The RFCextraction capa-

bility is very noteworthy—this feature pro-vides the bridge from the prototype worldto the production world.

So, why would one want to use RelativeFloorplanning Constraints? They can beused to guide the macro placer to get thefinal floorplan. One can fully specify a rel-ative set of constraints without the need torun the macro placer. It offers a method forcapturing and implementing true “design-er intent”. It allows a floorplan to be spec-ified (scripted) as a human would think(such as “a group of 8 rams in the upperright corner, 4 more rams stacked justbelow them,” etc.), and not in a Cartesianbased dump-file (as a tool would think).

Most importantly, Relative FloorplanningConstraints will allow small changes in thedesign to be absorbed without a need tochange the macro placement. Case inpoint was the real-life example of havingto change the RAM size two days beforerelease of the ARM and Magma imple-mentation Reference Methodology (iRM)for the Cortex-A9 MPCore. An updated

floorplan file was not necessary because itwas all placed with relative constraints,thereby saving us several days in TAT.

Addressing Channel CongestionCongestion analysis will help determineon the custom channel sizing. Throughour previous work we were successfullyable to remove the congestion in the cen-tral area. However, there were still somechannels that needed closer investigation.The solution was to custom size thosechannels. It wouldn’t be prudent to glob-ally make all channels larger. This wasaccomplished by reviewing the congestionin the Hydra GUI with the congestion mapand the channel report to identify hotspots. The channel report also suggestschannel sizes for a given utilization.Additionally, blockages can be added tomanually alleviate any boundary pin con-gestion. It took under 20 minutes to turn-around the design, including shaping,macro placement and global route.

Congestion Clean FloorplanImplementationIn summary, we have a nearly congestion-free design before full blown optimization.The key here is to fix any gross timing vio-

lations and congestion issues before goinginto optimization. This reduces the overallruntime since the tool will not try to fiximpossible paths. A very acceptable floor-plan was settled in a matter of days notweeks. A natural progression would be toexperiment implementing the four CPUsas repeated blocks.

ConclusionsBased on the results we obtained, it is clearthat the above approaches offer manybenefits. Rectilinear shaping and auto-interactive macro placement are valuablefeatures of Hydra that allowed optimal useof the quad Cortex-A9 MPCore floorplanarea. When late-arriving RTL and aspectratios of macros change, RelativeFloorplanning Constraints may be effec-tively utilized to feed back to the shaperand cluster placement engine to reduceTAT significantly. Gross timing violationsand congestion can be fully addressedeven before going into full optimization.Overall, the auto-interactive features ofHydra offers a fast prototyping, hierarchi-cal chip planning and finishing system forSoCs that embed different configurationsof the Cortex-A9 MPCore multicore proces-sor. Utilizing these and other features ofHydra early in the design process will yieldconfidence in closing the overall designpost placement optimization and routing,as well as saving months of time com-pared to a manual approach, therebymaximizing productivity.

Figure 8: Arrows showinganchor points of usingRe-lative FloorplanningCon-straints in theCortex-A9 MPCore floor-plan: Define arrays easilyand locate arrays usingrelation to an object

Figure 9: The wide floorplan and dispersed SCUlogic was successful in reducing the central con-gestion and further analysis of local channel con-gestion suggested custom sizing.

Figure 10: After modifying the channels and re-running the placer & global router the design isnearly congestion-free.

Figure 11: Early analysis and partitioning solvesgross timing/congestion issues before optimiza-tion, reducing overall TAT

Hear Magma Design Automation give twopresentations at the ARM Developer’sConference:1. Quad-core Cortex-A9 MPCore Multicore

Processor Implementation

2. Advanced Techniques for Implementing Cortex-M3 based Ultra-Low Power Designs

Visit www.rtcgroup.com/arm/2008/for more information.

5

tringent Requirements ofCortex-R4 Embedded ApplicationsCortex-R4 covers a wide area of applica-tions, including mass storage/hard-diskdrive controllers, digital video and stillcameras, car chassis/braking systems,mobile wireless modems, intelligent PC-independent printers, networking andhome gateways. In such deeply embeddedhigh-volume systems, balancing perform-ance targets with overall cost is a delicateact. For example, given the relative lowcost (price per chip) of disk-drive con-trollers with embedded Cortex-R4 proces-sors, any area savings at a given perform-ance point may directly translate intoincreased profitability. On the other hand,safety-critical automotive applicationsrequire design margins to be built into thechips to address reliability concerns due toextreme temperature variations and anextended product lifetimein the vicinity of 10+ years.Under such conditions,poor power supply (mesh)design can lead to Voltage(IR)-drop issues. Analyzingand preventing electro-migration during thedesign phase is necessaryin order to avoid thermalbreakdown on the powernetwork and to address sig-nal integrity issues. All ofthese can place severerestrictions on processchoices, maximum operat-ing frequency and designcomplexity.

Here’s an overview ofCortex-R4 features thatmitigate some of theserequirements:• ARMv7-R architecture

– Thumb®-2 technology ensures a more efficientprocessor design

D E S I G N S T R AT E G I E S A N D M E T H O D O L O G I E S


Magma iRM for ARMPowered-optimized Cortex™-R4 Processors

SAuthor:Vasu Madabushi, Gary Powell and JoeWalston, Magma Design Automation;Stuart Riches, ARM Limited, UK

Synopsis:As the feature size of process technologybecomes smaller, the leakage power dominates overall power dissipation and the problem gets worse if the design is supposed to perform at very high frequen-cies. This article addresses specific problemsfor achieving high performance and lowpower while implementing nanometer system-on-a chip (SoC) designs with theMagma iRM for the ARM Cortex-R4, whichis based on Magma Design Automation’s IC implementation software. This article alsopresents solutions and identifies necessaryenhancements to the overall methodology so that the entire design flow can be automated to accelerate turnaround time.

The ARM® Cortex™-R4 processor targetsdeeply embedded applications in theextremely competitive imaging, automotive,wireless base-band and storage vertical markets, where achieving optimal perform-ance at the least possible overall cost is paramount.

Magma’s IC implementation software lends itself to a power-aware design flowthat lets designers make timing-vs.-powerand area-vs.-power trade-offs at differentstages of the flow. It provides access toappropriate low-power analysis and optimization engines that are integratedwith,and applied throughout,the entire RTL-to-GDSII flow.

– Optional MPU—applications that don’t require it can save on area

– 32-bit signed/unsigned hard-ware divider for control applications

– Improvements to interrupt handling for hard real time applications

• Micro-architecture– 8 stage selective dual issue pipeline

ensures minimized area overhead– Optional I and D caches, Tightly

Coupled Memories (TCM)– AMBA 3 AXI master port for efficient

on-chip interconnect• DMA access to TCMs through slave

port– Global history branch predictor and

function call return stack

Figure 1: The ARM Cortex-R4 architecture

12

neers concentrate on analyzing andaddressing power considerations duringphysical design. Hence, poor decisionsmade during the early stages of the designmakes it almost impossible to fix anyproblems during implementation.

What today’s SoC designers need is accessto proven solutions to address all of theabove issues, shrinking process geometries(65-nm and below) and lower power con-sumption, while simultaneously improv-ing performance and reducing area. TheMagma iRM for ARM adds tremendousvalue by helping ARM licensees becomefamiliar with best practices that are robustand deliver a completely hardened macrofor re-use in a hierarchical SoC design. TheMagma iRM for ARM takes advantage ofpowerful features in Magma’s tools, suchas a single executable with a single timingengine and a single, unified data model.The common data model for algorithmsallows for analysis, optimization andimplementation for timing and power andsignal integrity to be performed concur-rently. Standard format outputs allow end-users to perform quick and easy verifica-tion of the implementation.

Salient Features of the Cortex-R4Magma iRM for ARM The current Magma iRM for Cortex-R4 isbased on Magma’s Blast 5.0 toolset whichcontains:

• Full RTL-to-GDSII Flow- Verilog RTL synthesis- DFT scan chain insertion- Floorplanning, power grid & physical

synthesis- Clock tree synthesis- Final routing- DRC checking- Static timing analysis

• Flexible Packaged Flow- Scalable across different technologies- Suitable for different core

configurations- Direct Tcl script access for quick

customizations

• Advanced Feature Support- Cross-talk delay analysis and

avoidance optimization techniques- Insertion and optimization of clock

gating- Options to control optimization effort



• Flexible synthesis-time configurabilityensures efficiency and significant area sav-ings depending on application marketrequirement

– Each cache can be 4KB - 64KB or can be removed

– 1 to 3 TCMs (Up to 8MB), or can be removed

– 8 or 12 regions in MPU, or no MPU– 2 – 8 Breakpoints, 1 – 8 Watchpoints

Relatively lower target clock frequenciesmean that Cortex-R4 implementations aresmaller than the ARM11 family of proces-sors. Cortex-R4 includes a number of fea-tures that reduce supporting memory cost.The processor pipeline has fewer stagesand yet accesses local RAMs over two clockcycles. These enable the use of a lowerspeed memory library, which drasticallyreduces silicon area and power consump-tion. Moreover, use of the RAMs from ARMPhysical IP Metro® library in Cortex-R4implementations give about 35% area andgreater than 50% power savings. All of theabove and the Magma iRM for ARM con-tribute towards making timing closure eas-ier, shortening design cycle and reducingrisk.

Cortex R4 also offers an increase in per-formance over the ARM946E-S processor;in terms of maximum operating frequencyand improved computing efficiency, with-out compromising on low power and size.In order to closely match Cortex-R4 fea-tures with its wide application coverage,end-users can take advantage of the excel-lent configurability at synthesis time.

The Need for an Integrated Power-Aware Design Flow in ImplementingCortex-R4-Based SoC DesignsIn traditional flows, power considerationsare addressed by stand-alone tools andwithout paying enough attention to howthey simultaneously impact timing, areaand turnaround time. The lack of integra-tion between point tools and the rest of thedesign environment can result in atremendous amount of “false errors” thatcan render a design impossible to close.Worse yet, this lack of integration coupledwith limited repair capabilities can resultin an unreliable power network, causingnumerous time-consuming design itera-tions.

Another problem with the majority oftoday’s design environments is that engi-

All of the above will also equally apply tothe recently announced ARM Cortex-R4Fprocessor, because they share a commonflow. The additional advanced features ofthe Cortex-R4F processor include supportfor Error-Correcting Code (ECC) in thecache memories and TCM, the extensionof error detection into the interconnect, asynthesis-optional Floating-Point Unit(FPU) and additional synthesis configura-tion of DMA.

Key Low-power Considerations forImplementing the ARM Cortex-R4ProcessorThe key low power design considerationsinclude dynamic and static (leakage)power dissipation, temperature and per-formance, and voltage drop effects. Theseare addressed continually through outMagma’s RTL-to-GDSII flow.

Dynamic Power DissipationDuring synthesis, gate sizes and cell countsare reduced, which directly translates tolower dynamic power. Automatic clock-gate insertion and optimization has dualadvantages in reducing the overall areaand dynamic power consumption.

Moreover, the clock tree in any design willtypically consume a significant portion ofthe budgeted power. During clock tree syn-thesis (CTS), advanced clock gate cloning,buffering, clustering and multi-Vt tech-niques are used to lower total power. Poweraware routing within the Magma environ-ment minimizes capacitance on highswitching nets by spreading the wires.

Static Power DissipationStatic power dissipation is associated withlogic gates when they are inactive. Staticpower dissipation has an exponentialdependence on temperature. This meansthat as the chip heats up, its static powerdissipation increases exponentially. Staticpower dissipation also has an exponentialdependence on the switching threshold ofthe transistors (Vt). The delay (switchingtime) associated with a transistor is affect-ed by the switching threshold of that tran-sistor (Vt) and the supply voltage to thattransistor (Vdd).

All of this means that engineers have toperform a complicated balancing act,because lowering the supply voltagereduces the amount of heat being generat-ed, which in turn lowers the static power

13

Every power and ground track segmenthas a small amount of resistance associat-ed with it. This means that the logic gateclosest to the IC's primary power or groundpins is presented with the optimal supply.The next gate in the chain will be present-ed with a slightly degraded supply, and soon down the chain.

The problem is exacerbated with transientor alternating current (AC) voltage dropeffects. These occur when gates switch fromone value to another or in worst-cases,when entire blocks are switched on/off.This causes transitory power surges, whichmomentarily reduce the voltage supply togates farther down the power supplychain.

The reason voltage drop effects are soimportant is that the input-to-outputdelays across a logic gate increase as thevoltage supplied to that gate is reduced,which can cause the gate to miss its timingspecifications. There is also an increase inthe interconnect delays associated withwires driven by under-powered gates.Furthermore, a gate's input switchingthresholds are modified when its supply isreduced, which causes that gate to becomemore susceptible to noise.

Voltage drop effects are becoming increas-ingly significant because the resistivity ofthe power and ground tracks rises as afunction of decreasing feature sizes (track

widths). These effects can be minimized byincreasing the width of power and groundtracks, but this consumes valuable realestate on the silicon, which typically caus-es routing congestion. In order to solvethese problems, the logic functions have tobe spaced farther apart, which increasesdelays (and power consumption) due tolonger signal tracks. Thus, implementingan optimal power network requires thebalancing of many diverse factors.

Advanced Features Address Variationsin Temperature, Power and VoltageDrop EffectsAdvanced power analysis and repair fea-tures of Magma tools enables the analysisof power, voltage drop, temperature andthe impact of voltage drop on timing.Automatic power grid synthesis consists ofIR drop analysis and is used to ensure opti-mal power distribution without over-designing the power grid. These play animportant part in avoiding electro-migra-tion and thermal break-down, while keep-ing the cost to the absolute minimum.These are very crucial in safety-criticalautomotive applications of Cortex-R4,such as anti-lock breaking systems (ABS)and electronic stability control (ESC).Subsequently, intelligent de-couplingcapacitance insertion can be employed onthe power grid to minimize the transients,keeping leakage power in check as well asimproving yield.



dissipation. However, lowering the supplyvoltage also increases gate delays. By com-parison, lowering the transistors' switchingthresholds speeds them up, but this expo-nentially increases their leakage andtherefore their static power dissipation.The Magma iRM for ARM helps automatethe leakage power trade-off and optimiza-tion by effectively using low Vt transistorsonly on timing-critical paths and high Vttransistors on non-critical paths. Thisautomation is a default feature of theMagma iRM for ARM and is very easy touse and implement. The only pre-requisiteis the multi-Vt library preparation stepprior to running the Magma iRM for ARM,which sorts the standard cell modelsaccording to logic and Vt class. Magmaapplication notes are available to helpwith the multi-Vt library preparation.

Challenges Faced with Variations inTemperature, Power and Voltage DropEffectsPower consumption - both static anddynamic - increases a device's operatingtemperature. This may force engineers toemploy expensive device packaging andexternal cooling technology.

To accommodate variations in operatingtemperature and supply voltage, designershave traditionally been obliged to paddevice characteristics and design margins.However, creating a device's power net-work using excessively conservative designpractices consumes valuable silicon realestate (leading to increased costs), increas-es routing congestion, and results in per-formance that is significantly below the sil-icon's full potential. This simply is unac-ceptable in today's highly competitivemarketplace.

Yet another consideration is that the on-chip temperature gradient (the differencein temperatures at different portions of thedevice caused by unbalanced power con-sumption) can produce mechanical stress,which may degrade the device's reliability.

65-nm devices are prone to voltage dropeffects, which are caused by the resistanceassociated with the network of wires usedto distribute power and ground from theexternal pins to the internal circuitry [withdirect current (DC) related voltage drops,these are also often referred to as IR dropeffects].

Table 1: Vital statistics of the results achieved with the Magma iRM for ARM, in internal implemen-tations

What Detail Advantages

Target Library ARM Physical IP Sage X High-density, High-speedfor TSMC 90G process Libraries

Cortex-R4 Cache 16K Data and 16K Configuration Instruction Cache

Target Frequency 385 MHz Top performance target achieved including Cross-talk delay avoidance and optimization

Max Power Leakage: 9.8 mW; Internal: Low power of approximately97.8 mW; swcap: 50.0 mW 0.4 mW/MHzTotal Max Power: 157.7mW

Standard Cell Count 115K cells (at 63% Low AreaUtilization)

Total Cell Area 1.010mm2

Total Area with Memories 3.487 mm2 Low overall cost of implementation

14

tures and ease of use for chip building. TheMagma iRM for ARM with Cortex-R4 sup-port delivers a low-risk; high performancepower optimized embedded processor hard-ening with a top-down approach. Thisgreatly simplifies large SoC implementa-tions in deeply embedded applicationsof automotive, imaging and storagemarkets.



Typical Results Achieved with theMagma iRM for ARMEnabling a rapid timing closure and fastturn-around-time (TAT), the RTL-to-GDSIIMagma iRM for ARM takes designers fromthe simplicity of an ‘out of the box’ experi-ence to first working Cortex-R4 processorin less than one engineering day (oncesetup, it typically takes an overnight run toachieve predictable results, given the config-urability of the Cortex-R4 at synthesis time).

Customer Success with Magma iRMfor ARMBroadcom Corporation has benefited fromthe tremendous value of the Magma iRMfor ARM that captures best practices andprovided them with a working, provenflow right out of the box. Given the exper-iments SoC designers typically perform,the user-configurability and repeatablenature of the Magma iRM for ARM afford-ed Broadcom quicker TAT. This allowedtheir system designers to explore “what-if”scenarios on chip function (given the syn-thesis-time configurability of the Cortex-R4 processor) versus cost of different physi-cal implementations. With ease of setup,the pre-packaged RTL-to-GDSII flowoffered a complete turnkey methodologyfor hardening the Cortex-R4 processor andembedding it in their SoCs. Here’re somesample results from two of Broadcom’sdesigns that validates the Magma iRM forARM and exceeded their expectations:

SummaryThe ARM andMagma PartnershipDeveloped in collabo-ration between ARMand Magma engi-neers, Magma iRMfor ARM has allowedembedded SoC basedon Cortex-R4 procces-sor implementationsto take advantage ofrapid time-to-market,process portabilityand a predictableand proven route tofirst-time working sil-icon. The MagmaiRM for ARM is a pre-verified flow that aidsrapid, applicationspecific implementa-tion of ARM poweredSoCs and providesease of integrationwith the rest of thechip using theMagma flow. It pro-vides an excellentstarting point forimplementing sever-al ARM processorsand eases the learn-ing curve to adoptionof ARM IP.

As a result of the ARM-Magmapartnership, end-users benefitfrom continuous improvementof Magma iRM for ARM newEDA technologies and flows aswell as reference methodologydevelopment and support fornewer ARM processors.

Magma is pushing forwardwith excellent low power fea-

Table 2: High speed 65nm Implementation of Cortex-R4: 10GBit PHYTransceiver at Broadcom

Figure 2: Layout of the Cortex-R4 used in the 10GBit PHY transceiver chip

Overnight Run to Harden Cortex-R4 with ARM-Magma RM

Performance target achieved 400MHzFloorplan Area .73mm X .73mm (0.53 mm2)Library TSMC 65G 7 LMStandard Cell Count 102,060Standard Cell Area 0.43 mm2

Utilization 81.2%RTL-to GDSII Runtime: 13.5 CPU Hours

Figure 3: Layout of the Cortex-R4 used in the disk drive controllerchip

Performance target achieved 150 MHzFloorplan Area .98mm X .80 mm (0.784 mm2)Library TSMC 65LPStandard Cell Count 96,239

(Regular Vt: 15,338High Vt: 80,901)

Standard Cell Area 0.58 mm2Utilization 74 %RTL-to GDSII Runtime: 9.6 CPU Hours

Figure 3: Layout of the Cortex-R4 used in the disk drive controller chip

40% Reduction in Leakage Power with Concurrent Multi-VtOptimization

15



Rapid Implementation of Low PowerARM Microprocessors

EAuthor:Alan Gibbons,Magma Design Automation, Inc.

Synopsis:As the levels of integration on mobiledevices increases, the competing requirements of high performance with low power consumption beingplaced on the implementation of leadingedge microprocessors is set to continue.Concurrent optimization and analysis ofpower, timing and area must form anintegral part of any implementation flowfor these processor cores. The BlastPower based low power implementationsolution by Magma Design Automation,is one answer.

xtended battery life, color displaysand the ability to support multiple con-current applications on mobile devices isplacing considerable focus on power man-agement in the development of ARMbased system chips.

Careful consideration must be given tominimizing both static and dynamicpower dissipation without sacrificing per-formance and market forces dictate thatdesigners must meet this challenge withever decreasing time to market windows.Consequently there is a pressing industryrequirement for the ability to rapidlyimplement application specific high per-formance, low power processor cores.

ARM has been addressing the needs of lowpower applications for years throughnovel microprocessor architecture anddesign techniques that have culminatedrecently in the launch of the flagshipARM1176JZF-S processor. A processor tar-geted specifically at the low power needsof the consumer and wireless applicationmarket.

In the era of synthesizable cores however,the work required to implement very lowpower processors is split across both ARMand their licensee - the development of theprocessor architecture and design fallinginto ARM’s domain and the ability toimplement technology specific versions ofthe processor in the hands of the licensee.In order to satisfy an increasinglydemanding customer base, these technol-ogy specific implementations must be cre-ated rapidly to be both high performanceand energy efficient and created rapidlythrough the use of comprehensive, inte-grated design flows.

The power consumption of a CMOS deviceincludes both the dynamic power associat-ed with activity and static power whichreflects the energy consumed when thedevice is idle. As we move from oneprocess to the next, this power consump-

tion grows exponentially and the need toaddress it as a primary design goalbecomes more apparent.

The commercial impacts of increasedpower consumption can be severe. In themobile world, where we have a finite ener-gy budget, trade-offs must be made in thefeature set of the mobile device – forexample providing a color display at theexpense of prolonged battery life, or theability to provide support for both videoand messaging simultaneously. In thewired world, increased power consump-tion directly affects packaging costs andform factors as well as device performanceand failure.

Cleary the need to reduce the power dissi-pation is a critical factor in the continueddevelopment of highly integrated portabledevices. The methods used to reduce thepower budget have application at the sys-tem level, during sub-system architecture,IC design and library development. Thisarticle focuses on the techniques we canemploy to reduce the power dissipation atthe IC level.

Reducing Dynamic PowerDissipationDynamic power dissipation can be repre-sented as

Where af is the activity expressed as afunction of frequency, V is the supply volt-age and C is the capacitance beingswitched.

Clearly, by reducing the switching activity,the voltage or the operating frequency ofthe design, we will reduce the dynamicpower dissipation and by simultaneouslyreducing all three we will realize signifi-cant power reduction.

The most obvious way to reduce theswitching activity of a design is to drop thefrequency. However, there are many other

DynamicPower ≡ af x C x V2

16



well documented techniques that can beemployed within a design to help reducethe switching activity at a given frequen-cy, and include: Multi-Level Clock gating,Operand Isolation, Pin Swapping,Technology mapping (hiding high togglerate nets within more complex cells), andFactoring.

Various combinations of these techniquesare commonly employed in CMOS designstoday, however, with the exception ofclock gating, these techniques have a mar-ginal impact on the overall power dissipa-tion. By far the greatest gain in dynamicpower reduction can be achieved by reduc-ing the voltage at which the design oper-ates. Reducing the voltage would necessar-ily reduce the performance of the design,so if we can couple this reduction in volt-age with a corresponding reduction in fre-quency then we get an almost cubic reduc-tion in the dynamic power dissipation ofthe design.

The architecture of the ARM1176JZF-Sprocessor can be easily partitioned intoseparate voltage domains or islands andcan operate synchronously where eachdomain is at the same voltage or asyn-chronously where the domains are at dif-ferent voltage levels and these voltage lev-els can change over time through interac-tion with the operating system.

When operating asynchronously, thedomains communicate through levelshifters that sit on the domain interfaces.When operating synchronously, theselevel shifters can be bypassed restoring thecycles per instruction (CPI) performance ofthe core. The ARM1176JZF-S offers the

designer both high performance when insynchronous mode coupled with a signifi-cant power savings when operating asyn-chronously during times where the workload is light.

Reducing static power dissipation Addressing the dynamic power dissipationof a high performance microprocessor iscertainly going to provide significantpower savings to the designer. However, asprocess migration passes 65nm and con-tinues towards 45nm where the operatingvoltages are lower, and the switchingthresholds roll-of more rapidly, staticpower dissipation is expected to exceeddynamic power dissipation and becomethe dominant contributor to the totalpower of a device. Static power minimiza-tion must be considered as an integralpart of the power reduction strategy.

Static power dissipation is dominated bysub-threshold leakage current through theindividual transistors and although smallin magnitude, the cumulative effect of thisleakage current in a system chip with hun-dred’s of millions of transistors is signifi-cant and cannot be ignored.

The leakage current through a transistor isa combination of a number of compo-nents (including sub threshold and gate-oxide leakage), this can be approximated:

One important point about this equationis that it shows that static power dissipa-tion has an exponential dependence ontemperature. This means that as the chipheats up, its static power dissipationincreases exponentially. Furthermore wesee that static power dissipation has aninverse exponential dependence on theswitching threshold of the transistors.However, as mentioned previously, thechallenge to the designer is to design to asignificantly reduced power budget whilealso maintaining high levels of perform-

ance for the design. Increasing the switch-ing threshold of a transistor has the effectof increasing delay through the device andconsequently reducing the performance.

The ability to use multi-threshold transis-tors in a design is an excellent techniquefor reduction of static power dissipation.Low (or regular) threshold transistors areused on timing critical parts of the designand high threshold transistors on non-crit-ical paths to minimize leakage.

Power Gating In addition to the use of multi-thresholdlibraries, power gating can be used to fur-ther reduce the effects of leakage power.Leakage is state dependant, but unlikedynamic power it is not activity depen-dant. Therefore even when a device has noswitching activity it is still dissipating leak-age power, Multi-Threshold CMOS (MTC-MOS) switch cells can be used to isolatespecific regions of the design. Theseregions can then be powered down wheninactive to significantly reduce leakagepower. The ARM1176JZF-S architecturelends itself well to this approach where thedesign is partitioned into multiple voltagedomains with active regions and MTC-MOS regions defined. MTCMOS switchcells are inserted into the power mesh inthe MTCMOS region. These switch cellsare enabled by a sleep signal sourced fromthe active region. When enabled theseswitch cells disconnect the inactive part ofthe design from the power network. Thisswitch cell isolates the logic from thepower mesh reducing the leakage current.

Comprehensive, integrated tool flowImplementation of an ARM1176JZF-Sprocessor with optimal performance thatsupports the previously mentioned tech-niques to reduce power, places a signifi-cant burden on the design flow and tech-nology/library combination used. Thedesign flow must be capable of achievingexcellent performance with a traditionalSI aware approach for nanometer technol-ogy while automating the handling ofmultiple voltage domains and minimiz-ing leakage. Specifically, support isrequired for automatic insertion of levelshifters and isolation cells between voltagedomains, domain specific optimization,power grid synthesis, multi-Vt optimiza-tion and switch cell insertion.

Managing these low power techniquesmust be an integral part of the optimiza-tion flow and cannot be an afterthought.Figure 1: ARM1176JZF-S Voltage Domains

Figure 2: Static vs. Dynamic Power with processMigration

Leakage ≅ exp (-qVth)kT

17



Simultaneous optimization for timing,area, power and SI is a minimum require-ment for an ARM1176JZF-S design flow.Further, in order to meet aggressive time tomarket requirements, simultaneous opti-mization for these design characteristicsrequires a unified data model with supportfor concurrent processing. This level ofintegration extends to the analysis envi-ronment. Lack of integration betweenimplementation and analysis tools canlead to a significant time penalty inresolving false errors and inconsistentdata that can force designers to overcom-pensate in certain areas of the flow result-ing in a sub-optimal implementation.

Magma Design Automation’s Blast FusionRTL-to-GDSII design flow enables thedesigner to meet this challenge todaythrough the ability to continually opti-mize for timing, power and area throughall phases of the design flow. Blast Fusionintegrates a number of focused solutionsaround a unified data model that enablesthe optimization, implementation andanalysis engines to get immediate accessto continually updated logical, physical,timing and power data. This single passapproach allows the engines to makeinstant decisions that ensure optimalresults. Specifically, Blast Create is used forsynthesis, Blast Plan Pro for prototyping,Blast Noise for signal integrity, Blast Railfor power integrity and Blast Power, forpower management.

Blast Power forms the heart of the imple-mentation flow for the ARM1176JZF-S andis used to provide a complete low powerimplementation solution supportingpower aware synthesis, leakage power

minimization with Multi-Vt libraries,automated support for dynamic voltageand frequency scaling, including auto-mated level shifter insertion, domainbased optimization and multi-corner opti-mization. In addition, Blast Power pro-vides automated power grid synthesis andcan automatically insert and optimizedecap cells based on transient voltagedrop analysis.

Multiple Voltage Domains Blast Power provides a domain basedmethodology to handle multiple voltagedomains within a design. Using domainswith associated floorplans a design can bepartitioned into a number of regions eachoperating at different voltages and fre-quencies.

Through the specification of domains, thedesigner is able to identify the power andground nets, the nature of the power sup-ply (constant, switched, variable etc.) andother salient process and temperaturecharacteristics. By associating domainswith floorplans, the designer is able toidentify the logical to physical relation-ship and identify which cells are attachedto which domain and the physical loca-tion of the domain within the design. Thedomains and floorplans are maintainedthrough synthesis and physical optimiza-tion as well as during timing and poweranalysis. New cells added during theimplementation process are automaticallyattached to the correct domain and con-nected to the corresponding power supply.

Having partitioned the design into specificvoltage islands (domains) and physicalregions (floorplans) Blast Power can auto-matically determine which interfaces needto be level shifted and insert the appropri-ate type of level shifter and/or isolationcell. The sensitivities associated with levelshifter insertion, such as buffering and sec-ondary power supply routing are handledautomatically within the tool.

This automated domain based approachsignificantly reduces the complexity of themulti-voltage implementation process.

Leakage Mitigation Typically a multi-Vt library contains twoor more versions of the same standard cellset: one set contains high-Vt cells and theother contains low-Vt cells. Blast Powerautomatically reduces leakage current inthe design by using high-Vt (slow, lowerleakage) cells for non-critical paths and

low-Vt (faster, higher leakage) cell for thetiming critical paths in the design.Through the unified data model and inte-grated analysis environment, concurrentoptimization for both timing and leakageis possible to yield the most optimal imple-mentation. Unlike conventional flows, theBlast Power multi-Vt flow performs leak-age optimization at different stages of thedesign flow resulting in superior QoR.

Multiple Corner Analysis Technology libraries are characterized forleakage and dynamic power for all possi-ble arcs that can be exercised. For multi-corner analysis traditional flows handleone PVT/library at a time leading to mul-tiple sequential runs. Depending on thenumber of corners for the design this couldpotentially result in long run times, multi-ple design iterations and cause conver-gence problems. Magma uses the supercorner approach to concurrently analyzeand optimize multi corner designs muchfaster, helping cover timing criticality forall corners. When performing concurrentoptimization and analysis at differentoperating corners the correct selection andderating of the characterization data isrequired.

When changing the voltage of a supplynet, the timing and power behavior of allcells supplied by that net changes. If thevoltage differs only slightly from the char-acterization value then derating may suf-fice. Blast Power supports numerous derat-ing methods, including k-factor, polyno-mial models, and support for ECSM char-acterized libraries.

However, to improve accuracy and avoidderating altogether, the cell library can becharacterized at many different operatingconditions. This more extensive character-ization data is then used by BlastPower to target the operating conditionof choice. Figure 3: Comprehensive RTL to GDSII Solution

Figure 4: ARM1176JZF-S Physical Domains

18

1650 Technology Drive, San Jose, CA 95110 USA | Tel: 408-565-7500 | Fax: 408-565-7501 | www.magma-da.com

© 2008 Magma Design Automation,Inc.All rights reserved. Magma is a registered trademark of Magma Design Automation.

All other product and company names are trademarks or registered trademarks of their respective companies. 10/08

Reining in Time-To-Market for Next Generation Embedded Design€¦ · based Solutions...

Documents

Transcript of Reining in Time-To-Market for Next Generation Embedded Design€¦ · based Solutions...