Download - Hgh Perf Arm

8/8/2019 Hgh Perf Arm

1/4

StrongARM: A High-PerformanceARM Processor

Rich Witek and James MontanaroDig al Equipment Corporation

AbstractA 32-bit 162MHzI215MHz custom VLSI ARM(tm)microprocessor is described. The chip contains two1GKbyte, 32-way set associative caches fo r instructionsand data. The 2.1M transistor chip is abrica ted i n a 2.0V,

0.35p m, 3-layer metal CMOS rocess. I t dissipates 0.5Wat 162M Hzl1.5~ nd 1.1W at 215MHzI2.0~

IntroduetionStrongARM 110 (un), a custom VLSI implementationof the ARM (tm) architecture, delivers 184Drystone/MIPS at 162MHz while dissipating0.5W usingan intemal supply of 1.5V. In applications which requirehigher performance, the chip may be operated at 215MHzwith a 2.0V internal supply dissipating 1.1W. The extemalinterface always runs at 3.3V. The die contains 2.1million transistors and measures 7.8" x 64" It isfabricated in a 2.0V, 0.35pm drawn (0.25pm effective), 3-layer metal CMOS process and packaged in a 144-pin thinquad flat pack. Clock generation is accomplished using anon-chip PLL with 3.68MHz input clock to minimize highfrequency clock signals on the board 111. The chip ispseudo-static and clocks may be stopped in either phase tominimize power consumption

Processor OverviewThe major elements of the processor are shown in theblock diagram (figure 1).The register file has three read ports and two write ports.The three read ports were chosen to simplify the controlby providing all the arguments to most instructions in asingle cycle. The two write ports allowed mostinstructions to be fully pipelined and write their results ina single cycle.The EBOX contains a single-cycleALU with a full 32-bit bidirectional shifter on one of the input operands. Inone cycle, the EBOX can perform a 0 to 32-bit shift, a 32-bit ALU operation, and compute the condition codes of

1063-6390/96$5.000 1996 IEEEProceedings of COMPCON '96

zero, overflow, carry, and negative on the result in time toaffect a branch in the Issue stage of the same cycle.

n

Figure 1:StrongARM Block DiagramThe EBOX also contains a 32x32-bit multiplier. Themultiplier consist of a 12x32-bit carry save multiplierarray which is used for one to three cycles depending onone of the input operands and a 32-bit final adder toreduce the carry save result. For Multiply Accumulateinstructions, the accumulate value is inserted into thearray so that an additional cycle for the add operation isnot required.The processor contains separate 32 entry, fullyassociative translation buffers(TB) or instruction fetchesand data accesses. Each entry in the TB can map a 1Mbytesection, or a 64Kbyte large page,or a 4Kbyte small page.TB fills are performed by hardware without using asoftware exception routine.The processor features separate 16Kbyte, 32-way setassociative virtual caches for instructions (Icache) anddata (Dcache). The split Instruction and Data cachessimplify the control and allow full pipelining of load andstore instructions. The Dcache is writeback and stores thephysical address along with the data at fill time. Thestored physical address is used when th e block is castout

188


2/4

of th e cache so a translation is not need at replacementtime. The Dcache has one dirty bit for each 16 bytesubblock. This reduces the amount of data written back tomemory when only part of the block has written. Eachcache is implemented as 16 fully associative blocks. Thisallows a small section of the cache to be powered up andonly the needed 32-bit word to be read without resortingto a direct mapped cache.The processor includes a write buffer with 8 entries of16bytes. The write buffer is used to buffer stores that missin the Dcache and castouts from the data cache. Loadmisses are allowed to pass stores in the write buffer. Eachblock of the write buffer has a physical address and acomparator to check for load to a piece of data currently inthe write buffer. In the case of a load hit in the writebuffer, the load is stalled until the write is completed.Processor Pipeline

,

The processor is a single issue design with a classical 5stage pipeline. The stages of the pipeline are Fetch 0,Issue (I), Execute (E), Buffer and cache access (B), andresult Write (W) as show in figure 2.

Figure 2: StrongARM PipelineThe instruction unit (IBOX) operates in the first 2 stages(Fetch and Issue) of the pipeline. In th e Fetch cycle, thenext instruction is fetched from the instruction cache andlatched into the instruction buffer. In the Issue stage, allregister dependencies and execution unit (EBOX)availabilities are checked and the instruction is issued ifthere are no conflicts. If there are conflicts, the instructionis held in the IBOX until al l the dependencies havecleared. Result forwarding is provided for all resultsbetween their generation and register file writing.Most instructions only need one Issue cycle. The LoadMultiple, Store Multiple, Swap, and Multiply Longinstructions need additional cycles. The IBOX holds theseinstructions while advancing the rest of the pipeline to

decompose these complex instructions to theirfundamental single issue cycle operations. The LoadMultiple and Store Multiple instructions are decomposedby the IBOX into a series of single Loadstore instructionsas seen by the rest of the machine. Likewise the Swapinstruction is decomposed into a Load and a Store. TheMultiply Long and Multiply Long Accumulate (MLA)

instructions use a second issue slot to acquire a secondregister file write slot for the second 32 bits of the result.The' MLA instruction performs an B=A*X+B operationwhere A and X are 32 bits, and B is 64 bits. The highorder 32bits of B are read and the second register file writeslot is reserved in the second issue cycle of a MLA .The IBOX performs PC-relative branches andsubroutine calls, and MOV to PC (return instructions) inthe Issue stage. A dedicated PC adder is used along withthe PC incrementer to perform these operations. TheIBOX can resolve conditional branches in the Issue :stageeven when th e condition codes are being updated iin th ecurrent Execute cycle. By providing this optimized path,the IBOX is able to turn branches with only a one cyclepenalty and no cycles lost for branch not taken. Byhandling the branches very early in the pipe the need forbranch prediction hardware was elhinated (Figure 3).

10 0ADDS

Figure 3: Branch PipelineFor Load and Store operations the effective address iscalculated in the E stage of th e pipe and the cache andtranslation buffers (TB) are accessed in the B stage. Thecache is accessed in a single cycle for both reads andwrites. On writes the cache is first read and the data islatched in the sense amps. If the write is aborted due tomemory management unit (MMU) access violations, theoriginal data is written back in the cycle that follows theexception while the first instruction of th e exceptionhandler is being fetched. The Dcache provides a 2-cyclelatency for load return data to the register fib.

Low Power DesignOur past designs have emphasized performance withoutsignificant concem for power dissipation.[2,3] Since itwas intended for the portable market, this design neededthe maximum performance possible for 0.5W or less..This

goal was addressed through a variety of avenues:Reducing the power supply voltage was paramount. Theuse of a 2.0v, 0.35mp process with 0 . 3 5 ~ evicethresholds joined with traditional high-performancecustom design techniques to enable a high speed design,even at 1 3 .

189


3/4

Conditional clocking was pursued aggressively fromthe beginning of the design. While this created some of themost challenging critical paths in the design,it allowed usto reduce switching and, in some cases, eliminate latches.Edge-triggered latches were used through the majority ofthe design to minimize the gate load on the clock. Poweris further reduced by running the internal clocks off theslower bus clock during cache fills.The constraints on total power led to limiting the clockfrequency and moving instead to a longer tick model ofthe design. One single cycle path includes a 32-bit shift,32-bit add and setting condition codes to direct the PCMUX in the next cycle. Another is the 16K associativecache access, from address in to data out in a single cycle.In past designs driven solely by performance, direct-mapped instruction and data caches havebeen used sincetheir shorter access time provides higher overallperformance, despite the higher hit rate of an associativecache. The long cycle time of this design allows time forthe associative cache access and the increased hit rateminimizes pin power and increase performance.

The choice of internal supply voltage and clockfrequency was made based on power analysis of an earlierdesign.[2] This analysis indicated that we could meet the0.5W goal with largely the same circuit techniques usedon past chips, adapted slightly to make the chip fullystatic. This conclusion was dependent on the low-voltagecharacteristics of the process, particularly the low Vts.However, the leakage characteristics of these devicesposed problems for the powerdown modes and oncertain pseudo-dynamic nodes. The chip supports tw opowerdown modes, Idle (2OmW) and Sleep (50uA). InIdle, the PLL is running but the internal clock grid isstopped. The cache array devices were lengthened toreduce their leakage so that the power requirement for Idlecould be met. In Sleep, the leakage requirement is muchtighterso it is addressed by turning off the intemal powersupply. The I/O circuitry is powered in Sleep so that wecan drive valid dataon the pins.Clocking

In an earlier design[2], the clock was distributed as asingle monolithic node. While this technique hasadvantages with respect to speed and simplicity, it is notappropriate on a low-power chip using conditionalclocking. On this design, the internal clock grid is astandard H-tree distribution systemwith minor variations.Since the pins can ru n asynchronous to the core, there is aseparate clock region associated with the pin circuitry.Inverter buffers were used in both distribution systems toallow use of narrow wires and reduce the grid capacitance.The chip has five clock regions, the pins, bus interfacecontrol and write buffer, core (integer unit and memorymanagment units), Icache, and Dcache. The first tworegions run off the bus clock while the final three areconnected to the internal clock grid. During cache fills, theinternal clock grid is clocked by the bus clock rather than

the PLL to save power and to minimize thesynchronization time during fills. The write bufferreceives clocks from the internal clock grid and the busclocks since during stores it must pass data across thisboundary.Latches

Differential edge-triggered latches were used to reducetotal latch delay and reduce clock load. The majority ofthe latching occurs on the rising edge of the clock butfalling edge-triggered latches were sometimes used,primarily when crossing clock region boundaries.External Interface

The pin interface supports different modes of operationto allow compatibility with systems using existing ARMchips while enabling system cost savings and performanceimprovements for new systems. The on chip PLLmultiplies the 3.68MHz input clock by 24 to 78 togenerate the high speed core clock. The speed is selectedby four pins. For compatibility with existing ARMsystems the bus clock can be supplied by the system. Fornew designs the processor can drive the bus clock by adivider on the PLL clock. The dividers supported rangefrom 2..9 to provide a large range of bus speeds.Configuration pins allow selection of PLL speed, sourceof bus clock, speed of bus clock and other bus timingoptions. When the processor is driving the bus clock aWait For Interrupt instruction places the processor in Idlemode where the core and bus clock are stopped until anexternal interrupt is asserted.

SummaryThe StrongARM 110 provides a low cost, low power,high performance processor that will enable more complexapplications to be provided in handheld devices. TheStrongARM 110 will also provide a new level ofperformance for low cost embedded processors.

References[l ]von Kaenel, V. et al., A 320MHz, 1.5mW CMOS PLL formicroprocessor clock generation,ISSCC 96, February 1996.[2] Dobberpuhl D., et al., A 200MHz 64b Dual-Issue CMOSMicroprocessor,E E E Journal of Solid State Circuits, No . 11,Vol. 27, November 1992.[3]Bowhill B., et al., A 300MHz 64b Quad-IssueCMOS RISCMicroprocessor,SSCC 1995, February 1995.ARM and StrongARM are registered trademarks ofAdvanced RISC Machines, Ltd.

190


4/4

ACKNOWLEDGMENTSWe would like to acknowledge the contributions of thefollowing people: Krishna Anne, Andrew J. Black,Elizabeth M. Cooper, Daniel W. obberpuhl, Paul M.Donahue, Jim Eno, Alejandro Farell, Gregory W .

Hoeppner, David Kruckemyer, Thomas H. Lee,PeterLin,Liam Madden, Daniel Murray, Mark Pearce, SribalanSanthanam, Kathryn J. Snyder, Ray Stephany, Stephen C.Thierauf, F. Aires, M. Bazor, G. Cheney, K. Chui, M.Culbert, T. Daum, K. Fielding, J. Gee, J. Grodstein, L.Hall, J. Hancock, H. Horovitz, C. Houghton, L. Howarth,D. Jaggar, G. Joe, R. Kaye, I. Kim, . Lum, D. Noorlag, L.O'Donnell, K. Patton, J. Reinschmidt, S . Roberts,A.Silveria, D. Souyadalay,E. Supnet,L. Tran, D. oehrerand the PLL design team at CSEM.The support which we received on many aspects of thedesign from he people at Advanced RISC Machines, Ltd.was very important and appreciated.

191