Hgh Perf Arm
Embed Size (px)
Transcript of Hgh Perf Arm
8/8/2019 Hgh Perf Arm
StrongARM: A High-PerformanceARM Processor
Rich Witek and James MontanaroDig al Equipment Corporation
AbstractA 32-bit 162MHzI215MHz custom VLSI ARM(tm)microprocessor is described. The chip contains two1GKbyte, 32-way set associative caches fo r instructionsand data. The 2.1M transistor chip is abrica ted i n a 2.0V,
0.35p m, 3-layer metal CMOS rocess. I t dissipates 0.5Wat 162M Hzl1.5~ nd 1.1W at 215MHzI2.0~
IntroduetionStrongARM 110 (un), a custom VLSI implementationof the ARM (tm) architecture, delivers 184Drystone/MIPS at 162MHz while dissipating0.5W usingan intemal supply of 1.5V. In applications which requirehigher performance, the chip may be operated at 215MHzwith a 2.0V internal supply dissipating 1.1W. The extemalinterface always runs at 3.3V. The die contains 2.1million transistors and measures 7.8" x 64" It isfabricated in a 2.0V, 0.35pm drawn (0.25pm effective), 3-layer metal CMOS process and packaged in a 144-pin thinquad flat pack. Clock generation is accomplished using anon-chip PLL with 3.68MHz input clock to minimize highfrequency clock signals on the board 111. The chip ispseudo-static and clocks may be stopped in either phase tominimize power consumption
Processor OverviewThe major elements of the processor are shown in theblock diagram (figure 1).The register file has three read ports and two write ports.The three read ports were chosen to simplify the controlby providing all the arguments to most instructions in asingle cycle. The two write ports allowed mostinstructions to be fully pipelined and write their results ina single cycle.The EBOX contains a single-cycleALU with a full 32-bit bidirectional shifter on one of the input operands. Inone cycle, the EBOX can perform a 0 to 32-bit shift, a 32-bit ALU operation, and compute the condition codes of
1063-6390/96$5.000 1996 IEEEProceedings of COMPCON '96
zero, overflow, carry, and negative on the result in time toaffect a branch in the Issue stage of the same cycle.
Figure 1:StrongARM Block DiagramThe EBOX also contains a 32x32-bit multiplier. Themultiplier consist of a 12x32-bit carry save multiplierarray which is used for one to three cycles depending onone of the input operands and a 32-bit final adder toreduce the carry save result. For Multiply Accumulateinstructions, the accumulate value is inserted into thearray so that an additional cycle for the add operation isnot required.The processor contains separate 32 entry, fullyassociative translation buffers(TB) or instruction fetchesand data accesses. Each entry in the TB can map a 1Mbytesection, or a 64Kbyte large page,or a 4Kbyte small page.TB fills are performed by hardware without using asoftware exception routine.The processor features separate 16Kbyte, 32-way setassociative virtual caches for instructions (Icache) anddata (Dcache). The split Instruction and Data cachessimplify the control and allow full pipelining of load andstore instructions. The Dcache is writeback and stores thephysical address along with the data at fill time. Thestored physical address is used when th e block is castout
8/8/2019 Hgh Perf Arm
of th e cache so a translation is not need at replacementtime. The Dcache has one dirty bit for each 16 bytesubblock. This reduces the amount of data written back tomemory when only part of the block has written. Eachcache is implemented as 16 fully associative blocks. Thisallows a small section of the cache to be powered up andonly the needed 32-bit word to be read without resortingto a direct mapped cache.The processor includes a write buffer with 8 entries of16bytes. The write buffer is used to buffer stores that missin the Dcache and castouts from the data cache. Loadmisses are allowed to pass stores in the write buffer. Eachblock of the write buffer has a physical address and acomparator to check for load to a piece of data currently inthe write buffer. In the case of a load hit in the writebuffer, the load is stalled until the write is completed.Processor Pipeline
The processor is a single issue design with a classical 5stage pipeline. The stages of the pipeline are Fetch 0,Issue (I), Execute (E), Buffer and cache access (B), andresult Write (W) as show in figure 2.
Figure 2: StrongARM PipelineThe instruction unit (IBOX) operates in the first 2 stages(Fetch and Issue) of the pipeline. In th e Fetch cycle, thenext instruction is fetched from the instruction cache andlatched into the instruction buffer. In the Issue stage, allregister dependencies and execution unit (EBOX)availabilities are checked and the instruction is issued ifthere are no conflicts. If there are conflicts, the instructionis held in the IBOX until al l the dependencies havecleared. Result forwarding is provided for all resultsbetween their generation and register file writing.Most instructions only need one Issue cycle. The LoadMultiple, Store Multiple, Swap, and Multiply Longinstructions need additional cycles. The IBOX holds theseinstructions while advancing the rest of the pipeline to
decompose these complex instructions to theirfundamental single issue cycle operations. The LoadMultiple and Store Multiple instructions are decomposedby the IBOX into a series of single Loadstore instructionsas seen by the rest of the machine. Likewise the Swapinstruction is decomposed into a Load and a Store. TheMultiply Long and Multiply Long Accumulate (MLA)
instructions use a second issue slot to acquire a secondregister file write slot for the second 32 bits of the result.The' MLA instruction performs an B=A*X+B operationwhere A and X are 32 bits, and B is 64 bits. The highorder 32bits of B are read and the second register file writeslot is reserved in the second issue cycle of a MLA .The IBOX performs PC-relative branches andsubroutine calls, and MOV to PC (return instructions) inthe Issue stage. A dedicated PC adder is used along withthe PC incrementer to perform these operations. TheIBOX can resolve conditional branches in the Issue :stageeven when th e condition codes are being updated iin th ecurrent Execute cycle. By providing this optimized path,the IBOX is able to turn branches with only a one cyclepenalty and no cycles lost for branch not taken. Byhandling the branches very early in the pipe the need forbranch prediction hardware was elhinated (Figure 3).
Figure 3: Branch PipelineFor Load and Store operations the effective address iscalculated in the E stage of th e pipe and the cache andtranslation buffers (TB) are accessed in the B stage. Thecache is accessed in a single cycle for both reads andwrites. On writes the cache is first read and the data islatched in the sense amps. If the write is aborted due tomemory management unit (MMU) access violations, theoriginal data is written back in the cycle that follows theexception while the first instruction of th e exceptionhandler is being fetched. The Dcache provides a 2-cyclelatency for load return data to the register fib.
Low Power DesignOur past designs have emphasized performance withoutsignificant concem for power dissipation.[2,3] Since itwas intended for the portable market, this design neededthe maximum performance possible for 0.5W or less..This
goal was addressed through a variety of avenues:Reducing the power supply voltage was paramount. Theuse of a 2.0v, 0.35mp process with 0 . 3 5 ~ evicethresholds joined with traditional high-performancecustom design techniques to enable a high speed design,even at 1 3 .
8/8/2019 Hgh Perf Arm
Conditional clocking was pursued aggressively fromthe beginning of the design. While this created some of themost challenging critical paths in the design,it allowed usto reduce switching and, in some cases, eliminate latches.Edge-triggered latches were used through the majority ofthe design to minimize the gate load on the clock. Poweris further reduced by running the internal clocks off theslower bus clock during cache fills.The constraints on total power led to limiting the clockfrequency and moving instead to a longer tick model ofthe design. One single cycle path includes a 32-bit shift,32-bit add and setting condition codes to direct the PCMUX in the next cycle. Another is the 16K associativecache access, from address in to data out in a single cycle.In past designs driven solely by performance, direct-mapped instruction and data caches havebeen used sincetheir shorter access time provides higher overallperformance, despite the higher hit rate of an associativecache. The long cycle time of this design allows time forthe associative cache access and the increased hit rateminimizes pin power and increase performance.
The choice of internal supply voltage and clockfrequency was made based on power analysis of an earlierdesign. This analysis indicated that we could meet the0.5W goal with largely the same circuit techniques usedon past chips, adapted slightly to make the chip fullystatic. This conclusion was dependent on the low-voltagecharacteristics of the process, particularly the low Vts.However, the leakage characteristics of these devicesposed problems for the powerdown modes and oncertain pseudo-dynamic nodes. The chip supports tw opowerdown modes, Idle (2OmW) and Sleep (50uA). InIdle, the PLL is running but the internal clock grid isstopped. The cache array devices were lengthened toreduce their leakage so that the power requirement for Idlecould be met. In Sleep, the leakage requirement is muchtighterso it is addressed by turning off the intemal powersupply. The I/O circuitry is powered in Sleep so that wecan drive valid dataon the pins.Clocking
In an earlier design, the clock was distributed as asingle monolithic node. While this technique hasadvantages with respect to speed and simplicity, it is notappropriate on a low-power chip us