A Loop Accelerator for Low Power Embedded VLIW Processors

31
A Loop Accelerator for Low Power Embedded VLIW Processors Binu Mathew, Al Davis School of Computing, University of Utah Deepika Ranade Sharanna Chowdhury 1

description

A Loop Accelerator for Low Power Embedded VLIW Processors. Binu Mathew, Al Davis School of Computing, University of Utah. Deepika Ranade Sharanna Chowdhury. “Super” DSP. Clustered execution units VLSI rocks  ILP limitation Improve throughput Timely data delivery to execution units - PowerPoint PPT Presentation

Transcript of A Loop Accelerator for Low Power Embedded VLIW Processors

A Loop Accelerator for Low Power Embedded VLIW Processors

A Loop Accelerator for Low Power Embedded VLIWProcessorsBinu Mathew, Al DavisSchool of Computing, University of UtahDeepika RanadeSharanna Chowdhury1Super DSPClustered execution unitsVLSI rocks ILP limitationImprove throughputTimely data delivery to execution unitsLimited memory, power

VLIWSpecialized on-chip memory systemscratch pad memory cache banks bit-reversalauto increment addressing modes

SW control2Authors solution: Sneak peekMultiple SRAMDistributed Address GeneratorsLoop Acceleration UnitArray Variable Renaming3VLIW Processor ArchitectureAddress Generator x 2Address Generator x 2Address Generator x 2SRAM 0SRAM 1SRAM nInterconnectFunction Unitsx 8Loop UnitU CodeMemory/I-CacheDecode Stagemicro codePC4Issues with load/store portsLimited # load/store ports limits performance+data availabilityNeed large # SRAM portsefficiently feed data to function unitsBUTdegrades access time power consumptionSolution Banking multiple small software managed scratch SRAMs Power down unused SRAMs

5Working1. VLIW instruction bundle 2. Load/store decoded3. issued to address generators VLIW execution unit - loop unit4. Compiler configures loop unit before entering loop intensive code 5. Loop unit works autonomously. 6. PC passes 1st instruction of loop body => loop count ++values used by AGs

6Instructions Neededwrite context instruction~transfers loop parameters/data access patterns => context registers ( in loop unit/address generators)load. context and store. context instructions~enhanced versions of load/store instructions ~Two fields named context index and modulo period => immediate constant field~Context. index :controls address calculationpush loop instruction~used by compiler to inform hardware => nested inner loop is entered.7Modulo SchedulingThe loop unit offers hw support for modulo schedulingsoftware pipelininghigh loop performance execution of new instance of loop body every II cyclesNon- modulo scheduled loop converted to modulo scheduled loop with II=N

N cyclesLOOP BODYII < N.8Modulo Scheduling contOriginal loop body => modulo scheduled Replicating instructions ~instruction scheduled in cycle n is replicated to appears in all cycles ~ pasting new copy of the loop body at intervals of II cycles over original loop body ~wrapping around all instructions that appear after cycle N.~ Instruction scheduled for cycle n~ then n=II =>modulo period9Modulo SchedulingContext registersCompilerStatic Parameters~ II~ Loop Count Limits

Loop Counter Register FileDynamic Values of loop variables

PC -> Loop Body ??10 Loop UnitLoop Contexts+1Loop Stack||Count Reg=MUX 2 x1Loop Count Register x 4MUX 2 x1+=Current looploop_type, ||, start_count, end_count,increment||clearWrite enableStart_countPop loopPush loopincrementNext countEnd_countCurrent loop11 Loop Unit- Prior to starting a loop intensive section of codeLoop Contexts+1Loop Stack||Count Reg=MUX 2 x1Loop Count Register x 4MUX 2 x1+=Current looploop_type, ||, start_count, end_count,increment||clearWrite enableStart_countPop loopPush loopincrementNext countEnd_countCurrent loop Loop Parameterwrite context12Loop Contexts+1Loop Stack||Count Reg=MUX 2 x1Loop Count Register x 4MUX 2 x1+=Current looploop_type, ||, start_count, end_count,increment||clearWrite enableStart_countPop loopPush loopincrementNext countEnd_countCurrent loopLoop Unit On entry into each loop body index of context registerPush_loopTTop of stack = current loop body

+1+||Reset!13 Loop Unit- When the end count of the loop is reachedLoop Contexts+1Loop Stack||Count Reg=MUX 2 x1Loop Count Register x 4MUX 2 x1+=Current looploop_type, ||, start_count, end_count,increment||clearWrite enableStart_countPop loopPush loopincrementNext countEnd_countCurrent loopInnermost loop completedPop_loopTop entry = popped off stack

14Stream address generatorLoop Unit countersAddress context register fileBase addressRow sizeHow Compiler generates addressesword oriented addressingElem size => size of Complex struct row size => elem size* Noffset of imag within struct Baseimag = BaseA + offsetLoad into t1 Baseimag+i*row size+j* elem sizeVector= 1 D array row size = 0

15

C row majorSelect correct Loop variablesStream address generator(2)Address Calc (2)Data PackingComplex index P*i+QQ=> Base address P=> row size

Column walkA[j][i]

If row size, elem sizes = powers of twoaddress = Base + [(i