Exploiting Vector Parallelism in Software Pipelined Loops
description
Transcript of Exploiting Vector Parallelism in Software Pipelined Loops
-
Multimedia ExtensionsShort vector extensions in ILP processorsAltiVec, 3DNow!, SSE, etc.Accelerate loops in multimedia & DSP codesNew designs have floating point support
Page
-
Multimedia ExtensionsVector resources do not overwhelm the scalar resourcesScalar: 2 FP ops / cycleVector: 4 FP ops / cycleFull vectorization may underutilize scalar resources ILP techniques do not target vector resourcesNeed bothCourtesy of International Business Machines Corporation. Unauthorized use not permitted.
Page
- Modulo Schedulingfor (i=0; i
- Traditional Vectorizationfor (i=0; i
- Vectorization without Distributionfor (i=0; i
- Selective Vectorizationfor (i=0; i
-
ComplicationsComplex scheduling requirementsParticularly in statically scheduled machinesMemory alignmentExample assumes no communication costIn reality, explicit operations requiredOften through memoryReserve critical resourcesPotential long latencyPerformance improvement still possible
Page
-
Tomcatv main loop (50%)
Page
-
Tomcatv (SpecFP 95)1.7x Speedup overModulo Scheduling
Issue Width6Memory Units2ALUs4FPUs2Vector Units1Vector Length2*
TechniqueALUMEMFPUVECModulo Scheduling622460
Full Vectorization713046
Selective Vectorization7271927
Page
-
Tomcatv (SpecFP 95)
Page
-
Selective VectorizationBalance computation among resourcesMinimize II when loop is modulo scheduledCarefully manage communicationIncorporate alignment informationSoftware pipelining hides latencyAdapt a 2-cluster partitioning heuristic[Fidduccia & Matheyses 82][Kernighan & Lin 70]
Page
-
Selective Vectorizationscalarvectorcost
Page
-
Cost FunctionProjected II due to resources (ResMII)Bin-packing approach [Rau MICRO 94]With some modifications
Can ignore operation latencySoftware pipelining hides latencyVectorizable ops not on dependence cycles
for (i=0; i
-
EvaluationSUIF front-endDependence analysisDataflow optimization
Trimaran back-endModulo schedulerRegister allocatorVLIW SimulatorAdded vector opsSimulation BinaryC or Fortran
Page
-
EvaluationOperands communicated through memorySoftware responsible for realignment
Issue Width6Memory Units2ALUs4FPUs2Vector Units1Vector Length2*
Page
-
EvaluationSpecFP 92, 95, 2000Easier to extract dependence informationDetectable data parallelism64-bit data means vector length of 2Considered amenable to vectorization & SWPApply selective vectorization to DO loopsNo control flow, no function calls Fully simulate with training sets
Page
-
Traditional Vectorization
Page
-
Vectorization without Distribution
Page
-
Vectorization + Free Communication
Page
-
Vectorization without Distribution
Page
-
Selective Vectorization
Page
-
Selective Vectorizationtomcatvsu2corswimmgrid
Page
-
Communication SupportTransfer through memoryRegister to register copyUses fewer issue slotsFrees memory resourcesShared register fileVector elements addressable in scalar opsRequires no extra issue slots
Page
-
Through Memorytomcatvsu2corswimmgrid
Page
-
Reg to Reg Transfer Supporttomcatvsu2corswimmgrid
Page
-
Shared Register Filetomcatvsu2corswimmgrid
Page
-
Related WorkTraditional vectorizationAllen & Kennedy, WolfeSoftware PipeliningRaus iterative modulo schedulingClustered VLIW[Aleta MICRO34], [Codina PACT01], [Nystrom MICRO31], [Sanchez MICRO33], [Zalamea MICRO34]Partitioning among clusters similarOurs is also an instruction selection problemNo dedicated communication resources
Page
-
ConclusionTargeting all FUs improves performanceSelective vectorizationVectorization better in the backendCost analysis more accurateSoftware pipeline vectorized loopsGood idea anywayFacilitates selective vectorizationHides communication and alignment latency
Page
ILP techniques are instruction scheduling techniques vectorization is a type of instruction selection thats why we need both
Mention what the notation in the code meansmention loop distributioncommunication between vector and scalar loopsThis is our contributionExample was very simple, but in reality there are complicationsPlanning to make publicly availablePACTTraditional never beats modulo scheduling for this architectureFree communication is unrealisticMention theoretical maximum for this architectureSay what percentages meanConsider two other design pointsMention theoretical maximum for this architectureSay what percentages meanMention theoretical maximum for this architectureSay what percentages meanMention theoretical maximum for this architectureSay what percentages mean