Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid*...
-
date post
22-Dec-2015 -
Category
Documents
-
view
220 -
download
2
Transcript of Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid*...
Dynamic Hardware/Software Partitioning: A First Approach
Dynamic Hardware/Software Partitioning: A First Approach
Greg Stitt, Roman Lysecky, Frank Greg Stitt, Roman Lysecky, Frank Vahid*Vahid*Department of Computer Science and Department of Computer Science and EngineeringEngineering
University of California, RiversideUniversity of California, Riverside*Also with the Center for Embedded Computer Systems at *Also with the Center for Embedded Computer Systems at UC IrvineUC Irvine
IntroductionIntroduction
Dynamic optimizations an increasing trendDynamic optimizations an increasing trend– ExamplesExamples
DynamoDynamo– Dynamic software optimizationsDynamic software optimizations
Transmeta CrusoeTransmeta Crusoe– Dynamic code morphingDynamic code morphing
Just In Time CompilationJust In Time Compilation– Interpreted languagesInterpreted languages
AdvantagesAdvantages– Transparent optimizationsTransparent optimizations
No designer effortNo designer effort No tool restrictionsNo tool restrictions
– Adapts to actual usageAdapts to actual usage
Sw__________________
IntroductionIntroduction
Drawbacks of current dynamic optimizationsDrawbacks of current dynamic optimizations– Currently limited to software optimizationsCurrently limited to software optimizations
Limited speedup (1.1x to 1.3x common)Limited speedup (1.1x to 1.3x common) Alternatively, we could perform hw/sw partitioningAlternatively, we could perform hw/sw partitioning
– Achieve large speedups (2x to 10x common)Achieve large speedups (2x to 10x common)– However, presently dynamic optimization not possibleHowever, presently dynamic optimization not possible
Sw__________________
Hw__________________
Profiler
Critical Regions
Processor ASIC/FPGA
IntroductionIntroduction
Ideally, we would perform hardware/software Ideally, we would perform hardware/software partitioning dynamicallypartitioning dynamically– Transparent partitioningTransparent partitioning
Supports all sw languages/toolsSupports all sw languages/tools Most partitioning approaches have complex tool Most partitioning approaches have complex tool
flowsflows– Achieves better results than software Achieves better results than software
optimizationsoptimizations >2x speedup, energy savings>2x speedup, energy savings
– Adapts to actual usageAdapts to actual usage Appropriate architecture requiredAppropriate architecture required
– Requires a processor and configurable logicRequires a processor and configurable logic
IntroductionIntroduction
Microprocessor/FPGA single-chip platforms make Microprocessor/FPGA single-chip platforms make partitioning more attractivepartitioning more attractive– More efficient communication, smaller sizeMore efficient communication, smaller size
Higher performance, low powerHigher performance, low power ExamplesExamples
– Xilinx Virtex II Pro, Triscend E5/A7, Altera Excalibur, Xilinx Virtex II Pro, Triscend E5/A7, Altera Excalibur, Atmel FPSLICAtmel FPSLIC
Makes dynamic hw/sw partitioning more feasibleMakes dynamic hw/sw partitioning more feasible– However, partitioning must be performed at binary levelHowever, partitioning must be performed at binary level
FPGAProcessorProcessor FPGA
1990s 2003
IntroductionIntroduction
Binary-level hw/sw partitioningBinary-level hw/sw partitioning– Binary is profiled and hardware Binary is profiled and hardware
candidates are determinedcandidates are determined– Regions to be partitioned are Regions to be partitioned are
decompiled into CDFGdecompiled into CDFG– CDFG is synthesized to CDFG is synthesized to
hardwarehardware– Binary is updated to use Binary is updated to use
hardwarehardware Many advantages over source-Many advantages over source-
level partitioninglevel partitioning– Supports any language or Supports any language or
software compilersoftware compiler No change in toolsNo change in tools
– Better software size and Better software size and performance estimation at binary performance estimation at binary levellevel
Enables dynamic hw/sw Enables dynamic hw/sw partitioningpartitioning
Binary
Netlist
Processor FPGA
Updated Binary
Profiling
Hw Exploration
Decompilation
Behavioral Synthesis
Binary Updater
Dynamic Hw/Sw PartitioningDynamic Hw/Sw Partitioning
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Memory
Micro-processor
SW___________________________
SW addaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddadd
Dynamic Hw/Sw PartitioningDynamic Hw/Sw Partitioning
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Memory
Micro-processor
SW___________________________
SW beqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeq
Dynamic Hw/Sw PartitioningDynamic Hw/Sw Partitioning
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Memory
Micro-processor
SW___________________________
SW addaddaddaddaddaddaddaddaddaddadd
addaddaddaddaddaddaddaddaddaddadd
Dynamic Partitioning
Module
addaddadd
add
Dynamic Hw/Sw PartitioningDynamic Hw/Sw Partitioning
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Memory
Micro-processor
SW___________________________
SW beqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeq
beqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeq
Dynamic Partitioning
Module
beqbeqbeq
beq
Dynamic Hw/Sw PartitioningDynamic Hw/Sw Partitioning
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Memory
Micro-processor
SW___________________________
SW
Dynamic Partitioning
Module
FrequentLoops
SWSWSW
SW
SW
SWSWSW
Dynamic Hw/Sw PartitioningDynamic Hw/Sw Partitioning
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Memory
Micro-processor
SW___________________________
SW
Dynamic Partitioning
Module
FrequentLoops
HWHWHWHWHWHWHW
Frequent Loops
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Memory
Micro-processor
Dynamic Partitioning
Module
Dynamic Hw/Sw PartitioningDynamic Hw/Sw Partitioning
SW___________________________
SW
FrequentLoops
ConfigurableLogic
Frequent Loops
0
20
40
60
80
100
Time Energy
SW
HW /SW
Dynamic Partitioning ModuleDynamic Partitioning Module
Dynamic partitioning module executes Dynamic partitioning module executes partitioning tools on chippartitioning tools on chip– Profiler, partitioning compiler, synthesis, Profiler, partitioning compiler, synthesis,
place&routeplace&route
Profiler
Partitioning
CompilerSynthesisSW Binary
HW
SW Source
Place&Route
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Dynamic Partitioning ModuleDynamic Partitioning Module
Synthesis and place & route tools all moved Synthesis and place & route tools all moved on-chipon-chip– These tools typically execute on powerful These tools typically execute on powerful
workstationsworkstations– Most people will cringe at idea of moving these Most people will cringe at idea of moving these
tools on-chiptools on-chip However, dynamic partitioning deals with However, dynamic partitioning deals with
small regions of codesmall regions of code– Typically, small innermost loopsTypically, small innermost loops
Therefore, we can develop lean tools that Therefore, we can develop lean tools that work specifically for these small loopswork specifically for these small loops– Lean tools make on-chip execution possibleLean tools make on-chip execution possible
Area overhead becoming less critical due to Area overhead becoming less critical due to Moore’s LawMoore’s Law
System ArchitectureSystem Architecture
MicroprocessorMicroprocessorss– MIPS (may be MIPS (may be
many)many) On-chip On-chip
memorymemory Configurable Configurable
logiclogic Dynamic Dynamic
partitioning partitioning modulemodule
Memory
Dynamic Partitioning
Module
ConfigurableLogic
Micro-processor
Micro-processor
Micro-processor
Micro-processor
Dynamic Partitioning ModuleDynamic Partitioning Module
Dynamically detects frequent loops and then Dynamically detects frequent loops and then reimplements the loops in hardware running reimplements the loops in hardware running on the configurable logicon the configurable logic
Architectural componentsArchitectural components– ProfilerProfiler– Additional processor and memoryAdditional processor and memory
But SOCs may have dozens anywaysBut SOCs may have dozens anyways Alternatively, we could share main processorAlternatively, we could share main processor
Memory
Profiler
Partitioning Co-Processor
Configurable LogicConfigurable Logic
Greatly simplified in order to create lean place & route toolsGreatly simplified in order to create lean place & route tools DMA used to access memoryDMA used to access memory Two registersTwo registers
– R0_Input stores data from memoryR0_Input stores data from memory– R1_InOut stores temporary data & data to write back to memoryR1_InOut stores temporary data & data to write back to memory
FabricFabric– Supports combinational logicSupports combinational logic– Implies loops must have body implemented in single cycle Implies loops must have body implemented in single cycle
(temporary restriction)(temporary restriction)
DMAR0_Input
Configurable Logic Fabric
R1_InOut
Configurable Logic FabricConfigurable Logic Fabric
FabricFabric– 3-input 2-output LUTS surrounded by switch 3-input 2-output LUTS surrounded by switch
matricesmatrices Switch MatrixSwitch Matrix
– Connect wire to same channel on different sideConnect wire to same channel on different side LUTLUT
– 3-input (8 word) 2-output SRAM3-input (8 word) 2-output SRAM
Configurable Logic Fabric
LUTT
LUT UT
...
SMM
SMSM
SMM
SMSM
SMM
...
0
0
00
1
1
1 12
2
2
2
33
3
3
Inputs Inputs
SRAM(8x2)
Outputs
Configurable Logic Fabric Switch Matrix LUT
Tool OverviewTool Overview
Binary
Loop Profiling
Small, Frequent Loops
Decompilation
Place & Route
HW
RT and Logic Synthesis
Binary Modification
Updated Binary
DMA Configuration
Bitfile Creation
Tech. Mapping
Tool flow slightly Tool flow slightly different from standard different from standard partitioning flowpartitioning flow– DecompilationDecompilation– Binary modificationBinary modification
Loop ProfilingLoop Profiling
Non-intrusive profilerNon-intrusive profiler– Monitors instruction busMonitors instruction bus
Very little overheadVery little overhead– Small cache (~16 entries) and 2,300 logic Small cache (~16 entries) and 2,300 logic
gatesgates Less than 1% power overheadLess than 1% power overhead
Mic
ro-
pro
cess
or Frequent Loop
CacheFrequent Loop
Cache Controller
++
rd/wr
addr
datadata
To L1 Memory
rd/wr
addr
sbb
data
saturation
DecompilationDecompilation
Decompilation recovers high-level informationDecompilation recovers high-level information Creates optimized CDFGCreates optimized CDFG
– All instruction-set inefficiencies are removedAll instruction-set inefficiencies are removed Binary partitioning has been shown to Binary partitioning has been shown to
achieve similar results to source-level achieve similar results to source-level partitioning for many applicationspartitioning for many applications– [Greg Stitt, Frank Vahid, ICCAD 2002][Greg Stitt, Frank Vahid, ICCAD 2002]
DMA ConfigurationDMA Configuration
Maps memory accesses to our DMA Maps memory accesses to our DMA architecturearchitecture– Reads/writesReads/writes– Increment/decrement address updatesIncrement/decrement address updates– Single/block request modesSingle/block request modes
Optimizes DFG for DMAOptimizes DFG for DMA– Removes address calculationsRemoves address calculations– Removes loop counters/exit conditionsRemoves loop counters/exit conditions
1 r1
+ Read
r1 +
r2
• Memory Read
• Increment Address
• Block Request
r3
DMA Read
+
r2
r3
Register Transfer SynthesisRegister Transfer Synthesis
Maps DFG operations to hw library Maps DFG operations to hw library componentscomponents– Adders, Comparators, Multiplexors, ShiftersAdders, Comparators, Multiplexors, Shifters
Creates Boolean expression for each output Creates Boolean expression for each output bit in dataflow graph by replacing hw bit in dataflow graph by replacing hw components with corresponding expressionscomponents with corresponding expressions
r4[0]=r1[0] xor r2[0], carry[0]=r1[0] and r2[0]
r4[1]=(r1[1] xor r2[1]) xor carry[0], carry[1]= …….
…….
r1 r2
+
r4
r3 8
<
r5
32-bit adder 32-bit comparator
Logic SynthesisLogic Synthesis
Optimizes Boolean equations from RT Optimizes Boolean equations from RT synthesissynthesis– Large opportunity for logic minimization due to Large opportunity for logic minimization due to
use of immediate values in the binaryuse of immediate values in the binary Simple on-chip 2-level logic minimization Simple on-chip 2-level logic minimization
methodmethod– Lysecky/Vahid DAC’03, session 20.4 (9:45 Wed)Lysecky/Vahid DAC’03, session 20.4 (9:45 Wed)
r2[0] = r1[0] xor 0 xor 0r2[1] = r1[1] xor 0 xor carry[0]r2[2] = r1[2] xor 1 xor carry[1]r2[3] = r1[3] xor 0 xor carry[2]…
r1 4
+
r2
r2[0] = r1[0]r2[1] = r1[1] xor carry[0]r2[2] = r1[2]’ xor carry[1]r2[3] = r1[3] xor carry[2]…
Technology MappingTechnology Mapping
Maps logic operations to 3-input, 2-output Maps logic operations to 3-input, 2-output LUTsLUTs1.1. Traverse logic network and combine nodes to Traverse logic network and combine nodes to
determine single output LUTsdetermine single output LUTs2.2. Combine nodes to form two output LUTsCombine nodes to form two output LUTs
3-input, 2-output LUTs
PlacementPlacement
Nodes along critical path are placed in single Nodes along critical path are placed in single horizontal rowhorizontal row
Build dependencies between remaining nodes Build dependencies between remaining nodes and placed nodesand placed nodes– Use dependencies to place remaining nodesUse dependencies to place remaining nodes
Either above or below placed nodesEither above or below placed nodes
LUT LUTLUTLUT
LUT LUTLUTLUT
LUT LUTLUTLUT
LUT LUTLUTLUT
LUT LUTLUTLUT
LUT LUTLUTLUT
LUT LUTLUTLUT
LUT LUTLUTLUT
LUT LUTLUTLUT
LUT LUTLUTLUT
LUT LUTLUTLUT
LUT LUTLUTLUT
RoutingRouting
Greedy algorithmGreedy algorithm1.1. At each switch matrix, choose directionAt each switch matrix, choose direction
to routeto route2.2. Continue to route until reaching switchContinue to route until reaching switch
matrix that is already in usematrix that is already in use3.3. Backtrack to previous switch matrix,Backtrack to previous switch matrix,
and try another directionand try another direction Place and route most complex task;Place and route most complex task;
currently working on improvementscurrently working on improvements
Bitfile CreationBitfile Creation
Combines place&routed hardware description Combines place&routed hardware description with DMA configuration into bitfilewith DMA configuration into bitfile– Used to initialize the configurable logicUsed to initialize the configurable logic
HW Netlist
Bitfile Creation
DMA Configuration
Bitfile
DMAR0_Input
Configurable Logic Fabric
R1_InOut
Binary ModificationBinary Modification
Updates the application binary in order to Updates the application binary in order to utilize the new hardwareutilize the new hardware– Loop replaced with jump to hw initialization Loop replaced with jump to hw initialization
codecode– Wisconsin Architectural Research Tool Set Wisconsin Architectural Research Tool Set
(WARTS)(WARTS) EEL (Executable Editing Library)EEL (Executable Editing Library)
– We assume memory is RAM or programmable We assume memory is RAM or programmable ROMROM
loop:
Load r2, 0(r1)
Add r1, r1, 1
Add r3, r3, r2
Blt r1, 8, loop
after_loop:
…..
hw_init:
1. Initialize HW registers
2. Enable HW
3. Shutdown processor
• Woken up by HW interrupt
4. Store any results
5. Jump to after_loop
loop:
Jump hw_init
..
after_loop:
…..
Tool StatisticsTool Statistics
Executed on SimpleScalarExecuted on SimpleScalar– Similar to a MIPS instruction setSimilar to a MIPS instruction set– Used 60 MHz clock (like Triscend A7 device)Used 60 MHz clock (like Triscend A7 device)
StatisticsStatistics– Total run time of only 1.09 secondsTotal run time of only 1.09 seconds– Requires less than ½ megabyte of RAMRequires less than ½ megabyte of RAM– Code size much smaller than standard Code size much smaller than standard
synthesis toolssynthesis tools
Tool
Code Size
(Lines)
Binary size
(Kbytes)
Data size
(Kbytes)Time
(s)
Decom pilation
DMA Config.
RT Synthes is
Logic Synthes is
Tech. Mapping
Place & Route
4,695 88 360 1.04
7,203 125 452 0.05
ExperimentsExperiments
Benchmark InformationBenchmark Information– Powerstone (Brev, g3fax1&2)Powerstone (Brev, g3fax1&2)– NetBench (url)NetBench (url)– Logic minimization kernel (logmin) Logic minimization kernel (logmin)
StatisticsStatistics– 55% of total time spent in loops that are moved to hardware55% of total time spent in loops that are moved to hardware– Ideal speedup of 2.8Ideal speedup of 2.8– These loops were only 2.4% of the size of the original applicationThese loops were only 2.4% of the size of the original application
ExampleTotal Ins
Loop Ins
Loop Time%
Loop Size%
Ideal Speedup
brev 992 104 70.0% 10.5% 3.3
g3fax1 1094 6 31.4% 0.5% 1.5
g3fax2 1094 6 31.2% 0.5% 1.5
url 13526 17 79.9% 0.1% 5.0
logm in 8968 38 63.8% 0.4% 2.8
Avg: 55.3% 2.4% 2.8
ExperimentsExperiments
ResultsResults– Achieved average speedup of 2.6, close to ideal 2.8Achieved average speedup of 2.6, close to ideal 2.8– Hardware loops were 20X faster than software loopsHardware loops were 20X faster than software loops
Even with simple architecture and tools, large Even with simple architecture and tools, large speedups were achievedspeedups were achieved
ExampleSw
Time
Sw Loop Time
Hw Loop Time
Sw /Hw Time
Speedup
brev 0.05 0.03 0.001 0.02 3.1g3fax1 23.50 7.35 0.82 16.98 1.4g3fax2 23.50 7.39 1.49 17.61 1.3url 379.90 303.74 13.29 89.45 4.2logmin 16.32 10.42 0.21 6.12 2.7
Avg: 65.78 3.16 26.03 2.6
ConclusionConclusion
Dynamic hardware/software partitioning has Dynamic hardware/software partitioning has advantages over other partitioning approachesadvantages over other partitioning approaches– Completely transparentCompletely transparent– Designers get performance/energy benefits of hw/sw Designers get performance/energy benefits of hw/sw
partitioning by simply writing softwarepartitioning by simply writing software– Quality likely not as good as desktop CAD for some Quality likely not as good as desktop CAD for some
applications, so most suitable when transparency is critical applications, so most suitable when transparency is critical (very often!)(very often!)
Achieved average speedup of 2.6Achieved average speedup of 2.6– Very close to ideal speedup of 2.8Very close to ideal speedup of 2.8
Future workFuture work– More complex configurable logic fabricMore complex configurable logic fabric
Designed in close conjunction with on-chip CAD toolsDesigned in close conjunction with on-chip CAD tools Sequential logic and increased inputs/outputsSequential logic and increased inputs/outputs Support larger hardware regions, not just simple loopsSupport larger hardware regions, not just simple loops Improved algorithms (especially place and route)Improved algorithms (especially place and route)
– Handle more complex memory access patternsHandle more complex memory access patterns