Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid*...

Dynamic Hardware/Software Partitioning: A First Approach

Dynamic Hardware/Software Partitioning: A First Approach

Greg Stitt, Roman Lysecky, Frank Greg Stitt, Roman Lysecky, Frank Vahid*Vahid*Department of Computer Science and Department of Computer Science and EngineeringEngineering

University of California, RiversideUniversity of California, Riverside*Also with the Center for Embedded Computer Systems at *Also with the Center for Embedded Computer Systems at UC IrvineUC Irvine

IntroductionIntroduction

Dynamic optimizations an increasing trendDynamic optimizations an increasing trend– ExamplesExamples

DynamoDynamo– Dynamic software optimizationsDynamic software optimizations

Transmeta CrusoeTransmeta Crusoe– Dynamic code morphingDynamic code morphing

Just In Time CompilationJust In Time Compilation– Interpreted languagesInterpreted languages

AdvantagesAdvantages– Transparent optimizationsTransparent optimizations

No designer effortNo designer effort No tool restrictionsNo tool restrictions

– Adapts to actual usageAdapts to actual usage

Sw__________________


Drawbacks of current dynamic optimizationsDrawbacks of current dynamic optimizations– Currently limited to software optimizationsCurrently limited to software optimizations

Limited speedup (1.1x to 1.3x common)Limited speedup (1.1x to 1.3x common) Alternatively, we could perform hw/sw partitioningAlternatively, we could perform hw/sw partitioning

– Achieve large speedups (2x to 10x common)Achieve large speedups (2x to 10x common)– However, presently dynamic optimization not possibleHowever, presently dynamic optimization not possible

Sw__________________

Hw__________________

Profiler

Critical Regions

Processor ASIC/FPGA


Ideally, we would perform hardware/software Ideally, we would perform hardware/software partitioning dynamicallypartitioning dynamically– Transparent partitioningTransparent partitioning

Supports all sw languages/toolsSupports all sw languages/tools Most partitioning approaches have complex tool Most partitioning approaches have complex tool

flowsflows– Achieves better results than software Achieves better results than software

optimizationsoptimizations >2x speedup, energy savings>2x speedup, energy savings

– Adapts to actual usageAdapts to actual usage Appropriate architecture requiredAppropriate architecture required

– Requires a processor and configurable logicRequires a processor and configurable logic


Microprocessor/FPGA single-chip platforms make Microprocessor/FPGA single-chip platforms make partitioning more attractivepartitioning more attractive– More efficient communication, smaller sizeMore efficient communication, smaller size

Higher performance, low powerHigher performance, low power ExamplesExamples

– Xilinx Virtex II Pro, Triscend E5/A7, Altera Excalibur, Xilinx Virtex II Pro, Triscend E5/A7, Altera Excalibur, Atmel FPSLICAtmel FPSLIC

Makes dynamic hw/sw partitioning more feasibleMakes dynamic hw/sw partitioning more feasible– However, partitioning must be performed at binary levelHowever, partitioning must be performed at binary level

FPGAProcessorProcessor FPGA

1990s 2003


Binary-level hw/sw partitioningBinary-level hw/sw partitioning– Binary is profiled and hardware Binary is profiled and hardware

candidates are determinedcandidates are determined– Regions to be partitioned are Regions to be partitioned are

decompiled into CDFGdecompiled into CDFG– CDFG is synthesized to CDFG is synthesized to

hardwarehardware– Binary is updated to use Binary is updated to use

hardwarehardware Many advantages over source-Many advantages over source-

level partitioninglevel partitioning– Supports any language or Supports any language or

software compilersoftware compiler No change in toolsNo change in tools

– Better software size and Better software size and performance estimation at binary performance estimation at binary levellevel

Enables dynamic hw/sw Enables dynamic hw/sw partitioningpartitioning

Binary

Netlist

Processor FPGA

Updated Binary

Profiling

Hw Exploration

Decompilation

Behavioral Synthesis

Binary Updater

Dynamic Hw/Sw PartitioningDynamic Hw/Sw Partitioning

Memory

Dynamic Partitioning

Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Memory

Micro-processor

SW___________________________

SW addaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddaddadd


Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Memory

Micro-processor

SW___________________________

SW beqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeq


Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Memory

Micro-processor

SW___________________________

SW addaddaddaddaddaddaddaddaddaddadd

addaddaddaddaddaddaddaddaddaddadd


Module

addaddadd

add


Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Memory

Micro-processor

SW___________________________

SW beqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeq

beqbeqbeqbeqbeqbeqbeqbeqbeqbeqbeq


Module

beqbeqbeq

beq


Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Memory

Micro-processor

SW___________________________

SW


Module

FrequentLoops

SWSWSW

SW

SW

SWSWSW


Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Memory

Micro-processor

SW___________________________

SW


Module

FrequentLoops

HWHWHWHWHWHWHW

Frequent Loops

Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor

Memory

Micro-processor


Module


SW___________________________

SW

FrequentLoops

ConfigurableLogic

Frequent Loops

0

20

40

60

80

100

Time Energy

SW

HW /SW

Dynamic Partitioning ModuleDynamic Partitioning Module

Dynamic partitioning module executes Dynamic partitioning module executes partitioning tools on chippartitioning tools on chip– Profiler, partitioning compiler, synthesis, Profiler, partitioning compiler, synthesis,

place&routeplace&route

Profiler

Partitioning

CompilerSynthesisSW Binary

HW

SW Source

Place&Route

Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor


Synthesis and place & route tools all moved Synthesis and place & route tools all moved on-chipon-chip– These tools typically execute on powerful These tools typically execute on powerful

workstationsworkstations– Most people will cringe at idea of moving these Most people will cringe at idea of moving these

tools on-chiptools on-chip However, dynamic partitioning deals with However, dynamic partitioning deals with

small regions of codesmall regions of code– Typically, small innermost loopsTypically, small innermost loops

Therefore, we can develop lean tools that Therefore, we can develop lean tools that work specifically for these small loopswork specifically for these small loops– Lean tools make on-chip execution possibleLean tools make on-chip execution possible

Area overhead becoming less critical due to Area overhead becoming less critical due to Moore’s LawMoore’s Law

System ArchitectureSystem Architecture

MicroprocessorMicroprocessorss– MIPS (may be MIPS (may be

many)many) On-chip On-chip

memorymemory Configurable Configurable

logiclogic Dynamic Dynamic

partitioning partitioning modulemodule

Memory


Module

ConfigurableLogic

Micro-processor

Micro-processor

Micro-processor

Micro-processor


Dynamically detects frequent loops and then Dynamically detects frequent loops and then reimplements the loops in hardware running reimplements the loops in hardware running on the configurable logicon the configurable logic

Architectural componentsArchitectural components– ProfilerProfiler– Additional processor and memoryAdditional processor and memory

But SOCs may have dozens anywaysBut SOCs may have dozens anyways Alternatively, we could share main processorAlternatively, we could share main processor

Memory

Profiler

Partitioning Co-Processor

Configurable LogicConfigurable Logic

Greatly simplified in order to create lean place & route toolsGreatly simplified in order to create lean place & route tools DMA used to access memoryDMA used to access memory Two registersTwo registers

– R0_Input stores data from memoryR0_Input stores data from memory– R1_InOut stores temporary data & data to write back to memoryR1_InOut stores temporary data & data to write back to memory

FabricFabric– Supports combinational logicSupports combinational logic– Implies loops must have body implemented in single cycle Implies loops must have body implemented in single cycle

(temporary restriction)(temporary restriction)

DMAR0_Input

Configurable Logic Fabric

R1_InOut

Configurable Logic FabricConfigurable Logic Fabric

FabricFabric– 3-input 2-output LUTS surrounded by switch 3-input 2-output LUTS surrounded by switch

matricesmatrices Switch MatrixSwitch Matrix

– Connect wire to same channel on different sideConnect wire to same channel on different side LUTLUT

– 3-input (8 word) 2-output SRAM3-input (8 word) 2-output SRAM


LUTT

LUT UT

...

SMM

SMSM

SMM

SMSM

SMM

...

0

0

00

1

1

1 12

2

2

2

33

3

3

Inputs Inputs

SRAM(8x2)

Outputs

Configurable Logic Fabric Switch Matrix LUT

Tool OverviewTool Overview

Binary

Loop Profiling

Small, Frequent Loops

Decompilation

Place & Route

HW

RT and Logic Synthesis

Binary Modification

Updated Binary

DMA Configuration

Bitfile Creation

Tech. Mapping

Tool flow slightly Tool flow slightly different from standard different from standard partitioning flowpartitioning flow– DecompilationDecompilation– Binary modificationBinary modification

Loop ProfilingLoop Profiling

Non-intrusive profilerNon-intrusive profiler– Monitors instruction busMonitors instruction bus

Very little overheadVery little overhead– Small cache (~16 entries) and 2,300 logic Small cache (~16 entries) and 2,300 logic

gatesgates Less than 1% power overheadLess than 1% power overhead

Mic

ro-

pro

cess

or Frequent Loop

CacheFrequent Loop

Cache Controller

++

rd/wr

addr

datadata

To L1 Memory

rd/wr

addr

sbb

data

saturation

DecompilationDecompilation

Decompilation recovers high-level informationDecompilation recovers high-level information Creates optimized CDFGCreates optimized CDFG

– All instruction-set inefficiencies are removedAll instruction-set inefficiencies are removed Binary partitioning has been shown to Binary partitioning has been shown to

achieve similar results to source-level achieve similar results to source-level partitioning for many applicationspartitioning for many applications– [Greg Stitt, Frank Vahid, ICCAD 2002][Greg Stitt, Frank Vahid, ICCAD 2002]

DMA ConfigurationDMA Configuration

Maps memory accesses to our DMA Maps memory accesses to our DMA architecturearchitecture– Reads/writesReads/writes– Increment/decrement address updatesIncrement/decrement address updates– Single/block request modesSingle/block request modes

Optimizes DFG for DMAOptimizes DFG for DMA– Removes address calculationsRemoves address calculations– Removes loop counters/exit conditionsRemoves loop counters/exit conditions

1 r1

+ Read

r1 +

r2

• Memory Read

• Increment Address

• Block Request

r3

DMA Read

+

r2

r3

Register Transfer SynthesisRegister Transfer Synthesis

Maps DFG operations to hw library Maps DFG operations to hw library componentscomponents– Adders, Comparators, Multiplexors, ShiftersAdders, Comparators, Multiplexors, Shifters

Creates Boolean expression for each output Creates Boolean expression for each output bit in dataflow graph by replacing hw bit in dataflow graph by replacing hw components with corresponding expressionscomponents with corresponding expressions

r4[0]=r1[0] xor r2[0], carry[0]=r1[0] and r2[0]

r4[1]=(r1[1] xor r2[1]) xor carry[0], carry[1]= …….

…….

r1 r2

+

r4

r3 8

<

r5

32-bit adder 32-bit comparator

Logic SynthesisLogic Synthesis

Optimizes Boolean equations from RT Optimizes Boolean equations from RT synthesissynthesis– Large opportunity for logic minimization due to Large opportunity for logic minimization due to

use of immediate values in the binaryuse of immediate values in the binary Simple on-chip 2-level logic minimization Simple on-chip 2-level logic minimization

methodmethod– Lysecky/Vahid DAC’03, session 20.4 (9:45 Wed)Lysecky/Vahid DAC’03, session 20.4 (9:45 Wed)

r2[0] = r1[0] xor 0 xor 0r2[1] = r1[1] xor 0 xor carry[0]r2[2] = r1[2] xor 1 xor carry[1]r2[3] = r1[3] xor 0 xor carry[2]…

r1 4

+

r2

r2[0] = r1[0]r2[1] = r1[1] xor carry[0]r2[2] = r1[2]’ xor carry[1]r2[3] = r1[3] xor carry[2]…

Technology MappingTechnology Mapping

Maps logic operations to 3-input, 2-output Maps logic operations to 3-input, 2-output LUTsLUTs1.1. Traverse logic network and combine nodes to Traverse logic network and combine nodes to

determine single output LUTsdetermine single output LUTs2.2. Combine nodes to form two output LUTsCombine nodes to form two output LUTs

3-input, 2-output LUTs

PlacementPlacement

Nodes along critical path are placed in single Nodes along critical path are placed in single horizontal rowhorizontal row

Build dependencies between remaining nodes Build dependencies between remaining nodes and placed nodesand placed nodes– Use dependencies to place remaining nodesUse dependencies to place remaining nodes

Either above or below placed nodesEither above or below placed nodes

LUT LUTLUTLUT

LUT LUTLUTLUT

LUT LUTLUTLUT

LUT LUTLUTLUT

LUT LUTLUTLUT

LUT LUTLUTLUT

LUT LUTLUTLUT

LUT LUTLUTLUT

LUT LUTLUTLUT

LUT LUTLUTLUT

LUT LUTLUTLUT

LUT LUTLUTLUT

RoutingRouting

Greedy algorithmGreedy algorithm1.1. At each switch matrix, choose directionAt each switch matrix, choose direction

to routeto route2.2. Continue to route until reaching switchContinue to route until reaching switch

matrix that is already in usematrix that is already in use3.3. Backtrack to previous switch matrix,Backtrack to previous switch matrix,

and try another directionand try another direction Place and route most complex task;Place and route most complex task;

currently working on improvementscurrently working on improvements

Bitfile CreationBitfile Creation

Combines place&routed hardware description Combines place&routed hardware description with DMA configuration into bitfilewith DMA configuration into bitfile– Used to initialize the configurable logicUsed to initialize the configurable logic

HW Netlist

Bitfile Creation

DMA Configuration

Bitfile

DMAR0_Input


R1_InOut

Binary ModificationBinary Modification

Updates the application binary in order to Updates the application binary in order to utilize the new hardwareutilize the new hardware– Loop replaced with jump to hw initialization Loop replaced with jump to hw initialization

codecode– Wisconsin Architectural Research Tool Set Wisconsin Architectural Research Tool Set

(WARTS)(WARTS) EEL (Executable Editing Library)EEL (Executable Editing Library)

– We assume memory is RAM or programmable We assume memory is RAM or programmable ROMROM

loop:

Load r2, 0(r1)

Add r1, r1, 1

Add r3, r3, r2

Blt r1, 8, loop

after_loop:

…..

hw_init:

1. Initialize HW registers

2. Enable HW

3. Shutdown processor

• Woken up by HW interrupt

4. Store any results

5. Jump to after_loop

loop:

Jump hw_init

..

after_loop:

…..

Tool StatisticsTool Statistics

Executed on SimpleScalarExecuted on SimpleScalar– Similar to a MIPS instruction setSimilar to a MIPS instruction set– Used 60 MHz clock (like Triscend A7 device)Used 60 MHz clock (like Triscend A7 device)

StatisticsStatistics– Total run time of only 1.09 secondsTotal run time of only 1.09 seconds– Requires less than ½ megabyte of RAMRequires less than ½ megabyte of RAM– Code size much smaller than standard Code size much smaller than standard

synthesis toolssynthesis tools

Tool

Code Size

(Lines)

Binary size

(Kbytes)

Data size

(Kbytes)Time

(s)

Decom pilation

DMA Config.

RT Synthes is

Logic Synthes is

Tech. Mapping

Place & Route

4,695 88 360 1.04

7,203 125 452 0.05

ExperimentsExperiments

Benchmark InformationBenchmark Information– Powerstone (Brev, g3fax1&2)Powerstone (Brev, g3fax1&2)– NetBench (url)NetBench (url)– Logic minimization kernel (logmin) Logic minimization kernel (logmin)

StatisticsStatistics– 55% of total time spent in loops that are moved to hardware55% of total time spent in loops that are moved to hardware– Ideal speedup of 2.8Ideal speedup of 2.8– These loops were only 2.4% of the size of the original applicationThese loops were only 2.4% of the size of the original application

ExampleTotal Ins

Loop Ins

Loop Time%

Loop Size%

Ideal Speedup

brev 992 104 70.0% 10.5% 3.3

g3fax1 1094 6 31.4% 0.5% 1.5

g3fax2 1094 6 31.2% 0.5% 1.5

url 13526 17 79.9% 0.1% 5.0

logm in 8968 38 63.8% 0.4% 2.8

Avg: 55.3% 2.4% 2.8

ExperimentsExperiments

ResultsResults– Achieved average speedup of 2.6, close to ideal 2.8Achieved average speedup of 2.6, close to ideal 2.8– Hardware loops were 20X faster than software loopsHardware loops were 20X faster than software loops

Even with simple architecture and tools, large Even with simple architecture and tools, large speedups were achievedspeedups were achieved

ExampleSw

Time

Sw Loop Time

Hw Loop Time

Sw /Hw Time

Speedup

brev 0.05 0.03 0.001 0.02 3.1g3fax1 23.50 7.35 0.82 16.98 1.4g3fax2 23.50 7.39 1.49 17.61 1.3url 379.90 303.74 13.29 89.45 4.2logmin 16.32 10.42 0.21 6.12 2.7

Avg: 65.78 3.16 26.03 2.6

ConclusionConclusion

Dynamic hardware/software partitioning has Dynamic hardware/software partitioning has advantages over other partitioning approachesadvantages over other partitioning approaches– Completely transparentCompletely transparent– Designers get performance/energy benefits of hw/sw Designers get performance/energy benefits of hw/sw

partitioning by simply writing softwarepartitioning by simply writing software– Quality likely not as good as desktop CAD for some Quality likely not as good as desktop CAD for some

applications, so most suitable when transparency is critical applications, so most suitable when transparency is critical (very often!)(very often!)

Achieved average speedup of 2.6Achieved average speedup of 2.6– Very close to ideal speedup of 2.8Very close to ideal speedup of 2.8

Future workFuture work– More complex configurable logic fabricMore complex configurable logic fabric

Designed in close conjunction with on-chip CAD toolsDesigned in close conjunction with on-chip CAD tools Sequential logic and increased inputs/outputsSequential logic and increased inputs/outputs Support larger hardware regions, not just simple loopsSupport larger hardware regions, not just simple loops Improved algorithms (especially place and route)Improved algorithms (especially place and route)

– Handle more complex memory access patternsHandle more complex memory access patterns

Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid*...

Documents

Transcript of Dynamic Hardware/Software Partitioning: A First Approach Greg Stitt, Roman Lysecky, Frank Vahid*...