04/10/2304/10/23 11
The Microarchitecure of FPGA-Based The Microarchitecure of FPGA-Based Soft ProcessorSoft Processor
Peter Yiannacouras, Jonathan Rose and Peter Yiannacouras, Jonathan Rose and J Gregory SteffanJ Gregory Steffan
Dept. of Electrical and Computer EngineeringDept. of Electrical and Computer EngineeringUniversity of TorontoUniversity of Toronto
Presented By:Presented By:Deepak TomarDeepak Tomar
CS08M054,M Tech II YearCS08M054,M Tech II YearCS & E DeptCS & E Dept
04/10/2304/10/23 22
OutlineOutline
AimAim The Basics FirstThe Basics First MotivationMotivation Understanding Soft Processor MicroarchitectureUnderstanding Soft Processor Microarchitecture Overview of SPREE SystemOverview of SPREE System Experimental FrameworkExperimental Framework Exploring Soft Processor Architecture (Partially)Exploring Soft Processor Architecture (Partially)
04/10/2304/10/23 33
AimAim
To build a system for automatically generating soft processorsTo build a system for automatically generating soft processors To develop a methodology for comparing soft processor To develop a methodology for comparing soft processor
architecturesarchitectures To begin to populate and analyze soft processor design space To begin to populate and analyze soft processor design space
04/10/2304/10/23 44
The Basics FirstThe Basics First
What is an FPGA?What is an FPGA? How is it different from ASIC?How is it different from ASIC? What is a Soft Processor?What is a Soft Processor? Is there Hard Processor too?Is there Hard Processor too?
04/10/2304/10/23 55
Field Programmable Gate Array (FPGA)Field Programmable Gate Array (FPGA)
FPGAs are programmable digital logical chipsFPGAs are programmable digital logical chips Can be programmed to do almost any digital functionCan be programmed to do almost any digital function
and – important makers of FPGAsand – important makers of FPGAs ASICs are application specific logical chips which are ASICs are application specific logical chips which are
programmed for a dedicated taskprogrammed for a dedicated task
04/10/2304/10/23 66
How FPGAs work ? Logic CellsLogic Cells
FPGAs built from one logic cell duplicated FPGAs built from one logic cell duplicated
hundred or thousands time. A Logic Cell ishundred or thousands time. A Logic Cell isbasically a small lookup table (LUT), a basically a small lookup table (LUT), a D-flip-flop and a 2-to-1 mux. A LUT is a smallD-flip-flop and a 2-to-1 mux. A LUT is a smallRAM that can implement any logic functionRAM that can implement any logic function
InterconnectInterconnect
Each logic cell can be connected to other logic cells through interconnect resources Each logic cell can be connected to other logic cells through interconnect resources
(wires/muxes placed around the logic cell). Each cell can do a little but with lots of (wires/muxes placed around the logic cell). Each cell can do a little but with lots of them connected together, complex logic functions can be createdthem connected together, complex logic functions can be created
General Work Flow when working with FPGAsGeneral Work Flow when working with FPGAs
LUT
LOGIC CELL
Logic Logic Function as Function as Text File Text File
Binary file post Binary file post compilation ofcompilation ofText FileText File
ComputerComputerFPGAFPGA
CableCable
FLIP-FLOP
04/10/2304/10/23 77
Soft Processor/ Hard ProcessorSoft Processor/ Hard Processor In a soft processor, the processor is implemented in the chip using In a soft processor, the processor is implemented in the chip using
the FPGA fabric itselfthe FPGA fabric itself In a hard processor, a processor as it is, is incorporated in the chipIn a hard processor, a processor as it is, is incorporated in the chip
Examples Examples
Soft ProcessorSoft Processor DeveloperDeveloper Hard Hard ProcessorProcessor
DeveloperDeveloper
MicroblazeMicroblaze XilinxXilinx Virtex II ProVirtex II Pro XilinxXilinx
NiosNios AlteraAltera ExcaliburExcalibur AlteraAltera
04/10/2304/10/23 88
MotivationMotivation More and more embedded systems using FPGA platformsMore and more embedded systems using FPGA platforms Increasing cost and time-to-market of designing state-of-the-Increasing cost and time-to-market of designing state-of-the-
art ASICart ASIC Drawbacks of hard processorDrawbacks of hard processor
▪▪ Mismatch in number of hard processor on FPGA chip and that Mismatch in number of hard processor on FPGA chip and that required required by the applicationby the application
▪▪ Mismatch in performance requirements of a processor for an Mismatch in performance requirements of a processor for an application and those provided by available FPGA based hard application and those provided by available FPGA based hard processorsprocessors▪▪ Difficulty in routing between processor and custom logicDifficulty in routing between processor and custom logic▪▪ Leads to specialization of FPGA chip impacting yield and Leads to specialization of FPGA chip impacting yield and customer customer basebase
04/10/2304/10/23 99
Understanding Soft Processor Understanding Soft Processor MicroarchitectureMicroarchitecture
A soft processor comparatively slower and less area efficientA soft processor comparatively slower and less area efficient
Processor architectures studied using high-level functional Processor architectures studied using high-level functional simulators due to difficulty in varying design at logic layout levelsimulators due to difficulty in varying design at logic layout level
In contrast, FPGA CAD tools allow quick and accurate measure In contrast, FPGA CAD tools allow quick and accurate measure of exact speed, area and powerof exact speed, area and power
Full understanding leads to making intelligent application Full understanding leads to making intelligent application specific architectural trade-offsspecific architectural trade-offs
Development of Soft Processor Rapid Exploration Environment Development of Soft Processor Rapid Exploration Environment (SPREE) to meet our aim(SPREE) to meet our aim
SPREE is a system for architectural explorationSPREE is a system for architectural exploration
04/10/2304/10/23 1010
Overview of the SPREE systemOverview of the SPREE system
SPREE RTL Generator
EfficientlySynthesizable
RTL
RTL CAD FlowRTL Simulator
1. Correctness2. Cycle count
3. Area4. Clock Frequency5. Power
EmbeddedBenchmarks Applications
Architecture Description
04/10/2304/10/23 1111
Preview of capabilities of SPREEPreview of capabilities of SPREE
Area (Equivalent LEs)Area (Equivalent LEs)
0 200 400 600 800 1000 1200 1400 1600 18000 200 400 600 800 1000 1200 1400 1600 1800
1200012000
10000 10000
80008000
60006000
40004000
20002000
00
Ave
rage
Wal
l Clo
ck T
ime
(A
vera
ge W
all C
lock
Tim
e ( µ
sµs ))
Multiply Full Hardware Multiply Full Hardware Support Support
Multiply Software RoutineMultiply Software Routine
Altera NiosIIe Altera NiosIIe
Altera NiosIIsAltera NiosIIs
Altera NiosIIfAltera NiosIIf
04/10/2304/10/23 1212
SPREE RTL GeneratorSPREE RTL Generator
Input : The Architecture DescriptionInput : The Architecture DescriptionDescribing the Datapath Describing the Datapath
Selecting and Interchanging ComponentsSelecting and Interchanging Components
Creating and Describing Custom componentsCreating and Describing Custom components
Describing the ISADescribing the ISA
Generating a Soft ProcessorGenerating a Soft ProcessorDatapath VerificationDatapath Verification
Datapath InstantiationDatapath Instantiation
Control GenerationControl Generation
04/10/2304/10/23 1313
SPREE RTL GeneratorSPREE RTL Generator
DatapathDatapathVerificationVerification
DatapathDatapathInstantiationInstantiation
ControlControlGenerationGeneration
ComponentComponentLibraryLibrary
(Efficient RTL)(Efficient RTL)
SPREE SPREE RTL RTL
GeneratorGenerator
Datapath Datapath DescriptionDescription
ISA ISA DescriptionDescription
Efficient Efficient RTL RTL
DescriptionDescription
04/10/2304/10/23 1414
SPREE RTL GeneratorSPREE RTL Generator
Input : The Architecture DescriptionInput : The Architecture DescriptionDescribing the Datapath Describing the Datapath
Selecting and Interchanging ComponentsSelecting and Interchanging Components
Creating and Describing Custom componentsCreating and Describing Custom components
Describing the Instruction Set Architecture (ISA)Describing the Instruction Set Architecture (ISA)
Generating a Soft ProcessorGenerating a Soft ProcessorDatapath VerificationDatapath Verification
Datapath InstantiationDatapath Instantiation
Control GenerationControl Generation
04/10/2304/10/23 1515
Datapath Description as Interconnection of Datapath Description as Interconnection of ComponentsComponents
ShiftShift
Instruction Instruction MemoryMemory
Reg Reg FileFile
mux
mux
mux
mux
ALUALU
DataDataMemMem
04/10/2304/10/23 1616
SPREE RTL GeneratorSPREE RTL Generator
Input : The Architecture DescriptionInput : The Architecture DescriptionDescribing the Datapath Describing the Datapath
Selecting and Interchanging ComponentsSelecting and Interchanging Components
Creating and Describing Custom componentsCreating and Describing Custom components
Describing the ISADescribing the ISA
Generating a Soft ProcessorGenerating a Soft ProcessorDatapath VerificationDatapath Verification
Datapath InstantiationDatapath Instantiation
Control GenerationControl Generation
04/10/2304/10/23 1717
Sample component description for a Sample component description for a simplified ALUsimplified ALU
Module alu_small {Module alu_small {Input opA 32Input opA 32Input opB 32Input opB 32Output result 32Output result 32Opcode opcode 2{Opcode opcode 2{
ADD 0 0ADD 0 0SUB 1 0SUB 1 0SLT 2 0SLT 2 0}}
}}
inAinA
inBinB
ADD ADD SUB SUB SLTSLT
resultresult
opcodeopcode
FunctionalityFunctionality
InterfaceInterface
Port valuePort valueLatency in Latency in
cyclescycles
Bit widthBit width
GENOPs : ADD,SUB and GENOPs : ADD,SUB and SLTSLT
04/10/2304/10/23 1818
SPREE RTL GeneratorSPREE RTL Generator
Input : The Architecture DescriptionInput : The Architecture DescriptionDescribing the Datapath Describing the Datapath
Selecting and Interchanging ComponentsSelecting and Interchanging Components
Creating and Describing Custom componentsCreating and Describing Custom components
Describing the ISADescribing the ISA
Generating a Soft ProcessorGenerating a Soft ProcessorDatapath VerificationDatapath Verification
Datapath InstantiationDatapath Instantiation
Control GenerationControl Generation
04/10/2304/10/23 1919
MIPS ADDI instruction shown as Data MIPS ADDI instruction shown as Data Dependence GraphDependence Graph
IFETCHIFETCH
REGREADREGREAD SIGN_EXTSIGN_EXT
ADDADD
REGWRITEREGWRITE
Rule: No GENOPRule: No GENOPcan execute until all can execute until all its inputs are readyits inputs are ready
04/10/2304/10/23 2020
SPREE RTL GeneratorSPREE RTL Generator
Input : The Architecture DescriptionInput : The Architecture DescriptionDescribing the Datapath Describing the Datapath
Selecting and Interchanging ComponentsSelecting and Interchanging Components
Creating and Describing Custom componentsCreating and Describing Custom components
Describing the ISADescribing the ISA
Generating a Soft ProcessorGenerating a Soft ProcessorDatapath Verification Datapath Verification
Datapath InstantiationDatapath Instantiation
Control GenerationControl Generation
04/10/2304/10/23 2121
SPREE RTL GeneratorSPREE RTL Generator
DatapathDatapathVerificationVerification
DatapathDatapathInstantiationInstantiation
ControlControlGenerationGeneration
ComponentComponentLibraryLibrary
(Efficient RTL)(Efficient RTL)
SPREE SPREE RTL RTL
GeneratorGenerator
Datapath Datapath DescriptionDescription
ISA ISA DescriptionDescription
Efficient Efficient RTL RTL
DescriptionDescription
04/10/2304/10/23 2222
Generating a soft processorGenerating a soft processor
Datapath VerificationDatapath Verification
Ensuring each instruction’s GENOP graph in ISA is subgraph of Ensuring each instruction’s GENOP graph in ISA is subgraph of datapath GENOP graphdatapath GENOP graph
Datapath InstantiationDatapath InstantiationGenerate an equivalent Verilog description from input datapath Generate an equivalent Verilog description from input datapath descriptiondescription
Control GenerationControl Generation
SPREE generates logic to control datapath’s operation to correctly SPREE generates logic to control datapath’s operation to correctly implement ISAimplement ISA
Control logic provides each component what operation to perform Control logic provides each component what operation to perform (Opcodes) and when to perform (Enables)(Opcodes) and when to perform (Enables)
04/10/2304/10/23 2323
Experimental FrameworkExperimental Framework
Required for measuring and comparing soft processor Required for measuring and comparing soft processor produced by SPREEproduced by SPREE
Processor VerificationProcessor VerificationTrace-based verification by comparing cycle accurate industrial RTL Trace-based verification by comparing cycle accurate industrial RTL simulator and MINT (a MIPS instruction set simulator)simulator and MINT (a MIPS instruction set simulator)
FPGA used : Altera’s Stratix IFPGA used : Altera’s Stratix I Quartus II v4.2 CAD software for synthesis, technology Quartus II v4.2 CAD software for synthesis, technology
mapping, placement and routingmapping, placement and routing
04/10/2304/10/23 2424
An An AlteraAltera Stratix FPGA Stratix FPGA
04/10/2304/10/23 2525
Experimental Framework (contd.)Experimental Framework (contd.)
Metrics for measuring Soft ProcessorsMetrics for measuring Soft Processors
AreaArea : : In terms of Logic Element (LE)In terms of Logic Element (LE)
LE composed of 4-input lookup table (LUT) and a flip-flopLE composed of 4-input lookup table (LUT) and a flip-flop
PerformancePerformance : : Wall-clock-time for execution of collection of Wall-clock-time for execution of collection of benchmark (BM) applications benchmark (BM) applications Wall-clock-time = Clock period*CPI*Avg. no of instructionsWall-clock-time = Clock period*CPI*Avg. no of instructions
PowerPower : : Through Quartus’ Power Play tool, based on switching Through Quartus’ Power Play tool, based on switching activities of post-placed-and-routed nodes determined by simulating BM activities of post-placed-and-routed nodes determined by simulating BM applicationsapplicationsStatic power and power of I/O pins substractedStatic power and power of I/O pins substractedFor each benchmark, energy per instruction is calculatedFor each benchmark, energy per instruction is calculated
04/10/2304/10/23 2626
Exploring Soft Processor MicroarchitectureExploring Soft Processor Microarchitecture
Comparison of generated processor with NiosII variationsComparison of generated processor with NiosII variationsThree points in space : NiosIIe (smallest area, lowest performance), NiosIIf Three points in space : NiosIIe (smallest area, lowest performance), NiosIIf (largest area, highest performance), NiosIIs (in between)(largest area, highest performance), NiosIIs (in between)
A SPREE generated processor : 80 Mhz, 3-stage pipelined processor is 9% A SPREE generated processor : 80 Mhz, 3-stage pipelined processor is 9% smaller and 11% faster than NiosIIssmaller and 11% faster than NiosIIs
CPI of this processor 1.36 and clock 80Mhz whereas NiosIIs and NiosIIf is CPI of this processor 1.36 and clock 80Mhz whereas NiosIIs and NiosIIf is 2.36,120Mhz and 1.97, 135Mhz respectively.2.36,120Mhz and 1.97, 135Mhz respectively.
Smallest generated processor within 15% of area and 11% faster than Smallest generated processor within 15% of area and 11% faster than NiosIIe. NiosIIe.
CPI benefit of 2-3 CPI of smallest SPREE generated processor over 6 CPI CPI benefit of 2-3 CPI of smallest SPREE generated processor over 6 CPI of NiosIIe is reduced to 11% net win in wall-clock-time as clock freq. is 82 of NiosIIe is reduced to 11% net win in wall-clock-time as clock freq. is 82 Mhz and 159 Mhz respectively. Mhz and 159 Mhz respectively.
04/10/2304/10/23 2727
Avg wall-clock-time vs area of NiosII and Avg wall-clock-time vs area of NiosII and generated processorgenerated processor
Area (Equivalent LEs)Area (Equivalent LEs)
0 200 400 600 800 1000 1200 1400 1600 18000 200 400 600 800 1000 1200 1400 1600 1800
1200012000
10000 10000
80008000
60006000
40004000
20002000
00
Ave
rage
Wal
l Clo
ck T
ime
(A
vera
ge W
all C
lock
Tim
e ( µ
sµs ))
Multiply Full Hardware Multiply Full Hardware Support Support
Multiply Software RoutineMultiply Software Routine
Altera NiosIIe Altera NiosIIe
Altera NiosIIsAltera NiosIIs
Altera NiosIIfAltera NiosIIf
04/10/2304/10/23 2828
Comparison with NiosII variationsComparison with NiosII variations
ProcessorProcessor CPI CPI Clock(MHClock(MHz)z)
CommentComment
SPREE Generated SPREE Generated Processor Processor
1.361.36 8080 9% smaller 9% smaller and 11% and 11% faster than faster than NiosIIsNiosIIs
NiosIIsNiosIIs 2.362.36 120120
NiosIIfNiosIIf 1.971.97 135135
04/10/2304/10/23 2929
Avg wall-clock-time vs area of NiosII and Avg wall-clock-time vs area of NiosII and generated processorgenerated processor
Area (Equivalent LEs)Area (Equivalent LEs)
0 200 400 600 800 1000 1200 1400 1600 18000 200 400 600 800 1000 1200 1400 1600 1800
1200012000
10000 10000
80008000
60006000
40004000
20002000
00
Ave
rage
Wal
l Clo
ck T
ime
(A
vera
ge W
all C
lock
Tim
e ( µ
sµs ))
Multiply Full Hardware Multiply Full Hardware Support Support
Multiply Software RoutineMultiply Software Routine
Altera NiosIIe Altera NiosIIe
Altera NiosIIsAltera NiosIIs
Altera NiosIIfAltera NiosIIf
04/10/2304/10/23 3030
Comparison with NiosII variationsComparison with NiosII variations
ProcessorProcessor CPI CPI Clock(MHClock(MHz)z)
CommentComment
SPREE Smallest SPREE Smallest Generated ProcessorGenerated Processor
2-32-3 8282 Within 15% Within 15% of area and of area and 11% faster 11% faster than NiosIIethan NiosIIe
NiosIIeNiosIIe 66 159159
04/10/2304/10/23 3131
ConclusionConclusion
Results indicate generated processor which came within 15% Results indicate generated processor which came within 15% of smallest NiosII variation while outperforming it by 11%of smallest NiosII variation while outperforming it by 11%
Other generated processors both outperformed and smaller Other generated processors both outperformed and smaller than standard NiosII variationthan standard NiosII variation
The Generator can populate the design space while remaining The Generator can populate the design space while remaining relatively competitive with commercial, hand optimized soft relatively competitive with commercial, hand optimized soft processorprocessor
04/10/2304/10/23 3232
ReferencesReferences
http://portal.acm.org/citation.cfm?id=1086297.1086325http://portal.acm.org/citation.cfm?id=1086297.1086325 http://www.fpga4fun.com/FPGAinfo1.htmlhttp://www.fpga4fun.com/FPGAinfo1.html http://en.wikipedia.org/wiki/Field-programmable_gate_arrayhttp://en.wikipedia.org/wiki/Field-programmable_gate_array
04/10/2304/10/23 3333
THANK YOUTHANK YOU
04/10/2304/10/23 3434
SPREE RTL GeneratorSPREE RTL Generator
DatapathDatapathVerificationVerification
DatapathDatapathInstantiationInstantiation
ControlControlGenerationGeneration
ComponentComponentLibraryLibrary
(Efficient RTL)(Efficient RTL)
SPREE SPREE RTL RTL
GeneratorGenerator
Datapath Datapath DescriptionDescription
ISA ISA DescriptionDescription
Efficient Efficient RTL RTL
DescriptionDescription
04/10/2304/10/23 3535
NiosII variationsNiosII variations
NiosIIe : Unpipelined 6-CPI processor witj NiosIIe : Unpipelined 6-CPI processor witj serial shifter and software multiplication serial shifter and software multiplication supportsupport
NiosIIs : 5-stage pipeline with multiplier NiosIIs : 5-stage pipeline with multiplier based shifter, hardware multiplication and based shifter, hardware multiplication and an instruction cachean instruction cache
NiosIIf : Large 6-stage pipeline with NiosIIf : Large 6-stage pipeline with dynamic branch prediction, instruction and dynamic branch prediction, instruction and data caches and optional hardware dividerdata caches and optional hardware divider
04/10/2304/10/23 3636
Generator collects all Generator collects all timing information from timing information from
each componenteach component
Analyze datapath and Analyze datapath and infer pipeline stage of infer pipeline stage of
each componenteach component
In each pipeline, local In each pipeline, local stall signals extracted stall signals extracted and propagated (stall and propagated (stall network) to earlier network) to earlier
stages stages
Enables generated if Enables generated if component is not stalledcomponent is not stalled
Generation of Enable SignalsGeneration of Enable Signals
04/10/2304/10/23 3737
FPGA-based soft processors adapted more widely in FPGA-based soft processors adapted more widely in embedded processing, hence need exists to embedded processing, hence need exists to understand architectural tradeoffs to maximize understand architectural tradeoffs to maximize efficiencyefficiency
SPREE is an infrastructure for rapidly generating soft SPREE is an infrastructure for rapidly generating soft processorsprocessors
Comparison of generated processors carried out with Comparison of generated processors carried out with Altera’s NiosII family of commercial soft processorsAltera’s NiosII family of commercial soft processors
Top Related