Constructive Computer Architecture: Multistage Pipelined Processors and modular refinement Arvind
Improving Pipelined Soft Processors with Multithreading
description
Transcript of Improving Pipelined Soft Processors with Multithreading
![Page 1: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/1.jpg)
Improving Pipelined Soft Processors with Multithreading
Martin LabrecqueGregory Steffan
ECE Dept. University of Toronto
Presented at RAAW 2006, Orlando, FL
![Page 2: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/2.jpg)
2
Custom Logic
FPGA
FPGAs increasingly implement SoCs, with CPUs Soft processors: processors in the FPGA fabric
Processor
PC
Instr. Mem.
Reg. Array
regA
regB
regW
datW
datA
datB
ALU
25:21
20:16
+4
Data Mem.
datIn
addrdatOut
aluA
aluB
IncrPC
Instr
4:0 Wdest
Wdata
20:13
Xtnd
25:21
Wdata
Wdest
15:0
Xtnd << 2
Zero Test
25:21
Wdata
Wdest
20:0
25:21
Wdata
Wdest
Soft processors are:•Easier to program than HDL•Customizable
Processors and FPGAs
![Page 3: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/3.jpg)
3
Soft processors in Embedded Systems
What do designers care about?Minimizing area?Matching frequency?Hitting performance target?
We trade-off 4 criteria (soft proc. power is related to area)
Area efficiency: a combined metric
Performance
Area Instr. Count xx Frequency
Cycle Count x Area
![Page 4: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/4.jpg)
4
Multithreading
Replace processor stalls
Fine-grained multithreading: 1 instr. per thread in round-robin
Million Instr. xx Frequency# Cycles x Area
Fill them with instructions from other threadsWhen to switch thread?
Every instruction (e.g. Sun’s Niagara)Convenient technique for in-order processors
![Page 5: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/5.jpg)
5
Avoiding processor stall cycles
Data and control hazards create stall cycles
F
E
W
Traditional execution
3 st
ages F
E
W
FE
W
F
E
WTimeB
EF
OR
E
F F F
E E E
W W W
F F F
E E E
W W W
Ideally, eliminates all stalls 3
stag
es
Time
Multithreading: execute streams of independent instructions
LegendThread1Thread2Thread3
AF
TE
R F
E
W
![Page 6: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/6.jpg)
6
How useful is multithreading?
Commercial SPs: single-threaded (NIOS-II,Microblaze) Fort et al. [FCCM’06] have shown:
multithreaded SP smaller than multiple SPs with some performance degradation
We go further by showing that:the Area-Efficiency of Multithreaded SP
is GREATER THAN
the Area-Efficiency of Single-Threaded SP
Not straightforward, here is how we did it
![Page 7: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/7.jpg)
7
Outline
Architectural Support for Multiple Threads Soft Processor Infrastructure Improvements to Baseline Multithreading
Architectural Support for Multiple Threads
![Page 8: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/8.jpg)
8
Single-Threaded Processor (simplified)
Instr.Mem
PC
+4
Reg.Array
ALU
DataMem
Hazard Detection Logic
Fo
rwar
din
g li
nes
![Page 9: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/9.jpg)
9
2-Threaded Processor (simplified)
Replicate state for each thread
Instr.Mem
PC
+4
PC
Reg.Array
ALU
DataMem
Ctrl.
Hazard Detection Logic
Simplify control logic
![Page 10: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/10.jpg)
10
Additional storage for multiple threads
More efficiently done in FPGA than in ASIC
Increase memory size while preserving frequency
Program counters Registers Data mem.
Multithreading builds on the strengths of FPGAs
N x
![Page 11: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/11.jpg)
11
Outline
Architectural Support for Multiple Threads Soft Processor Infrastructure Improvements to baseline multithreading
![Page 12: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/12.jpg)
12
Measurement Infrastructure
RTL
2. Resource Usage3. Clock Frequency4. Power
1. Cycle Count
Benchmarks(MiBench,
Dhrystone 2.1,RATES,XiRisc)
Stratix 1S40C5
We can measure area/performance/energy accurately
ModelsimRTL Simulator
Quartus II 5.0CAD Software
Single-Thread ProcessorsSPREE System [FPGA’06]
![Page 13: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/13.jpg)
13
Evaluation methodology
Same benchmark running on all threadsSome mixed benchmarks results in the paper
Run until completion of the last thread Same instruction space
We present results with fixed latency on-chip RAM We are implementing a solution for off-chip RAM
![Page 14: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/14.jpg)
14
Processors: 3, 5 and 7 stages
Pipe3
Pipe5
Pipe7
F: FetchD: DecodeR: RegisterEX: ExecuteM: MemoryWB: Writeback
Pipe3
Pipe5
Pipe7
R/EX/MF/D WB
DF R/EX1 EX2/M WB
DF R EX2/M EX3/WB1EX1 WB2
Best of each pipeline depth generated by SPREEBy default: thread count = number of pipeline stages
1174 LEs78.3 MHz
1283 LEs86.79 MHz
1557 LEs, 100.59 MHz
![Page 15: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/15.jpg)
15
Area efficiency results
0
10
20
30
40
50
60
70
80
90
single MT single MT single MT
Are
a e
ffici
ency
(M
IPS
/ 1
000 L
Es)
33%77%
106%
Area efficiency is most improved with deeper pipelines 3- and 7-stages have similar area efficiency
3-stage 5-stage 7-stage
![Page 16: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/16.jpg)
16
IPC results for 3, 5 and 7 stages
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
bubb
le_s
ort
crc
des fft
fir
quan
t
iqua
nt vlc
bitc
nts
gol
Mea
n
IPC
(In
stru
ctio
ns/
cycl
e)
pipe3_mt
pipe5_mt
pipe7_mt
24%, 45% and 104% more instructions per cycle, respectively
0
0,5
1
1,5
2
2,5
MeanNor
mal
ized
IPC
(ins
truct
ions
per
cyc
le).Ideal IPC = 1
IPC versus single-threaded proc.
![Page 17: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/17.jpg)
17
Improvements to the Baseline Multithreaded Soft Processors
Optimize away unpipelined multi-cycle paths
Selection of architectural features1) Multiplier implementation 2) Number of registers 3) Number of threads
Combination of techniques optimizing area efficiency
Optimize away unpipelined multi-cycle paths
![Page 18: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/18.jpg)
18
1- Changing multiplication support
Reg
iste
r fil
e
Multiplier
Hi/Lo
MUX
• Default MIPS has Hi/Lo registers
•3-operand multiplies (NIOS2 and Microblaze)
– Two instructions compute high and low parts
– Avoids replicating Hi and Lo registers support
![Page 19: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/19.jpg)
19
2- Reducing the register file
Not all registers are utilized [RAAW’06] Many threads can combine the savings Results in saved memory blocks
•Applicable to the 5-stage processor
•Increases slightly cycle count due to increased register pressure
•Allows area and frequency improvements
1..N 1..N
2N
1..N-k 1..N-k
2N-2k
![Page 20: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/20.jpg)
20
Reducing the Number of Threads
• Usually: # threads = # pipeline stages• Last stage: writeback to non-conflicting register
Positive effect on the 5 and 7-stage processorsHelps meet processing latency deadline (shorter round-robin)Gives designers more flexibility
F F
E E
F
E
W W W
F F
E E
W W
F
E
W3 st
ages
Time
LegendThread1Thread2Thread3
![Page 21: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/21.jpg)
21
Conclusions Multithreaded SPs outperforms Single-threaded
Assumes independent threads Assumes use of on-chip memory
33%, 77% and 106% increase in area-efficiency Demonstrated that benefits increase with pipeline depth Techniques to optimize away unpipelined multi-cycle paths Selection and combination of architectural features
Multiplier support Number of threads Number of registers
Commercial FPGA makers should have a Multi-Threaded SP
![Page 22: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/22.jpg)
22
Long term goals Multiple multithreaded soft processors
Research using off-chip memory hierarchy Study of synchronization mechanisms Make easy to target and scale up for non-HW people
Stanford/Xilinx platform Collaboration with network researchers
Perform real high bandwidth experiments
–Virtex-II Pro
–4 x 1 Gbps Ethernet
–PCI board
–64 MB DDR2 DRAM
Experimental Testbed: NetFPGA
![Page 24: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/24.jpg)
24
Where do threads come from?
Event processing e.g. multiple sources of interrupts
Packet processinge.g. CAN, RS-485, Ethernet, etc.
Systems handling requests e.g. bus controllers
For now, we consider independent threads
![Page 25: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/25.jpg)
25
300
500
700
900
1100
1300
1500
1700
1900
500 700 900 1100 1300 1500 1700 1900
Area (Equivalent LEs)
Ge
om
ea
n W
all
Clo
ck
Tim
e (
us
) SPREE Processors
Altera Nios II/e
Altera Nios II/s
Altera Nios II/f
SPREE vs Nios II [IEEE TCAD’07]
smaller
faster
![Page 26: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/26.jpg)
26
Architectural Parameters Used in SPREE
We focus on core microarchitecture (for now)
Multiplication Support Hardware FU or software routine
Shifter implementation Flipflops, multiplier, or LUTs
PipeliningDepth
(2-7 stages)
Forwarding lines
![Page 27: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/27.jpg)
27
Contributions on Multithreaded Soft Processors
Multithreaded SP dominate single-threadedprocessors in area and IPC
Demonstrated that these benefitsIncrease with the # of pipeline stages
Explained techniques to optimize awayunpipelined multi-cycle paths
Selection of architectural featuresNumber of threadsNumber of registersMultiplier support
Combination of techniques that optimize area efficiency
![Page 28: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/28.jpg)
28
Unpipelined Multicycle Paths
ST
MT
R/EXF/D EX
Important source of IPC improvement
WB
R/EXF/D M WB
Not practical in STbecause of hazarddetection
Example of 3-stage pipeline with multicycle on load, store, shift and multiplies
![Page 29: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/29.jpg)
29
Changing multiplication support
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Hi/Lo 3op Hi/Lo 3op Hi/Lo 3op
Nor
mal
ized
Equ
iv. L
Es
/ MH
z / n
J/in
str
AreaFrequencyEnergyPerInstr
3-stage 5-stage 7-stage
For multithreaded SPs, 3op-multiplies always win
![Page 30: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/30.jpg)
30
Reducing the Number of Threads
0
0.2
0.4
0.6
0.8
1
1.2
pipe3_mt_2T pipe5_mt_4T pipe7_mt_6TNor
mal
ized
Equ
iv. L
Es
/ MH
z / n
J/in
str
Area
Frequency
EnergyPerInstr
Positive effect on the 5 and 7-stage processors
![Page 31: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/31.jpg)
31
3. Control Generation
2. Datapath Instantiation
SPREE
SPREE System (Soft Processor Rapid Exploration Environment)
RTL
ISA
Datapath
■ Input: Processor description■ Made of hand-coded components
1. Verify ISA against datapath
■ SPREE System
■ Output: Synthesizable Verilog
ProcessorDescription
![Page 32: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/32.jpg)
32
Multithreading
Replace processor stalls
Fine-grained multithreading: 1 instr. per thread in round-robin
Million Instr. xx Frequency# Cycles x Area
T1 T2 T3 T1 T2 T3Time
Interleaved instructions in pipeline
Fill them with instructions from other threadsWhen to switch thread?
Multiple techniquesMost common: every instruction (e.g. Sun’s Niagara)
![Page 33: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/33.jpg)
33
Experimental Testbed: NetFPGA
Stanford/Xilinx platform Collaboration with network researchers
Perform real high bandwidth experiments
–Virtex-II Pro
–4 x 1 Gbps Ethernet
–PCI board
–64 MB DDR2 DRAM
![Page 34: Improving Pipelined Soft Processors with Multithreading](https://reader035.fdocuments.net/reader035/viewer/2022081513/56813ff8550346895dab2373/html5/thumbnails/34.jpg)
34
Removed load and branch delay slots in the code