Post on 23-Feb-2016
description
HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing
Michael AdlerElliott FlemingMichael PellauerJoel Emer
2
Simulating Multicores
Simulating an N-multicore target•Fundametally N times the work•Plus on-chip network
Duplicating cores will quickly fill FPGAMulti-FPGA will slow simulation
CPU
CPU CPU CPUCPU
CPU CPU CPU CPU
Network
3
Trading Time for Space
Can leverage separation of model clock and FPGA clock to save space• Two techniques: serialization and time-multiplexing
But doesn’t this just slow down our simulator?
The tradeoff is a good idea if we can:• Save a lot of space• Improve FPGA critical path• Improve utilization• Slow down rare events, keep common events fast
LI approach enables a wide range of tradeoff options
4
Serialization: A First Tradeoff
5
Example Tradeoff: Multi-Port Register File
2 Read Ports, 2 Write Ports• 5-bit index, 32-bit data• Reads take zero clock cycles
Virtex 2Pro FPGA: 9242 (>25%) slices, 104 MHz
2R/2WRegister
File
rd addr 1
rd addr 2
wr addr 1wr val 1
wr addr2wr val 2
rd val 1
rd val 2
6
Trading Time for Space
Simulate the circuit sequentially using BlockRAM• 94 slices (<1%), 1 BlockRAM, 224 MHz (2.2x)• Simulation rate is 224 / 3 = 75 MHz
rd addr 1
rd addr 2
wr addr 1wr val 1
wr addr 2wr val 2
rd val 1
rd val 2
1R/1WBlockRAM
FSM
• Each module may have different FMR• A-Ports allow us to connect many such modules together• Maintain a consistent notion of model time
FPGA-cycle to Model Cycle Ratio(FMR)
7
Example: Inorder Front End
FET
BranchPred
IMEM PCResolve
InstQ
I$
ITLB1 1 1 0
1
2
0
0first
deq
slot
enqor
drop
1
fault
mispred
1training
pred
rspImm
rspDel
1
1redirect
1vaddr
(from Back End)
vaddr
0
(from Back End)
paddr
0paddr
1
LinePred
00
instor
fault
Legend: Ready to simulate?
YesNo
FET
Part
IMEM
• Modules may simulate at any wall-clock rate• Corollary: adjacent modules may not be simulating the same
model cycle
8
Simulator “Slip”
Adjacent modules simulating different cycles!• In paper: distributed resynchronization scheme
This can speed up simulation• Case study: Achieved 17% better performance than centralized controller• Can get performance = dynamic average
FET DEC1FET DEC1 vs
Let’s see how...
9
Traditional Software Simulation
Wallclock time FET DEC EXE MEM WB
0 A1 A2 NOP3 NOP4 NOP5 NOP6 B7 B8 A9 A10 NOP
= model cycle
10
2008.06.30
Challenges in Conducting Compelling Architecture Research10
Global Controller “Barrier” Synchronization
FPGA CC
FET DEC EXE MEM WB
0 A NOP NOP NOP NOP1 A2 A3 B A NOP NOP NOP4 B A5 A6 C B A NOP NOP7 B8 D C B A NOP9 D10 D
= model cycle
11
A-Ports Distributed SynchronizationFPGA CC
FET DEC EXE MEM WB
0 A NOP NOP NOP NOP1 B A NOP NOP NOP2 C B A NOP NOP3 D B A NOP4 E
(full)B A
5 B A6 B A7 C B A8 F D C B A9 G
(full)D C B
10 D C11 D12 D
long-running opscan overlap evenif on different CC
run-ahead in timeuntil buffering fills
Takeaway: LI makes serialization tradeoffs more appealing
12
Modeling large caches
Expensive instructions
CPU
Leveraging Latency-Insensitivity
1 1
FPU
EXE
LEAP InstructionEmulator
(M5)
RRR
[With Parashar,
Adler]
FPGA
1 1
L2$
CacheController
BRAM(KBs, 1 CC)
SRAM(MBs,
10s CCs) SystemMemory
(GBs, 100s CCs)
RAM256 KB
FPGA
LEAP
LEAP Scratchpad
13
Time-Multiplexing: A Tradeoff to Scale Multicores
(resume at 3:45)
14
Drawbacks:• Probably won’t fit• Low utilization of functional units
Benefits:• Simple to describe• Maximum parallelism
Multicores Revisited
What if we duplicate the cores?
state state state
CORE 0 CORE 1 CORE 2
15
Module Utilization
FET DEC1FET DEC1
A module is unutilized on an FPGA cycle if:• Waiting for all input ports to be non-empty or• Waiting for all output ports to be non-full
Case Study: In-order functional units were utilized 13% of FPGA cycles on average
1 1
16
• Drawbacks:• More expensive than
duplication(!)
Benefits:• Better unit utilization
Time-Multiplexing: First Approach
Duplicate state, Sequentially share logic
state
state
state physicalpipeline
virtualinstances
17
• Drawbacks:• Head-of-line blocking may limit
performance
Benefits:• Much better area• Good unit utilization
Round-Robin Time Multiplexing
Fix ordering, remove multiplexors
statestatestate
physicalpipeline
• Need to limit impact of slow events• Pipeline at a fine granularity• Need a distributed, controller-free mechanism to coordinate...
18
Port-Based Time-Multiplexing
• Duplicate local state in each module• Change port implementation:
• Minimum buffering: N * latency + 1• Initialize each FIFO with: # of tokens = N * latency
• Result: Adjacent modules can be simultaneously simulating different virtual instances
19
The Front End Multiplexed
FET
BranchPred
IMEM PCResolve
InstQ
I$
ITLB1 1 1 0
1
2
0
0first
deq
slot
enqor
drop
1
fault
mispred
1training
pred
rspImm
rspDel
1
1redirect
1vaddr
(from Back End)
vaddr
0
(from Back End)
paddr
0paddr
1
LinePred
00
instor
fault
Legend: Ready to simulate?
CPU1No CPU
2
FET IMEM
20
On-Chip Networks in a Time-Multiplexed World
21
Problem: On-Chip Network
CPUL1/L2 $
msg credit
Memory Control
rr r r
[0 1 2] [0 1 2]
CPU 0L1/L2 $
CPU 1L1/L2 $
CPU 2L1/L2 $
r
router
msg msg
credit credit
• Problem: routing wires to/from each router• Similar to the “global controller” scheme• Also utilization is low
22
Router0..3
Multiplexing On-Chip Network Routers
Router3
Router0
Router2
Router1
cur to 1 to 2 to 3 fr 1 fr 2 fr 30123
0
001
1
1 2 3
2
2 33
reorder
reorder
reorder
σ(x) = (x + 1) mod 4
σ(x) = (x + 2) mod 4
σ(x) = (x + 3) mod 4
1 2 3
0
001
12
2 33
Simulate the network without a network
23
Ring/Double Ring Topology Multiplexed
Router3
Router0
Router2
Router1
Router0..3
“to next”“from prev”
???
cur to N fr P0
1
2
3
σ(x) = (x + 1) mod 4
1 3
0
012
23
Opposite direction: flip to/from
24
Implementing Permutations on FPGAs Efficiently
Side Buffer•Fits networks like ring/torus (e.g. x+1 mod N)
Indirection Table•More general, but more expensive
PermTable
RAMBuffer
FSM
σ(x) = (x + 1) mod 4
1000 0001
Move first to Nth
Move Nth to first Move every K to N-K
25
Torus/Mesh Topology Multiplexed
Mesh: Don’t transmit on non-existent links
26
Dealing with Heterogeneous Networks
Compose “Mux Ports” with Permutation PortsIn paper: generalize to any topology
27
Putting It All Together
28
Typical HAsim Model Leveraging these Techniques
• 16-core chip multiprocessor• 10-stage pipeline (speculative, bypassed)• 64-bit Alpha ISA, floating point• 8 KB lockup-free L1 caches• 256 KB 4-way set associative L2 cache• Network: 2 v. channels, 4 slots, x-y wormhole
F BP1 BP2 PCC IQ D X DM CQ C
ITLB I$ DTLB D$ L/S Q
L2$ Route
• Single detailed pipeline, 16-way time-multiplexed• 64-bit Alpha functional partition, floating point• Caches modeled with different cache hierarchy• Single router, multiplexed, 4 permutations
Regs LUTs BRAM0%
25%
50%
75%
100%
Synthesis Results, percentage of Xilinx V5 330T
LEAPFuncOCNL1/L2Core
29
Time-Multiplexed Multicore Simulation Rate Scaling
Best Worst Avg
FMR 15.7 27.1 18.4
30
Time-Multiplexed Multicore Simulation Rate Scaling
Best Worst Avg
FMR Per-Core 5.4 14.4 8.95
31
Time-Multiplexed Multicore Simulation Rate Scaling
Best Worst Avg
FMR Per-Core 8.5 13.5 11.6
32
Time-Multiplexed Multicore Simulation Rate Scaling
Best Worst Avg
FMR Per-Core 8.45 19.8 11.5
33
Takeaways
The Latency-Insensitive approach provides a unified approach to interesting tradeoffs
Serialization: Leverage FPGA-efficient circuits at the cost of FMR• A-Port-based synchronization can amortize cost by giving
dynamic average• Especially if long events are rare
Time-Multiplexing: Reuse datapaths and only duplicate state• A-Port based approach means not all modules are fully utilized• Increased utilization means that performance degradation is
sublinear• Time-multiplexing the on-chip network requires permutations
34
Next Steps
Here we were able to push one FPGA to its limits
What if we want to scale farther?
Next, we’ll explore how latency-Insensitivity can help us scale to multiple FPGAs with better performance than traditional techniques
Also how we can increase designer productivity by abstracting platform
36
Resynchronizing Ports
Modules follow modified scheme:• If any incoming port is heavy, or any outgoing port is light, simulate next
cycle (when ready)• Result: balanced w/o centralized coordination
Argument: • Modules farthest ahead in time will never proceed• Ports in (out) of this set will be light (resp. heavy)
– Therefore those modules will try to proceed, but may not be able to
• There’s also a set farthest behind in time– Always able to proceed– Since graph is connected, simulating only enables modules, makes progress
towards quiescence
37
Other Topologies
Tree
Butterfly
[1 , 1 , 1 , 0 , 0 , 0 , 0 ] [1 , 1 , 1 , 0 , 0 , 0 , 0 ][0 , 0 , 0 , 1 , 1 , 1 , 1 ]
[2 , 0 , 1 , 0 , 1 , 0 , 1 ] [0 , 1 , 2 , 1 , 2 , 1 , 2 ]
P h ys ica lR ou ter
[0 , 0 , 0, 1 , 1 , 1 , 1 ]
R ou ter0
R ou ter2
R ou te r1
R ou te r6
R ou ter5
R ou te r4
R ou ter3
[0 , 0 , 1 , 1 , 0 , 1 , 0 , 1 , 2 , 2 , 2 , 2 ] [1 , 1 , 2 , 2 , 1 , 2 , 1 , 2 , 0 , 0 , 0 , 0 ]
[0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 ]
[0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 ]
[2 , 2 , 2 , 2 , 0 , 0 , 1 , 1 , 0 , 1 , 0 , 1 ]
F rom P hys ica l C o re
To P hys ica l C ore
[2 , 2 , 2 , 2 , 0 , 0 , 1 , 1 , 0 , 1 , 0 , 1 ]
P hys ica lR ou ter
To C ore 0
To C o re 1
R ou te r8
To C ore 2
To C ore 3
R ou te r9
To C ore 4
To C ore 5
R ou te r10
To C ore 6
To C ore 7
R ou te r11
R ou ter4
R ou ter5
R ou ter6
R ou ter7
R outer0
R outer1
R outer2
R outer3
F rom C ore 0
F rom C ore 1
F rom C ore 2
F rom C ore 3
F rom C ore 4
F rom C ore 5
F rom C ore 6
F rom C ore 7
38
Generalizing OCN Permutations
•Represent model as Directed Graph G=(M,P)•Label modules M with simulation order: 0..(N-1)•Partition ports into sets P0..Pm where:
– No two ports in a set Pm share a source– No two ports in a set Pm share a destination
• Transform each Pm into a permutation σm
– Forall {s, d} in Pm, σm(s) = d– Holes in range represent “don’t cares”– Always send NoMessage on those steps
• Time-Multiplex module as usual– Associate each σm with a physical port
39
Example: Arbitrary Network
0
4
3
2
15
A
1032
543210
10543210
14543210
C
0
2
1
B
(1, 0)(3, 1)
P0
P1
P2
(5, 1)
(1, 2)(2, 3)(4, 0)
(0, 4)(4, 1)
40
Results: Multicore Simulation Rate
FMR Simulation Rate
Min Max Avg Min Max Avg
Overall 16 218 80 160 KHz 3.2 MHz 625 KHz
Per-Core 5 27 11 1.84 9.5 MHz 4.54 MHz
• Must simulate multiple cores to get full benefit of time-multiplexed pipelines
• Functional cache-pressure rate-limiting factor