NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept....
-
Upload
loren-mason -
Category
Documents
-
view
230 -
download
0
Transcript of NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept....
![Page 1: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/1.jpg)
NetThreads: Programming NetFPGA with Threaded Software
Martin LabrecqueGregory Steffan
ECE Dept.
Geoff SalmonMonia GhobadiYashar Ganjali
University of Toronto
CS Dept.
![Page 2: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/2.jpg)
2
Real-Life Customers
● Hardware:
– NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA● Collaboration with CS researchers
– Interested in performing network experiments
– Not in coding Verilog
– Want to use GigE link at maximum capacity
Requirements: Easy to program system Efficient systemWhat would the ideal solution look like?
![Page 3: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/3.jpg)
3
Processor
Processor
Processor
Processor
Processor
Processor
Envisioned System (Someday)
● Many Compute Engines
● Delivers the expected performance
● Hardware handles communication and synchronizaton
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Processor
Processor
Processor
Processor
Processor
Processor
control-flowparallelism
data-level parallelism
Processors inside an FPGA?
![Page 4: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/4.jpg)
4
FPGA
Soft processors: processors in the FPGA fabric
FPGAs increasingly implement SoCs with CPUs Commercial soft processors: NIOS-II and Microblaze
Processor
PC
Instr. Mem.
Reg. Array
regA
regB
regW
datW
datA
datB
ALU
25:21
20:16
+4
Data Mem.
datIn
addrdatOut
aluA
aluB
IncrPC
Instr
4:0 Wdest
Wdata
20:13
Xtnd
25:21
Wdata
Wdest
15:0
Xtnd << 2
Zero Test
25:21
Wdata
Wdest
20:0
25:21
Wdata
Wdest
•Easier to program than HDL•Customizable
Soft Processors in FPGAs
DDR controller
Ethernet MACEthernet MAC
Ethernet MAC
DDR controller
What is the performance requirement?
![Page 5: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/5.jpg)
5
Performance In Packet Processing
● The application defines the throughput required
Home networking (~100 Mbps/link)
Edge routing (≥ 1 Gbps/link)
Scientific instruments(< 100 Mbps/link)
● Our measure of throughput:– Bisection search of the minimum packet inter-arrival– Must not drop any packet
Are soft processors fast enough?
![Page 6: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/6.jpg)
6
Realistic Goals
● 109 bps stream with normal inter-frame gap of 12 bytes
● 2 processors running at 125 MHz
● Cycle budget:
– 152 cycles for minimally-sized 64B packets;
– 3060 cycles for maximally-sized 1518B packets
Soft processors: non-trivial processing at line rate!
How can they efficiently be organized?
![Page 7: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/7.jpg)
Key Design Features
![Page 8: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/8.jpg)
8
Efficient Network Processing
Memory system with specialized
memories
Multithreaded soft
processor
Multiple processors
support
![Page 9: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/9.jpg)
9
Multiprocessor System Diagram
InputBuffer
DataCache
OutputBuffer
Synch. Unit
packetinput
packetoutput
Instr.
Data
Input mem.
Output mem.
I$
processor
4-threads
Off-chip DDR
I$
processor
4-threads
- Overcomes the 2-port limitation of block RAMs- Shared data cache is not the main bottleneck in our experiments
![Page 10: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/10.jpg)
10
Performance of Single-Threaded Processors
● Single-issue, in order pipeline ● Should commit 1 instruction every cycle, but:
– stall on instruction dependences
– stall on memory, I/O, accelerators accesses
● Throughput depends on sequential execution:
– packet processing
– device control
– event monitoring
Solution to Avoid Stalls: Multithreading
many concurrent threads
![Page 11: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/11.jpg)
11
Avoiding Processor Stall Cycles
Single-ThreadTraditional execution
BE
FO
RE
F
E
F
E
M M
DD
W W
F
E
M
D
W5
stag
esTime
Ideally, eliminates all stalls
Multithreading: execute streams of independent instructions
LegendThread1Thread2Thread3Thread4
AF
TE
R
F F
E E
F
E
M M M
F
E
M5 st
ages
Time
D DD D
W W W W
F
E
M
D
W
4 threads eliminate hazards in 5-stage pipeline
Data or control hazard
5-stage pipeline is 77% more area efficient [FPL’07]
![Page 12: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/12.jpg)
Multithreading Evaluation
![Page 13: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/13.jpg)
13
Infrastructure• Compilation:
– modified versions of GCC 4.0.2 and Binutils 2.16 for the MIPS-I ISA
• Timing: – no free PLL: processors run at the speed of the Ethernet MACs, 125MHz
• Platform: – 2 processors, 4 MAC + 1 DMA ports, 64 Mbytes 200 MHz DDR2
SDRAM – Virtex II Pro 50 (speed grade 7ns)– 16KB private instruction caches and shared data write-back cache– Capacity would be increased on a more modern FPGA
• Validation: – Reference trace from MIPS simulator– Modelsim and online instruction trace collection
- PC server can send ~0.7 Gbps maximally size packets- Simple packet echo application can keep up- Complex applications are the bottleneck, not the architecture
![Page 14: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/14.jpg)
14
Our benchmarksBenchmark Description Dynamic
Instructions per packet
x1000
Variance of Instructions per packet
x1000
UDHCP DHCP server 35 36
Classifier Regular expression +
QOS
13 35
NAT Network Address
Translation+ Accounting
6 7
Realistic non-trivial applications, dominated by control flow
![Page 15: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/15.jpg)
15
What is limiting performance?
0
1
2
3
4
5
6
7
8
9
0 1000000 2000000 3000000 4000000 5000000 6000000 7000000
Time (cycles)
Num
ber
of A
ctiv
e Thre
ads
Let’s focus on the underlying problem: Synchronization
Packet Backlog due to Synchronization
Serializing Tasks
![Page 16: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/16.jpg)
Addressing Synchronization Overhead
![Page 17: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/17.jpg)
17
Real Threads Synchronize
• All threads execute the same code
• Concurrent threads may access shared data
• Critical sections ensure correctness
Lock();
shared_var = f();
Unlock();
Impact on round-robin scheduled threads?
Thread1 Thread2 Thread3 Thread4
![Page 18: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/18.jpg)
18
Multithreaded processor with Synchronization
5 st
ages
Time
F
E
M
D
W
F F
E E
M M
F
E
M
D D D
W W W
F
E
M
D
W
F F
E E
M M
F
E
M
D D D
W W W
F
E
M
D
WAcquire
lock
Releaselock
![Page 19: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/19.jpg)
19
Synchronization Wrecks Round-Robin Multithreading
5 st
ages
Time
F
E
M
D
W
F
E
M
D
W
F
E
M
D
WAcquire
lock
Releaselock
With round-robin thread scheduling and contention on locks:< 4 threads execute concurrently> 18% cycles are wasted while blocked on synchronization
![Page 20: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/20.jpg)
20
D
W
Better Handling of Synchronization5
stag
es
Time
F
E
M
D
W
F
E
M
D
W
F
E
M
D
W
E
M M
E
M
D
W W
F F
E E
M M
F
E
D D
WW
F
D
F
BE
FO
RE
E
M M
E
M
D
W W WTime
F
E
M
D
W
F
E
M
F
E
M
D D
W W
F
E
M
D
W
F F
E E
M M
D D
W W
F
E
M
D
W
AF
TE
R
F F
E E
M M
F
E
D D D
WW
F
D
F
DESCHEDULE Thread3 Thread4
5 st
ages
![Page 21: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/21.jpg)
21
Thread scheduler• Suspend any thread waiting for a lock• Round-robin among the remaining threads• Unlock operation resumes threads across processors
- Multithreaded processor hides hazards across active threads- Fewer than N threads requires hazard detection
But, hazard detection was on critical path of single threaded processor
Is there a low cost solution?
![Page 22: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/22.jpg)
22
Static Hazard Detection• Hazards can be determined at compile time
Hazard distance(maximum 2)
Min. issue cycle
addi r1,r1,r4 0 0
addi r2,r2,r5 1 1
or r1,r1,r8 0 3
or r2,r2,r9 0 4
- Hazard distances are encoded as part of the instructions
Static hazard detection allows scheduling without an extra pipeline stageVery low area overhead (5%), no frequency penalty
![Page 23: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/23.jpg)
Thread Scheduler Evaluation
![Page 24: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/24.jpg)
24
0500
10001500
20002500
30003500
40004500
5000
UDHCP NAT Classifier
pa
cke
ts p
er
seco
nd
Round-Robin 1p
Round-Robin 2p
Scheduler 1p
Scheduler 2p
Results on 3 benchmark applications
- Thread scheduling improves throughput by 63%, 31%, and 41%- Why isn’t the 2nd processor always improving throughput?
![Page 25: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/25.jpg)
25
0
2
4
6
8
10
12
RR S1 RR S1 RR S1
Cy
cle
s p
er
ins
tru
cti
on
Other
Locked
No Packet
Busy
Cycle Breakdown in Simulation
UDHCP Classifier NAT
- Removed cycles stalled waiting for a lock- What is the bottleneck?
![Page 26: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/26.jpg)
26
Impact of Allowing Packet Drops
0
5000
10000
15000
20000
25000
0 1 2 3 4 5
allowed percentage of packet drops
thro
ug
hp
ut
(pa
cke
ts/s
ec)
1 processor 2 processors
- System still under-utilized- Throughput still dominated by serialization
![Page 27: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/27.jpg)
27
Future Work
• Adding custom hardware accelerators– Same interconnect as processors– Same synchronization interface
• Evaluate speculative threading– Alleviate need for fine grained-synchronization– Reduce conservative synchronization overhead
![Page 28: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/28.jpg)
28
Conclusions
• Efficient multithreaded design– Parallel threads hide stalls on one thread– Thread scheduler mitigates synchronization costs
• System Features– System is easy to program in C– Performance from parallelism is easy to get
On the lookout for relevant applications suitable for benchmarking
NetThreads available with compiler at:http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads
![Page 29: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/29.jpg)
Martin LabrecqueGregory Steffan
ECE Dept.
Geoff SalmonMonia GhobadiYashar Ganjali
University of Toronto
CS Dept.
NetThreads available with compiler at:http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads
![Page 30: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/30.jpg)
30
Backup
![Page 31: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/31.jpg)
31
Software Network Processing• Not meant for:
– Straightforward tasks accomplished at line speed in hardware– E.g. basic switching and routing
• Advantages compared to Hardware– Complex applications are best described in a high-level software – Easier to design and fast time-to-market– Can interface with custom accelerators, controllers– Can be easily updated
• Our focus: stateful applications– Data structures modified by most packets– Difficult to pipeline the code into balanced stages
Run-to-Completion/Pool-of-Threads model for parallelism:−Each thread processes a packet from beginning to end −No thread-specific behavior
![Page 32: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/32.jpg)
32
Impact of allowing packet drops
0
1
2
3
4
5
6
7
0 1 2 3 4 5
allowed percentage of packet drops
no
rma
lize
d th
rou
gh
pu
t (p
ack
ets
/se
c) Round-Robin 1p
Round-Robin 2p
Sched 1p
Sched 2p
NAT benchmark
t
![Page 33: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/33.jpg)
33
Cycle Breakdown in Simulation
0
2
4
6
8
10
RR S1 RR S1 RR S1
cycl
es
pe
r in
stru
ctio
n
Other
HazardBubbleSquashed
Locked
No Packet
Busy
UDHCP Classifier NAT
- Removed cycles stalled waiting for a lock- Throughput still dominated by serialization
![Page 34: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/34.jpg)
34
More Sophisticated Thread Scheduling
• Add pipeline stage to pick hazard-free instruction
• Result:– Increased instruction latency– Increased hazard window– Increased branch mis-prediction cost
Fetch ThreadSelection
RegisterRead
Execute Writeback
MU
X
Add hazard detection without an extra pipeline stage?
Memory
![Page 35: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/35.jpg)
35
Implementation• Where to store the hazard distance bits?
– Block RAMs are multiple of 9 bits wide– 36 bits word leaves 4 bits available
• Also encode lock and unlock flagsLock/ Unlock +
Hazard DistanceInstruction
4 Bits 32 Bits
x 36 bitsI$
processor
4-threads
Off-chip DDR
I$
processor
4-threadsx 36 bits
x 32 bits
How to convert instructions from 36 bits to 32 bits?
![Page 36: NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.](https://reader035.fdocuments.net/reader035/viewer/2022062314/56649f595503460f94c7ec15/html5/thumbnails/36.jpg)
36
Instruction Compaction 36 32 bits
R-Type Instructions
opcode (6) rs (5) rt (5) rd (5) sa (5) function (6)
opcode (6) target (26)
J-Type Instructions
Example: add rd, rs, rt
Example: j label
- De-compaction: 2 block RAMs + some logic between DDR and cache- Not a critical path of the pipeline
opcode (6) rs (5) rt (5) immediate (16)
Example: addi rt, rs, immediate
I-Type Instructions