An Efficient Programmable 10 Gigabit Ethernet Network Interface Card
description
Transcript of An Efficient Programmable 10 Gigabit Ethernet Network Interface Card
An Efficient Programmable 10 Gigabit Ethernet
Network Interface Card
Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai
2
Designing a 10 Gigabit NIC
Programmability for performance Computation offloading improves performance
NICs have power, area concerns Architecture solutions should be efficient
Above all, must support 10 Gb/s links What are the computation, memory requirements? What architecture efficiently meets them? What firmware organization should be used?
3
Mechanisms for an Efficient Programmable 10 Gb/s NIC
A partitioned memory system Low-latency access to control structures High-bandwidth, high-capacity access to frame data
A distributed task-queue firmware Utilizes frame-level parallelism to scale across many simple,
low-frequency processors
New RMW instructions Reduce firmware frame-ordering overheads by 50% and
reduce clock frequency requirement by 17%
4
Outline
Motivation
How Programmable NICs work
Architecture Requirements, Design
Frame-parallel Firmware
Evaluation
5
How Programmable NICs Work
PCIInterface
EthernetInterface
Memory
Processor(s)Bus Ethernet
6
Per-frame Requirements
InstructionsData
Accesses
TX Frame 281 101
RX Frame 253 85
Processing and control data requirements per frame, as determined by dynamic traces of relevant NIC functions
7
Aggregate Requirements10 Gb/s - Max Sized Frames
Instruction Throughpu
t
Control Data
Bandwidth
Frame Data
Bandwidth
TX Frame 229 MIPS 2.6 Gb/s 19.75 Gb/s
RX Frame 206 MIPS 2.2 Gb/s 19.75 Gb/s
Total 435 MIPS 4.8 Gb/s 39.5 Gb/s
1514-byte Frames at 10 Gb/s 812,744 Frames/s
8
Meeting 10 Gb/s Requirements with Hardware
Processor Architecture At least 435 MIPS within embedded device Does NIC firmware have ILP?
Memory Architecture Low latency control data High bandwidth, high capacity frame data … both, how?
9
ILP Processors for NIC Firmware? ILP limited by data, control dependences Analysis of dynamic trace reveal dependences
Perfect BPPerfect
1BPNo BP
In-order 1 0.87 0.87 0.87
In-order 2 1.19 1.19 1.13
In-order 4 1.34 1.33 1.17
Out-order 1
1.00 1.00 0.88
Out-order 2
1.96 1.74 1.21
Out-order 4
2.65 2.00 1.29
10
Processors: 1-Wide, In-order
2x performance costly Branch prediction, reorder buffer, renaming logic, wakeup logic Overheads translate to greater than 2x core power, area costs Great for a GP processor; not for an embedded device
Other opportunities for parallelism? YES! Many steps to process a frame - run them simultaneously Many frames need processing - process simultaneously
Use parallel single-issue cores
Perfect 1BP
No BP
In-order 1 0.87 0.87
Out-order 2
1.74 1.21
11
Memory Architecture
Competing demands Frame data: High bandwidth, high capacity for many
offload mechanisms Control data: Low latency; coherence among processors,
PCI Interface, and Ethernet Interface
The traditional solution: Caches Advantages: low latency, transparent to the programmer Disadvantages: Hardware costs (tag arrays, coherence) In many applications, advantages outweigh costs
12
Are Caches Effective?
SMPCache trace analysis of a 6-processor NIC architecture
0
10
20
30
40
50
60
16B 32B 64B 128B256B 512B 1KB 2KB 4KB 8KB16KB32KB
Cache Size (Bytes)
Hit Ratio (Percent) 6 ProcessorHit Ratio
13
Choosing a Better Organization
Cache HierarchyA Partitioned Organization
14
Putting it All Together
Instruction Memory
I-Cache 0
CPU 0
(P+4)x(S) Crossbar (32-bit)
PCIInterface
EthernetInterfacePCI
Bus DRAM
Ext. Mem. Interface
(Off-Chip)
Scratchpad 0 Scratchpad 1 S-pad S-1
CPU P-1
I-Cache 1 I-Cache P-1
CPU 1
15
Parallel Firmware
NIC processing steps already well-
defined
Previous Gigabit NIC firmware divides
steps between 2 processors
… but does this mechanism scale?
16
Task Assignment with an Event Register
PCI Read Bit SW Event Bit … Other Bits
PCI Interface Finishes Work
Processor(s) inspect
transactions
0 0 011
Processor(s) need to enqueue
TX Data
Processor(s) pass data to
Ethernet Interface
17
Task-level Parallel Firmware
TransferDMAs 0-4
0 Idle Idle
PCI Read Bit
PCI Read HW Status
Function Running (Proc 0)
Function Running (Proc 1)
1Transfer
DMAs 5-9
1
0
TimeProcessDMAs
0-4Idle
ProcessDMAs
5-91 Idle
18
Frame-level Parallel Firmware
TransferDMAs 0-4 Idle
PCI RD HW Status
Function Running (Proc 0)
Function Running (Proc 1)
TransferDMAs 5-9
TimeProcessDMAs
0-4
Build Event
Idle
ProcessDMAs
5-9
Build Event
Idle
19
Evaluation Methodology
Spinach: A library of cycle-accurate LSE simulator modules for network interfaces Memory latency, bandwidth, contention modeled precisely Processors modeled in detail NIC I/O (PCI, Ethernet Interfaces) modeled in detail Verified when modeling the Tigon 2 Gigabit NIC (LCTES
2004)
Idea: Model everything inside the NIC Gather performance, trace data
20
Scaling in Two Dimensions
0
2
4
6
8
10
12
14
16
18
20
100 150 200 250 300
Core Frequency (MHz)
Throughput (Gb/s)
Ethernet Limit
8 Processors
6 Processors
4 Processors
2 Processors
1 Processor
21
Processor Performance
Processor Behavior
IPC Compone
nt
Execution 0.72
Miss Stalls 0.01
Load Stalls 0.12
Scratchpad Conflict Stalls
0.05
Pipeline Stalls
0.10
Total 1.00
Achieves 83% of theoretical peak IPC
Small I-Caches work Sensitive to mem
stalls Half of loads are part of a load-
to-use sequence Conflict stalls could be reduced
with more ports, more banks
22
Reducing Frame Ordering Overheads
Firmware ordering costly - 30% of execution
Synchronization, bitwise check/updates occupy processors, memory
Solution: Atomic bitwise operations that also update a pointer according to last set location
23
Maintaining Frame Ordering
0
Index 0 Index 1 Index 3 Index 4 … more bitsFrame Status Array
0 0 0
CPU A prepares frames
CPU B prepares frames
CPU C Detects
Completed Frames
EthernetInterface
LOCK
Iterate
Notify Hardware
UNLOCK
1 1 1 1
24
RMW Instructions Reduce Clock Frequency
Performance: 6x166 MHz = 6x200 MHz Performance is equivalent at all frame sizes 17% reduction in frequency requirement
Dynamically tasked firmware balances the benefit Send cycles reduced by 28.4% Receive cycles reduced by 4.7%
25
ConclusionsA Programmable 10 Gb/s NIC
This NIC architecture relies on: Data Memory System - Partitioned organization, not
coherent caches Processor Architecture - Parallel scalar processors Firmware - Frame-level parallel organization RMW Instructions - reduce ordering overheads
A programmable NIC: A substrate for offload services
26
Comparing Frame Ordering Methods
0
2
4
6
8
10
12
14
16
18
20
0 200 400 600 800 1000 1200 1400
UDP Datagram Size (Bytes)
Full-Duplex Throughput (Gb/s)
Ethernet Limit(Duplex)
6x200 MHz,Software Only
6x166 MHz, RMWEnhanced