A FPGA-based Network Interface Card with GPUDirect enabling real ...
Transcript of A FPGA-based Network Interface Card with GPUDirect enabling real ...
A FPGA-based Network Interface Card with GPUDirect enabling real-time GPU computing in HEP experiments
Alessandro Lonardo (INFN Roma 1)
On behalf of the NaNet collaboration
Perspectives of GPU Computing in
Physics and Astrophysics
Roma, September 15-17 2014
1 Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014
■ Real-time GPGPU in High Energy Physics ! CERN NA62 Experiment Physics Case
■ NaNet FPGA-based NIC design
■ NA62 RICH detector application
■ KM3Net-IT on-shore read-out system application
■ Conclusions
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 2
■ Hard real-time: a system that has to respond strictly within a given time budget to avoid failures (or data loss). ! Real time control systems (e.g. Tokamak plasma control) ! First level (or low level) triggers
■ System latency: ! Is GPU processing latency small and stable enough for the given
task? ! Is the latency of network communications to and from GPU memory
small and stable enough? ■ System throughput:
! Has the GPU enough computing power to execute the assigned task within the given time budget?
! Has the GPU interface with network enough bandwidth?
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 3
■ Goal is the precision measurement of the BR of ultra-rare decay process
! High precision theoretical prediction: sensitive to new physics ■ Many (uninteresting) events: 107 decays/s. ■ Ultra-rare: 1 target event on 1010 particle decays
! Expect to collect ~100 events in 2/3 years of operation ! Need to select the interesting events, filtering efficiently and quickly
the data stream coming from the detectors.
4 Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014
■ decay (BR~8×10-11) ■ Huge background from other kaon decays
750 MHz beam rate
(~6% kaons, 75 GeV/c)
10 MHz rate from decays
5 Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014
April 15th, 2012
L0 trigger
Trigger primi,ves
Data CDR
O(kHz)
GigaEth SWITCH
L1/L2 PC
RICH MUV CEDAR LKR STRAWS LAV
L0TP
L0
1 MHz
1 MHz
10 MHz
10 MHz
L1/L2 PC
L1/L2 PC
L1/L2 PC
L1/L2 PC
L1/L2 PC
L1/L2 PC
100 kHz
L1 trigger
L1/2
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 6
■ L0: Hardware synchronous level ! 10 MHz to 1
MHz, 1 ms max. latency
! Primitives (MUV, RICH, LAV, LKR)
■ L1: Software level ! “Single detector”,
1 MHz to 100 kHz ■ L2: Software level
! “Complete information”, 100 kHz to 10 kHz
Beam
Beam pipe 2 spots with 1000x 18mm diameter photo-‐detectors
April 15th, 2012
L0 trigger
Trigger primi,ves
Data CDR
O(kHz)
GigaEth SWITCH
L1/L2 PC
RICH MUV CEDAR LKR STRAWS LAV
L0TP
L0
1 MHz
1 MHz
10 MHz
10 MHz
L1/L2 PC
L1/L2 PC
L1/L2 PC
L1/L2 PC
L1/L2 PC
L1/L2 PC
100 kHz
L1 trigger
L1/2
Mirror Mosaic
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 7
Replace custom electronics with a GPU-based system " More selective trigger algorithms. " Programmable. " Upgradable. " Rough detection of particle speed (radius) and direction (centre)
" Efficient match of circular hit patterns on GPUs.
April 15th, 2012
L0 trigger
Trigger primi,ves
Data CDR
O(kHz)
GigaEth SWITCH
L1/L2 PC
RICH MUV CEDAR LKR STRAWS LAV
L0TP
L0
1 MHz
1 MHz
10 MHz
10 MHz
L1/L2 PC
L1/L2 PC
L1/L2 PC
L1/L2 PC
L1/L2 PC
L1/L2 PC
100 kHz
L1 trigger
L1/2
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 8
■ Network Protocol: UDP on GbE channels (4). ■ System Throughput: 10 M events/s ■ Network bandwidth < 500 MB/s ■ System response latency < 1 ms
! determined by the size of Readout Board memory buffer storing event data to be passed to higher trigger levels.
Readout L0TP
latency < 1 ms
BW < 500 MB/s
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 9
GbE 0
■ latL0TP-GPU = latproc + latcomm " Estimate the two components
and their fluctuations. ■ Perform estimation using a single
GbE channel " Estrapolate data for the design
of the “full bandwidth” system.
GPU
GPU MEM
Chipset
CPU SYSTEM MEM
PCIe
GbE 1
GbE 3
Tel62 readout boards
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 10
GbE 0
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 11
■ latproc : time needed to perform rings pattern-matching on the GPU with input and output data on device memory using the MATH least squares algorithm.
104
μs
# Latcomm = 104 us ≅ 3.5 x Latproc
total latency is dominated by the communication component.
GPU
GPU MEM
Chipset
CPU SYSTEM MEM
PCIe
GbE
latcomm latproc 0
# 40 events data (1400 bytes) sent from Readout board to the GbE NIC are stored in a receiving GPU buffer.
# T=0 start of send operation on the Readout Board.
latcomm : time needed to receive event data from GbE NIC to GPU memory.
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 12
GPU
GPU MEM
Chipset
CPU SYSTEM MEM
PCIe
GbE
99
μs
sockperf: Summary: Latency is 99.129 usec sockperf: Total 100816 observations; each percentile contains 1008.16 observations
sockperf: ---> <MAX> observation = 657.743 sockperf: ---> percentile 99.99 = 474.758 sockperf: ---> percentile 99.90 = 201.321 sockperf: ---> percentile 99.50 = 163.819 sockperf: ---> percentile 99.00 = 149.694 sockperf: ---> percentile 95.00 = 116.730 sockperf: ---> percentile 90.00 = 105.027 sockperf: ---> percentile 75.00 = 97.578 sockperf: ---> percentile 50.00 = 96.023 sockperf: ---> percentile 25.00 = 95.775 sockperf: ---> <MIN> observation = 64.141
64
150
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 13
Fluctuations on latency of the data communication task may hinder the real-time constraints of the system.
■ Challenge: ! Lower latency of data communication task (and its fluctuations).
■ How? 1. Injecting directly data from the NIC into the GPU memory without
intermediate buffering. 2. Offloading the CPU from network stack protocol management,
eliminating possible OS jitter effects. ■ NaNet design:
! Re-use part of the APEnet+ design, implementing a PCIe interface with GPUDirect P2P/RDMA capabilities.
! Support multiple link technologies. ! Provide a network stack protocol management unit. ! Use FPGA resources to perform processing on data stream (e.g.
reformat data on-the-fly in a GPU-friendly fashion).
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 14
■ 2D/3D toroidal mesh topology for point-to-point dead-lock free communications
■ Up to 6 fully bidir 34 Gbps/channel over QSFP+
■ PCI Express x8 gen2 ■ RDMA support for CPU offload
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 15
■ GPUDirect P2P (and RDMA) support ■ GPUDirect allows direct data exchange on
the PCIe bus with no CPU involvement ■ No bounce buffers on host memory
$ Latency reduction for small messages
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 16
■ GPUDirect RDMA latency ~ 6.5µs ■ GPUDirect P2P latency ~ 8.2µs ■ Staging latency ~ 17.0µs
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 17
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 18
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 19
# Physical Link Coding ! Standard: 1GbE (1000Base-T),
10GbE (10Base-R). ! Custom: APElink (20 Gb/s QSFP),
KM3link Det. Lat. (2.5 Gb/s optical).
# Protocol Manager ! Data/Network/Transport layers off-
loader module. ! UDP, Time Division Multiplexing.
# Data Processing ! Application dependent processing on
data stream. ! NA62 Decompressor: re-format
event data (data size and alignment).
# APEnet Protocol Encoder ! Protocol adaptation between on-
board and off-board protocol. ! In RX path generates CPU/GPU
destination memory addresses for incoming pkts.
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 20
PCIe X8 Gen2 core
Network Interface
TX/RX Block
32bit Micro Controller
Custom Logic
memory controller
GPU I/O accelerator
On Board Mem
ory
Router
I/O In
terface
Protocol Manager
Data Processing
APEnet Proto Encoder
NaN
et-‐1 (G
bE)
TSE MAC
UDP
NaNet Ctrl
KM3link
Det. Latency
TDM
NaNet Ctrl
NaN
et-‐10 (1
0GbE
)
10G BASE-‐R
UDP
NaNet Ctrl
Physical Link Coding
APE
link
APElink I/O IF
Word Stuff
Link Ctrl
N/A Decompressor N/A
Decompressor
# 5 ports full crossbar switch ! Router: dynamically
interconnects ports. ! Arbitration: resolve contention on
ports requests (static, round robin).
! Supports 10 simultaneous 2.8 GB/s data streams.
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 21
PCIe X8 Gen2 core
Network Interface
TX/RX Block
32bit Micro Controller
Custom Logic
memory controller
GPU I/O accelerator
On Board Mem
ory
Router
I/O In
terface
Protocol Manager
Data Processing
APEnet Proto Encoder
NaN
et-‐1 (G
bE)
TSE MAC
UDP
NaNet Ctrl
KM3link
Det. Latency
TDM
NaNet Ctrl
NaN
et-‐10 (1
0GbE
)
10G BASE-‐R
UDP
NaNet Ctrl
Physical Link Coding
APE
link
APElink I/O IF
Word Stuff
Link Ctrl
N/A Decompressor N/A
Decompressor
# In TX gathers data from PCIe interface and forwards them to destination port.
# In RX performs RDMA receive operation managing CPU/GPU Virtual to Physical address translation. ! Translation Lookaside Buffer
(associative cache). ! Microcontroller in case of miss.
# GPU I/O accelerator (P2P protocol)
# TX/RX Block ! DMA engines, PCIe transactions.
# Altera NIOS II microcontroller. # PLDA PCIe x8 Gen2 core.
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 22
PCIe X8 Gen2 core
Network Interface
TX/RX Block
32bit Micro Controller
Custom Logic
memory controller
GPU I/O accelerator
On Board Mem
ory
Router
I/O In
terface
Protocol Manager
Data Processing
APEnet Proto Encoder
NaN
et-‐1 (G
bE)
TSE MAC
UDP
NaNet Ctrl
KM3link
Det. Latency
TDM
NaNet Ctrl
NaN
et-‐10 (1
0GbE
)
10G BASE-‐R
UDP
NaNet Ctrl
Physical Link Coding
APE
link
APElink I/O IF
Word Stuff
Link Ctrl
N/A Decompressor N/A
Decompressor
■ Implemented on Altera Stratix IV dev board (EP4SGX230KF40C2)
■ GbE PHY Marvell 88E1111 ■ Supports additional 3 APElink
channels (20 Gb/s each ) with HSMC daughtercard
PCIe X8 Gen2 core
Network Interface TX/RX Block
32bit Micro Controller
UDP offload NaNet CTRL
memory controller
GPU I/O accelerator
On Board Mem
ory 1G
bE port
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 23
■ TTC daughtercard with HSMC connector (Ferrara)
■ NaNet receives TTC stream with timing (40 MHz clock, SOB, EOB) and trigger signals from the experiment.
■ Allows synchronous operation (accurate latency measurements)
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 24
■ First integration and tests activities during August technical run.
■ Parasitic operation of GPU-based L0 trigger using NaNet-1 scheduled for October (with reduced event rate).
0
200
400
600
800
1000
2K 4K 8K 16K 32K 64K
Lat
ency
[us]
Buffer Size [Bytes]
Communication Latency
NaNet-1 M2070NaNet-1 K20XmSockperf Kernel 2.6.33 M2070Sockperf Kernel 2.6.33.9-rt31-EL6RT M2070
RT kernel: less fluctua)ons but also increased latency….
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 25
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 26
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 27
25
50
75
100
125
150
16 32 64 128 256 512 1024 2048 4096
0.25
0.5
0.75
1
1.25
1.5
1.75
2
2K 4K 8K 16K 32K 64K 128K 256KB
andw
idth
[M
B/s
]
ME
ven
ts/s
Receive Buffer Size [Number of Events]
NaNet-1 GbE Performance
Receive Buffer Size [Bytes]
M2070 - Bandwidth M2070 - ThroughputK20Xm - Bandwidth K20Xm - Throughput
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 28
0
500
1000
1500
2000
2500
16 32 64 128 256 512 1K 2K 4K 8K 16K
2K 4K 8K 16K 32K 64K 128K 256K 512K 1M
Lat
ency
[µ
s]
Receive Buffer Size [Number of Events]
NaNet-1 APElink Communication + Kernel Latencies (M2070)
Receive Buffer Size [Bytes]
Kernel Processing Latency (M2070)Total LatencyNaNet-1 APElink Communication Latency
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 29
0
200
400
600
800
1000
1200
1400
1600
16 32 64 128 256 512 1K 2K 4K 8K 16K 0
2
4
6
8
10
12
14
16
18
20
222K 4K 8K 16K 32K 64K 128K 256K 512K 1M
Ban
dw
idth
[M
B/s
]
ME
ven
ts/s
Receive Buffer Size [Number of Events]
NaNet-1 APElink Performance
Receive Buffer Size [Bytes]
M2070 - ThroughputM2070 - Bandwidth
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 30
overlap between GPU compu)ng and communica)on $ 10M events/s
■ Note: ■ we performed latency measurements in worst case conditions…
■ communication and processing tasks were serialized ■ while, in normal operation, data communication and GPU
processing are overlapped ■ Circular list of receive buffers: when one of the buffers is full, the GPU
kernel is launched, data arriving from the network are accumulated in the next buffer in the circular list concurrently with processing.
■ Summary: ■ A RICH detector Level 0 Trigger Processor GPU-based system using
NaNet-1 (1 GbE link) performing the simple MATH processing is able to sustain a throughput of ~1.6M events/s with real-time behaviour.
■ Using a single APElink channel instead of the GbE one, the system shows a throughput of ~10M events/s ---> good scaling with link bandwidth.
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 31
■ Implemented on the Altera Stratix V dev board but porting on cheaper board (Terasic TR5-F40W) is possible ! PCIe gen2 x8 - developing PCIe
Gen3 (8 GB/s) ! Faster embedded Altera
transceivers (up to 14.1 Gbps) ! hardened 10GBASE-R PCS
features to support 10 Gbps Ethernet (10GbE)
PCIe X8 Gen3 core
Network Interface TX/RX Block
32bit Micro Controller
UDP offload NaNet
memory controller
GPU I/O accelerator
On Board Mem
ory 4 x 10 G
bE (QSFP+)
QSFP+
PCIe gen3
QSFP+ to 4 SFP+ cable
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 32
PCIe 2.0
RICH Readout System
1/10 GbE
Switch
1 GbE 1 GbE
SFP+ 10 GbE
FPGA based L0 TP
GPU based L0 TP
■ NaNet-10 is required to achieve L0-TP required performances (10M events/s)
■ Ready for March 2015 run ! run in parallel and “parasitic” mode…. ! Aggregation of 4 GbE channels payload in a
single 10GbE link through Switch HP2920
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 33
TTC Card
■ KM3Net-IT: an European deep-sea research infrastructure, hosting a neutrino telescope with a volume of cubic kilometer at the bottom of the Mediterranean Sea.
Op)cal Module
Anchor
Floor Control Module (FCM)
E.O.Cable on-‐shore
DWDM Mux
Ethernet FCM (eFCM)s
twin boards
Laser
Ethernet conn. to TriDAS
Buoy
(PMT +FEM)
off-‐shore
■ Current read-out system design employs a large number of NO state-of-the-art components ! 2.5 Gb/s optical link per 800Mb/s
payload ! 2 twinned (on and off-shore) FCM
boards per floor (14 floors per tower) ! Many servers for HW read-out hosting
■ When scaling to Km3 many cost/size/power/reliability issues.
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 34
NaNet3 specifications for enhanced KM3Net-‐IT read-out system: # 4 channels/PCIe board # i.e. less servers, less read-out boards,…
# Link speed up to 10Gb/s ✔
# Further reduction of a factor 4 in number of optical channels is possible
# GPUDirect P2P/RDMA capability # to support GPU-based trigger developments ✔
# Deterministic latency link # “Fixed Latency” clock distribution for under-
water events time-stamping ✔
Implemented on Terasic DE5-NET board # Altera Stratix V based dev board # PCIe Gen2 x8 (developing Gen3) # 4 independent SFP+ ports (up to 10Gb/s) # Deterministic Latency mode for transceiver
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 35
NaNet3 On-shore (StratixV) Fcm Off-Shore (Virtex5)
Clock (+ Data) Off shore Recovered clock
From on shore data
Off shore Recovered clock
To on shore data
TX clock
To off shore data
On shore recovered clk
From off shore data
Clock (+ Data)
• Testbed: FCM vs Terasic DE5-Net • Custom hw mode for FCM
Transceivers (Xilinx) • Latency deterministic mode for
Stratix V Transceiver • 2mt copper and 2 mt long fiber
• Test: • 12 hours of periodic (~s) Tx
clock reset to verify pll locking and rx word alignment
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 36
• NaNet3 link (currently 1 out of 4): • Physical Layer: Altera Deterministic Latency
Transceivers (8B10B encoding scheme) • Data Layer: Time Division Multiplexing (TDM) data
transmission protocol. • RX path: payload of different off-shore
devices, multiplexed on continuous data stream at fixed time slot
• (PCIe DMA transaction to CPU/GPU memory)
• TX path: limited data rate per FCM for slow control devices messages
• (PCIe TARGET transaction from CPU/GPU memory)
• NaNet Ctrl: • protocol translation: encapsulates TDM data
stream in APElink packet protocol. • Destination Virtual Address generation
• PCIe X8 Gen2 • CPU/GPU Memory Write Bandwidth ~ 2.5 GB/s
PCIe X8 Gen2 core
Network Interface
TX/RX Block
32bit Micro Controller
Custom Logic
memory controller
GPU I/O accelerator
Router
I/O In
terface Physical Link
Coding
Protocol Manager
NaNet Controller
NaN
et3 link
Det. Latency
TDM
NaNet Ctrl
NaN
et3 link
Det. Latency
TDM
NaNet Ctrl
NaN
et3 link
Det. Latency
TDM
NaNet Ctrl
37 Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014
■ NaNet design has proved to be effective in two different experimental contexts thanks to its modularity and high performance/low latency (RDMA, GPUDirect, network protocol offloading) features. ■ NA62 GPU-based Low Level Trigger ■ Demonstrated real-time data communication between the NA62
RICH read-out system and the GPU-based L0 trigger processor over a single GbE link.
■ Demonstrated scalability of the design up to multiple 10GbE read-out channels.
■ Developing 4 ports 10GbE version (1Q2015). ■ KM3Net-IT on-shore read-out system: ■ Demonstrated different vendors, high-end FPGAs serial links
inter-operability (with Deterministic Latency). ■ Compliant with Real-time constraints of TDM processing. ■ Ready for future enhancements (off-shore link aggregation,
GPU-based trigger…).
38 Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014
F. Ameli(a), R. Ammendola(b), A. Biagioni(a), A. Cotta Ramusino(c), M. Fiorini(c), O. Frezza(a), G. Lamanna(d), F. Lo Cicero(a), A. Lonardo(a), M. Martinelli(a), I. Neri(c), P. S. Paolucci(a), E. Pastorelli(a), L. Pontisso(f), D. Rossetti(e), F. Simeone(a), F. Simula(a), M. Sozzi(f), L. Tosoratto(a), P. Vicini(a)
(a) INFN Sezione di Roma (b) INFN Sezione di Roma Tor Vergata (c) Università di Ferrara e INFN Sezione di Ferrara (d) INFN LNF and CERN (e) nVIDIA Corporation, USA (f) INFN Sezione di Pisa and CERN
Perspec)ves of GPU Compu)ng in Physics and Astrophysics -‐ Roma September 17 2014 39