A FPGA-based Network Interface Card with GPUDirect enabling real ...

39
A FPGA-based Network Interface Card with GPUDirect enabling real-time GPU computing in HEP experiments Alessandro Lonardo (INFN Roma 1) On behalf of the NaNet collaboration Perspectives of GPU Computing in Physics and Astrophysics Roma, September 15-17 2014 1 Perspec)ves of GPU Compu)ng in Physics and Astrophysics Roma September 17 2014

Transcript of A FPGA-based Network Interface Card with GPUDirect enabling real ...

Page 1: A FPGA-based Network Interface Card with GPUDirect enabling real ...

A FPGA-based Network Interface Card with GPUDirect enabling real-time GPU computing in HEP experiments

Alessandro Lonardo (INFN Roma 1)

On behalf of the NaNet collaboration

Perspectives of GPU Computing in

Physics and Astrophysics

Roma, September 15-17 2014

1  Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014  

Page 2: A FPGA-based Network Interface Card with GPUDirect enabling real ...

■  Real-time GPGPU in High Energy Physics !  CERN NA62 Experiment Physics Case

■  NaNet FPGA-based NIC design

■  NA62 RICH detector application

■  KM3Net-IT on-shore read-out system application

■  Conclusions

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   2  

Page 3: A FPGA-based Network Interface Card with GPUDirect enabling real ...

■  Hard real-time: a system that has to respond strictly within a given time budget to avoid failures (or data loss). !  Real time control systems (e.g. Tokamak plasma control) !  First level (or low level) triggers

■  System latency: !  Is GPU processing latency small and stable enough for the given

task? !  Is the latency of network communications to and from GPU memory

small and stable enough? ■  System throughput:

!  Has the GPU enough computing power to execute the assigned task within the given time budget?

!  Has the GPU interface with network enough bandwidth?

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   3  

Page 4: A FPGA-based Network Interface Card with GPUDirect enabling real ...

■  Goal is the precision measurement of the BR of ultra-rare decay process

!  High precision theoretical prediction: sensitive to new physics ■  Many (uninteresting) events: 107 decays/s. ■  Ultra-rare: 1 target event on 1010 particle decays

!  Expect to collect ~100 events in 2/3 years of operation !  Need to select the interesting events, filtering efficiently and quickly

the data stream coming from the detectors.

4  Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014  

Page 5: A FPGA-based Network Interface Card with GPUDirect enabling real ...

■  decay (BR~8×10-11) ■  Huge background from other kaon decays

750 MHz beam rate

(~6% kaons, 75 GeV/c)

10 MHz rate from decays

5  Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014  

Page 6: A FPGA-based Network Interface Card with GPUDirect enabling real ...

April  15th,  2012  

L0  trigger  

Trigger  primi,ves  

Data  CDR  

O(kHz)  

GigaEth  SWITCH  

L1/L2    PC  

RICH   MUV   CEDAR   LKR  STRAWS   LAV  

L0TP  

L0  

1  MHz  

1  MHz  

10  MHz  

10  MHz  

L1/L2    PC  

L1/L2    PC  

L1/L2    PC  

L1/L2    PC  

L1/L2    PC  

L1/L2    PC  

100  kHz  

L1  trigger  

L1/2  

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   6  

■  L0: Hardware synchronous level !  10 MHz to 1

MHz, 1 ms max. latency

!  Primitives (MUV, RICH, LAV, LKR)

■  L1: Software level !  “Single detector”,

1 MHz to 100 kHz ■  L2: Software level

!  “Complete information”, 100 kHz to 10 kHz

Page 7: A FPGA-based Network Interface Card with GPUDirect enabling real ...

Beam  

Beam  pipe  2  spots  with  1000x  18mm  diameter  photo-­‐detectors  

April  15th,  2012  

L0  trigger  

Trigger  primi,ves  

Data  CDR  

O(kHz)  

GigaEth  SWITCH  

L1/L2    PC  

RICH   MUV   CEDAR   LKR  STRAWS   LAV  

L0TP  

L0  

1  MHz  

1  MHz  

10  MHz  

10  MHz  

L1/L2    PC  

L1/L2    PC  

L1/L2    PC  

L1/L2    PC  

L1/L2    PC  

L1/L2    PC  

100  kHz  

L1  trigger  

L1/2  

 Mirror  Mosaic  

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   7  

Page 8: A FPGA-based Network Interface Card with GPUDirect enabling real ...

Replace custom electronics with a GPU-based system "  More selective trigger algorithms. " Programmable. " Upgradable. " Rough detection of particle speed (radius) and direction (centre)

" Efficient match of circular hit patterns on GPUs.

April  15th,  2012  

L0  trigger  

Trigger  primi,ves  

Data  CDR  

O(kHz)  

GigaEth  SWITCH  

L1/L2    PC  

RICH   MUV   CEDAR   LKR  STRAWS   LAV  

L0TP  

L0  

1  MHz  

1  MHz  

10  MHz  

10  MHz  

L1/L2    PC  

L1/L2    PC  

L1/L2    PC  

L1/L2    PC  

L1/L2    PC  

L1/L2    PC  

100  kHz  

L1  trigger  

L1/2  

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   8  

Page 9: A FPGA-based Network Interface Card with GPUDirect enabling real ...

■  Network Protocol: UDP on GbE channels (4). ■  System Throughput: 10 M events/s ■  Network bandwidth < 500 MB/s ■  System response latency < 1 ms

!  determined by the size of Readout Board memory buffer storing event data to be passed to higher trigger levels.

Readout L0TP

latency  <  1  ms  

     BW  <  500  MB/s  

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   9  

Page 10: A FPGA-based Network Interface Card with GPUDirect enabling real ...

GbE  0  

■  latL0TP-GPU = latproc + latcomm "  Estimate the two components

and their fluctuations. ■  Perform estimation using a single

GbE channel "  Estrapolate data for the design

of the “full bandwidth” system.

GPU  

GPU  MEM  

Chipset  

CPU  SYSTEM  MEM  

PCIe  

GbE  1  

GbE  3  

Tel62  readout  boards  

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   10  

GbE  0  

Page 11: A FPGA-based Network Interface Card with GPUDirect enabling real ...

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   11  

■  latproc : time needed to perform rings pattern-matching on the GPU with input and output data on device memory using the MATH least squares algorithm.

Page 12: A FPGA-based Network Interface Card with GPUDirect enabling real ...

104  

μs  

#  Latcomm = 104 us ≅ 3.5 x Latproc

total latency is dominated by the communication component.

GPU  

GPU  MEM  

Chipset  

CPU  SYSTEM  MEM  

PCIe  

GbE  

latcomm     latproc  0  

#  40 events data (1400 bytes) sent from Readout board to the GbE NIC are stored in a receiving GPU buffer.

#  T=0 start of send operation on the Readout Board.

latcomm : time needed to receive event data from GbE NIC to GPU memory.

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   12  

Page 13: A FPGA-based Network Interface Card with GPUDirect enabling real ...

GPU  

GPU  MEM  

Chipset  

CPU  SYSTEM  MEM  

PCIe  

GbE  

99  

μs  

sockperf: Summary: Latency is 99.129 usec sockperf: Total 100816 observations; each percentile contains 1008.16 observations

sockperf: ---> <MAX> observation = 657.743 sockperf: ---> percentile 99.99 = 474.758 sockperf: ---> percentile 99.90 = 201.321 sockperf: ---> percentile 99.50 = 163.819 sockperf: ---> percentile 99.00 = 149.694 sockperf: ---> percentile 95.00 = 116.730 sockperf: ---> percentile 90.00 = 105.027 sockperf: ---> percentile 75.00 = 97.578 sockperf: ---> percentile 50.00 = 96.023 sockperf: ---> percentile 25.00 = 95.775 sockperf: ---> <MIN> observation = 64.141

64  

150  

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   13  

Fluctuations on latency of the data communication task may hinder the real-time constraints of the system.

Page 14: A FPGA-based Network Interface Card with GPUDirect enabling real ...

■  Challenge: !  Lower latency of data communication task (and its fluctuations).

■  How? 1.  Injecting directly data from the NIC into the GPU memory without

intermediate buffering. 2.  Offloading the CPU from network stack protocol management,

eliminating possible OS jitter effects. ■  NaNet design:

!  Re-use part of the APEnet+ design, implementing a PCIe interface with GPUDirect P2P/RDMA capabilities.

!  Support multiple link technologies. !  Provide a network stack protocol management unit. !  Use FPGA resources to perform processing on data stream (e.g.

reformat data on-the-fly in a GPU-friendly fashion).

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   14  

Page 15: A FPGA-based Network Interface Card with GPUDirect enabling real ...

■  2D/3D toroidal mesh topology for point-to-point dead-lock free communications

■  Up to 6 fully bidir 34 Gbps/channel over QSFP+

■  PCI Express x8 gen2 ■  RDMA support for CPU offload

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   15  

■  GPUDirect P2P (and RDMA) support ■  GPUDirect allows direct data exchange on

the PCIe bus with no CPU involvement ■  No bounce buffers on host memory

$ Latency reduction for small messages

Page 16: A FPGA-based Network Interface Card with GPUDirect enabling real ...

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   16  

■  GPUDirect RDMA latency ~ 6.5µs ■  GPUDirect P2P latency ~ 8.2µs ■  Staging latency ~ 17.0µs

Page 17: A FPGA-based Network Interface Card with GPUDirect enabling real ...

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   17  

Page 18: A FPGA-based Network Interface Card with GPUDirect enabling real ...

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   18  

Page 19: A FPGA-based Network Interface Card with GPUDirect enabling real ...

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   19  

Page 20: A FPGA-based Network Interface Card with GPUDirect enabling real ...

#  Physical Link Coding !  Standard: 1GbE (1000Base-T),

10GbE (10Base-R). !  Custom: APElink (20 Gb/s QSFP),

KM3link Det. Lat. (2.5 Gb/s optical).

#  Protocol Manager !  Data/Network/Transport layers off-

loader module. !  UDP, Time Division Multiplexing.

#  Data Processing !  Application dependent processing on

data stream. !  NA62 Decompressor: re-format

event data (data size and alignment).

#  APEnet Protocol Encoder !  Protocol adaptation between on-

board and off-board protocol. !  In RX path generates CPU/GPU

destination memory addresses for incoming pkts.

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   20  

PCIe  X8  Gen2  core  

Network    Interface  

TX/RX  Block  

32bit  Micro  Controller  

Custom  Logic  

memory  controller  

GPU  I/O  accelerator  

On  Board  Mem

ory  

Router  

I/O  In

terface  

Protocol  Manager  

Data  Processing  

APEnet  Proto  Encoder  

NaN

et-­‐1  (G

bE)  

TSE  MAC  

UDP  

NaNet  Ctrl  

KM3link  

Det.  Latency  

TDM  

NaNet  Ctrl  

NaN

et-­‐10    (1

0GbE

)  

10G  BASE-­‐R  

UDP  

NaNet  Ctrl  

Physical  Link  Coding  

APE

link  

APElink  I/O  IF  

Word  Stuff  

Link  Ctrl  

N/A   Decompressor   N/A  

Decompressor  

Page 21: A FPGA-based Network Interface Card with GPUDirect enabling real ...

#  5 ports full crossbar switch !  Router: dynamically

interconnects ports. !  Arbitration: resolve contention on

ports requests (static, round robin).

!  Supports 10 simultaneous 2.8 GB/s data streams.

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   21  

PCIe  X8  Gen2  core  

Network    Interface  

TX/RX  Block  

32bit  Micro  Controller  

Custom  Logic  

memory  controller  

GPU  I/O  accelerator  

On  Board  Mem

ory  

Router  

I/O  In

terface  

Protocol  Manager  

Data  Processing  

APEnet  Proto  Encoder  

NaN

et-­‐1  (G

bE)  

TSE  MAC  

UDP  

NaNet  Ctrl  

KM3link  

Det.  Latency  

TDM  

NaNet  Ctrl  

NaN

et-­‐10    (1

0GbE

)  

10G  BASE-­‐R  

UDP  

NaNet  Ctrl  

Physical  Link  Coding  

APE

link  

APElink  I/O  IF  

Word  Stuff  

Link  Ctrl  

N/A   Decompressor   N/A  

Decompressor  

Page 22: A FPGA-based Network Interface Card with GPUDirect enabling real ...

#  In TX gathers data from PCIe interface and forwards them to destination port.

#  In RX performs RDMA receive operation managing CPU/GPU Virtual to Physical address translation. !  Translation Lookaside Buffer

(associative cache). !  Microcontroller in case of miss.

#  GPU I/O accelerator (P2P protocol)

#  TX/RX Block !  DMA engines, PCIe transactions.

#  Altera NIOS II microcontroller. #  PLDA PCIe x8 Gen2 core.

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   22  

PCIe  X8  Gen2  core  

Network    Interface  

TX/RX  Block  

32bit  Micro  Controller  

Custom  Logic  

memory  controller  

GPU  I/O  accelerator  

On  Board  Mem

ory  

Router  

I/O  In

terface  

Protocol  Manager  

Data  Processing  

APEnet  Proto  Encoder  

NaN

et-­‐1  (G

bE)  

TSE  MAC  

UDP  

NaNet  Ctrl  

KM3link  

Det.  Latency  

TDM  

NaNet  Ctrl  

NaN

et-­‐10    (1

0GbE

)  

10G  BASE-­‐R  

UDP  

NaNet  Ctrl  

Physical  Link  Coding  

APE

link  

APElink  I/O  IF  

Word  Stuff  

Link  Ctrl  

N/A   Decompressor   N/A  

Decompressor  

Page 23: A FPGA-based Network Interface Card with GPUDirect enabling real ...

■  Implemented on Altera Stratix IV dev board (EP4SGX230KF40C2)

■  GbE PHY Marvell 88E1111 ■  Supports additional 3 APElink

channels (20 Gb/s each ) with HSMC daughtercard

PCIe  X8  Gen2  core  

Network  Interface  TX/RX  Block  

32bit  Micro  Controller  

UDP  offload  NaNet  CTRL  

memory  controller  

GPU  I/O  accelerator  

On  Board  Mem

ory  1G

bE  port  

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   23  

Page 24: A FPGA-based Network Interface Card with GPUDirect enabling real ...

■  TTC daughtercard with HSMC connector (Ferrara)

■  NaNet receives TTC stream with timing (40 MHz clock, SOB, EOB) and trigger signals from the experiment.

■  Allows synchronous operation (accurate latency measurements)

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   24  

■  First integration and tests activities during August technical run.

■  Parasitic operation of GPU-based L0 trigger using NaNet-1 scheduled for October (with reduced event rate).

Page 25: A FPGA-based Network Interface Card with GPUDirect enabling real ...

0

200

400

600

800

1000

2K 4K 8K 16K 32K 64K

Lat

ency

[us]

Buffer Size [Bytes]

Communication Latency

NaNet-1 M2070NaNet-1 K20XmSockperf Kernel 2.6.33 M2070Sockperf Kernel 2.6.33.9-rt31-EL6RT M2070

RT  kernel:  less  fluctua)ons    but  also  increased  latency….  

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   25  

Page 26: A FPGA-based Network Interface Card with GPUDirect enabling real ...

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   26  

Page 27: A FPGA-based Network Interface Card with GPUDirect enabling real ...

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   27  

Page 28: A FPGA-based Network Interface Card with GPUDirect enabling real ...

25

50

75

100

125

150

16 32 64 128 256 512 1024 2048 4096

0.25

0.5

0.75

1

1.25

1.5

1.75

2

2K 4K 8K 16K 32K 64K 128K 256KB

andw

idth

[M

B/s

]

ME

ven

ts/s

Receive Buffer Size [Number of Events]

NaNet-1 GbE Performance

Receive Buffer Size [Bytes]

M2070 - Bandwidth M2070 - ThroughputK20Xm - Bandwidth K20Xm - Throughput

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   28  

Page 29: A FPGA-based Network Interface Card with GPUDirect enabling real ...

0

500

1000

1500

2000

2500

16 32 64 128 256 512 1K 2K 4K 8K 16K

2K 4K 8K 16K 32K 64K 128K 256K 512K 1M

Lat

ency

s]

Receive Buffer Size [Number of Events]

NaNet-1 APElink Communication + Kernel Latencies (M2070)

Receive Buffer Size [Bytes]

Kernel Processing Latency (M2070)Total LatencyNaNet-1 APElink Communication Latency

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   29  

Page 30: A FPGA-based Network Interface Card with GPUDirect enabling real ...

0

200

400

600

800

1000

1200

1400

1600

16 32 64 128 256 512 1K 2K 4K 8K 16K 0

2

4

6

8

10

12

14

16

18

20

222K 4K 8K 16K 32K 64K 128K 256K 512K 1M

Ban

dw

idth

[M

B/s

]

ME

ven

ts/s

Receive Buffer Size [Number of Events]

NaNet-1 APElink Performance

Receive Buffer Size [Bytes]

M2070 - ThroughputM2070 - Bandwidth

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   30  

overlap  between  GPU  compu)ng  and    communica)on  $  10M  events/s  

Page 31: A FPGA-based Network Interface Card with GPUDirect enabling real ...

■  Note: ■  we performed latency measurements in worst case conditions…

■  communication and processing tasks were serialized ■  while, in normal operation, data communication and GPU

processing are overlapped ■  Circular list of receive buffers: when one of the buffers is full, the GPU

kernel is launched, data arriving from the network are accumulated in the next buffer in the circular list concurrently with processing.

■  Summary: ■  A RICH detector Level 0 Trigger Processor GPU-based system using

NaNet-1 (1 GbE link) performing the simple MATH processing is able to sustain a throughput of ~1.6M events/s with real-time behaviour.

■  Using a single APElink channel instead of the GbE one, the system shows a throughput of ~10M events/s ---> good scaling with link bandwidth.

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   31  

Page 32: A FPGA-based Network Interface Card with GPUDirect enabling real ...

■  Implemented on the Altera Stratix V dev board but porting on cheaper board (Terasic TR5-F40W) is possible !  PCIe gen2 x8 - developing PCIe

Gen3 (8 GB/s) !  Faster embedded Altera

transceivers (up to 14.1 Gbps) !  hardened 10GBASE-R PCS

features to support 10 Gbps Ethernet (10GbE)

PCIe  X8  Gen3  core  

Network  Interface  TX/RX  Block  

32bit  Micro  Controller  

UDP  offload  NaNet  

memory  controller  

GPU  I/O  accelerator  

On  Board  Mem

ory  4  x  10  G

bE  (QSFP+)  

   QSFP+  

PCIe  gen3  

QSFP+  to  4  SFP+  cable  

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   32  

Page 33: A FPGA-based Network Interface Card with GPUDirect enabling real ...

PCIe  2.0  

RICH  Readout  System  

1/10  GbE  

Switch  

1  GbE   1  GbE  

SFP+   10  GbE  

FPGA  based  L0  TP  

 GPU  based  L0  TP  

■  NaNet-10 is required to achieve L0-TP required performances (10M events/s)

■  Ready for March 2015 run !  run in parallel and “parasitic” mode…. !  Aggregation of 4 GbE channels payload in a

single 10GbE link through Switch HP2920

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   33  

TTC  Card  

Page 34: A FPGA-based Network Interface Card with GPUDirect enabling real ...

■  KM3Net-IT: an European deep-sea research infrastructure, hosting a neutrino telescope with a volume of cubic kilometer at the bottom of the Mediterranean Sea.

Op)cal    Module  

Anchor    

Floor  Control  Module    (FCM)  

E.O.Cable    on-­‐shore  

DWDM    Mux  

Ethernet  FCM    (eFCM)s  

twin  boards  

Laser  

Ethernet    conn.    to  TriDAS  

Buoy    

(PMT  +FEM)  

off-­‐shore  

■  Current read-out system design employs a large number of NO state-of-the-art components !  2.5 Gb/s optical link per 800Mb/s

payload !  2 twinned (on and off-shore) FCM

boards per floor (14 floors per tower) !  Many servers for HW read-out hosting

■  When scaling to Km3 many cost/size/power/reliability issues.

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   34  

Page 35: A FPGA-based Network Interface Card with GPUDirect enabling real ...

NaNet3 specifications for enhanced KM3Net-­‐IT  read-out system: #  4 channels/PCIe board #  i.e. less servers, less read-out boards,…

#  Link speed up to 10Gb/s ✔

#  Further reduction of a factor 4 in number of optical channels is possible

#  GPUDirect P2P/RDMA capability #  to support GPU-based trigger developments ✔

#  Deterministic latency link #  “Fixed Latency” clock distribution for under-

water events time-stamping ✔

Implemented on Terasic DE5-NET board #  Altera Stratix V based dev board #  PCIe Gen2 x8 (developing Gen3) #  4 independent SFP+ ports (up to 10Gb/s) #  Deterministic Latency mode for transceiver

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   35  

Page 36: A FPGA-based Network Interface Card with GPUDirect enabling real ...

NaNet3 On-shore (StratixV) Fcm Off-Shore (Virtex5)

Clock (+ Data) Off shore Recovered clock

From on shore data

Off shore Recovered clock

To on shore data

TX clock

To off shore data

On shore recovered clk

From off shore data

Clock (+ Data)

•  Testbed: FCM vs Terasic DE5-Net •  Custom hw mode for FCM

Transceivers (Xilinx) •  Latency deterministic mode for

Stratix V Transceiver •  2mt copper and 2 mt long fiber

•  Test: •  12 hours of periodic (~s) Tx

clock reset to verify pll locking and rx word alignment

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   36  

Page 37: A FPGA-based Network Interface Card with GPUDirect enabling real ...

•  NaNet3 link (currently 1 out of 4): •  Physical Layer: Altera Deterministic Latency

Transceivers (8B10B encoding scheme) •  Data Layer: Time Division Multiplexing (TDM) data

transmission protocol. •  RX path: payload of different off-shore

devices, multiplexed on continuous data stream at fixed time slot

•  (PCIe DMA transaction to CPU/GPU memory)

•  TX path: limited data rate per FCM for slow control devices messages

•  (PCIe TARGET transaction from CPU/GPU memory)

•  NaNet Ctrl: •  protocol translation: encapsulates TDM data

stream in APElink packet protocol. •  Destination Virtual Address generation

•  PCIe X8 Gen2 •  CPU/GPU Memory Write Bandwidth ~ 2.5 GB/s

PCIe  X8  Gen2  core  

Network    Interface  

TX/RX  Block  

32bit  Micro  Controller  

Custom  Logic  

memory  controller  

GPU  I/O  accelerator  

Router  

I/O  In

terface   Physical  Link  

Coding  

Protocol  Manager  

NaNet  Controller  

NaN

et3 link  

Det.  Latency  

TDM  

NaNet  Ctrl  

NaN

et3 link  

Det.  Latency  

TDM  

NaNet  Ctrl  

NaN

et3 link  

Det.  Latency  

TDM  

NaNet  Ctrl  

37  Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014  

Page 38: A FPGA-based Network Interface Card with GPUDirect enabling real ...

■  NaNet design has proved to be effective in two different experimental contexts thanks to its modularity and high performance/low latency (RDMA, GPUDirect, network protocol offloading) features. ■  NA62 GPU-based Low Level Trigger ■  Demonstrated real-time data communication between the NA62

RICH read-out system and the GPU-based L0 trigger processor over a single GbE link.

■  Demonstrated scalability of the design up to multiple 10GbE read-out channels.

■  Developing 4 ports 10GbE version (1Q2015). ■  KM3Net-IT on-shore read-out system: ■  Demonstrated different vendors, high-end FPGAs serial links

inter-operability (with Deterministic Latency). ■  Compliant with Real-time constraints of TDM processing. ■  Ready for future enhancements (off-shore link aggregation,

GPU-based trigger…).

38  Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014  

Page 39: A FPGA-based Network Interface Card with GPUDirect enabling real ...

F. Ameli(a), R. Ammendola(b), A. Biagioni(a), A. Cotta Ramusino(c), M. Fiorini(c), O. Frezza(a), G. Lamanna(d), F. Lo Cicero(a), A. Lonardo(a), M. Martinelli(a), I. Neri(c), P. S. Paolucci(a), E. Pastorelli(a), L. Pontisso(f), D. Rossetti(e), F. Simeone(a), F. Simula(a), M. Sozzi(f), L. Tosoratto(a), P. Vicini(a)

(a) INFN Sezione di Roma (b) INFN Sezione di Roma Tor Vergata (c) Università di Ferrara e INFN Sezione di Ferrara (d) INFN LNF and CERN (e) nVIDIA Corporation, USA (f) INFN Sezione di Pisa and CERN

Perspec)ves  of  GPU  Compu)ng  in  Physics  and  Astrophysics  -­‐  Roma  September  17  2014   39