SQL Explore 2012 - Tzahi Hakikat and Keren Bartal: Extended Events
12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016...
Transcript of 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016...
![Page 1: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/1.jpg)
12th ANNUAL WORKSHOP 2016
RDMA AND USER SPACE ETHERNET BONDING
Tzahi Oved
[ April , 2016 ] Mellanox
[ LOGO HERE ]
![Page 2: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/2.jpg)
OpenFabrics Alliance Workshop 2016
AGENDA
§ Introduction • NIC Teaming • RoCE and ib_device • Application view
§ RDMA Device HW Bonding § HW Bond and virtualization
• Embedded Switch SW Model • Embedded Switch and HW Bonding
§ Multi-PCI Socket NIC • Introduction • HW Bonding for app transparency
§ Summary
2
![Page 3: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/3.jpg)
OpenFabrics Alliance Workshop 2016
INTRODUCTION
§ IEEE 802.3ad defines how to combine multiple physical network ports to single logical port for: • High Availability • Load balancing
§ Linux uses Bonding/Teaming device for building Link Aggregation trunk
§ Both expose software net_dev that provides LAG I/F toward the networking stack
§ Team/bond is considered “upper” device to “lower” enslaved NICs net_devices
§ Different modes of operation • Active/Passive • 802.3ad (LAG) static and dynamic (LACP)
§ Traditional network stack sees single “upper” net_dev
Bonding / Team drivers
![Page 4: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/4.jpg)
OpenFabrics Alliance Workshop 2016
INTRODUCTION
§ The upstream RDMA stack supports multiple transports: RoCE, IB, iWARP § RoCE – RDMA over Converged Ethernet, RoCE V2 (upstream 4.5), IBTA
RDMA headers over UDP. § RoCE uses IPv4/6 addresses set over the regular Eth NIC port net_dev § RoCE apps use RDMA-CM API for control path and verbs API for data path § RDMA-CM API (include/rdma/rdma_cm.h)
• Address resolution – Local Route lookup + ARP/ND services (rdma_resolve_addr()) • Route resolution – Path lookup in IB networks (rdma_resolve_route()) • Connection establishment – per transport CM to wire the offloaded connection (rdma_connect())
§ Verbs API • Send/RDMA – Send message or perform RDMA operation (post_send()) • Poll– Poll for completion of Send/RDMA or Receive operation (poll_cq())
• Async completion handling and fd semantics are supported • Post Receive Buffer – Hand receive buffers to the NIC (post_recv())
§ RDMA Device – ib_device • The DEVICE structure, exposes all above operations • Associated with net_device
§ Available for both RoCE and user mode Ethernet programming (e.g. DPDK)
RDMA over Ethernet (RoCE) / RDMA-CM
![Page 5: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/5.jpg)
OpenFabrics Alliance Workshop 2016
ETHERNET BONDING Application Point of View
eth0 eth1
Linux Bonding
bond0
TCP/IP
Sockets
Socket App
Verbs App
User Kernel
Ib_dev Ib_dev
User Verbs
Sock
Sock
QP
HCA QP HW Kernel
![Page 6: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/6.jpg)
OpenFabrics Alliance Workshop 2016
RDMA DEVICE HW BONDING
§ Register new ib_dev associated with the bond net_dev • eth0, eth1 will listen on Linux bond enslavement
netlink events • New device will use provider pick of PCIe
Function (PF0/1 or both) for device I/O
§ Registered RDMA devices associated with eth0, eth1 • Will unregister and re-register to drop existing
consumers on enslavement • Will be used for port management only through
Port Immutable ops (get_port_immutable()) • Alike the Linux Bonding enslaved net_devs
eth0
Phys Port1
PCIe PF0
eth1
Phys Port2
Linux Bonding/ Teaming
PCIe PF1
HW Bond
RDMA Device
RDMA Device
bond0
RDMA Device
NIC
![Page 7: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/7.jpg)
OpenFabrics Alliance Workshop 2016
RDMA DEVICE HW BONDING – CONT.
§ HW Bond • NIC logic for HW forwarding of ingress traffic to bond/
team RDMA device • net_dev traffic is passed directly to owner net_dev
according to ingress port § Failover
• RoCE and user mode Eth traffic transport object (QP) port is migrated transparently in HW
• Traditional net_dev I/F traffic remains associated with slave net_dev
§ Verbs • Use transport object (QP) attribute: port affinity
§ Configuration • Native Linux administration • RoCE Bonding is mainly auto configured
§ LACP ((802.3ad) • Either handled by Linux bonding/teaming driver • Or in HW/FW for supporting NICs (required for many
PFs to single phys port configurations)
eth0
Phys Port1
PCIe PF0
eth1
Phys Port2
Linux Bonding/ Teaming
PCIe PF1
HW Bond
RDMA Device
RDMA Device
bond0
RDMA Device
NIC
![Page 8: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/8.jpg)
OpenFabrics Alliance Workshop 2016
HW BOND AND VIRTUALIZATION eSwitch Software Model – Option I
eth0
rep_vf0
rep_vf1
Linux/OVS Bridge
br0
Linux Switch Device
SRIOV VM0
SRIOV VM1
NIC
eSwitch
Native OS
Phys Port
PCIe VF0.0 PCIe VF0.1
PCIe PF0
RDMA Device
VM2
VM3
![Page 9: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/9.jpg)
OpenFabrics Alliance Workshop 2016
HW BOND AND VIRTUALIZATION eSwitch Software Model – Option II
rep_phy0
rep_vf0
rep_vf1
Linux/OVS Bridge
rep_eth0
eth0
Linux Switch Device
SRIOV VM0
SRIOV VM1
NIC
eSwitch
Native OS
Phys Port
PCIe VF0.0 PCIe VF0.1
PCIe PF0
RDMA Device
VM2
VM3
![Page 10: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/10.jpg)
OpenFabrics Alliance Workshop 2016
HW BOND AND VIRTUALIZATION eSwitch Software Model with HA
rep_phy0
rep_vf0
rep_vf1
Linux/OVS Bridge
rep_eth0
eth0
Linux Switch Device
SRIOV VM0
SRIOV VM1
NIC
Native OS
Phys Port1
PCIe VF0.0 PCIe VF1.0
PCIe PF0
Phys Port2
Linux Bonding
PCIe PF1
RDMA Device
rep_phy1
eSwitch
HW Bond
VM2
VM3
![Page 11: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/11.jpg)
OpenFabrics Alliance Workshop 2016
HW BOND AND VIRTUALIZATION eSwitch Software Model with Tunneling
rep_phy0
rep_vf0
rep_vf1
Linux/OVS Bridge
rep_eth0
eth0
Linux Switch Device
NIC
eSwitch
UDP/IP Stack
Phys Port
PCIe PF0
RDMA Device
VM2
VM3
OVS-VX Bridge
vxlan net_device VNI (Key)
SRIOV VM0
SRIOV VM1
PCIe VF0.1
PCIe VF0.0 HW
Tunnel
![Page 12: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/12.jpg)
OpenFabrics Alliance Workshop 2016
MULTI-PCI SOCKET NIC
§ Single NIC can be connected through one or more PCIe buses
§ Each PCIe bus is connected through different NUMA node
§ For OS, exposed as 2 or more net_device each with it’s own associated RDMA device
§ Application enjoy direct device to local NUMA access • Using local network I/F per NUMA node
§ Boosting performance for HPC and Cloud • QPI avoidance for I/O – Optimal performance • Enables GPU / peer direct on both slots • Enables Direct Data I/O (DDIO) acceleration for both
sockets
CPU CPU QPI
PCIe X8 PCIe X8
![Page 13: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/13.jpg)
OpenFabrics Alliance Workshop 2016
MULTI-PCI SOCKET NIC Benchmark
20% Lower Latency
50% of CPU Overhead Lower is beN
er
![Page 14: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/14.jpg)
OpenFabrics Alliance Workshop 2016
MULTI-PCI SOCKET NIC
§ Application use & feel – would like to work with single net I/F
§ Use Linux bonding with RDMA device bonding
§ For TCP/IP traffic • On TX, select slave according to TX queue affinity • On RX, use accelerated RFS to educate the NIC which
slave to use per flow § For RDMA/User mode ETH traffic select slave
according to: • Explicit - Transport object (QP) logical port create affinity
attribute • Or transport object creation thread CPU affinity attribute • QPn namespace is divided across slaves
• On receive use QPn to slave mapping • From BTH or from Flow Steering action
§ Don’t share HW resources (CQ, SRQ) on different CPU sockets • each device has it’s own HW resources
Transparency to the App
eth0
Phys Port
PCIe PF0
eth1
Linux Bonding/ Teaming
PCIe PF1
HW Bond
RDMA Device
RDMA Device
bond0
RDMA Device
NIC
![Page 15: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/15.jpg)
OpenFabrics Alliance Workshop 2016
SUMMARY
§ Traditional stack transport logic is managed in software (TCP/IP) § RDMA transport logic is managed in NIC HW § Migrating the HW managed transport object from failed port
requires HW aid • Currently limited to phys port of the same adaptor
§ Building on top of existing infrastructure provides seamless administrative and application wise configuration • Allows HW awareness of the configuration and failover event
§ Same logic may be used for representing multiple logical devices to single phys device interface
![Page 16: 12th RDMA AND USER SPACE ETHERNET BONDING AND USER SPACE ETHERNET BONDING Tzahi Oved [ April , 2016 ] Mellanox [ LOGO HERE ] OpenFabrics Alliance Workshop 2016 AGENDA ! Introduction](https://reader031.fdocuments.net/reader031/viewer/2022011801/5ae84a057f8b9a08778f8756/html5/thumbnails/16.jpg)
12th ANNUAL WORKSHOP 2016
THANK YOU Tzahi Oved
Mellanox Technologies
[ LOGO HERE ]