LinuxCon 2015 Linux Kernel Networking Walkthrough
-
Upload
thomas-graf -
Category
Technology
-
view
8.749 -
download
110
Transcript of LinuxCon 2015 Linux Kernel Networking Walkthrough
Kernel Networking Walkthrough
LinuxCon 2015, Seattle
Thomas Graf Kernel & Open vSwitch Team Noiro Networks (Cisco)
Agenda
● Getting packets from/to the NIC
● NAPI, Busy Polling, RSS, RPS, XPS, GRO, TSO
● Packet processing
● RX Handler, IP Processing, TCP Processing, TCP Fast Open
● Queuing from/to userspace
● Socket Buffers, Flow Control, TCP Small Queues
● Q&A
Receive & Transmit Process
Ring Buffer
DMA
ParseL2 & IP
ParseTCP/UDP
Socket Buffer
Task /Container
read()
Ring Buffer
ConstructIP
ConstructTCP/UDP
Local?
Socket Buffer
Forward
Route?write()
NIC Network Stack(Kernel Space)
Process(User Space)
The 3 ways into the Network Stack
Ring Buffer
NetworkStack
Interrupt Driven
A
Ring Buffer
NetworkStack
NAPI based Polling poll()B
Ring Buffer NetworkStack
Busy Polling busy_poll()
TaskC
RSS – Receive Side Scaling
● NIC distributes packets across multiple RX queuesallowing for parallel processing.
● Separate IRQ per RX queue, thus selects CPU to runhardware interrupt handler on.
RX-queue-1
RX-queue-2
RX-queue-3
RX-queue-4
CPU 1
CPU 2
CPU 1
CPU 2
filter
RPS – Receive Packet Steering
● Software filter to select CPU # for processing
● Use it to ...
RX-queue-1
RX-queue-2
RX-queue-3
RX-queue-4
CPU 1
CPU 2
CPU 3
CPU 1
CPU 2
CPU 3
... redo queue - CPU mapping ... distribute single queue to multiple CPUs
Hardware Offload
● RX/TX Checksumming
● Perform CPU intensive checksumming inhardware.
● Virtual LAN filtering and tag stripping
● Strip 802.1Q header and store VLAN IDin network packet meta data.
● Filter out unsubscribed VLANs.
● Segmentation Offload
Generic Receive Offload(ethtool -K eth0 gro on)
Ring BufferNetwork
Stack
poll()
NAPI based GRO
MTU
GRO
Up to 64K
It's more effective to process 1x64K bytes packetinstead of 40x1500 bytes packets.
Segmentation Offload(ethtool -K eth0 tso on)(ethtool -K eth0 gso on)
Ring Buffer
NetworkStack
Generic Segmentation Offload (GSO)
ethtool -K eth0 gso on
MTU
TCP Segmentation Offload (TSO)
ethtool -K eth0 tso on
MTU
Up to 64K
Packet ProcessingLink Layer
Ingress QoS
Proto Handler
IPv4
IPv6
ARP
IPX
...Drop
The Feast!
RX Handler
Open vSwitch
Team
Bonding
Bridge
macvlan
macvtap
Packet SocketETH_P_ALL
tcpdump
IP Processing
IPHandler Route Lookup
PREROUTING
IPv4Construction
Route Lookup
Local Output
OUTPUT
POSTROUTINGLink Layer
FORWARD
Forwarding L4(TCP, ...)
Local Delivery
INPUT
UserSpace
TCP Processing
IP
Socket Filter
Receive TCP
Parse TCPLookup Socket
Backlogsocket locked
Receive Socket Buffer
Prequeuetask exists
process context ← softirq
Task
poll()read()
TCP Fast Open(net.ipv4.tcp_fastopen)
2nd Req SYN
SYN+ACK
ACK+HTTP GET
Data
2x RTT
SYN+Cookie+HTTP GET
SYN+ACK+Data
2nd Req
1x RTT
Client Server
SYN
SYN+ACK
ACK+HTTP GET
1st Req
Data
2x RTT2x RTT
Regular
Client Server
SYN
SYN+ACK+Cookie
ACK+HTTP GET
1st Req
Data
2x RTT
Fast Open
Socket Buffers & Flow Control(net.ipv4.tcp_{r|w}mem)
ssh
TX Ring Buffer
TCP/IP
Socket Buffer
wmemoverlimit?
Block or EWOULDBLOCK
wmem += packet-size
ssh
RX Ring Buffer
TCP/IP
Socket Buffer
rmem -= packet-size
rmemoverlimit?
Reduce TCP Window
rmem += packet-size
wmem -= packet-size
write()
TCP Small Queues(net.ipv4.tcp_limit_output_bytes)
ssh
TX Ring Buffer
Driver
TCP/IP
Socket Buffer
write()
Queuing Discipline
torrent
Socket Buffer
write()
TSQ: max 128Kb in flight per socket