Parallelizing Packet Processing in Container Overlay ...

51
Parallelizing Packet Processing in Container Overlay Networks (EuroSys '21) Jiaxin Lei 1 , Manish Munikar 2 , Kun Suo 3 , Hui Lu 1 , Jia Rao 2 1 SUNY Binghamton 2 The University of Texas at Arlington 3 Kennesaw State University

Transcript of Parallelizing Packet Processing in Container Overlay ...

Page 1: Parallelizing Packet Processing in Container Overlay ...

Parallelizing Packet Processing in Container Overlay Networks

(EuroSys '21)

Jiaxin Lei1, Manish Munikar2, Kun Suo3, Hui Lu1, Jia Rao2

1 SUNY Binghamton2 The University of Texas at Arlington

3 Kennesaw State University

Page 2: Parallelizing Packet Processing in Container Overlay ...

Containers are everywhere

● Containers are revolutionizing cloud.

○ Lightweight OS-level virtualization

2

Page 3: Parallelizing Packet Processing in Container Overlay ...

Containers are everywhere

● Containers are revolutionizing cloud.

○ Lightweight OS-level virtualization

● Containers communicate using overlay network

○ VXLAN encapsulation

3

Page 4: Parallelizing Packet Processing in Container Overlay ...

Containers are everywhere

● Containers are revolutionizing cloud.

○ Lightweight OS-level virtualization

● Containers communicate using overlay network

○ VXLAN encapsulation

4

Ethernet IP UDP Payload

Original packet

Page 5: Parallelizing Packet Processing in Container Overlay ...

Containers are everywhere

● Containers are revolutionizing cloud.

○ Lightweight OS-level virtualization

● Containers communicate using overlay network

○ VXLAN encapsulation

5

Ethernet IP UDP VXLAN Ethernet IP UDP Payload FCS

VXLAN encapsulation Original packet

Outer packet

Page 6: Parallelizing Packet Processing in Container Overlay ...

Containers are everywhere

● Containers are revolutionizing cloud.

○ Lightweight OS-level virtualization

● Containers communicate using overlay network

○ VXLAN encapsulation

6

Ethernet IP UDP

Flags Reserved VNI Reserved

VXLAN Ethernet IP UDP Payload FCS

VXLAN encapsulation Original packet

Page 7: Parallelizing Packet Processing in Container Overlay ...

Overlay network is slow

7

● Compared to host, overlay network has:

Page 8: Parallelizing Packet Processing in Container Overlay ...

Overlay network is slow

8

● Compared to host, overlay network has:

○ Half the throughput

53%

47%

Page 9: Parallelizing Packet Processing in Container Overlay ...

Overlay network is slow

9

● Compared to host, overlay network has:

○ Half the throughput

○ Double per-packet latency

2x

3x53%

47%

Page 10: Parallelizing Packet Processing in Container Overlay ...

Why are overlay networks so slow?

10

Page 11: Parallelizing Packet Processing in Container Overlay ...

Why are overlay networks so slow?

11

NIC

Application

L 2

L 3&4

L 7

Hardware

SoftIRQ

IRQ

Container packetHost packet● Host packet

○ 1 IRQ + 1 SoftIRQ

Page 12: Parallelizing Packet Processing in Container Overlay ...

Why are overlay networks so slow?

12

NIC

Application

Bridge

pNIC

VxLAN

vNIC

L 2

L 3&4

L 7

Hardware

Application

SoftIRQ

IRQ IRQ

SoftIRQ SoftIRQ SoftIRQ

Container packetHost packet● Host packet

○ 1 IRQ + 1 SoftIRQ

● Container packet

○ 1 IRQ + 3 SoftIRQs

Page 13: Parallelizing Packet Processing in Container Overlay ...

Why are overlay networks so slow?

1. Prolonged datapath

○ Multiple virtual devices to

traverse for each packet

○ 3x more softirq

13

NIC

Application

Bridge

pNIC

VxLAN

vNIC

L 2

L 3&4

L 7

Hardware

Application

SoftIRQ

IRQ IRQ

SoftIRQ SoftIRQ SoftIRQ

Container packetHost packet

Page 14: Parallelizing Packet Processing in Container Overlay ...

Why are overlay networks so slow?

1. Prolonged datapath

○ Multiple virtual devices to

traverse for each packet

○ 3x more softirq

2. Serialized softirq execution

○ Load imbalance

○ Longer queue delay

14

NIC

Application

Bridge

pNIC

VxLAN

vNIC

L 2

L 3&4

L 7

Hardware

Application

SoftIRQ

IRQ IRQ

SoftIRQ SoftIRQ SoftIRQ

Container packetHost packet

Page 15: Parallelizing Packet Processing in Container Overlay ...

Existing optimizations

15

Page 16: Parallelizing Packet Processing in Container Overlay ...

Existing optimizations

● Kernel-bypass [DPDK, mTCP, TAS]✅ Avoid OS overheads; custom minimal network stack

❌ Loose security, compatibility

16

Page 17: Parallelizing Packet Processing in Container Overlay ...

Existing optimizations

● Kernel-bypass [DPDK, mTCP, TAS]✅ Avoid OS overheads; custom minimal network stack

❌ Loose security, compatibility

● Connection-level metadata manipulation [Slim, FreeFlow]✅ Avoids overhead of virtual devices; as fast as host

❌ Limited scope and scalability; cannot support dataplane policies

17

Page 18: Parallelizing Packet Processing in Container Overlay ...

Existing optimizations

● Kernel-bypass [DPDK, mTCP, TAS]✅ Avoid OS overheads; custom minimal network stack

❌ Loose security, compatibility

● Connection-level metadata manipulation [Slim, FreeFlow]✅ Avoids overhead of virtual devices; as fast as host

❌ Limited scope and scalability; cannot support dataplane policies

● Hardware offload [Mellanox ASAP2, AccelNet, RDMA]✅ Fastest; completely avoids CPU overheads

❌ Requires hardware upgrade; limited flexibility

18

Page 19: Parallelizing Packet Processing in Container Overlay ...

Our approach

FALCON = Fast and Balanced Container Networking

Key idea: Leverage multicore architecture to accelerate overlay packet processing

19

Page 20: Parallelizing Packet Processing in Container Overlay ...

Our approach

FALCON = Fast and Balanced Container Networking

Key idea: Leverage multicore architecture to accelerate overlay packet processing

✅ Software-based solution

✅ Full network isolation / flexibility

✅ Completely backward compatible

✅ Better performance

20

Page 21: Parallelizing Packet Processing in Container Overlay ...

Design 1: Softirq Pipelining

Key idea: Pipeline different softirqs onto different cores

21

Page 22: Parallelizing Packet Processing in Container Overlay ...

Design 1: Softirq Pipelining

Key idea: Pipeline different softirqs onto different cores

● Original hash: flow → core

22

Page 23: Parallelizing Packet Processing in Container Overlay ...

Design 1: Softirq Pipelining

Key idea: Pipeline different softirqs onto different cores

● Original hash: flow → core

23

123. . . Core 1

Time

Application1

Core 2

Core 3

2 3 1 2 3 21 3

1

1

1

1st stage

2nd stage

3rd stage

Page 24: Parallelizing Packet Processing in Container Overlay ...

Design 1: Softirq Pipelining

Key idea: Pipeline different softirqs onto different cores

● Original hash: flow → core

● New hash: (flow, device) → core

24

123. . . Core 1

Time

Application1

Core 2

Core 3

2 3

1 2 3

21 3

1

1

1

1st stage

2nd stage

3rd stage

Page 25: Parallelizing Packet Processing in Container Overlay ...

Design 1: Softirq Pipelining

Key idea: Pipeline different softirqs onto different cores

● Original hash: flow → core

● New hash: (flow, device) → core

● Order of packets is still preserved

25

123. . . Core 1

Time

Application1

Core 2

Core 3

2 3

1 2 3

21 3

1

1

1

1st stage

2nd stage

3rd stage

Page 26: Parallelizing Packet Processing in Container Overlay ...

Design 2: Softirq Splitting

Key idea: Split one big softirq into two that can be pipelined

26

Page 27: Parallelizing Packet Processing in Container Overlay ...

Design 2: Softirq Splitting

Key idea: Split one big softirq into two that can be pipelined

● Overlay TCP processing is heavily dominated by first stage

27

1123. . . Core 1

Time

Application

Core 2

Core 3

2 3

Page 28: Parallelizing Packet Processing in Container Overlay ...

Design 2: Softirq Splitting

Key idea: Split one big softirq into two that can be pipelined

● Overlay TCP processing is heavily dominated by first stage○ Two main functions: (a) SKB allocation, (b) GRO processing

28

1a, 1b123. . . Core 1

Time

Application

Core 2

Core 3

2a, 2b 3a, 3b

Page 29: Parallelizing Packet Processing in Container Overlay ...

Design 2: Softirq Splitting

Key idea: Split one big softirq into two that can be pipelined

● Overlay TCP processing is heavily dominated by first stage○ Two main functions: (a) SKB allocation, (b) GRO processing

● Split them by adding a softirq in the middle

29

123. . . Core 1

Time

Application

1a

Core 2

Core 3

1b

2a 3a

2b 3b

Page 30: Parallelizing Packet Processing in Container Overlay ...

Design 3: Softirq Balancing

Key idea: Try to dispatch softirqs on idle cores, else disable Falcon

3070% 95% 25% 45%

1 2 3 4

1 2 3

hashhash hash

● Static hashing

○ Prone to load imbalance

○ Hurts performance if load is already high

Page 31: Parallelizing Packet Processing in Container Overlay ...

Design 3: Softirq Balancing

Key idea: Try to dispatch softirqs on idle cores, else disable Falcon

3170% 95% 25% 45%

1 2 3 4

1 2 3

hashhash hash

● Static hashing

○ Prone to load imbalance

○ Hurts performance if load is already high

● Dynamic rehashing

○ More balanced CPU utilization

rehash

Page 32: Parallelizing Packet Processing in Container Overlay ...

Design 3: Softirq Balancing

Key idea: Try to dispatch softirqs on idle cores, else disable Falcon

3270% 95% 25% 45%

1 2 3 4

1 2 3

hashhash hash

● Static hashing

○ Prone to load imbalance

○ Hurts performance if load is already high

● Dynamic rehashing

○ More balanced CPU utilization

● Disable FALCON when overall system usage is high.

rehash

Page 33: Parallelizing Packet Processing in Container Overlay ...

Evaluation — Setup

Hardware: Intel Xeon, 40 logical cores @ 2.2GHz, 128 GB RAM

NIC: Mellanox ConnectX-5 EN (100 Gbps)

Software: Ubuntu 18.04, with Linux kernel 5.4

33

Page 34: Parallelizing Packet Processing in Container Overlay ...

Evaluation — Setup

Hardware: Intel Xeon, 40 logical cores @ 2.2GHz, 128 GB RAM

NIC: Mellanox ConnectX-5 EN (100 Gbps)

Software: Ubuntu 18.04, with Linux kernel 5.4

Comparison: FALCON vs. Container vs. Host

34

Page 35: Parallelizing Packet Processing in Container Overlay ...

Evaluation — Setup

Hardware: Intel Xeon, 40 logical cores @ 2.2GHz, 128 GB RAM

NIC: Mellanox ConnectX-5 EN (100 Gbps)

Software: Ubuntu 18.04, with Linux kernel 5.4

Comparison: FALCON vs. Container vs. Host

Experiments:

● Single-flow and multi-flow microbenchmarks

● Application benchmarks (CloudSuite web & data-caching)

● [many others in the paper]35

Page 36: Parallelizing Packet Processing in Container Overlay ...

Single-flow throughput

36

Page 37: Parallelizing Packet Processing in Container Overlay ...

Single-flow throughput

37

● FALCON is results at more

than 2x better packet rate

than Container

> 2x

Page 38: Parallelizing Packet Processing in Container Overlay ...

Single-flow throughput

38

● FALCON is results at more

than 2x better packet rate

than Container

● Closer to Host performance

for large packet sizes

Page 39: Parallelizing Packet Processing in Container Overlay ...

Single-flow latency

39

Page 40: Parallelizing Packet Processing in Container Overlay ...

Single-flow latency

40

● Container latency is 2x of host

2x2x

Page 41: Parallelizing Packet Processing in Container Overlay ...

Single-flow latency

41

● Container latency is 2x of host

● FALCON achieves latency closer to host

Page 42: Parallelizing Packet Processing in Container Overlay ...

Multi-flow throughput

42

Page 43: Parallelizing Packet Processing in Container Overlay ...

Multi-flow throughput

43

● UDP: Improves overlay network as much as 55%

55%

Page 44: Parallelizing Packet Processing in Container Overlay ...

Multi-flow throughput

44

● UDP: Improves overlay network as much as 55%

● TCP: Improves overlay network by 45% (host network by 56%)

56%45%

Page 45: Parallelizing Packet Processing in Container Overlay ...

Cloud benchmarks: Web serving

45

Page 46: Parallelizing Packet Processing in Container Overlay ...

Cloud benchmarks: Web serving

46

● Throughput improved by up to 300%

Page 47: Parallelizing Packet Processing in Container Overlay ...

Cloud benchmarks: Web serving

47

● Throughput improved by up to 300%

● Response time reduced by up to 31%

Page 48: Parallelizing Packet Processing in Container Overlay ...

Cloud benchmarks: Data Caching

48

● Memcached benchmark

○ 4 server threads

○ 10 clients

Page 49: Parallelizing Packet Processing in Container Overlay ...

Cloud benchmarks: Data Caching

49

● Memcached benchmark

○ 4 server threads

○ 10 clients

● Avg and tail latency reduced

to 50%

Page 50: Parallelizing Packet Processing in Container Overlay ...

Conclusion

● Overlay packet processing in current OS is not optimized to utilize multicore

● FALCON accelerates overlay packet processing

○ Without losing any features such as security, flexibility, compatibility

● Purely software-based solution that is easy to deploy and upgrade

● Our implementation is available at github.com/munikarmanish/falcon

50

Page 51: Parallelizing Packet Processing in Container Overlay ...

Conclusion

● Overlay packet processing in current OS is not optimized to utilize multicore

● FALCON accelerates overlay packet processing

○ Without losing any features such as security, flexibility, compatibility

● Purely software-based solution that is easy to deploy and upgrade

● Our implementation is available at github.com/munikarmanish/falcon

51

Thank you!