PFQ@ PAM12

PFQ: a Novel Architecture for Packet Capture on Parallel Commodity

Hardware

Nicola Bonelli, Andrea Di Pietro, Stefano Giordano, Gregorio Procissi

CNIT e Dip. di Ingegneria dell’Informazione - Università di Pisa

Outline

• Introduction and motivation• Multi-core programming guidelines• PFQ architecture• Performance evaluation• Conclusion and future work

Introduction and Motivations• Designing monitoring applications has become a very challenging task:

– The hardware has evolved: 10Gbits links, multi-core architectures and multi-queue network devices (MSI-X)…

• The present software for traffic monitoring, including some parts of the Linux kernel, is not optimized for new hardware

– (+) kernel support for multi-queue network adapters is implemented– (-) Linux kernel has a very bad support for monitoring applications– (-) PF_PACKET is extremely slow, even when used in memory-map mode (pcap)– (-) PF_RING has been designed for single-processor systems

• Traffic monitoring should:– Exploits modern hardware, scaling possibly linearly with the number of cores– Decouple the hardware parallelism from the software one– Divide and conquer approach to steer packets to applications or threads

Multi-thread on Multi-core• What’s wrong with the current software?

– Previous multi-threading paradigms used for single-processor systems are still valid, but prevent the software from scaling with the number of cores.

• For a software to be effective on multi-core system…– Semaphores, mutexes, and spinlocks are out of question!– R/W mutexes prevent readers from scaling, even though they are supposed to

grant concurrent access to readers– Atomic operations are sometimes required, but must be used with moderation

• sparse-counters instead of atomic ones• design algorithm as they can use amortized atomic operations

– Sharing (writes to shared data) has serious impact on performance– writes to shared memory are delayed by the hardware, reads must be synchronized

– False-sharing must and can always be avoided

• wait-free algorithms are mandatory, use lock-free algorithm should be avoided (if possible)…

PFQ preamble• PFQ is a novel capture system natively supporting 64bit multi-core

architectures written on top of all the previously exposed guidelines

• PFQ is not a custom driver• It is an architecture running on top of standard Ethernet drivers, as

well as slightly modified ones “PFQ aware drivers” (PF_RING aware driver inheritance)

• PFQ enables packet capturing, filtering, hw queues and devices aggregation, packet classifications, packet steering and so forth…

• Decouples the hardware parallelism (i.e. Intel RSS) from the software one

PFQ architectureBuilt on the top of the following components…

• User-space C++11 library that provides the same abstraction as that of the STL: container and iterators

• DB-MPSC queue: double-buffered multiple-producers queue (for the communication to user-space):

– Allows NAPI contexts to enqueue packets concurrently– Reduce the sharing, eliminate the false sharing between user-space and NAPI contexts– Enables user-space copies of packets from the queue to a private buffer in a batch fashion

• De-multiplexing Matrix:– perfect wait-free concurrently accessible data structure– no serialization is required to steer/copy packets

• SPSC queue: – enables batching for socket buffers (skb), to increase temporal locality for the memory

manager (SLAB for kernel prior to 2.6.39)• Driver aware:

– an effective idea inherited from PF_RING

PFQ architecture

Packet steeringGiven a packet and a set of sockets, which sockets need to receive it?

• For capture engines that do not support it, filtering can be used to dispatch packets across a number of sockets:– Traversing the socket list to find those interested in the packet has

linear complexity O(n).– Flexible approach because it enables dispatching as well as copies

• We designed a “packet steering” paradigm that:– O(1) complexity to identify the destination sockets– Support both balancing and copies of packets– Custom hash functions for packet dispatching

Packet steering• Completely concurrent block (wait-free):

– Shared state (de-multiplexing matrix) is mostly read only– Writes, which are in general rare events, are serialized each other to prevent

race conditions. The update of the state in the matrix is atomic

• Load balancing groups:– A socket can create or subscribe a load-balancing group– It will receive a fraction of the overall traffic

• Socket binding– One or more hardware queues of a given NIC– One or more NICs

• Binding and balancing groups are orthogonal and can be concurrently used

Socket queue: DB-MPSC• The queue of socket is an unavoidable contention point:

– Load balancing shuffles packets across sockets

• How handle contention without impacting the performance?– Use an atomic operation to reserve a slot within the queue (will be amortized

in future implementations)– Reduce traffic coherence among the cores running k-thread and user-space

thread– Swap between buffers is triggered by user-space thread or by water-mark– Packets can be copied in batch fashion, or consumed in-place

Testbed: Mascara & MonstersMascara Monsters

10 Gb link

Xeon 6-core X5650, @2.57 GHz, 12GBytes RAM

New socket PF_DIRECT for generationIntel 82599 multi-queue 10G ethernet adapter.

By deploying 3-4 cores, it is possible to generate up to ~12 Mpps of 64 bytes.

Xeon 6-core X5650 @2.57GHz, 12 GBytes RAM

Intel 82599 multi-queue 10G ethernet adapter

PFQ on board for traffic capture

Single socket layout

Fully parallel layout

Load balancing across sockets

• Using 12 capturing NAPI

• Varying the number of user space threads

Packet copy• Copying packets to a variable number of user space threads

• 12 NAPI contexts within the kernel

Future directionsWe are working to improve the packet steering framework…

• How can we better distribute packets according to application-specific semantics?• Enhance balancing groups, allow a single socket to join multiple

balancing groups• Each group is associated with a “specific steering function”

• Investigating on the implementation for wait-free stateful algorithm (pimp/CAS)

• Add the support of control- and data-plane socket

• Implement a filtering mechanism by means of some bloom filter variant (capture filters)

Conclusions• Modern commodity architectures are increasingly parallel

• Multithread software is today not ready for multi-core architectures: • Need to strictly fulfill coding and design rules to achieve linear

scalability

• PFQ: a novel Linux packet capturing engine– Better scalability with respect to competitors– Flexible packet steering that eases the implementation of multi-thread

user-space applications– Decouples kernel space and user space parallelism

• PFQ webpage and download:– netgroup.iet.unipi.it/software/pfq

PFQ@ PAM12

Engineering

Transcript of PFQ@ PAM12