Raoul Kernel Slides

download Raoul Kernel Slides

of 25

Transcript of Raoul Kernel Slides

  • 7/29/2019 Raoul Kernel Slides

    1/25

    Linux Kernel Networking

    Raoul Rivas

  • 7/29/2019 Raoul Kernel Slides

    2/25

    Kernel vs Application Programming

    No memory protection

    We share memory withdevices, scheduler

    Sometimes no preemption

    Can hog the CPU

    Concurrency is difficult

    No libraries

    Printf, fopen No security descriptors

    In Linux no access to files

    Direct access to hardware

    Memory Protection

    Segmentation Fault

    Preemption

    Scheduling isn't ourresponsibility

    Signals (Control-C)

    Libraries

    Security Descriptors

    In Linux everything is a filedescriptor

    Access to hardware as files

  • 7/29/2019 Raoul Kernel Slides

    3/25

    Outline

    User Space and Kernel Space Running Context in the Kernel

    Locking

    Deferring Work

    Linux Network Architecture

    Sockets, Families and Protocols

    Packet Creation

    Fragmentation and Routing

    Data Link Layer and Packet Scheduling

    High Performance Networking

  • 7/29/2019 Raoul Kernel Slides

    4/25

    System Calls

    A system call is an interrupt

    syscall(number,arguments)

    The kernel runs in adifferent address space

    Data must be copied backand forth

    copy_to_user(),copy_from_user()

    Never directly dereferenceany pointer from user space

    Kernel Space

    User Space

    Syscalltable

    write(ptr, size);

    ptr

    syscall(WRITE, ptr, size)

    sys_write()

    Copy_from_user()INT0x80

    0xFFFF50

    0x011075

  • 7/29/2019 Raoul Kernel Slides

    5/25

    Context

    Context: Entity whom the kernel is running code on behalf of

    Process context and Kernel Context are preemptible

    Interrupts cannot sleep and should be small

    They are all concurrent

    Process context and Kernel context have a PID:

    Struct task_struct* current

    Kernel Context Process

    Context

    Interrupt

    Context

    Preemptible Yes Yes No

    PID Itself Application PID No

    Can Sleep? Yes Yes No

    Example Kernel Thread System Call Timer Interrupt

  • 7/29/2019 Raoul Kernel Slides

    6/25

    Race Conditions

    Process context, Kernel Context and Interruptsrun concurrently

    How to protect critical zones from race

    conditions? Spinlocks

    Mutex

    Semaphores Reader-Writer Locks (Mutex, Semaphores)

    Reader-Writer Spinlocks

  • 7/29/2019 Raoul Kernel Slides

    7/25

    Inside Locking Primitives Spinlock

    //spinlock_lock:disable_interrupts();while(locked==true);

    //critical region//spinlock_unlock:enable_interrupts();locked=false;

    Mutex

    //mutex_lock:If (locked==true){

    Enqueue(this);Yield();

    }locked=true;

    //critical region//mutex_unlock:If !isEmpty(waitqueue){wakeup(Dequeue());}Else locked=false;

    We can't sleep while thespinlock is locked!

    We can't use a mutex inan interrupt because

    interrupts can't sleep!

    THE MUTEX SLEEPSTHE SPINLOCK SPINS...

  • 7/29/2019 Raoul Kernel Slides

    8/25

    When to use what?

    Mutex SpinlockShort Lock Time

    Long Lock Time

    Interrupt Context

    Sleeping

    Usually functions that handle memory, user space ordevices and scheduling sleep

    Kmalloc, printk, copy_to_user, schedule

    wake_up_process does not sleep

  • 7/29/2019 Raoul Kernel Slides

    9/25

    Linux Kernel Modules

    Extensibility

    Ideally you don't want topatch but build a kernelmodule

    Separate Compilation

    Runtime-Linkage

    Entry and Exit Functions

    Run in Process Context LKM Hello-World

    #define MODULE

    #define LINUX#define __KERNEL__

    #include #include #include

    static int __init myinit(void){printk(KERN_ALERT "Hello,

    world\n");Return 0;

    }

    static void __exit myexit(void){printk(KERN_ALERT "Goodbye,

    world\n");}

    module_init(myinit);module_exit(myexit);MODULE_LICENSE("GPL");

  • 7/29/2019 Raoul Kernel Slides

    10/25

    The Kernel Loop

    The Linux kernel uses the concept ofjiffies to measure time

    Inside the kernel there is a loop tomeasure time and preempt tasks

    A jiffy is the period at which the timerin this loop is triggered

    Varies from system to system 100Hz, 250 Hz, 1000 Hz.

    Use the variable HZ to get thevalue.

    The schedule function is thefunction that preempts tasks

    schedule()

    Timer1/HZ

    add_timer(1 jiffy)jiffies++

    scheduler_tick()

    tick_periodic:

  • 7/29/2019 Raoul Kernel Slides

    11/25

    Deferring Work / Two Halves

    Kernel Timers are used to createtimed events

    They use jiffies to measure time

    Timers are interrupts

    We can't do much in them!

    Solution: Divide the work in twoparts

    Use the timer handler to signal athread. (TOP HALF)

    Let the kernel thread do thereal job. (BOTTOM HALF)

    Timer

    Timer Handler:wake_up(thread);

    Thread:While(1){

    Do work();Schedule();

    }

    Interruptcontext

    Kernel

    context

    TOP HALF

    BOTTOMHALF

  • 7/29/2019 Raoul Kernel Slides

    12/25

    Linux Kernel Map

  • 7/29/2019 Raoul Kernel Slides

    13/25

    Linux Network ArchitectureSocket Access

    INET UNIX

    VFS

    Socket Splice

    Protocol Families

    NFS SMB iSCSI

    Network Storage

    UDP TCP

    Protocols

    IP

    802.11ethernet

    Network Interface

    Network Device Driver

    File Access

    Logical Filesystem

    EXT4

  • 7/29/2019 Raoul Kernel Slides

    14/25

    Socket Access

    Contains the system callfunctions like socket, connect,accept, bind

    Implements the POSIX

    socket interface Independent of protocols or

    socket types

    Responsible of mapping socket

    data structures to integerhandlers

    Calls the underlying layerfunctions

    sys_socket()sock_create

    sys_socket

    socketIntegerhandler

    Socketcreate

    Handlertable

  • 7/29/2019 Raoul Kernel Slides

    15/25

    Protocol Families

    Implements different socketfamilies INET, UNIX

    Extensible through the useof pointers to functions and

    modules. Allocates memory for the

    socket

    Calls net_proto_familiy

    create for familiy specificinitilization

    *pf

    inet_create

    net_proto_family

    AF_LOCAL

    AF_UNIX

  • 7/29/2019 Raoul Kernel Slides

    16/25

    Socket Splice

    Unix uses the abstraction of Files as first classobjects

    Linux supports to send entire files between file

    descriptors. A descriptor can be a socket

    Also Unix supports Network File Systems

    NFS, Samba, Coda, Andrew The socket splice is responsible of handling

    these abstractions

  • 7/29/2019 Raoul Kernel Slides

    17/25

    Protocols

    Families have multipleprotocols

    INET: TCP, UDP

    Protocol functions are

    stored in proto_ops

    Some functions are notused in that protocol so theypoint to dummies

    Some functions are thesame across manyprotocols and can beshared

    inet_bind

    inet_listen

    inet_stream_connect

    socket inet_stream_ops

    proto_ops

    inet_bind

    NULL

    inet_dgram_connect

    inet_dgram_ops

  • 7/29/2019 Raoul Kernel Slides

    18/25

    Packet Creation

    At the sending function, thebuffer is packetized.

    Packets are represented bythe sk_buff data structure

    Contains pointers the:

    transport layer header

    Link-layer header

    Received Timestamp Device we received it

    Some fields can be NULL

    tcp_send_msg

    tcp_transmit_skb

    ip_queue_xmit

    Struct sk_buf

    char*

    Struct sk_buf

    TCP Header

  • 7/29/2019 Raoul Kernel Slides

    19/25

    Fragmentation and Routing

    Fragmentation is performedinside ip_fragment

    If the packet does not havea route it is filled in by

    ip_route_output_flow There are routing

    mechanisms used

    Route Cache

    Forwarding InformationBase

    Slow Routing

    ip_fragment

    FIB

    Slow routing

    ip_route_output_flow

    Route cache

    forward dev_queue_xmit(queue packet)

    NY

    N

    N

    N

    Y

    Y

    Y

    ip_forward(packet forwarding)

  • 7/29/2019 Raoul Kernel Slides

    20/25

    Data Link Layer The Data Link Layer is

    responsible of packetscheduling

    The dev_queue_xmit isresponsible of enqueing

    packets for transmission inthe qdisc of the device

    Then in process context it istried to send

    If the device is busy weschedule the send for alater time

    The dev_hard_start_xmit isresponsible for sending tothe device

    Dev_queue_xmit(sk_buf)

    Dev qdisc enqueue

    dev_hard_start_xmit()

    Dev qdisc dequeue

  • 7/29/2019 Raoul Kernel Slides

    21/25

    Case Study: iNET

    INET is an EDF (EarliestDeadline First) packetscheduler

    Each Packet has a deadline

    specified in the TOS field We implemented it as a

    Linux Kernel Module

    We implement a packet

    scheduler at the qdisc level. Replace qdisc enqueue and

    dequeue functions

    Enqueued packets are put

    in a heap sorted by deadline

    enqueue(sk_buf)

    dequeue(sk_buf)

    HW

    Deadline

    heap

  • 7/29/2019 Raoul Kernel Slides

    22/25

    High-Performance Network Stacks Minimize copying

    Zero copy technique

    Page remapping

    Use good data structures

    Inet v0.1 used a list instead of a heap

    Optimize the common case

    Branch optimization

    Avoid process migration orcache misses

    Avoid dynamic assignment of interrupts to different CPUs

    Combine Operations within the same layer to minimizepasses to the data

    Checksum + data copying

  • 7/29/2019 Raoul Kernel Slides

    23/25

    High-Performance Network Stacks

    Cache/Reuse as much as you can

    Headers, SLAB allocator

    Hierarchical Design + Information Hiding

    Data encapsulation Separation of concerns

    Interrupt Moderation/Mitigation

    Receive packets in timed intervals only (e.g. ATM)

    Packet Mitigation

    Similar but at the packet level

  • 7/29/2019 Raoul Kernel Slides

    24/25

    Conclusion

    The Linux kernel has 3 main contexts: Kernel, Process andInterrupt.

    Use spinlock for interrupt context and mutexes if you plan tosleep holding the lock

    Implement a module avoid patching the kernel main tree

    To defer work implement two halves. Timers + Threads

    Socket families are implemented through pointers tofunctions (net_proto_family and proto_ops)

    Packets are represented by the sk_buf structure

    Packet scheduling is done at the qdisc level in the Link Layer

  • 7/29/2019 Raoul Kernel Slides

    25/25

    References Linux Kernel Map http://www.makelinux.net/kernel_map

    A. Chimata, Path of a Packet in the Linux Kernel Stack, Universityof Kansas, 2005

    Linux Kernel Cross Reference Source

    R. Love, Linux Kernel Development , 2nd Edition, Novell Press,

    2006 H. Nguyen, R. Rivas, iDSRT: Integrated Dynamic Soft Realtime

    Architecture for Critical Infrastructure Data Delivery over WAN,Qshine 2009

    M. Hassan and R. Jain, High Performance TCP/IP Networking:Concepts, Issues, and Solutions, Prentice-Hall, 2003

    K. Ilhwan, Timer-Based Interrupt Mitigation for High PerformancePacket Processing, HPC, 2001

    Anand V., TCPIP Network Stack Performance in Linux Kernel 2.4

    and 2.5, IBM Linux Technology Center

    http://www.makelinux.net/kernel_maphttp://www.makelinux.net/kernel_map