Post on 16-Apr-2017
Tecniche di debugging nel kernel Linux
Agenda
Overview (kernel programming)
Kernel crash classification
Debugging techniques
Example(s)
Q/A
What's a kernel?
The kernel provides an abstraction layer for the applications to use the physical hardware resources
Kernel basic facilitiesProcess management
Memory management
Device management
System call interface
User space
Good for debugging (gdb)
Lots of user-space libraries available
Unpredictable latency (context switch, scheduler, syscall, ...)
Overhead
Impossibility to fully interact with interrupt routines
Impossibility to access certain memory address
More difficult to share certain features with other drivers
Reliability: user processes can be terminated upon critical system events (OOM, filesystem errors, etc.)
Kernel space
Written in C and assembly
No debugging tool (kgdb, UML, ...)
Bugs can hang the entire system
User memory is swappable, kernel memory can't be swapped out
Kernel stack size is small (8K / 4K - THREAD_SIZE_ORDER)
Floating point is forbidden
Userspace libraries are not available
Linux kernel must be portable (this is important if you consider to contribute mainstream)
Closed source kernel modules taint the kernel
Example kernel module
#include #include
/* Module constructor */static int __init hello_init(void){printk(KERN_INFO "Hello, world!\n");return 0;}
/* Module destructor */static void __exit hello_exit(void){printk(KERN_INFO "Goodbye\n");}
module_init(hello_init);module_exit(hello_exit);
MODULE_LICENSE("GPL");MODULE_AUTHOR("Andrea Righi ");MODULE_DESCRIPTION("BetterEmbedded hello world example");
Kernel problems
Kernel panic (fatal error for the system)
Kernel oops (non-fatal error)
Wrong result (fatal from user's perspective)
Kernel panic
No recovery is possibleExample: exception in an atomic context (i.e., interrupt)
Typically result in a system reboot (panic=N), or blinking LED or just hang
[ 165.552280] general protection fault: 0000 [#1] PREEMPT SMP [ 165.553055] Modules linked in: crashtest(O) [last unloaded: crashtest][ 165.553092] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G O 3.10.0-rc7+ #535[ 165.553092] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011[ 165.553092] task: ffff88003d90a2c0 ti: ffff88003d92e000 task.ti: ffff88003d92e000[ 165.553092] RIP: 0010:[] [] __kmalloc_track_caller+0xd5/0x2b0[ 165.553092] RSP: 0018:ffff88003e003988 EFLAGS: 00010206[ 165.553092] RAX: 0000000000000000 RBX: ffff88003e1d6a20 RCX: 00000000000be841[ 165.553092] RDX: 00000000000be801 RSI: 0000000000000000 RDI: 0000000000000001[ 165.553092] RBP: ffff88003e0039c8 R08: 00000000001d6a20 R09: 0000000000000000[ 165.553092] R10: 0000000000000000 R11: 0000000000000001 R12: 7878787878787878[ 165.553092] R13: 0000000000010220 R14: 0000000000000240 R15: ffff88003d801780[ 165.553092] FS: 0000000000000000(0000) GS:ffff88003e000000(0000) knlGS:0000000000000000[ 165.553092] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b[ 165.553092] CR2: 00000000081ab008 CR3: 0000000037dc8000 CR4: 00000000000006e0[ 165.553092] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000[ 165.553092] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400[ 165.553092] Stack:[ 165.553092] 00000000000be801 ffff88003d92ffd8 ffffffff8161683d ffff880034e3f300[ 165.553092] ffff88003e003a17 0000000000000020 0000000000000240 0000000000000000[ 165.553092] ffff88003e003a00 ffffffff8161433c ffff880034e3f300 0000000000000020...
...[ 165.553092] Call Trace:[ 165.553092] [ 165.553092] [] ? __alloc_skb+0x7d/0x290[ 165.553092] [] __kmalloc_reserve.isra.52+0x3c/0xa0[ 165.553092] [] __alloc_skb+0x7d/0x290[ 165.553092] [] tcp_send_ack+0x3b/0xf0[ 165.553092] [] __tcp_ack_snd_check+0x5e/0xa0[ 165.553092] [] tcp_rcv_established+0x204/0x6f0[ 165.553092] [] ? put_lock_stats.isra.26+0xe/0x40[ 165.553092] [] tcp_v4_do_rcv+0x161/0x360[ 165.553092] [] ? _raw_spin_lock_nested+0x79/0x90[ 165.553092] [] tcp_v4_rcv+0x731/0x980[ 165.553092] [] ? __lock_is_held+0x5f/0x80[ 165.553092] [] ip_local_deliver_finish+0xc8/0x2f0[ 165.553092] [] ? ip_local_deliver_finish+0x4a/0x2f0[ 165.553092] [] ip_local_deliver+0x47/0x80[ 165.553092] [] ip_rcv_finish+0x140/0x5e0[ 165.553092] [] ip_rcv+0x233/0x380[ 165.553092] [] __netif_receive_skb_core+0x6a2/0x970[ 165.553092] [] ? __netif_receive_skb_core+0x50/0x970[ 165.553092] [] __netif_receive_skb+0x21/0x70[ 165.553092] [] netif_receive_skb+0x23/0x1f0[ 165.553092] [] napi_gro_receive+0x98/0xd0[ 165.553092] [] e1000_clean_rx_irq+0x18a/0x520[ 165.553092] [] e1000_clean+0x251/0x910[ 165.553092] [] ? put_lock_stats.isra.26+0xe/0x40[ 165.553092] [] ? lock_release_holdtime.part.27+0xd4/0x160[ 165.553092] [] net_rx_action+0xd5/0x2e0[ 165.553092] [] __do_softirq+0xf7/0x420[ 165.553092] [] irq_exit+0xb5/0xc0[ 165.553092] [] do_IRQ+0x63/0xd0[ 165.553092] Code: c8 48 8b 55 c0 48 8b 81 38 e0 ff ff a8 08 0f 85 5f 01 00 00 4c 8b 23 4d 85 e4 0f 84 15 01 00 00 49 63 47 20 48 8d 4a 40 4d 8b 07 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 97 49 63 [ 165.553092] RIP [] __kmalloc_track_caller+0xd5/0x2b0[ 165.553092] RSP [ 165.553092] ---[ end trace baac76a23c6da73c ]---[ 165.553092] Kernel panic - not syncing: Fatal exception in interrupt
Kernel oops
A message is displayed in the log when a recoverable error has occurred in kernel spaceExample: access a bad address (i.e., NULL pointer dereference)
An oops does not mean the system has crashed
Current process is killed
Oops message is displayed along with a registers dump and a stack trace
[ 75.962412] BUG: unable to handle kernel NULL pointer dereference at (null)[ 75.963046] IP: [] procfs_write+0x2d6/0x320 [crashtest][ 75.963046] PGD 3a78d067 PUD 362be067 PMD 0 [ 75.963046] Oops: 0002 [#1] PREEMPT SMP [ 75.963046] Modules linked in: crashtest(O)[ 75.963046] CPU: 0 PID: 1587 Comm: bash Tainted: G O 3.10.0-rc7+ #535[ 75.963046] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011[ 75.963046] task: ffff88003a7ec580 ti: ffff8800362f6000 task.ti: ffff8800362f6000[ 75.963046] RIP: 0010:[] [] procfs_write+0x2d6/0x320 [crashtest][ 75.963046] RSP: 0018:ffff8800362f7e78 EFLAGS: 00010297[ 75.963046] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 000000000000004e[ 75.963046] RDX: 0000000000000000 RSI: ffffffffa0000469 RDI: ffff8800362f7eaa[ 75.963046] RBP: ffff8800362f7ee0 R08: 0000000000000000 R09: 0000000000000000[ 75.963046] R10: ffff88003a7ec580 R11: 0000000000000000 R12: 0000000000000003[ 75.963046] R13: 000000000000000a R14: ffff8800362f7f50 R15: 0000000000000000[ 75.963046] FS: 0000000000000000(0000) GS:ffff88003de00000(0063) knlGS:00000000f75f76c0[ 75.963046] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033[ 75.963046] CR2: 0000000000000000 CR3: 0000000036209000 CR4: 00000000000006f0[ 75.963046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000[ 75.963046] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400[ 75.963046] Stack:[ 75.963046] ffffffff811b66cb 0000000000000000 0000000000000000 ffff88003a7ec580[ 75.963046] ffff8800362f7ec8 4f49545045435845 000000000000004e 0000000000000000[ 75.963046] 0000000000000000 00000000463b9fa0 ffff8800362fd300 000000000000000a[ 75.963046] Call Trace:[ 75.963046] [] ? vfs_write+0x1bb/0x1f0[ 75.963046] [] proc_reg_write+0x3d/0x80[ 75.963046] [] vfs_write+0xc8/0x1f0[ 75.963046] [] SyS_write+0x55/0xa0[ 75.963046] [] sysenter_dispatch+0x7/0x1f[ 75.963046] [] ? trace_hardirqs_on_thunk+0x3a/0x3f[ 75.963046] Code: e1 f3 6f e1 48 c7 c7 60 09 00 a0 e8 d5 f3 6f e1 e9 e2 fd ff ff c7 45 d0 78 56 34 12 e9 d6 fd ff ff e8 bf fc ff ff e9 cc fd ff ff 04 25 00 00 00 00 00 00 00 00 e9 bc fd ff ff eb fe 66 c7 07 [ 75.963046] RIP [] procfs_write+0x2d6/0x320 [crashtest][ 75.963046] RSP [ 75.963046] CR2: 0000000000000000[ 75.998054] ---[ end trace 33bbddb47601039c ]---
Kernel fault classification
panic(have a nice day... ;-))
BUG() / BUG_ON(condition)
exception (i.e., invalid opcode, division by zero, ...)
memory corruptionstack overflow/underflowNOTE: in kernel space stack size is limited to 2 pages (8K in almost all architectures)
write after free
write to a bad address
concurrent access without protections (locks, etc.)
soft lockuplock a CPU without giving other tasks a chance to run
hard lockuplock a CPU without giving other tasks or interrupts a chance to run
hung task: task doesn't get a chance to run for more than N seconds
scheduling while atomic
deadlock
use FPU registers in kernel space
Useful debugging kernel options
Kernel Hacking section ->CONFIG_KALLSYMS_ALL: print function names instead of addresses in kernel messages
CONFIG_FRAME_POINTER: get useful stack info in case of kernel bugs
CONFIG_DEBUG_ATOMIC_SLEEP: enable sleep inside atomic section checks (i.e., sleep from interrupt handler, sleep when a lock is held, etc...)
CONFIG_LOCKUP_DETECTOR: detect hard and soft lockups
CONFIG_LOCKDEP: lock dependency enging (deadlock detection)
CONFIG_DYNAMIC_FTRACE: enable individual function tracing dynamically (via debugfs /sys/kernel/debug/tracing)
Debugging techniques
blinking LED
printk()
procfs
SysReq key (Documentation/sysrq.txt)
function instrumentation (kprobes)
dynamic ftrace (CONFIG_DYNAMIC_FTRACE)
debugger (kgdb)
printk()
Advantageseasy to use
no need any other system support
Disadvantageshave to modify and rebuild kernel/modules
no interactive debugging
printk(): levels
printk levelsKERN_EMERG: system is unusable
KERN_ALERT: action must be taken immediately
KERN_CRIT: critical condition
KERN_ERR: error condition
KERN_WARNING: warning condition
KERN_NOTICE: normal condition
KERN_INFO: informational
KERN_DEBUG: debug message
Show kernel messages:# dmesg
Redirect all kernel messages to the console# echo 8 > /proc/sys/kernel/printk
procfs
static int procfs_read(struct seq_file *m, void *v){...}
static ssize_t procfs_write(struct file *file, const char __user *ubuf, size_t count, loff_t *pos){...}
static int procfs_open(struct inode *inode, struct file *file){ return single_open(file, procfs_read, NULL);}
static int procfs_release(struct inode *inode, struct file *file){ return 0;}
static const struct file_operations procfs_fops = { .open = procfs_open, .read = seq_read, .write = procfs_write, .llseek = seq_lseek, .release = procfs_release,};
static int __init myproc_init(void){ if (!proc_create(myproc, 0666, NULL, &procfs_fops)) return -ENOMEM; return 0;}
static void __exit myproc_exit(void){ remove_proc_entry(myproc, NULL);}
Kprobes (Kernel probes)
Kprobes allow to dynamically break into any kernel routine and collect debugging and performance information (CONFIG_KPROBES=y)
Trap almost every kernel code address, specifying a handler routine to be invoked when the breakpoint is hit
How does it work?Make a copy of the probed instruction and replace the original instruction with a breakpoint instruction (int3 on x86)
When the breakpoint is hit, a trap occurs, CPU's registers are saved and the control passes to the Kprobes pre-handler
The saved instruction is executed in single-step mode
The Kprobes post-handler is executed
The rest of the original function is executed
Kprobes (example)
static int my_handler(struct kprobe *p, struct pt_regs *regs){/* Do something here... */}
static struct kprobe my_kp = {.pre_handler = my_wrapper,.symbol_name = schedule_timeout,};
static int __init my_kprobe_init(void){ int ret;
ret = register_kprobe(&my_kp); if (ret < 0) { printk(KERN_INFO "%s: error %d\n", __func__, ret); return ret; } return 0;}
static void __exit my_kprobe_exit(void){ unregister_kprobe(&my_kp);}
Dump a stack trace
static const char function_name[] = "schedule_timeout";
static int my_handler(struct kprobe *p, struct pt_regs *regs){dump_stack();printk(KERN_INFO "%s called %s(%d)\n", current->comm, function_name, (int)regs->di);}
static struct kprobe my_kp = {.pre_handler = my_wrapper,.symbol_name = function_name,};
static int __init my_kprobe_init(void){ int ret;
ret = register_kprobe(&my_kp); if (ret < 0) { printk(KERN_INFO "%s: error %d\n", __func__, ret); return ret; } return 0;}
static void __exit my_kprobe_exit(void){ unregister_kprobe(&my_kp);}
Dynamic ftrace
# mount -t debufs none /sys/kernel/debug# cd /sys/kernel/debug# echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter# echo function > current_tracer# echo 1 > tracing_on# usleep 1# echo 0 > tracing_on# cat trace# tracer: function## entries-in-buffer/entries-written: 5/5 #P:4## _-----=> irqs-off# / _----=> need-resched# | / _---=> hardirq/softirq# || / _--=> preempt-depth# ||| / delay# TASK-PID CPU# |||| TIMESTAMP FUNCTION# | | | |||| | | usleep-2665 [001] .... 4186.475355: sys_nanosleep