K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)
description
Transcript of K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan)
2004/12/2 APSEC@BUSAN 1
K. Gondow (Titech, Japan)T. Suzuki (Elmic System Inc, Japan)H. Kawashima (JAIST, Japan)
Binary-Level Lightweight Data Integration to Develop Program Understanding Tools for Embedded Software in C
2004/12/2 APSEC@BUSAN 2
Overview
Problems: Imprecision in C tools. High development cost of C tools.
Our solution: Binary-level lightweight data integration. As a testbed, DWARF2 used for developing
dxref, rxref: cross-referencers bscg: a call-graph extractor
2004/12/2 APSEC@BUSAN 3
Imprecision in C tools (1/3)
e.g., GNU GLOBAL cannot identify a variable 'foo' and a label 'foo'. Users must select some one from the list. Because GNU GLOBAL partially analyzes
source code to run very fast.
int main (void) { int foo; foo: goto foo;}
foo 3 test.c int foo.cfoo 4 test.c foo: goto foo;
click candidate list
2004/12/2 APSEC@BUSAN 4
Imprecision in C tools (2/3)
e.g., Murphy's study: "An Empirical Study of Static Call
Graph Extractors", by Murphy, et al., ICSE, 1996.
Tells "call graphs extracted by several broadly distributed tools vary significantly enough to surprise many experienced software engineers."
2004/12/2 APSEC@BUSAN 5
Imprecision in C tools (3/3)
cflow∩Field
cflow-Field
Field-cflow
Quantitative results from mosaic, quoted from Murphy's paper.
2004/12/2 APSEC@BUSAN 6
Why imprecision? (1/2)
Reason #1: many tools partially parse source code, resulting in incomplete analysis. e.g, GNU GLOBAL, cxref, LXR,
cscope, cflow...
At a glance, full-parsing seems to solve this problem, but...
2004/12/2 APSEC@BUSAN 7
Why imprecision? (2/2)
Reason #2: C source code is difficult to fully analyze because of Compiler-specific extensions.
e.g., asm for inline assembly code Ambiguous behaviors in the C
standards. undefined, unspecified, implementation-
defined. e.g., padding in a structure.
2004/12/2 APSEC@BUSAN 8
Compiler-specific extensions
Essential in C and embedded software. e.g., asm is used to obtain H/W error code.
e.g., long long is used in C89's <stdio.h> Make it hard to analyze source code.
Different compiler has different semantics.
void page_fault_handler (uint32_t error) { uint32_t cr2; asm volatile ("movl %%cr2,%0":"=r"(cr2)); ... /* IA-32 control register #2 */}
2004/12/2 APSEC@BUSAN 9
Ambiguous behaviors in C (1/2)
Intentional and essential to keep C compilers fast and simple.
e.g., padding in a structure is an implementation-defined behavior. This makes pointer-analysis hard.
"Pointer analysis for programs with structures and casts", by Suan Hsi Yong, et al, PLDI'99.
2004/12/2 APSEC@BUSAN 10
Different padding on different platforms.
To obtain precise dataflow, tools need to know the padding values of the compiler.
But it is hard...
struct S {char c; int *ip; } *p;struct T {char c; int i; } t;t.i = 0x1234;p = (struct S *)&t;printf ("%p\n", p->ip);
ip
ip
i
pad
din
g
struct S struct Sstruct Tc c c
Solaris8 (32bit)
Solaris8 (64bit)
Ambiguous behaviors in C (2/2)
depends on
not
2004/12/2 APSEC@BUSAN 11
Possible solutions
To modify compilers (e.g. GCC) to emit their analyzed internal data. Seemingly high development cost. Many compilers to be modified.
To use binary information in executables emitted by compilers. Relatively easy, although it lacks
some information, e.g., statements.
2004/12/2 APSEC@BUSAN 12
Our solution and result
Our solution: Uses DWARF2 debugging information
as binary information. Preliminary experiment:
Good result for our cross-referencers and call-graph extractor.
Better precision, although: some false negatives increased. quantitative results are not yet obtained.
2004/12/2 APSEC@BUSAN 13
Demonstration
Using DWARF2, we implemented: two cross-referencers:
dxref: only uses DWARF2 Sample output: dxref
rxref: hybrid of dxref and GNU GLOBAL Sample output: dxref
a static call-graph extractor: bscg: uses DWARF2 and disassembler.
Sample outputs: fact, dxref, bash, bash
2004/12/2 APSEC@BUSAN 14
DWARF2-XML
C code
compile
extract
common formatDWARF2-XML
textdatasymbol info.relocation info.debug info.
binaryELF/
DWARF2
data inte-
gration
use
dxref, rxref:cross-referencers
bscg:call graph extractor
2004/12/2 APSEC@BUSAN 15
How bscg works(1)extract call instructions
by disassembling text.(2) convert addresses to symbols using DWARF2
(3) trim call graphs according to options
(4) output graph topologyin DOT of Graphviz
1234: call 5678 main: call fact
main fact
usage
digraph G { main -> fact; fact -> fact; }
2004/12/2 APSEC@BUSAN 16
Advantages of bscg
Advantages of binary-level DI (explained later). eg., high applicability and few false positives.
Can identify inlined functions. Can extract a call from asm ("call fact"); Can exclude
library functions: e.g., printf system calls: e.g., open, fork functions in runtime systems: _start, _fini
2004/12/2 APSEC@BUSAN 17
Disadvantages of bscg
No support for macro calls, signals, function pointers, optimization. gprof-callgraph.pl can handle function
pointers, since it uses dynamic information.
source-level ones (e.g., cflow) don't suffer from optimization problem.
2004/12/2 APSEC@BUSAN 18
So, is bscg good?
Yes! (not the best, of course) Not easy to compare.
2004/12/2 APSEC@BUSAN 19
What is binary-level DI? Provides common formats by extracting
information from binary code.
source code binary code
analyze
*.c*.c
Tools
a.outa.out
analyze
compile
commonformats
binaryDIsourc
eDI
DWARF2-XML
2004/12/2 APSEC@BUSAN 20
Why binary-level DI?
Many advantages: High applicability Few false-positives. More true-positives for low-level
info. Low development cost
Can improve C tool's precision.
2004/12/2 APSEC@BUSAN 21
What is lightweight DI?
Allows several common formats. To be practical! Hard to perfectly
integrate.light-
weight DI
heavy-weight DI
DWARF2-XML
2004/12/2 APSEC@BUSAN 22
Summary
Imprecision in C tools. Our solution:
Binary-level lightweight data integration.
As a testbed, DWARF2 used for developing dxref, rxref: cross-referencers bscg: call-graph extractor
2004/12/2 APSEC@BUSAN 23
Future works
Apply our technique to other tools: e.g., memory profilers, slicers, test
coverage tools, ... Develop new binary formats
suitable for lower CASE tools. tool-information carrying code.
cf. proof-carrying code, model-carrying code, schedule-carrying code.
2004/12/2 APSEC@BUSAN 24
2004/12/2 APSEC@BUSAN 25
Taxonomy of cross referencers.
Source-level Partial-parsing: GNU GLOBAL,
LXR, ... Full-parsing: Sapid, ACML
Binary-level Symbol tables: Visual Studio .NET(?) Debug info.: dxref Hybrid: rxref
2004/12/2 APSEC@BUSAN 26
What is DWARF2?
A binary format for debugging information.
Primary target languages: C, C++, Fortran, Modula2, Pascal.
Includes: types, nested blocks, line numbers,
function/object names, addresses, stack frame information, ...
2004/12/2 APSEC@BUSAN 27
DWARF2-XML
Our common format in XML for DWARF2.
A testbed of binary-level lightweight DI.
Makes it easier to process DWARF2. cf. libdwarf
About 15 times larger than DWARF2.
2004/12/2 APSEC@BUSAN 28
DWARF2-XML example<section name=".debug_info"> <tag name="DW_TAG_lexical_block" offset="id:27"> <attribute name="DW_AT_low_pc" value="67328"/> <attribute name="DW_AT_high_pc" value="67356"/> ... <tag name="DW_TAG_variable" offset="id:27"> <attribute name="DW_AT_name" value="i"/> <attribute name="DW_AT_type"
value_ref="id:161"> <attribute name="DW_AT_location"> <description>DW_OP_fbreg:
-24</description></></></></> ... <tag name="DW_TAG_base_type" offset="id:161"> <attribute name="DW_AT_name" value="int"/> <attribute name="DW_AT_byte_size" value="4"/> <attribute name="DW_AT_encoding" value="5"> <description>signed</description></></></>
{ int i; ... }
addressrange
variablename
offset to
base ptr.
ID/IDREFlink
2004/12/2 APSEC@BUSAN 29
DWARF2-XML file sizes About 15 times larger than DWARF2.
Size increase is almost cancelled by gzip.
Consumes much memory when using DOM. e.g., we cannot build DOM tree for gdb in our
environment. Tradeoff between memory consumption and low
development cost.
source a.out .debug_* DWARF2-XML
compressed by gzip
x_debug.c 27KB 77KB 50KB 1.1MB 58KBreadelf+.c 315KB 575KB 137KB 2.1MB 128KB
bash 1.2MB 2.9MB 705KB 16.3MB 815KBgdb 12MB 21.5MB 14.4MB 276MB 14MB
gdb's LOC is about 400,000.
2004/12/2 APSEC@BUSAN 30
Execution speed
bscg is slower than the other, but acceptable for practical use. 12000 lines in 8.8 sec.
but too bad in the case of bash-2.03.
bscg has a problem in scalability due to heavy overhead of DOM library.
2004/12/2 APSEC@BUSAN 31
Why XML?
Highly readable, portable, interoperable. plain-text and self-descriptiveness.
Powerful enough to describe complex structures and relations in programs. Nested tags and ID/IDREF links. DTD for checking XML documents. Flexibility to process semi-structured
documents. Easy to query/display/modify.
XML parsers, DOM/SAX, XPath. XPath's description is much smaller than
boring tree traversal code.
2004/12/2 APSEC@BUSAN 32
Drawbacks in API integration
Insufficient abstraction. Many and various data structures/access
make it hard to well encapsulate them into a fixed API.
e.g., poor API in libdwarf to traverse a wide range of data tree. (only dwarf_siblingof and dwarf_child are provided.)
High cost to implement API in many languages.
High cost to learn how to use API.
e.g., libdwarf
2004/12/2 APSEC@BUSAN 33
false/true positive/negative
false positives tool's incorrect output.
true positives tool's correct output.
false negatives tool's incorrect silence. tool should have produced output, but not.
true negatives tool's correct silence tool should not have produced output, and
not.
2004/12/2 APSEC@BUSAN 34
bscg's graph trimming options
2004/12/2 APSEC@BUSAN 35
Why lightweight DI?
To be practical! Hard to perfectly integrate.
Supported by the fact that most technologies gave up the perfect integration/definition. e.g., undefined behaviors in C. e.g., GNU BFD gives API integrating
different binary formats. useful, but not perfect. cannot convert ELF/DWARF2 into Windows PE.
2004/12/2 APSEC@BUSAN 36
Why function pointer analysis is difficult in C?
Pointer arithmetic and casting. e.g., (int (*)())(base + offset)
Dynamic library e.g., handle = dlopen (libname,
RTLD_LAZY); func = dlsym (handle, funcname); f ();
Inline assembly code e.g., asm ("call foo");
2004/12/2 APSEC@BUSAN 37
CASE tools development cost
Generally very high. individual parsers & analyzers. internal data is less interoperable
and portable IBM Eclipse
$40,000,000 (?)
2004/12/2 APSEC@BUSAN 38
E.g., function pointer Cflow
apply calls f (false positive) gprof-callgraph.pl
apply calls add5 (true positive) Other tools (bscg)
apply calls ? (false negative)
int add5 (int x){ return x + 5; }int apply (int (*f)(int), int x){ return f (x); } int main (void){ return apply (add5, 10); }
2004/12/2 APSEC@BUSAN 39
Our homepage
http://www.sde.cs.titech.ac.jp/~gondow/dwarf2-xml/ DTD for DWARF2-XML Source code of readelf+, dxref,
rxref, bscg Some sample outputs