Recovering Components from Executables [Cooperative Agreement HR0011-12-2-0012] Thomas Reps...

Recovering Components from Executables[Cooperative Agreement HR0011-12-2-0012]

Thomas RepsUniversity of Wisconsin

Thomas Reps JungheeLim

EvanDriscoll

VenkateshSrinivasan

TusharSharma

AdityaThakur

DARPA BET IPR 2

Project Goals

• Develop a “redeveloper’s workbench”– Aid in the harvesting of components from an

executable• identify components• make adjustments to components identified• issue queries about a component’s properties

– Queries• type information; function prototypes• side-effect “footprint”• error-triggering properties

DARPA BET IPR 3

Basic scenario

DARPA BET IPR 4

Project Activities

• Component identification– generalized concept analysis– object traces

• Component extraction– specialization slicing– inferred invariants for function summaries– partial evaluation (planned)

• Verifying component properties– affine-inequality domain for modular arithmetic– symbolic abstraction (BET + ONR STTR)– format-compatibility checking (ONR)– stretched-Tree-IC3 (planned)

DARPA BET IPR 5

Project Activities

• Component identification– generalized concept analysis– object traces

• Component extraction– specialization slicing– inferred invariants for function summaries– partial evaluation (planned)

• Verifying component properties– affine-inequality domain for modular arithmetic– symbolic abstraction (BET + ONR STTR)– format-compatibility checking (ONR)– stretched-Tree-IC3 (planned)

DARPA BET IPR 6

Outline of Talk

• Review of goals• Progress (May-Oct. 2012) and plans (2013)

– Component identification• generalized concept analysis• object traces

– Component extraction• specialization slicing• inferred invariants for function summaries• partial evaluation (planned)

– Verifying component properties• affine-inequality domain for modular arithmetic• symbolic abstraction (BET + ONR STTR)• format-compatibility checking (ONR)• stretched-Tree-IC3 (planned)

• Recap of publications/submissions• Recap of plans for 2013

DARPA BET IPR 7

Outline of Talk






Dagstuhl Seminar 2012 8

Component Identification – Attempt 1

• Aim: Identify classes/modules in stripped binaries• Key Idea: Group procedures by the data that they use

and/or modify• Methodology– Formal Concept Analysis (FCA)– procedures versus (GMOD set x GREF set)

VenkateshSrinivasan

9

Concept Analysis

Universe

{humans, whales}

{dolphins, gibbons, humans, whales}

{dolphins, gibbons} {humans, gibbons}

Generalized Concept Analysis

• Semi-lattice attributes instead of Boolean attributes– Values in range [0.0 ... 1.0]

• degree of uncertainty about an attribute’s value– Physical types

• subclass relationships– MOD/REF information

• procedure footprints

10

Footprint Example

// Linked-list operationsint ll_length(struct node* head);void ll_push(struct node** headRef, int newData);int ll_count(struct node* head, int searchFor);int ll_sum(struct node* head);void ll_reverse(struct node** headRef);

// Binary-search-tree operationsint bst_lookup(struct bnode* bnode, int target);struct bnode* bst_newbnode(int data);struct bnode* bst_insert(struct bnode* bnode, int data);struct bnode* bst_build123c();int bst_size(struct bnode* bnode);

11

1. Compile2. Build CodeSurfer/x86 project3. Compute IMOD/IREF (and GMOD/GREF) information4. Run concept analysis


IMOD, IREF, GMOD, and GREF sets

• IMOD (IREF): the set of a-locs (abstract locations) that a procedure directly modifies (uses)

• GMOD (GREF): of a procedure is the set of a-locs that the procedure and it callees modify (use)

• Elements include– heap a-locs (heap allocated objects)– global a-locs (global objects) – local a-locs (objects allocated on the stack) by itself or

other procedures

13

Footprint Example (IMOD U IREF)

Access next fields

Access data, left, and right fields

Access data and next fields

Access left and right fields

14

Footprint Example (IMOD x IREF)

Write data and next

Read next;write next

Write next

Read next

Read data and

next


Experiments and Results

• Experiments– 3 open source C++ apps– ground truth (classes in

application) obtained from source code

– FCA using GMOD X GREF – each class is matched

against a concept using weighted bipartite matching

• Accuracy = • - set of functions in class • – set of functions in

matched concept

Application Accuracy

TinyXML(Xml parser)

0.37

SmillaEnlarger(Bitmap enlarger)

0.21

PonyProg(Serial device programmer)

0.14


Component Identification – Attempt 2

• Switch gears to dynamic analysis– identify classes/modules in stripped binaries– modulo code coverage

• Key idea– common idiom in object-oriented programming:

“this” pointer = 1st argument of methods of a class– record <function, 1st-argument value> pairs– use to classify sets of functions


Object Traces

• Instrument binary to trace allocated objects and values of 1st-arguments of methods

• Group methods based on “this”-pointer values• Emit object traces, pairs <A, S> where– A is an object address– S is the sequence of methods that were passed A as the

value of the “this” pointer (1st argument)


Challenges

• Re-use of memory– heap-allocated objects

• instrument malloc, free, new, and delete• track lifetimes

– stack-allocated objects• trace stack pointer, activation-record boundaries, call-suffix

• Static methods do not have “this” pointer as 1st argument– track all currently allocated objects and filter out methods that

have an invalid address as first argument


Challenges

• False traces– Two or more object

traces appear concatenated because of the reuse of same stack space by compiler for different objects in different scopes

– Locate initializer and finalizer methods to split false traces

void f() {{

Foo f;…

}{

Bar b;…

}}

Stack space reused for f

and b

AR of f()


Future Work: Class Hierarchies

• Aim: Identify class boundaries and class hierarchies from object traces

• Key Idea: Exploit order of constructor/ destructor calls for derived class objects

• Challenges:– Constructor/destructor inlining– Finding the right classes for static methods

DARPA BET IPR 21

Outline of Talk






Specialization Slicing

• Problem statement– Ordinary “closure slices” can have mismatches between

call sites and called procedures• different call sites have different subsets of the parameters

– Idea: specialize the called procedures– Challenge: find a minimal solution (minimal duplication)

22MinAung

23


int g1, g2, g3;

void p(int a, int b) {

g1 = a; g2 = b; g3 = g2;}

int main() { g2 = 100; p(g2, 2); p(g2, 3); p(4, g1+g2); printf("%d", g2);}

int g1, g2;


g1 = a; g2 = b; }

int main() { p( 2); p(g2, 3); p( g1+g2); printf("%d", g2);}

int g1, g2;

void p1(int b) { g2 = b; }

void p2(int a, int b) {

g1 = a; g2 = b; }

int main() { p1(2); p2(g2, 3); p1(g1+g2); printf("%d", g2);}

(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)

Closure slice Specialized slice

24

m4: g2m3: call p

m21: call printf

m5: 2 m6: g3m7: g2

m23: g2C1m2: g2=100

m8: g1

C2C3

p3: b p7: g3 p8: g2p2: a p9: g1p6: g3=g2

p4: g1=a

m22: “%d”m10: g2

m11: 3 m12: g3m13: g2m14: g1

m16: 4m15: call p

m17: g1+g2 m18: g3m19: g2m20: g1

p1: p

m1: main

m9: call p

p5: g2=b

System Dependence Graph (SDG)

25

C2

(p2, C1)

C1

(m4, ε) (m6, ε) (m8, ε)

(p7, C1) (p9, C1)(p6, C1)

(p4, C1)

(m12, ε)

(p7, C2)(p6, C2)

(m13, ε) (m14, ε)

(p3, C2) (p8, C2)(p2, C2) (p9, C2)

(p4, C2)

C3

(m16, ε) (m18, ε) (m20, ε)

(p7, C3)(p2, C3) (p9, C3)(p6, C3)

(p4, C3)

(m5, ε) (m7, ε)

(p3, C1) (p8, C1)

(m10, ε) (m11, ε)

(m21, ε)(m23, ε)(m22, ε)

(m17, ε)

(p3, C3) (p8, C3)

(m2, ε)

(p1, C2)

(p5, C2)

(m15, ε)(m3, ε)

(p1, C1)

(p5, C1)

(m1, ε)

(m9, ε)(m19, ε)

(p1, C3)

(p5, C3)

Unrolled SDG

26

C2C1 C3

p8

m21m23m22

m17m9

p1

p4 p5p5p3

m3,

m1

p1)

m7

m15

m5 m13 m14

p3 p8p2 p9

m10 m11 m19

Specialized SDG

DARPA BET IPR 27

Specialization slice of a recursive program

(1) int g1, g2;(2)(3) void s(int a,(4) int b){(5)(6) g1 = b;(7) g2 = a;(8) }(9)(10)(11)(12) int r(int k) {(13)(14) if (k > 0) {(15) s(g1, g2);(16) r(k-1);(17) s(g1, g2);(18) }(19)(20) }(21)(22)(23)(24) int main() {(25) g1 = 1;(26) g2 = 2;(27) r(3);(28) printf("%d\n", g1);(29) }

int g1, g2;

void s_1(int b) { g1 = b;}void s_2(int a) { g2 = a;}

void r_1(int k) { if (k > 0) { s_2(g1); r_2(k-1); s_1(g2); }}void r_2(int k) { if (k > 0) { s_1(g2); r_1(k-1); s_2(g1); }}

int main() { g1 = 1; r_1(3); printf("%d\n", g1);}

Calling pattern:

(27) ((16)(16))*

Calling pattern:

(27)(16)((16)(16))*


28

• Problem statement– Ordinary “closure slices” can have mismatches between

call sites and called procedures• different call sites have different subsets of the parameters


1. In the worst case, specialization causes an exponential increase in size

2. In practice, observed a 9.4% increase

Relatively Few Specialized Procedures

29


• Problem statement– Ordinary “closure slices” can have mismatches between call

sites and called procedures• different call sites have different subsets of the parameters


• Key insight– minimal solution involves solving a partitioning problem on a

certain infinite graph– problem solvable using PDSs: all node-sets in infinite graph can

be represented via FSMs– algorithm: a few automata-theoretic operations

30

31

Algorithm

Input: SDG S and slicing criterion COutput: An SDG R for the specialized slice of S with respect to C

// Create A6, a minimal reverse-deterministic automaton for the// stack-configuration slice of S with respect to C1 PS = the PDS for S2 A0 = a PS-automaton that accepts C3 A1 = Prestar[PS](A0)4 A2 = reverse(A1)5 A3 = determinize(A2)6 A4 = minimize(A3)7 A5 = reverse(A4)8 A6 = removeEpsilonTransitions(A5)

// Read out SDG R from A6

...

Automata used to hold onto sets of points in possibly infinite graph

Partition obtained by determinizing and

minimizing:Each state = set of calling

contexts for one specialized procedure

32

C2

(p2, C1)

C1

(m4, ε) (m6, ε) (m8, ε)

(p7, C1) (p9, C1)(p6, C1)

(p4, C1)

(m12, ε)

(p7, C2)(p6, C2)

(m13, ε) (m14, ε)

(p3, C2) (p8, C2)(p2, C2) (p9, C2)

(p4, C2)

C3

(m16, ε) (m18, ε) (m20, ε)

(p7, C3)(p2, C3) (p9, C3)(p6, C3)

(p4, C3)

(m5, ε) (m7, ε)

(p3, C1) (p8, C1)

(m10, ε) (m11, ε)

(m21, ε)(m23, ε)(m22, ε)

(m17, ε)

(p3, C3) (p8, C3)

(m2, ε)

(p1, C2)

(p5, C2)

(m15, ε)(m3, ε)

(p1, C1)

(p5, C1)

(m1, ε)

(m9, ε)(m19, ε)

(p1, C3)

(p5, C3)

Unrolled SDG

Each yellow name has the same set of stack configurations {C1,C3}Such sets are infinite for recursive programs => FSMs

33

C2C1 C3

p8

m21m23m22

m17m9

p1

p4 p5p5p3

m3,

m1

p1

m7

m15

m5 m13 m14

p3 p8p2 p9

m10 m11 m19

Specialized SDG

Each yellow name has the same set of stack configurations {C1,C3}Such sets are infinite for recursive programs => FSMs

DARPA BET IPR 34

Feature Removalint add(int a,int b) { q: return a+b;} int mult(int a,int b) { int i = 0; int ans = 0; while(i < a) { c5: ans = add(ans,b); c6: i = add(i,1); } return ans;}

void tally(int& sum,int& prod,int N) { int i = 1; while(i <= N) { c2: sum = add(sum,i); c3: prod = mult(prod,i); c4: i = add(i,1); }} int main() { int sum = 0; int prod = 1; c1: tally(sum,prod,10); printf("%d ",sum); printf("%d ",prod);}

int add(int a,int b) { q: return a+b;} int mult( int b) { int i = 0; int ans = 0;

return;}

void tally(int& sum, int N) { int i = 1; while(i <= N) { c2: sum = add(sum,i); c3: mult( i); c4: i = add(i,1); }} int main() { int sum = 0;

c1: tally(sum, 10); printf("%d ",sum);

}

35

Feature Removal

int g1, g2, g3;


g1 = a; g2 = b; g3 = g2;}


int g1, g2, g3;


g1 = a; g2 = b; g3 = g2;}


int g1, g2;

void p1(int a) { g1 = a; }

void p2(int b) { g2 = b; g3 = g2; }

int main() { g2 = 100; p1(g2); p2(3); p1(4); }

(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)

Forwardclosure slice

Specialized slice

36

C2

(p2, C1)

C1

(m4, ε) (m6, ε) (m8, ε)

(p7, C1) (p9, C1)(p6, C1)

(p4, C1)

(m12, ε)

(p7, C2)(p6, C2)

(m13, ε) (m14, ε)

(p3, C2) (p8, C2)(p2, C2) (p9, C2)

(p4, C2)

C3

(m16, ε) (m18, ε) (m20, ε)

(p7, C3)(p2, C3) (p9, C3)(p6, C3)

(p4, C3)

(m5, ε) (m7, ε)

(p3, C1) (p8, C1)

(m10, ε) (m11, ε)

(m21, ε)(m23, ε)(m22, ε)

(m17, ε)

(p3, C3) (p8, C3)

(m2, ε)

(p1, C2)

(m15, ε)(m3, ε)

(p1, C1)

(m1, ε)

(m9, ε)(m19, ε)

(p1, C3)

(p5, C3)(p5, C1) (p5, C2)

Unrolled SDG

37

C2

(p2, C1)

C1

(m4, ε) (m6, ε) (m8, ε)

(p7, C1) (p9, C1)(p6, C1)

(p4, C1)

(m12, ε)

(p7, C2)(p6, C2)

(m13, ε) (m14, ε)

(p3, C2) (p8, C2)(p2, C2) (p9, C2)

(p4, C2)

C3

(m16, ε) (m18, ε) (m20, ε)

(p7, C3)(p2, C3) (p9, C3)(p6, C3)

(p4, C3)

(m5, ε) (m7, ε)

(p3, C1) (p8, C1)

(m10, ε) (m11, ε)

(m21, ε)(m23, ε)(m22, ε)

(m17, ε)

(p3, C3) (p8, C3)

(m2, ε)

(p1, C2)

(m15, ε)(m3, ε)

(p1, C1)

(m1, ε)

(m9, ε)(m19, ε)

(p1, C3)

(p5, C3)(p5, C1) (p5, C2)

Complemented Unrolled SDG

38

C2

(p2, C1)

C1

(m4, ε) (m6, ε) (m8, ε)

(p7, C1) (p9, C1)(p6, C1)

(p4, C1)

(m12, ε)

(p7, C2)(p6, C2)

(m13, ε) (m14, ε)

(p3, C2) (p8, C2)(p2, C2) (p9, C2)

(p4, C2)

C3

(m16, ε) (m18, ε) (m20, ε)

(p7, C3)(p2, C3) (p9, C3)(p6, C3)

(p4, C3)

(m5, ε) (m7, ε)

(p3, C1) (p8, C1)

(m10, ε) (m11, ε)

(m21, ε)(m23, ε)(m22, ε)

(m17, ε)

(p3, C3) (p8, C3)

(m2, ε)

(p1, C2)

(m15, ε)(m3, ε)

(p1, C1)

(m1, ε)

(m9, ε)(m19, ε)

(p1, C3)

(p5, C3)(p5, C1) (p5, C2)


39

C2

(p2, C1)

C1

(m4, ε) (m6, ε) (m8, ε)

(p7, C1) (p9, C1)(p6, C1)

(p4, C1)

(m12, ε)

(p7, C2)(p6, C2)

(m13, ε) (m14, ε)

(p3, C2) (p8, C2)(p2, C2) (p9, C2)

(p4, C2)

C3

(m16, ε) (m18, ε) (m20, ε)

(p7, C3)(p2, C3) (p9, C3)(p6, C3)

(p4, C3)

(m5, ε) (m7, ε)

(p3, C1) (p8, C1)

(m10, ε) (m11, ε)

(m21, ε)(m23, ε)(m22, ε)

(m17, ε)

(p3, C3) (p8, C3)

(m2, ε)

(p1, C2)

(m15, ε)(m3, ε)

(p1, C1)

(m1, ε)

(m9, ε)(m19, ε)

(p1, C3)

(p5, C3)(p5, C1) (p5, C2)


40

C2

(p2, C1)

C1

(m4, ε) (m8, ε)

(p9, C1)

(p4, C1)

(m12, ε)

(p7, C2)(p6, C2)

(m13, ε)

(p3, C2) (p8, C2)

C3

(m16, ε) (m20, ε)

(p2, C3) (p9, C3)

(p4, C3)

(m11, ε)

(m2, ε)

(p1, C2)

(m15, ε)(m3, ε)

(p1, C1)

(m1, ε)

(m9, ε)

(p1, C3)

(p5, C2)


41

C2

p2

C1

m4 m8

p9

p4

m12

p7p6

m13)

p3, p8

C3

m16 m20m11

m2

p1

m15m3

p1

m1

m9

p5


BET 42

Future Work: Partial Evaluator for Machine-Code

• Slicing has limitations– limited semantic information – i.e., just dependence edges– no evaluation/simplification

• Partial evaluation: a framework for specializing programs – software specialization, optimization, etc.

• Binding-time analysis– what patterns are foo and bar called with?

• e.g, { foo(S,S,D,D), foo(S,D,S,D), bar(S,D), bar(D,S) }

– polyvariant binding-time analysis? specialized slicing!

• Design and implement an algorithm for partial evaluation of machine code

JungheeLim

VenkateshSrinivasan

DARPA BET IPR 43

Outline of Talk






DARPA BET IPR 44

Possible-overflow example

char* concat(char* a, char* b){ unsigned size = strlen(a)+strlen(b)+1; char* out = (char*)malloc(size*sizeof(char)); // Possible overflow for(unsigned i = 0; i < strlen(a); i++) { out[i] = a[i]; // Potential memory corruption } for(unsigned i = 0; i < strlen(b); i++) { out[i+strlen(a)] = b[i]; // Potential memory corruption } out[i+strlen(a)] = '\0'; return out;}

DARPA BET IPR 45

Convex Polyhedra[Figures from Halbwachs et al. FMSD97]

BET 46

Bitvector Inequality domain

• Conventional domain for representing inequalities– polyhedra: conjunctions of linear inequalities a1 x1 + a2 x2 + . . . + ak xk ≤ c– operations on polyhedra: linear transformations• unsound for machine arithmetic• machine integers wrap while mathematical integers do not

• Solution: Bitvector Inequality Domain

TusharSharma

DARPA BET IPR 47

Not so well-behaved . . .

DARPA BET IPR 48

Outline of Talk






DARPA BET IPR 49

Symbolic abstraction: Who cares?

• Win, win, win• Easier/faster implementation of analysis tools

– just state concrete (actual!) semantics in logic – supply an abstract domain

• e.g., as a class that meets a specific interface

– obtain analyzer/decision-procedure

• More precise results in abstract interpretation– can identify loop and procedure summaries that are more

precise than ones obtained via conventional techniques

• Applies to interesting, non-standard logics (we think!)– separation logic: memory safety properties

Basic scenario

DARPA BET IPR 48

interfere?

50

In 1977, Cousot & Cousot gave us a beautiful theory of overapproximation

α

Universe of States

x [2,5]y [1,3]

{(x2, y1), (x5, y3)}

{(2,1), (2,2), (2,3), (3,1), (3,2), (3,3), (4,1), (4,2), (4,3), (5,1), (5,2), (5,3)}

γ

α

51

In 1979, Cousot & Cousot gave us:

τ#τα

γ

Universe of States

53

In 1979, Cousot & Cousot gave us:

τ#τα

γ

Universe of States

54

In 2004, Reps, Sagiv, and Yorsh gave us:

Symbolic Abstract Interpretation

Symbolic ConcretizationUniverse of States

55

Symbolic Abstraction

In 2004, Reps, Sagiv, and Yorsh gave us:

Universe of States

56

Universe of States

57

ans

Universe of States

S

58

ans

Universe of States

60

From “Below” vs. From “Above”

• Reps, Sagiv, and Yorsh 2004: approximation from “below”• Desirable: approximation from “above”

– always have safe over-approximation in hand– can stop algorithm at any time (e.g., if taking too long)– Thakur, A. and Reps, T., A method for symbolic computation of abstract

operations. In Proc. Computer-Aided Verification (CAV), 2012

AdityaThakur

61

[CAV 2012]

62

Tunable

More time more precision

Stop at any time sound answer

[CAV 2012]

63

Stålmarck’s method (1989)

Dilemma Rule

𝑅∪{𝑣=True \} 𝑅∪{𝑣=False \}

𝑅

𝑅1′ 𝑅2

′

• Split

• Propagate

• Merge

64


1-saturation

65


2-saturation

66⊥

⊤

𝑅∪{𝑣=False \}𝑅∪{𝑣=True \}𝐴⊓𝑎1

𝐴

Stålmarck’s method for

Dilemma Rule

𝑅

𝐹 1′ 𝐹 2

′

• Split

• Propagate

• Merge

𝐴⊓𝑎2

𝐴1′ 𝐴2

′

𝛾 (𝑎1 )∪𝛾 (𝑎2 )⊇𝛾 ( 𝐴 )

67

⊤

⊥

Stålmarck’s method

~𝜶

68

⊤

⊥

Reasoning: Using in an SMT solver

~𝛼↓ (𝜑 )=¿ is unsatisfiable

Property verificationvia model checking:

OK if Unsat(Program Bad)

Dual use:• for abstract interpretation• Unsat/validity checking for pure logical reasoning abstract interpretation in service to logic!

[CAV 2012]

DARPA BET IPR 69

The importance of data structures

• Classic union-find– plus layers– plus least-upper bound

• Given UF1 and UF2, find the coarsest partition that is finer than UF1 and UF2

• Roughly, “confluent, partially-persistent union-find”

Example: gzip

70

gzip ARs zip’s AR gzip heap zip’s heap

R RW RWR

stack heap

Array Separation Logic• Satisfiable examples elem(a1) * arr(a2) elem(a1) * arr_seg(a2,a3) * arr(a3) arr_seg(a1,a2) * arr_seg(a2,a3) * arr(a3)

• Unsatisfiable arr_seg(a1,a2) * arr_seg(a2,a1)

71

a1 a2 a3

Separation Logic Frame Rule

72

{ P } S { Q }{ P * R } S { Q * R}

“Whatever the context is, it remains unchanged”

Local Reasoning:Do not have to describe unchanged context

Property Verification via Model CheckingOK if Unsat(Program Bad)

73

ProgramCounter == 2042 [a1 a2] * list(a2)

BET 74

Extend WALi to use

• Weighted Automaton Library (WALi):– supports context-sensitive

interprocedural analysis– weights = dataflow transformers– weighted version of PDSs (a la

material on specialized slicing)

• More precise results in abstract interpretation

• Easier implementation of analysis tools

JungheeLim

AdityaThakur

BET 75

AlphaHat

• AlphaHat technique in three ways– WALi + AlphaHat (Aditya Thakur and Junghee Lim)

• ~October 2012

– Boogie + AlphaHat for source code (Akash Lal at Microsoft India)• ~November 2012

– Boogie + AlphaHat for machine code (Aditya Thakur and Junghee Lim)• ~November 2012

DARPA BET IPR 76

Outline of Talk






77

Goal: check format compatibility

Producer componen

t

Consumer componen

t

1. Infer output format2. Infer accepted format3. Check compatibility

DARPA BET IPREvan

Driscoll

78

Formats are strings over “types”

ID CM FG FG OS …M TIME

Header of gzip format:

short byte byte word byte byte

DARPA BET IPR

79

Current work: enhance format spec

nrows ncols pix11 pix12 pix13 pix14 pix21 pix22 pix23 …

DARPA BET IPR

80



nrows

DARPA BET IPR

81



nrows ncols

ncols ncols

nrows

DARPA BET IPR

82


nrows ncols pix11 pix12 pix13 pix14 pix21 pix22 pix23 …ncols ncols

nrows

nrows:int ncols:int ((byte byte byte byte)ncols)nrows

Infer an automaton equivalent to:

DARPA BET IPR

83

Roadmap: Inference

Program TracesI/O equalities

ICFG Inferred XFA

Inputs

DARPA BET IPR

84

Roadmap: Compatibility

Producer component

Consumer component

Inferred XFA Inferred XFA ?

DARPA BET IPR

85

Status

Prototype essentially done, but not well-tested. Working on performance and on finding tests.

DARPA BET IPR

86

How we do it

Exponents start as standard Kleene *,and correspond to program loops

nrows:int ncols:int ((byte byte byte)*)*

DARPA BET IPR

87

How we do it


We instrument loops with trip countsWe instrument I/O calls to remember values

DARPA BET IPR

88

How we do it



When two of these are found to always equal, replace the * with an exponent

trip countremembered I/O value

DARPA BET IPR

89

How we do it

nrows:int ncols:int ((byte byte byte)*)nrows



trip countremembered I/O value

DARPA BET IPR

90

How we do it

nrows:int ncols:int ((byte byte byte)*)nrows



DARPA BET IPR

91

How we do it

nrows:int ncols:int ((byte byte byte)ncols)nrows



DARPA BET IPR

92

We use Daikon

Daikon identifies dynamic invariants• Hold over all test runs; might not actually be invariants• Could use statically inferred instead

We wrote our own Daikon front end for machine code• Assumes debugging information

• can we remove this restriction?• Front ends supplied with Daikon not sufficient

• checks only entry-to-exit invariants, whereas we need• loop trip-count instrumentation• I/O-to-loop-exit invariants

• Instruments program using DyninstDARPA BET IPR

93

Instrumentation remembers I/O vals

If value is returned:x = read_int(); x = __io1 = read_int();

If value is “returned” via out parameter:err = read_int(&x); err = read_int(&x);

__io2 = *(&x);

If value is passed by parameter:write_int(x); __io3 = x;

write_int(x);

DARPA BET IPR

94

Instrumentation finds trip counts

DARPA BET IPR

95


On loop entry:Set trip count to 0__trip1 = 0;

DARPA BET IPR

96



Entering loop body:Increment trip count__trip1++;

DARPA BET IPR

97



Entering loop body:Increment trip count__trip1++;

On loop exit:Output current value of variablesInterested in invariants hereprint(__io1, __io2, …, __trip1);

DARPA BET IPR

98

We use Daikon to find I/O equalities

I/O equalitiesValue trace

Instrumentedprogram

Dakion dynamicinvariant detector

LOOP_EXIT_A__io2 = 2__io4 = 5__trip_count_A = 5

LOOP_EXIT_B__io2 = 6__io4 = 5__trip_count_B = 6

__trip_count_A = __io4 = 5__trip_count_B = __io2 = 6

DARPA BET IPR

99

We model programs as XFAs

XFAs: extended finite automata

Add separate bounded “data state” to standard FAsTransformers on transitions describe data-state changes

DARPA BET IPR

DARPA BET IPR 100

Outline of Talk


– Component identification• generalized concept analysis• object traces• component dependences (planned)




BET 101

Stretched-TreeIC3

• IC3 (Bradley et al.)– SAT-based symbolic model checking technique for

hardware specifications– Incrementally refines the property P, eventually producing

either an inductive strengthening of the property or a counter-example trace

BET 102

Stretched-TreeIC3 (Cont.)

• TreeIC3 (Cimatti et al.)– Software model-checking technique that adapts IC3-

inspired ideas to CFG-guided search

Unwinding CFG in an abstract space (with CEGAR)

CFG ART (Abstract Reachability Tree)

BET 103


• Stretched-TreeIC3 (UW)– An extension of TreeIC3 for machine-code programs– Use trip-count-based invariant generation for better handling of loops

BET 104


• We use– TSL

• Machine-code analysis framework

– CodeSurfer/x86• To obtain program variables from executables

– Daikon-Dyninst developed by Evan Driscoll• To obtain loop invariants associated with trip-count variables that

count the number of iterations taken through a loop

BET 105

Stretched-TreeIC3

• Finish the base implementation of Stretched-TreeIC3– ~ November 2012

• Integrate Daikon-Dyninst to Stretched-TreeIC3– ~ December 2012

DARPA BET IPR 106

Outline of Talk


– Component identification• generalized concept analysis• object traces• component dependences (planned)




DARPA BET IPR 107

Recap of publications/submissions1. Thakur, A., Elder, M., and Reps, T., Bilateral algorithms for symbolic abstraction. In Proc. Static

Analysis Symposium (SAS), Sept. 2012. http://research.cs.wisc.edu/wpis/papers/sas12-bilateral-alphahat.pdf

2. Thakur, A. and Reps, T., A generalization of Staalmarck's method. In Proc. Static Analysis Symposium (SAS), Sept. 2012. Extended version: http://research.cs.wisc.edu/wpis/papers/tr1699r.pdf

3. Thakur, A. and Reps, T., A method for symbolic computation of abstract operations. In Proc. Computer-Aided Verification (CAV), July 2012. http://www.cs.wisc.edu/wpis/papers/CAV12-Staalmarck.pdf

4. Driscoll, E., Thakur, A., and Reps, T., OpenNWA: A nested-word-automaton library (tool paper). In Proc. Computer-Aided Verification (CAV), July 2012. http://www.cs.wisc.edu/wpis/papers/CAV12-OpenNWA.pdf

5. Lim, J. and Reps. T., TSL: A system for generating abstract interpreters and its application to machine-code analysis. TR-1775, Computer Sciences Department, University of Wisconsin, Madison, WI, Oct. 2012. (Submitted for journal publication.) http://research.cs.wisc.edu/wpis/papers/tr1775.pdf

6. Aung, M., Horwitz, S., Joiner, R., and Reps, T., Specialization slicing. To be submitted for journal publication, Oct. 2012.

http://research.cs.wisc.edu/wpis/papers/sas12-bilateral-alphahat.pdf

http://research.cs.wisc.edu/wpis/papers/tr1699r.pdf

http://www.cs.wisc.edu/wpis/papers/CAV12-Staalmarck.pdf

http://www.cs.wisc.edu/wpis/papers/CAV12-OpenNWA.pdf

http://research.cs.wisc.edu/wpis/papers/tr1775.pdf

DARPA BET IPR 108

Papers in preparation

• Elder, M., Lim, T., Sharma, T., Andersen, T., and Reps, T., Abstract domains of affine relations. Invited by the SAS 2011 PC for accelerated refereeing for ACM TOPLAS.

• Thakur, A., Lal, A., Lim, J., and Reps, T., Alphahat and all that: Attaining most-precise abstract interpretation.

• Driscoll, E. and Reps, T., An algorithm for format-compatibility checking.

• Sharma, T., Thakur, A., and Reps, T., An abstract domain of bitvector inequalities.

DARPA BET IPR 109

Recap of plans for 2013

• Component identification– object traces class hierarchies

• Component extraction– partial evaluator for machine code

• Verifying component properties

• separation logic• WALi-based and Boogie-based invariant finding

– bitvector-inequality domain– Stretched-TreeIC3

Recovering Components from Executables [Cooperative Agreement HR0011-12-2-0012] Thomas Reps...

Documents

Transcript of Recovering Components from Executables [Cooperative Agreement HR0011-12-2-0012] Thomas Reps...