Recovery of Variables and Heap Structure in x86
Executables
Gogul BalakrishnanThomas Reps
University of Wisconsin
Overview
• Introduction• Challenges• Background• Recovering A-locs via Iteration• An Abstraction for Heap-Allocated
Storage• Experiments
Introduction
• The Need of Analyzing Executables– What You See Is Not What You eXecute
• Many Obstacles in Analyzing Executables– Data Objects are Not Easily Identifiable.– Absence of Symbol Table & Debugging Information– Determining the Memory Addresses of Data Objects– Difficult to Track the Flow of Data through Memory– Challenging to get useful information about the heap
e.g) memset(password, ‘\0’, len); free(password);
Challenges(1/3)
• Recovering Variable-like Entities– The layout of Memory is known at Compile
time or Assembly time (IDAPro’ Approach)
– To Recover y, the Set of Values that eax Holds at 5 Needs to be Determined.
void main() { int x, y; x = 1; y = 2; return;}
proc main1 mov ebp, esp2 sub esp, 83 mov [ebp-8], 14 mov eax, ebp5 mov [eax-4], 26 add esp, 87 retn
Challenges(2/3)
• Granularity of Recovered Variable-like
Entities– Affects the complexity and accuracy of
subsequent analyses
• The Structure of Heap-Allocated Objects– Only the Size of the Allocated Block is Known.– Using Abstract-Refinement Algorithm
Challenges(3/3)
• Resolving Virtual-Function Calls
– A Definite Link between the Object and the Virtual Function Table is Never Established. (Weak Update)
one-variable-per-malloc-site abstraction
Background(1/6)
• Abstract Locations (A-locs)– Memory Region
• A Set of Disjoint Memory Areas• Represents a Group of Locations that have Similar
Runtime Properties
– Abstract Locations• Locations between two addresses/offsets in Memory-
Region• Address & Offsets are Statically Determined
Background(2/6)
• Abstract Locations (cont’d) proc main0 mov ebp,esp1 sub esp,402 mov ecx,03 lea eax,[ebp-40]L1: mov [eax], 15 mov [eax+4],26 add eax, 87 inc ecx8 cmp ecx, 59 jl L110 mov eax,[ebp-36]11 add esp,4012 retn
Background(3/6)
• Value-Set Analysis (VSA)– Combined Numeric-Analysis & Pointer-Analysis– Over-Approximation of the values that each a-
loc holds at each program point– Value-Set
• The Set of Addresses and Numeric Values• N-tuple of strided intervals of the form s[l, u]
• (Global Region, Procedure Region, …)• (1[0, 9], ∮) versus (∮, -8[-40, -8])
e.g) 8[-40, -8] = {-40, -32, -24, -16, -8}
N : the number of memory-regions
Background(4/6)
• Value-Set Analysis (cont’d)– The Value-Set of eax at L1
• (∮, 8[-40, -8]) • eax holds the offsets
{-40, -32, -24, -16, -8}• Starting Addresses of Field x of p
proc main0 mov ebp,esp1 sub esp,402 mov ecx,03 lea eax,[ebp-40]L1: mov [eax], 15 mov [eax+4],26 add eax, 87 inc ecx8 cmp ecx, 59 jl L110 mov eax,[ebp-36]11 add esp,4012 retn
Typedef struct { int x, y;} Point;
int main() { int i; Point p[5]; for(i=0; i<5; ++i) { p[i].x = 1; p[i].y = 2; } return p[0].y;}
Background(5/6)
• Aggregate Structure Identification (ASI)– Can Distinguish between Accesses to Different
Parts of the Same Aggregate– Aggregate is broken up into smaller parts
(atoms)– Data-Access Constraint Language (DAC)
• Specifying Data-Access Pattern in the Program
DataRef Reference to a set of sequences of bytes
UnifyConstraint
Flow of Data in the Program
Background(6/6)
• Aggregate Structure Identification (cont’d)– Data-Access Constraint Language (DAC)
• DataRef [l : u] refers to bytes l through u in DataRef• DataRef n : n is the number of elements
– ASI DAG
e.g) P[0:11] 3 = P[0:3], P[4:7], or P[8:11]
return_main
p[0:39] 5[0:3] ≈ const_1[0:3];p[0:39] 5[4:7] ≈ const_2[0:3];return_main[0:3] ≈ p[4:7]
Recovering A-locs via Iteration• Problems of VSA
– Can only Represent a Contiguous Sequence of Memory Locations
– Cannot Detect Internal Substructure
• Basic Idea
1. VSA is used to obtain memory-access patterns in the executable;
2. ASI is used as a heuristic to determine a set of a-locs according to the memory-access patterns obtained from the information recovered by VSA.
IDAPro
ASI VSAFinal Value-Sets
Recovering A-locs via Iteration• Generating Data-Access Constraints
from Value<Algorithm 1 SI2ASI>if s[l,u] is a singleton then return <“r[l : l+length-1]”, true>else size ← max(s, length) n ← (u – l + size – 1) / size ref ← “r[l : u+size-1] n[0 : size-1]” return <ref, (s = size)>enf if
e.g) s[l, l]
Actual Byte Range
The number of array elements
Input : (r, s[l, u], length)Output : (ASI Ref, Boolean)
(AR_main, 8[-40, -8], length)=> {AR_main[(-40):(-1)] 5[0:7]}AR_main[-40:-33][0:7]AR_main[-32:-25][0:7]AR_main[-24:-17][0:7]AR_main[-16:-9][0:7]AR_main[-8:-1][0:7]
Recovering A-locs via Iteration• Generating Data-Access Constraints
from Value<Algorithm 2>if (s1[l1,u1] or s2[l2,u2] is a singleton then return SI2ASI(r, s1[l1, u1] ⊕ s2[l2, u2], length)end ifif s1 ≥ (u2 – l2 + length) then baseSI ← s1[l1, u1] indexSI ← s2[l2, u2]else if s2 ≥ (u1 – l1 + length) then baseSI ← s2[l2, u2] indexSI ← s1[l1, u1]else return SI2ASI(r, s1[l1, u1] ⊕ s2[l2, u2], length)end if<baseRef, exactRef> ← SI2ASI(r, baseSI, stride(baseSI))if exactRef is false then return SI2ASI(r, s1[l1, u1] ⊕ s2[l2, u2], length)else return concat(baseRef, SI2ASI(‘’, indexSI, length))endif
Determine base register
Row-major order
Base Addr
Base Addr
Index Addr
e.g) eax : (1[0:9], ∮)ecx : (∮, 16[-160, -16])In case of [ecx+eax] =>AR[-160:-1] 10[0:15] [0:9] 10[0:0]
Recovering A-locs via Iteration• Interpreting Indirect Memory-
References– Lookup Algorithm
• NodeDesc : <name, length>
• NodeDescList : An Ordered List of NodeDesc
• Three Operations
name :the name associated with the ASI tree nodelength : the length of above node
e.g) [nd1, nd2, …, ndn]
Name Output
GetChildren(aloc) List of Child Nodes
GetRange(start, end)
List of Nodes with offsets in the given range [start, end]
GetArrayElements(m)
List of Nodes with m elements
Recovering A-locs via Iteration• Lookup Algorithm Examples
e.g) Lookup p[0:39] 5[0:3]
GetChildren(p) = [<a3, 4>, <a4, 4>, <i2, 32>]GetRange(0, 39) = [<a3, 4>, <a4, 4>, <i2, 32>]GetArrayElements(5) = [<a3, 4>, <a4, 4>], [<a5, 4>, <a6, 4>]GetRange(0, 3) = [<a3, 4>, <a5, 4>]
An Abstraction for Heap-Allocated Storage
• Previous Abstraction
• Recency Abstraction– Allowing VSA & ASI to recover Info. About
virtual-function tables– Use Two Memory-Regions per allocation site s
• MRAB[s] : Most Recently Allocated Block• NMRAB[s] : Non-Most Recently Allocated Block• count : How many concrete blocks the memory-region
represents (MRAB[s].count, NMRAB[s].count)– SmallRange = {[0, 0], [0, 1], [1, 1], [0, ∞], [1, ∞], [2, ∞]}
• size : over-approximation of the size of block (MRAB[s].size, NMRAB[s].size)
All of the nodes allocated at a given allocation site s are folded together into a single summary node ns.
An Abstraction for Heap-Allocated Storage
• Operation– AbsEnv[s] : MRAB[s]/NMRAB[s] →
<count,size,alocEnv>– AlocEnv = a-loc → ValueSet– Allocation site s transforms absEnv to absEnv’
• absEnv’(MRAB[s]) = <[0,1], size, a-loc.Value-Set>• absEnv’(NMRAB[s]).count = absEnv(NMRAB[s]).count +
absEnv(MRAB[s]).count• absEnv’(NMRAB[s]).size = absEnv(NMRAB[s]).size ∪
absEnv(MRAB[s]).size• absEnv’(NMRAB[s]).alocEnv = absEnv(NMRAB[s]).alocEnv
∪ absEnv(MRAB[s]).alocEnv
An Abstraction for Heap-Allocated Storage
Experiments
• Environments
• Software
OS Compiler Language Target Files
Windows Visual Studio 6.0
C++ .obj
Experiments
• Results of Virtual-Function Call Resolution
Experiments
• Results of A-loc Identification– Comparing the Results of Algorithm with
Debugging Information
The structure of 87% of the local variables is correct
Experiments
• Results of A-loc Identification
The structure of 72% of the objects in the heap is correct
Q & A
Top Related