WCRE2006 University of Wisconsin-Madison Extracting File Formats from Executables Junghee Lim,...
-
Upload
rosamund-gregory -
Category
Documents
-
view
213 -
download
0
Transcript of WCRE2006 University of Wisconsin-Madison Extracting File Formats from Executables Junghee Lim,...
University of Wisconsin-MadisonWCRE2006
Extracting File Formats from Executables
Junghee Lim, Thomas Reps and Ben LiblitUniversity of Wisconsin-Madison
13th Working Conference on Reverse EngineeringOct. 26, 2006
http://www.cs.wisc.edu/~junghee/WCRE2006.ppt
University of Wisconsin-MadisonWCRE2006
Data Format (File Format)• Goal: automatically ex
tract a specification of a program’s output format– E.g., something similar
to the file-format specification for gzip
• FFE (File Format Extractor)– Input: an executable without source code or documentation
– Output: a representation of the output data format
– (e.g., a regular expression)* *
size:1
value:
0x1F
size:1
value:
0x8B
size:1
value:0x08
size:1
value:
Top
size:Top
value:
Top
size:Top
value:Top
size:4
value:
Top
size:4
value:
Top
size:4
value:
Top
size:1
value:
Top
size:1
value:
Top
University of Wisconsin-MadisonWCRE2006
Gzip specification vs. our structure* *
size:1
value:0x1F
size:1
value:0x8B
size:1
value:0x08
size:1
value:Top
size:Top
value:Top
size:Top
value:Top
size:4
value:Top
size:4
value:Top
size:4
value:Top
size:1
value:Top
size:1
value:Top
University of Wisconsin-MadisonWCRE2006
Usage Scenarios• Reuse components of a tool chain
– COTS (Commercial Off-The-Shelf) products
• Detect malware– Recover output format (= network-communication pattern)
from captured malware
– Detect variants in the wild by detecting network traffic with that pattern
• Characterize what a program computes/creates• Find inconsistencies between specifications and impl
ementations
University of Wisconsin-MadisonWCRE2006
2Bulk writes1Individual writes
Programming Stylese.g.
- gzip
- compress95
- png2ico
e.g.
- tar
- cpio
University of Wisconsin-MadisonWCRE2006
What are the Steps?• Disassemble executable• Recover
– Interprocedural CFG– Variables (and their sizes)– Possible values of variables
• Construct Hierarchical Finite-State Machine (HFSM)• Annotate HFSM with size/value information• [Construct regular expression]
– Perform in-line expansion
• [Validation]– Regular exp. flex spec. recognizer– Examples recognizer success/failure
University of Wisconsin-MadisonWCRE2006
What are the Steps?• Disassemble executable• Recover
– Interprocedural CFG– Variables (and their sizes)– Possible values of variables
• Construct Hierarchical Finite-State Machine (HFSM)• Annotate HFSM with size/value information• [Construct regular expression]
– Perform in-line expansion
• [Validation]– Regular exp. flex spec. recognizer– Examples recognizer success/failure
University of Wisconsin-MadisonWCRE2006
What are the Steps?• Disassemble executable• Recover
– Interprocedural CFG– Variables (and their sizes)– Possible values of variables
• Construct Hierarchical Finite-State Machine (HFSM)• Annotate HFSM with size/value information• [Construct regular expression]
– Perform in-line expansion
• [Validation]– Regular exp. flex spec. recognizer– Examples recognizer success/failure
call bar
foo
1
2
3
4
bar
10
9
5
6
7
8
baz
call bar
call baz
call baz
1 2 5 6 9 10 7 9 10 8
3 5’ 6’ 9’ 10’ 7’ 9’ 10’ 8’ 4
FSM
HFSM
University of Wisconsin-MadisonWCRE2006
What are the Steps?• Disassemble executable• Recover
– Interprocedural CFG– Variables (and their sizes)– Possible values of variables
• Construct Hierarchical Finite-State Machine (HFSM)• Annotate HFSM with size/value information• [Construct regular expression]
– Perform in-line expansion
• [Validation]– Regular exp. flex spec. recognizer– Examples recognizer success/failure
University of Wisconsin-MadisonWCRE2006
What are the Steps?• Disassemble executable• Recover
– Interprocedural CFG– Variables (and their sizes)– Possible values of variables
• Construct Hierarchical Finite-State Machine (HFSM)• Annotate HFSM with size/value information• [Construct regular expression]
– Perform in-line expansion
• [Validation]– Regular exp. flex spec. recognizer– Examples recognizer success/failure
University of Wisconsin-MadisonWCRE2006
What are the Steps?• Disassemble executable• Recover
– Interprocedural CFG– Variables (and their sizes)– Possible values of variables
• Construct Hierarchical Finite-State Machine• Annotate HFSM with size/value information• [Construct regular expression]
– Perform in-line expansion
• [Validation]– Regular exp. flex spec. recognizer– Examples recognizer success/failure
Well-known concepts fromformal-language theory• but we use varying-sized alphabet symbols
University of Wisconsin-MadisonWCRE2006
What are the Steps?• Disassemble executable• Recover
– Interprocedural CFG– Variables (and their sizes)– Possible values of variables
• Construct Hierarchical Finite-State Machine (HFSM)• Annotate HFSM with size/value information• [Construct regular expression]
– Perform in-line expansion
• [Validation]– Regular exp. flex spec. recognizer– Examples recognizer success/failure
University of Wisconsin-MadisonWCRE2006
1Individual writes
Example code
disassemble
compile
Executable
0100100010001001001001000010111010100111010101010101010101010000101010100101010110110100101010100101010101010010010101010101010100101010101010101001001010101010
University of Wisconsin-MadisonWCRE2006
sub_401050 (put_byte) : void put_byte(char c); sub_401075 (put_long) : void put_long(int n); sub_4010E4 (writes) : void writes(char* str, int size);
The disassembled code for our example401120 sub_401120 proc near; type401120 push ebp401121 mov ebp, esp401123 sub esp, 0Ch401126 mov eax, [ebp-4]401129 mov [ebp-8], eax40112C cmp [ebp-8], 0401130 jz short loc_40113A401132 cmp [ebp-8], 1401136 jz short loc_401147401138 jmp short loc_40115240113A loc_40113A:40113A mov eax, [ebp-4]40113D mov [esp], eax401140 call sub_401050401145 jmp short loc_401152401147 loc_401147:401147 mov eax, [ebp-4]40114A mov [esp], eax40114D call sub_401050401152 loc_401152:401152 leave401153 retn401154 sub_401154 proc near; chksum401154 push ebp401155 mov ebp, esp401157 sub esp, 840115A mov eax, [ebp-4]40115D mov [esp], eax401160 call sub_401075401165 leave401166 retn401167 sub_401167 proc near; fill_data401167 push ebp401168 mov ebp, esp40116A sub esp, 840116D loc_40116D:40116D cmp [ebp-1], 0401171 jz short loc_401181401173 movsx eax, [ebp-1]401177 mov [esp], eax40117A call sub_40105040117F jmp short loc_40116D401181 loc_401181:401181 leave401182 retn
401183 sub_401183 proc near; main401183 push ebp401184 mov ebp, esp401186 sub esp, 28h401189 and esp, 0FFFFFFF0h40118C mov eax, 0401191 add eax, 0Fh401194 add eax, 0Fh401197 shr eax, 440119A shl eax, 440119D mov [ebp-14h], eax4011A0 mov eax, [ebp-14h]4011A3 call sub_4012004011A8 call __main4011AD mov eax, [ebp-10h]4011B0 mov [esp], eax4011B3 call sub_4010754011B8 mov eax, [ebp-0Ch]4011BB mov [esp], eax4011BE call sub_4010754011C3 mov [esp+4], 44011CB mov eax, [ebp-8]4011CE mov [esp], eax4011D1 call sub_4010E44011D6 call sub_4011204011DB call sub_4011674011E0 mov eax, [ebp-4]4011E3 mov [esp], eax4011E6 call sub_4010754011EB call sub_4011544011F0 mov eax, 04011F5 leave4011F6 retn
Output operations 401140, 40114D, 401160, 40117A, 4011B3, 4011BE, 4011D1, 4011E6
User-supplied information• Library function, or• Wrapped library function
Output functions
University of Wisconsin-MadisonWCRE2006
HFSM for our example
4011B3call
sub_401075(put_long)
4011BEcall
sub_401075(put_long)
4011D1call sub_4010E4
(writes)
4011E6call
sub_401075(put_long)
40117Acall sub_401050
(put_byte)
401160call sub_401075
(put_long)
401140call sub_401075
(put_byte)
40114Dcall sub_401075
(put_byte)
4011DBcall
sub_401167(fill_data)
4011D6call sub_401120
(type)
4011EBcall sub_401154
(chksum)
University of Wisconsin-MadisonWCRE2006
4051b4_ENTRY
HFSM for gzip4051b4_ENTRY
call 4056df
call 40510c
call 4054e6
call 4056df
call 4057f2
call 4056df
call 4054e6
call 4057a5
40572b
404366_ENTRY
call 4051b4
call 4051b4
call 4051b4
call 404145
404145_ENTRY
call 4051b4
call 4051b4
403d20_ENTRY
403d62
403d6e
403d7a
403d90
403d9d
403df1
403dfd
403e1f
call 404366
403e43
403e50
40510c_ENTRY
call 4056df
call 4056df
call 4056df
call 404f0e
call 404f0e
call 4056df
40510c_ENTRY
call 4056df
403e50
403e50
403e50
4059c8_ENTRY
403e50
408281_ENTRY
408414
4057a5_ENTRY
4057d8 4057be
404f0e_ENTRY
call 4056df
call 4056df
call 4056df
call 4056df
call 4056df
call 4056df
call 4056df
call 4056df
404f0e_ENTRYcall 4056df
call 4056df
call 4056df
call 4056df
call 4056df
call 4056df
- 12 FSMs - 64 nodes - 36 call-sites
University of Wisconsin-MadisonWCRE2006
A fragment of the call graph of gzip
University of Wisconsin-MadisonWCRE2006
4051b4_ENTRY
HFSM for gzip4051b4_ENTRY
call 4056df
call 40510c
call 4054e6
call 4056df
call 4057f2
call 4056df
call 4054e6
call 4057a5
40572b
404366_ENTRY
call 4051b4
call 4051b4
call 4051b4
call 404145
404145_ENTRY
call 4051b4
call 4051b4
403d20_ENTRY
403d62
403d6e
403d7a
403d90
403d9d
403df1
403dfd
403e1f
call 404366
403e43
403e50
40510c_ENTRY
call 4056df
call 4056df
call 4056df
call 404f0e
call 404f0e
call 4056df
40510c_ENTRY
call 4056df
403e50
403e50
403e50
4059c8_ENTRY
403e50
408281_ENTRY
408414
4057a5_ENTRY
4057d8 4057be
404f0e_ENTRY
call 4056df
call 4056df
call 4056df
call 4056df
call 4056df
call 4056df
call 4056df
call 4056df
404f0e_ENTRYcall 4056df
call 4056df
call 4056df
call 4056df
call 4056df
call 4056df
- 12 FSMs - 64 nodes - 36 call-sites
University of Wisconsin-MadisonWCRE2006
Regular Expression for gzip
* *size:
1value:0x1F
size:1
value:0x8B
size:1
value:0x08
size:1
value:Top
size:Top
value:Top
size:Top
value:Top
size:4
value:Top
size:4
value:Top
size:4
value:Top
size:1
value:Top
size:1
value:Top
If HFSM is too complicated and there is no
recursion, in-line expand to create regular expression
University of Wisconsin-MadisonWCRE2006
Executable
disassembleExecutable
Build CFGs
IDA Pro
VSA*
ASI*
CodeSurfer Back-end
CodeSurfer/x86
Organization of CodeSurfer/x86
Augmenting an HFSM with VSA and ASI information
Connector
File Format
Extractor (FFE/x86)
* VSA (Value Set Analysis)A combined numeric-analysis and pointer-analysis algorithm that determines an over-approximation of the set of numeric values and addresses that each abstract memory location holds at each program point. (G. Balakrishnan and T. Reps. “Analyzing memory accesses in x86 executables”, CC04)
* ASI (Aggregate Structure Identification)A unification-based, flow-insensitive algorithm to identify a program’s arrays and structs. (G. Ramalingam and et. al, “Aggregate structure identification and its application to program analysis”, POPL99)(G. Balakrishnan and T. Reps, “Recovery of variables and heap structure in x86 executables”, TR-1533, Comp. Sci. Dept., UW-Madison, 2005)
VSA*
ASI*
University of Wisconsin-MadisonWCRE2006
Value Set Analysis (VSA)
Output functionvoid put_long(int n) { put_short(n&0xffff); put_short((ulong)n >> 16);} stack
esp
push 12345678hcall put_long
Output operation
Output functionvoid writes(char* c, uint len) { for(int i=0; i<len; i++) { outbuf[outcnt++]=(uchar)(c[i]); if(outcnt==OUTBUFSIZE) flush_outbuf(); }}
University of Wisconsin-MadisonWCRE2006
Value Set Analysis (VSA)
Output function Output operationvoid put_long(int n) { put_short(n&0xffff); put_short((ulong)n >> 16);}
push 12345678hcall put_long
stack
78h
size:4
espLookupVSA(esp-4x8, 4)=12345678h
Output functionvoid writes(char* c, uint len) { for(int i=0; i<len; i++) { outbuf[outcnt++]=(uchar)(c[i]); if(outcnt==OUTBUFSIZE) flush_outbuf(); }}
.
.
.
1000a
b
c
d
1001
1002
1003
1004
esp
stack
mov ebx, 1000...push 4 push ebxcall writes
Output operation
56h
34h
12h
University of Wisconsin-MadisonWCRE2006
Value Set Analysis (VSA)
.
.
.
1000a
b
c
d
1001
1002
1003
1004
Output function Output operationvoid writes(char* c, uint len) { for(int i=0; i<len; i++) { outbuf[outcnt++]=(uchar)(c[i]); if(outcnt==OUTBUFSIZE) flush_outbuf(); }}
mov ebx, 1000...push 4 push ebxcall writes
stack
size:4
4esp
Output function Output operationvoid put_long(int n) { put_short(n&0xffff); put_short((ulong)n >> 16);}
push 12345678hcall put_long
stack
78h
size:4
espLookupVSA(esp-4x8, 4)=12345678h
esp
56h
34h
12h
University of Wisconsin-MadisonWCRE2006
Value Set Analysis (VSA)
.
.
.
1000a
b
c
d
1001
1002
1003
1004
Output function Output operationvoid writes(char* c, uint len) { for(int i=0; i<len; i++) { outbuf[outcnt++]=(uchar)(c[i]); if(outcnt==OUTBUFSIZE) flush_outbuf(); }}
mov ebx, 1000...push 4 push ebxcall writes
size:4
4
1000 esp
stack
Output function Output operationvoid put_long(int n) { put_short(n&0xffff); put_short((ulong)n >> 16);}
push 12345678hcall put_long
stack
78h
size:4
espLookupVSA(esp-4x8, 4)=12345678h
esp
56h
34h
12h
University of Wisconsin-MadisonWCRE2006
Value Set Analysis (VSA)
.
.
.
1000a
b
c
d
1001
1002
1003
1004
Output function Output operationvoid writes(char* c, uint len) { for(int i=0; i<len; i++) { outbuf[outcnt++]=(uchar)(c[i]); if(outcnt==OUTBUFSIZE) flush_outbuf(); }}
mov ebx, 1000...push 4 push ebxcall writes
size:4
4
1000 esp
LookupVSA(*(esp-4*8))=“abcd”
stack
Output function Output operationvoid put_long(int n) { put_short(n&0xffff); put_short((ulong)n >> 16);}
push 12345678hcall put_long
stack
78h
size:4
espLookupVSA(esp-4x8, 4)=12345678h
esp
56h
34h
12h
University of Wisconsin-MadisonWCRE2006
*
size:4
value:?
size:4
value:?
size:4
value:?
size:4
value:?
size:4
value:?
size:2
value:?
size:2
value:?
size:4
value:?
size:4
value:?
size:4
value:?
size:4
value:?
size:4
value:?
* size:Top
value:?
size:1
value:?
size:Top
value:?
*
size:2
value:?
size:2
value:?
size:2
value:?
size:1
value:?
size:1
value:?
size:1
value:?
size:1
value:?
size:2
value:?
size:4
value:?
size:4
value:?
size:2
value:?
*
* *
size:4
value:40
size:4
value:
Top
size:4
value:0
size:4
value:
Top
size:4
value:
Top
size:2
value:1
size:2
value:
Top
size:4
value:0
size:4
value:
Top
size:4
value:0
size:4
value:0
size:4
value:0
* size:Top
value:
Top
size:1
value:0
size:Top
value:
Top
*
size:2
value:0
size:2
value:1
size:2
value:
Top
size:1
value:
Top
size:1
value:
Top
size:1
value:
Top
size:1
value:0
size:2
value:0
size:4
value:
Top
size:4
value:
Top
size:2
value:
Top
*
*
Before After
University of Wisconsin-MadisonWCRE2006
ASI output :
Aggregate Structure Identification (ASI)
[14][14] call sendto call sendto...
University of Wisconsin-MadisonWCRE2006
Experiments
• gzip– GNU data-compression program
• png2ico– converts PNG files to Windows icon-resource files
• ping– sends ICMP ECHO_REQUEST packets to a host to see
if the host is reachable via the network
University of Wisconsin-MadisonWCRE2006
gzip* *
size:1
value:0x1F
size:1
value:0x8B
size:1
value:0x08
size:1
value:Top
size:Top
value:Top
size:Top
value:Top
size:4
value:Top
size:4
value:Top
size:4
value:Top
size:1
value:Top
size:1
value:Top
University of Wisconsin-MadisonWCRE2006
png2ico (1)
• Usage scenario– Find inconsistencies between specifications and
implementations
University of Wisconsin-MadisonWCRE2006
*
size:4
value:40
size:4
value:
Top
size:4
value:0
size:4
value:
Top
size:4
value:
Top
size:2
value:1
size:2
value:
Top
size:4
value:0
size:4
value:
Top
size:4
value:0
size:4
value:0
size:4
value:0
* size:Top
value:
Top
size:1
value:0
size:Top
value:
Top
*
size:2
value:0
size:2
value:1
size:2
value:
Top
size:1
value:
Top
size:1
value:
Top
size:1
value:
Top
size:1
value:0
size:2
value:0
size:4
value:
Top
size:4
value:
Top
size:2
value:
Top
*
*
*
png2ico (2)
University of Wisconsin-MadisonWCRE2006
*
size:4
value:40
size:4
value:
Top
size:4
value:0
size:4
value:
Top
size:4
value:
Top
size:2
value:1
size:2
value:
Top
size:4
value:0
size:4
value:
Top
size:4
value:0
size:4
value:0
size:4
value:0
* size:Top
value:
Top
size:1
value:0
size:Top
value:
Top
*
size:2
value:0
size:2
value:1
size:2
value:
Top
size:1
value:
Top
size:1
value:
Top
size:1
value:
Top
size:1
value:0
size:2
value:0
size:4
value:
Top
size:4
value:
Top
size:2
value:
Top
*
*
*
png2ico (2)
bug?
University of Wisconsin-MadisonWCRE2006
png2ico (3)
• We found an inconsistency between the file-format specification for Windows icons and the converter png2ico– png2ico regular exp. flex spec. recognizer
– Windows icon files recognizer failure!
University of Wisconsin-MadisonWCRE2006
*
size:4
value:40
size:4
value:
Top
size:4
value:0
size:4
value:
Top
size:4
value:
Top
size:2
value:1
size:2
value:
Top
size:4
value:0
size:4
value:
Top
size:4
value:0
size:4
value:0
size:4
value:0
* size:Top
value:
Top
size:1
value:0
size:Top
value:
Top
*
size:2
value:0
size:2
value:1
size:2
value:
Top
size:1
value:
Top
size:1
value:
Top
size:1
value:
Top
size:1
value:0
size:2
value:0
size:4
value:
Top
size:4
value:
Top
size:2
value:
Top
*
*
*
png2ico (4)writeWord(outfile,0); // wPlanes
University of Wisconsin-MadisonWCRE2006
ping (1)
pinger pinger
*
pinger
* ?
pinger
pinger
catcherentry
catcherexit
pingermainentry
mainexitpinger catcher
pinger
catcher
The HFSM gives a hint about the behavior of ping.
University of Wisconsin-MadisonWCRE2006
ping (2)typedef struct icmp { uint8 icmp_type; /* type of message, see below */ uint8 icmp_code; /* type sub code */ uint16 icmp_checksum; /* ones complement cksum of struct */ #define icmp_cksum icmp_checksum union { uint8 ih_pptr; /* ICMP_PARAMPROB */ struct in_addr ih_gwaddr; /* ICMP_REDIRECT */ struct ih_idseq { uint16 icd_id; uint16 icd_seq; } ih_idseq; int ih_void; /* ICMP_UNREACH_NEEDFRAG – Path MTU Discovery (RFC1191) */ struct ih_pmtu { uint16 ipm_void; uint16 ipm_nextmtu; } ih_pmtu; struct ih_rtradv { uint8 irt_num_addrs; uint8 irt_wpa; uint16 irt_lifetime; } ih_rtradv; } icmp_hun; #define icmp_pptr icmp_hun.ih_pptr ... union { struct id_ts { uint32 its_otime; uint32 its_rtime; uint32 its_ttime; } id_ts; struct id_ip { struct ip idi_ip; /* options and then 64 bits of data */ } id_ip; struct icmp_ra_addr id_radv; uint32 id_mask; char id_data[1]; } icmp_dun; #define icmp_otime icmp_dun.id_ts.its_otime ...} icmp_t;
pinger pinger*
pinger* ?
pinger
size:1
value:
Top
size:1
value:
Top
size:2
value:
Top
size:2
value:
Top
size:2
value:
Top
University of Wisconsin-MadisonWCRE2006
- A technique for extracting an over-approximation of a program’s output data format, including- a way to extract a preliminary structure for the output
data format- a way to elaborate the structure by annotating it with
information about possible output values and sizes
Conclusion
University of Wisconsin-MadisonWCRE2006
Over-Approximation?
• Yes, modulo . . .– All operations must append to the output
• No tracking of file-pointer rewind, seek, . . .
– Multiple different formats in a program– Signals and exceptions ignored
• In principle, could use the same technique used in the MOPS tool
University of Wisconsin-MadisonWCRE2006
- Automatic detection of output functions- Other operation sequences other formats
– Input operations– Network-communication operations
- Adoption of a learning technique for refining output formats
Possible Future Work
University of Wisconsin-MadisonWCRE2006
Thank you!Clarifications?
University of Wisconsin-MadisonWCRE2006
University of Wisconsin-MadisonWCRE2006
Identifying Output Operations
• IDAPro disassembler identifies library output procedures
• Typically, inspect the call graph to choose which application procedures should be considered output wrappers