Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik,...

Post on 14-Dec-2015

214 views 0 download

Tags:

Transcript of Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik,...

Mining Specifications

Glenn Ammons, Dept. Computer Science University of Wisconsin

Rastislav Bodik, Computer Science Division University of California, Berkeley

James R. Larus, Microsoft Research

POPL 2002

Motivation

Formal verification is a promising alternative to software testing

But

Verifiers will be of little use without enough correctness specifications to be verified

The Assumption

Common behavior is (often) correct behavior.

If we can identify common behavior we can produce correct specifications, even from programs that contain errors.

A Program Using socket API

1 int s = socket(AF_INET, SOCK_STREAM, 0);

2 …

3 bind(s, &serv_addr, sizeof(serv_addr));

4 …

5 listen(s, 5);

6 …

7 while (1) {

8 int ns = accept(s, &addr, &len);

9 if (ns < 0) break;

10 do {

11 read(ns, buffer, 255);

12 …

13 write(ns, buffer, size);

14 if (cond1) return;

15 } while (cond2)

16 close(ns);

17 }

18 close(s);

An Example Trace

1 socket(domain = 2, type = 1, proto = 0, return = 7)

2 bind(so = 7, addr = 0x400120, addr_len = 6, return = 0)

3 listen(so = 7, backlog = 5, return = 0)

4 accept(so = 7, addr = 0x400200, addr_len = 0x400240, return = 8)

5 read(fd = 8, buf = 0x400320, len = 255, return = 12)

6 write(fd = 8, buf = 0x400320, len = 12, return = 12)

7 read(fd = 8, buf = 0x400320, len = 255, return = 7)

8 write(fd = 8, buf = 0x400320, len = 7, return = 7)

9 close(fd = 8, return = 0)

10 accept(so = 7, addr = 0x400200, addr_len = 0x400240, return = 10)

11 read(fd = 10, buf = 0x400320, len = 255, return = 13)

12 write(fd = 10, buf = 0x400320, len = 13, return = 13)

13 close(fd = 10, return = 0)

14 close(fd = 7, return = 0)

Design Decisions

1. Learn from traces not from source• Contain fewer bugs

2. Take a “vote” on what the common program behavior is.

• the high-probability core encodes the frequently followed protocol.

Mining System

Run

Tracer

Automatonlearner

Scenario extractor

Flow dependenceannotator

Instrumented program

Traces

Program

Test inputs

Annotated traces

Scenario seedAbstract scenario strings

Specifications

• I - the set of all traces of interaction with an API or ADT .

• C I - the set of all correct traces of interaction.

• T - an unlabelled training set of interaction traces.

Find an automaton A that generates exactly the traces in C.

The (unsolvable) Problem

Restriction 1

• C must be a regular language.– Model checkers require finite-state

specifications.– Algorithms for learning finite-state automatons

are relatively well developed.

Interaction Scenarios

LinkedList(n)

mallocmalloc

freemalloc

freefree

.

.

.

.

.

.

malloc(return = O1)malloc(return = O2)

free(p = On)malloc(return = On)

free(p = O2)free(p = O1)

.

.

.

malloc(return = O1)free(p = O1)

O1{malloc(return = O2)free(p = O2)

O2{

malloc(return = On)free(p = On)

On{

.

.

.

malloc(return = Ostd)free(p = Ostd)

O1{malloc(return = Ostd)free(p = Ostd)

O2{

malloc(return = Ostd)free(p = Ostd)

On{

The Problem – Take 2

• IS - the set of all interaction scenarios with an API or ADT that manipulate no more than k data objects.

• CS IS - the regular set of all correct scenarios.

• TS - an unlabelled training set of interaction scenarios from IS.

Find a finite-state automaton AS that generates exactly the scenarios in CS.

Restriction 2 - Linking Ts and Cs

TS = c0,c1,… be an infinite sequence of elements from CS in which each element of CS occurs at least once.

for each n > 0: c0,c1,… cn ASn

for some N ≥ 0, ASN generates exactly the

scenarios in CS and ASn= ASN

for all

n ≥ N.AS0

,AS1,… identifies CS in the limit.

The Probabilistic Approach• Is – as before.

• M – a target PFSA and PM a distribution over Is that M generates.

“Efficiently” find a PFSA M’ such that its distribution PM’ is an ε-good approximation of PM.

Mining System

Run

Tracer

Automatonlearner

Scenario extractor

Flow dependenceannotator

Instrumented program

Traces

Program

Test inputs

Annotated traces

Scenario seedAbstract scenario strings

Specifications

Tracer1. C stdio replacement (requires recompilation)2. Executable editing

1 socket(domain = 2, type = 1, proto = 0, return = 7)2 bind(so = 7, addr = 0x400120, addr_len = 6, return = 0)3 listen(so = 7, backlog = 5, return = 0)4 accept(so = 7, addr = 0x400200, addr_len = 0x400240, return = 8)

skeleton:interaction(attribute0 ,…, attributen)

Flow Dependence

Type inference

Dependence analysis Untyped trace with dependencies

Traces

Annotated traces

Dependence Analysis

Definers:socket.returnbind.solisten.soaccept.returnclose.fd

• Takes a list of attributes that define or use objects (manually created).

• Creates a flow dependence between users and definers.

Users:bind.solisten.soaccept.soread.fdwrite.fdclose.fd

Type Inference

If there exists a flow dependency between two attributes then typing gives these attributes the same type.

Type(socket.return)=T0

Type(bind.so)=T0

Type(listen.so)=T0

Type(accept.so)=T0

Type(accept.return)=T0

Type(read.fd)=T0

Type(write.fd)=T0 Type(close.fd)=T0

Scenario Extraction

Simplification

Extraction scenarios

simplified scenarios

Annotaed traces

Standardization

Scenario seeds

Abstract scenario strings

Extraction

• A scenario is a set of interactions related by flow dependences.

1 socket(domain = 2, type = 1, proto = 0, return = 7)

2 bind(so = 7, addr = 0x400120, addr_len = 6, return = 0)

3 listen(so = 7, backlog = 5, return = 0)

4 accept(so = 7, addr = 0x400200, addr_len = 0x400240, return = 8)

5 read(fd = 8, buf = 0x400320, len = 255, return = 12)

6 write(fd = 8, buf = 0x400320, len = 12, return = 12)

7 read(fd = 8, buf = 0x400320, len = 255, return = 7)

8 write(fd = 8, buf = 0x400320, len = 7, return = 7)

9 close(fd = 8, return = 0)

Simplification

Eliminate all interaction attributes that do not carry a flow dependence.

1 socket(return = 7)

2 bind(so = 7)

3 listen(so = 7)

4 accept(so = 7, return = 8) [seed]

5 read(fd = 8)

6 write(fd = 8)

7 read(fd = 8)

8 write(fd = 8)

9 close(fd = 8)

Standardization

1 socket(return = x0:T0)

2 bind(so = x0:T0)

3 listen(so = x0:T0)

4 accept(so = x0:T0, return = x1:T0) [seed]

5 read(fd = x1:T0)

7 read(fd = x1:T0)

6 write(fd = x1:T0)

8 write(fd = x1:T0)

9 close(fd = x1:T0)

1. Naming: replaces attribute values with symbolic variables.

2. Reordering

(A)

(B)

(C)

(D)

(E)

(E)

(F)

(F)

(G)

Automaton Learning

1. OTS learner learns a PFSA2. A corer removes infrequently

traversed edges and converts the PFSA into an NFA.start

final

10000

10000

10000

5

5

5

5

Specification Automaton for the Socket Protocolsocket(return = x)

bind(so = x)

listen(so = x)

accept(so = x, return = y)

read(fd = y) write(fd = y)

close(fd = x)

close(fd = y)

Experimental Results

• Analyzed traces from programs that use the Xlib and X Toolkit Intrinsics libraries for the X11 windowing system.

• Traces were generated manually• Compare mined specification to

Interclient Communication Conventions Manual (ICCCM) rules.

Experimental Results

• A small and buggy training set prevented the miner from discovering the rule.

• solution: an expert chooses correct traces as the training set.

Benefits

• Exploits the massive programmers' effort that is reflected in the code (and nowhere else).

• Offers convenience and insights.It is easier to approve a mined formal specification than to write one.

Conclusion

• Introduced a (semi) automatic machine-learning approach for discovering formal specifications.

• Reduced the problem to learning regular languages.

• Initial experience is promising.