Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik,...

28
Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California, Berkeley James R. Larus, Microsoft Research POPL 2002

Transcript of Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik,...

Page 1: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Mining Specifications

Glenn Ammons, Dept. Computer Science University of Wisconsin

Rastislav Bodik, Computer Science Division University of California, Berkeley

James R. Larus, Microsoft Research

POPL 2002

Page 2: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Motivation

Formal verification is a promising alternative to software testing

But

Verifiers will be of little use without enough correctness specifications to be verified

Page 3: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

The Assumption

Common behavior is (often) correct behavior.

If we can identify common behavior we can produce correct specifications, even from programs that contain errors.

Page 4: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

A Program Using socket API

1 int s = socket(AF_INET, SOCK_STREAM, 0);

2 …

3 bind(s, &serv_addr, sizeof(serv_addr));

4 …

5 listen(s, 5);

6 …

7 while (1) {

8 int ns = accept(s, &addr, &len);

9 if (ns < 0) break;

10 do {

11 read(ns, buffer, 255);

12 …

13 write(ns, buffer, size);

14 if (cond1) return;

15 } while (cond2)

16 close(ns);

17 }

18 close(s);

Page 5: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

An Example Trace

1 socket(domain = 2, type = 1, proto = 0, return = 7)

2 bind(so = 7, addr = 0x400120, addr_len = 6, return = 0)

3 listen(so = 7, backlog = 5, return = 0)

4 accept(so = 7, addr = 0x400200, addr_len = 0x400240, return = 8)

5 read(fd = 8, buf = 0x400320, len = 255, return = 12)

6 write(fd = 8, buf = 0x400320, len = 12, return = 12)

7 read(fd = 8, buf = 0x400320, len = 255, return = 7)

8 write(fd = 8, buf = 0x400320, len = 7, return = 7)

9 close(fd = 8, return = 0)

10 accept(so = 7, addr = 0x400200, addr_len = 0x400240, return = 10)

11 read(fd = 10, buf = 0x400320, len = 255, return = 13)

12 write(fd = 10, buf = 0x400320, len = 13, return = 13)

13 close(fd = 10, return = 0)

14 close(fd = 7, return = 0)

Page 6: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Design Decisions

1. Learn from traces not from source• Contain fewer bugs

2. Take a “vote” on what the common program behavior is.

• the high-probability core encodes the frequently followed protocol.

Page 7: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Mining System

Run

Tracer

Automatonlearner

Scenario extractor

Flow dependenceannotator

Instrumented program

Traces

Program

Test inputs

Annotated traces

Scenario seedAbstract scenario strings

Specifications

Page 8: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

• I - the set of all traces of interaction with an API or ADT .

• C I - the set of all correct traces of interaction.

• T - an unlabelled training set of interaction traces.

Find an automaton A that generates exactly the traces in C.

The (unsolvable) Problem

Page 9: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Restriction 1

• C must be a regular language.– Model checkers require finite-state

specifications.– Algorithms for learning finite-state automatons

are relatively well developed.

Page 10: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Interaction Scenarios

LinkedList(n)

mallocmalloc

freemalloc

freefree

.

.

.

.

.

.

malloc(return = O1)malloc(return = O2)

free(p = On)malloc(return = On)

free(p = O2)free(p = O1)

.

.

.

malloc(return = O1)free(p = O1)

O1{malloc(return = O2)free(p = O2)

O2{

malloc(return = On)free(p = On)

On{

.

.

.

malloc(return = Ostd)free(p = Ostd)

O1{malloc(return = Ostd)free(p = Ostd)

O2{

malloc(return = Ostd)free(p = Ostd)

On{

Page 11: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

The Problem – Take 2

• IS - the set of all interaction scenarios with an API or ADT that manipulate no more than k data objects.

• CS IS - the regular set of all correct scenarios.

• TS - an unlabelled training set of interaction scenarios from IS.

Find a finite-state automaton AS that generates exactly the scenarios in CS.

Page 12: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Restriction 2 - Linking Ts and Cs

TS = c0,c1,… be an infinite sequence of elements from CS in which each element of CS occurs at least once.

for each n > 0: c0,c1,… cn ASn

for some N ≥ 0, ASN generates exactly the

scenarios in CS and ASn= ASN

for all

n ≥ N.AS0

,AS1,… identifies CS in the limit.

Page 13: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

The Probabilistic Approach• Is – as before.

• M – a target PFSA and PM a distribution over Is that M generates.

“Efficiently” find a PFSA M’ such that its distribution PM’ is an ε-good approximation of PM.

Page 14: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Mining System

Run

Tracer

Automatonlearner

Scenario extractor

Flow dependenceannotator

Instrumented program

Traces

Program

Test inputs

Annotated traces

Scenario seedAbstract scenario strings

Specifications

Page 15: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Tracer1. C stdio replacement (requires recompilation)2. Executable editing

1 socket(domain = 2, type = 1, proto = 0, return = 7)2 bind(so = 7, addr = 0x400120, addr_len = 6, return = 0)3 listen(so = 7, backlog = 5, return = 0)4 accept(so = 7, addr = 0x400200, addr_len = 0x400240, return = 8)

skeleton:interaction(attribute0 ,…, attributen)

Page 16: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Flow Dependence

Type inference

Dependence analysis Untyped trace with dependencies

Traces

Annotated traces

Page 17: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Dependence Analysis

Definers:socket.returnbind.solisten.soaccept.returnclose.fd

• Takes a list of attributes that define or use objects (manually created).

• Creates a flow dependence between users and definers.

Users:bind.solisten.soaccept.soread.fdwrite.fdclose.fd

Page 18: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Type Inference

If there exists a flow dependency between two attributes then typing gives these attributes the same type.

Type(socket.return)=T0

Type(bind.so)=T0

Type(listen.so)=T0

Type(accept.so)=T0

Type(accept.return)=T0

Type(read.fd)=T0

Type(write.fd)=T0 Type(close.fd)=T0

Page 19: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Scenario Extraction

Simplification

Extraction scenarios

simplified scenarios

Annotaed traces

Standardization

Scenario seeds

Abstract scenario strings

Page 20: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Extraction

• A scenario is a set of interactions related by flow dependences.

1 socket(domain = 2, type = 1, proto = 0, return = 7)

2 bind(so = 7, addr = 0x400120, addr_len = 6, return = 0)

3 listen(so = 7, backlog = 5, return = 0)

4 accept(so = 7, addr = 0x400200, addr_len = 0x400240, return = 8)

5 read(fd = 8, buf = 0x400320, len = 255, return = 12)

6 write(fd = 8, buf = 0x400320, len = 12, return = 12)

7 read(fd = 8, buf = 0x400320, len = 255, return = 7)

8 write(fd = 8, buf = 0x400320, len = 7, return = 7)

9 close(fd = 8, return = 0)

Page 21: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Simplification

Eliminate all interaction attributes that do not carry a flow dependence.

1 socket(return = 7)

2 bind(so = 7)

3 listen(so = 7)

4 accept(so = 7, return = 8) [seed]

5 read(fd = 8)

6 write(fd = 8)

7 read(fd = 8)

8 write(fd = 8)

9 close(fd = 8)

Page 22: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Standardization

1 socket(return = x0:T0)

2 bind(so = x0:T0)

3 listen(so = x0:T0)

4 accept(so = x0:T0, return = x1:T0) [seed]

5 read(fd = x1:T0)

7 read(fd = x1:T0)

6 write(fd = x1:T0)

8 write(fd = x1:T0)

9 close(fd = x1:T0)

1. Naming: replaces attribute values with symbolic variables.

2. Reordering

(A)

(B)

(C)

(D)

(E)

(E)

(F)

(F)

(G)

Page 23: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Automaton Learning

1. OTS learner learns a PFSA2. A corer removes infrequently

traversed edges and converts the PFSA into an NFA.start

final

10000

10000

10000

5

5

5

5

Page 24: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Specification Automaton for the Socket Protocolsocket(return = x)

bind(so = x)

listen(so = x)

accept(so = x, return = y)

read(fd = y) write(fd = y)

close(fd = x)

close(fd = y)

Page 25: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Experimental Results

• Analyzed traces from programs that use the Xlib and X Toolkit Intrinsics libraries for the X11 windowing system.

• Traces were generated manually• Compare mined specification to

Interclient Communication Conventions Manual (ICCCM) rules.

Page 26: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Experimental Results

• A small and buggy training set prevented the miner from discovering the rule.

• solution: an expert chooses correct traces as the training set.

Page 27: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Benefits

• Exploits the massive programmers' effort that is reflected in the code (and nowhere else).

• Offers convenience and insights.It is easier to approve a mined formal specification than to write one.

Page 28: Mining Specifications Glenn Ammons, Dept. Computer Science University of Wisconsin Rastislav Bodik, Computer Science Division University of California,

Conclusion

• Introduced a (semi) automatic machine-learning approach for discovering formal specifications.

• Reduced the problem to learning regular languages.

• Initial experience is promising.