ETH Zurich Andreas Krause Martin Vechev Learning Programs ... · Martin Vechev Andreas Krause ETH...

Learning Programs from Noisy Data

Veselin RaychevPavol BielikMartin VechevAndreas Krause

ETH Zurich

Why learn programs from examples?

2

Input/output examples

often easier to provide examples than specification

(e.g. in FlashFill)


3

Input/output examples p such that

p( )=

...p( )=

learn a function

?often easier to provide

examples than specification(e.g. in FlashFill)


4


p( )=

...p( )=

learn a function

?often easier to provide

examples than specification(e.g. in FlashFill)

the user may make a mistake in the

examples


5


p( )=

...p( )=

learn a function

?

Actual goal: produce p that the user really wanted and tried to specify


(e.g. in FlashFill)


examples


6


p( )=

...p( )=

learn a function

?

Actual goal: produce p that the user really wanted and tried to specify


(e.g. in FlashFill)


examples

Key problem of synthesis: overfits, not robust to noise

Learning Programs from Data: Defining Dimensions

7

Number of Examples

Handling errors in the dataset

Learned program complexity


8

Number of Examples



Program synthesis(PL)

no

tens

interestingprograms


9

Number of Examples




no

tens

interestingprograms

Deep learning(ML)

yes

millions

simple, butunexplainable

functions


10

Number of Examples




no

tens

interestingprograms

Deep learning(ML)

yes

millions


functions

This paperbridges a gap

yes

millions

interestingprograms


11

Number of Examples




no

tens

interestingprograms

Deep learning(ML)

yes

millions


functions


yes

millions

interestingprograms

expands capabilities ofexisting synthesizers


12

Number of Examples




no

tens

interestingprograms

Deep learning(ML)

yes

millions


functions


yes

millions

interestingprograms

new state-of-the-art precision for programming tasks



13

Number of Examples




no

tens

interestingprograms

Deep learning(ML)

yes

millions


functions


yes

millions

interestingprograms

new state-of-the-art precision for programming tasks


Bridges gap between ML and PL

Advances both areas

In this paper

14

● A general framework that handles○ errors in training dataset○ learns statistical models on data○ handles synthesis with millions of examples

● Instantiated with two synthesizers○ generalize existing works

Handling noise

New probabilistic models

Handling large datasets

Contributions

15


incorrect examples

1. synthesizep

2. use probabilistic

modelparametrized

with p

Representative dataset sampler

Program generator

Handling noise



Contributions

16

Handling noise


incorrect examples

1. synthesizep


modelparametrized

with p


Program generator

Synthesis with noise: usage model

17



18


DomainSpecific

Language


19

Input/output examples synthesizer

DomainSpecific

Language


20


DomainSpecific

Language

p such that p( )=

...p( )=


21


DomainSpecific

Language

p such that p( )=

...p( )=

incorrect example(e.g. a typo)


22


DomainSpecific

Language

p such that p( )=

...p( )=


p( )≠


23


DomainSpecific

Language

p such that p( )=

...p( )=


p( )≠new kind of feedback

from synthesizer

✔

❌

✔

✔

✔


24


DomainSpecific

Language

p such that p( )=

...p( )=


p( )≠new kind of feedback

from synthesizer● Tell user to

remove suspicious example, or

● Ask for more examples

✔

❌

✔

✔

✔

if returnif returnif returnif returnreturn

Issue: such a program makesno errorsand it may be the only solution to the minimization problem

Handling noise: problem statement

25

D: Input/output examples

incorrect examples

synthesizer pbest = arg min errors(D, p)p∈P

Too long program, hardcodes the input/outputs.Synthesis must penalize such answers

dataset of input/output

examples

space of possible programs in DSL

Our problem formulation:

pbest = arg min errors(D, p) + λr(p) p∈P

regularizerpenalizes long

programs

regularization constant

error rate

Noisy synthesis using SMT

26


number of instructions

total solution

cost


27



total solution

cost

Ψ


28

encoding

err1 = if p( )= then 0 else 1err2 = if p( )= then 0 else 1err3 = if p( )= then 0 else 1

errors = err1 + err2 + err3p ∈ Pr (with r instructions)



total solution

cost

formula given to SMT solver

costnumber of errors

0 1 2 3

r

1 0.6 1.6 2.6 3.6

2 1.2 2.2 3.2 4.2

3 1.8 2.8 3.8 4.8

Ψ


29

encoding





total solution

cost

Ask a number of SMT queriesin increasing value of solution cost

e.g. for

λ = 0.6costs are



0 1 2 3

r

1 0.6 1.6 2.6 3.6

2 1.2 2.2 3.2 4.2

3 1.8 2.8 3.8 4.8

Ψ


30

encoding





total solution

cost


e.g. for

λ = 0.6costs are

UNSAT



0 1 2 3

r

1 0.6 1.6 2.6 3.6

2 1.2 2.2 3.2 4.2

3 1.8 2.8 3.8 4.8

Ψ


31

encoding





total solution

cost


e.g. for

λ = 0.6costs are

UNSAT

UNSAT



0 1 2 3

r

1 0.6 1.6 2.6 3.6

2 1.2 2.2 3.2 4.2

3 1.8 2.8 3.8 4.8

Ψ


32

encoding





total solution

cost


e.g. for

λ = 0.6costs are

UNSAT

UNSAT

UNSAT



0 1 2 3

r

1 0.6 1.6 2.6 3.6

2 1.2 2.2 3.2 4.2

3 1.8 2.8 3.8 4.8

Ψ


33

encoding





total solution

cost


e.g. for

λ = 0.6costs are

UNSAT

UNSAT

UNSAT

UNSAT



0 1 2 3

r

1 0.6 1.6 2.6 3.6

2 1.2 2.2 3.2 4.2

3 1.8 2.8 3.8 4.8

Ψ


34

encoding





total solution

cost


e.g. for

λ = 0.6costs are

UNSAT

UNSAT

UNSAT

UNSAT

SAT



0 1 2 3

r

1 0.6 1.6 2.6 3.6

2 1.2 2.2 3.2 4.2

3 1.8 2.8 3.8 4.8

Ψ


35

encoding





total solution

cost


e.g. for

λ = 0.6costs are

UNSAT

UNSAT

UNSAT

UNSAT

SAT

best program is with two instructions and makes one error


Noisy synthesizer: example

36

Take an actual synthesizer andshow that we can make it handle noise

Implementation: BitSyn

37

For BitStream programs, using Z3

similar to Jha et al.[ICSE’10] and Gulwani et al.[PLDI’11]

Example program:function check_if_power_of_2(int32 x) { var o = add(x, 1) return bitwise_and(x, o)}

synthesized, short loop-free programs


38





Question: how well does our synthesizer discover noise?(in programs from prior work)


39







40






best area to be in. empirically pick λ here

So far… handling noise

● Problem statement and regularization● Synthesis procedure using SMT● Presented one synthesizer

Handling noise enables us to solve new classes of problems beyond normal synthesis

41


Handling noise

New probabilistic modelsContributions

42


incorrect examples

1. synthesizep


modelparametrized

with p


Program generator

Handling large datasetsHandling large datasets

Handling noise

New probabilistic modelsContributions

43


incorrect examples

1. synthesizep


modelparametrized

with p


Program generator

Fundamental problem

44

Large number of examples:

pbest = arg min cost(D, p)p∈P

Fundamental problem

45


pbest = arg min cost(D, p)

D

Millions ofinput/output

examples

p∈P

Fundamental problem

46



D


examples

computing cost(D, p)

O( |D| )

p∈P

Fundamental problem

47



D


examples


O( |D| )

Synthesis: practically intractable

p∈P

Fundamental problem

48



D


examples


O( |D| )

Synthesis: practically intractable

Key idea: iterative synthesis onfraction of examples

p∈P

Our solution: two components

given dataset d, finds best program

49

pbest = arg min cost(d, p)p∈P

Program generator

Synthesizer for small number of examples

Our solution: two components

given dataset d, finds best program

50

pbest = arg min cost(d, p)p∈P

Program generator

Datasetsampler

Picks dataset d ⊆ D

Synthesizer for small number of examples

We introduce representative dataset sampler

Generalize a user providing input/output examples

In a loop

51


Program generator

In a loop

52


Program generator

Start with a small random sample d⊆D

Iteratively generate programs and samples.

In a loop

53


Program generator



Program generator p1

In a loop

54


Program generator




Representative dataset samplerd

In a loop

55


Program generator




Program generator p1 , p2


In a loop

56


Program generator







In a loop

57


Program generator






pbest


In a loop

58


Program generator






pbest


Algorithm generalizes synthesis by examples techniques


59

Idea: pick a small dataset d for which a set of already generated programs p1,...,pn behave like on the full dataset

d = arg min d ⊆ D maxi∊1..n | cost(d, pi) - cost(D, pi) |


60

Costs on small dataset d

p1 p2




61


Costs on full dataset D

p1 p2

p1 p2




62



p1 p2

p1 p2




63



p1 p2



p1 p2


64



p1 p2

p1 p2



Theorem: this sampler shrinks the candidate program search spaceIn evaluation: significant speedup of synthesis

So far... handling large datasets

● Iterative combination of synthesis and sampling

● New way to perform approximateempirical risk minimization

● Guarantees (in the paper)

65


Program generator

Handling noise



Contributions

66


incorrect examples

1. synthesizep


modelparametrized

with p


Program generator

Handling noise



Contributions

67


incorrect examples


1. synthesizep


modelparametrized

with p


Program generator

Statistical programming tools

68

A new breed of tools:

Learn from large existing codebases (e.g. Big Code) to make predictions about programs


69

1. Train machine learning model




70


2. Make predictionswith model




71


2. Make predictionswith model



hard-coded modellow precision

Existing machine learning models

Essentially remember mapping from context in training data to prediction (with probabilities)

72

Hindle et al.[ICSE’12]



73




74




75




76

Learn a mapping

Model will predict slice when it sees it after “+ name .”

This model comes from NLP

+ name . slice

Raychev et al.[PLDI’14]




77

Learn a mapping



+ name . slice





78

Learn a mapping



+ name . slice





79

Learn a mapping



+ name . slice





80

Learn a mapping



+ name . slice





81

Learn a mapping



+ name . slice

Learn a mapping

Model will predict slice when it sees it after “charAt”Relies on static analysis

charAt slice

Problem of existing systems

Precision. They rarely predict the next statement

82


Hindle et al.[ICSE’12] Learn a mapping



+ name . slice

Learn a mapping


charAt slice



83





+ name . slice

Learn a mapping


charAt slice

Very low precision



84





+ name . slice

Learn a mapping


charAt slice

Very low precision

Low precision for JavaScript



85





+ name . slice

Learn a mapping


charAt slice

Very low precision

Low precision for JavaScript

Core problem:Existing machine learning

models are limited and notexpressive enough

Key idea: second-order learning

86

Learn a program that parametrizes a probabilistic model that makes predictions.


87

1. Synthesizeprogram

describing a model



88


describing a model

i.e. learn the mapping

2. Train model



89


describing a model


2. Train model 3. Make predictionswith this model



90


describing a model


2. Train model 3. Make predictionswith this model


prior models are described by simple hard-coded programs

Our approach:learn a better program

output

Training and evaluation

Training example:

91

slice

input

output


Training example:

92

slice

Compute contextwith program p

input

output


Training example:

93

slice

Learn a mapping

toUpperCase slice


input

output


Training example:

94

slice

Evaluation example:

Learn a mapping

toUpperCase slice


input

output


Training example:

95

slice

Evaluation example:

Learn a mapping

toUpperCase slice


Compute context with program p

input

output


Training example:

96

slice

Evaluation example:

Learn a mapping

toUpperCase slice


slice

Compute context with program p predict completion

input

output


Training example:

97

slice

Evaluation example:

Learn a mapping

toUpperCase slice


slice

Compute context with program p predict completion

✔

input

Observation

Synthesis of probabilistic model can be done with the same optimization problem as before!

98




programs


evaluation data: input/output

examples

Observation

Synthesis of probabilistic model can be done with the same optimization problem as before!

99




programs


evaluation data: input/output

examplescost(D, p)

So far...

100

Handling noise

Synthesizing a model


Techniques are generally applicable to program synthesis

Next, application for “Big Code” called DeepSyn

DeepSyn: Training

Trained on 100’000 JavaScript files from GitHub

101

Program synthesizer


Program generator

D

DeepSyn: Training

Trained on 100’000 JavaScript files from GitHub

102

Program synthesizer


Program generator

D

Train model on full dataand best program p

pD

DeepSyn: Evaluation

50’000 evaluation files (not used in training or synthesis)

API completion task

103

DeepSyn: Evaluation


API completion task

104

Conditioning program p Accuracy

Last two tokens, Hindle et al.[ICSE’12] 22.2%

Last two APIs per object, Raychev et al.[PLDI’14] 30.4%

Program synthesis with noise 46.3%

Program synthesis with noise + dataset sampler 50.4%This

wor

k

DeepSyn: Evaluation


API completion task

105

Conditioning program p Accuracy

Last two tokens, Hindle et al.[ICSE’12] 22.2%

Last two APIs per object, Raychev et al.[PLDI’14] 30.4%

Program synthesis with noise 46.3%

Program synthesis with noise + dataset sampler 50.4%

We can explain best program. It looks at API preceding completion position and at tokens prior to these APIs.

This

wor

k

Scalability

Handling noise

Second-order learning


Q&A

106


incorrect examples

1. synthesizep


modelparametrized

with p


Program generator

Synthesis of probabilistic

models

Extending synthesizers to handle noise

Scalability

Handling noise

Second-order learning


Q&A

107


incorrect examples

1. synthesizep


modelparametrized

with p


Program generator

Synthesis of probabilistic

models

Extending synthesizers to handle noise

Bridges gap between ML and PLAdvances both areas

What did we synthesize?

108

Left PrevActor WriteAction WriteValue PrevActor WriteAction PrevLeaf WriteValue PrevLeaf WriteValue

ETH Zurich Andreas Krause Martin Vechev Learning Programs ... · Martin Vechev Andreas Krause ETH...

Documents

Transcript of ETH Zurich Andreas Krause Martin Vechev Learning Programs ... · Martin Vechev Andreas Krause ETH...