Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13...

41
Synthesis for Automated Grading and Feedback EXCAPE NSF SITE VISIT AUGUST 2013

Transcript of Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13...

Page 1: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Synthesis for Automated Grading

and Feedback

EXCAPE NSF SITE VISIT AUGUST 2013

Page 2: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

ExCAPE Collaboration on Education

Solar-Lezama Alur Seshia Hartmann

Sumit Gulwani

Leverage Synthesis in Education Technology◦ Computer aided approach to introduce automation in

education

Page 3: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Education Technology in ExCapeFocus on 2 problem domains◦ Introductory programming

◦ Automata theory

Key application areas for synthesis◦ Problem Generation

◦ Automated Grading

◦ Feedback

Page 4: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

The challenge

Page 5: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Meaningful Feedback is Hard

• Test-cases based feedback

• Hard to relate failing inputs to errors

• Manual feedback by TAs

• Time consuming and error prone

Page 6: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Student Feedback

"Not only did it take 1-2 weeks to grade

problem, but the comments were entirely

unhelpful in actually helping us fix our errors. …. Apparently they don't read the code -- they just

ran their tests and docked points mercilessly.

What if I just had a simple typo, but my

algorithm was fine? ...."

Page 7: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Bigger Challenge in MOOCs

Scalability Challenges (>10k

students)

Page 8: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

AutoGraderAUTOMATED GRADING FOR PROGRAMMING ASSIGNMENTS

STUDENT: RISHABH SINGH AT MIT

Page 9: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Today’s Grading Workflowdef computeDeriv(poly):

deriv = []

zero = 0

if (len(poly) == 1):

return deriv

for e in range(0, len(poly)):

if (poly[e] == 0):

zero += 1

else:

deriv.append(poly[e]*e)

return deriv

replace derive by [0]

def computeDeriv(poly):

deriv = []

zero = 0

if (len(poly) == 1):

return deriv

for e in range(0, len(poly)):

if (poly[e] == 0):

zero += 1

else:

deriv.append(poly[e]*e)

return deriv

Teacher’s

Solution

Grading

Rubric

Page 10: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Autograder Workflowdef computeDeriv(poly):

deriv = []

zero = 0

if (len(poly) == 1):

return deriv

for e in range(0, len(poly)):

if (poly[e] == 0):

zero += 1

else:

deriv.append(poly[e]*e)

return deriv

replace derive by [0]

def computeDeriv(poly):

deriv = []

zero = 0

if (len(poly) == 1):

return deriv

for e in range(0, len(poly)):

if (poly[e] == 0):

zero += 1

else:

deriv.append(poly[e]*e)

return deriv

Teacher’s

Solution

Error Model

Page 11: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Technical Challenges

Large space of possible corrections

Minimal corrections

Dynamically-typed language

Constraint-based Synthesis to the rescue

Page 12: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Solution Idea

1

chang

e

2

changes

4

changes

Progra

ms

satisfyin

g spec

Minimum

changes

.S

All Programs

1015 candidates

Constraint-based

Synthesis

Page 13: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Autograder Algorithm

Page 14: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Rewriter Translator Solver Feedback

Algorithm

.p

y

. 𝒑𝒚 .sk .out ---

-

---

-

Page 15: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Rewriter Translator Solver Feedback

Algorithm: Rewriter

.p

y

. 𝒑𝒚

Page 16: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

• return a return {[0],?a}

• range(a1, a2) range(a1+1,a2)

• a a + 1

Simplified Error Model

Page 17: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Rewriting using Error Model

range(0, len(poly))

a a+1

range({0 ,1}, len(poly))

default

choice

Page 18: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Rewriting using Error Model

range(0, len(poly))

a a+1

range({0 ,1}, len(poly))

Page 19: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Rewriting using Error Model

range(0, len(poly))

a a+1

range({0 ,1}, len({poly, poly+1}))

Page 20: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Rewriting using Error Model

range(0, len(poly))

a a+1

range({0 ,1}, {len({poly, poly+1}),

len({poly, poly+1})+1})

Page 21: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Rewriting using Error Model ( 𝑷𝒚)

def computeDeriv(poly):

deriv = []

zero = 0

if ({len(poly) == 1, False}):

return {deriv,[0]}

for e in range({0,1}, len(poly)):

if (poly[e] == 0):

zero += 1

else:

deriv.append(poly[e]*e)

return {deriv,[0]}

Problem: Find a program

that minimizes cost

metric and is functionally

equivalent with teacher’s solution

Page 22: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Rewriter Translator Solver Feedback

Algorithm: Translator

. 𝒑𝒚 .sk

Page 23: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Sketch [Solar-Lezama et al. ASPLOS06]

void main(int x){

int k = ??;

assert x + x == k * x;

}

void main(int x){

int k = 2;

assert x + x == k * x;

}

Statically typed C-like language with

holes

Page 24: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

𝑃𝑦 Translation to Sketch

(1) Handling python’s dynamic types

(2) Translation of expression choices

Page 25: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Rewriter Translator Solver Feedback

Algorithm: Solver

.sk .out

Page 26: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Rewriter Translator Solver Feedback

Algorithm: Feedback

.out ---

-

---

-

Page 27: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Feedback Generation

Correction rules associated with Feedback

Template

Extract synthesizer choices to fill templates

Page 28: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Evaluation

Page 29: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Autograder Tool for Python

Currently supports:

- Integers, Bool, Strings, Lists, Dictionary, Tuples

- Closures, limited higher-order fn, list comprehensions

Page 30: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Benchmarks

Exercises from first five weeks of 6.00x and

6.00

int: prodBySum, compBal, iterPower, recurPower, iterGCD

tuple: oddTuple

list: compDeriv, evalPoly

string: hangman1, hangman2

arrays(C#): APCS dynamic programming (Pex4Fun)

Page 31: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Benchmark Test Set

evalPoly-6.00 13

compBal-stdin-6.00 52

compDeriv-6.00 103

hangman2-6.00x 218

prodBySum-6.00 268

oddTuples-6.00 344

hangman1-6.00x 351

evalPoly-6.00x 541

compDeriv-6.00x 918

oddTuples-6.00x 1756

iterPower-6.00x 2875

recurPower-6.00x 2938

iterGCD-6.00x 2988

Page 32: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Average Running Time (in s)

0

5

10

15

20

25

30

35

Tim

e (i

n s

)

Page 33: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Feedback Generated (Percentage)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

Fee

db

ack

Per

cen

tage

Page 34: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Why low % in some cases?

• Completely Incorrect Solutions

• Unimplemented Python Features

• Timeout

•comp-bal-6.00

• Big Conceptual errors

Page 35: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

AutomataTutorSTUDENTS AND POSTDOCS: LORIS D’ANTONI (PENN)

DILEEP KINI (UIUC), MAHESH VISWANATHAN (UIUC)

Page 36: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344
Page 37: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Automata Feedback Example

The student misunderstood the problem

at least 2 occurrences of ‘ab’ instead of

exactly 2 occurrences of ‘ab’

Tool answer

◦ Feedback: The correct language is { s | ‘ab’ appears in s exactly

2 times }

◦ Grade: 5/10

Page 38: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Big Error: Misunderstanding Spec

• hangman2-6.00x

def getGuessedWord(secretWord, lettersGuessed):

for letter in lettersGuessed:

secretWord = secretWord.replace(letter,’_’)

return secretWord

Page 39: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Big Error: Misunderstanding APIs

• eval-poly-6.00x

def evaluatePoly(poly, x):

result = 0

for i in list(poly):

result += i * x ** poly.index(i)

return result

Page 40: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

PLOOC Workshop

PL technology for Open Online Courses

Co-located with PLDI

◦ ~25 attendees

Page 41: Synthesis for Automated Grading and Feedback · Benchmark Test Set evalPoly-6.00 13 compBal-stdin-6.00 52 compDeriv-6.00 103 hangman2-6.00x 218 prodBySum-6.00 268 oddTuples-6.00 344

Acknowledgements