Download - The Art Of Parsing @ Devoxx France 2014

Transcript
Page 1: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing

The Art of Parsing

Evgeny Mandrikov @_godin_Dinesh Bolkensteyn @dbolkensteynhttp://sonarsource.com

Page 2: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 2/56

The Art of Parsing// TODO: don't forget to add huge disclaimer that all opinions hereinbelow are our own and not our employer (they wish they had them)

Evgeny Mandrikov@_godin_

Dinesh Bolkensteyn@dbolkensteyn

Page 3: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 3/56

I want to create a parser

«Done»!

Use Yacc, JavaCC, ANTLR, SSLR, …

or hand-written ?

Page 4: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 4/56

What is the plan?

Why• javac and GCC are hand-written• do we use parser-generators ?

Together we will implement parser for• arithmetic expressions• common constructions from Java• C++ ;)

Page 5: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 5/56

Java formal grammar

JLS8

JLS7

Page 6: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 6/56

Answer is

42

Page 7: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 7/56

Pill of theory

NUM ➙ 42Nonterminal

Productions

Terminals(tokens)

Page 8: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 8/56

Grammar for numbers

NUM ➙ NUM DIGIT | DIGITDIGIT ➙ 0|1|2|3|4|5|6|7|8|9

4, 8, 15, 16, 23, 42,…

Alternatives

Page 9: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 9/56

Arithmetic expressions

4 – 3 – 2 = ?

Page 10: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 10/56

expr ➙ expr – expr | NUM

Arithmetic expressions

4 – 3 – 2 = ?

Page 11: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 11/56

Arithmetic expressions

expr

4 3

2

expr

expr ➙ expr – expr | NUM

(4 – 3)– 2 =-1

Page 12: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 12/56

Arithmetic expressions

4

3 2

expr

expr

expr ➙ expr – expr | NUM

(4 – 3)– 2 =-1 4 –(3 – 2)= 3

expr

4 3

2

expr

Page 13: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 13/56

Arithmetic expressionsexpr ➙ NUM – expr | NUM

expr ➙ expr – expr | NUM

(4 – 3)– 2 =-1 4 –(3 – 2)= 3

expr

4 3

2

expr 4

3 2

expr

expr

Page 14: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 14/56

Arithmetic expressionsexpr ➙ NUM – expr | NUM

expr ➙ expr – expr | NUM

expr ➙ expr – NUM | NUM

(4 – 3)– 2 =-1 4 –(3 – 2)= 3

4

3 2

expr

expr

expr

4 3

2

expr

Page 15: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 15/56

Show me the code

int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

expr ➙ expr – NUM | NUM

Page 16: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 16/56

Show me the code right code

? ? int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

expr ➙ expr – NUM | NUM

Page 17: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 17/56

Show me the code right code

int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

expr ➙ expr – NUM | NUM

int expr() { int res = num(); while (token == '–') res = res – num(); return res; }

int expr() { int res = num(); while (token == '–') res = res – num(); return res; }

Page 18: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 18/56

Arithmetic expressions

4 – 3 * 2 = ?

Page 19: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 19/56

Arithmetic expressions

4 – 3 * 2 = -2

expr ➙ expr – NUM | expr * NUM | NUM

Page 20: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 20/56

Arithmetic expressions

4 –(3 * 2)= -2(4 – 3)* 2 = 2 expr ➙ expr – NUM | expr * NUM | NUM

Page 21: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 21/56

Arithmetic expressions

subs ➙ subs – mult | mult mult ➙ mult * NUM | NUM

4 –(3 * 2)= -2

Page 22: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 22/56

Show me the code

int subs() { res = mult() ; while (token == '–') res = res – mult(); return res; }

int mult() { int res = num(); while (token == '*') res = res * num(); return res; }

int subs() { res = mult() ; while (token == '–') res = res – mult(); return res; }

int mult() { int res = num(); while (token == '*') res = res * num(); return res; }

subs ➙ subs – mult | multmult ➙ mult * NUM | NUM

Page 23: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 23/56

LL(1)

● back to 1969● one token lookahead● no left-recursion

Page 24: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 24/56

What is the plan?

✔ arithmetic expressions✔ LL(1)

• a few common constructions from Java• C++ ;)

Page 25: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 25/56

The real deal

expr-stmt ➙ expr ; obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 26: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 26/56

The real deal

expr-stmt ➙ expr ;

expr ➙ field-access

| method-call

| assignment

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 27: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 27/56

The real deal

expr-stmt ➙ expr ;

expr ➙ field-access

| method-call

| assignment

field-access ➙ qualified-id

qualified-id ➙ qualified-id . id

| id

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 28: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 28/56

The real deal

expr-stmt ➙ expr ;

expr ➙ field-access

| method-call

| assignment

field-access ➙ qualified-id

qualified-id ➙ qualified-id . id

| id

method-call ➙ qualified-id ()

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 29: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 29/56

The real deal

expr-stmt ➙ expr ;

expr ➙ field-access

| method-call

| assignment

field-access ➙ qualified-id

qualified-id ➙ qualified-id . id

| id

method-call ➙ qualified-id ()

assignment ➙ qualified-id = expr

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 30: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 30/56

int qualified_id() { /* easy */ } int field_access() { /* easy */ } int method_call() { /* easy */ } int assignment() { /* easy */ }

int expr() { // ??? }

int qualified_id() { /* easy */ } int field_access() { /* easy */ } int method_call() { /* easy */ } int assignment() { /* easy */ }

int expr() { // ??? }

Show me the code

expr-stmt ➙ expr ;

expr ➙ field-access

| method-call

| assignment

field-access ➙ qualified-id

qualified-id ➙ qualified-id . id

| id

method-call ➙ qualified-id ()

assignment ➙ qualified-id = expr

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 31: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 31/56

int expr() { String id = qualified_id(); if (token == '(') return method_call(); else if (token == '=') return assignment(); else return field_access(); }

int expr() { String id = qualified_id(); if (token == '(') return method_call(); else if (token == '=') return assignment(); else return field_access(); }

The LL(1) wayexpr ➙ field-access

| method-call

| assignment

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 32: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 32/56

Realityhttp://hg.openjdk.java.net/jdk8/jdk8/langtools/.../JavacParser.java

Page 33: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 33/56

The better wayexpr ➙ field-access

| method-call

| assignment

int expr() { try { return field_access(); } catch (RE e1) { try { return method_call(); } catch (RE e2) { return assignment(); } } }

int expr() { try { return field_access(); } catch (RE e1) { try { return method_call(); } catch (RE e2) { return assignment(); } } }

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 34: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 34/56

int expr() { try { return method_call() ; } catch (RE e1) { try { return assignment(); } catch (RE e2) { return field_access(); } } }

int expr() { try { return method_call() ; } catch (RE e1) { try { return assignment(); } catch (RE e2) { return field_access(); } } }

Show me the code right codeexpr ➙ method-call

/ assignment

/ field-access

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 35: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 35/56

Parsing Expression Grammars

● 2002● ordered choice «/»● backtracking● no left-recursion

Page 36: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 36/56

enum Nonterminals { EXPR, METHOD_CALL, … }

void grammar() { rule(EXPR).is( firstOf( METHOD_CALL, ASSIGNMENT, FIELD_ACCESS)); }

enum Nonterminals { EXPR, METHOD_CALL, … }

void grammar() { rule(EXPR).is( firstOf( METHOD_CALL, ASSIGNMENT, FIELD_ACCESS)); }

DSL for PEGexpr ➙ method-call

/ assignment

/ field-access

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 37: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 37/56

What is the plan?

✔ arithmetic expressions✔ LL(1)

✔ common constructions from Java✔ PEG

• C++ ;)

Page 38: The Art Of Parsing @ Devoxx France 2014

@YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing

Tea

Break

Page 39: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 39/56

if (false) if (true) System.out.println("foo"); else System.out.println("bar");

if (false) if (true) System.out.println("foo"); else System.out.println("bar");

Quiz

Page 40: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 40/56

if (false) if (true) System.out.println("foo"); else System.out.println("bar");

if (false) if (true) System.out.println("foo"); else System.out.println("bar");

«Dangling else»

if-stmt ➙ IF (cond) stmt ELSE stmt / IF (cond) stmt

Page 41: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 41/56

Java is awesome

(A)*B

(A)*B

Page 42: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 42/56

C++ all the pains of the world

int *B; typedef int A; (A)*B; // cast to type 'A' ('int' alias) // of dereference of expression 'B'

int A, B; (A)*B; // multiplication of 'A' and 'B' // with redundant parenthesis around 'A'

int *B; typedef int A; (A)*B; // cast to type 'A' ('int' alias) // of dereference of expression 'B'

int A, B; (A)*B; // multiplication of 'A' and 'B' // with redundant parenthesis around 'A'

Java is good, because itwas influenced by bad experience of C++ (A)*B (A)*B

Page 43: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 43/56

rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));

rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));

Hit the wall !

(A)*B (A)*B

Page 44: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 44/56

rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));

rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));

Hit the wall !

(A)*B (A)*B

Page 45: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 45/56

Dreammul-expr ➙ mul-expr * unary-expr | unary-exprunary-expr ➙ ( type-id ) unary-expr   | * unary-expr | primaryprimary ➙ ( expr ) | id

(A)*B (A)*B

Page 46: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 46/56

Generalized parsers

● Earley (1968)● slow

● GLR (1984)● complex

Page 47: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 47/56

Chicken and egg problem

(A)*B

unary-expr mul-expr

(A) (A)*B

B*...

(A)*B (A)*Bmul-expr ➙ mul-expr * unary-expr

| unary-expr

unary-expr ➙ ( type-id ) unary-expr

  | * unary-expr

| primary

primary ➙ ( expr )

| id

Page 48: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 48/56

Back to the future «dangling else» 

if (…) if (…) then-stmt else else-stmt

if (…) if (…) then-stmt else else-stmt

outer-if

inner-if inner-if

then-stmt else-stmt

inner-if · else-stmt

Page 49: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 49/56

GLL : How does it work ?

mul-expr ➙ mul-expr * unary-expr

| unary-expr

Page 50: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 50/56

Generalized LL

● 2010● no grammar left behind (left-recursive, ambiguous)

● simpler than GLR● syntactic ambiguities

Page 51: The Art Of Parsing @ Devoxx France 2014

@YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing

Sum

mar

y

Page 52: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 52/56

Summary

LL(1)

• trivial• major grammar changes• only good for arithmetic expressions• on steroids as in JavaCC usable for real languages

Page 53: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 53/56

Summary

PEG

• trivial• fewer grammar changes• no ambiguities• usable for real languages• nice tools such as SSLR• dead-end for C/C++

Page 54: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 54/56

Summary

GLL

• any grammar• relatively simple• ambiguities• reasonable performances• the only clean choice for C/C++• only «academic» tools for now... ;)

Page 55: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 55/56

Summary

Hand-written

● based on LL(1)● precise error-reporting and recovery

● best performances● maintainance hell

Page 56: The Art Of Parsing @ Devoxx France 2014

@YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing

Q & A