The Art Of Parsing @ Devoxx France 2014

56
@dbolkensteyn @_godin_ #parsing The Art of Parsing Evgeny Mandrikov @_godin_ Dinesh Bolkensteyn @dbolkensteyn http://sonarsource.com

description

What attracts researchers starting from the 60s till nowadays? What is studied in university by engineers in computer science and then successfully forgotten? What is at the heart of the compilers used daily by any software developer? Parsers! From a practical point of view using a small pill of theory, this session will bring lights on questions like: if there is so many parser-generators based on formal theory, then why javac, GCC and Clang are all hand-written? And how we, insiders of the world of parsing, do this at SonarSource for languages like Java, C/C++, C#, JavaScript, Python, COBOL?

Transcript of The Art Of Parsing @ Devoxx France 2014

Page 1: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing

The Art of Parsing

Evgeny Mandrikov @_godin_Dinesh Bolkensteyn @dbolkensteynhttp://sonarsource.com

Page 2: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 2/56

The Art of Parsing// TODO: don't forget to add huge disclaimer that all opinions hereinbelow are our own and not our employer (they wish they had them)

Evgeny Mandrikov@_godin_

Dinesh Bolkensteyn@dbolkensteyn

Page 3: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 3/56

I want to create a parser

«Done»!

Use Yacc, JavaCC, ANTLR, SSLR, …

or hand-written ?

Page 4: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 4/56

What is the plan?

Why• javac and GCC are hand-written• do we use parser-generators ?

Together we will implement parser for• arithmetic expressions• common constructions from Java• C++ ;)

Page 5: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 5/56

Java formal grammar

JLS8

JLS7

Page 6: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 6/56

Answer is

42

Page 7: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 7/56

Pill of theory

NUM ➙ 42Nonterminal

Productions

Terminals(tokens)

Page 8: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 8/56

Grammar for numbers

NUM ➙ NUM DIGIT | DIGITDIGIT ➙ 0|1|2|3|4|5|6|7|8|9

4, 8, 15, 16, 23, 42,…

Alternatives

Page 9: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 9/56

Arithmetic expressions

4 – 3 – 2 = ?

Page 10: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 10/56

expr ➙ expr – expr | NUM

Arithmetic expressions

4 – 3 – 2 = ?

Page 11: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 11/56

Arithmetic expressions

expr

4 3

2

expr

expr ➙ expr – expr | NUM

(4 – 3)– 2 =-1

Page 12: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 12/56

Arithmetic expressions

4

3 2

expr

expr

expr ➙ expr – expr | NUM

(4 – 3)– 2 =-1 4 –(3 – 2)= 3

expr

4 3

2

expr

Page 13: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 13/56

Arithmetic expressionsexpr ➙ NUM – expr | NUM

expr ➙ expr – expr | NUM

(4 – 3)– 2 =-1 4 –(3 – 2)= 3

expr

4 3

2

expr 4

3 2

expr

expr

Page 14: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 14/56

Arithmetic expressionsexpr ➙ NUM – expr | NUM

expr ➙ expr – expr | NUM

expr ➙ expr – NUM | NUM

(4 – 3)– 2 =-1 4 –(3 – 2)= 3

4

3 2

expr

expr

expr

4 3

2

expr

Page 15: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 15/56

Show me the code

int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

expr ➙ expr – NUM | NUM

Page 16: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 16/56

Show me the code right code

? ? int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

expr ➙ expr – NUM | NUM

Page 17: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 17/56

Show me the code right code

int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

expr ➙ expr – NUM | NUM

int expr() { int res = num(); while (token == '–') res = res – num(); return res; }

int expr() { int res = num(); while (token == '–') res = res – num(); return res; }

Page 18: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 18/56

Arithmetic expressions

4 – 3 * 2 = ?

Page 19: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 19/56

Arithmetic expressions

4 – 3 * 2 = -2

expr ➙ expr – NUM | expr * NUM | NUM

Page 20: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 20/56

Arithmetic expressions

4 –(3 * 2)= -2(4 – 3)* 2 = 2 expr ➙ expr – NUM | expr * NUM | NUM

Page 21: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 21/56

Arithmetic expressions

subs ➙ subs – mult | mult mult ➙ mult * NUM | NUM

4 –(3 * 2)= -2

Page 22: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 22/56

Show me the code

int subs() { res = mult() ; while (token == '–') res = res – mult(); return res; }

int mult() { int res = num(); while (token == '*') res = res * num(); return res; }

int subs() { res = mult() ; while (token == '–') res = res – mult(); return res; }

int mult() { int res = num(); while (token == '*') res = res * num(); return res; }

subs ➙ subs – mult | multmult ➙ mult * NUM | NUM

Page 23: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 23/56

LL(1)

● back to 1969● one token lookahead● no left-recursion

Page 24: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 24/56

What is the plan?

✔ arithmetic expressions✔ LL(1)

• a few common constructions from Java• C++ ;)

Page 25: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 25/56

The real deal

expr-stmt ➙ expr ; obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 26: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 26/56

The real deal

expr-stmt ➙ expr ;

expr ➙ field-access

| method-call

| assignment

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 27: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 27/56

The real deal

expr-stmt ➙ expr ;

expr ➙ field-access

| method-call

| assignment

field-access ➙ qualified-id

qualified-id ➙ qualified-id . id

| id

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 28: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 28/56

The real deal

expr-stmt ➙ expr ;

expr ➙ field-access

| method-call

| assignment

field-access ➙ qualified-id

qualified-id ➙ qualified-id . id

| id

method-call ➙ qualified-id ()

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 29: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 29/56

The real deal

expr-stmt ➙ expr ;

expr ➙ field-access

| method-call

| assignment

field-access ➙ qualified-id

qualified-id ➙ qualified-id . id

| id

method-call ➙ qualified-id ()

assignment ➙ qualified-id = expr

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 30: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 30/56

int qualified_id() { /* easy */ } int field_access() { /* easy */ } int method_call() { /* easy */ } int assignment() { /* easy */ }

int expr() { // ??? }

int qualified_id() { /* easy */ } int field_access() { /* easy */ } int method_call() { /* easy */ } int assignment() { /* easy */ }

int expr() { // ??? }

Show me the code

expr-stmt ➙ expr ;

expr ➙ field-access

| method-call

| assignment

field-access ➙ qualified-id

qualified-id ➙ qualified-id . id

| id

method-call ➙ qualified-id ()

assignment ➙ qualified-id = expr

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 31: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 31/56

int expr() { String id = qualified_id(); if (token == '(') return method_call(); else if (token == '=') return assignment(); else return field_access(); }

int expr() { String id = qualified_id(); if (token == '(') return method_call(); else if (token == '=') return assignment(); else return field_access(); }

The LL(1) wayexpr ➙ field-access

| method-call

| assignment

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 32: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 32/56

Realityhttp://hg.openjdk.java.net/jdk8/jdk8/langtools/.../JavacParser.java

Page 33: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 33/56

The better wayexpr ➙ field-access

| method-call

| assignment

int expr() { try { return field_access(); } catch (RE e1) { try { return method_call(); } catch (RE e2) { return assignment(); } } }

int expr() { try { return field_access(); } catch (RE e1) { try { return method_call(); } catch (RE e2) { return assignment(); } } }

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 34: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 34/56

int expr() { try { return method_call() ; } catch (RE e1) { try { return assignment(); } catch (RE e2) { return field_access(); } } }

int expr() { try { return method_call() ; } catch (RE e1) { try { return assignment(); } catch (RE e2) { return field_access(); } } }

Show me the code right codeexpr ➙ method-call

/ assignment

/ field-access

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 35: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 35/56

Parsing Expression Grammars

● 2002● ordered choice «/»● backtracking● no left-recursion

Page 36: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 36/56

enum Nonterminals { EXPR, METHOD_CALL, … }

void grammar() { rule(EXPR).is( firstOf( METHOD_CALL, ASSIGNMENT, FIELD_ACCESS)); }

enum Nonterminals { EXPR, METHOD_CALL, … }

void grammar() { rule(EXPR).is( firstOf( METHOD_CALL, ASSIGNMENT, FIELD_ACCESS)); }

DSL for PEGexpr ➙ method-call

/ assignment

/ field-access

obj.method(); a = obj.field; obj.method(); a = obj.field;

Page 37: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 37/56

What is the plan?

✔ arithmetic expressions✔ LL(1)

✔ common constructions from Java✔ PEG

• C++ ;)

Page 38: The Art Of Parsing @ Devoxx France 2014

@YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing

Tea

Break

Page 39: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 39/56

if (false) if (true) System.out.println("foo"); else System.out.println("bar");

if (false) if (true) System.out.println("foo"); else System.out.println("bar");

Quiz

Page 40: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 40/56

if (false) if (true) System.out.println("foo"); else System.out.println("bar");

if (false) if (true) System.out.println("foo"); else System.out.println("bar");

«Dangling else»

if-stmt ➙ IF (cond) stmt ELSE stmt / IF (cond) stmt

Page 41: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 41/56

Java is awesome

(A)*B

(A)*B

Page 42: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 42/56

C++ all the pains of the world

int *B; typedef int A; (A)*B; // cast to type 'A' ('int' alias) // of dereference of expression 'B'

int A, B; (A)*B; // multiplication of 'A' and 'B' // with redundant parenthesis around 'A'

int *B; typedef int A; (A)*B; // cast to type 'A' ('int' alias) // of dereference of expression 'B'

int A, B; (A)*B; // multiplication of 'A' and 'B' // with redundant parenthesis around 'A'

Java is good, because itwas influenced by bad experience of C++ (A)*B (A)*B

Page 43: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 43/56

rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));

rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));

Hit the wall !

(A)*B (A)*B

Page 44: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 44/56

rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));

rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));

Hit the wall !

(A)*B (A)*B

Page 45: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 45/56

Dreammul-expr ➙ mul-expr * unary-expr | unary-exprunary-expr ➙ ( type-id ) unary-expr   | * unary-expr | primaryprimary ➙ ( expr ) | id

(A)*B (A)*B

Page 46: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 46/56

Generalized parsers

● Earley (1968)● slow

● GLR (1984)● complex

Page 47: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 47/56

Chicken and egg problem

(A)*B

unary-expr mul-expr

(A) (A)*B

B*...

(A)*B (A)*Bmul-expr ➙ mul-expr * unary-expr

| unary-expr

unary-expr ➙ ( type-id ) unary-expr

  | * unary-expr

| primary

primary ➙ ( expr )

| id

Page 48: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 48/56

Back to the future «dangling else» 

if (…) if (…) then-stmt else else-stmt

if (…) if (…) then-stmt else else-stmt

outer-if

inner-if inner-if

then-stmt else-stmt

inner-if · else-stmt

Page 49: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 49/56

GLL : How does it work ?

mul-expr ➙ mul-expr * unary-expr

| unary-expr

Page 50: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 50/56

Generalized LL

● 2010● no grammar left behind (left-recursive, ambiguous)

● simpler than GLR● syntactic ambiguities

Page 51: The Art Of Parsing @ Devoxx France 2014

@YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing

Sum

mar

y

Page 52: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 52/56

Summary

LL(1)

• trivial• major grammar changes• only good for arithmetic expressions• on steroids as in JavaCC usable for real languages

Page 53: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 53/56

Summary

PEG

• trivial• fewer grammar changes• no ambiguities• usable for real languages• nice tools such as SSLR• dead-end for C/C++

Page 54: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 54/56

Summary

GLL

• any grammar• relatively simple• ambiguities• reasonable performances• the only clean choice for C/C++• only «academic» tools for now... ;)

Page 55: The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing 55/56

Summary

Hand-written

● based on LL(1)● precise error-reporting and recovery

● best performances● maintainance hell

Page 56: The Art Of Parsing @ Devoxx France 2014

@YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing

Q & A