The Art Of Parsing @ Devoxx France 2014

Post on 06-May-2015

348 views 1 download

description

What attracts researchers starting from the 60s till nowadays? What is studied in university by engineers in computer science and then successfully forgotten? What is at the heart of the compilers used daily by any software developer? Parsers! From a practical point of view using a small pill of theory, this session will bring lights on questions like: if there is so many parser-generators based on formal theory, then why javac, GCC and Clang are all hand-written? And how we, insiders of the world of parsing, do this at SonarSource for languages like Java, C/C++, C#, JavaScript, Python, COBOL?

Transcript of The Art Of Parsing @ Devoxx France 2014

@dbolkensteyn @_godin_#parsing

The Art of Parsing

Evgeny Mandrikov @_godin_Dinesh Bolkensteyn @dbolkensteynhttp://sonarsource.com

@dbolkensteyn @_godin_#parsing 2/56

The Art of Parsing// TODO: don't forget to add huge disclaimer that all opinions hereinbelow are our own and not our employer (they wish they had them)

Evgeny Mandrikov@_godin_

Dinesh Bolkensteyn@dbolkensteyn

@dbolkensteyn @_godin_#parsing 3/56

I want to create a parser

«Done»!

Use Yacc, JavaCC, ANTLR, SSLR, …

or hand-written ?

@dbolkensteyn @_godin_#parsing 4/56

What is the plan?

Why• javac and GCC are hand-written• do we use parser-generators ?

Together we will implement parser for• arithmetic expressions• common constructions from Java• C++ ;)

@dbolkensteyn @_godin_#parsing 5/56

Java formal grammar

JLS8

JLS7

@dbolkensteyn @_godin_#parsing 6/56

Answer is

42

@dbolkensteyn @_godin_#parsing 7/56

Pill of theory

NUM ➙ 42Nonterminal

Productions

Terminals(tokens)

@dbolkensteyn @_godin_#parsing 8/56

Grammar for numbers

NUM ➙ NUM DIGIT | DIGITDIGIT ➙ 0|1|2|3|4|5|6|7|8|9

4, 8, 15, 16, 23, 42,…

Alternatives

@dbolkensteyn @_godin_#parsing 9/56

Arithmetic expressions

4 – 3 – 2 = ?

@dbolkensteyn @_godin_#parsing 10/56

expr ➙ expr – expr | NUM

Arithmetic expressions

4 – 3 – 2 = ?

@dbolkensteyn @_godin_#parsing 11/56

Arithmetic expressions

expr

4 3

2

expr

expr ➙ expr – expr | NUM

(4 – 3)– 2 =-1

@dbolkensteyn @_godin_#parsing 12/56

Arithmetic expressions

4

3 2

expr

expr

expr ➙ expr – expr | NUM

(4 – 3)– 2 =-1 4 –(3 – 2)= 3

expr

4 3

2

expr

@dbolkensteyn @_godin_#parsing 13/56

Arithmetic expressionsexpr ➙ NUM – expr | NUM

expr ➙ expr – expr | NUM

(4 – 3)– 2 =-1 4 –(3 – 2)= 3

expr

4 3

2

expr 4

3 2

expr

expr

@dbolkensteyn @_godin_#parsing 14/56

Arithmetic expressionsexpr ➙ NUM – expr | NUM

expr ➙ expr – expr | NUM

expr ➙ expr – NUM | NUM

(4 – 3)– 2 =-1 4 –(3 – 2)= 3

4

3 2

expr

expr

expr

4 3

2

expr

@dbolkensteyn @_godin_#parsing 15/56

Show me the code

int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

expr ➙ expr – NUM | NUM

@dbolkensteyn @_godin_#parsing 16/56

Show me the code right code

? ? int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

expr ➙ expr – NUM | NUM

@dbolkensteyn @_godin_#parsing 17/56

Show me the code right code

int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

int expr() { int res = expr(); if (token == '–') return res – num(); return num(); }

expr ➙ expr – NUM | NUM

int expr() { int res = num(); while (token == '–') res = res – num(); return res; }

int expr() { int res = num(); while (token == '–') res = res – num(); return res; }

@dbolkensteyn @_godin_#parsing 18/56

Arithmetic expressions

4 – 3 * 2 = ?

@dbolkensteyn @_godin_#parsing 19/56

Arithmetic expressions

4 – 3 * 2 = -2

expr ➙ expr – NUM | expr * NUM | NUM

@dbolkensteyn @_godin_#parsing 20/56

Arithmetic expressions

4 –(3 * 2)= -2(4 – 3)* 2 = 2 expr ➙ expr – NUM | expr * NUM | NUM

@dbolkensteyn @_godin_#parsing 21/56

Arithmetic expressions

subs ➙ subs – mult | mult mult ➙ mult * NUM | NUM

4 –(3 * 2)= -2

@dbolkensteyn @_godin_#parsing 22/56

Show me the code

int subs() { res = mult() ; while (token == '–') res = res – mult(); return res; }

int mult() { int res = num(); while (token == '*') res = res * num(); return res; }

int subs() { res = mult() ; while (token == '–') res = res – mult(); return res; }

int mult() { int res = num(); while (token == '*') res = res * num(); return res; }

subs ➙ subs – mult | multmult ➙ mult * NUM | NUM

@dbolkensteyn @_godin_#parsing 23/56

LL(1)

● back to 1969● one token lookahead● no left-recursion

@dbolkensteyn @_godin_#parsing 24/56

What is the plan?

✔ arithmetic expressions✔ LL(1)

• a few common constructions from Java• C++ ;)

@dbolkensteyn @_godin_#parsing 25/56

The real deal

expr-stmt ➙ expr ; obj.method(); a = obj.field; obj.method(); a = obj.field;

@dbolkensteyn @_godin_#parsing 26/56

The real deal

expr-stmt ➙ expr ;

expr ➙ field-access

| method-call

| assignment

obj.method(); a = obj.field; obj.method(); a = obj.field;

@dbolkensteyn @_godin_#parsing 27/56

The real deal

expr-stmt ➙ expr ;

expr ➙ field-access

| method-call

| assignment

field-access ➙ qualified-id

qualified-id ➙ qualified-id . id

| id

obj.method(); a = obj.field; obj.method(); a = obj.field;

@dbolkensteyn @_godin_#parsing 28/56

The real deal

expr-stmt ➙ expr ;

expr ➙ field-access

| method-call

| assignment

field-access ➙ qualified-id

qualified-id ➙ qualified-id . id

| id

method-call ➙ qualified-id ()

obj.method(); a = obj.field; obj.method(); a = obj.field;

@dbolkensteyn @_godin_#parsing 29/56

The real deal

expr-stmt ➙ expr ;

expr ➙ field-access

| method-call

| assignment

field-access ➙ qualified-id

qualified-id ➙ qualified-id . id

| id

method-call ➙ qualified-id ()

assignment ➙ qualified-id = expr

obj.method(); a = obj.field; obj.method(); a = obj.field;

@dbolkensteyn @_godin_#parsing 30/56

int qualified_id() { /* easy */ } int field_access() { /* easy */ } int method_call() { /* easy */ } int assignment() { /* easy */ }

int expr() { // ??? }

int qualified_id() { /* easy */ } int field_access() { /* easy */ } int method_call() { /* easy */ } int assignment() { /* easy */ }

int expr() { // ??? }

Show me the code

expr-stmt ➙ expr ;

expr ➙ field-access

| method-call

| assignment

field-access ➙ qualified-id

qualified-id ➙ qualified-id . id

| id

method-call ➙ qualified-id ()

assignment ➙ qualified-id = expr

obj.method(); a = obj.field; obj.method(); a = obj.field;

@dbolkensteyn @_godin_#parsing 31/56

int expr() { String id = qualified_id(); if (token == '(') return method_call(); else if (token == '=') return assignment(); else return field_access(); }

int expr() { String id = qualified_id(); if (token == '(') return method_call(); else if (token == '=') return assignment(); else return field_access(); }

The LL(1) wayexpr ➙ field-access

| method-call

| assignment

obj.method(); a = obj.field; obj.method(); a = obj.field;

@dbolkensteyn @_godin_#parsing 32/56

Realityhttp://hg.openjdk.java.net/jdk8/jdk8/langtools/.../JavacParser.java

@dbolkensteyn @_godin_#parsing 33/56

The better wayexpr ➙ field-access

| method-call

| assignment

int expr() { try { return field_access(); } catch (RE e1) { try { return method_call(); } catch (RE e2) { return assignment(); } } }

int expr() { try { return field_access(); } catch (RE e1) { try { return method_call(); } catch (RE e2) { return assignment(); } } }

obj.method(); a = obj.field; obj.method(); a = obj.field;

@dbolkensteyn @_godin_#parsing 34/56

int expr() { try { return method_call() ; } catch (RE e1) { try { return assignment(); } catch (RE e2) { return field_access(); } } }

int expr() { try { return method_call() ; } catch (RE e1) { try { return assignment(); } catch (RE e2) { return field_access(); } } }

Show me the code right codeexpr ➙ method-call

/ assignment

/ field-access

obj.method(); a = obj.field; obj.method(); a = obj.field;

@dbolkensteyn @_godin_#parsing 35/56

Parsing Expression Grammars

● 2002● ordered choice «/»● backtracking● no left-recursion

@dbolkensteyn @_godin_#parsing 36/56

enum Nonterminals { EXPR, METHOD_CALL, … }

void grammar() { rule(EXPR).is( firstOf( METHOD_CALL, ASSIGNMENT, FIELD_ACCESS)); }

enum Nonterminals { EXPR, METHOD_CALL, … }

void grammar() { rule(EXPR).is( firstOf( METHOD_CALL, ASSIGNMENT, FIELD_ACCESS)); }

DSL for PEGexpr ➙ method-call

/ assignment

/ field-access

obj.method(); a = obj.field; obj.method(); a = obj.field;

@dbolkensteyn @_godin_#parsing 37/56

What is the plan?

✔ arithmetic expressions✔ LL(1)

✔ common constructions from Java✔ PEG

• C++ ;)

@YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing

Tea

Break

@dbolkensteyn @_godin_#parsing 39/56

if (false) if (true) System.out.println("foo"); else System.out.println("bar");

if (false) if (true) System.out.println("foo"); else System.out.println("bar");

Quiz

@dbolkensteyn @_godin_#parsing 40/56

if (false) if (true) System.out.println("foo"); else System.out.println("bar");

if (false) if (true) System.out.println("foo"); else System.out.println("bar");

«Dangling else»

if-stmt ➙ IF (cond) stmt ELSE stmt / IF (cond) stmt

@dbolkensteyn @_godin_#parsing 41/56

Java is awesome

(A)*B

(A)*B

@dbolkensteyn @_godin_#parsing 42/56

C++ all the pains of the world

int *B; typedef int A; (A)*B; // cast to type 'A' ('int' alias) // of dereference of expression 'B'

int A, B; (A)*B; // multiplication of 'A' and 'B' // with redundant parenthesis around 'A'

int *B; typedef int A; (A)*B; // cast to type 'A' ('int' alias) // of dereference of expression 'B'

int A, B; (A)*B; // multiplication of 'A' and 'B' // with redundant parenthesis around 'A'

Java is good, because itwas influenced by bad experience of C++ (A)*B (A)*B

@dbolkensteyn @_godin_#parsing 43/56

rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));

rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));

Hit the wall !

(A)*B (A)*B

@dbolkensteyn @_godin_#parsing 44/56

rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));

rule(MUL_EXPR).is( UNARY_EXPR, zeroOrMore('*', UNARY_EXPR)); rule(UNARY_EXPR).is( firstOf( sequence('(', TYPE_ID, ')', UNARY_EXPR), PRIMARY, sequence('*', UNARY_EXPR))); rule(PRIMARY).is( firstOf( sequence('(', EXPR, ')'), ID));

Hit the wall !

(A)*B (A)*B

@dbolkensteyn @_godin_#parsing 45/56

Dreammul-expr ➙ mul-expr * unary-expr | unary-exprunary-expr ➙ ( type-id ) unary-expr   | * unary-expr | primaryprimary ➙ ( expr ) | id

(A)*B (A)*B

@dbolkensteyn @_godin_#parsing 46/56

Generalized parsers

● Earley (1968)● slow

● GLR (1984)● complex

@dbolkensteyn @_godin_#parsing 47/56

Chicken and egg problem

(A)*B

unary-expr mul-expr

(A) (A)*B

B*...

(A)*B (A)*Bmul-expr ➙ mul-expr * unary-expr

| unary-expr

unary-expr ➙ ( type-id ) unary-expr

  | * unary-expr

| primary

primary ➙ ( expr )

| id

@dbolkensteyn @_godin_#parsing 48/56

Back to the future «dangling else» 

if (…) if (…) then-stmt else else-stmt

if (…) if (…) then-stmt else else-stmt

outer-if

inner-if inner-if

then-stmt else-stmt

inner-if · else-stmt

@dbolkensteyn @_godin_#parsing 49/56

GLL : How does it work ?

mul-expr ➙ mul-expr * unary-expr

| unary-expr

@dbolkensteyn @_godin_#parsing 50/56

Generalized LL

● 2010● no grammar left behind (left-recursive, ambiguous)

● simpler than GLR● syntactic ambiguities

@YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing

Sum

mar

y

@dbolkensteyn @_godin_#parsing 52/56

Summary

LL(1)

• trivial• major grammar changes• only good for arithmetic expressions• on steroids as in JavaCC usable for real languages

@dbolkensteyn @_godin_#parsing 53/56

Summary

PEG

• trivial• fewer grammar changes• no ambiguities• usable for real languages• nice tools such as SSLR• dead-end for C/C++

@dbolkensteyn @_godin_#parsing 54/56

Summary

GLL

• any grammar• relatively simple• ambiguities• reasonable performances• the only clean choice for C/C++• only «academic» tools for now... ;)

@dbolkensteyn @_godin_#parsing 55/56

Summary

Hand-written

● based on LL(1)● precise error-reporting and recovery

● best performances● maintainance hell

@YourTwitterHandle#DVXFR14{session hashtag} @dbolkensteyn @_godin_#parsing

Q & A