Post on 18-Jan-2018
description
1
Introduction to Parsing
2
Outline Regular languages revisited
Parser overview
Context-free grammars (CFG’s)
Derivations
3
Languages and Automata Formal languages are very important in CS
» Especially in programming languages
Regular languages» The weakest formal languages widely used» Many applications
We will also study context-free languages
4
Limitations of Regular Languages
Intuition: A finite automaton that runs long enough must repeat states
Finite automaton can’t remember # of times it has visited a particular state
Finite automaton has finite memory» Only enough to store in which state it is » Cannot count, except up to a finite limit
E.g., language of balanced parentheses is not regular: { (i )i | i ¸ 0}
5
The Functionality of the Parser
Input: sequence of tokens from lexer
Output: parse tree of the program
6
Example Java expr
x == y ? 1 : 2 Parser input
ID == ID ? INT : INT Parser output
ID ID
?:
== INT
INT
7
Comparison with Lexical Analysis
Phase Input Output
Lexer Sequence of characters
Sequence of tokens
Parser Sequence of tokens
Parse tree
8
The Role of the Parser Not all sequences of tokens are programs . . . . . . Parser must distinguish between valid and
invalid sequences of tokens
We need» A language for describing valid sequences of
tokens» A method for distinguishing valid from invalid
sequences of tokens
9
Context-Free Grammars Programming language constructs have
recursive structure
An EXPR isif EXPR then EXPR else EXPR fi , orwhile EXPR loop EXPR pool , or…
Context-free grammars are a natural notation for this recursive structure
10
CFGs (Cont.) A CFG consists of
» A set of terminals T» A set of non-terminals N» A start symbol S (a non-terminal)» A set of productions
Assuming X 2 N X ! , or X ! Y1 Y2 ... Yn where Yi µ N [ T
11
Notational Conventions In these lecture notes
» Non-terminals are written upper-case» Terminals are written lower-case» The start symbol is the left-hand side of the
first production
12
Examples of CFGsExpr if Expr then Expr else Expr | while Expr do Expr | id
13
Examples of CFGs Simple arithmetic expressions:
E E E| E + E| E| id
14
The Language of a CFGRead productions as replacement rules: X ! Y1 ... Yn
Means X can be replaced by Y1 ... Yn
X ! Means X can be erased (replaced with empty
string)
15
Key Idea1. Begin with a string consisting of the start symbol
“S”2. Replace any non-terminal X in the string by a
right-hand side of some production
3. Repeat (2) until there are no non-terminals in the string
1 nX Y Y
16
The Language of a CFG (Cont.)
More formally, write
if there is a production
1 1 1 1 1i n i m i nX X X X X Y Y X X
1 i mX Y Y
17
The Language of a CFG (Cont.)
Write
if
in 0 or more steps
1 1n mX X Y Y
1 1n mX X Y Y
18
The Language of a CFGLet G be a context-free grammar with start symbol
S. Then the language of G is:
1 1| and every is a terminaln n ia a S a a a
19
Terminals Terminals are called because there are no rules
for replacing them
Once generated, terminals are permanent
Terminals ought to be tokens of the language
20
ExamplesL(G) is the language of CFG G
Strings of balanced parentheses
Two grammars:
( )S SS
( )|
S S
( ) | 0i i i
OR
21
Arithmetic ExampleSimple arithmetic expressions:
Some elements of the language:
E E+E | E E | (E) | id
id id + id(id) id id(id) id id (id)
22
NotesThe idea of a CFG is a big step. But:
Membership in a language is “yes” or “no”; also need parse tree of the input
Must handle errors gracefully
Need an implementation of CFG’s (e.g., Cup)
23
More Notes Form of the grammar is important
» Many grammars generate the same language» Tools are sensitive to the grammar
» Note: Tools for regular languages (e.g., flex) are also sensitive to the form of the regular expression, but this is rarely a problem in practice
24
Derivations and Parse TreesA derivation is a sequence of productions
A derivation can be drawn as a tree» Start symbol is the tree’s root» For a production add children» to node
S
1 nX Y Y X1 nY Y
25
Derivation Example Grammar
String
E E+E | E E | (E) | id
id id + id
26
Derivation Example (Cont.)
EE+EE E+Eid E + Eid id + Eid id + id
E
E
E E
E+id*
idid
27
Derivation in Detail (1)
E
E
28
Derivation in Detail (2)
EE+E
E
E E+
29
Derivation in Detail (3)
E E
EE+EE +
E
E
E E
E+*
30
Derivation in Detail (4)
EE+EE E+Eid E + E
E
E
E E
E+*
id
31
Derivation in Detail (5)
EE+EE E+Eid E + id id +
EE
E
E
E E
E+*
idid
32
Derivation in Detail (6)
EE+EE E+Eid E + Eid id + Eid id + id
E
E
E E
E+id*
idid
33
Notes on Derivations A parse tree has
» Terminals at the leaves» Non-terminals at the interior nodes
An in-order traversal of the leaves is the original input
The parse tree shows the association of operations, the input string does not
34
Left-most and Right-most Derivations
The example is a left-most derivation» At each step, replace
the left-most non-terminal
There is an equivalent notion of a right-most derivation
EE+EE+idE E + idE id + idid id + id
35
Right-most Derivation in Detail (1)
E
E
36
Right-most Derivation in Detail (2)
EE+E
E
E E+
37
Right-most Derivation in Detail (3)
id
EE+EE+
E
E E+id
38
Right-most Derivation in Detail (4)
EE+EE+idE E + id
E
E
E E
E+id*
39
Right-most Derivation in Detail (5)
EE+EE+idE E E
+ idid + id
E
E
E E
E+id*
id
40
Right-most Derivation in Detail (6)
EE+EE+idE E + idE id + idid id + id
E
E
E E
E+id*
idid
41
Derivations and Parse Trees Note that right-most and left-most derivations
have the same parse tree
The difference is the order in which branches are added
42
Summary of Derivations We are not just interested in whether s 2L(G)
» We need a parse tree for s
A derivation defines a parse tree» But one parse tree may have many derivations
Left-most and right-most derivations are important in parser implementation
43
Issues A parser consumes a sequence of tokens s and
produces a parse tree Issues:
» How do we recognize that s 2 L(G) ?» A parse tree of s describes how s L(G) » Ambiguity: more than one parse tree
(interpretation) for some string s » Error: no parse tree for some string s» How do we construct the parse tree?
44
Ambiguity Grammar
E ! E + E | E * E | ( E ) | int
String
int * int + int
45
Ambiguity (Cont.)This string has two parse trees
E
E
E E
E*int
+intint
E
E
E E
E+int
*intint
46
Ambiguity (Cont.) A grammar is ambiguous if it has more than one
parse tree for some string» Equivalently, there is more than one right-most
or left-most derivation for some string Ambiguity is bad
» Leaves meaning of some programs ill-defined Ambiguity is common in programming languages
» Arithmetic expressions» IF-THEN-ELSE
47
Dealing with Ambiguity There are several ways to handle ambiguity
Most direct method is to rewrite the grammar unambiguously
E ! T + E | T T ! int * T | int | ( E )
Enforces precedence of * over +
48
Ambiguity: The Dangling Else
Consider the grammar E if E then E | if E then E else E | OTHER
This grammar is also ambiguous
49
The Dangling Else: Example The expression
if E1 then if E2 then E3 else E4
has two parse treesif
E1 if
E2 E3 E4
if
E1 if
E2 E3
E4
• Typically we want the second form
50
The Dangling Else: A Fix else matches the closest unmatched then We can describe this in the grammar
E MIF /* all then are matched */
| UIF /* some then are unmatched */
MIF if E then MIF else MIF
| OTHERUIF if E then E | if E then MIF else UIF
Describes the same set of strings
51
The Dangling Else: Example Revisited
The expression if E1 then if E2 then E3 else E4
if(UIF)
E1 if(MIF)
E2 E3 E4
if(MIF)
E1 if(UIF)
E2 E3
E4
• Not valid because the then expression is not a MIF
• A valid parse tree (for a UIF)
52
Ambiguity No general techniques for handling ambiguity
Impossible to convert automatically an ambiguous grammar to an unambiguous one
Used with care, ambiguity can simplify the grammar» Sometimes allows more natural definitions» We need disambiguation mechanisms
53
Precedence and Associativity Declarations
Instead of rewriting the grammar» Use the more natural (ambiguous) grammar» Along with disambiguating declarations
Most tools allow precedence and associativity declarations to disambiguate grammars
Examples …
54
Associativity Declarations Consider the grammar E E + E | int Ambiguous: two parse trees of int + int + int
E
E
E E
E+int +
intint
E
E
E E
E+int+
intint• Left-associativity declaration: %left +
55
Precedence Declarations Consider the grammar E E + E | E * E | int
» And the string int + int * intE
E
E E
E+int *
intint
E
E
E E
E*int+
intint• Precedence declarations: %left + %left *
56
Abstract Syntax Trees So far a parser traces the derivation of a
sequence of tokens The rest of the compiler needs a structural
representation of the program Abstract syntax trees
» Like parse trees but ignore some details» Abbreviated as AST
57
Abstract Syntax Tree. (Cont.) Consider the grammar
E int | ( E ) | E + E And the string
5 + (2 + 3) After lexical analysis (a list of tokens) int5 ‘+’ ‘(‘ int2 ‘+’ int3 ‘)’ During parsing we build a parse tree …
58
Example of Parse TreeE
E E
( E )+
E +int5
int2
E
int3
Traces the operation of the parser
Does capture the nesting structure
But too much info» Parentheses» Single-successor
nodes
59
Example of Abstract Syntax Tree
Also captures the nesting structure But abstracts from the concrete syntax
=> more compact and easier to use An important data structure in a compiler
PLUS
PLUS
2 5 3
60
Constructing An AST We first define the AST class hierarchy
» ASTNode IntNode , PlusNode Consider an abstract tree type with two constructors:
new IntNode(n)
new PlusNode(
T1
) =,
T2
=
PLUS
T1 T2
n
61
Semantic Actions This is what we’ll use to construct ASTs
Each grammar symbol may have attributes» For terminal symbols (lexical tokens) attributes
can be calculated by the lexer
Each production may have an action» Written as: X Y1 … Yn { action }» That can refer to or compute symbol attributes
62
Constructing an AST We define an attribute ast for non-terminals
» Values of ast attributes are ASTs» We assume that int.lexval is the value of the
integer lexeme» Computed using semantic actions
E int E.ast = new intNode(int.lexval) | E1 + E2 E.ast = new PlusNode
(E1.ast, E2.ast)
| ( E1 ) E.ast = E1.ast
63
Parse Tree Example Consider the string int5 ‘+’ ‘(‘ int2 ‘+’ int3 ‘)’ A bottom-up evaluation of the ast attribute: E.ast = new PlusNode(new IntNode(5), new PlusNode(new IntNode(2), new IntNode(3))
PLUS
PLUS
2 5 3
64
Review We can specify language syntax using CFG A parser will answer whether s L(G)
» and will build a parse tree» which we convert to an AST» and pass on to the rest of the compiler
Next lectures:» How do we answer s L(G) and build a parse
tree?
65
Introduction to Top-Down Parsing
Terminals are seen in order of appearance in the token stream:
t2 t5 t6 t8 t9
The parse tree is constructed» From the top» From left to right
1
t2 3
4
t5
7
t6
t9
t8
66
Recursive Descent Parsing Consider the grammar
E T + E | T T int | int * T | ( E )
Token stream is: int5 * int2
Start with top-level non-terminal E
Try the rules for E in order
67
Recursive Descent Parsing. Example (Cont.)
Try E0 T1 + E2 Then try a rule for T1 ( E3 )
» But ( does not match input token int5
Try T1 int . Token matches.
» But + after T1 does not match input token * Try T1 int * T2
» This will match but + after T1 will be unmatched Have exhausted the choices for T1
» Backtrack to choice for E0
68
Recursive Descent Parsing. Example (Cont.)
Try E0 T1
Follow same steps as before for T1
» And succeed with T1 int * T2 and T2 int
» With the following parse tree
E0
T1
int5 * T2
int2
69
Recursive Descent Parsing. Notes.
Easy to implement by hand
But does not always work …
70
Implementation of a Recursive Descent Parser
71
A Recursive Descent Parser. Preliminaries
Let TOKEN be the type of tokens» Special tokens INT, OPEN, CLOSE, PLUS,
TIMES
Let the global next point to the next token
72
A Recursive Descent Parser (2)
Define boolean functions that check the token string for a match of» A given token terminal bool term(TOKEN tok) { return *next++ ==
tok; }» A given production of S (the nth) bool Sn() { … } // and test» Any production of S: bool S() { … } // or test
These functions advance next
73
A Recursive Descent Parser (3)
For production E T + E bool E1() { return T() && term(PLUS) && E(); }
For production E T bool E2() { return T(); }
For all productions of E (with backtracking) bool E() { TOKEN *save = next; return (next = save, E1()) || (next = save, E2()); }
74
A Recursive Descent Parser (4)
Functions for non-terminal Tbool T1() { return term(OPEN) && E() && term(CLOSE); }
bool T2() { return term(INT) && term(TIMES) && T(); }
bool T3() { return term(INT); }
bool T() { TOKEN *save = next; return (next = save, T1())
|| (next = save, T2()) || (next = save, T3()); }
75
Recursive Descent Parsing. Notes.
To start the parser » Initialize next to point to first token» Invoke E()
Notice how this simulates our backtracking example from lecture
Easy to implement by hand But does not always work …
» Predictive parsing is better