cos301-4.ppt - University of Maine...

15
10/4/2012 1 COS 301 Programming Languages Sebesta Chapter 4.1-4.4 Lexical and Syntactic Analysis Lexical and Syntactic Analysis Language implementation systems must analyze source code, regardless of the specific implementation approach (compiler or interpreter) Nearly all syntax analysis is based on a formal description of the syntax of the source language (BNF) Lexical analysis uses less powerful grammars than syntactic analysis Source Code Syntax Analysis The syntax analysis portion of a language processor nearly always consists of two parts: A low-level part called a lexical analyzer (mathematically, a finite automaton based on a regular grammar) A high-level part called a syntax analyzer, or parser (mathematically, a push-down automaton based on a context-free grammar, or BNF) Why Separate Lexical and Syntax Analysis? Simplicity - less complex approaches can be used for lexical analysis; separating them simplifies the parser Efficiency - separation allows optimization of the lexical analyzer About 75% of execution time for a non-optimizing compiler is lexical analysis Portability - parts of the lexical analyzer may not be portable, but the parser always is portable The lexical analyzer has to deal with low-level details of the character set – such as what a newline character looks like, EOF etc. Lexical Analysis A lexical analyzer is a pattern matcher for character strings A lexical analyzer is a “front-end” for the parser Identifies substrings of the source program that belong together - lexemes Lexemes match a character pattern, which is associated with a lexical category called a token – sum is a lexeme; its token may be IDENT Often “token” is used in place of lexeme Lexical Analyzer Purpose: transform program representation from sequence of characters to sequence of tokens Input: a stream of characters Output: lexemes / tokens Discard: whitespace, comments

Transcript of cos301-4.ppt - University of Maine...

10/4/2012

1

COS 301

Programming Languages

Sebesta Chapter 4.1-4.4

Lexical and Syntactic Analysis

Lexical and Syntactic Analysis

• Language implementation systems must analyze source code, regardless of the specific implementation approach (compiler or interpreter)

• Nearly all syntax analysis is based on a formal description of the syntax of the source language (BNF)– Lexical analysis uses less powerful grammars than

syntactic analysis

Source Code Syntax Analysis

• The syntax analysis portion of a language processor nearly always consists of two parts:– A low-level part called a lexical analyzer

(mathematically, a finite automaton based on a regular grammar)

– A high-level part called a syntax analyzer, or parser (mathematically, a push-down automaton based on a context-free grammar, or BNF)

Why Separate Lexical and Syntax Analysis?

• Simplicity - less complex approaches can be used for lexical analysis; separating them simplifies the parser

• Efficiency - separation allows optimization of the lexical analyzer– About 75% of execution time for a non-optimizing

compiler is lexical analysis• Portability - parts of the lexical analyzer may

not be portable, but the parser always is portable– The lexical analyzer has to deal with low-level

details of the character set – such as what a newline character looks like, EOF etc.

Lexical Analysis

• A lexical analyzer is a pattern matcher for character strings

• A lexical analyzer is a “front-end” for the parser

• Identifies substrings of the source program that belong together - lexemes– Lexemes match a character pattern, which is

associated with a lexical category called a token– sum is a lexeme; its token may be IDENT

• Often “token” is used in place of lexeme

Lexical Analyzer

• Purpose: transform program representation from sequence of characters to sequence of tokens

• Input: a stream of characters• Output: lexemes / tokens• Discard: whitespace, comments

10/4/2012

2

Example Tokens

• Identifiers• Literals: 123, 5.67, 'x', true• Keywords or reserved words: bool, while, char

...• Operators: + - * / ...• Punctuation: ; , ( ) { }

Other Sequences

• Whitespace: space tab• Comments, e.g.

// {any-char} end-of-line/* {any-char} */

• End-of-line• End-of-file• Note: in some languages end-of-line or new-

line characters are considered white space (C, C++, Java…)

• In other languages (BASIC, Fortran, etc.) they are statement delimiters

Lexical Analyzer (continued)

• The lexical analyzer is usually a function that is called by the parser when it needs the next token

• Three approaches to building a lexical analyzer:– Write a formal description of the tokens (grammar or regular

expressions) and use a software tool that constructs table-driven lexical analyzers given such a description

• Ex. lex, flex, flex++

– Design a state diagram that describes the tokens and write a program that implements the state diagram

– Design a state diagram that describes the tokens and hand-construct a table-driven implementation of the state diagram

The Chomsky Hierarchy (Again)

• Four levels of grammar:1. Regular2. Context-free3. Context-sensitive4. Unrestricted (recursively enumerable)• CFGs are used for syntax parsing• Regular grammars are used for lexical analysis

Productions

• All grammars are tuples {P,T,N,S} – Where P is a set of productions, T a set of terminal

symbols, N a set of non-terminal symbols and S is the start symbol – a member of N

• The form of production rules distinguishes grammars in hierarchy

Three models of the lexical level

• Although the lexical level can be described with BNF, regular grammars can be used

• Equivalent to regular grammars are: – Regular expressions– Finite state automata

10/4/2012

3

Context-Sensitive Grammars

• Production:• α → β |α| ≤ |β|• α, β (N T)*

– The left-hand side can be composed of strings of terminals and nonterminals

– Length of RHS cannot be less than length of LHS (sentential form cannot shrink in derivation) except S is allowed

• Note than context sensitive grammars can have productions such as– aXb => aYZc– aXc => aaXb

Context-free Grammars

• Already discussed as BNF - a stylized form of CFG

• Every production is in the form A where A is a single non-terminal and is a string of terminals and/or non-terminals (possibly empty)

• Equivalent to a pushdown automaton• For a wide class of unambiguous CFGs, there

are table-driven, linear time parsers

Regular Grammars

• Simplest and least powerful; equivalent to:– Regular expression– Finite-state automaton

• All productions must be right-regular or left-regular

• Right regular grammar: T*, B NA → BA →

• E.g., rhs of any production must contain at most one nonterminal AND it must be the rightmost symbol

• Direct recursion is permitted A → A

Regular Grammars

• Left regular grammar: T*, B NA → B A →

• A regular grammar is a right-regular or a left-regular grammar– If we have both types of rules we have a linear

grammar – a more powerful language than a regular grammar

– Regular langs linear langs context-free langs• Example of a linear language that is not a regular

language:{ aⁿ bⁿ | n ≥ 1 }

i.e., we cannot balance symbols that have matching pairs such as ( ), { }, begin end, with a regular grammar

Right-regular Integer grammar

Integer→ 0 Integer | 1 Integer | ... | 9 IntegerInteger→ 0 | 1 | ... | 9

• In EBNFInteger → (0 |... | 9) Integer Integer → 0 | ... | 9

Summary of Grammatical Forms

• Regular Grammars– Only one nonterminal on left; rhs of any production

must contain at most one nonterminal AND it must be the rightmost (leftmost) symbol

• Context Free Grammars– Only one non-terminal symbol on lhs

• Context-Sensitive Grammars– Lhs can contain any number of terminals and non-

terminals– Sentential form cannot shrink in derivation

• Unrestricted Grammars– Same as CSGs but remove restriction on shrinking

sentential forms

10/4/2012

4

Left-regular Integer grammar

Integer→ Integer 0 | Integer 1 | ... | Integer 9Integer→ 0 | 1 | ... | 9

• In EBNFInteger → Integer (0 |... | 9)Integer → 0 | ... | 9

Finite State Automata

• An abstract machine that is useful for lexical analysis– Also know as Finite State Machines

• Two varieties (equivalent in power):– Non-deterministic finite state automata (NFSA)– Deterministic finite state automata (DFSA)

• Only DFSAs are directly useful for constructing programs– Any NFSA can be converted into an equivalent DFSA

• We will use an informal approach to describe DFSAs

What is a Finite State Machine?

• A device that has a finite number of states. • It accepts input from a “tape” • Each state and each input symbol uniquely determine

another state (hence deterministic)• The device starts operation before any input is read –

this is the “start state” • At the end of input the device may be in an

“accepting” state – If inputs are characters then the device recognizes a language

• Some inputs may cause the device to enter an “error” state (not usually explicitly represented)

Other uses of FSAs / FSMs

• Finite state machines can be used to describe things other than languages

• Many relatively simply embedded systems can be described with a finite state machine

FSA Graph Representation

• A finite state automaton has1. A set of states: represented by nodes in a

graph2. An input alphabet augmented with unique end

of input symbol3. State transition function, represented by

directed edges in graph, labeled with symbols from alphabet or set of inputs

4. A unique start state5. One or more final (accepting) states – no

exiting edges

Example: Vending Machine• Adapted from Wulf, Shaw, Hilfinger, Flon, Fundamental Structures of Computer Science, p.17.

10/4/2012

5

Example: Battery Charger• From http://www.jcelectronica.com/articles/state_machines.htm

A Finite State Automaton for Identifiers

• This diagram indicates an explicit transition to an accepting state• We could also use this diagram:

L

L, D

Letter

Letter, Digit

$

S

S

1

1

F

FSM for a childish language

• What language is described by this diagram?

amS

da

a

a

m

d

Quiz Oct 2

1. Draw a DFSA that recognizes binary strings that start with 1 and end with 0

2. Draw a DFSA that recognizes binary strings with at least three consecutive 1’s

3. Below is a BNF grammar for fractional numbers. Rewrite as EBNFS -> -FN | FNFN -> DL | DL.DLDL -> D | D DLD -> 0|1|2|3|4|5|6|7|8|9

Regular Expressions

• An alternative to regular grammars for specifying a language at the lexical level

• Also used extensively in text-processing • Very useful for web applications• Built-in support in many languages, e.g., Perl,

Ruby, Java, Javascript, Python, .NET languages• There are several different syntactic

conventions for regexes

Regular ExpressionsRegex Meaningx a character x (stands for itself)\x an escaped character, e.g., \nM | N M or NM N M followed by NM* zero or more occurrences of MNote: \ varies with software, typical usage:

certain non-printable characters (e.g., \n = newline and \t=tab)

ASCII hex (\xFF) or Unicode hex (\xFFFF)

Shorthand character classes (\w = word, \s = whitespace \d=digit)

Escaping a literal, e.g. \* or \.

10/4/2012

6

Regex MeaningM+ One or more occurrences of MM? Zero or one occurrence of MM* Zero or more occurrences of M[aeiou] the set of vowels[0-9] the set of digits. Any single character( ) Grouping

Regular Expression Metasymbols Regex Examples - 1

Let Σ = { a, b, c } r = ( a | b ) * cThis regex specifies repetition (0, 1, 2, etc. occurrences) of either a or b followed by c. Strings that match this regular expression

include:

cacbcabcaabbaabbc

Regex Examples – 2

Let Σ = { a, b, c } r = ( a | c ) * b ( a | c ) * This regular expression specifies repetition of either a or c followed by b followed by repetition of either a or c.

babbccccabcaaccaabaacabccca

Regex Examples – 3

• A regular expression to represent a signed integer. • There is an optional leading sign (+ or -) followed by at

least one digit in the range 0 .. 9.

(\+ | \- )? [ 0 – 9 ] +

Matches include +1, 0, -0, 827356, -98686, …

Regex Examples - 4

• A regular expression to represent a signed floating point number. There is an optional leading sign ( + or - ) followed by 1 or more digits in the range 0 .. 9 followed by an optional decimal point and then 1 or more digits in the range 0 .. 9. The \ symbol indicates . is the literal period and not the . symbol for “any character.”

1. (\+|\-)?[0-9]+(\.[0-9]+)?2. [-+]?([0-9]+\.[0-9]+|[0-9]+)3. [-+]?[0-9]+\.?[0-9]+

This illustrates how complex regexes can be!

Regex Libraries

• Many sources available online • See for example

http://regexlib.com/Default.aspx

10/4/2012

7

Lexical Syntax for a simple C-like language

anyChar [ -~] Note: space(0x20) to tilde (0x7f)

Letter [a-zA-Z]Digit [0-9]Whitespace [ \t] Again note literal space(0x20)

Eol \nEof \004

Lexical Syntax for a simple C-like language

Keyword bool | char | else | false | float |if | int | main | true | while

Identifier {Letter}({Letter} | {Digit})*integerLit {Digit}+floatLit {Digit}+\.{Digit}+charLit ‘{anyChar}’Operator = | || | && | == | != | < | <= |> |

>= | + | - | * | / |! | [ | ]Separator : | . | { | } | ( | )Comment // ({anyChar} | {Whitespace})* {eol

Some Common Conventions

• When expressing lexical rules for a language:– Explicit terminator typically is used only for

program as a whole, not each token.– An unlabeled arc represents any other valid input

symbol.– Recognition of a token ends in a final state.– Recognition of a non-token (e.g., whitespace,

comment) transitions back to start state.

• Recognition of end symbol (end of file) ends in a final state.

• Automaton must be deterministic.– Drop keywords; handle separately with lookup

table– We must consider all sequences with a common

prefix together. Examples: 1. floats and ints 2. Comments and division

DFSAs for a small C-like Language

ws = whitespace, l = letter, d = digit, eoln = \n, eof = end of input, All others are literal

Whitespace

// comments

Identifiers

DFSAs for a small C-like language

Ints and floats

Single & double quotes

Assignment & comparison

Addition

Logical and bitwise AND

10/4/2012

8

Translations

• A DFSA that accepts binary strings with an even number of 1 bits

1

0

1

0

• Right Regular GrammarA -> 0A | 1B | εB -> 0B | 1A

• Regex0*(10*1)*0*

A B

State Diagram Design

– A naive state diagram would have a transition from every state on every character in the source language –

– All keywords would be captured in the state diagram

– such a diagram would be very large!

Lexical Analysis (cont.)

• In many cases, transitions can be combined to simplify the state diagram– When recognizing an identifier, all uppercase and

lowercase letters are equivalent• Use a character class that includes all letters

– When recognizing an integer literal, all digits are equivalent - use a digit class

Lexical Analysis (cont.)

• Reserved words and identifiers can be recognized together (rather than having a part of the diagram for each reserved word)– Use a table lookup to determine whether a possible

identifier is in fact a reserved word

Lexical Rules

<id> ::= <letter> | <letter> <id2>

<id2> ::= <letter> <id2>| <digit> <id2>| <letter> | <digit>

<int> ::= <digit> | <digit> <int>

<other> ::= + | - | * | / | ( | )

State Diagram

10/4/2012

9

Lexical Analyzer from Text

Implementation: front.c (pp. 176-181)

- Following is the output of the lexical analyzer offront.c when used on (sum + 47) / total

Next token is: 25 Next lexeme is (Next token is: 11 Next lexeme is sumNext token is: 21 Next lexeme is +Next token is: 10 Next lexeme is 47Next token is: 26 Next lexeme is )Next token is: 24 Next lexeme is /Next token is: 11 Next lexeme is totalNext token is: -1 Next lexeme is EOF

Program Structure

• Program is a DFSA with global variables • Utility routines:

– getChar - gets the next character of input, puts it in nextChar, determines its class and puts the class in charClass

– getNonBlank – advances over whitespace to the first char of a token

– addChar - puts the character from nextChar into the place the lexeme is being accumulated, lexeme

– lookup - determines whether the string in lexemeis a reserved word (returns a code)

front.c 1#include <stdio.h>#include <ctype.h>

/* global declarations *//* variables */

int charclass;char lexeme[100];char nextChar;int lexlen;int nextToken;FILE *in_fp, *fopen();

/* Function declarations */void addChar();void getChar();void getNonBlank();int lex();

front.c 2/* Character classes */#define LETTER 0#define DIGIT 1#define UNKNOWN 99

/* Token codes */#define INT_LIT 10#define IDENT 11#define ASSIGN_OP 20#define ADD_OP 21#define SUB_OP 22#define MULT_OP 23#define DIV_OP 24#define LEFT_PAREN 25#define RIGHT_PAREN 26

front.c 3/* main driver */main() {

/* open the input data file and process contents */if ((in_fp = fopen = fopen("front.in","r")) == NULL)

printf("ERROR - cannot open front in \n");else {

getChar();do {

lex();} while nextToken != EOF

}}

front.c 4/* lookup - a function to lookup operators and parentheses and return the token */int lookup(char ch){

switch(ch){case '(':

addChar();nextToken = LEFT_PAREN;break;

case ')':addChar();nextToken = RIGHT_PAREN;break;

case '+':addChar();nextToken = ADD_OP;break;

case '-':addChar();nextToken = SUB_OP;break;

case '*':addChar();nextToken = MULT_OP;break;

case '/':addChar();nextToken = DIV_OP;break;

default:addChar();nextToken = EOF;break;

}return nextToken;

}

10/4/2012

10

front.c 5/* addChar - a function to add next char to lexeme */void addChar(){

if (lexlen <= 98){lexeme[lexlen++] = nextChar;lexeme[lexlen] = 0;

} else {printf("Error - lexeme too long \n");

}}

/* getChar - a function get the next char of input and determine its character class */

void getChar(){if ((nextChar = getc(in_fp)) != EOF){

if (isalpha(nextChar))charClass = LETTER;

else if (isdigit(nextChar))charClass = DIGIT;else charClass = UNKNOWN;

} elsecharClass = EOF;

}

front.c 6/* getNonBlank - a function to call getChar until it

returns a non-whitespace character */void getNonBlank(){

while (isspace(nextChar))getChar();

}

/* lex - a simple lexical analyzer for arithmetic expressions */int lex(){

lexLen = 0;getNonBlank();switch (charClass){

case LETTER:/* parse identifiers */addChar();getChar();while (charClass == LETTER || charClass == DIGIT){

addChar();getChar();

}nextToken = IDENT;break;

}

front.c 7case DIGIT:

/* parse integer literals */addChar();getChar();while (charClass == DIGIT){

addChar();getChar();

}nextToken = INT_LIT;break;

case UNKNOWN:/* parenthese and operators */lookup(nextChar);getChar();break;

case EOF:/* EOF */nextToken = EOF;lexeme[0] = 'E';lexeme[1] = 'O';lexeme[2] = 'F';lexeme[3] = 0;break;

} /* end of switch */printf("Next token is: %d, next lexeme is %s\n",

nextToken, lexeme);return nextToken;

} /* end lex */}

Example output (sum + 47) / total

Next token is: 25 lexeme is (Next token is: 11 lexeme is sumNext token is: 21 lexeme is + Next token is: 10 lexeme is 47Next token is: 26 lexeme is )Next token is: 24 lexeme is /Next token is: 11 lexeme is totalNext token is: -1 lexeme is EOF

Syntactic Analysis

• Syntactic analysis or parsing determines whether a program is legal or syntactically correct.

• There are two distinct goals:1. If not, produce diagnostic messages. Many parsers

try to recover and continue analysis as long as possible in order to diagnose as many problems as possible

2. If a program is syntactically correct, produce a parse tree

Two general types of parsers

• Top-down parsers start with the start symbol of the language and build a parse tree in preorder:– Visit the node – Visit the left subtree– Visit the right subtree

• This corresponds to a leftmost derivation• Example: Given current string x A y, and a

rule A → w, rewrite the string as x w y

10/4/2012

11

Bottom-up parsers

• Bottom up parsers construct a tree starting with the leaves – the reverse order of a rightmost derivation

• In broad terms, the parser finds a right sentential form (called a handle) with a substring of that is the RHS of a rule that produces the previous sentential form of – The sentential form is then reduced to its LHS– Example: If the current string is x w y and there is a

rule A → w, rewrite the string as x A y

Computational Complexity of Parsing

• Parsing CFLs in the general case is inefficient and exponential in the length of the program string– Each possible rule has to be tried (exhaustive search)

• There are a number of algorithms that can reduce complexity to O(n3)– Still too complex for commercial compilers

• By reducing the generality of the languages to be parsed complexity can be reduced to approximately linear O(n)

Top-Down Parsing

• Given the sentential form xAwhere– x is a string of terminal symbols– A is the leftmost non-terminal– is a string of terminals and non-terminals

• Our goal is to find the next sentential form in a leftmost derivation– We need to choose a rule where A is the LHS– Suppose the possibilities are

• A => bB A => cBb A => a

– We need to choose among• A => xbB A => xcBb A => xa

How to choose?

• Examine the next token of input: is it a, b or c?• This of course is easy but it may get

considerably more complex if the RHSs begin with non terminals

Recursive Descent Parsing

• An easy and straightforward top down parsing algorithm (at least for humans to write)– It only works with a subset of CFGs called LL(k)

• L = Left-to-right parsing• L = Leftmost derivation• (k) means at most k tokens lookahead – usually 1 for an

efficient parser

– LR grammars are left-to-right parsing with rightmost derivation

• Handle a wider class of grammars than LL parsers• Better at error reporting• Table driven parser, harder for humans to write than LL• Easy to generate with machine (e.g, yacc)

Recursive Descent Parsing

• Constructed from a set of mutually recursive routines that mirror the productions of the grammar– EBNF is well-suited as a model for a recursive

descent parser

• Each non-terminal in the grammar has a single routine or function– Its purpose is to trace the parse tree starting from

that symbol– It is effectively a parser for that language where the

nonterminal is the start symbol

10/4/2012

12

Example

• EBNF <expr> => <term> {(+|-) <term>}<term> => <factor> {(* | /) <factor>}<factor> => <id> | int_constant | ( expr )

• In the following example, remember that the lexer has global variables:char nextChar;int lexlen;int nextToken;

Defines from front.c 2/* Character classes */#define LETTER 0#define DIGIT 1#define UNKNOWN 99

/* Token codes */#define INT_LIT 10#define IDENT 11#define ASSIGN_OP 20#define ADD_OP 21#define SUB_OP 22#define MULT_OP 23#define DIV_OP 24#define LEFT_PAREN 25#define RIGHT_PAREN 26

Exprvoid expr(){/* parses <expr> => <term> {(+|-) <term>} */printf("enter <expr>\n");term();while (nextToken == ADD_OP ||

nextToken == SUB_OP) {lex();term();

}printf("exit <expr>\n");

}/* Q: Where does nextToken come from?

A: each function leaves the next unconsumed token in nextTokeneach function assumes on entry that it is available in nextToken */

Termvoid term(){/* parses <term> => <factor> {(+|-) <term>} */

printf("enter <term>\n");factor();while (nextToken == MULT_OP ||

nextToken == DIV_OP) {lex();factor();

}printf("exit <term>\n");

}

<Factor> is a bit more complex…

• Factor has to choose between the several alternate RHS

<factor> => <id> | int_constant | ( expr )

• Also we may be able to detect a syntax error in this function– The previous two functions could not

Factor

void factor(){/* parses <factor> => <id> | int_constant | ( expr ) */

printf("enter <factor>\n");if (nextToken == IDENT || nextToken == INT_LIT)

lex();else {

if (nextToken == LEFT_PAREN) {lex();expr(); /* recursion! */if (nextToken == RIGHT_PAREN)

lex();else

error();} else

error();printf("exit <factor>\n");

}

10/4/2012

13

Example output (sum + 47) / total

Next token is: 25 lexeme is (Enter <expr>Enter <term>Enter <factor>Next token is: 11 lexeme is sumEnter <expr>Enter <term>Enter <factor>Next token is: 21 lexeme is + Exit <factor>Exit <term>Next token is: 10 lexeme is 47Enter <term>Enter <factor>

Example output (sum + 47) / total

Next token is: 26 lexeme is )Exit <factor>Exit <term> Exit <expr>Next token is: 24 lexeme is /Exit <factor>Next token is: 11 lexeme is totalEnter <factor>Next token is: -1 lexeme is EOFExit <factor>Exit <term>Exit <expr>

Example 2: if statement

<ifstmt> -> if ( <boolexpr> ) <stmt> [else <stmt>]

• Recursive descent subprogram has to – Check that current token is IF– Lex() and check that current token is (– Lex() and call <boolexpr>– Check that current token is )– Lex() and call <stmt>– Check if current token is ELSE, if so Lex() and call

<stmt>

Example 2: if statementvoid ifstmt(){if (nextToken != IF_CODE)error();

else {lex();if (nextToken != LEFT_PAREN)error();

else {lex(); /* error in text; this was omitted */boolexpr();if (nextToken != RIGHT_PAREN)error();

else {lex(); /* error in text; this was omitted */stmt();

Example 2: if statement

stmt();if (nextToken == ELSE_CODE){

lex();stmt();

} /* end if (nextToken == ELSE_CODE) /*} /* end if (nextToken != RIGHT_PAREN) /*

} /* end if (nextToken != LEFT_PAREN) /*} /* end if (nextToken != IF_CODE) /*

} /* end IF_STMT /*

10/4/2012

14

LL Grammars

• Top down parsing algorithms are simple and easy to hand code – But the class of grammars that can be recognized

using top down parsing is limited to LL(k) (and it is easiest when k=1: one symbol lookahead)

• Rule #1: left recursion is prohibited – Given a rule <A> => <A> + <B> we would obviously

have infinite recursion as A has to start with a recursive call

– Note that this applies only to the FORM of the grammar

– EBNF can be useful for top down parsing

BNF and EBNF

• BNF<expr> <expr> + <term>

| <expr> - <term>| <term>

<term> <term> * <factor>| <term> / <factor>| <factor>

• EBNF<expr> <term> {(+ | -) <term>}<term> <factor> {(* | /) <factor>}

Eliminating Direct Left Recursion

• Direct left recursion can be removed by rewriting any rule of the form – A => AxB | B | C

• As– A => BA' | C– A' => xBA' |

Left recursion removal

<expr> <expr> + <term> | <expr> - <term> | <term><term> <term> * <factor> | <term> / <factor>

| <factor><factor> <id> | ( <expr>)

<expr> <term> <expr'><expr'> + <term> <expr'> | - <term> <expr'> | <term> <factor> <term'><term'> * <factor> <term'> | / <factor> <term'><factor> <id> | ( <expr>)

Indirect Left Recursion

• Indirect left recursion also presents a problem:A => B x AB => A B

• It is possible to remove indirect left recursion but this is beyond our scope

Rule #2

• In order to use one-symbol lookahead the rules on the RHS of any production must be distinguishable by examining only one token– The text refers to this as the "pairwise disjointness"

rule– For any RHS of nonterminal A -> , we can compute

a set called FIRST() which contains the non-terminals that can appear on the left of

– So A -> , we want the intersection of FIRST() and FIRST() to be empty

10/4/2012

15

Example

• Consider – A => aB | bAb | Bb– B => cB | d– FIRST(aB) = {a}; FIRST(bAb) = {b}; FIRST(Bb) = {c,d}

• Consider – A => aB | BAb – B => aB | b– FIRST(aB) = {a} FIRST(BAb) = {a,b}

• When parsing A we can't determine what production to apply by looking at the next terminal

Left factoring

• Rewriting the grammar can solve many lookahead problems

• Consider subscript expressions<var> => <ident> | <ident>[<expr>]

• Rewrite as <var> => <ident><subscriptExpr><subscriptExpr> => [<expr>] |

• Which is identical to the EBNF<var> => <ident> [[<subscriptExpr>] ]

Quiz Answers

• Draw a DFSA that recognizes binary strings that start with 1 and end with 0

• Below is a BNF grammar for fractional numbers. Rewrite as EBNFS -> -FN | FNFN -> DL | DL.DLDL -> D | D DLD -> 0|1|2|3|4|5|6|7|8|9

S -> [-]FNFN-> DL[.DL]DL -> D{D}

001

1

1S

DFSA for q2

• Draw a DFSA that recognizes binary strings with at least three consecutive 1’s

0

0

0

S111

1,0

Quiz 4

• For the language of binary strings that contain at least 3 consecutive 1’s write:

1. A regular grammar2. A regular expression