cos301-4.ppt - University of Maine...
Transcript of cos301-4.ppt - University of Maine...
10/4/2012
1
COS 301
Programming Languages
Sebesta Chapter 4.1-4.4
Lexical and Syntactic Analysis
Lexical and Syntactic Analysis
• Language implementation systems must analyze source code, regardless of the specific implementation approach (compiler or interpreter)
• Nearly all syntax analysis is based on a formal description of the syntax of the source language (BNF)– Lexical analysis uses less powerful grammars than
syntactic analysis
Source Code Syntax Analysis
• The syntax analysis portion of a language processor nearly always consists of two parts:– A low-level part called a lexical analyzer
(mathematically, a finite automaton based on a regular grammar)
– A high-level part called a syntax analyzer, or parser (mathematically, a push-down automaton based on a context-free grammar, or BNF)
Why Separate Lexical and Syntax Analysis?
• Simplicity - less complex approaches can be used for lexical analysis; separating them simplifies the parser
• Efficiency - separation allows optimization of the lexical analyzer– About 75% of execution time for a non-optimizing
compiler is lexical analysis• Portability - parts of the lexical analyzer may
not be portable, but the parser always is portable– The lexical analyzer has to deal with low-level
details of the character set – such as what a newline character looks like, EOF etc.
Lexical Analysis
• A lexical analyzer is a pattern matcher for character strings
• A lexical analyzer is a “front-end” for the parser
• Identifies substrings of the source program that belong together - lexemes– Lexemes match a character pattern, which is
associated with a lexical category called a token– sum is a lexeme; its token may be IDENT
• Often “token” is used in place of lexeme
Lexical Analyzer
• Purpose: transform program representation from sequence of characters to sequence of tokens
• Input: a stream of characters• Output: lexemes / tokens• Discard: whitespace, comments
10/4/2012
2
Example Tokens
• Identifiers• Literals: 123, 5.67, 'x', true• Keywords or reserved words: bool, while, char
...• Operators: + - * / ...• Punctuation: ; , ( ) { }
Other Sequences
• Whitespace: space tab• Comments, e.g.
// {any-char} end-of-line/* {any-char} */
• End-of-line• End-of-file• Note: in some languages end-of-line or new-
line characters are considered white space (C, C++, Java…)
• In other languages (BASIC, Fortran, etc.) they are statement delimiters
Lexical Analyzer (continued)
• The lexical analyzer is usually a function that is called by the parser when it needs the next token
• Three approaches to building a lexical analyzer:– Write a formal description of the tokens (grammar or regular
expressions) and use a software tool that constructs table-driven lexical analyzers given such a description
• Ex. lex, flex, flex++
– Design a state diagram that describes the tokens and write a program that implements the state diagram
– Design a state diagram that describes the tokens and hand-construct a table-driven implementation of the state diagram
The Chomsky Hierarchy (Again)
• Four levels of grammar:1. Regular2. Context-free3. Context-sensitive4. Unrestricted (recursively enumerable)• CFGs are used for syntax parsing• Regular grammars are used for lexical analysis
Productions
• All grammars are tuples {P,T,N,S} – Where P is a set of productions, T a set of terminal
symbols, N a set of non-terminal symbols and S is the start symbol – a member of N
• The form of production rules distinguishes grammars in hierarchy
Three models of the lexical level
• Although the lexical level can be described with BNF, regular grammars can be used
• Equivalent to regular grammars are: – Regular expressions– Finite state automata
10/4/2012
3
Context-Sensitive Grammars
• Production:• α → β |α| ≤ |β|• α, β (N T)*
– The left-hand side can be composed of strings of terminals and nonterminals
– Length of RHS cannot be less than length of LHS (sentential form cannot shrink in derivation) except S is allowed
• Note than context sensitive grammars can have productions such as– aXb => aYZc– aXc => aaXb
Context-free Grammars
• Already discussed as BNF - a stylized form of CFG
• Every production is in the form A where A is a single non-terminal and is a string of terminals and/or non-terminals (possibly empty)
• Equivalent to a pushdown automaton• For a wide class of unambiguous CFGs, there
are table-driven, linear time parsers
Regular Grammars
• Simplest and least powerful; equivalent to:– Regular expression– Finite-state automaton
• All productions must be right-regular or left-regular
• Right regular grammar: T*, B NA → BA →
• E.g., rhs of any production must contain at most one nonterminal AND it must be the rightmost symbol
• Direct recursion is permitted A → A
Regular Grammars
• Left regular grammar: T*, B NA → B A →
• A regular grammar is a right-regular or a left-regular grammar– If we have both types of rules we have a linear
grammar – a more powerful language than a regular grammar
– Regular langs linear langs context-free langs• Example of a linear language that is not a regular
language:{ aⁿ bⁿ | n ≥ 1 }
i.e., we cannot balance symbols that have matching pairs such as ( ), { }, begin end, with a regular grammar
Right-regular Integer grammar
Integer→ 0 Integer | 1 Integer | ... | 9 IntegerInteger→ 0 | 1 | ... | 9
• In EBNFInteger → (0 |... | 9) Integer Integer → 0 | ... | 9
Summary of Grammatical Forms
• Regular Grammars– Only one nonterminal on left; rhs of any production
must contain at most one nonterminal AND it must be the rightmost (leftmost) symbol
• Context Free Grammars– Only one non-terminal symbol on lhs
• Context-Sensitive Grammars– Lhs can contain any number of terminals and non-
terminals– Sentential form cannot shrink in derivation
• Unrestricted Grammars– Same as CSGs but remove restriction on shrinking
sentential forms
10/4/2012
4
Left-regular Integer grammar
Integer→ Integer 0 | Integer 1 | ... | Integer 9Integer→ 0 | 1 | ... | 9
• In EBNFInteger → Integer (0 |... | 9)Integer → 0 | ... | 9
Finite State Automata
• An abstract machine that is useful for lexical analysis– Also know as Finite State Machines
• Two varieties (equivalent in power):– Non-deterministic finite state automata (NFSA)– Deterministic finite state automata (DFSA)
• Only DFSAs are directly useful for constructing programs– Any NFSA can be converted into an equivalent DFSA
• We will use an informal approach to describe DFSAs
What is a Finite State Machine?
• A device that has a finite number of states. • It accepts input from a “tape” • Each state and each input symbol uniquely determine
another state (hence deterministic)• The device starts operation before any input is read –
this is the “start state” • At the end of input the device may be in an
“accepting” state – If inputs are characters then the device recognizes a language
• Some inputs may cause the device to enter an “error” state (not usually explicitly represented)
Other uses of FSAs / FSMs
• Finite state machines can be used to describe things other than languages
• Many relatively simply embedded systems can be described with a finite state machine
FSA Graph Representation
• A finite state automaton has1. A set of states: represented by nodes in a
graph2. An input alphabet augmented with unique end
of input symbol3. State transition function, represented by
directed edges in graph, labeled with symbols from alphabet or set of inputs
4. A unique start state5. One or more final (accepting) states – no
exiting edges
Example: Vending Machine• Adapted from Wulf, Shaw, Hilfinger, Flon, Fundamental Structures of Computer Science, p.17.
10/4/2012
5
Example: Battery Charger• From http://www.jcelectronica.com/articles/state_machines.htm
A Finite State Automaton for Identifiers
• This diagram indicates an explicit transition to an accepting state• We could also use this diagram:
L
L, D
Letter
Letter, Digit
$
S
S
1
1
F
FSM for a childish language
• What language is described by this diagram?
amS
da
a
a
m
d
Quiz Oct 2
1. Draw a DFSA that recognizes binary strings that start with 1 and end with 0
2. Draw a DFSA that recognizes binary strings with at least three consecutive 1’s
3. Below is a BNF grammar for fractional numbers. Rewrite as EBNFS -> -FN | FNFN -> DL | DL.DLDL -> D | D DLD -> 0|1|2|3|4|5|6|7|8|9
Regular Expressions
• An alternative to regular grammars for specifying a language at the lexical level
• Also used extensively in text-processing • Very useful for web applications• Built-in support in many languages, e.g., Perl,
Ruby, Java, Javascript, Python, .NET languages• There are several different syntactic
conventions for regexes
Regular ExpressionsRegex Meaningx a character x (stands for itself)\x an escaped character, e.g., \nM | N M or NM N M followed by NM* zero or more occurrences of MNote: \ varies with software, typical usage:
certain non-printable characters (e.g., \n = newline and \t=tab)
ASCII hex (\xFF) or Unicode hex (\xFFFF)
Shorthand character classes (\w = word, \s = whitespace \d=digit)
Escaping a literal, e.g. \* or \.
10/4/2012
6
Regex MeaningM+ One or more occurrences of MM? Zero or one occurrence of MM* Zero or more occurrences of M[aeiou] the set of vowels[0-9] the set of digits. Any single character( ) Grouping
Regular Expression Metasymbols Regex Examples - 1
Let Σ = { a, b, c } r = ( a | b ) * cThis regex specifies repetition (0, 1, 2, etc. occurrences) of either a or b followed by c. Strings that match this regular expression
include:
cacbcabcaabbaabbc
Regex Examples – 2
Let Σ = { a, b, c } r = ( a | c ) * b ( a | c ) * This regular expression specifies repetition of either a or c followed by b followed by repetition of either a or c.
babbccccabcaaccaabaacabccca
Regex Examples – 3
• A regular expression to represent a signed integer. • There is an optional leading sign (+ or -) followed by at
least one digit in the range 0 .. 9.
(\+ | \- )? [ 0 – 9 ] +
Matches include +1, 0, -0, 827356, -98686, …
Regex Examples - 4
• A regular expression to represent a signed floating point number. There is an optional leading sign ( + or - ) followed by 1 or more digits in the range 0 .. 9 followed by an optional decimal point and then 1 or more digits in the range 0 .. 9. The \ symbol indicates . is the literal period and not the . symbol for “any character.”
1. (\+|\-)?[0-9]+(\.[0-9]+)?2. [-+]?([0-9]+\.[0-9]+|[0-9]+)3. [-+]?[0-9]+\.?[0-9]+
This illustrates how complex regexes can be!
Regex Libraries
• Many sources available online • See for example
http://regexlib.com/Default.aspx
10/4/2012
7
Lexical Syntax for a simple C-like language
anyChar [ -~] Note: space(0x20) to tilde (0x7f)
Letter [a-zA-Z]Digit [0-9]Whitespace [ \t] Again note literal space(0x20)
Eol \nEof \004
Lexical Syntax for a simple C-like language
Keyword bool | char | else | false | float |if | int | main | true | while
Identifier {Letter}({Letter} | {Digit})*integerLit {Digit}+floatLit {Digit}+\.{Digit}+charLit ‘{anyChar}’Operator = | || | && | == | != | < | <= |> |
>= | + | - | * | / |! | [ | ]Separator : | . | { | } | ( | )Comment // ({anyChar} | {Whitespace})* {eol
Some Common Conventions
• When expressing lexical rules for a language:– Explicit terminator typically is used only for
program as a whole, not each token.– An unlabeled arc represents any other valid input
symbol.– Recognition of a token ends in a final state.– Recognition of a non-token (e.g., whitespace,
comment) transitions back to start state.
• Recognition of end symbol (end of file) ends in a final state.
• Automaton must be deterministic.– Drop keywords; handle separately with lookup
table– We must consider all sequences with a common
prefix together. Examples: 1. floats and ints 2. Comments and division
DFSAs for a small C-like Language
ws = whitespace, l = letter, d = digit, eoln = \n, eof = end of input, All others are literal
Whitespace
// comments
Identifiers
DFSAs for a small C-like language
Ints and floats
Single & double quotes
Assignment & comparison
Addition
Logical and bitwise AND
10/4/2012
8
Translations
• A DFSA that accepts binary strings with an even number of 1 bits
1
0
1
0
• Right Regular GrammarA -> 0A | 1B | εB -> 0B | 1A
• Regex0*(10*1)*0*
A B
State Diagram Design
– A naive state diagram would have a transition from every state on every character in the source language –
– All keywords would be captured in the state diagram
– such a diagram would be very large!
Lexical Analysis (cont.)
• In many cases, transitions can be combined to simplify the state diagram– When recognizing an identifier, all uppercase and
lowercase letters are equivalent• Use a character class that includes all letters
– When recognizing an integer literal, all digits are equivalent - use a digit class
Lexical Analysis (cont.)
• Reserved words and identifiers can be recognized together (rather than having a part of the diagram for each reserved word)– Use a table lookup to determine whether a possible
identifier is in fact a reserved word
Lexical Rules
<id> ::= <letter> | <letter> <id2>
<id2> ::= <letter> <id2>| <digit> <id2>| <letter> | <digit>
<int> ::= <digit> | <digit> <int>
<other> ::= + | - | * | / | ( | )
State Diagram
10/4/2012
9
Lexical Analyzer from Text
Implementation: front.c (pp. 176-181)
- Following is the output of the lexical analyzer offront.c when used on (sum + 47) / total
Next token is: 25 Next lexeme is (Next token is: 11 Next lexeme is sumNext token is: 21 Next lexeme is +Next token is: 10 Next lexeme is 47Next token is: 26 Next lexeme is )Next token is: 24 Next lexeme is /Next token is: 11 Next lexeme is totalNext token is: -1 Next lexeme is EOF
Program Structure
• Program is a DFSA with global variables • Utility routines:
– getChar - gets the next character of input, puts it in nextChar, determines its class and puts the class in charClass
– getNonBlank – advances over whitespace to the first char of a token
– addChar - puts the character from nextChar into the place the lexeme is being accumulated, lexeme
– lookup - determines whether the string in lexemeis a reserved word (returns a code)
front.c 1#include <stdio.h>#include <ctype.h>
/* global declarations *//* variables */
int charclass;char lexeme[100];char nextChar;int lexlen;int nextToken;FILE *in_fp, *fopen();
/* Function declarations */void addChar();void getChar();void getNonBlank();int lex();
front.c 2/* Character classes */#define LETTER 0#define DIGIT 1#define UNKNOWN 99
/* Token codes */#define INT_LIT 10#define IDENT 11#define ASSIGN_OP 20#define ADD_OP 21#define SUB_OP 22#define MULT_OP 23#define DIV_OP 24#define LEFT_PAREN 25#define RIGHT_PAREN 26
front.c 3/* main driver */main() {
/* open the input data file and process contents */if ((in_fp = fopen = fopen("front.in","r")) == NULL)
printf("ERROR - cannot open front in \n");else {
getChar();do {
lex();} while nextToken != EOF
}}
front.c 4/* lookup - a function to lookup operators and parentheses and return the token */int lookup(char ch){
switch(ch){case '(':
addChar();nextToken = LEFT_PAREN;break;
case ')':addChar();nextToken = RIGHT_PAREN;break;
case '+':addChar();nextToken = ADD_OP;break;
case '-':addChar();nextToken = SUB_OP;break;
case '*':addChar();nextToken = MULT_OP;break;
case '/':addChar();nextToken = DIV_OP;break;
default:addChar();nextToken = EOF;break;
}return nextToken;
}
10/4/2012
10
front.c 5/* addChar - a function to add next char to lexeme */void addChar(){
if (lexlen <= 98){lexeme[lexlen++] = nextChar;lexeme[lexlen] = 0;
} else {printf("Error - lexeme too long \n");
}}
/* getChar - a function get the next char of input and determine its character class */
void getChar(){if ((nextChar = getc(in_fp)) != EOF){
if (isalpha(nextChar))charClass = LETTER;
else if (isdigit(nextChar))charClass = DIGIT;else charClass = UNKNOWN;
} elsecharClass = EOF;
}
front.c 6/* getNonBlank - a function to call getChar until it
returns a non-whitespace character */void getNonBlank(){
while (isspace(nextChar))getChar();
}
/* lex - a simple lexical analyzer for arithmetic expressions */int lex(){
lexLen = 0;getNonBlank();switch (charClass){
case LETTER:/* parse identifiers */addChar();getChar();while (charClass == LETTER || charClass == DIGIT){
addChar();getChar();
}nextToken = IDENT;break;
}
front.c 7case DIGIT:
/* parse integer literals */addChar();getChar();while (charClass == DIGIT){
addChar();getChar();
}nextToken = INT_LIT;break;
case UNKNOWN:/* parenthese and operators */lookup(nextChar);getChar();break;
case EOF:/* EOF */nextToken = EOF;lexeme[0] = 'E';lexeme[1] = 'O';lexeme[2] = 'F';lexeme[3] = 0;break;
} /* end of switch */printf("Next token is: %d, next lexeme is %s\n",
nextToken, lexeme);return nextToken;
} /* end lex */}
Example output (sum + 47) / total
Next token is: 25 lexeme is (Next token is: 11 lexeme is sumNext token is: 21 lexeme is + Next token is: 10 lexeme is 47Next token is: 26 lexeme is )Next token is: 24 lexeme is /Next token is: 11 lexeme is totalNext token is: -1 lexeme is EOF
Syntactic Analysis
• Syntactic analysis or parsing determines whether a program is legal or syntactically correct.
• There are two distinct goals:1. If not, produce diagnostic messages. Many parsers
try to recover and continue analysis as long as possible in order to diagnose as many problems as possible
2. If a program is syntactically correct, produce a parse tree
Two general types of parsers
• Top-down parsers start with the start symbol of the language and build a parse tree in preorder:– Visit the node – Visit the left subtree– Visit the right subtree
• This corresponds to a leftmost derivation• Example: Given current string x A y, and a
rule A → w, rewrite the string as x w y
10/4/2012
11
Bottom-up parsers
• Bottom up parsers construct a tree starting with the leaves – the reverse order of a rightmost derivation
• In broad terms, the parser finds a right sentential form (called a handle) with a substring of that is the RHS of a rule that produces the previous sentential form of – The sentential form is then reduced to its LHS– Example: If the current string is x w y and there is a
rule A → w, rewrite the string as x A y
Computational Complexity of Parsing
• Parsing CFLs in the general case is inefficient and exponential in the length of the program string– Each possible rule has to be tried (exhaustive search)
• There are a number of algorithms that can reduce complexity to O(n3)– Still too complex for commercial compilers
• By reducing the generality of the languages to be parsed complexity can be reduced to approximately linear O(n)
Top-Down Parsing
• Given the sentential form xAwhere– x is a string of terminal symbols– A is the leftmost non-terminal– is a string of terminals and non-terminals
• Our goal is to find the next sentential form in a leftmost derivation– We need to choose a rule where A is the LHS– Suppose the possibilities are
• A => bB A => cBb A => a
– We need to choose among• A => xbB A => xcBb A => xa
How to choose?
• Examine the next token of input: is it a, b or c?• This of course is easy but it may get
considerably more complex if the RHSs begin with non terminals
Recursive Descent Parsing
• An easy and straightforward top down parsing algorithm (at least for humans to write)– It only works with a subset of CFGs called LL(k)
• L = Left-to-right parsing• L = Leftmost derivation• (k) means at most k tokens lookahead – usually 1 for an
efficient parser
– LR grammars are left-to-right parsing with rightmost derivation
• Handle a wider class of grammars than LL parsers• Better at error reporting• Table driven parser, harder for humans to write than LL• Easy to generate with machine (e.g, yacc)
Recursive Descent Parsing
• Constructed from a set of mutually recursive routines that mirror the productions of the grammar– EBNF is well-suited as a model for a recursive
descent parser
• Each non-terminal in the grammar has a single routine or function– Its purpose is to trace the parse tree starting from
that symbol– It is effectively a parser for that language where the
nonterminal is the start symbol
10/4/2012
12
Example
• EBNF <expr> => <term> {(+|-) <term>}<term> => <factor> {(* | /) <factor>}<factor> => <id> | int_constant | ( expr )
• In the following example, remember that the lexer has global variables:char nextChar;int lexlen;int nextToken;
Defines from front.c 2/* Character classes */#define LETTER 0#define DIGIT 1#define UNKNOWN 99
/* Token codes */#define INT_LIT 10#define IDENT 11#define ASSIGN_OP 20#define ADD_OP 21#define SUB_OP 22#define MULT_OP 23#define DIV_OP 24#define LEFT_PAREN 25#define RIGHT_PAREN 26
Exprvoid expr(){/* parses <expr> => <term> {(+|-) <term>} */printf("enter <expr>\n");term();while (nextToken == ADD_OP ||
nextToken == SUB_OP) {lex();term();
}printf("exit <expr>\n");
}/* Q: Where does nextToken come from?
A: each function leaves the next unconsumed token in nextTokeneach function assumes on entry that it is available in nextToken */
Termvoid term(){/* parses <term> => <factor> {(+|-) <term>} */
printf("enter <term>\n");factor();while (nextToken == MULT_OP ||
nextToken == DIV_OP) {lex();factor();
}printf("exit <term>\n");
}
<Factor> is a bit more complex…
• Factor has to choose between the several alternate RHS
<factor> => <id> | int_constant | ( expr )
• Also we may be able to detect a syntax error in this function– The previous two functions could not
Factor
void factor(){/* parses <factor> => <id> | int_constant | ( expr ) */
printf("enter <factor>\n");if (nextToken == IDENT || nextToken == INT_LIT)
lex();else {
if (nextToken == LEFT_PAREN) {lex();expr(); /* recursion! */if (nextToken == RIGHT_PAREN)
lex();else
error();} else
error();printf("exit <factor>\n");
}
10/4/2012
13
Example output (sum + 47) / total
Next token is: 25 lexeme is (Enter <expr>Enter <term>Enter <factor>Next token is: 11 lexeme is sumEnter <expr>Enter <term>Enter <factor>Next token is: 21 lexeme is + Exit <factor>Exit <term>Next token is: 10 lexeme is 47Enter <term>Enter <factor>
Example output (sum + 47) / total
Next token is: 26 lexeme is )Exit <factor>Exit <term> Exit <expr>Next token is: 24 lexeme is /Exit <factor>Next token is: 11 lexeme is totalEnter <factor>Next token is: -1 lexeme is EOFExit <factor>Exit <term>Exit <expr>
Example 2: if statement
<ifstmt> -> if ( <boolexpr> ) <stmt> [else <stmt>]
• Recursive descent subprogram has to – Check that current token is IF– Lex() and check that current token is (– Lex() and call <boolexpr>– Check that current token is )– Lex() and call <stmt>– Check if current token is ELSE, if so Lex() and call
<stmt>
Example 2: if statementvoid ifstmt(){if (nextToken != IF_CODE)error();
else {lex();if (nextToken != LEFT_PAREN)error();
else {lex(); /* error in text; this was omitted */boolexpr();if (nextToken != RIGHT_PAREN)error();
else {lex(); /* error in text; this was omitted */stmt();
Example 2: if statement
stmt();if (nextToken == ELSE_CODE){
lex();stmt();
} /* end if (nextToken == ELSE_CODE) /*} /* end if (nextToken != RIGHT_PAREN) /*
} /* end if (nextToken != LEFT_PAREN) /*} /* end if (nextToken != IF_CODE) /*
} /* end IF_STMT /*
10/4/2012
14
LL Grammars
• Top down parsing algorithms are simple and easy to hand code – But the class of grammars that can be recognized
using top down parsing is limited to LL(k) (and it is easiest when k=1: one symbol lookahead)
• Rule #1: left recursion is prohibited – Given a rule <A> => <A> + <B> we would obviously
have infinite recursion as A has to start with a recursive call
– Note that this applies only to the FORM of the grammar
– EBNF can be useful for top down parsing
BNF and EBNF
• BNF<expr> <expr> + <term>
| <expr> - <term>| <term>
<term> <term> * <factor>| <term> / <factor>| <factor>
• EBNF<expr> <term> {(+ | -) <term>}<term> <factor> {(* | /) <factor>}
Eliminating Direct Left Recursion
• Direct left recursion can be removed by rewriting any rule of the form – A => AxB | B | C
• As– A => BA' | C– A' => xBA' |
Left recursion removal
<expr> <expr> + <term> | <expr> - <term> | <term><term> <term> * <factor> | <term> / <factor>
| <factor><factor> <id> | ( <expr>)
<expr> <term> <expr'><expr'> + <term> <expr'> | - <term> <expr'> | <term> <factor> <term'><term'> * <factor> <term'> | / <factor> <term'><factor> <id> | ( <expr>)
Indirect Left Recursion
• Indirect left recursion also presents a problem:A => B x AB => A B
• It is possible to remove indirect left recursion but this is beyond our scope
Rule #2
• In order to use one-symbol lookahead the rules on the RHS of any production must be distinguishable by examining only one token– The text refers to this as the "pairwise disjointness"
rule– For any RHS of nonterminal A -> , we can compute
a set called FIRST() which contains the non-terminals that can appear on the left of
– So A -> , we want the intersection of FIRST() and FIRST() to be empty
10/4/2012
15
Example
• Consider – A => aB | bAb | Bb– B => cB | d– FIRST(aB) = {a}; FIRST(bAb) = {b}; FIRST(Bb) = {c,d}
• Consider – A => aB | BAb – B => aB | b– FIRST(aB) = {a} FIRST(BAb) = {a,b}
• When parsing A we can't determine what production to apply by looking at the next terminal
Left factoring
• Rewriting the grammar can solve many lookahead problems
• Consider subscript expressions<var> => <ident> | <ident>[<expr>]
• Rewrite as <var> => <ident><subscriptExpr><subscriptExpr> => [<expr>] |
• Which is identical to the EBNF<var> => <ident> [[<subscriptExpr>] ]
Quiz Answers
• Draw a DFSA that recognizes binary strings that start with 1 and end with 0
• Below is a BNF grammar for fractional numbers. Rewrite as EBNFS -> -FN | FNFN -> DL | DL.DLDL -> D | D DLD -> 0|1|2|3|4|5|6|7|8|9
S -> [-]FNFN-> DL[.DL]DL -> D{D}
001
1
1S
DFSA for q2
• Draw a DFSA that recognizes binary strings with at least three consecutive 1’s
0
0
0
S111
1,0
Quiz 4
• For the language of binary strings that contain at least 3 consecutive 1’s write:
1. A regular grammar2. A regular expression