Programming Language Syntax and Language Implementationblk/cs3490/ch02/ch02slides.pdf · –A small...
Transcript of Programming Language Syntax and Language Implementationblk/cs3490/ch02/ch02slides.pdf · –A small...
Chapter 2 Programming Language Syntax and Language Implementation
Programming Languages and Paradigms
J. Fenwick, B. Kurtz, C. Norris
(to be published in 2012)
Introduction • In this chapter you will learn about
– A small programming language named Wren
– How to generate Wren Intermediate Code by hand
– Specification of language syntax using BNF
– Lexical analysis and parsing techniques
– Compiling and code generation
– Interpretation of intermediate code
• We start off with a case study that introduces the programming language Wren and Wren Intermediate Code, a stack-based assembly-like language
Finding the GCD • We will use three simple arithmetic programs
– Finding the greatest common division
– Find the product and quotient/remainder for integers
• Euclid’s Algorithm
while m != n
if m < n then
n := n - m
else
m := m – n
return m
m n
84 35
49 35
14 35
14 21
14 7
7 7
GCD = 7
A gcd program in Wren program gcd is
var m,n : integer;
begin
read m; read n;
while m <> n do
if m < n then
n := n - m
else
m := m - n
end if
end while;
write m
end
Declarations appear at the top of the program
Read is used for console input
Describe some of the syntactic differences with C-based languages
Write is used for console output
Alternative Algorithms for GCD • Consider the given algorithm
– Why would this algorithm be slow for certain input values? (give a specific example)
– How could the algorithm be made more efficient? Be specific
• Writing a gcd algorithm in Java
– By using recursion it is possible to write a two line method in Java to find the gcd
– Try to write that method now
A Second Wren Program program product is
var a,b,p : integer;
begin
read a; read b; p := 0;
while b > 0 do
if (b - (b/2) * 2) > 0 then
p := p + a
end if;
a := a * 2;
b := b / 2
end while;
write p
end
Express this test condition logically
a b p
5 13 5
10 6 5
20 3 25
40 1 65
80 0 65
A Third Wren Program program quotient is
var x,y,r,q,w : integer;
begin
read x;
read y;
r:=x;
q:=0;
w:=y;
while w <= r do
w := 2 * w
end while;
while w > y do
q := q * 2;
w := w / 2;
if w <= r then
r := r - w;
q := q + 1
end if
end while;
write q;
write r
end
w q r
8 0 42
16 0 42
32 0 42
64 0 42
32 1 10
16 2 10
8 5 2
X = 42 Y = 8
Wren Intermediate Code get A get B push A push A mul push B push B mul sub pop Result put Result halt The above program produces the following interaction: enter a > 5 enter b > 3 Result = 16 program halted
3
5 3 3 9
5 5 25 25 25 25 16 empty
If Command • The label structure for the two alternative if
command is: <code for Boolean expression and test>
jf L1 % jump to else when false
<code for the true alternative of if command>
j L2 % jump unconditionally to L2
L1 label % the else code is next
<code for the false alternative of if command>
L2 label % end if is here
Wren Intermediate Code get A get B push A push B sub tstlt jf L1 push B pop MAX j L2 L1 label push A pop MAX L2 label put MAX halt The above program produces the following interaction: enter a > 5 enter b > 3 Max = 5 program halted
A 5
B 3
MAX 5
3
empty 5 5 2 0 empty 5 empty
Test instructions compare the top of stack with zero
While Command L1 label % top of the while loop
<code for Boolean expression and test>
jf L2 % jump out of loop if false
<code for the body of the while command>
j L1 % jump unconditionally to top of loop
L2 label % end while is here
Wren Intermediate Code get num push 0 pop count L1 label push num push 0 sub tstgt jf L2 push num push 2 div pop num push count push 1 add pop count j L1 L2 label put count halt The above program produces the following interaction: enter num > 20 count = 4 program halted
0 2
0 empty 20 20 20 1 empty 20 20 10
1 0
empty 0 0 1 empty 10 10 10 1 empty
As a homework exercise complete the stack trace for this program
Code generation for the gcd program - 1
begin
read m; get m
read n; get n
<<code for while command>>
while m <> n do
if m < n then
n := n – m
else
m := m – n
end if
end while;
write m put m
end halt
Code generation for the gcd program - 2
begin
read m; get m
read n; get n
L1 label
while m <> n do push m
push n
sub
tstne
jf L2
if m < n then <<code for if command>>
n := n – m
else
m := m – n
end if
end while; j L1
L2 label
write m put m
end halt
Code generation for the gcd program - 3
begin
read m; get m
read n; get n
L1 label
while m <> n do push m
push n
sub
tstne
jf L2
if m < n then push m
push n
sub
tstlt
jf L3
n := n – m <<code for assignment>>
j L4
else L3 label
m := m – n <<code for assignment>>
end if L4 label
end while; j L1
L2 label
write m put m
end halt
Notice that label numbers must be unique throughout the program
Code generation for the gcd program - 4
begin
read m; get m
read n; get n
L1 label
while m <> n do push m
push n
sub
tstne
jf L2
if m < n then push m
push n
sub
tstlt
jf L3
n := n – m push n
push m
sub
pop n
j L4
else L3 label
m := m – n push m
push n
sub
pop m
end if L4 label
end while; j L1
L2 label
write m put m
end halt
Homework exercise: Hand compile the product program into WIC
U.S. telephone numbers G2.2.A = (T, N, P, S) where
T = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, (, -, ) },
N = {<PhoneNumber>, <CountryCode>, <AreaCode>, <Prefix>, <Extension>, <Digit>},
P = {
<PhoneNumber> ::= <CountryCode> <AreaCode> <Prefix> - <Extension>
<PhoneNumber> ::= <AreaCode> <Prefix> - <Extension>
<CountryCode> ::= <Digit> <Digit>
<AreaCode> ::= ( <Digit> <Digit> <Digit> )
<Prefix> ::= <Digit> <Digit> <Digit>
<Extension> ::= <Digit> <Digit> <Digit> <Digit>
<Digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
}, and
S = <PhoneNumber>.
Leftmost Derivation Step 1: <PhoneNumber>
Step 2: <AreaCode> <Prefix> - <Extension>
Step 3: ( <Digit> <Digit> <Digit> ) <Prefix> - <Extension>
Step 4: ( 8 <Digit> <Digit> ) <Prefix> - <Extension>
Step 5: ( 8 0 <Digit> ) <Prefix> - <Extension>
Step 6: ( 8 0 0 ) <Prefix> - <Extension>
Step 7: ( 8 0 0 ) <Digit> <Digit> <Digit> - <Extension>
Step 8: ( 8 0 0 ) 5 <Digit> <Digit> - <Extension>
Step 9: ( 8 0 0 ) 5 5 <Digit> - <Extension>
Step 10: ( 8 0 0 ) 5 5 5 - <Extension>
Step 11: ( 8 0 0 ) 5 5 5 - <Digit> <Digit> <Digit> <Digit>
Step 12: ( 8 0 0 ) 5 5 5 - 1 <Digit> <Digit> <Digit>
Step 13: ( 8 0 0 ) 5 5 5 - 1 2 <Digit> <Digit>
Step 14: ( 8 0 0 ) 5 5 5 - 1 2 1 <Digit>
Step 15: ( 8 0 0 ) 5 5 5 - 1 2 1 2
How can you tell that this is the proper choice?
Parse Tree
• A rightmost derivation Step 1: <PhoneNumber>
Step 2: <AreaCode> <Prefix> - <Extension>
Step 3: <AreaCode> <Prefix> - <Digit> <Digit> <Digit> <Digit>
And so forth
• would produce the same parse tree
A Grammar for Expressions – version 1 G2.2.x = (T, N, P, S) where
T = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, -, *, / },
N = {<Expression>, <Operator>, <Number>, <Digit>},
P = {
<Expression> ::= <Number>
<Expression> ::= <Expression> <Operator> <Expression>
<Operator> ::= + | - | * | /
<Number> ::= <Digit> | <Digit> <Number>
<Digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
}, and
S = <Expression>.
Do you spot a potential problem with this grammar?
Grammar Ambiguity
Review of Operator Precedence • Multiplicative operations have higher precedence
than additive operations
• Where in the parse tree would multaplicative operations appear relative to additive operations?
• How could the grammar be modified to make this happen?
A Grammar for Expressions – version 2 G2.2.F = (T, N, P, S) where
T = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, -, *, / },
N = {<Expression>, <Term>, <Number>, <Digit>},
P = {
<Expression> ::= <Term>
<Expression> ::= <Term> + <Expression>
<Expression> ::= <Term> - <Expression>
<Term> ::= <Number>
<Term> ::= <Number> * <Term>
<Term> ::= <Number> / <Term>
<Number> ::= <Digit> | <Digit> <Number>
<Digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
}, and
S = <Expression>.
A Remaining Problem • How are operators of the same precedence
processed?
– Consider 34 – 12 + 16
– What would be the result if minus is done first?
– What would be the result if plus is done first?
– Which choice is correct?
• What associativity does version 2 of the grammar have?
• How can this be corrected?
A Grammar for Expressions – version 3 G2.2.G = (T, N, P, S) where
T = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, -, *, / },
N = {<Expression>, <Term>, <Number>, <Digit>},
P = {
<Expression> ::= <Term>
<Expression> ::= <Expression> + <Term>
<Expression> ::= <Expression> - <Term>
<Term> ::= <Number>
<Term> ::= <Term> * <Number>
<Term> ::= <Term> / <Number>
<Number> ::= <Digit> | <Digit> <Number>
<Digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
}, and
S = <Expression>.
Homework Exercise Draw the complete parse tree for the expression
2 – 3 + 4 * 5 Use <E>, <T>, <N> and <D> for the internal nodes
A Homework Exercise • Suppose exponentiation (written with a ^) is added
to a programming language
– What is the precedence of exponentiation?
– Consider the expression 2 * 3 ^ 2
– How would the BNF change to give the correct result?
– What is the associativity for exponentiation?
– Consider the expression 4 ^ 3 ^ 2
– How would the BNF change to give the correct result?
• Your answer to this homework exercise will be version 4 of the grammar for expressions that handles exponentiation correctly
Syntax Diagrams
An alternative to BNF. Sometimes used in CS1/CS2 level classes to express the syntax of a programming language.
Extended BNF (EBNF) • EBNF uses special meta-symbols
– “*” to indicate 0 or more occurrences of a symbol
– “+” indicates 1 or more occurrences
– This replaces recursion with repetition
• Some Examples
– BNF notation: <Number> ::= <Digit> | <Digit> <Number>
– EBNF notation: <Number> ::= <Digit>+
– BNF notation: <AlphanumSeq> ::= empty | <Alphanum> <AlphanumSeq>
<Identifier> ::= <Alpha> <AlphanumSeq>
– EBNF notation: <Identifier> ::= <Alpha> <Alphanum>*
BNF for Wren - 1
BNF for Wren - 2
Introduction to Lexical Analysis • A lexical analyzer, also known as a scanner, process the
source code one character at a time
– The goal is to produce a sequence of tokens that can be passed to a parser in order to build an abstract syntax tree
– A single character, such as ‘;’, may produce a token named semicolon
– An identifier will produce an identifier token, such as ide(count)
– A number will produce a number token, such as num(123)
– A FSA (finite state automata) is a convenient way to represent the scanning process
– Here is a FSA for identifier; starts with a alpha followed by any sequence of alphanumerics
– What would the FSA for a number look like?
Handling of Reserved Words • Every programming language contains reserved words
– Normally a declared identifier name cannot be a reserved word
– Consider the case of a reserved word int, it must be distinguished from variable names that are a proper prefix, such as the variable i, and variable names for which int is a proper prefix, such as integral
– The following FSA handles this situation
• The scanner as a large FSA
– The production of tokens must be unambiguous
– Some character sequences, such as 123a, should produce errors since this is not an identifier nor a number
The Output of a Scanner for Wren Enter name of source file: gcd.wren
program gcd is
var m,n : integer;
begin
read m; read n;
while m <> n do
if m < n then n := n - m
else m := m - n
end if
end while;
write m
end
Scan successful
[program,ide(gcd),is,var,ide(m),comma,ide(n),colon,integer,
semicolon,begin,read,ide(m),semicolon,read,ide(n),semicolon,
while,ide(m),neq,ide(n),do,if,ide(m),less,ide(n),then,ide(n),
assign,ide(n),minus,ide(m),else,ide(m),assign,ide(m),minus,
ide(n),end,if,end,while,semicolon,write,ide(m),end,eop]
Recall our phone number example G2.2.A = (T, N, P, S) where
T = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, (, -, ) },
N = {<PN>, <CC>, <AC>, <P>, <E>, <D>},
P = {
1: <PN> ::= <CC> <AC> <P> - <E>
2: <PN> ::= <AC> <P> - <E>
3: <CC> ::= <D> <D>
4: <AC> ::= ( <D> <D> <D> )
5: <P> ::= <D> <D> <D>
6: <E> ::= <D> <D> <D> <D>
7-16: <D> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
}, and
S = <PN>.
Here is the Parsing Stack Parsing Stack
(top of stack on left)
Input stream
Commentary
<PN> (800)555-1212 Initial state
<AC><P>-<E> (800)555-1212 Replaced <PN>: Two possible productions but next input character is ( so choose production #2.
(<D><D><D>)<P>-<E> (800)555-1212 Replaced <AC> by production #4
<D><D><D>)<P>-<E> 800)555-1212 Top of stack is a terminal that matches next input character: pop stack, read character.
8<D><D>)<P>-<E> 800)555-1212 Replaced <D>: Ten possibilities but input character suggests using <D>::=8.
<D><D>)<P>-<E> 00)555-1212 Matched top of stack with input: pop and read.
0<D>)<P>-<E> 00)555-1212 Replaced <D>
<D>)<P>-<E> 0)555-1212 Stack matches input: pop,read.
0)<P>-<E> 0)555-1212 Replaced <D>
)<P>-<E> )555-1212 Stack matches input: pop,read.
<P>-<E> 555-1212 Stack matches input: pop,read.
<D><D><D>-<E> 555-1212 Replaced <P>
5<D><D>-<E> 555-1212 Replaced <D>
<D><D>-<E> 55-1212 Stack matches input: pop,read.
5<D>-<E> 55-1212 Replaced <D>
<D>-<E> 5-1212 Stack matches input: pop,read.
5-<E> 5-1212 Replaced <D>
-<E> -1212 Stack matches input: pop,read.
<E> 1212 Stack matches input: pop,read.
<D><D><D><D> 1212 Replaced <E>
1<D><D><D> 1212 Replaced <D>
<D><D><D> 212 Stack matches input: pop,read.
2<D><D> 212 Replaced <D>
<D><D> 12 Stack matches input: pop,read.
1<D> 12 Replaced <D>
<D> 2 Stack matches input: pop,read.
2 2 Replaced <D>
Empty-stack End-of-file Stack matches input: pop,read.
Parse successful! Stack is empty and input consumed.
Unsuccessful Parses • The parse on the previous slide was successful
• Not all parses will be successful; an unsuccessful parse is realized if
– A non-terminal on the stack can not be replaced
– A terminal on the stack is not matched by the next terminal in the input
– The stack becomes empty but the input has not be completely used
More on Parsing Techniques • The textbook covers the following topics
– Use of FIRST sets for each nonterminal
– Left factoring
– Elimination of left recursion
– Recursive descent parsing
– Bottom up parsing
• These topics are important if you study compilers but we will not cover them in lecture at this time
• We will cover elimination of left recursion and recursive descent parsing when we build a scanner and parser in Prolog
Abstract Syntax Tree
A Simplified Tree • The Wren statement is if a <> 0 then x := x / a end if
• The simplified tree, shown below, contains all the necessary information for code generation
Abstract Production Rules for Wren • Notice the use of EBNF notation such as * and +
Output of a Wren Parser Parse successful
prog([dec(integer,[m,n])],[read(m),read(n),while(bexp(neq,
ide(m),ide(n)),[if(bexp(less,ide(m),ide(n)),
[assign(n,exp(minus,ide(n),ide(m)))],
[assign(m,exp(minus,ide(m),ide(n)))])]),write(ide(m))])
prog
decSeq cmdSeq
dec read read while write
integer varList m n <finish> m
m n
Complete the parse tree for the while loop as a homework problem
Three Address Intermediate Code • Given the Wren command
if a <> 0 then x := x / a end if
• This can be represented by the following three-address code:
(<>, t1, a, 0), // perform the test
(jump-false, t1, Label1, nil), // if test failed(t1 is false) // jump over then block
(/, x, x, a) // here is the then block
(label, Label1, nil, nil) // end of the if-then
Stack-based Intermediate Code • Used to Build Virtual Machines such as the JVM for
bytecode or the Common Intermediate Language in the .NET environment
• Advantages
– Compact object code
– Simple compilers and interpreters
– Minimal processor state
• Disadvantages
– More memory references
– More instructions, slower interpretation
– Does not take direct advantage of registers
Hand compiled code vs. machine compiled get m; get n; L1 label; push m; push n; sub; tstne; jf L2;
push m; push n; sub; tstlt; jf L3; push n; push m; sub; pop n;
j L4; L3 label; push m; push n; sub; pop m; L4 label; j L1;
L2 label; put m; halt;
[[get,m],[get,n],[L1,label],[push,m],[pop,T1],[push,n],[pop,T2],
[push,T1],[push,T2],sub,tstne,[jf,L2],[push,m],[pop,T1],
[push,n],[pop,T2],[push,T1],[push,T2],sub,tstlt,[jf,L3],
[push,n],[pop,T1],[push,m],[pop,T2],[push,T1],[push,T2],sub,
[pop,n],[j,L4],[L3,label],[push,m],[pop,T1],[push,n],[pop,T2],
[push,T1],[push,T2],sub,[pop,m],[L4,label],[j,L1],[L2,label],
[push,m],[pop,T1],[put,T1],halt]
• Why is the machine compiled code so much longer?
Code Generation for if Command
Homework Exercise • A single alternative if command contains a true task
but no false task
• The BNF is if <boolean expression> then <command sequence> end if
• Draw the flow diagram and then propose the code generation scheme for the single alternative if
Code generation for a while command
Context Sensitivity • BNF is NOT context sensitive
– Each production rule has a single nonterminal on the left-hand side
– This means it cannot be context sensitive
• Name several instances where context sensitivity is essential for a programming language?
• Context sensitive grammars are possible but they are overly complex
• Attribute grammars can solve these problems and retain the simplicity of a BNF grammar
What are attribute grammars? • Can transport information anywhere within a
abstract syntax tree
– Inherited attributes pass information to nodes at lower levels in the tree
– Synthesized attributes pass information from child nodes back up to parent nodes
• Use of attribute grammars in Wren
– Insure label numbers are unique for the entire program, uses inherited and synthesized attributes
– Insure temporary variable numbers are unique within a single expression, uses inherited attributes
Using Attributes for Label Numbers
Executing a WIC program (0)get m (1)get n (2)L1 label (3)push m (4)pop T1 (5)push n
(6)pop T2 (7)push T1 (8)push T2 (9)sub (10)tstne (11)jf L2
(12)push m (13)pop T1 (14)push n (15)pop T2 (16)push T1
(17)push T2 (18)sub (19)tstlt (20)jf L3 (21)push n (22)pop T1
(23)push m (24)pop T2 (25)push T1 (26)push T2 (27)sub (28)pop n
(29)j L4 (30)L3 label (31)push m (32)pop T1 (33)push n
(34)pop T2 (35)push T1 (36)push T2 (37)sub (38)pop m
(39)L4 label (40)j L1 (41)L2 label (42)push m (43)pop T1
(44)put T1 (45)halt
• Instructions are numbered for reference purposes
• There are two stages for executing this program
– Read in the WIC code and perform all necessary preprocessing steps
– Execute the program until a halt instruction is encountered
Preprocessing
m 0
n 0
T1 0
T2 0
L1 2
L2 41
L3 30
L4 39
• Reading WIC from a file
• Put the instructions into an indexed list of instructions
• Put the label instruction in <operator, operand> format
• Building a Symbol Table
• Create a table for all variables and temps
• Initialize all values to zero
• Building a Jump Table
• Every label has a unique name
• The jump table stores the location of the label
• Creating the Runtime Stack and the Program Counter
Running the Interpreter The basic operation of the interpreter is:
(1) fetch the instruction as specified by the PC
(2) increment the PC in anticipation of sequential instruction execution
(3) execute the current instruction; some instructions will change values in the ST or on the runtime stack, the jump instructions may change the PC value
(4) if the last instruction is not halt, go to step (1)
Groups of Instructions - 1 • Input, Output and Halt
– get prompts the user to enter a value and stores it in the symbol table (ST)
– put prints out the value of a variable from the ST
– halt stops the interpreter
• Push and Pop
– push places either a literal value (e.g., push 0) or a variable value from the ST on top of the runtime stack
– pop removes a value from the top of the runtime stack and stores in the specified variable in the ST
Groups of Instructions - 2 • Arithmetic Instruction operation
– Pop the right hand operand off the stack
– Pop the left hand operand off the stack
– Perform the operation (check for 0 divisors)
– Push the result back onto the stack
• Logical Instruction operation
– and & or instructions perform the same four steps as arithmetic instructions
– The not instruction negates the value on top of the stack (0 1 or 1 0)
Groups of Instructions - 3 • Test Instruction operation
– Pop the value off the top of the stack
– Perform the indicated comparison of that value with 0 (choices: eq, ne, lt, le, gt, ge)
– Either push a 0 or a 1 onto the top of the stack based on whether the comparison is false or true
• Jump Instructions
– Unconditional jump: look up the label location in the jump table and place in the program counter
– Conditional jump: remove the value from the stack
• If it is false (= 0) look up the label location in the jump table and place in the program counter
• If it is true (= 1) do nothing
Preprocess and Interpret this WIC get A
get B
push A
push B
sub
tstlt
jf L1
push B
pop MAX
j L2
L1 label
push A
pop MAX
L2 label
put MAX
halt
1. Show the indexed instruction list after preprocessing 2. Show the initial Symbol Table 3. Show the Jump Table 4. Execute the instructions one-by-one a. give the instruction being executed b. show the symbol table, stack, and program counter after the instruction is executed
Summary of Homework Problems • Complete the stack trace of the program on slide 12
• Hand compile the product program into WIC
• Draw the complete parse tree for the expression 2 – 3 + 4 * 5 Use <E>, <T>, <N> and <D> for the internal nodes
• Develop version 4 with exponentiation (^) of the grammar for expressions
• Complete the parse tree for the while loop in the gcd program
• Specify code generation for a single alternative if
• Complete the preprocessing and interpretation of the program shown on the previous slide