Lexical Analysis Dragon Book: chapter 3. Compiler structure Lexical analyzer Syntax analyzer...

Post on 28-Mar-2015

303 views 7 download

Tags:

Transcript of Lexical Analysis Dragon Book: chapter 3. Compiler structure Lexical analyzer Syntax analyzer...

Lexical Analysis

Dragon Book: chapter 3

Compiler structure

Lexical analyzer

Syntax analyzer

Semantic analyzer

Intermediate codegenerator

Code optimizer

Code generator

Source program

Target program

Symbol table Error handling

Compiler structure

Lexical analyzer

Syntax analyzer

Source program

Symbol table Error handling

token Get next token

Tokens in programming languages

Token Sample instances Description

if id keyword

rel <, <=, <>, >=, > relation

id count, length, point2

variable

num 3.1415927, 7, 145e-3

Numericalconstant

str “abc”, “some space”“\7\” is a char”

Constant string

Tokens may be difficult to recognize Fortran: DO 5 I=1.25

DO 5 I=1,25(spaces do not count).

PL/I: IF THEN THEN THEN=ELSE; ELSE ELSE=THEN;(no reserved keywords).

PL/I: PR1(2, 7, 18, D*3, 175.14)=3(proc. call or array reference).

Strings, languages. A sequence of characters over some

alphabet, e.g., 0100110 over {0, 1}. In computers, usually ASCII or EBCDIC. Length of strings: number of characters. Empty string: (size 0). Concatenation: putting one string after

another. X=dog, Y=house, XY=doghouse (also X.Y).

Prefix: ban is prefix of banana.Suffix: ana is prefix of banana.

Language: a set of strings The alphabet is a language:

L={A, B, …, Z, a, b, …, z}. Constant languages: X={ab, ba}, Y={a}. Concatenation: X.Y = {aba, baa}.

Y.X = {aab, aba}. Union: XY=X+Y=X|Y={ab, ba, a}. Exponentation: X3 = X.X.X Star: X* = zero or more occurrences.

L* = all words with letters from L. L+= all words with one or more letters from

L.

Regular expressions

X|Y = XY= { s | sX or sY }.X.Y = { x.y | xX and yY }.X* = i=0, Xi.

X+ = i=1, Xi.

Examples

a|b = {a, b}. (a|b).(a|b) = {aa, ab, ba, bb}. a* = { , a, aa, aaa, … }. (a|b)* = { , a, b, ab, ba, aa, aba,

… }

Defining tokens

digit [0-9] digits digit+ fraction . digits | exponent E ( + | - | ) digits | const digits fraction exponent

Not everything is regular!

All the words of the form w c w, wherew is a word and c a letter.

The syntax of a program, e.g., the recursive definition of if-then-else.stmtif expr then stmt else stmt.

Reading the input

Need sometimes to “lookahead”. For example: identifying the variable done.

May need to “unread” a character.

If a>8 then goto nextloop else begin while z>8 do

Token starts here

Last character read

Returning: token + attributes.

if xyz > 11 then if, keyword id, value=xyz op, value=“>”. const, value=11 then, keyword.

Finite Automata

s1

s4

s2

c

a

a

a

b

b

b

b

s3

s5

c

a

Includes:

States {s1,s2,…,s5}.

Initial states {s1}.

Accepting states {s3,s5}.

Alphabet {a, b, c}.

Transitions:

{(s1,a,s2), (s2, a, s3), …}.

Deterministic?

Automaton. What is the language?

b

s0

a

a bs1

Formally:

An input is a word over the alphabet .

A run over a word is an alternating sequence ofstates and letters, starting from the initial state.

Accepting run: ends with an accepting state.

Example

s0

a

a bs1

Input: aabbb

Run: s0 a s0 a s0 b s1 b s1 b s1. Accepts.

Input: aba

Run: s0 a s0 b s1 a s0. Does not accept.

b

Automaton. What is the language?

s0

a

a

b

bs1

Automaton. What is the language?

s1

a

a

b

bs0

Identifying tokens

IF

T H E N

L SE

E

letterletter|digit

Non deterministic automata

Allows more than a single transition from a state with the same label.

There does not have to be a transition from every state with every label.

Allows multiple initial states.

Allows transitions.

s0 s1 s20,1

1 0,1 0,1s3

Nondeterministic runs

Input: 0100

Run 1: s0 0 s0 1 s0 0 s0 0 s0. Does not accept.Run 2: s0 0 s0 1 s1 0 s2 0 s3. Accepts.

Accepts when there exists an accepting run.

s0 s1 s20,1

1 0,1 0,1s3

Determinizing Automata

s0 s1 s20,1

1 0,1 0,1s3

Each state of D is a set of the states of N.

S—aT when T={t|sS and s—at}.

The initial state of D includes all the initial states of N.

Accepting states in D include at least one acceptingstate of N.

Determinization

0,1 s0 s1 s21 0,

10,1

s3

s0

s0,s3

s0,s2 s0,s1,s3

s0,s2,s3

s0,s1,s2,s3s0,s1,s2s0,s10

00

0

1

00

0

1 1

1 1

1

1

0

Determinization

000

100

010 101

110

1110110010

00

0

1

00

0

1 1

1 1

1

1

0

Translating regular expressions into automata

L1

L1 L2

L2

L

L1L2L1.L2

L*

Automatic translation

(a|b).(a.b)=(ab)(ab)=(a+b).(a+b)=…

a

b

a

b

a

b

a

b

Determinization with transitions.

s1 s3a

s2 s4b

s0 s5

s7 s9a

s8 s10bs6 s11

Add to each set states reachable using transitions.

s0,s1,s2

s3,s5,s6,s7,s8 s9,s11

s4,s5,s6,s7,s8 s10,s11

a a

abb

b

Minimization

Group all the states together.

Separate states according to available exit transitions.

Separate a set to two if from some of its states one can reach another set and with others one cannot.

Repeat until cannot separate.

p0

p1 p3

p2 p4

a a

abb

b

Minimization

Group all the states together.

{p0, p1, p2, p3, p4}.

p0

p1 p3

p2 p4

a a

abb

b

Minimization

Separate states according to available exit transitions.

p0

p1 p3

p2 p4

a a

abb

b

Minimization

p0

p1 p3

p2 p4

a a

abb

b

Separate a set to two if from some of its states one can reach another set and with others one cannot.

Repeat until cannot separate.

Can minimize now

a

b

a

b

bb

aa

Lex

Declarations%%Translation rules%%Auxiliary procedures

Lex behavior

Lex ProgramLex sourceprogramlex.l

lex.yy.c

CCompiler

a.out

a.outInput

streem

Output

tokens

Lex behavior Translates the definitions into an

automaton. The automaton looks for the longest

matching string. Either return some value to the reading

program (parser), or looks for next token. Lookahead operator: x/y allow the

token x only if y follows it (but y is not part of the token).

Lex Project Project collection date: Feb 11th. Work in pairs (singles). Use lex to take a text and check

whether the number of open parentheses of any kind is equal to the number of closed parentheses.

Exception: Inside quotes. \” is not a closing quote.