Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn...

Post on 28-Dec-2015

217 views 0 download

Transcript of Lexical Analysis Natawut Nupairoj, Ph.D. Department of Computer Engineering Chulalongkorn...

Lexical Analysis

Natawut Nupairoj, Ph.D.

Department of Computer EngineeringChulalongkorn University

Outline

Overview. Token, Lexeme, and Pattern. Lexical Analysis Specification. Lexical Analysis Engine.

Front-End Components

ScannerSource program(text stream)

Parser

IntermediateRepresentation(file or in memory)

SemanticAnalyzer

Front-End

Construct parse tree.

Group token.

next-token

token

SymbolTable

m a i n ( ) {

Check semantic/contextual.

identifiermain

symbol(

parse-tree

Tasks for Scanner

Read input and group tokens for Parser. Strip comments and white spaces. Count line numbers. Create an entry in the symbol table. Preprocessing functions

Benefits

Simpler design parser doesn’t worry about comments and white spac

es.

More efficient scanner optimize the scanning process only. use specialize buffering techniques.

Portability handle standard symbols on different platforms.

Basic Terminology

Tokena set of stringsEx: token = identifier

Lexemea sequence of characters in the source progra

m matched by the pattern for a token.Ex: lexeme = counter

Basic Terminology

Pattern a description of strings that can belong to a particular

token set. Ex: pattern = letter followed by letters or digit

{A,…,Z,a,…,z}{A,…,Z,a,…,z,0,…,9}*

Token

const

if

relation

id

num

literal

Lexeme

const

if

<, <=, …, >=

counter, x, y

12.53, 1.42E-10

“Hello World”

Pattern

const

if

comparison symbols

letter (letter | digit)*

any numeric constant

characters between “

Language and Lexical Analysis

Fixed-format input i.e. FORTRANmust consider the alignment of a lexeme.difficult to scan.

No reserved words i.e. PL/Ikeywords vs. id ? -- complex rules.

if if = then then then := else; else else := then;

Regular Expression Revisited

is a regular expression that denotes {}. If a is an alphabet, a is a regular expressio

n that denotes {a}. Suppose r and s are regular expressions:

(r)|(s) denoting L(r) U L(s).(r)(s) denoting L(r)L(s).(r)* denoting (L(r))*

Precedence of Operator

Level of precedenceKleene clusure (*)concatenationunion (|)

All operators are left associative. Ex: a*b | cd* = ((a*)b) | (c(d*))

Regular Definition

A sequence of definitions:d1ฎr1

d2ฎr2

...

dnฎrn

di is a distinct nameri is a regular expression over:

ฎ U {d1, …, di-1}

Examples

letter ฎ A | B | … | Z | a | b | … | z

digit ฎ 0 | 1 | … | 9

id ฎ letter ( letter | digit )*

digits ฎ digit digit*

opt_fraction ฎ . digits | opt_exponent ฎ ( E ( + | - | ) digits ) | num ฎ digits opt_fraction opt_exponent

Notational Shorthands

One or more instancesr+ = rr*

Zero or one instancer? = r | (rs)? = rs |

Character Class [A-Za-z] = A | B | … | Z | a | b | … | z

Examples

digit ฎ [0-9]

digits ฎ digit+

opt_fraction ฎ . digits )?

opt_exponent ฎ ( E ( + | - )? digits )?

num ฎ digits opt_fraction opt_exponent

id ฎ [A-Za-z][A-Za-z0-9]*

Recognition of Tokens

Consider tokens from the grammar. tokenpatternattribute

Draw NFAs with retracting options.

Example : Grammar

stmt ::= if expr then stmt

| if expr then stmt else stmt

| expr

expr ::= term relop term

| term

term ::= id | num

Example : Regular Definition

if ฎ if

then ฎ then

else ฎ else

relop ฎ < | <= | = | <> | > | >=

id ฎ letter (letter | digit)*

num ฎ digit+ ( . digit+ )? ( E (+ | -)? digit+ ) ?

delimฎ blank | tab | newline

ws ฎ delim+

Example: Pattern-Token-Attribute

Attribute-Value

-

-

-

-

Index in table

Index in table

LT

LE

EQ

NE

..

Regular

Expression

ws

if

then

else

id

num

<

<=

=

<>

...

Token

-

if

then

else

id

num

relop

relop

relop

relop

...

Attributes for Tokens

if count >= 0 then ...

<if, >

<id, index for count in symbol table>

<relop, GE>

<num, integer value 0>

<then, >

NFA – Lexical Analysis Engine

0 1

6

2

3

4

5

8

7

return(relop, LE)

return(relop, EQ)

return(relop, NE)

return(relop, LT)

return(relop, GE)

return(relop, GT)

< =

>

other

=

>

=

other

*

*

Handle Numbers

Pattern for number contains options.num ฎ digit+ ( . digit+ )? ( E (+ | -)? digit+ ) ?

31, 31.02, 31.02E-15

Always get the longest possible match.match the longest first if not match, try the next possible pattern.

Handle Numbers

12

19

13

return(num, getnum())

*

other

digit

14 15 16 17 18digit

digit

digitdigit

digit digit. E

E

+ or -

20 21 22 23

digitdigit

digit digit.

25 26

digit

digit

24

27

other

other

*

*

Handle Keywords

Two approaches:encode keywords into an NFA (if, then, etc.)

complex NFA (too many states).

use symbol table simple. require some tricks.

9 1110 return(gettoken(),

install_id())

*otherletter

letter or digit

Handle Keywords

Symbol table contains both lexeme and token type.

Initialize symbol table with all keywords and corresponding token types.

lexeme: if token type: if

lexeme: then token type: then

lexeme: else token type: else

Handle Keywords

Scanner

Parser

SymbolTable

Lexeme Token Type …

if if …

then then …

else else …

initial

1

2

3

4

5

Handle Keywordsgettoken():

If id is not found in the table, return token type ID. Otherwise, return token type from the table.

Handle Keywords

Scanner

Parser

SymbolTable

Lexeme Token Type …

if if …

then then …

else else …

gettoken

Source program(text stream)

i f c o u n t < =i f

next-token

i f

if

1

2

3

4

5

Handle Keywords install_id():

If id is not found in the table, it’s a new id. INSERT NEW ID INTO TABLE and return pointer to the new entry.

If id is found and its type is ID, return pointer to that entry.

Otherwise, it’s a keyword. Return 0.

1

Handle Keywords

Scanner

Parser

SymbolTable

Lexeme Token Type …

if if …

then then …

else else …

install_idSource program(text stream)

i f c o u n t < =i f

next-token

token if0i f

0

2

3

4

5

Handle Keywords

Scanner

Parser

SymbolTable

Lexeme Token Type …

if if …

then then …

else else …

gettoken

Source program(text stream)

i f c o u n t < =i f

next-token

id

1

2

3

4

5

c o u n t

c o u n tc o u n t

Not found!

1

Handle Keywords

Scanner

Parser

SymbolTable

Lexeme Token Type …

if if …

then then …

else else …

install_id

Source program(text stream)

i f c o u n t < =

next-token

token id4

4

2

3

4

5

c o u n tc o u n t

count id …