Lexical Analysis Mooly Sagiv [email protected] Schrierber 317 03-640-7606 Wed 10:00-12:00...

26
Lexical Analysis Mooly Sagiv [email protected] Schrierber 317 03-640-7606 Wed 10:00-12:00 html://www.math.tau.ac.il/~msagiv/ courses/wcc.html Textbook:Modern Compiler Implementation in C

Transcript of Lexical Analysis Mooly Sagiv [email protected] Schrierber 317 03-640-7606 Wed 10:00-12:00...

Page 1: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Lexical Analysis

Mooly [email protected]

Schrierber 31703-640-7606

Wed 10:00-12:00html://www.math.tau.ac.il/~msagiv/courses/wcc.html

Textbook:Modern Compiler Implementation in CChapter 2

Page 2: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

A motivating example• Create a program that counts the number of lines in

a given input file

Page 3: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

A motivating examplesolution

int num_lines = 0;%%\n ++num_lines;. ;%% main() { yylex(); printf( "# of lines = %d\n", num_lines); }

Page 4: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Subjects• Roles of lexical analysis

• The straightforward solution a manual scanner for C

• Regular Expressions

• Finite automata

• From regular languages into finite automata

• Flex

Page 5: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Basic Compiler PhasesSource program (string)

Fin. Assembly

lexical analysis

syntax analysis

semantic analysis

Translate

Instruction selection

Register Allocation

Tokens

Abstract syntax tree

Intermediate representation

Assembly

Finite automata

Pushdown automata

Memory organization

graph algorithms

Dynamic programming

Page 6: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Example

a\b := 5 + 3 ;\nb := (print(a, a-1), 10 * a) ;\nprint(b)

• Input string

• Tokens

id (“a”) assign num (5) + num(3) ;id(“b”) assign

print(id(“a”) , id(“a”) - num(1)), num(10) * id(“a”)) ;print(id(“b”))

Page 7: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

• Functionality

– input

• program text (file)

– output

• sequence of tokens

– Read input file

– Identify language keywords and standard identifiers

– Handle include files and macros

– Count line numbers

– Remove whitespaces

– Report illegal symbols

– Produce symbol table

Lexical Analysis (Scanning)

Page 8: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

A simplified scanner for CToken nextToken(){char c ;loop: c = getchar();switch (c){

case ` `:goto loop ;case `;`: return SemiColumn;case `+`: c = getchar() ;

switch (c) { case `+': return PlusPlus ; case '=’ return PlusEqual; default: putchar(c);

return Plus; } case `<`:case `w`:

}

Page 9: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Automatic Generation of Lexical Analysis

• The matching of input strings can be performed by a finite automaton

• Examples:– An automaton for while– An automaton for C identifier– An automaton for C comment

• The program for the automaton is automatically generated from regular expressions

Page 10: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Flex• Input

– regular expressions and actions (C code)

• Output– A scanner program that reads the input and

applies actions when input regular expression is matched

flex

regular expressions

input program tokensscanner

Page 11: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Regular Expression Notations

a An ordinary character stands for itselfM|N M or NMN M followed by NM* Zero or more times of MM+ One or more times of MM? Zero or one occurrence of M[a-zA-Z] Character set alternation (single character). Any (single) character but newline“a.+” Quotation\ Convert an operator into text

Page 12: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Ambiguity Resolving

• Find the longest matching token

• Between two tokens with the same length use the one declared first

Page 13: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

A Flex specification of C Scanner

Letter [a-zA-Z_]Digit [0-9]%%[ \t] {;} [\n] {line_count++;}“;” { return SemiColumn;}“++” { return PlusPlus ;}“+=“ { return PlusEqual ;}“+” { return Plus}“while” { return While ; }{Letter}({Letter}|{Digit})* { return Id ;}“<=” { return LessOrEqual;}“<” { return LessThen ;}

Page 14: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Running Exampleif { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }[0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; }(\-\-[a-z]*\n)|(“ “|\n|\t) { ; }. { error(); }

Page 15: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

int edges[][256] ={ /* …, 0, 1, 2, 3, ..., -, e, f, g, h, i, j, ... *//* state 0 */ {0, ..., 0, 0, …, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0}/* state 1 */ {13, ..., 7, 7, 7, 7, …, 9, 4, 4, 4, 4, 2, 4, ..., 13, 13}/* state 2 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 3, 4, 4, 4, 4, ..., 0, 0}/* state 3 */ {0, …, 4, 4, 4, 4, …, 0, 4, 4, 4, 4, 4, 4, , 0, 0}/* state 4 */ {0, …, 4, 4, 4, 4, ..., 0, 4, 4, 4, 4, 4, 4, ..., 0, 0} /* state 5 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}/* state 6 */ {0, …, 6, 6, 6, 6, …, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0}/* state 7 */

.../* state 13 */ {0, …, 0, 0, 0, 0, …, 0, 0, 0, 0, 0, 0, 0, …, 0, 0}

Page 16: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Pseudo Code for ScannerToken nextToken(){lastFinal = 0; currentState = 1 ;inputPositionAtLastFinal = input; currentPosition = input; while (not(isDead(currentState))) {

nextState = edges[currentState][currentPosition]; if (isFinal(nextState)) { lastFinal = nextState ; inputPositionAtLastFinal = currentPosition; } currentState = nextState; advance currentPosition;

}input = inputPositionAtLastFinal ;return action[lastFinal]; }

Page 17: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Example

Input: “if --not-a-com”

Page 18: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Efficient Scanners

• Efficient state representation

• Input buffering

• Using switch and goto instead of tables

Page 19: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Constructing Automaton from Specification

• Create a non-deterministic automaton (NDFA) from every regular expression

• Merge all the automata using epsilon moves(like the | construction)

• Construct a deterministic finite automaton (DFA)

• Minimize the automaton starting with separate accepting states

Page 20: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

NDFA Constructionif { return IF; }[a-z][a-z0-9]* { return ID; }[0-9]+ { return NUM; }[0-9]”.”[0-9]*|[0-9]*”.”[0-9]+ { return REAL; }(\-\-[a-z]*\n)|(“ “|\n|\t) { ; }. { error(); }

Page 21: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

DFA Construction

Page 22: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Minimization

Page 23: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

%{/* C declarations */#include “tokens.h'' /* Mapping of tokens into integers */#include “errormsg.h'' /* Shared by all the phases */union {int ival; string sval; double fval;} yylval;int charPos=1 ; #define ADJ (EM_tokPos=charPos, charPos+=yyleng)%}/* Lex Definitions */digits [0-9]+%%if { ADJ; return IF;}[a-z][a-z0-9] { ADJ; yylval.sval=String(yytext); return ID; }{digits} {ADJ; yylval.ival=atoi(yytext); return NUM; }({digits}\.{digits}?)|({digits}?\.{digits}) {

ADJ; yylval.fval=atof(yytext); return REAL; }(\-\-[a-z]*\n)|([\n\t]|" ")* { ADJ; }. { ADJ; EM_error(“illegal character''); }

Page 24: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Start States• Regular expressions may be more complicated

than automata– C comments

• Solutions– Conversion of automata into regular expressions– Start States

% start s1 s2%%< INITIAL>r1 { action0 ; BEGIN s_1; }<s1>r1 { action1 ; BEGIN s2; }<s2>r2 { action2 ; BEGIN INITIAL};

Page 25: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Realistic Example% start Comment%%<INITIAL>”/*'' { BEGIN Comment; }<INITIAL>r1 { Usual actions; }<INITIAL>r2 { Usual actions; }

...<INITIAL>rk { Usual actions; }<Comment>”*/”’ { BEGIN Initial; }<Comment>.|\n ;

Page 26: Lexical Analysis Mooly Sagiv msagiv@post.tau.ac.il Schrierber 317 03-640-7606 Wed 10:00-12:00 html://msagiv/courses/wcc.html Textbook:Modern.

Summary

• For most programming languages lexical analyzers can be easily constructed

• Exceptions:– Fortran– PL/1

• Flex is a useful tool beyond compilers