UNIT - III INTRODUCTION TO COMPILERS · UNIT - III INTRODUCTION TO COMPILERS Phase structure of...

UNIT - III INTRODUCTION TO COMPILERS

Phase structure of Compiler and entire compilation process.

Lexical Analyzer: The Role of the Lexical Analyzer, InputBuffering. Specification of Tokens, Recognition of

Tokens, Design of Lexical Analyzer using Uniform SymbolTable, Lexical Errors.

LEX: LEX Specification, Generation of Lexical Analyzer by LEX.

What is Compiler?

• Compiler is a software which converts a program written in high level language (Source Language) to low level language (Object/Target/Machine Language).

Symbol Table

• Symbol Table – It is a data structure being used and maintained by the compiler, consists all the identifier’s name along with their types. It helps the compiler to function smoothly by finding the identifiers quickly.

Phases of Compiler

• Structure of compiler has of two parts:

1.Analysis phase(front end )

2.Sysnthesis Phase(back end)

• Front-end constitutes of the Lexical analyzer, semantic analyzer, syntax analyzer and intermediate code generator. And the rest are assembled to form the back end

1.Lexical Analyzer

• Lexical Analyzer – It reads the program and converts it into tokens. It converts a stream of lexemes into a stream of tokens. Tokens are defined by regular expressions which are understood by the lexical analyzer. It also removes white-spaces and comments.

• Example: X:=z+y

• 5 token x,=,z,+,y after this stage id1 assign id2 binop id3

Tokens, Patterns, and LexemesToken Sample Lexemes Informal Description of Pattern

const

if

relation

id

num

literal

const

if

<, <=, =, < >, >, >=

pi, count, D2

3.1416, 0, 6.02E23

“core dumped”

const

if

< or <= or = or < > or >= or >

letter followed by letters and digits

any numeric constant

any characters between “ and “ except “

Classifies Pattern

Actual values are critical. Info is :

1. Stored in symbol table

2. Returned to parser

2.Syntax Analyzer

• It is sometimes called as parser. It constructs the parse tree.

• It takes all the tokens one by one and uses Context Free Grammar to construct the parse tree.

• Why Grammar ?The rules of programming can be entirely represented in some few productions. Using these productions we can represent what the program actually is. The input has to be checked whether it is in the desired format or not.

• Syntax error can be detected at this level if the input is not in accordance with the grammar

3.Semantic Analyzer

• Semantic Analyzer – It verifies the parse tree, whether it’s meaningful or not. It furthermore produces a verified parse tree.

• Semantic deals with Type checking and constraint with the help of rules.

4.Intermediate Code Generator

• It act as bridge between the analysis phase and synthesis phase of compilation process.

• It generates intermediate code, that is a form which can be readily executed by machine We have many popular intermediate codes.

• Example – Three address code(TAC), Quadruples, triples,Postfix etc.

• Intermediate code is converted to machine language using the last two phases which are platform dependent. Till intermediate code, it is same for every compiler out there, but after that, it depends on the platform. To build a new compiler we don’t need to build it from scratch. We can take the intermediate code from the already existing compiler and build the last two parts.

5.Code Optimizer• Code Optimizer – It transforms the code so that it

consumes fewer resources and produces more speed. • The meaning of the code being transformed is not altered. • Optimization can be categorized into two types: machine

dependent and machine independent.• Optimization Technique:• 1.Removing redundant identifiers• 2.Removing unreachable s• ections of code• 3.Identifying common sub expression.• 4.Unfolding loops• 5.Elimnating Procedure

6.Target Code Generator

• Target Code Generator – The main purpose of Target Code generator is to write a code that the machine can understand.

• The output is dependent on the type of assembler.

• This is the final stage of compilation.

1.Lexical Analysis

• Lexical Analysis is the first phase of compiler also known as scanner. It converts the input program into a sequence of Tokens.

• Lexical Analysis can be implemented with the Deterministic finite Automata.

• What is a token?A lexical token is a sequence of characters that can be treated as a single logical entity.

http://quiz.geeksforgeeks.org/toc-finite-automata-introduction/

1.Lexical Analysis

• Example of tokens: Keyword, Operators, Constants,Identifiers, Special Symbol.

• Keywords; Examples-for, while, if etc. Identifier; Examples-Variable name, function name etc. Operators; Examples '+', '++', '-' etc. Separators; Examples ',' ';' etc

Tokens, Patterns, Lexemes

• Pattern: A set of strings in the input for which the same token is produced as output. This set of string is described by a rule called a pattern associated with the token.– e.g., id => “letter followed by letters and digits”

• Lexeme: a sequence of characters in the source program that is matched by the pattern for a token

• Examples: int a;

First string:int pattern:int,lexeme:int, token:keyword

Second string:a pattern:[a-zA-Z][a-zA-Z-)-9]* lexeme:atoken: identifier

It is the first phase of the compiler.

It reads the input characters and produces as output a sequence of tokens that the parser uses for syntax analysis.

It strips out from the source program comments and white spaces in the form of blank , tab and newline characters .

It also correlates error messages from the compiler with the source program (because it keeps track of line numbers).

20

The role of the lexical analyzer

Interaction Of The Lexical Analyzer With The Parser

21

LexicalAnalyzer

ParserSource

Program

Token,tokenval

Symbol Table

Get nexttoken

error error

Recognition of Token


•Data structure used in Lexical Analyzer,•Terminal Table(TRM)•Identifier Table|(IDN)•Uniform symbol table•Literal Table

Lexical Errors

• Lexical error is a sequence of characters that does not match the pattern of any token. Lexical phase error is found during the execution of the program.

•Lexical phase error can be:1. Spelling error.2. Exceeding length of identifier or numeric constants.3. Appearance of illegal characters.4. To remove the character that should be present.5. To replace a character with an incorrect character.6. Transposition of two characters.

Example:Void main(){

int x=10, y=20;char * a;a= &x;x= 1xab;

}In this code, 1xab is neither a number nor an identifier. So this code will show the lexical error.

Lexical Errors

A lexical error is any input that can be rejected by the lexer.

This generally results from token recognition falling off the end of

the rules you've defined.

For example (in no particular syntax):

[0-9]+ ===> NUMBER token [a-zA-Z] ===> LETTERS token anything else ===> error!

If you think about a lexer as a finite state machine that accepts valid input strings.

If not valid input then ERROR can be generated.

Lexical Errors

Input buffering

The L.A. scan the input string from input from left to right, one character at time.,

The input character is read from secondary storage, but reading in this way is costly.

Hence buffering technique is used.

A block of data is first read into a buffer & then scanned by lexical Analyzer.

Uses 2 Pointer

1.begin_ptr(Bp)

2.forwar_ptr(fp)

forward_ptr moves ahead to search for end of lexme,if blank space is encounter it indicate end of lexme.

Begin pointer point to current lexme.(Lexme_Pointer)

Eof means end of buffer: input is at an end.

Introduction to Lex

35

• The main job of a lexical analyzer (scanner) is to break up an input stream into more usable elements (tokens)a = b + c * d;

ID ASSIGN ID PLUS ID MULT ID SEMI

• Lex is an utility to help you rapidly generate your scanners

36

What is Lex?

• Lexical analyzers tokenize input streams

• Tokens are the terminals of a language

– English

• words, punctuation marks, …

– Programming language

• Identifiers, operators, keywords, …

• Regular expressions define terminals/tokens

39

Lex – Lexical Analyzer

40

An Overview of Lex

Lex

C compiler

a.out

Lex source program

lex.yy.c

input

lex.yy.c

a.out

tokens

• Lex source is separated into three sections by %%delimiters

• The general format of Lex source is

• The absolute minimum Lex program is thus

42

Lex Source

(optional)

(required)

{definitions}

%%

{transition rules}

%%

{user subroutines}

%%

• Lex source is a table of

– regular expressions and

– corresponding program fragments

43

Lex Source Program

digit [0-9]

letter [a-zA-Z]

%%

{letter}({letter}|{digit})* printf(“id: %s\n”, yytext);

\n printf(“new line\n”);

%%

main() {

yylex();

}

Regular Expressions

44

• A regular expression matches a set of strings

• Regular expression

– Operators

– Character classes

– Arbitrary character

– Optional expressions

– Alternation and grouping

– Context sensitivity

– Repetitions and definitions

45

Lex Regular Expressions (Extended Regular Expressions)

• [abc] matches a single character, which may be a, b, or c

• Every operator meaning is ignored except \ - and ^

• e.g.

[ab] => a or b

[a-z] => a or b or c or … or z

[-+0-9] => all the digits and the two signs

[^a-zA-Z] => any character which is not a

letter

46

Character Classes []

Pattern Matching PrimitivesMetacharacter Matches

. any character except newline

\n newline

* zero or more copies of the preceding expression

+ one or more copies of the preceding expression

? zero or one copy of the preceding expression

^ beginning of line / complement

$ end of line

a|b a or b

(ab)+ one or more copies of ab (grouping)

[ab] a or b

a{3} 3 instances of a

“a+b” literal “a+b” (C escapes still work) 47

• regexp <one or more blanks> action (C code);

• regexp <one or more blanks> { actions (C code) }

• A null statement ; will ignore the input (no actions)

[ \t\n];

48

Transition Rules

• | indicates that the action for this rule is from the action for the next rule[ \t\n] ;

“ “ |

“\t” |

“\n” ;

49

Transition Rules (cont’d)

• yytext -- a string containing the lexeme

• yyleng -- the length of the lexeme

• yyin -- the input stream pointer – the default input of default main() is stdin

• yyout -- the output stream pointer– the default output of default main() is stdout.

• cs20: %./a.out < inputfile > outfile

• E.g. [a-z]+ printf(“%s”, yytext);

[a-z]+ ECHO;

[a-zA-Z]+ {words++; chars += yyleng;}

50

Lex Predefined Variables

• yylex()– The default main() contains a call of yylex()

• yymore()– return the next token

• yyless(n)– retain the first n characters in yytext

• yywarp()– is called whenever Lex reaches an end-of-file

– The default yywarp() always returns 1

51

Lex Library Routines

Review of Lex Predefined Variables

Name Function

char *yytext pointer to matched string

int yyleng length of matched string

FILE *yyin input stream pointer

FILE *yyout output stream pointer

int yylex(void) call to invoke lexer, returns token

char* yymore(void) return the next token

int yyless(int n) retain the first n characters in yytext

int yywrap(void) wrapup, return 1 if done, 0 if not done

ECHO write matched string

REJECT go to the next alternative rule

INITAL initial start condition

BEGIN condition switch start condition52

• You can use your Lex routines in the same ways you use routines in other programming languages.

53

User Subroutines Section

%{

void foo();

%}

letter [a-zA-Z]

%%

{letter}+ foo();

%%

…

void foo() {

…

}

• The section where main() is placed

54

User Subroutines Section (cont’d)

%{

int counter = 0;

%}

letter [a-zA-Z]

%%

{letter}+ {printf(“a word\n”); counter++;}

%%

main() {

yylex();

printf(“There are total %d words\n”, counter);

}

• To run Lex on a source file, typelex scanner.l

• It produces a file named lex.yy.c which is a C program for the lexical analyzer.

• To compile lex.yy.c, typecc lex.yy.c –ll

• To run the lexical analyzer program, type

./a.out < inputfile

55

Usage

• AT&T -- lexhttp://www.combo.org/lex_yacc_page/lex.html

• GNU -- flexhttp://www.gnu.org/manual/flex-2.5.4/flex.html

• a Win32 version of flex :http://www.monmouth.com/~wstreett/lex-yacc/lex-yacc.html

• Lex on different machines is not created equal.

56

Versions of Lex

UNIT - III INTRODUCTION TO COMPILERS · UNIT - III INTRODUCTION TO COMPILERS Phase structure of...

Documents

Transcript of UNIT - III INTRODUCTION TO COMPILERS · UNIT - III INTRODUCTION TO COMPILERS Phase structure of...