UNIT - III INTRODUCTION TO COMPILERS · UNIT - III INTRODUCTION TO COMPILERS Phase structure of...
Transcript of UNIT - III INTRODUCTION TO COMPILERS · UNIT - III INTRODUCTION TO COMPILERS Phase structure of...
UNIT - III INTRODUCTION TO COMPILERS
Phase structure of Compiler and entire compilation process.
Lexical Analyzer: The Role of the Lexical Analyzer, InputBuffering. Specification of Tokens, Recognition of
Tokens, Design of Lexical Analyzer using Uniform SymbolTable, Lexical Errors.
LEX: LEX Specification, Generation of Lexical Analyzer by LEX.
What is Compiler?
• Compiler is a software which converts a program written in high level language (Source Language) to low level language (Object/Target/Machine Language).
Symbol Table
• Symbol Table – It is a data structure being used and maintained by the compiler, consists all the identifier’s name along with their types. It helps the compiler to function smoothly by finding the identifiers quickly.
Phases of Compiler
• Structure of compiler has of two parts:
1.Analysis phase(front end )
2.Sysnthesis Phase(back end)
• Front-end constitutes of the Lexical analyzer, semantic analyzer, syntax analyzer and intermediate code generator. And the rest are assembled to form the back end
1.Lexical Analyzer
• Lexical Analyzer – It reads the program and converts it into tokens. It converts a stream of lexemes into a stream of tokens. Tokens are defined by regular expressions which are understood by the lexical analyzer. It also removes white-spaces and comments.
• Example: X:=z+y
• 5 token x,=,z,+,y after this stage id1 assign id2 binop id3
Tokens, Patterns, and LexemesToken Sample Lexemes Informal Description of Pattern
const
if
relation
id
num
literal
const
if
<, <=, =, < >, >, >=
pi, count, D2
3.1416, 0, 6.02E23
“core dumped”
const
if
< or <= or = or < > or >= or >
letter followed by letters and digits
any numeric constant
any characters between “ and “ except “
Classifies Pattern
Actual values are critical. Info is :
1. Stored in symbol table
2. Returned to parser
2.Syntax Analyzer
• It is sometimes called as parser. It constructs the parse tree.
• It takes all the tokens one by one and uses Context Free Grammar to construct the parse tree.
• Why Grammar ?The rules of programming can be entirely represented in some few productions. Using these productions we can represent what the program actually is. The input has to be checked whether it is in the desired format or not.
• Syntax error can be detected at this level if the input is not in accordance with the grammar
3.Semantic Analyzer
• Semantic Analyzer – It verifies the parse tree, whether it’s meaningful or not. It furthermore produces a verified parse tree.
• Semantic deals with Type checking and constraint with the help of rules.
4.Intermediate Code Generator
• It act as bridge between the analysis phase and synthesis phase of compilation process.
• It generates intermediate code, that is a form which can be readily executed by machine We have many popular intermediate codes.
• Example – Three address code(TAC), Quadruples, triples,Postfix etc.
• Intermediate code is converted to machine language using the last two phases which are platform dependent. Till intermediate code, it is same for every compiler out there, but after that, it depends on the platform. To build a new compiler we don’t need to build it from scratch. We can take the intermediate code from the already existing compiler and build the last two parts.
5.Code Optimizer• Code Optimizer – It transforms the code so that it
consumes fewer resources and produces more speed. • The meaning of the code being transformed is not altered. • Optimization can be categorized into two types: machine
dependent and machine independent.• Optimization Technique:• 1.Removing redundant identifiers• 2.Removing unreachable s• ections of code• 3.Identifying common sub expression.• 4.Unfolding loops• 5.Elimnating Procedure
6.Target Code Generator
• Target Code Generator – The main purpose of Target Code generator is to write a code that the machine can understand.
• The output is dependent on the type of assembler.
• This is the final stage of compilation.
1.Lexical Analysis
• Lexical Analysis is the first phase of compiler also known as scanner. It converts the input program into a sequence of Tokens.
• Lexical Analysis can be implemented with the Deterministic finite Automata.
• What is a token?A lexical token is a sequence of characters that can be treated as a single logical entity.
1.Lexical Analysis
• Example of tokens: Keyword, Operators, Constants,Identifiers, Special Symbol.
• Keywords; Examples-for, while, if etc. Identifier; Examples-Variable name, function name etc. Operators; Examples '+', '++', '-' etc. Separators; Examples ',' ';' etc
Tokens, Patterns, Lexemes
• Pattern: A set of strings in the input for which the same token is produced as output. This set of string is described by a rule called a pattern associated with the token.– e.g., id => “letter followed by letters and digits”
• Lexeme: a sequence of characters in the source program that is matched by the pattern for a token
• Examples: int a;
First string:int pattern:int,lexeme:int, token:keyword
Second string:a pattern:[a-zA-Z][a-zA-Z-)-9]* lexeme:atoken: identifier
It is the first phase of the compiler.
It reads the input characters and produces as output a sequence of tokens that the parser uses for syntax analysis.
It strips out from the source program comments and white spaces in the form of blank , tab and newline characters .
It also correlates error messages from the compiler with the source program (because it keeps track of line numbers).
20
The role of the lexical analyzer
Interaction Of The Lexical Analyzer With The Parser
21
LexicalAnalyzer
ParserSource
Program
Token,tokenval
Symbol Table
Get nexttoken
error error
Recognition of Token
Recognition of Token
•Data structure used in Lexical Analyzer,•Terminal Table(TRM)•Identifier Table|(IDN)•Uniform symbol table•Literal Table
Recognition of Token
Lexical Errors
• Lexical error is a sequence of characters that does not match the pattern of any token. Lexical phase error is found during the execution of the program.
•Lexical phase error can be:1. Spelling error.2. Exceeding length of identifier or numeric constants.3. Appearance of illegal characters.4. To remove the character that should be present.5. To replace a character with an incorrect character.6. Transposition of two characters.
Example:Void main(){
int x=10, y=20;char * a;a= &x;x= 1xab;
}In this code, 1xab is neither a number nor an identifier. So this code will show the lexical error.
Lexical Errors
A lexical error is any input that can be rejected by the lexer.
This generally results from token recognition falling off the end of
the rules you've defined.
For example (in no particular syntax):
[0-9]+ ===> NUMBER token [a-zA-Z] ===> LETTERS token anything else ===> error!
If you think about a lexer as a finite state machine that accepts valid input strings.
If not valid input then ERROR can be generated.
Lexical Errors
Input buffering
The L.A. scan the input string from input from left to right, one character at time.,
The input character is read from secondary storage, but reading in this way is costly.
Hence buffering technique is used.
A block of data is first read into a buffer & then scanned by lexical Analyzer.
Uses 2 Pointer
1.begin_ptr(Bp)
2.forwar_ptr(fp)
forward_ptr moves ahead to search for end of lexme,if blank space is encounter it indicate end of lexme.
Begin pointer point to current lexme.(Lexme_Pointer)
Eof means end of buffer: input is at an end.
Introduction to Lex
35
• The main job of a lexical analyzer (scanner) is to break up an input stream into more usable elements (tokens)a = b + c * d;
ID ASSIGN ID PLUS ID MULT ID SEMI
• Lex is an utility to help you rapidly generate your scanners
36
What is Lex?
Lex
Lex
• Lexical analyzers tokenize input streams
• Tokens are the terminals of a language
– English
• words, punctuation marks, …
– Programming language
• Identifiers, operators, keywords, …
• Regular expressions define terminals/tokens
39
Lex – Lexical Analyzer
40
An Overview of Lex
Lex
C compiler
a.out
Lex source program
lex.yy.c
input
lex.yy.c
a.out
tokens
• Lex source is separated into three sections by %%delimiters
• The general format of Lex source is
• The absolute minimum Lex program is thus
42
Lex Source
(optional)
(required)
{definitions}
%%
{transition rules}
%%
{user subroutines}
%%
• Lex source is a table of
– regular expressions and
– corresponding program fragments
43
Lex Source Program
digit [0-9]
letter [a-zA-Z]
%%
{letter}({letter}|{digit})* printf(“id: %s\n”, yytext);
\n printf(“new line\n”);
%%
main() {
yylex();
}
Regular Expressions
44
• A regular expression matches a set of strings
• Regular expression
– Operators
– Character classes
– Arbitrary character
– Optional expressions
– Alternation and grouping
– Context sensitivity
– Repetitions and definitions
45
Lex Regular Expressions (Extended Regular Expressions)
• [abc] matches a single character, which may be a, b, or c
• Every operator meaning is ignored except \ - and ^
• e.g.
[ab] => a or b
[a-z] => a or b or c or … or z
[-+0-9] => all the digits and the two signs
[^a-zA-Z] => any character which is not a
letter
46
Character Classes []
Pattern Matching PrimitivesMetacharacter Matches
. any character except newline
\n newline
* zero or more copies of the preceding expression
+ one or more copies of the preceding expression
? zero or one copy of the preceding expression
^ beginning of line / complement
$ end of line
a|b a or b
(ab)+ one or more copies of ab (grouping)
[ab] a or b
a{3} 3 instances of a
“a+b” literal “a+b” (C escapes still work) 47
• regexp <one or more blanks> action (C code);
• regexp <one or more blanks> { actions (C code) }
• A null statement ; will ignore the input (no actions)
[ \t\n];
48
Transition Rules
• | indicates that the action for this rule is from the action for the next rule[ \t\n] ;
“ “ |
“\t” |
“\n” ;
49
Transition Rules (cont’d)
• yytext -- a string containing the lexeme
• yyleng -- the length of the lexeme
• yyin -- the input stream pointer – the default input of default main() is stdin
• yyout -- the output stream pointer– the default output of default main() is stdout.
• cs20: %./a.out < inputfile > outfile
• E.g. [a-z]+ printf(“%s”, yytext);
[a-z]+ ECHO;
[a-zA-Z]+ {words++; chars += yyleng;}
50
Lex Predefined Variables
• yylex()– The default main() contains a call of yylex()
• yymore()– return the next token
• yyless(n)– retain the first n characters in yytext
• yywarp()– is called whenever Lex reaches an end-of-file
– The default yywarp() always returns 1
51
Lex Library Routines
Review of Lex Predefined Variables
Name Function
char *yytext pointer to matched string
int yyleng length of matched string
FILE *yyin input stream pointer
FILE *yyout output stream pointer
int yylex(void) call to invoke lexer, returns token
char* yymore(void) return the next token
int yyless(int n) retain the first n characters in yytext
int yywrap(void) wrapup, return 1 if done, 0 if not done
ECHO write matched string
REJECT go to the next alternative rule
INITAL initial start condition
BEGIN condition switch start condition52
• You can use your Lex routines in the same ways you use routines in other programming languages.
53
User Subroutines Section
%{
void foo();
%}
letter [a-zA-Z]
%%
{letter}+ foo();
%%
…
void foo() {
…
}
• The section where main() is placed
54
User Subroutines Section (cont’d)
%{
int counter = 0;
%}
letter [a-zA-Z]
%%
{letter}+ {printf(“a word\n”); counter++;}
%%
main() {
yylex();
printf(“There are total %d words\n”, counter);
}
• To run Lex on a source file, typelex scanner.l
• It produces a file named lex.yy.c which is a C program for the lexical analyzer.
• To compile lex.yy.c, typecc lex.yy.c –ll
• To run the lexical analyzer program, type
./a.out < inputfile
55
Usage
• AT&T -- lexhttp://www.combo.org/lex_yacc_page/lex.html
• GNU -- flexhttp://www.gnu.org/manual/flex-2.5.4/flex.html
• a Win32 version of flex :http://www.monmouth.com/~wstreett/lex-yacc/lex-yacc.html
• Lex on different machines is not created equal.
56
Versions of Lex