CSC324 Formal Language Theory - University of …afsaneh/csc324w13/fltI.pdf · Example IDEs:...

Post on 27-Sep-2018

220 views 0 download

Transcript of CSC324 Formal Language Theory - University of …afsaneh/csc324w13/fltI.pdf · Example IDEs:...

CSC324 – Formal Language Theory

Afsaneh Fazly1

January 9, 2013

1Thanks to A.Tafliovich, S.McIlraith, E.Joanis, S.Stevenson, G.Penn, D.Horton

1

The only language men ever speak perfectly is the onethey learn in babyhood, when no one can teach themanything.

— Maria Montessori

2

What is a Programming Language (PL)?

We tend to think of a compiler or an IDE as a programminglanguage:

• Example compilers: javac, gcc

• Example IDEs: Eclipse, PyDev, Java Workshop

But the language is not these things:

• A programming language is an abstract entity with somespecficiation.

• Compiler and IDE are pieces of software that implement thelanguage.

3

Language Specification

Natural language:

• Syntax: specification of the ways in which words are puttogether to form phrases and sentences.

• Semantics: specification of the relation among the meaningsof words and phrases and sentences.

Programming language:

• Formal Syntax

units are not words but tokens

• Formal Semantics

defining semantics is hard!

4

Language Specification

Natural language:

• Syntax: specification of the ways in which words are puttogether to form phrases and sentences.

• Semantics: specification of the relation among the meaningsof words and phrases and sentences.

Programming language:

• Formal Syntaxunits are not words but tokens

• Formal Semanticsdefining semantics is hard!

4

More on Syntax

Syntax of a language tells us:

• what is legal or acceptable.

• what the relationships or dependecies in a legal sentence are.

Which one is acceptable in English?

• “used kids clothing store”

4

• “clothing store used kids”

4

What is the structure (dependencies among words)?

• need a way to represent structure: parse tree.

5

More on Syntax

Syntax of a language tells us:

• what is legal or acceptable.

• what the relationships or dependecies in a legal sentence are.

Which one is acceptable in English?

• “used kids clothing store” 4

• “clothing store used kids”

4

What is the structure (dependencies among words)?

• need a way to represent structure: parse tree.

5

More on Syntax

Syntax of a language tells us:

• what is legal or acceptable.

• what the relationships or dependecies in a legal sentence are.

Which one is acceptable in English?

• “used kids clothing store” 4

• “clothing store used kids” 4

What is the structure (dependencies among words)?

• need a way to represent structure: parse tree.

5

More on Syntax

Syntax of a language tells us:

• what is legal or acceptable.

• what the relationships or dependecies in a legal sentence are.

Which one is acceptable in English?

• “used kids clothing store” 4

• “clothing store used kids” 4

What is the structure (dependencies among words)?

• need a way to represent structure: parse tree.

5

Stages of Translation Process

1. Lexical analysisconverts source code into sequence of tokens.

2. Syntactic analysisstructures tokens into initial parse tree.

3. Semantic analysisaugments parse tree with semantic information, and performssemantic checks such as type checking, object binding, etc.

4. Code generationperforms optimizations, and produces final machine code.

6

Specifying Syntax Informally

Example:

“Everything between /* and */ is a comment and should beignored.”

/* Do such and such, watching out for problems.

Store the result in y. */

x = 3;

y = x * 17.2;

When syntax is defined informally, incompatible dialects of thelanguage may evolve.

7

Specifying Syntax Formally

• The state of the art is to define PL syntax formally.

• There are a number of well-understood formalisms for doingso. We’ll talk about this in some detail.

8

Specifying Syntax Formally

• Lexical rules specify the form of the building blocks of thelanguage:

• what’s a token(keyword, identifier, literal, operator, punctuation; plus the listof keywords and the form of identifiers)

• how tokens are delimited• where white space can go• syntax of comments

• Syntactic rules specify how to put the building blockstogether:

• what are the acceptable combinations of keywords, identifiers,operators, etc. into larger units, such as mathematical/logicalexpressions and statements?

9

Specifying Syntax Formally

• Lexical rules specify the form of the building blocks of thelanguage:

• what’s a token(keyword, identifier, literal, operator, punctuation; plus the listof keywords and the form of identifiers)

• how tokens are delimited• where white space can go• syntax of comments

• Syntactic rules specify how to put the building blockstogether:

• what are the acceptable combinations of keywords, identifiers,operators, etc. into larger units, such as mathematical/logicalexpressions and statements?

⇒ A tool for specifying lexical/syntactic rules is Grammer

9

Today

• Grammar: a tool for specifying syntax of a language• Regular Grammar• Context-Free Grammar

• Derivations and Parse Trees: useful tools for implementationof a PL (compiler)

• Ambiguity in a PL

• Translation Process

10

Grammar

Informal idea of grammar: a bunch of rules. For example:

• Don’t end a sentence with a preposition.

• Subject and verb must agree in number.

A formal grammar is a different concept.

A language is a set of strings. A grammar generates a language,i.e., it specifies which strings are in the language.

There are many kinds of formal grammar.

11

Grammar

Informal idea of grammar: a bunch of rules. For example:

• Don’t end a sentence with a preposition.

• Subject and verb must agree in number.

A formal grammar is a different concept.

A language is a set of strings. A grammar generates a language,i.e., it specifies which strings are in the language.

There are many kinds of formal grammar.

11

Grammar

Informal idea of grammar: a bunch of rules. For example:

• Don’t end a sentence with a preposition.

• Subject and verb must agree in number.

A formal grammar is a different concept.

A language is a set of strings. A grammar generates a language,i.e., it specifies which strings are in the language.

There are many kinds of formal grammar.

11

Chomsky’s Hierarchy

[ named after linguist & political activist Noam Chomsky, whoresearched grammars for natural languages. ]

There are several categories of grammar, ordered from least tomost expressive:

• Regular Grammars

=⇒ generate Regular Languages

• Context-Free Grammars

=⇒ generate Context-Free Languages

• Context-Sensitive Grammars

• Unrestricted Grammars

12

Chomsky’s Hierarchy

[ named after linguist & political activist Noam Chomsky, whoresearched grammars for natural languages. ]

There are several categories of grammar, ordered from least tomost expressive:

• Regular Grammars =⇒ generate Regular Languages

• Context-Free Grammars =⇒ generate Context-Free Languages

• Context-Sensitive Grammars

• Unrestricted Grammars

12

Regular Expressions

Regular expressions and regular grammars are two equivalent waysfor describing regular languages.

Example Regular Expressions (REs):

• (0 + 1)∗

• 1∗(: + ; )∗

• (a + b)∗aa(a + b)∗ + ε

Notation:

• Kleene star: ∗ superscript denotes 0 or more repetitions.

• alternation: binary “+” denotes choice.It is sometimes denoted by |, i.e., (0|1)∗.

• grouping: “(” and “)” are used for grouping

• empty string: ε (epsilon) denotes the empty or ”null” string.

• empty language: ∅ denotes the language with no strings.

13

Regular Expressions: Formal Definition

Let Σ be a given finite alphabet. A string is a regular expression(RE) if and only if it can be derived from the primitive regularexpressions:• ∅• ε• any a, s.t. a ∈ Σ

by a finite number of applications of the following rules:

If r1 and r2 are REs, then so are:• r1 + r2, or r1|r2 (union, alternation)• r1r2, or r1 · r2 (concatenation)• r∗1 (Kleene closure)• (r1) (grouping)

14

Regular Languages

Each RE r (defined over an alphabet Σ) specifies a regularlanguage, denoted L(r).

Languages specified by primitive REs: L(∅), L(ε), ...

Languages specified by more complex REs: L(r1 + r2), L(r∗), ...

15

Regular Languages (cont’d)

Languages specified by primitive REs:

• L(∅) is

the empty language, i.e., the language that containsno strings, denoted by {}

• L(ε) is

the language {ε}

• for a ∈ Σ, L(a) is

the language {a}

16

Regular Languages (cont’d)

Languages specified by primitive REs:

• L(∅) is the empty language, i.e., the language that containsno strings, denoted by {}

• L(ε) is

the language {ε}

• for a ∈ Σ, L(a) is

the language {a}

16

Regular Languages (cont’d)

Languages specified by primitive REs:

• L(∅) is the empty language, i.e., the language that containsno strings, denoted by {}

• L(ε) is the language {ε}• for a ∈ Σ, L(a) is

the language {a}

16

Regular Languages (cont’d)

Languages specified by primitive REs:

• L(∅) is the empty language, i.e., the language that containsno strings, denoted by {}

• L(ε) is the language {ε}• for a ∈ Σ, L(a) is the language {a}

16

Regular Languages (cont’d)

Languages specified by more complex REs:

If r1 and r2 are regular expressions, then

• L(r1 + r2) = L(r1) ∪ L(r2),where S ∪ T = {s | s ∈ S or s ∈ T}

• L(r1r2) = L(r1)L(r2),where ST = {st | s ∈ S , t ∈ T}

• L((r1)) = L(r1)

• L(r∗1 ) = (L(r1))∗,where S∗ = S0 ∪ S1 ∪ S2 ∪ · · · ,and S0 = {ε},

S1 = S ,S j = set of all possible concatenations of j strings from S

17

Regular Languages: Examples

Let Σ = {a, b}, r1 = a, and r2 = b.

L(r1|r2) =?

L(r1.r2) =?

L((r1)) =?

L(r∗1 ) =?

L((r1|r2)∗) =?

18

Regular Languages: Summary

A Regular Language is a language that:

• is generated by a Regular Grammar.

• can be described / specified by a regular expression.

• is accepted by a finite-state automaton.

We won’t talk about automata in this course: you studied them inCSC236/... .

19

Regular Expressions: Examples

Give regular expressions for these languages:

1. All alphanumeric strings beginning with an upper-case letter.

2. All strings of a’s and b’s in which the third-last character is b.

3. All strings of 0’s and 1’s in which every pair of adjacent 0’sappears before any pairs of adjacent 1’s.

4. All binary numbers with exactly three 1’s.

20

Quiz

Could we use a regular grammar to describe a Natural Language(e.g., English)? This question was first asked and answered byChomsky (1956, 1957)

21

Regular Expressions: Limitaions

Regular expressions are not powerful enough to describe somelanguages.

Examples:

• The language consisting of all strings of one or more a’sfollowed by the same number of b’s.

• The language consisting of strings containing a’s, b’s, leftbrackets, and right brackets, such that the brackets match.

Question: How can we be sure there is no regular expression forthese languages?

Question: Exactly what things can and cannot be expressed witha regular expression?

22

Regular Expressions: Limitaions

Regular expressions are not powerful enough to describe somelanguages.

Examples:

• The language consisting of all strings of one or more a’sfollowed by the same number of b’s.

• The language consisting of strings containing a’s, b’s, leftbrackets, and right brackets, such that the brackets match.

Question: How can we be sure there is no regular expression forthese languages?

Question: Exactly what things can and cannot be expressed witha regular expression?

22

Context-Free Grammars (CFGs)

CFGs are more powerful than regular expressions.

A CFG has four parts:

• A set of terminals:the atomic symbols (tokens) of the language.

• A set of non-terminals:“variables” used in the grammar.

• A special non-terminal chosen as the starting non-terminal orstart symbol:represents the top-level construct of the language.

• A set of rules (or productions), each specifying one legal waythat a non-terminal can be rewritten to a sequence ofterminals and non-terminals.

23

Context-Free Grammars: Example

A CFG for real numbers:

• Terminals:

0 1 2 3 4 5 6 7 8 9 . ±

• Non-terminals:

real-number, part, digit, sign

• Productions:

• A real-number is a ± sign followed by a numerical part,possibly followed by “.” and another numerical part

• A part is a sequence of digits, i.e., a digit, or a digit followedby a part

• A digit is any single terminal except “.”, “+”, and “-”

• Start symbol:

real-number

Note that we use recursion to specify repeated occurrences.

We have “defined” this CFG using plain English. Not a good idea.

24

Context-Free Grammars: Example

A CFG for real numbers:

• Terminals: 0 1 2 3 4 5 6 7 8 9 . ±• Non-terminals:

real-number, part, digit, sign

• Productions:• A real-number is a ± sign followed by a numerical part,

possibly followed by “.” and another numerical part• A part is a sequence of digits, i.e., a digit, or a digit followed

by a part• A digit is any single terminal except “.”, “+”, and “-”

• Start symbol:

real-number

Note that we use recursion to specify repeated occurrences.

We have “defined” this CFG using plain English. Not a good idea.

24

Context-Free Grammars: Example

A CFG for real numbers:

• Terminals: 0 1 2 3 4 5 6 7 8 9 . ±• Non-terminals: real-number, part, digit, sign

• Productions:• A real-number is a ± sign followed by a numerical part,

possibly followed by “.” and another numerical part• A part is a sequence of digits, i.e., a digit, or a digit followed

by a part• A digit is any single terminal except “.”, “+”, and “-”

• Start symbol: real-number

Note that we use recursion to specify repeated occurrences.

We have “defined” this CFG using plain English. Not a good idea.

24

Context-Free Grammars: Example

A CFG for real numbers:

• Terminals: 0 1 2 3 4 5 6 7 8 9 . ±• Non-terminals: real-number, part, digit, sign

• Productions:• A real-number is a ± sign followed by a numerical part,

possibly followed by “.” and another numerical part• A part is a sequence of digits, i.e., a digit, or a digit followed

by a part• A digit is any single terminal except “.”, “+”, and “-”

• Start symbol: real-number

Note that we use recursion to specify repeated occurrences.

We have “defined” this CFG using plain English. Not a good idea.

24