Post on 13-Jan-2016
Languages, Grammars, and Regular Expressions
Chuck Cusack
• Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5th edition, by Kenneth Rosen
Alphabets and Languages• Definition: A vocabulary (or alphabet) V is a
finite, nonempty set of symbols. • Definition: A word or sentence over V is a finite
string of symbols from V.• Definition: The empty string or null string,
denoted by , is the string containing no symbols.• Definition: The set of all words over V is denoted
by V*.• Definition: A language over V is a subset of V*.
Language Examples• Let V={0,1}• 00110, 11111, 00, and 11 are words over V• 012, a234, and 222 are not words over V• V*={0,1,00,01,10,11,000,…}• In other words, V* is the set of all binary strings• The set of strings consisting of only 0s is a
language over V*
• {1,10,100,1000,10000,…} is a language over V*
Concatenation• Definition: Let V be a vocabulary, and A and B
be subsets of V*. The concatenation of A and B, denoted by AB, is the set of all strings of the form xy, where xA and yB.
• Example: Let A={0, 10}, and B={1,12}. Then– AB={01, 012, 101, 1012}– BA={10, 110, 120, 1210}– AA={00, 010, 100, 1010}– AAA=A(AA)={000, 0010, 0100, 01010, 1000,
10010, 10100, 101010}
Concatenation: An
• Definition: Let V be a vocabulary, and A a subset of V*. Then A0={} , and for n>0, we can define
An=A(n-1)A• Example: Let A={0, 10}. Then
– A0={– A1=A0A={A=A={0,10}– A2=A1A ={00, 010, 100, 1010} – A3= A2A={000, 0010, 0100, 01010, 1000,
10010, 10100, 101010}
Kleene Closure• Definition: Let V be a vocabulary, and A a subset of V*.
The Kleene closure of A, denoted by A*, is the set consisting of concatenations of an arbitrary number of strings from A. That is,
0
*
k
kAA
}{*
1
AAAk
k
• Definition: A+ is the set of nonempty strings over A. In other words,
Kleene Closure Example• Example: Let A={0, 1}. Then
– A0={– A1={0,1}
– A2={00, 01, 10, 11}
– A3={000, 001, 010, 011, 100, 101, 110, 111}
– A*={0,1}*={All binary strings}
• Example: Let B={111}. Then
– B0={B1={111}, B2={111111}
– B3={111111111}
– B* is the set of strings with 3n 1s, for every n
Regular Sets• Definition: A regular set is a set that can be
generated starting from the empty set, empty string, and single elements from the vocabulary, using concatenations, unions, and Kleene closures in arbitrary order.
• We will give a more precise definition after we define a regular expression.
Regular Expressions• Definition: The regular expressions over a set I
are defined recursively by: – (the empty set) is a regular expression,– (the set containing the empty string) is a regular
expression,– x is a regular expression for all xI,– (AB) , (AB) , and A* are regular expressions if A and B
are regular expressions
• Definition: A regular set is a set represented by a regular expression.
• Examples: 001*, 1(0(01)*11, and AB*C are regular expressions
Regular Expression Example• The regular set defined by the regular expression
01* is the set of strings starting with a 0 followed by 0 or more 1s.
• The regular set defined by (10)* is the set of strings containing 0 or more copies of 10.
• The regular set defined by 0(01)*1 is the set of all binary strings beginning with 0 and ending with 1.
• The regular set defined by (01)1(01) is the set of strings {010, 011, 110, 111}.
Regular Expression Applications• Regular expressions are actually used quite often
in computer science.• For instance, if you are editing a file with vi, and
want to see if it contains the string blah followed by a number followed by any character followed by the letter Q, you can use the regular expression
blah[0-9][0-9]*.Q• This works because vi uses regular expressions for
searching.
Grammars and Languages
• Many languages can be defined by grammars.• We are particularly interested in phrase-structure
grammars.• Before we can define phrase-structure grammars,
we need to define a few more terms.
Special Symbols• Definition: A nonterminal symbol (or just
nonterminal) is a symbol which can be replaced by other symbols.
• Definition: A terminal symbol (or just terminal) is a symbol which cannot be replaced by other symbols.
• Definition: The start symbol is a special symbol, usually denoted by S.
• The set of terminals is denoted by T, and the set of nonterminals by N.
• S is a nonterminal.
Productions• Definition: A production is a rule which tells how
to replace one string from V* with another string.• Productions are denoted by ab, which denotes
that a can be replaced by b.• Example
– Let SA0, AA1, and A0 be productions
– Then I can replace S with A0
– Since I can replace A with A1, A0 can become A10
– Since I can replace A with 0, A10 can become 010
– Thus, I can replace S with 010
Phrase-Structure Grammars• Definition: A phrase-structure grammar is a 4-
tuple G=(V,T,S,P), where – V is a vocabulary– TV is a set of terminals– SV is a start symbol– P is a set of productions
• N=V-T is the set of nonterminals• Each production contains at least one nonterminal
on its left side.• We will always use S as the start symbol.
Direct Derivations
• Let G=(V,T,S,P) be a phrase-structure grammar.
• Let A=lar and B=lbr, where l, a, b, r V*.
• Let ab be a production.
• Then we can derive B from A.
• Thus we say that A is directly derivable from B.
• We write this as AB
Derivations
• Let G=(V,T,S,P) be a phrase-structure grammar
• Let A1, A2,…,An V* be such that
A1A2…An
• Then we say that An is derivable from A1.
• We write A1* An
• The sequence of productions used is called a derivation.
Generating Languages
• Let G=(V,T,S,P) be a grammar
• Definition: The language generated by G, denoted L(G) , is the set of all strings of terminals that are derivable from S.
• Put another way,
L(G)={w T* | S * w }
Example 1
Let G be the grammar with – V={S,0,1} – T={0,1}– P={SS0, S0}
• Clearly S0, so 0L(G)• Also, SS000, so 00L(G)• And, SS0S00000, so 000L(G)• It is not hard to see that L(G) is the language
consisting of all strings with 1 or more 0s.
Example 2
Let G be the grammar with V={S,0,1}, T={0,1}, and P={SSS, S1, S0}
• Clearly S0, so 0L(G)• Also, S1, so 1L(G)• Since SSSS101, so 01L(G)• In general, we can get a sequence of Ss, and
replace each with either 0 or 1. • Given this fact, it is easy to see that
L(G) ={0,1}+, the set of all non-empty binary strings
Example 3
Let G be the grammar with V={S,A,B,0,1}, T={0,1}, and
P={SAB, BBB, AAA, A0, B1}• Clearly SAB0B01, so 01L(G)• Also, SABAAB0AB00B001, so
001L(G)• Similarly, we can get 011, 0011, 0001, etc.• In general, we can get a sequence of n 0s followed
by m 1s, where n>0, m>0.• Thus L(G) ={0n1m | m and n are positive integers}
Type 0 Grammars
• Type 0 grammars have no restrictions on the types of productions that are allowed.
• Thus type 0 grammars are just phrase-structure grammars.
• This is not too exciting, so we will move on to type 1 grammars.
Type 1 Grammars
• In a type 1 grammar, productions are of the form– aXbacb,where XN and a,b,cV* with c– (or S, but ignore this for now)
• Thus, a production can only be applied if the symbol X is surrounded by a and b.
• In other words, the production can only be applied in a certain context.
• This is why type 1 grammars are also called context-sensitive grammars.
Type 2 Grammars• Productions are of the form
– Xa, where XN and aV*.
• Thus, if X is in a string, we can replace X with a no matter what surrounds X.
• In other words, the context in which X appears does not matter.
• This is why type 2 grammars are called context-free grammars.
• Context-free grammars produce context-free languages.
Type 3 Grammars
• Productions are of the form– Xa, where XN and aT– XaY, where X,YN and aT– S
• Type 3 grammars are called regular grammars.• Regular grammars produce regular languages.• It is easy to see that a type 3 grammar is a type 2
grammar.
Types of Grammars
Type Productions allowed
0 Almost any kind allowed
1 aXbacb, where XN, a,b,cV*, c
S
2 Xa, where XN and aV*
3 Xa, where XN and aT
XaY, where X,YN and aT
S
Types of Grammars• The following summarizes the relationships
between the types of grammars
Type 0: phrase-structure
Type 1: context-sensitive
Type 2: context-free
Type 3: regular
Regular Grammar Example
• Let G be the grammar with
– V={S,A,0,1},
– T={0,1}, and
– P={S0A, A0A, A1A, A1}
• We can determine what the language is by constructing a few words.– S0A01
– S0A00A001 S0A01A011
– S0A00A000A0001 S0A00A001A0011
– S0A01A010A0101 S0A01A011A0111
• We can see that in general, L(G) is the set of binary strings beginning with 0 and ending with 1.
Regular Languages and Sets
• Theorem: Let A be a subset of V* . Then A is a regular language if and only if A is a regular set.
• In other words, a language defined by a regular grammar can also be defined by a regular expression, and vice-versa.
• Example: We just saw that the grammar with V={S,A,0,1}, T={0,1}, and P={S0A, A0A, A1A, A1} generates the set of binary strings beginning with 0 and ending with 1.
• Recall that the regular set defined by 0(01)*1 is also the set of all binary strings beginning with 0 and ending with 1.
Grammar Applications
• Context-free grammars are used to define the syntax of most programming languages.
• Regular grammars are used in several applications, including the following– Searching text for patterns
– Lexical analysis (during program compilation)
• Efficient algorithms exist to determine if a string is in a context-free or regular language.
• This is important for tasks like determining whether or not a program is syntactically valid.
Backus-Naur Form
• Backus-Naur form (BNF) is a more compact representation of productions in a type 2 grammar.
• All productions with the same left hand side are combined into one production
• The symbol is replaced with ::=• All terminals are enclosed in < and >• The right hand sides of the various productions are
combined, and separated by |
Backus-Naur Form Example
• Consider the set of productions– SAB
– BBB
– AAA
– A0
– B1
• In BNF, they are represented by – <S> ::= <A><B>
– <B> ::= <B><B> | 1
– <A> ::= <A><A> | 0
Backus-Naur Form Example 2
• The Backus Naur form for the production of a signed integer is
– <signed integer> ::= <sign><integer>– <sign> ::= + | -– <integer> ::= <digit> | <digit><integer>– <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Backus-Naur Form Applications
• Specifying the syntax for programming languages including – Java– LISP
• Specifying database languages– SQL
• Specifying markup languages– XML