CMSC 723: Intro to Computational Linguistics
Lecture 2: February 4, 2004
Regular Expressions and Finite State Automata
Professor Bonnie J. DorrDr. Nizar Habash
TAs: Nitin Madnani and Nate Waisbrot
Regular Expressions and Finite State Automata
• REs: Language for specifying text strings
• Search for document containing a string– Searching for “woodchuck”
• Finite-state automata (FSA)(singular: automaton)
• How much wood would a woodchuck chuck if a woodchuck would chuck wood?
– Searching for “woodchucks” with an optional final “s”
Regular Expressions
• Basic regular expression patterns
• Perl-based syntax (slightly different from other notations for regular expressions)
• Disjunctions /[wW]oodchuck/
Regular Expressions
• Optional characters ? ,* and +– ? (0 or 1)
• /colou?r/ color or colour
– * (0 or more)• /oo*h!/ oh! or Ooh! or Ooooh!
*+
Stephen Cole Kleene
– + (1 or more) • /o+h!/ oh! or Ooh! or Ooooh!
• Wild cards .– /beg.n/ begin or began or begun
Regular Expressions
• Anchors ^ and $– /^[A-Z]/ “Ramallah, Palestine”
– /^[^A-Z]/ “¿verdad?” “really?”
– /\.$/ “It is over.”
– /.$/ ?
• Boundaries \b and \B– /\bon\b/ “on my way” “Monday”
– /\Bon\b/ “automaton”
• Disjunction |– /yours|mine/ “it is either yours or mine”
Disjunction, Grouping, Precedence
• Column 1 Column 2 Column 3 …How do we express this?/Column [0-9]+ *//(Column [0-9]+ *)*/
• Precedence– Parenthesis ()– Counters * + ? {}– Sequences and anchors the ^my end$– Disjunction |
• REs are greedy!
Writing correct expressions
Exercise: write a Perl regular expression to match the English article “the”:
/the//[tT]he//\b[tT]he\b//[^a-zA-Z][tT]he[^a-zA-Z]//(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
A more complex example
Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:/$[0-9]+//$[0-9]+\.[0-9][0-9]//\b$[0-9]+(\.[0-9][0-9])?\b//\b$[0-9][0-9]?[0-9]?(\.[0-9][0-9])?\b//\b[0-9]+ *([MG]Hz|[Mm]egahertz| [Gg]igahertz)\b//\b[0-9]+ *(Mb|[Mm]egabytes?)\b//\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
Substitutions and Memory
• Substitutions
• Memory (\1, \2, etc. refer back to matches)
s/colour/color/s/colour/color/g
s/([0-9]+)/<\1>/
/the (.*)er they were, the \1er they will be/
/the (.*)er they (.*), the \1er they \2/
Substitute as many times as possible!
Case insensitive matching
s/colour/color/i
Eliza [Weizenbaum, 1966]
User: Men are all alike
ELIZA: IN WHAT WAY
User: They’re always bugging us about something or other
ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE?
User: Well, my boyfriend made me come here
ELIZA: YOUR BOYFRIEND MADE YOU COME HERE
User: He says I’m depressed much of the time
ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
Eliza-style regular expressions
s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Step 1: replace first person references with second person referencess/\bI(’m| am)\b/YOU ARE/g
s/\bmy\b/YOUR/g
S/\bmine\b/YOURS/g
Step 2: use additional regular expressions to generate replies
Step 3: use scores to rank possible transformations
Finite-state Automata (Machines)
q0 q1 q2 q3 q4
b a a !
a
/baa+!/
state transition finalstate
baa! baaa! baaaa! baaaaa! ...
Finite-state Automata
• Q: a finite set of N states – Q = {q0, q1, q2, q3, q4}
: a finite input alphabet of symbols = {a, b, !}
• q0: the start state• F: the set of final states
– F = {q4} (q,i): transition function
– Given state q and input symbol i, return new state q' (q3,!) q4
D-RECOGNIZEfunction D-RECOGNIZE (tape, machine) returns accept or reject index Beginning of tape current-state Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state transition-table [current-state, tape[index]] index index + 1end
Languages and Automata
• Can use FSA as a generator as well as a recognizer
• Formal language L: defined by machine M that both generates and recognizes all and only the strings of that language. – L(M) = {baa!, baaa!, baaaa!, …}
• Regular languages vs. non-regular languages
Using NFSAs to accept strings
• Backup: add markers at choice points, then possibly revisit unexplored arcs at marked choice point.
• Look-ahead: look ahead in input
• Parallelism: look at alternatives in parallel
More about FSAs
• Transducers
• Equivalence of DFSAs and NFSAs
• Recognition as search: depth-first, breadth-search
Regular languages
• Regular languages are characterized by FSAs
• For every NFSA, there is an equivalent DFSA.
• Regular languages are closed under concatenation, Kleene closure, union.
Top Related