ICS 482 Natural Language Processing Regular Expression and Finite Automata
description
Transcript of ICS 482 Natural Language Processing Regular Expression and Finite Automata
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 11
ICS 482ICS 482Natural Language ProcessingNatural Language Processing
Regular Expression andRegular Expression andFinite AutomataFinite Automata
Muhammed Al-MulhemMuhammed Al-Mulhem
March 1, 2009March 1, 2009
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 22
Regular ExpressionsRegular Expressions
Regular expression (RE): A formula for Regular expression (RE): A formula for specifying a set of strings. specifying a set of strings.
String: A sequence of alphanumeric String: A sequence of alphanumeric characters (letters, numbers, spaces, tabs, characters (letters, numbers, spaces, tabs, and punctuation). and punctuation).
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 33
Regular Expression PatternsRegular Expression Patterns
RE RE String matchedString matched
woodchuckswoodchucks ““interesting links to interesting links to woodchuckswoodchucks and lemurs”and lemurs”
aa ““SSaarah Ali stopped by Mona’s”rah Ali stopped by Mona’s”
Ali says,Ali says, ““My gift please,” My gift please,” Ali says,Ali says,””
bookbook ““all our pretty all our pretty bookbooks”s”
!! ““Leave him behindLeave him behind!!” said Sami” said Sami
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 44
Specify Options and Range using Specify Options and Range using [ ] and -[ ] and -
RERE MatchMatch
[wW]ood[wW]ood Wood or woodWood or wood
[abc][abc] ““a”, “b”, or “c”a”, “b”, or “c”
[A-Z] an uppercase letter
[a-z] a lowercase letter
[0-9] a single digit
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 55
RE OperatorsRE OperatorsRERE DescriptionDescription
a*a* Zero or more a’sZero or more a’s
a+a+ One or more a’sOne or more a’s
a?a? Zero or one a’sZero or one a’s
[ab]*[ab]* Zero or more a’s or b’s. Matches aaa.., ababab.., Zero or more a’s or b’s. Matches aaa.., ababab.., bbbb..bbbb..
[0-9]+[0-9]+ Sequence of one or more digits.Sequence of one or more digits.
.. Wildcard expression-matches any single character.Wildcard expression-matches any single character.
\b\b Matches a word boundary. Matches the but not otherMatches a word boundary. Matches the but not other
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 66
Sidebar: ErrorsSidebar: Errors
Find all instances of the word “the” in a Find all instances of the word “the” in a text.text.– /the//the/
What About ‘The’What About ‘The’
– /[tT]he//[tT]he/What about ‘Theater”, ‘Another’What about ‘Theater”, ‘Another’
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 77
Sidebar: ErrorsSidebar: Errors
The process we just went through was The process we just went through was based on: based on: – Matching strings that we should not have Matching strings that we should not have
matched (there, then, other)matched (there, then, other)False positivesFalse positives
– Not matching things that we should have Not matching things that we should have matched (The)matched (The)
False negativesFalse negatives
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 88
Sidebar: ErrorsSidebar: Errors
Reducing the error rate for an application Reducing the error rate for an application often involves two efforts often involves two efforts – Increasing accuracyIncreasing accuracy (minimizing false (minimizing false
positives)positives)– Increasing coverageIncreasing coverage (minimizing false (minimizing false
negatives)negatives)
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 99
Regular expressionsRegular expressionsBasic regular expression patternsBasic regular expression patterns
Perl-based syntax (slightly different from other Perl-based syntax (slightly different from other notations for regular expressions)notations for regular expressions)
Disjunctions Disjunctions [abc][abc]
Ranges Ranges [A-Z][A-Z]
Negations Negations [^Ss][^Ss]
Optional characters Optional characters ?, +?, + and and **
Wild cards Wild cards ..
Anchors Anchors \b\b and and \B\B
Disjunction, grouping, and precedence Disjunction, grouping, and precedence ||
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1010
Preceding character or nothing Preceding character or nothing using ?using ?
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 111104/21/23 11
WildcardWildcard
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 121204/21/23 12
Negation using ^Negation using ^
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1313
Writing correct expressionsWriting correct expressions
Exercise: write a regular expression to match the Exercise: write a regular expression to match the English article “the”:English article “the”:
/the/ missed ‘The’
included ‘the’ in ‘others’/[tT]he/
/\b[tT]he\b/ Missed ‘the25’ ‘the_’
/[^a-zA-Z][tT]he[^a-zA-Z]/Missed ‘The’ at the beginning of a line
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1414
A more complex exampleA more complex example
Exercise: Write a Perl regular expression that Exercise: Write a Perl regular expression that will match “any PC with more than 500MHz and will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:32 Gb of disk space for less than $1000”:
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1515
ExampleExamplePricePrice– /$[0-9]+/ /$[0-9]+/ # whole dollars # whole dollars – /$[0-9]+\.[0-9][0-9]/ /$[0-9]+\.[0-9][0-9]/ # dollars and cents # dollars and cents – /$[0-9]+(\.[0-9][0-9])?/ /$[0-9]+(\.[0-9][0-9])?/ #cents optional #cents optional – /\b$[0-9]+(\.[0-9][0-9])?\b/ /\b$[0-9]+(\.[0-9][0-9])?\b/ #word boundaries #word boundaries
Specifications for processor speed Specifications for processor speed – /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz|[Gg]igahertz)\b/ /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz|[Gg]igahertz)\b/
Memory size Memory size – /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ – /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
Vendors Vendors – /\b(Win95|WIN98|WINNT|WINXP *(NT|95|98|2000|XP)?)\b/ /\b(Win95|WIN98|WINNT|WINXP *(NT|95|98|2000|XP)?)\b/
– /\b(Mac|Macintosh|Apple)\b/ /\b(Mac|Macintosh|Apple)\b/
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1616
Advanced Operators – Aliases for Advanced Operators – Aliases for common rangescommon ranges
Underscore: Correct figure 2.6
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1717
\ to Reference special characters\ to Reference special characters
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1818
Operators for countingOperators for counting
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1919
FFinite inite SState tate AAutomatautomata
FSAFSA recognizes the regular languages recognizes the regular languages represented by regular expressionsrepresented by regular expressions– SheepTalk: /baa+!/SheepTalk: /baa+!/
• Directed graph with labeled nodes and arc transitions
• Five states: q0 the start state, q4 the final state, 5 transitions
q0 q4q1 q2 q3
b a aa !
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2020
FormallyFormally
FSAFSA is a 5-tuple consisting of is a 5-tuple consisting of– QQ: set of states {q0,q1,q2,q3,q4}: set of states {q0,q1,q2,q3,q4} : an alphabet of symbols {a,b,!}: an alphabet of symbols {a,b,!}
– q0q0: A start state: A start state
– FF: a set of final states in Q {q4}: a set of final states in Q {q4} (q,i)(q,i): a transition function mapping : a transition function mapping Q x Q x
to Qto Q
q0 q4q1 q2 q3
b a
a
a !
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2121
FSA recognizes (FSA recognizes (acceptsaccepts) strings of a ) strings of a regular languageregular language– baa!baa!– baaa!baaa!– baaaa!baaaa!– ……
A rejected inputA rejected input
aa bb aa !! bb
q0 q4q1 q2 q3b a
aa !
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2222
State Transition TableState Transition Table
StateStateInputInput
bb aa !!
00 11 ØØ ØØ
11 ØØ 22 ØØ
22 ØØ 33 ØØ
33 ØØ 33 44
44 ØØ ØØ ØØq0 q4q1 q2 q3
b aa
a !
FSA can be represented FSA can be represented with State Transition Table with State Transition Table
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2323
Non-Deterministic FSAs for Non-Deterministic FSAs for SheepTalkSheepTalk
q0 q4q1 q2 q3
b a a a !
q0 q4q1 q2 q3
b a a !
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2424
A language is a set of A language is a set of stringsstrings
String:String: A sequence of lettersA sequence of letters
LanguagesLanguages
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2525
Tracing FSA - Initial ConfigurationTracing FSA - Initial Configuration
1q 2q 3q 4qa b b a
5q
a a bb
ba,
Input String
a b b a
ba,
0q
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2626
Reading the InputReading the Input
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b b a
ba,
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2727
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b b a
ba,
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2828
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b b a
ba,
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2929
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b b a
ba,
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3030
0q 1q 2q 3q 4qa b b a
Output: “accept”
5q
a a bb
ba,
a b b a
ba,
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3131
RejectionRejection
1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b a
ba,
0q
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3232
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b a
ba,
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3333
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b a
ba,
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3434
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b a
ba,
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3535
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
Output:“reject”
a b a
ba,
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3636
Another ExampleAnother Example
a
b ba,
ba,
0q 1q 2q
a ba
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3737
a
b ba,
ba,
0q 1q 2q
a ba
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3838
a
b ba,
ba,
0q 1q 2q
a ba
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3939
a
b ba,
ba,
0q 1q 2q
a ba
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4040
a
b ba,
ba,
0q 1q 2q
a ba
Output: “accept”
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4141
RejectionRejection
a
b ba,
ba,
0q 1q 2q
ab b
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4242
a
b ba,
ba,
0q 1q 2q
ab b
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4343
a
b ba,
ba,
0q 1q 2q
ab b
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4444
a
b ba,
ba,
0q 1q 2q
ab b
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4545
a
b ba,
ba,
0q 1q 2q
ab b
Output: “reject”
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4646
FormalitiesFormalities
Deterministic Finite Accepter (DFA)Deterministic Finite Accepter (DFA) FqQM ,,,, 0
Q
0q
F
: set of states
: input alphabet
: transition function
: initial state
: set of final states
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4747
About AlphabetsAbout Alphabets
Alphabets means we need a finite set of Alphabets means we need a finite set of symbols in the input.symbols in the input.
These symbols can and will stand for These symbols can and will stand for bigger objects that can have internal bigger objects that can have internal structure.structure.
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4848
Input Aplhabet Input Aplhabet
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
ba,
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4949
Set of States Set of States
Q
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
543210 ,,,,, qqqqqqQ
ba,
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5050
Initial State Initial State
0q
1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
0q
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5151
Set of Final StatesSet of Final States
F
0q 1q 2q 3qa b b a
5q
a a bb
ba,
4qF
ba,
4q
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5252
Transition Function Transition Function
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
QQ :
ba,
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5353
10 , qaq
2q 3q 4qa b b a
5q
a a bb
ba,
ba,
0q 1q
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5454
50 , qbq
1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
0q
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5555
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
32 , qbq
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5656
Transition FunctionTransition Function
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
a b
0q
1q
2q
3q
4q
5q
1q 5q
5q 2q
2q 3q
4q 5q
ba,5q5q5q5q
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5757
Extended Transition FunctionExtended Transition Function(Reads the entire string)(Reads the entire string)
*
QQ *:*
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5858
20 ,* qabq
3q 4qa b b a
5q
a a bb
ba,
ba,
0q 1q 2q
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5959
40 ,* qabbaq
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6060
50 ,* qabbbaaq
1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
0q
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6161
50 ,* qabbbaaq
1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
0q
Observation: There is a walk from to with label
0q 5qabbbaa
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6262
ExampleExample
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
abbaML M
accept
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6363
Another ExampleAnother Example
0q 1q 2q 3q 4qa b b a
5q
a a bb
ba,
ba,
abbaabML ,, M
acceptacceptaccept
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6464
More ExamplesMore Examples
a
b ba,
ba,
0q 1q 2q
}0:{ nbaML n
accept trap state
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6565
ML = { all substrings with prefix }ab
a b
ba,
0q 1q 2q
accept
ba,3q
ab
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6666
ML = { all strings without substring }001
0 00 001
1
0
1
10
0 1,0
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6767
Regular LanguagesRegular Languages
A language is regular if there is a DFA A language is regular if there is a DFA such thatsuch that
All regular languages form a language All regular languages form a language familyfamily
LM MLL
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6868
ExampleExampleThe languageThe language
is regular:is regular:
*,: bawawaL
a
b
ba,
a
b
ba
0q 2q 3q
4q
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6969
Finite State AutomataFinite State Automata
Regular expressions can be viewed as a Regular expressions can be viewed as a textual way of specifying the structure of textual way of specifying the structure of finite-state automata.finite-state automata.
March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 7070
More FormallyMore Formally
You can specify an FSA by enumerating You can specify an FSA by enumerating the following things.the following things.– The set of states: QThe set of states: Q– A finite alphabet: A finite alphabet: ΣΣ– A start stateA start state– A set of accept/final statesA set of accept/final states– A transition function that maps QxA transition function that maps QxΣΣ to Q to Q