ICS 482 Natural Language Processing Regular Expression and Finite Automata

70
March 1, 2009 March 1, 2009 Dr. Muhammed Al-mulhem Dr. Muhammed Al-mulhem 1 ICS 482 ICS 482 Natural Language Natural Language Processing Processing Regular Expression and Regular Expression and Finite Automata Finite Automata Muhammed Al-Mulhem Muhammed Al-Mulhem March 1, 2009 March 1, 2009

description

ICS 482 Natural Language Processing Regular Expression and Finite Automata. Muhammed Al-Mulhem March 1, 2009. Regular Expressions. Regular expression (RE): A formula for specifying a set of strings. - PowerPoint PPT Presentation

Transcript of ICS 482 Natural Language Processing Regular Expression and Finite Automata

Page 1: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 11

ICS 482ICS 482Natural Language ProcessingNatural Language Processing

Regular Expression andRegular Expression andFinite AutomataFinite Automata

Muhammed Al-MulhemMuhammed Al-Mulhem

March 1, 2009March 1, 2009

Page 2: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 22

Regular ExpressionsRegular Expressions

Regular expression (RE): A formula for Regular expression (RE): A formula for specifying a set of strings. specifying a set of strings.

String: A sequence of alphanumeric String: A sequence of alphanumeric characters (letters, numbers, spaces, tabs, characters (letters, numbers, spaces, tabs, and punctuation). and punctuation).

Page 3: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 33

Regular Expression PatternsRegular Expression Patterns

RE RE String matchedString matched

woodchuckswoodchucks ““interesting links to interesting links to woodchuckswoodchucks and lemurs”and lemurs”

aa ““SSaarah Ali stopped by Mona’s”rah Ali stopped by Mona’s”

Ali says,Ali says, ““My gift please,” My gift please,” Ali says,Ali says,””

bookbook ““all our pretty all our pretty bookbooks”s”

!! ““Leave him behindLeave him behind!!” said Sami” said Sami

Page 4: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 44

Specify Options and Range using Specify Options and Range using [ ] and -[ ] and -

RERE MatchMatch

[wW]ood[wW]ood Wood or woodWood or wood

[abc][abc] ““a”, “b”, or “c”a”, “b”, or “c”

[A-Z] an uppercase letter

[a-z] a lowercase letter

[0-9] a single digit

Page 5: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 55

RE OperatorsRE OperatorsRERE DescriptionDescription

a*a* Zero or more a’sZero or more a’s

a+a+ One or more a’sOne or more a’s

a?a? Zero or one a’sZero or one a’s

[ab]*[ab]* Zero or more a’s or b’s. Matches aaa.., ababab.., Zero or more a’s or b’s. Matches aaa.., ababab.., bbbb..bbbb..

[0-9]+[0-9]+ Sequence of one or more digits.Sequence of one or more digits.

.. Wildcard expression-matches any single character.Wildcard expression-matches any single character.

\b\b Matches a word boundary. Matches the but not otherMatches a word boundary. Matches the but not other

Page 6: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 66

Sidebar: ErrorsSidebar: Errors

Find all instances of the word “the” in a Find all instances of the word “the” in a text.text.– /the//the/

What About ‘The’What About ‘The’

– /[tT]he//[tT]he/What about ‘Theater”, ‘Another’What about ‘Theater”, ‘Another’

Page 7: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 77

Sidebar: ErrorsSidebar: Errors

The process we just went through was The process we just went through was based on: based on: – Matching strings that we should not have Matching strings that we should not have

matched (there, then, other)matched (there, then, other)False positivesFalse positives

– Not matching things that we should have Not matching things that we should have matched (The)matched (The)

False negativesFalse negatives

Page 8: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 88

Sidebar: ErrorsSidebar: Errors

Reducing the error rate for an application Reducing the error rate for an application often involves two efforts often involves two efforts – Increasing accuracyIncreasing accuracy (minimizing false (minimizing false

positives)positives)– Increasing coverageIncreasing coverage (minimizing false (minimizing false

negatives)negatives)

Page 9: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 99

Regular expressionsRegular expressionsBasic regular expression patternsBasic regular expression patterns

Perl-based syntax (slightly different from other Perl-based syntax (slightly different from other notations for regular expressions)notations for regular expressions)

Disjunctions Disjunctions [abc][abc]

Ranges Ranges [A-Z][A-Z]

Negations Negations [^Ss][^Ss]

Optional characters Optional characters ?, +?, + and and **

Wild cards Wild cards ..

Anchors Anchors \b\b and and \B\B

Disjunction, grouping, and precedence Disjunction, grouping, and precedence ||

Page 10: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1010

Preceding character or nothing Preceding character or nothing using ?using ?

Page 11: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 111104/21/23 11

WildcardWildcard

Page 12: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 121204/21/23 12

Negation using ^Negation using ^

Page 13: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1313

Writing correct expressionsWriting correct expressions

Exercise: write a regular expression to match the Exercise: write a regular expression to match the English article “the”:English article “the”:

/the/ missed ‘The’

included ‘the’ in ‘others’/[tT]he/

/\b[tT]he\b/ Missed ‘the25’ ‘the_’

/[^a-zA-Z][tT]he[^a-zA-Z]/Missed ‘The’ at the beginning of a line

Page 14: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1414

A more complex exampleA more complex example

Exercise: Write a Perl regular expression that Exercise: Write a Perl regular expression that will match “any PC with more than 500MHz and will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”:32 Gb of disk space for less than $1000”:

Page 15: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1515

ExampleExamplePricePrice– /$[0-9]+/ /$[0-9]+/ # whole dollars # whole dollars – /$[0-9]+\.[0-9][0-9]/ /$[0-9]+\.[0-9][0-9]/ # dollars and cents # dollars and cents – /$[0-9]+(\.[0-9][0-9])?/ /$[0-9]+(\.[0-9][0-9])?/ #cents optional #cents optional – /\b$[0-9]+(\.[0-9][0-9])?\b/ /\b$[0-9]+(\.[0-9][0-9])?\b/ #word boundaries #word boundaries

Specifications for processor speed Specifications for processor speed – /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz|[Gg]igahertz)\b/ /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz|[Gg]igahertz)\b/

Memory size Memory size – /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ – /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/

Vendors Vendors – /\b(Win95|WIN98|WINNT|WINXP *(NT|95|98|2000|XP)?)\b/ /\b(Win95|WIN98|WINNT|WINXP *(NT|95|98|2000|XP)?)\b/

– /\b(Mac|Macintosh|Apple)\b/ /\b(Mac|Macintosh|Apple)\b/

Page 16: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1616

Advanced Operators – Aliases for Advanced Operators – Aliases for common rangescommon ranges

Underscore: Correct figure 2.6

Page 17: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1717

\ to Reference special characters\ to Reference special characters

Page 18: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1818

Operators for countingOperators for counting

Page 19: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 1919

FFinite inite SState tate AAutomatautomata

FSAFSA recognizes the regular languages recognizes the regular languages represented by regular expressionsrepresented by regular expressions– SheepTalk: /baa+!/SheepTalk: /baa+!/

• Directed graph with labeled nodes and arc transitions

• Five states: q0 the start state, q4 the final state, 5 transitions

q0 q4q1 q2 q3

b a aa !

Page 20: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2020

FormallyFormally

FSAFSA is a 5-tuple consisting of is a 5-tuple consisting of– QQ: set of states {q0,q1,q2,q3,q4}: set of states {q0,q1,q2,q3,q4} : an alphabet of symbols {a,b,!}: an alphabet of symbols {a,b,!}

– q0q0: A start state: A start state

– FF: a set of final states in Q {q4}: a set of final states in Q {q4} (q,i)(q,i): a transition function mapping : a transition function mapping Q x Q x

to Qto Q

q0 q4q1 q2 q3

b a

a

a !

Page 21: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2121

FSA recognizes (FSA recognizes (acceptsaccepts) strings of a ) strings of a regular languageregular language– baa!baa!– baaa!baaa!– baaaa!baaaa!– ……

A rejected inputA rejected input

aa bb aa !! bb

q0 q4q1 q2 q3b a

aa !

Page 22: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2222

State Transition TableState Transition Table

StateStateInputInput

bb aa !!

00 11 ØØ ØØ

11 ØØ 22 ØØ

22 ØØ 33 ØØ

33 ØØ 33 44

44 ØØ ØØ ØØq0 q4q1 q2 q3

b aa

a !

FSA can be represented FSA can be represented with State Transition Table with State Transition Table

Page 23: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2323

Non-Deterministic FSAs for Non-Deterministic FSAs for SheepTalkSheepTalk

q0 q4q1 q2 q3

b a a a !

q0 q4q1 q2 q3

b a a !

Page 24: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2424

A language is a set of A language is a set of stringsstrings

String:String: A sequence of lettersA sequence of letters

LanguagesLanguages

Page 25: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2525

Tracing FSA - Initial ConfigurationTracing FSA - Initial Configuration

1q 2q 3q 4qa b b a

5q

a a bb

ba,

Input String

a b b a

ba,

0q

Page 26: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2626

Reading the InputReading the Input

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

Page 27: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2727

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

Page 28: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2828

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

Page 29: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 2929

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b b a

ba,

Page 30: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3030

0q 1q 2q 3q 4qa b b a

Output: “accept”

5q

a a bb

ba,

a b b a

ba,

Page 31: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3131

RejectionRejection

1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

0q

Page 32: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3232

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

Page 33: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3333

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

Page 34: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3434

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b a

ba,

Page 35: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3535

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

Output:“reject”

a b a

ba,

Page 36: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3636

Another ExampleAnother Example

a

b ba,

ba,

0q 1q 2q

a ba

Page 37: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3737

a

b ba,

ba,

0q 1q 2q

a ba

Page 38: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3838

a

b ba,

ba,

0q 1q 2q

a ba

Page 39: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 3939

a

b ba,

ba,

0q 1q 2q

a ba

Page 40: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4040

a

b ba,

ba,

0q 1q 2q

a ba

Output: “accept”

Page 41: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4141

RejectionRejection

a

b ba,

ba,

0q 1q 2q

ab b

Page 42: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4242

a

b ba,

ba,

0q 1q 2q

ab b

Page 43: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4343

a

b ba,

ba,

0q 1q 2q

ab b

Page 44: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4444

a

b ba,

ba,

0q 1q 2q

ab b

Page 45: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4545

a

b ba,

ba,

0q 1q 2q

ab b

Output: “reject”

Page 46: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4646

FormalitiesFormalities

Deterministic Finite Accepter (DFA)Deterministic Finite Accepter (DFA) FqQM ,,,, 0

Q

0q

F

: set of states

: input alphabet

: transition function

: initial state

: set of final states

Page 47: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4747

About AlphabetsAbout Alphabets

Alphabets means we need a finite set of Alphabets means we need a finite set of symbols in the input.symbols in the input.

These symbols can and will stand for These symbols can and will stand for bigger objects that can have internal bigger objects that can have internal structure.structure.

Page 48: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4848

Input Aplhabet Input Aplhabet

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

ba,

Page 49: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 4949

Set of States Set of States

Q

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

543210 ,,,,, qqqqqqQ

ba,

Page 50: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5050

Initial State Initial State

0q

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Page 51: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5151

Set of Final StatesSet of Final States

F

0q 1q 2q 3qa b b a

5q

a a bb

ba,

4qF

ba,

4q

Page 52: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5252

Transition Function Transition Function

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

QQ :

ba,

Page 53: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5353

10 , qaq

2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q 1q

Page 54: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5454

50 , qbq

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Page 55: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5555

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

32 , qbq

Page 56: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5656

Transition FunctionTransition Function

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

a b

0q

1q

2q

3q

4q

5q

1q 5q

5q 2q

2q 3q

4q 5q

ba,5q5q5q5q

Page 57: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5757

Extended Transition FunctionExtended Transition Function(Reads the entire string)(Reads the entire string)

*

QQ *:*

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

Page 58: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5858

20 ,* qabq

3q 4qa b b a

5q

a a bb

ba,

ba,

0q 1q 2q

Page 59: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 5959

40 ,* qabbaq

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

Page 60: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6060

50 ,* qabbbaaq

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Page 61: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6161

50 ,* qabbbaaq

1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

0q

Observation: There is a walk from to with label

0q 5qabbbaa

Page 62: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6262

ExampleExample

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

abbaML M

accept

Page 63: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6363

Another ExampleAnother Example

0q 1q 2q 3q 4qa b b a

5q

a a bb

ba,

ba,

abbaabML ,, M

acceptacceptaccept

Page 64: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6464

More ExamplesMore Examples

a

b ba,

ba,

0q 1q 2q

}0:{ nbaML n

accept trap state

Page 65: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6565

ML = { all substrings with prefix }ab

a b

ba,

0q 1q 2q

accept

ba,3q

ab

Page 66: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6666

ML = { all strings without substring }001

0 00 001

1

0

1

10

0 1,0

Page 67: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6767

Regular LanguagesRegular Languages

A language is regular if there is a DFA A language is regular if there is a DFA such thatsuch that

All regular languages form a language All regular languages form a language familyfamily

LM MLL

Page 68: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6868

ExampleExampleThe languageThe language

is regular:is regular:

*,: bawawaL

a

b

ba,

a

b

ba

0q 2q 3q

4q

Page 69: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 6969

Finite State AutomataFinite State Automata

Regular expressions can be viewed as a Regular expressions can be viewed as a textual way of specifying the structure of textual way of specifying the structure of finite-state automata.finite-state automata.

Page 70: ICS 482 Natural Language Processing Regular Expression and Finite Automata

March 1, 2009March 1, 2009 Dr. Muhammed Al-mulhemDr. Muhammed Al-mulhem 7070

More FormallyMore Formally

You can specify an FSA by enumerating You can specify an FSA by enumerating the following things.the following things.– The set of states: QThe set of states: Q– A finite alphabet: A finite alphabet: ΣΣ– A start stateA start state– A set of accept/final statesA set of accept/final states– A transition function that maps QxA transition function that maps QxΣΣ to Q to Q