The Simplest NL Applications: Text Searching and Pattern Matching
Read J & M Chapter 2
Searching for a Single StringUsing a Nondeterministic FSM
c o c o n u t
1 2 3 4 5 6 7 8
Searching for a Single String Using the Boyer Moore Algorithm
Searching for Multiple Strings
c o c o n u t
1 2 3 4 5 6 7 8
o c o s 2 3 4 5 6 l
Example: lococonut
Converting to a Deterministic FSM
c o c o n u t
1 2 3 4 5 6 7 8
o c o s 2 3 4 5 6 l
Regular Expressions
Two different (but related) uses of the term:
•Expressions that define all and only the regular languages
•(aa ab ba bb)*
•Expressions in a useful pattern language
Matching ip addresses:
S!<emphasis> ([0-9]+ (\ . [0-9]+) {3}) </emphasis> !
<inet> $1 </inet>!
Finding doubled words:
\< ([A-Za-z]+) \s+ \1 \>
REs: Syntax and Semantics
Syntax
The regular expressions over an alphabet are all strings over the alphabet {(, ), , , *} that can be obtained as follows:
1. and each member of is a regular expression.
2. If , are regular expressions, then so is .
3. If , are regular expressions, then so is .
4. If is a regular expression, then so is *.
5. If is a regular expression, then so is ().
6. Nothing else is a regular expression.
REs: Syntax and SemanticsRegular expressions define languages via a semantic interpretation function we'll call L:
1. L() = and L(a) = {a} for each a
2. If , are regular expressions, then L() = L() L() = all strings that can be formed by concatenating to somestring from L() some string from L().
3. If , are regular expressions, then L() = L() L()
4. If is a regular expression, then L(*) = L()*
5. If () is a regular expression, then L( () ) = L()
A language is regular if and only if it can be described by a regular expression.
Note: L is compositional.
The Importance of Compositionality
What is the meaning of:
Mary cooked the yujutes.
Mary tyroked the yujutes.
Morphological Analysis
•Read J & M Chapter 3
•Recognize words
•Parse words
Morphological Parsing
Goal: to represent the facts declaratively so that a single representation can be used for both recognition and generation.
Note: ^ marks morpheme boundaries. # marks word boundaries.
From Lexical to Intermediate
Note: All the transducers in the book are described as lexical:intermediate, but they can run the other direction.
Where Did reg-noun-stem Come From?
We Can Cascade or Compose
From Intermediate to Surface
For text, we need spelling rules.
x
e / s ^ ___ s #
z
Read this as “Replace as e in the context after the /.
Turning the Rule into a Transducer
foxes
xerox
fox#sat
Disambiguation - Local
Local ambiguities:
asses#
s#luxury
Disambiguation - Harder
Sometimes additional knowledge is necessary:
foxes: fox +N + PL or fox +V +SG
Can we think of nouns that cannot also be verbs?
Search•For FSMs, we can build a deterministic machine.
•In other cases, we will have to search:•Depth-first•Breadth-first – chart parsing
S S VP VP NP PP NP NP V VPR N det N PREP DET NI hit the boy with a bat.
Top Related