Post on 31-Dec-2015
description
Natural Language Processing
Finite State Transducers
May 2007 CLINT-CL 2
Acknowledgement
• Material in this lecture derived/copied from
• Richard Sproat CL46 Lectures
• Lauri Karttunen LSA lectures 2005
May 2007 CLINT-CL 3
Three Key Concepts
Finite StateTransducers
RegularRelations
ComputationalMorphology
May 2007 CLINT-CL 4
Three Key Concepts
Finite StateTransducers
RegularRelations
ComputationalMorphology
May 2007 CLINT-CL 5
A Regular Set
ababababababababababababababab..
L1
May 2007 CLINT-CL 6
Two Regular Sets
ababababababababababababababab..
bababababababababababababababa..
L1 L2
May 2007 CLINT-CL 7
A Regular Relation L1 x L2
ababababababababababababababab..
bababababababababababababababa..
L1 L2
or {("ab","ba"), ("abab","baba"),...}
May 2007 CLINT-CL 8
Concatenation and Power
ConcatenationR1 = {("a","b")}R2 = {("c","d")}[R1 R2] = {("ac","bd")}
PowerR1+ = {("a","b"),("aa","bb"), ("aaa","bbb"), ...}
May 2007 CLINT-CL 961
Composition
• R1 ○ R2 denotes the composition of relations R1 and R2.
• DefinitionIf R1 contains <x,y>
And R2 contains <y,z>
Then R1 ○ R2 contains <x,z>
• R1 and R2 and B must be relations. If either is just a language, it is assumed to abbreviate the identity relation.
• R1 ○ R2 is written [R1 .o. R2] in xfst
May 2007 CLINT-CL 10
Closure Properties of Regular Languages and Relations
Operation Regular Languages Regular RelationsUnion yes yesConcatenation yes yesIteration yes yes
Intersection yes noSubtraction yes noComplementation yes no
Composition n/a yes
May 2007 CLINT-CL 11
Three Key Concepts
Finite StateTransducers
RegularRelations
ComputationalMorphology
May 2007 CLINT-CL 12
Morphology
• Study of word structure and formation in terms of morphemes
• Morpheme: smallest independent unit in a word that bear some meaning, such as rabbit and s.
• Two kinds of morphology– Inflectional– Derivational
May 2007 CLINT-CL 13
Inflectional/DerivationalMorphology
• Inflectional+s plural+ed past
• category preserving• productive: always
applies (esp. new words, e.g. fax)
• systematic: same semantic effect
• Derivational+ment
• category changingallot+ment
• not completely productive: detractment*
• not completely systematic: department
May 2007 CLINT-CL 14
Inflectional Morphology ExampleEnglish Noun Inflections
Regular Irregular
Singular cat church mouse ox
child
Plural cats churches mice oxen
children
May 2007 CLINT-CL 15
Computational Morphology: Concrete Processes
Analysis
leaves
leaf N Pl leave N Pl leave V Sg3
Generation
hang V Past
hanged hung
May 2007 CLINT-CL 16
Two Challenges for Computational Morphology
• Handling Morphotactics– Words are composed of smaller elements that must
be combined in a certain order:• piti-less-ness is English• piti-ness-less is not English
• Handling Phonological Alternations– The shape of an element may vary depending on the
context• pity is realized as piti in pitilessness• die becomes dy in dying
May 2007 CLINT-CL 17
Morphological Parsing
• The goal of morphological parsing is to find out what morphemes a given word is built from.
cats cat Noun PL
mice mouse Noun PL
foxes fox N PL
foxes fox V 3SING PRES• Two kinds of information
– Lexeme (which dictionary word)– Morphosyntactic information
May 2007 CLINT-CL 18
Morphological Parsing
MorphologicalParser
Input Word
cats
OutputAnalysis
cat N PL
Issues• Non-determinism• Arbitrary choice of morpheme symbols
May 2007 CLINT-CL 19
Morphological Generation
MorphologicalGenerator
Output Word
cats
InputAnalysis
cat N PL
Issues• Reversibility: how much can be shared regardless of direction?
May 2007 CLINT-CL 20
Morphology as a Regular Relation
catcatsmicelives...
catcat+N+PLmouse+N+PLlife+N+PLlive+V+3SING..
surface language lexical language
or {("cat,cat"),("cats","cat+N+PL"),......}
May 2007 CLINT-CL 21
Three Key Concepts
Finite StateTransducers
RegularRelations
ComputationalMorphology
May 2007 CLINT-CL 22
FSA
a
Used for• Recognition• Generation
May 2007 CLINT-CL 23
Finite State Transducers
• A finite state transducer (FST) is essentially an FSA finite state automaton that works on two (or more) tapes.
• The most common way to think about transducers is as a kind of translating machine which works by reading from one tape and writing onto the other.
May 2007 CLINT-CL 24
FST Definition
• A 2 way FST is a quintuple (K,s,F,ixo,) where
i, o are input and output alphabets
• K is a finite set of states
• s K is an initial state
• FK are final states is a transition relation of type
K x i x o x K
May 2007 CLINT-CL 25
FST
a
Used for• Recognition• Generation• Translation
b
upper tape
lower tape
May 2007 CLINT-CL 26
A Very Simple Transducer
a
b
Relation { ("a","b") }
Notation a:b encodes the transition
May 2007 CLINT-CL 27
A Very Simple Transducer
a
b
a:b
also written as
May 2007 CLINT-CL 28
Symbol Pairs
• Symbols vs. symbol pairs– In general, no distinction is made in
xfst betweena the language {“a”}a:a the identity relation
{(“a”, “a”)}
a
a
May 2007 CLINT-CL 29
A Very Simple Transducer
a
b
a:b
upper side
lower side
May 2007 CLINT-CL 30
A (more interesting) Transducer
• Relation
{ ("a","b"), ("aa","bb"), ...}
• Notationa:b*
• N.B. with this notation a and b must be single symbols
1
a:b
May 2007 CLINT-CL 31
Transducer have SeveralModes of Operation
• generation mode: It writes on both tapes. A string of as on one tape and a string of bs on the other tape. Both strings have the same length.
• recognition mode: It accepts when the word on the first tape consists of exactly as many as as the word on the second tape consists of bs.
• translation mode (left to right): It reads as from the first tape and writes a b for every a that it reads onto the second tape.
• translation mode (right to left): It reads bs from the second tape and writes an a for every b that it reads onto the first tape.
May 2007 CLINT-CL 32
The Basic Idea
• Morphology is regular
• Morphology is finite state
May 2007 CLINT-CL 33
Morphology is Regular
• The relation between the surface forms of a language and the corresponding lexical forms can be described as a regular relation, e.g.
{ ("leaf+N+Pl","leaves"),("hang+V+Past","hung"),...}
• Regular relations are closed under operations such as concatenation, iteration, union, and composition.
• Complex regular relations can be derived from simpler relations.
May 2007 CLINT-CL 34
Morphology is finite-state
• A regular relation can be defined using the metalanguage of regular expressions.
• [{talk} | {walk} | {work}]• [%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}];
• A regular expression can be compiled into a finite-state transducer that implements the relation computationally.
May 2007 CLINT-CL 35
Compilation
• [{talk} | {walk} | {work}]• [%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:
{ed}];
Regular expression
k
t
a
a
wo
l
r
+Progr:i :g
+3rdSg:s
+Past:e :d
:n
+Base:
Finite-state transducer
finalstate
initialstate
May 2007 CLINT-CL 36
work+3rdSg --> works
k:k
t:t
a:a
a:a
w:wo:o
l:l
r:r
+Progr:i :g
+3rdSg:s
+Past:e :d
:n
+Base:
Generation
May 2007 CLINT-CL 37
talked --> talk+Past
k:k
t:t
a:a
a:a
w:wo:o
l:l
r:r
+Progr:i :g
+3rdSg:s
+Past:e :d
:n
+Base:
Analysis
May 2007 CLINT-CL 38
XFST Demo 2
• xfst[0]: regex • [{talk} | {walk} | {work}]• [% +Base:0 | %+SgGen3:s | %+Progr:{ing} | %
+Past:{ed}];
% xfstxfst[0]:
start xfst
compile a regular expression
apply the resultxfst[1]: apply up walkedwalk+Past
xfst[1]: apply down talk+SgGen3talks
May 2007 CLINT-CL 39
Lexical transducer
veut
vouloir +IndP +SG + P3
Finite-state transducer
inflected form
citation form inflection codes
v o u l o i r +IndP +SG +P3
v e u t
• Bidirectional: generation or analysis• Compact and fast• Comprehensive systems have been
built for over 40 languages:– English, German, Dutch, French,
Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, Korean, Basque, Greek, Arabic, Hebrew, Bulgarian, …
May 2007 CLINT-CL 40
How lexical transducers are made
LexiconFST
RuleFSTs
Compiler
f a t +Adj
r
+Comp
f a t t e
Lexical Transducer(a single FST)composition
LexiconRegular Expression
RulesRegular Expressions
Morphotactics
Alternations
May 2007 CLINT-CL 41
Sequential Model
...
Surface form
Intermediate form
Lexical form
fst 1
fst 2
fst n
Ordered sequenceof rewrite rules(Chomsky & Halle ‘68)can be modeledby a cascade offinite-state transducersJohnson ‘72Kaplan & Kay ‘81
May 2007 CLINT-CL 42
Sequential application
N -> m / _ p
p -> m / m _
k a N p a n
k a m p a n
k a m m a n
May 2007 CLINT-CL 43
Sequential application in detail
N:m
N
?? 0
2
1
pN:m
m
pN
m
p:m
?? 0 1
mp
m
k a N p a n
k a m p a n
k a m m a n
0 0 0 2 0 0 0
0 0 0 1 0 0 0
May 2007 CLINT-CL 44
Composition of Two Stages
N:m
N
?? 0
3
1
N:m
m
p
N
?
m2
p:m
p:m
N m
N:mk a N p a n
k a m m a n
0 0 0 3 0 0 0
May 2007 CLINT-CL 45
Composition Example
• A .o. B The relation C such that if A maps x to y and B maps y to z, C maps x to z.
b:B c:Ca:A
b ca
a:A
b:B
c:C
d:D {abc} .o. [a:A | b:B | c:C | d:D]*= [a:A b:B c:C]
May 2007 CLINT-CL 46
Crossproduct
• A .x. B The relation that maps every string in A to every string in B, and vice versa
• A:B Same as [A .x. B].
b:y c:0a:x
a b c .x. x y [a b c] : [x y] {abc}:{xy}
May 2007 CLINT-CL 47
Xerox RE Operators
• $ containment• => restriction• -> replacement• @-> longest match replacement
– Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.
May 2007 CLINT-CL 48
Containment
aa?? ?? aa$a$a
[?* a ?*][?* a ?*]
May 2007 CLINT-CL 49
Restriction
??cc
bb
bb
cc?? aa
cc
a => b _ ca => b _ c
““AnyAny aa must be preceded bymust be preceded by bband followed byand followed by cc.”.”
~[~[?* b] a ?*] & ~[?* a ~[c ?*]] ~[~[?* b] a ?*] & ~[?* a ~[c ?*]]
Equivalent expression Equivalent expression
May 2007 CLINT-CL 50
Replacement
a:ba:b
bb
aa
??
??
b:ab:a
aa
a:ba:b
a b -> b a
““Replace ‘ab’ by ‘ba’.”Replace ‘ab’ by ‘ba’.”
[[~$[a b] [[a b] .x. [b a]]]* ~$[a b]]
Equivalent expression Equivalent expression
May 2007 CLINT-CL 51
Marking
0:[0:[
[[
0:]0:]
??
aa
ee
iioo
uu]]
a|e|i|o|u -> %[ ... %]
p o t a t op o t a t op[o]t[a]t[o]p[o]t[a]t[o]
May 2007 CLINT-CL 52
a b | b | b a | a b a -> x
(a) b (a) -> x
applied to “aba”
a b a a b a a b a a b a
a x a a x x a x
Multiple Results
Four factorizations of the input string.
May 2007 CLINT-CL 53
@-> Left-to-right, Longest-match Replacement
(a) b (a) @-> x
applied to “aba”
a b a a b a a b a a b a
a x a a x x a x
May 2007 CLINT-CL 54
Conditional Replacement
The relation that replaces A by B between L and R leaving everything else unchanged.
A -> BA -> B
Replacement
L _ RL _ R
Context
Sources of complexity:
Replacements and contexts may overlap
Alternative ways of interpreting “between left and right.”A -> B || L _ R both contexts on the inputA -> B // L _ R left context on the outputA -> B \\ L _ R right context on the output