Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.
-
Upload
prosper-hodges -
Category
Documents
-
view
233 -
download
0
Transcript of Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.
RoadmapMotivation:
Representing words
A little (mostly English) Morphology
Stemming
FSTs & MorphologyStemmingMorphological analysis
FSTs & Phonology
LexiconGoal: Compact representation of all surface
forms in a languageEnumeration:
Impractical for morphologically rich languagesDescriptively unsatisfying for most languages
LexiconGoal: Compact representation of all surface
forms in a languageEnumeration:
Impractical for morphologically rich languagesDescriptively unsatisfying for most languages
Orthographic variation:Fly+er Flier
LexiconGoal: Compact representation of all surface
forms in a languageEnumeration:
Impractical for morphologically rich languagesDescriptively unsatisfying for most languages
Orthographic variation:Fly+er Flier
Morphological variation:saw + s saws; fish + s fish; goose + s geese
LexiconGoal: Compact representation of all surface
forms in a languageEnumeration:
Impractical for morphologically rich languagesDescriptively unsatisfying for most languages
Orthographic variation:Fly+er Flier
Morphological variation:saw + s saws; fish + s fish; goose + s geese
Phonological variation:dog + s dog + /z/; fox + s fox + /IH Z/
Morphological ParsingGoal: Take a surface word form and generate a
linguistic structure of component morphemes
A morpheme is the minimal meaning-bearing unit in a language.Stem: the morpheme that forms the central meaning
unit in a wordAffix: prefix, suffix, infix, circumfix
Prefix: e.g., possible impossibleSuffix: e.g., walk walkingInfix: e.g., hingi humingi (Tagalog)Circumfix: e.g., sagen gesagt (German)
Combining MorphemesInflection: Stem + gram. morpheme same
classE.g.: help + ed helped
Derivation: Stem + gram. morpheme new classE.g. Walk + er walker (N)
Combining MorphemesInflection: Stem + gram. morpheme same
classE.g.: help + ed helped
Derivation: Stem + gram. morpheme new classE.g. Walk + er walker (N)
Compounding: multiple stems new wordE.g. doghouse, catwalk, …
Combining MorphemesInflection: Stem + gram. morpheme same class
E.g.: help + ed helped
Derivation: Stem + gram. morpheme new classE.g. Walk + er walker (N)
Compounding: multiple stems new wordE.g. doghouse, catwalk, …
Clitics: stem+clitic I + ll I’ll; he + is he’s
Inflectional Morphology(Mostly English)
Relatively simple inflectional systemNouns, verbs, some adjectives
Inflectional Morphology(Mostly English)
Relatively simple inflectional systemNouns, verbs, some adjectives
Noun inflection: Only plural, possessiveNon-English???
Inflectional Morphology(Mostly English)
Relatively simple inflectional systemNouns, verbs, some adjectives
Noun inflection: Only plural, possessiveNon-English???
Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x
Possessive:
Regular Irregular
Singular cat thrush goose ox
Plural cats thrushes geese oxen
Inflectional Morphology(Mostly English)
Relatively simple inflectional systemNouns, verbs, some adjectives
Noun inflection: Only plural, possessiveNon-English???
Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x
Possessive: sg, irreg pl: +’s; reg pl, after s,z: ‘
Regular Irregular
Singular cat thrush goose ox
Plural cats thrushes geese oxen
Verb Inflectional Morphology
Classes:Main (eat, hit), modal (can, should), primary (be, have)Only main, primary inflected
Verb Inflectional Morphology
Classes:Main (eat, hit), modal (can, should), primary (be, have)Only main, primary inflected
Regular verbs: Forms predictable from stem, productiveForm Regul
arVerbs
Stem walk merge try map
-s form walks merges tries maps
-ing part walking merging trying mapping
past (-ed)
walked merged tried mapped
Verb Inflectional Morphology
Classes:Main (eat, hit), modal (can, should), primary (be, have)Only main, primary inflected
Regular verbs: Forms predictable from stem, productive
Irregular verbs: Only about 250, but very frequent
Form Regular
Verbs
Stem walk merge try map
-s form walks merges tries maps
-ing part walking merging trying mapping
past (-ed)
walked merged tried mapped
eat eats eating ate eaten
catch catches catching caught caught
cut cuts cutting cut cut
Derivational MorphologyRelatively complex, common in English
Nominalization: Verb or Adj + affix Noun
Derivational MorphologyRelatively complex, common in English
Nominalization: Verb or Adj + affix Noun
Adjectives: Verb or Noun + affix Adj
Suffix Base Derived Noun
-ation computerize computerization
-ee appoint appointee
-er kill killer
-ness fuzzy fuzziness
Derivational MorphologyRelatively complex, common in English
Nominalization: Verb or Adj + affix Noun
Adjectives: Verb or Noun + affix Adj
Suffix Base Derived Noun
-ation computerize computerization
-ee appoint appointee
-er kill killer
-ness fuzzy fuzziness
Suffix Base Derived Adjective
-al computation computational
-able embrace embraceable
-less clue clueless
CliticizationClitics: between affix and word
Affix: short, reducedWord: act as pronouns, articles, conj, verbs
CliticizationClitics: between affix and word
Affix: short, reducedWord: act as pronouns, articles, conj, verbs
In English:Presence is (mostly) unambiguous: ‘Meaning is often ambiguous: e.g. he’s
CliticizationClitics: between affix and word
Affix: short, reducedWord: act as pronouns, articles, conj, verbs
In English:Presence is (mostly) unambiguous: ‘Meaning is often ambiguous: e.g. he’s
More complex in other languages: e.g. Arabic
CliticizationClitics: between affix and word
Affix: short, reduced Word: act as pronouns, articles, conj, verbs
In English: Presence is (mostly) unambiguous: ‘ Meaning is often ambiguous: e.g. he’s
More complex in other languages: e.g. Arabic Can prefix (proclitic) article, prep, conj, No markers
Removal of such clitics often referred to as light stemming
StemmingSimple type of morphological analysis
Commonly used in information retrieval (IR)Supports matching using base forme.g. Television, televised, televising televise
StemmingSimple type of morphological analysis
Commonly used in information retrieval (IR)Supports matching using base forme.g. Television, televised, televising televiseTypically improves retrieval of short documents –
why?
StemmingSimple type of morphological analysis
Commonly used in information retrieval (IR)Supports matching using base forme.g. Television, televised, televising televiseTypically improves retrieval of short documents –
why?Most popular: Porter stemmer (snowball.tartarus.org)
StemmingSimple type of morphological analysis
Commonly used in information retrieval (IR)Supports matching using base forme.g. Television, televised, televising televiseTypically improves retrieval of short documents –
why?Most popular: Porter stemmer (snowball.tartarus.org)
Task: Given surface form, produce base formTypically, removes suffixes
StemmingSimple type of morphological analysis
Commonly used in information retrieval (IR)Supports matching using base forme.g. Television, televised, televising televiseTypically improves retrieval of short documents – why?Most popular: Porter stemmer (snowball.tartarus.org)
Task: Given surface form, produce base form Typically, removes suffixes
Model: Rule cascade No lexicon!
Porter StemmerRule cascade:
Rule form:(condition) PATT1 PATT2E.g. stem contains vowel, ING -> εATIONAL ATE
Porter StemmerRule cascade:
Rule form:(condition) PATT1 PATT2E.g. stem contains vowel, ING -> εATIONAL ATE
Rule partial order:Step1a: -sStep1b: -ed, -ing
Porter StemmerRule cascade:
Rule form:(condition) PATT1 PATT2E.g. stem contains vowel, ING -> εATIONAL ATE
Rule partial order:Step1a: -sStep1b: -ed, -ingStep 2-4: derivational suffixes
Porter StemmerRule cascade:
Rule form:(condition) PATT1 PATT2E.g. stem contains vowel, ING -> εATIONAL ATE
Rule partial order:Step1a: -sStep1b: -ed, -ingStep 2-4: derivational suffixesStep 5: cleanup
Pros:
Porter StemmerRule cascade:
Rule form:(condition) PATT1 PATT2E.g. stem contains vowel, ING -> εATIONAL ATE
Rule partial order:Step1a: -sStep1b: -ed, -ingStep 2-4: derivational suffixesStep 5: cleanup
Pros: Simple, fast, buildable for a variety of languages
Cons:
Porter Stemmer Rule cascade:
Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε ATIONAL ATE
Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes Step 5: cleanup
Pros: Simple, fast, buildable for a variety of languages
Cons: Overaggressive and underaggressive Limited in application
FST Morphological Analysis
Focus on English morphology
FSA acceptor:cats yes; foxes yes; childs no
FST morphological analyzer:fox + N + pl fox^s#
FST Morphological Analysis
Focus on English morphology
FSA acceptor:cats yes; foxes yes; childs no
FST morphological analyzer:fox + N + pl fox^s#
FST for orthographic rules:fox^s# foxes#
Morphological AnalysisComponents
Lexicon: List of stems and affixesE.g.: cat: N -s: Pl
Morphotactics: Model of morpheme orderingAssociation with classes, affix ordering
E.g. Pl follows N
Morphological AnalysisComponents
Lexicon: List of stems and affixesE.g.: cat: N -s: Pl
Morphotactics: Model of morpheme orderingAssociation with classes, affix ordering
E.g. Pl follows N
Orthographic rules: Spelling rulesChanges when morphemes combine
E.g. y ie in try + s
ExampleGoal: foxes fox + N + Pl
Surface: foxes
Orthographic rules
Intermediate: fox s
Lexicon + morphotactics
Lexical: fox + N + Pl
Multiple LevelsGeneration and Analysis
Generation: fox + N + Pl fox^s#; fox^s# foxes#
Analysis: foxes# fox^s#; fox^s# fox + N + Pl
The LexiconRepository for words:
Simplest would be enumeration Impractical (at least) for many languages
The LexiconRepository for words:
Simplest would be enumeration Impractical (at least) for many languages
Includes stems, affixes, some morphotacticsE.g cat: N, +sg; fly: v, +base
The LexiconRepository for words:
Simplest would be enumeration Impractical (at least) for many languages
Includes stems, affixes, some morphotacticsE.g cat: N, +sg; fly: v, +baseWhat about: flies: v, +sg +3rd?
Common model of morphotactics: FSA
Basic Noun Lexicon(J&M, CH3)
reg-noun irreg-pl-noun
irreg-sg-noun
plural
fox geese goose -s
cat sheep sheep
dog mice mouse
Basic Noun Lexicon(J&M, CH3)
As an FSA
reg-noun irreg-pl-noun
irreg-sg-noun
plural
fox geese goose -s
cat sheep sheep
dog mice mouse
Basic Noun Lexicon(J&M, CH3)
As an FSA
reg-noun irreg-pl-noun
irreg-sg-noun
plural
fox geese goose -s
cat sheep sheep
dog mice mouse
Lexicon for English VerbsVerbs and classes:reg-v-
stemirreg-v-stem
irreg-past-v-form
past part-part pres-part 3sg
walk cut caught -ed -ed -ing -s
fry speak ate
talk sing eaten
impeach sang
Lexicon for English VerbsVerbs and classes:reg-v-
stemirreg-v-stem
irreg-past-v-form
past part-part pres-part 3sg
walk cut caught -ed -ed -ing -s
fry speak ate
talk sing eaten
impeach sang
FSAs for MorphotacticsWe have:
stems (and stem class identities)e.g cat: reg-noun, goose: irreg-noune.g. walk: reg-verb-stem; cut: irreg-verb-stem
FSAs for MorphotacticsWe have:
stems (and stem class identities)e.g cat: reg-noun, goose: irreg-noune.g. walk: reg-verb-stem; cut: irreg-verb-stem
affixes (by form and class)e.g. –s: Plurale.g. –ed: past, past-part
FSAs for MorphotacticsWe have:
stems (and stem class identities)e.g cat: reg-noun, goose: irreg-noune.g. walk: reg-verb-stem; cut: irreg-verb-stem
affixes (by form and class)e.g. –s: Plurale.g. –ed: past, past-part
morphotactic FSAs:Accept combinations of stems & affixes in languageReject o.w.
Recognition vs Analysis/Generation
Can validate a morphological sequence
Recognition not usually main goalAnalysis: Given a surface form, produce
component morphemesGeneration: Given some morphological structure,
produce full surface form
Recognition vs Analysis/Generation
Can validate a morphological sequence
Recognition not usually main goalAnalysis: Given a surface form, produce
component morphemesGeneration: Given some morphological structure,
produce full surface form
Requires translation from one form to another
Recognition vs Analysis/Generation
Can validate a morphological sequence
Recognition not usually main goalAnalysis: Given a surface form, produce
component morphemesGeneration: Given some morphological structure,
produce full surface form
Requires translation from one form to anotherFSTs
Schematic FST
cat + N + Pl cat^s# Map morph features to empty stringif there is no corresponding output
Updating the LexiconNeed words, not just classes, as FST
fox foxNeed: geese goose + N + Pl
Assume f:f written as f
reg-noun irreg-pl-noun irreg-sg-noun
fox g o o s e
cat sheep sheep
aardvark mouse
Updating the LexiconNeed words, not just classes, as FST
fox foxNeed: geese goose + N + Pl
Assume f:f written as f
reg-noun irreg-pl-noun irreg-sg-noun
fox g o:e o:e s e g o o s e
cat sheep sheep
aardvark m o:i u:εs:c e mouse
Adding Orthographic RulesCurrent transducer concatenates morphemes
Should work for cats, aardvarks, mice,..
Adding Orthographic RulesCurrent transducer concatenates morphemes
Should work for cats, aardvarks, mice,..foxs?
Problem: spelling changes at morpheme boundaries
Adding Orthographic RulesCurrent transducer concatenates morphemes
Should work for cats, aardvarks, mice,..foxs?
Problem: spelling changes at morpheme boundariesMany such rules
Consonant doubling before –ing, -ed E-deletion: silent e dropped before –ing, -ed, Y replacement: y ie before –s, i before -ed , etc
Adding Orthographic RulesCurrent transducer concatenates morphemes
Should work for cats, aardvarks, mice,..foxs?
Problem: spelling changes at morpheme boundariesMany such rules
Consonant doubling before –ing, -ed E-deletion: silent e dropped before –ing, -ed, Y replacement: y ie before –s, i before -ed , etc
Approach: Transducers for orthographic rules
Creating an Orthographic Rule
Goal: Correct e insertion in pluralsE.g. fox^s# foxes
Approach 1: ε e foxes
Creating an Orthographic Rule
Goal: Correct e insertion in pluralsE.g. fox^s# foxes
Approach 1: ε e foxes, but also cates, doges, etc…
Creating an Orthographic Rule
Goal: Correct e insertion in pluralsE.g. fox^s# foxes
Approach 1: ε e foxes, but also cates, doges, etc…Only apply in context: after s,z,x, etc before s
Approach 2: ε e /(s|z|x|)_s Issue
Creating an Orthographic Rule
Goal: Correct e insertion in pluralsE.g. fox^s# foxes
Approach 1: ε e foxes, but also cates, doges, etc…Only apply in context: after s,z,x, etc before s
Approach 2: ε e /(s|z|x|)_s Issue? glass glases
Approach 3: ε e /(s|z|x|)^_s#
Rewrite RulesFormat: a b/c_d
Rewrite rules can be optional or obligatory
Rewrite rules can be ordered to reduce ambiguity.
Under some conditions, rewrite rules equivalent to FSTs.a not allowed to match s.t. introduced in prior rule
application
E-insertion Rule Transducer
ε e /(s|z|x|)^_s#
Input: ….(s|z|x)^s# Intermediate level
Output: …(s|z|x)es# surface level
Using the E-insertion FST
(fox,fox): q0, q0,q0,q1, accept(fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#):
Using the E-insertion FST
(fox,fox): q0, q0,q0,q1, accept(fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept(fox^s,foxs):
Using the E-insertion FST
(fox,fox): q0, q0,q0,q1, accept(fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept(fox^s,foxs): q0,q0,q0,q1 ,q2,q5,reject
Using the E-insertion FST
(fox,fox): q0, q0,q0,q1, accept(fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept(fox^s,foxs): q0,q0,q0,q1 ,q2,q5,reject(fox^z#,foxz#) ?
What will it accept?(f,f)
(fox#,fox#)
(fox^s#,foxes#)
(fox^z#,foxz#)
Goal: write rules capture only those constraintsLet all other input pass through
Combining FST Lexicon & Rules
Two-level morphological system: ‘Cascade’Transducer from Lexicon to IntermediateRule transducers from Intermediate to Surface
Generation & ParsingGeneration:
Given lexicon tape, cascade to produce surface form
fox + N + PL foxes#
Generation & ParsingGeneration:
Given lexicon tape, cascade to produce surface form
fox + N + PL foxes#
Parsing:Given surface form, generate analysisfoxes#
Generation & ParsingGeneration:
Given lexicon tape, cascade to produce surface form
fox + N + PL foxes#
Parsing:Given surface form, generate analysisfoxes# fox + N + PL
Generation & ParsingGeneration:
Given lexicon tape, cascade to produce surface form
fox + N + PL foxes#
Parsing:Given surface form, generate analysisfoxes# fox + N + PL or fox + V + 3SgHow can we disambiguate?
Generation & ParsingGeneration:
Given lexicon tape, cascade to produce surface form
fox + N + PL foxes#
Parsing:Given surface form, generate analysisfoxes# fox + N + PL or fox + V + 3SgHow can we disambiguate?
We can’t here – need outside information
What about ‘assess’?
Generation & ParsingGeneration:
Given lexicon tape, cascade to produce surface form fox + N + PL foxes#
Parsing:Given surface form, generate analysis foxes# fox + N + PL or fox + V + 3SgHow can we disambiguate?
We can’t here – need outside information
What about ‘assess’?Need same sort of search as NFAs
FST Morphological Analysis
Summary:Main components
LexiconMorphotacticsOrthographic rules
Morphotactics as FSTs, expanded with FST Lexicon
Orthographic rules as FSTs
Combine FSTs, e.g. in cascade
IssuesWhat do you think of creating all the rules for a
languages – by hand?Time-consuming, complicated
IssuesWhat do you think of creating all the rules for a
languages – by hand?Time-consuming, complicated
Proposed approach: Unsupervised morphology induction
IssuesWhat do you think of creating all the rules for a
languages – by hand?Time-consuming, complicated
Proposed approach: Unsupervised morphology induction
Potentially useful for many applications IR, MT
Unsupervised MorphologyStart from tokenized text (or word frequencies)
talk 60talked 120walked 40walk 30
Unsupervised MorphologyStart from tokenized text (or word frequencies)
talk 60talked 120walked 40walk 30
Treat as coding/compression problemFind most compact representation of lexicon
Popular model MDL (Minimum Description Length) Smallest total encoding:
Weighted combination of lexicon size & ‘rules’
ApproachGenerate initial model:
Base set of words, compute MDL length
Iterate:Generate a new set of words + some model to
create a smaller description size
ApproachGenerate initial model:
Base set of words, compute MDL length
Iterate:Generate a new set of words + some model to
create a smaller description size
E.g. for talk, talked, walk, walked4 words
ApproachGenerate initial model:
Base set of words, compute MDL length
Iterate:Generate a new set of words + some model to create
a smaller description size
E.g. for talk, talked, walk, walked4 words2 words (talk, walk) + 1 affix (-ed) + combination info2 words (t,w) + 2 affixes (alk,-ed) + combination info