Consumer Participation and Culturally and Linguistically ...
Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer...
-
Upload
charleen-norton -
Category
Documents
-
view
214 -
download
0
Transcript of Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer...
Linguistically-motivated, statistically-driven
induction of morphology
Erwin Chan
Dept. of Computer and Information ScienceUniversity of Pennsylvania
Overview
• Problem: induction of morphology from unannotated text
• Main idea: knowledge of linguistic and statistical properties of morphology allows for a simple induction algorithm
• Develops ideas from previous work:– Goldsmith (2001)– Schone & Jurafsky (2000)– Yarowsky & Wicentowski (2000, 2004)
Outline
1. Goals of morphology induction
2. Linguistic model of morphology
3. Statistical model of morphology
4. Induction algorithm
5. Conclusion, relevance to cognitive science
Computational modeling of language acquisition
Raw corpus Induction algorithm
(“fully” unsupervised)
Linguisticknowledge
Desired properties of output
1. Analysis of input data– morphology, POS, parse
2. Generalize analysis– produce tool to apply to new data– morphological analyzer, POS tagger, parser
Generalize morphological structure
• Word-specific morphological analysisdogs = dog + s
cats = cat + s
churches = church + es
finches = finch + es
• Out-of-vocabulary words?
• Summarize phonological propertiesIf ends in ch, add es, otherwise add s
Morphophonological rules
• generative phonology, finite-state morphology
• Analysis: inflected base form• Generation: base form inflected
• Rule specifies:– rewrite pattern– context of application
• N.PL rule: $ es / ch _ #$ s / _ # ( $ is null suffix )
Towards induction of rules
• This presentation: from a corpus,– Select words to be base forms– Formulate rewrite patterns (transforms)
• Future: learn other rule components– context of application– POS categories (e.g. “Noun”)– fine-grained inflectional categories (Noun.PL)– allomorphs
Outline
1. Goals of morphology induction
2. Linguistic model of morphology
3. Statistical model of morphology
4. Induction algorithm
5. Conclusion, relevance to cognitive science
Linguistic model of morphology
• Model that generates inflectional morph paradigms– Base forms– Transforms– Transform signatures
• Simplifying assumptions:– One inflectional property for word
(not adequate for agglutinative languages: Finnish)– Omit derivational morphology
Base-and-transforms model of morphological paradigms
• Apply transforms to base form to generate
each inflection
base
Lexeme 1
base base
Lexeme 2 Lexeme 3
Base forms
• Same inflectional type across lexemes for a particular POS category– e.g. Nom.Sg for all nouns
• Representation in lexicon
• Surface form– not abstract, underlying
Transforms
• Specifies conversion process between base and inflected forms
• Similar to a rule, but omits context of application
• Tuple of 2 regular expressions (X,Y)– X: replaced portion of base form– Y: replaced portion of inflected form
Transform examples (for English)
Base form
eat
time
time
hang
Inflected form
eating
times
timing
hung
Transform
( $, ing )
( $, s )
( e, ing )
( *a*, *u* ) non-concat.
Transform signatures
• Summarizes the inflections of a set of words– set of base forms X set of transforms– each base form belongs to exactly one trans. signature
Base forms Transforms t-sig #1 { time, save } { ( $, s ) ( e, ing ) } t-sig #2 { walk } { ( $, s ) }
generates: time, times, timing,save, saves, saving, walk, walks
Comparison to stem-suffix signatures
• Stem-suffix signature (Goldsmith 2001,2007)
Stems Suffixes
sig #1 { time, save, walk } { $, s }
sig #2 { tim, sav } { ing }
• Compare lexical representations– stem-suffix sig: multiple stems for a lexeme– transform sig: one base form per lexeme
Outline
1. Goals of morphology induction
2. Linguistic model of morphology
3. Statistical model of morphology
4. Induction algorithm
5. Conclusion, relevance to cognitive science
Statistical model of morphology
• Need to show learnability of linguistic model
• Understand distribution of data:look for patterns that hold across languages
• Propose simple model of distribution of inflections
• Implications for linguistic model
Examine annotated corpora
• Word representation: (lemma, infl. category)e.g. went = ( go, verb-past-tense )
• Collapse phonological sub-classese.g. N.Masc.Sg N.Sg
N.Fem.Sg N.Sg
Spanish newswire verbs
Lemma Inflection
Log(freq)
Sparse data
CHILDES adult Spanish verbs
InflectionLemma
Log(freq)
Dist. of inflectional categories
• (roughly) Zipfian
• Slovene nouns
• 3 inflections
don’t occur at all
# types # types
N.Nom.Sg 7950 N.Inst.Pl 1630
N.Gen.Sg 5967 N.Dat.Sg 1515
N.Acc.Sg 5157 N.Gen.Dual 876
N.Nom.Pl 4154 N.Nom.Dual 682
N.Gen.Pl 3900 N.Dat.Pl 626
N.Inst.Sg 3334 N.Acc.Dual 586
N.Loc.Sg 3252 N.Loc.Dual 160
N.Acc.Pl 2967 N.Inst.Dual 120
N.Loc.Pl 1848 N.Dat.Dual 14
High type frequency of base form
• Most type-frequent inflection accords with intuitive notions of what inflection a base form should be– Slovene: A.Pos.Nom.Sg.Indef
N.Nom.Sg
V.Main.Ind.Pres.3.Sg– Swedish: A.Pos.Sg.Indef.Nom
N.Sg.Indef.Nom
V.Inf.Act– Spanish: A.Sg
N.Sg
V.Inf
Multinomial distribution
• Urn-and-balls problem– Assume inflectional categories have constant prob.– Choose lexeme and number of words, then
generate inflections according to their prob. dist.
• Let an inflection set be the inflectional types of the words generated for a particular lexeme
• What is the prob. dist. over inflection sets?Can calculate from multinomial
Inflection sets and base forms
• If base form is usually most frequent, multinomial predicts:– Inflection set with base relatively high prob– Inflection set without base relatively low prob
– If a rare inflection occurs,
its base form is likely to occur
Occurrence of base in infl sets• Percentage of inflection sets of size >= 2
that contain most type-freq inflectionAdj Noun Verb
Slovene 64% 68% 80%
Greek 89% 83% 62%
Swedish 80% 84% 57%
Spanish 82%
Sp CHILDES 70%
Implications for linguistic model
• Zipfian + multinomial distributions predict
that data will exist in corpus to support rule learning
– Prominence of base form
– (base, inflected) exist even for rare inflections
Outline
1. Goals of morphology induction
2. Linguistic model of morphology
3. Statistical model of morphology
4. Induction algorithm
5. Conclusion, relevance to cognitive science
Overview of induction algorithm
• Learn transform signatures for portion of vocab– Select words to be base forms
• Construct increasingly complex data structures1. suffixes
2. transforms
3. transform signatures
• Ranking and filtering based on ling, stat models
Additional simplifying assumptions
• Assume language is suffixing
• Not learning POS categories
Step 1. Suffixes
• Find 50 most type-frequent suffixes
• Keep track of words that end in each suffix
ing: { beating, eating, cheating, etc. }
• Rank by number of types
Most type-frequent suffixes (Brown)# types # types
1. $ 42596 41. les 237
2. s 10730 42. ses 230
3. e 4967 43. et 224
4. d 4800 44. ck 223
5. ed 3868 45. ding 220
6. y 3648 46. ning 219
7. n 3226 47. ded 219
8. g 3107 48. ment 217
9. ng 2951 49. ngs 216
10. ing 2869 50. rd 211
Step 2. Transforms
• For each pair of suffixes s1 and s2,
construct 2 transforms: (s1,s2) and (s2,s1)– Don’t allow deletion: ( _ , $)
• Hypothesize base forms (next slide)
• Rank transforms by # of base forms
• Keep top 50
Transform construction
s1 words s2 words
Base forms
for (s1,s2)
relation (s1,s2)
Top transforms (Brown corpus)# base forms
# base forms
1. ( $, s ) 5257 41. ( on, ng ) 229
2. ( ing, ed ) 1922 42. ( ng, on ) 229
3. ( ed, ing ) 1922 43. ( $, r ) 221
4. ( $, 's ) 1609 44. ( ion, e ) 216
5. ( $, ed ) 1481 45. ( e, ion ) 216
6. ( $, ing ) 1335 46. ( y, e ) 214
7. ( $, ly ) 1069 47. ( e, y ) 214
8. ( $, d ) 1041 48. ( $, al ) 213
9. ( s, ed ) 925 49. ( y, ed ) 212
10. ( ed, s ) 925 50. ( ed, y ) 212
Step 3. Transform signatures
• Intersect base form sets of different transforms
Transform 1 ( $, s )
Transform 2( $, ing )
Base forms for transform 1
Base forms for transform 2
Base forms in transforms 1 and 2
3 transform signatures
Rank, filter transform signatures
• Rank by number of words
• Go down list and filter:
Missing base form #4. ($,s) ($,ed) ($,ing)
#5. (s,ed) (s,ing) transforms consisting of
“derived” suffixes
Filter transform signatures
• Remove redundant signatures
(want a grammar of minimal size)
#1 ($,s)
#2 ($,’s)
#14 ($,s) ($,'s) redundant:
combination of #1 and #2
Final transform signatures
1. ( $, s )
2. ( $, 's )
3. ( $, s ) ( $, ed ) ( $, ing )
4. ( $, ly )
5. ( $, s ) ( $, d ) ( e, ing )
6. ( y, ies )
7. ( $, ly ) ( $, ness )
8. ( $, s ) ( $, ed) ( $, ing ) ( $, er ) ( $, ers )
9. ( $, ed ) ( $, ing ) ( $, es )
10. ( $, ' )
11. ( $, s ) ( $, al )
12. ( $, e )
13. ( $, y )
spurious
Deletion from base
Evaluation: precision of relation
• Precision:– Whether (base, derived-from-base)
relationship is inflectional
– Gold standard: Brown corpus lemmas– 96.7% correct
Error Analysis
1. Agglutinative morphologyInflected base gold basesurvivors’ survivors survivor
2. Gold standard doesn’t have deriv basehunters hunt hunter
3. Spurious morphological relationshiphone hon honelouise louis louise
Evaluation: vocab coverage
• Brown open-class POS categories– 31709 base forms– 539494 tokens (all inflections)
• 13 transform signatures– 5846 base forms = 18.4% coverage– 113165 tokens = 21.0% coverage
• (include redundant: 27%, 41.9% coverage)
How to expand coverage
• Have initial, high-precision set of base forms
• Bootstrap– Find other inflections of base forms– Use new inflections to acquire more base forms– Repeat
Why induction algorithm works
• Exploits combinatorics of multinomial
• Find legitimate morphological relationships– Intersection filters non-linguistic features– only linguistic features likely to co-occur
across large portion of vocabulary
• Find base forms– t-sigs with base more probable than t-sigs without,
so t-sigs with base are ranked high
Comparison to other algorithms
• Components:– spelling and frequencies– set intersection, set cover (greedy approx. algorithm)– knowledge of base-and-transforms model
• Doesn’t use:– entropy – parameter optimization– minimum description length– transitional probability between characters– distributional semantics
Outline
1. Goals of morphology induction
2. Linguistic model of morphology
3. Statistical model of morphology
4. Induction algorithm
5. Conclusion, relevance to cognitive science
Summary
• Task: induction of morphology from raw data– Importance of generalization– Generalization through morphophonological
rules
• Linguistic model:– Base forms, transforms, transform signatures– Improved lexical representation
Summary
• Statistical model:– Zipf + Multinomial prominence of base forms– Data distribution sufficient to learn ling. model
• Induction algorithm:– build increasingly complex representations
suffix transform transform signature
– uses knowledge of linguistic and statistical models
Main ideas
• Knowledge of linguistic and statistical properties of morphology allows for a simple induction algorithm
• Look for “universal” properties of data
• Incorporate “universals” into algorithm
as a learning bias
Relevance to cognitive science
• Linguistics:– Statistical / algorithmic evidence for rules– Statistical origin of rules ?
• Psycholinguistics:– “Past tense” learning models (R&M, Pinker)– presupposes list of (base, inflected) forms
• Computational linguistics:– towards induction of phonological rules and
finite-state models of morphology