Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer...

Linguistically-motivated, statistically-driven

induction of morphology

Erwin Chan

Dept. of Computer and Information ScienceUniversity of Pennsylvania

Overview

• Problem: induction of morphology from unannotated text

• Main idea: knowledge of linguistic and statistical properties of morphology allows for a simple induction algorithm

• Develops ideas from previous work:– Goldsmith (2001)– Schone & Jurafsky (2000)– Yarowsky & Wicentowski (2000, 2004)

Outline

1. Goals of morphology induction

2. Linguistic model of morphology

3. Statistical model of morphology

4. Induction algorithm

5. Conclusion, relevance to cognitive science

Computational modeling of language acquisition

Raw corpus Induction algorithm

(“fully” unsupervised)

Linguisticknowledge

Desired properties of output

1. Analysis of input data– morphology, POS, parse

2. Generalize analysis– produce tool to apply to new data– morphological analyzer, POS tagger, parser

Generalize morphological structure

• Word-specific morphological analysisdogs = dog + s

cats = cat + s

churches = church + es

finches = finch + es

• Out-of-vocabulary words?

• Summarize phonological propertiesIf ends in ch, add es, otherwise add s

Morphophonological rules

• generative phonology, finite-state morphology

• Analysis: inflected base form• Generation: base form inflected

• Rule specifies:– rewrite pattern– context of application

• N.PL rule: $ es / ch _ #$ s / _ # ( $ is null suffix )

Towards induction of rules

• This presentation: from a corpus,– Select words to be base forms– Formulate rewrite patterns (transforms)

• Future: learn other rule components– context of application– POS categories (e.g. “Noun”)– fine-grained inflectional categories (Noun.PL)– allomorphs

Outline






Linguistic model of morphology

• Model that generates inflectional morph paradigms– Base forms– Transforms– Transform signatures

• Simplifying assumptions:– One inflectional property for word

(not adequate for agglutinative languages: Finnish)– Omit derivational morphology

Base-and-transforms model of morphological paradigms

• Apply transforms to base form to generate

each inflection

base

Lexeme 1

base base

Lexeme 2 Lexeme 3

Base forms

• Same inflectional type across lexemes for a particular POS category– e.g. Nom.Sg for all nouns

• Representation in lexicon

• Surface form– not abstract, underlying

Transforms

• Specifies conversion process between base and inflected forms

• Similar to a rule, but omits context of application

• Tuple of 2 regular expressions (X,Y)– X: replaced portion of base form– Y: replaced portion of inflected form

Transform examples (for English)

Base form

eat

time

time

hang

Inflected form

eating

times

timing

hung

Transform

( $, ing )

( $, s )

( e, ing )

( *a*, *u* ) non-concat.

Transform signatures

• Summarizes the inflections of a set of words– set of base forms X set of transforms– each base form belongs to exactly one trans. signature

Base forms Transforms t-sig #1 { time, save } { ( $, s ) ( e, ing ) } t-sig #2 { walk } { ( $, s ) }

generates: time, times, timing,save, saves, saving, walk, walks

Comparison to stem-suffix signatures

• Stem-suffix signature (Goldsmith 2001,2007)

Stems Suffixes

sig #1 { time, save, walk } { $, s }

sig #2 { tim, sav } { ing }

• Compare lexical representations– stem-suffix sig: multiple stems for a lexeme– transform sig: one base form per lexeme

Outline






Statistical model of morphology

• Need to show learnability of linguistic model

• Understand distribution of data:look for patterns that hold across languages

• Propose simple model of distribution of inflections

• Implications for linguistic model

Examine annotated corpora

• Word representation: (lemma, infl. category)e.g. went = ( go, verb-past-tense )

• Collapse phonological sub-classese.g. N.Masc.Sg N.Sg

N.Fem.Sg N.Sg

Spanish newswire verbs

Lemma Inflection

Log(freq)

Sparse data

CHILDES adult Spanish verbs

InflectionLemma

Log(freq)

Dist. of inflectional categories

• (roughly) Zipfian

• Slovene nouns

• 3 inflections

don’t occur at all

# types # types

N.Nom.Sg 7950 N.Inst.Pl 1630

N.Gen.Sg 5967 N.Dat.Sg 1515

N.Acc.Sg 5157 N.Gen.Dual 876

N.Nom.Pl 4154 N.Nom.Dual 682

N.Gen.Pl 3900 N.Dat.Pl 626

N.Inst.Sg 3334 N.Acc.Dual 586

N.Loc.Sg 3252 N.Loc.Dual 160

N.Acc.Pl 2967 N.Inst.Dual 120

N.Loc.Pl 1848 N.Dat.Dual 14

High type frequency of base form

• Most type-frequent inflection accords with intuitive notions of what inflection a base form should be– Slovene: A.Pos.Nom.Sg.Indef

N.Nom.Sg

V.Main.Ind.Pres.3.Sg– Swedish: A.Pos.Sg.Indef.Nom

N.Sg.Indef.Nom

V.Inf.Act– Spanish: A.Sg

N.Sg

V.Inf

Multinomial distribution

• Urn-and-balls problem– Assume inflectional categories have constant prob.– Choose lexeme and number of words, then

generate inflections according to their prob. dist.

• Let an inflection set be the inflectional types of the words generated for a particular lexeme

• What is the prob. dist. over inflection sets?Can calculate from multinomial

Inflection sets and base forms

• If base form is usually most frequent, multinomial predicts:– Inflection set with base relatively high prob– Inflection set without base relatively low prob

– If a rare inflection occurs,

its base form is likely to occur

Occurrence of base in infl sets• Percentage of inflection sets of size >= 2

that contain most type-freq inflectionAdj Noun Verb

Slovene 64% 68% 80%

Greek 89% 83% 62%

Swedish 80% 84% 57%

Spanish 82%

Sp CHILDES 70%

Implications for linguistic model

• Zipfian + multinomial distributions predict

that data will exist in corpus to support rule learning

– Prominence of base form

– (base, inflected) exist even for rare inflections

Outline






Overview of induction algorithm

• Learn transform signatures for portion of vocab– Select words to be base forms

• Construct increasingly complex data structures1. suffixes

2. transforms

3. transform signatures

• Ranking and filtering based on ling, stat models

Additional simplifying assumptions

• Assume language is suffixing

• Not learning POS categories

Step 1. Suffixes

• Find 50 most type-frequent suffixes

• Keep track of words that end in each suffix

ing: { beating, eating, cheating, etc. }

• Rank by number of types

Most type-frequent suffixes (Brown)# types # types

1. $ 42596 41. les 237

2. s 10730 42. ses 230

3. e 4967 43. et 224

4. d 4800 44. ck 223

5. ed 3868 45. ding 220

6. y 3648 46. ning 219

7. n 3226 47. ded 219

8. g 3107 48. ment 217

9. ng 2951 49. ngs 216

10. ing 2869 50. rd 211

Step 2. Transforms

• For each pair of suffixes s1 and s2,

construct 2 transforms: (s1,s2) and (s2,s1)– Don’t allow deletion: ( _ , $)

• Hypothesize base forms (next slide)

• Rank transforms by # of base forms

• Keep top 50

Transform construction

s1 words s2 words

Base forms

for (s1,s2)

relation (s1,s2)

Top transforms (Brown corpus)# base forms

# base forms

1. ( $, s ) 5257 41. ( on, ng ) 229

2. ( ing, ed ) 1922 42. ( ng, on ) 229

3. ( ed, ing ) 1922 43. ( $, r ) 221

4. ( $, 's ) 1609 44. ( ion, e ) 216

5. ( $, ed ) 1481 45. ( e, ion ) 216

6. ( $, ing ) 1335 46. ( y, e ) 214

7. ( $, ly ) 1069 47. ( e, y ) 214

8. ( $, d ) 1041 48. ( $, al ) 213

9. ( s, ed ) 925 49. ( y, ed ) 212

10. ( ed, s ) 925 50. ( ed, y ) 212

Step 3. Transform signatures

• Intersect base form sets of different transforms

Transform 1 ( $, s )

Transform 2( $, ing )

Base forms for transform 1

Base forms for transform 2

Base forms in transforms 1 and 2

3 transform signatures

Rank, filter transform signatures

• Rank by number of words

• Go down list and filter:

Missing base form #4. ($,s) ($,ed) ($,ing)

#5. (s,ed) (s,ing) transforms consisting of

“derived” suffixes

Filter transform signatures

• Remove redundant signatures

(want a grammar of minimal size)

#1 ($,s)

#2 ($,’s)

#14 ($,s) ($,'s) redundant:

combination of #1 and #2

Final transform signatures

1. ( $, s )

2. ( $, 's )

3. ( $, s ) ( $, ed ) ( $, ing )

4. ( $, ly )

5. ( $, s ) ( $, d ) ( e, ing )

6. ( y, ies )

7. ( $, ly ) ( $, ness )

8. ( $, s ) ( $, ed) ( $, ing ) ( $, er ) ( $, ers )

9. ( $, ed ) ( $, ing ) ( $, es )

10. ( $, ' )

11. ( $, s ) ( $, al )

12. ( $, e )

13. ( $, y )

spurious

Deletion from base

Evaluation: precision of relation

• Precision:– Whether (base, derived-from-base)

relationship is inflectional

– Gold standard: Brown corpus lemmas– 96.7% correct

Error Analysis

1. Agglutinative morphologyInflected base gold basesurvivors’ survivors survivor

2. Gold standard doesn’t have deriv basehunters hunt hunter

3. Spurious morphological relationshiphone hon honelouise louis louise

Evaluation: vocab coverage

• Brown open-class POS categories– 31709 base forms– 539494 tokens (all inflections)

• 13 transform signatures– 5846 base forms = 18.4% coverage– 113165 tokens = 21.0% coverage

• (include redundant: 27%, 41.9% coverage)

How to expand coverage

• Have initial, high-precision set of base forms

• Bootstrap– Find other inflections of base forms– Use new inflections to acquire more base forms– Repeat

Why induction algorithm works

• Exploits combinatorics of multinomial

• Find legitimate morphological relationships– Intersection filters non-linguistic features– only linguistic features likely to co-occur

across large portion of vocabulary

• Find base forms– t-sigs with base more probable than t-sigs without,

so t-sigs with base are ranked high

Comparison to other algorithms

• Components:– spelling and frequencies– set intersection, set cover (greedy approx. algorithm)– knowledge of base-and-transforms model

• Doesn’t use:– entropy – parameter optimization– minimum description length– transitional probability between characters– distributional semantics

Outline






Summary

• Task: induction of morphology from raw data– Importance of generalization– Generalization through morphophonological

rules

• Linguistic model:– Base forms, transforms, transform signatures– Improved lexical representation

Summary

• Statistical model:– Zipf + Multinomial prominence of base forms– Data distribution sufficient to learn ling. model

• Induction algorithm:– build increasingly complex representations

suffix transform transform signature

– uses knowledge of linguistic and statistical models

Main ideas

• Knowledge of linguistic and statistical properties of morphology allows for a simple induction algorithm

• Look for “universal” properties of data

• Incorporate “universals” into algorithm

as a learning bias

Relevance to cognitive science

• Linguistics:– Statistical / algorithmic evidence for rules– Statistical origin of rules ?

• Psycholinguistics:– “Past tense” learning models (R&M, Pinker)– presupposes list of (base, inflected) forms

• Computational linguistics:– towards induction of phonological rules and

finite-state models of morphology

Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer...

Documents

Transcript of Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer...