comp_morpho.pdf

78
Computational Morphology Pawan Goyal CSE, IIT Kharagpur August 14, 2014 Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 1 / 34

Transcript of comp_morpho.pdf

  • Computational Morphology

    Pawan Goyal

    CSE, IIT Kharagpur

    August 14, 2014

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 1 / 34

  • Prosody and orthography

    Orthography: The conventional spelling system of a language.Prosody: The pattern of sounds in a language.

    Case for EnglishA given morpheme is represented with a single orthography despite the factthat it has different surface phonetic representations in different contexts.Ex: The past tense suffix -ed is so written despite three distinct phoneticrealizations in three clearly defined contexts:

    /t/ e.g. dip, dipped/d/ e.g. boom, boomed/1d/ e.g. loot, looted

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 2 / 34

  • Prosody and orthography

    Orthography: The conventional spelling system of a language.Prosody: The pattern of sounds in a language.

    Case for EnglishA given morpheme is represented with a single orthography despite the factthat it has different surface phonetic representations in different contexts.

    Ex: The past tense suffix -ed is so written despite three distinct phoneticrealizations in three clearly defined contexts:

    /t/ e.g. dip, dipped/d/ e.g. boom, boomed/1d/ e.g. loot, looted

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 2 / 34

  • Prosody and orthography

    Orthography: The conventional spelling system of a language.Prosody: The pattern of sounds in a language.

    Case for EnglishA given morpheme is represented with a single orthography despite the factthat it has different surface phonetic representations in different contexts.Ex: The past tense suffix -ed is so written despite three distinct phoneticrealizations in three clearly defined contexts:

    /t/ e.g. dip, dipped/d/ e.g. boom, boomed/1d/ e.g. loot, looted

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 2 / 34

  • Prosody and orthography

    Orthography: The conventional spelling system of a language.Prosody: The pattern of sounds in a language.

    Case for SanskritAn advanced discipline of phonetics explicitly described prosodic changes,these prosodic changes, well known by the term sandhi, are represented inwriting.

    Ex: past passive participle suffix -ta variously realized as ta or dha dependingsolely upon the phonetic context, is written as follows:

    /ta/ e.g. from su press, suta pressed/dha/ e.g. from budh awake, buddha awakened

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 3 / 34

  • Prosody and orthography

    Orthography: The conventional spelling system of a language.Prosody: The pattern of sounds in a language.

    Case for SanskritAn advanced discipline of phonetics explicitly described prosodic changes,these prosodic changes, well known by the term sandhi, are represented inwriting.Ex: past passive participle suffix -ta variously realized as ta or dha dependingsolely upon the phonetic context, is written as follows:

    /ta/ e.g. from su press, suta pressed/dha/ e.g. from budh awake, buddha awakened

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 3 / 34

  • Morphology

    Morphology studies the internal structure of words, how words are built upfrom smaller meaningful units called morphemes

    dogs2 morphemes, dog and s

    s is a plural marker on nouns

    unladylike3 morphemes

    un- not

    lady well-behaved woman

    -like having the characteristic of

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 4 / 34

  • Morphology

    Morphology studies the internal structure of words, how words are built upfrom smaller meaningful units called morphemes

    dogs2 morphemes, dog and s

    s is a plural marker on nouns

    unladylike3 morphemes

    un- not

    lady well-behaved woman

    -like having the characteristic of

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 4 / 34

  • Morphology

    Morphology studies the internal structure of words, how words are built upfrom smaller meaningful units called morphemes

    dogs2 morphemes, dog and s

    s is a plural marker on nouns

    unladylike3 morphemes

    un- not

    lady well-behaved woman

    -like having the characteristic of

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 4 / 34

  • Allomorphs

    Variants of the same morpheme, but cannot be replaced by one another

    ExamplePlural morphemes: cat-s, judge-s, dog-s

    opposite: un-happy, in-comprehensible, im-possible, ir-rational

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 5 / 34

  • Bound and Free Morphemes

    BoundCannot appear as a word by itself.-s (dog-s), -ly (quick-ly), -ed (walk-ed)

    FreeCan appear as a word by itself; often can combine with other morphemes too.house (house-s), walk (walk-ed), of, the, or

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 6 / 34

  • Bound and Free Morphemes

    BoundCannot appear as a word by itself.-s (dog-s), -ly (quick-ly), -ed (walk-ed)

    FreeCan appear as a word by itself; often can combine with other morphemes too.house (house-s), walk (walk-ed), of, the, or

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 6 / 34

  • Stems and Affixes

    Stems and AffixesStems (roots): The core meaning bearing units

    Affixes: Bits and pieces adhering to stems to change their meanings andgrammatical functions

    Mostly, stems are free morphemes and affixes are bound morphemes

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 7 / 34

  • Stems and Affixes

    Stems and AffixesStems (roots): The core meaning bearing units

    Affixes: Bits and pieces adhering to stems to change their meanings andgrammatical functions

    Mostly, stems are free morphemes and affixes are bound morphemes

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 7 / 34

  • Types of affixes

    Prefix: un-, anti-, etc (a-, ati-, pra- etc.)un-happy, pre-existing

    Suffix: -ity, -ation, etc (-taa, -ke, -ka etc.)talk-ing, quick-ly

    Infix: n in vindati (he knows), as contrasted with vid (to know).Philippines: basa read b-um-asa readEnglish: abso-bloody-lutely (emphasis)

    Circumfixes - precedes and follow the stemDutch: berg mountain, ge-berg-te mountains

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 8 / 34

  • Types of affixes

    Prefix: un-, anti-, etc (a-, ati-, pra- etc.)un-happy, pre-existing

    Suffix: -ity, -ation, etc (-taa, -ke, -ka etc.)talk-ing, quick-ly

    Infix: n in vindati (he knows), as contrasted with vid (to know).Philippines: basa read b-um-asa readEnglish: abso-bloody-lutely (emphasis)

    Circumfixes - precedes and follow the stemDutch: berg mountain, ge-berg-te mountains

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 8 / 34

  • Types of affixes

    Prefix: un-, anti-, etc (a-, ati-, pra- etc.)un-happy, pre-existing

    Suffix: -ity, -ation, etc (-taa, -ke, -ka etc.)talk-ing, quick-ly

    Infix: n in vindati (he knows), as contrasted with vid (to know).Philippines: basa read b-um-asa readEnglish: abso-bloody-lutely (emphasis)

    Circumfixes - precedes and follow the stemDutch: berg mountain, ge-berg-te mountains

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 8 / 34

  • Types of affixes

    Prefix: un-, anti-, etc (a-, ati-, pra- etc.)un-happy, pre-existing

    Suffix: -ity, -ation, etc (-taa, -ke, -ka etc.)talk-ing, quick-ly

    Infix: n in vindati (he knows), as contrasted with vid (to know).Philippines: basa read b-um-asa readEnglish:

    abso-bloody-lutely (emphasis)

    Circumfixes - precedes and follow the stemDutch: berg mountain, ge-berg-te mountains

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 8 / 34

  • Types of affixes

    Prefix: un-, anti-, etc (a-, ati-, pra- etc.)un-happy, pre-existing

    Suffix: -ity, -ation, etc (-taa, -ke, -ka etc.)talk-ing, quick-ly

    Infix: n in vindati (he knows), as contrasted with vid (to know).Philippines: basa read b-um-asa readEnglish: abso-bloody-lutely (emphasis)

    Circumfixes - precedes and follow the stemDutch: berg mountain, ge-berg-te mountains

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 8 / 34

  • Types of affixes

    Prefix: un-, anti-, etc (a-, ati-, pra- etc.)un-happy, pre-existing

    Suffix: -ity, -ation, etc (-taa, -ke, -ka etc.)talk-ing, quick-ly

    Infix: n in vindati (he knows), as contrasted with vid (to know).Philippines: basa read b-um-asa readEnglish: abso-bloody-lutely (emphasis)

    Circumfixes - precedes and follow the stemDutch: berg mountain, ge-berg-te mountains

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 8 / 34

  • Content and functional morphemes

    Content morphemesCarry some semantic contentcar, -able, un-

    Functional morphemesProvide grammatical information-s (plural), -s (3rd singular)

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 9 / 34

  • Content and functional morphemes

    Content morphemesCarry some semantic contentcar, -able, un-

    Functional morphemesProvide grammatical information-s (plural), -s (3rd singular)

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 9 / 34

  • Inflectional and Derivational Morphology

    Two different kind of relationship among words

    Inflectional morphologyGrammatical: number, tense, case, genderCreates new forms of the same word: bring, brought, brings, bringing

    Derivational morphologyCreates new words by changing part-of-speech: logic, logical, illogical,illogicality, logicianFairly systematic but some derivations missing: sincere - sincerity, scarce -scarcity, curious - curiosity, fierce - fiercity?

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 10 / 34

  • Inflectional and Derivational Morphology

    Two different kind of relationship among words

    Inflectional morphologyGrammatical: number, tense, case, genderCreates new forms of the same word: bring, brought, brings, bringing

    Derivational morphologyCreates new words by changing part-of-speech: logic, logical, illogical,illogicality, logicianFairly systematic but some derivations missing: sincere - sincerity, scarce -scarcity, curious - curiosity, fierce - fiercity?

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 10 / 34

  • Inflectional and Derivational Morphology

    Two different kind of relationship among words

    Inflectional morphologyGrammatical: number, tense, case, genderCreates new forms of the same word: bring, brought, brings, bringing

    Derivational morphologyCreates new words by changing part-of-speech: logic, logical, illogical,illogicality, logician

    Fairly systematic but some derivations missing: sincere - sincerity, scarce -scarcity, curious - curiosity, fierce - fiercity?

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 10 / 34

  • Inflectional and Derivational Morphology

    Two different kind of relationship among words

    Inflectional morphologyGrammatical: number, tense, case, genderCreates new forms of the same word: bring, brought, brings, bringing

    Derivational morphologyCreates new words by changing part-of-speech: logic, logical, illogical,illogicality, logicianFairly systematic but some derivations missing: sincere - sincerity, scarce -scarcity, curious - curiosity, fierce - fiercity?

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 10 / 34

  • Morphological processes

    ConcatenationAdding continuous affixes - the most common process:

    hope+less, un+happy, anti+capital+ist+s

    Often, there are phonological/graphemic changes on morpheme boundaries:

    book + s [s], shoe + s [z]

    happy +er happier

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 11 / 34

  • Morphological processes

    ConcatenationAdding continuous affixes - the most common process:

    hope+less, un+happy, anti+capital+ist+s

    Often, there are phonological/graphemic changes on morpheme boundaries:

    book + s [s], shoe + s [z]

    happy +er happier

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 11 / 34

  • Morphological processes

    Reduplication: part of the word or the entire word is doubled

    Nama: go (look), go-go (examine with attention)

    Tagalog: basa (read), ba-basa(will read)

    Sanskrit: pac (cook), papaca (perfect form, cooked)

    Phrasal reduplication (Telugu): pillavad. u nad. ustu nad. ustu pad. i poyad. u(The child fell down while walking)

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 12 / 34

  • Morphological processes

    Reduplication: part of the word or the entire word is doubled

    Nama: go (look), go-go (examine with attention)

    Tagalog: basa (read), ba-basa(will read)

    Sanskrit: pac (cook), papaca (perfect form, cooked)

    Phrasal reduplication (Telugu): pillavad. u nad. ustu nad. ustu pad. i poyad. u(The child fell down while walking)

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 12 / 34

  • Morphological processes

    Reduplication: part of the word or the entire word is doubled

    Nama: go (look), go-go (examine with attention)

    Tagalog: basa (read), ba-basa(will read)

    Sanskrit: pac (cook), papaca (perfect form, cooked)

    Phrasal reduplication (Telugu): pillavad. u nad. ustu nad. ustu pad. i poyad. u(The child fell down while walking)

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 12 / 34

  • Morphological processes

    Reduplication: part of the word or the entire word is doubled

    Nama: go (look), go-go (examine with attention)

    Tagalog: basa (read), ba-basa(will read)

    Sanskrit: pac (cook), papaca (perfect form, cooked)

    Phrasal reduplication (Telugu): pillavad. u nad. ustu nad. ustu pad. i poyad. u(The child fell down while walking)

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 12 / 34

  • Morphological processes

    Reduplication: part of the word or the entire word is doubled

    Nama: go (look), go-go (examine with attention)

    Tagalog: basa (read), ba-basa(will read)

    Sanskrit: pac (cook), papaca (perfect form, cooked)

    Phrasal reduplication (Telugu): pillavad. u nad. ustu nad. ustu pad. i poyad. u(The child fell down while walking)

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 12 / 34

  • Morphological processes

    Suppletionirregular relation between the wordsgo - went, good - better

    Morpheme internal changesThe word changes internallysing - sang - sung, man - men, goose - geese

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 13 / 34

  • Morphological processes

    Suppletionirregular relation between the wordsgo - went, good - better

    Morpheme internal changesThe word changes internallysing - sang - sung, man - men, goose - geese

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 13 / 34

  • Word Formation

    CompoundingWords formed by comnining two or more wordsExample in English:

    Adj + Adj Adj: bitter-sweetN + N N: rain-bowV + N V: pick-pocketP + V V: over-do

    Particular to languagesroom-temperature: Hindi translation?

    Can be non-compositionalasva-karn. a (horse -ear?)name of a medicinal plant

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 14 / 34

  • Word Formation

    CompoundingWords formed by comnining two or more wordsExample in English:

    Adj + Adj Adj: bitter-sweetN + N N: rain-bowV + N V: pick-pocketP + V V: over-do

    Particular to languagesroom-temperature: Hindi translation?

    Can be non-compositionalasva-karn. a (horse -ear?)name of a medicinal plant

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 14 / 34

  • Word Formation

    CompoundingWords formed by comnining two or more wordsExample in English:

    Adj + Adj Adj: bitter-sweetN + N N: rain-bowV + N V: pick-pocketP + V V: over-do

    Particular to languagesroom-temperature: Hindi translation?

    Can be non-compositionalasva-karn. a (horse -ear?)

    name of a medicinal plant

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 14 / 34

  • Word Formation

    CompoundingWords formed by comnining two or more wordsExample in English:

    Adj + Adj Adj: bitter-sweetN + N N: rain-bowV + N V: pick-pocketP + V V: over-do

    Particular to languagesroom-temperature: Hindi translation?

    Can be non-compositionalasva-karn. a (horse -ear?)name of a medicinal plant

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 14 / 34

  • Word Formation

    Acronymslaser: Light Amplification by Simulated Emission of Radiation

    BlendingParts of two different words are combined

    breakfast + lunch brunchsmoke + fog smogmotor + hotel motel

    ClippingLonger words are shorteneddoctor, laboratory, advertisement, dormitory, examination, bicycle, refrigerator

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 15 / 34

  • Word Formation

    Acronymslaser: Light Amplification by Simulated Emission of Radiation

    BlendingParts of two different words are combined

    breakfast + lunch brunchsmoke + fog smogmotor + hotel motel

    ClippingLonger words are shorteneddoctor, laboratory, advertisement, dormitory, examination, bicycle, refrigerator

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 15 / 34

  • Word Formation

    Acronymslaser: Light Amplification by Simulated Emission of Radiation

    BlendingParts of two different words are combined

    breakfast + lunch brunchsmoke + fog smogmotor + hotel motel

    ClippingLonger words are shortened

    doctor, laboratory, advertisement, dormitory, examination, bicycle, refrigerator

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 15 / 34

  • Word Formation

    Acronymslaser: Light Amplification by Simulated Emission of Radiation

    BlendingParts of two different words are combined

    breakfast + lunch brunchsmoke + fog smogmotor + hotel motel

    ClippingLonger words are shorteneddoctor, laboratory, advertisement, dormitory, examination, bicycle, refrigerator

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 15 / 34

  • Processing morphology

    Lemmatization: word lemmasaw {see, saw}

    Morphological analysis : word setOf(lemma +tag)saw { , < saw, noun.sg>}Tagging: word tag, considers contextPeter saw her { }Morpheme segmentation: de-nation-al-iz-ation

    Generation: see + verb.past saw

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 16 / 34

  • Processing morphology

    Lemmatization: word lemmasaw {see, saw}Morphological analysis : word setOf(lemma +tag)saw { , < saw, noun.sg>}

    Tagging: word tag, considers contextPeter saw her { }Morpheme segmentation: de-nation-al-iz-ation

    Generation: see + verb.past saw

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 16 / 34

  • Processing morphology

    Lemmatization: word lemmasaw {see, saw}Morphological analysis : word setOf(lemma +tag)saw { , < saw, noun.sg>}Tagging: word tag, considers contextPeter saw her { }

    Morpheme segmentation: de-nation-al-iz-ation

    Generation: see + verb.past saw

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 16 / 34

  • Processing morphology

    Lemmatization: word lemmasaw {see, saw}Morphological analysis : word setOf(lemma +tag)saw { , < saw, noun.sg>}Tagging: word tag, considers contextPeter saw her { }Morpheme segmentation: de-nation-al-iz-ation

    Generation: see + verb.past saw

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 16 / 34

  • Processing morphology

    Lemmatization: word lemmasaw {see, saw}Morphological analysis : word setOf(lemma +tag)saw { , < saw, noun.sg>}Tagging: word tag, considers contextPeter saw her { }Morpheme segmentation: de-nation-al-iz-ation

    Generation: see + verb.past saw

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 16 / 34

  • What are the applications?

    Text-to-speech synthesis:lead:

    verb or noun?read: present or past?

    Search and information retrieval

    Machine translation, grammar correction

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 17 / 34

  • What are the applications?

    Text-to-speech synthesis:lead: verb or noun?read:

    present or past?

    Search and information retrieval

    Machine translation, grammar correction

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 17 / 34

  • What are the applications?

    Text-to-speech synthesis:lead: verb or noun?read: present or past?

    Search and information retrieval

    Machine translation, grammar correction

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 17 / 34

  • Morphological Analysis

    GoalTo take input forms like those in the first column and produce output forms likethose in the second column.Output contains stem and additional information; +N for noun, +SG forsingular, +PL for plural, +V for verb etc.

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 18 / 34

  • Morphological Analysis

    GoalTo take input forms like those in the first column and produce output forms likethose in the second column.Output contains stem and additional information; +N for noun, +SG forsingular, +PL for plural, +V for verb etc.

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 18 / 34

  • Issues involved

    boy boys

    fly flys flies (y i rule)

    Toiling toilDuckling duckl?

    Getter get + erDoes do + erBeer be + er?

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 19 / 34

  • Issues involved

    boy boysfly flys flies (y i rule)

    Toiling toilDuckling duckl?

    Getter get + erDoes do + erBeer be + er?

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 19 / 34

  • Issues involved

    boy boysfly flys flies (y i rule)

    Toiling toil

    Duckling duckl?

    Getter get + erDoes do + erBeer be + er?

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 19 / 34

  • Issues involved

    boy boysfly flys flies (y i rule)

    Toiling toilDuckling duckl?

    Getter get + erDoes do + erBeer be + er?

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 19 / 34

  • Issues involved

    boy boysfly flys flies (y i rule)

    Toiling toilDuckling duckl?

    Getter get + erDoes do + er

    Beer be + er?

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 19 / 34

  • Issues involved

    boy boysfly flys flies (y i rule)

    Toiling toilDuckling duckl?

    Getter get + erDoes do + erBeer be + er?

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 19 / 34

  • Knowledge Required

    Knowledge of stems or rootsDuck is a possible root, not duckl.We need a dictionary (lexicon)

    MorphotacticsWhich class of morphemes follow other classes of orphemes inside the word?Ex: plural morpheme follows the noun

    Only some endings go on some wordsDo+er: ok

    Be+er: not so

    Spelling change rulesAdjust the surface form using spelling change rules

    Get + er getterFox + s foxes

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 20 / 34

  • Knowledge Required

    Knowledge of stems or rootsDuck is a possible root, not duckl.We need a dictionary (lexicon)

    MorphotacticsWhich class of morphemes follow other classes of orphemes inside the word?Ex: plural morpheme follows the noun

    Only some endings go on some wordsDo+er: ok

    Be+er: not so

    Spelling change rulesAdjust the surface form using spelling change rules

    Get + er getterFox + s foxes

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 20 / 34

  • Knowledge Required

    Knowledge of stems or rootsDuck is a possible root, not duckl.We need a dictionary (lexicon)

    MorphotacticsWhich class of morphemes follow other classes of orphemes inside the word?Ex: plural morpheme follows the noun

    Only some endings go on some wordsDo+er: ok

    Be+er: not so

    Spelling change rulesAdjust the surface form using spelling change rules

    Get + er getterFox + s foxes

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 20 / 34

  • Knowledge Required

    Knowledge of stems or rootsDuck is a possible root, not duckl.We need a dictionary (lexicon)

    MorphotacticsWhich class of morphemes follow other classes of orphemes inside the word?Ex: plural morpheme follows the noun

    Only some endings go on some wordsDo+er: ok

    Be+er: not so

    Spelling change rulesAdjust the surface form using spelling change rules

    Get + er getterFox + s foxes

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 20 / 34

  • Why cant this be put in a big lexicon?

    English: just 317,477 forms from 90,196 lexical entries, a ratio of 3.5:1

    Sanskrit: 11 million forms from a lexicon of 170,000 entries, a ratio of64.7:1

    New forms can be created, compounding etc.

    One of the most common methods is finite-state-machines

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 21 / 34

  • Why cant this be put in a big lexicon?

    English: just 317,477 forms from 90,196 lexical entries, a ratio of 3.5:1

    Sanskrit: 11 million forms from a lexicon of 170,000 entries, a ratio of64.7:1

    New forms can be created, compounding etc.

    One of the most common methods is finite-state-machines

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 21 / 34

  • Why cant this be put in a big lexicon?

    English: just 317,477 forms from 90,196 lexical entries, a ratio of 3.5:1

    Sanskrit: 11 million forms from a lexicon of 170,000 entries, a ratio of64.7:1

    New forms can be created, compounding etc.

    One of the most common methods is finite-state-machines

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 21 / 34

  • Finite State Automaton (FSA)

    What is FSA?A kind of directed graph

    Nodes are called states, edges are labeled with symbols (possibly empty)

    Start state and accepting states

    Recognizes regular languages, i.e., languages specified by regularexpressions

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 22 / 34

  • Finite State Automaton (FSA)

    What is FSA?A kind of directed graph

    Nodes are called states, edges are labeled with symbols (possibly empty)

    Start state and accepting states

    Recognizes regular languages, i.e., languages specified by regularexpressions

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 22 / 34

  • Finite State Automaton (FSA)

    What is FSA?A kind of directed graph

    Nodes are called states, edges are labeled with symbols (possibly empty)

    Start state and accepting states

    Recognizes regular languages, i.e., languages specified by regularexpressions

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 22 / 34

  • Finite State Automaton (FSA)

    What is FSA?A kind of directed graph

    Nodes are called states, edges are labeled with symbols (possibly empty)

    Start state and accepting states

    Recognizes regular languages, i.e., languages specified by regularexpressions

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 22 / 34

  • FSA for nominal inflection in English

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 23 / 34

  • FSA for English Adjectives

    Word modeledhappy, happier, happiest, real, unreal, cool, coolly, clear, clearly, unclear,unclearly, ...

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 24 / 34

  • FSA for English Adjectives

    Word modeledhappy, happier, happiest, real, unreal, cool, coolly, clear, clearly, unclear,unclearly, ...

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 24 / 34

  • Morphotactics

    The last two examples model some parts of the English morphotactics

    But what about the information about regular and irregular roots?

    LexiconCan we include the lexicon in the FSA?

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 25 / 34

  • Morphotactics

    The last two examples model some parts of the English morphotactics

    But what about the information about regular and irregular roots?

    LexiconCan we include the lexicon in the FSA?

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 25 / 34

  • FSA for nominal inflection in English

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 26 / 34

  • After adding a mini-lexicon

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 27 / 34

  • Some properties of FSAs: Elegance

    Recognizing problem can be solved in linear time (independent of thesize of the automaton)

    There is an algorithm to transform each automaton into a uniqueequivalent automaton with the least number of states

    An FSA is deterministic iff it has no empty () transition and for each stateand each symbol, there is at most one applicable transition

    Every non-deterministic automaton can be transformed into adeterministic one

    Pawan Goyal (IIT Kharagpur) Computational Morphology August 14, 2014 28 / 34