HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Inducing the Morphological Lexicon...

19
HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias.Creutz, Krista.Lagus }@hut.fi International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05) Espoo, 17 June 2005 kahvi + n + juo + ja + lle + ki nyky + ratkaisu + i + sta + mme tietä + isi + mme + + hän open + mind + ed + ness un + believ + able
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Inducing the Morphological Lexicon...

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Inducing the Morphological Lexicon of a Natural Language from Unannotated

Text

{ Mathias.Creutz, Krista.Lagus }@hut.fi

International and Interdisciplinary Conference on Adaptive Knowledge Representation and

Reasoning (AKRR’05)Espoo, 17 June 2005

kahvi + n + juo + ja + lle + kin

nyky + ratkaisu + i + sta + mme

tietä + isi + mme + kö + hän

open + mind + ed + ness un + believ + able

17 June 2005Mathias Creutz 2

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Challenge for NLP: too many words• E.g., Finnish words often consist of lengthy

sequences of morphemes — stems, suffixes and prefixes:– kahvi + n + juo + ja + lle + kin

(coffee + of + drink + -er + for + also)

– nyky + ratkaisu + i + sta + mme(current + solution + -s + from + our)

– tietä + isi + mme + kö + hän(know + would + we + INTERR + indeed)

Huge number of different possible word forms Important to know the inner structure of words The number of morphemes per word varies much

17 June 2005Mathias Creutz 3

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Goal

• Learn representations of– the smallest individually meaningful units of

language (morphemes)– and their interaction– in an unsupervised and data-driven manner

from raw text– making as general and language-independent

assumptions as possible.

Morfessor

17 June 2005Mathias Creutz 4

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

State of the art• Rule-based systems

– accurate, language-dependent, adaptivity issues

• Unsupervised word segmentation– sentences can be of different length– context-insensitive poor modeling of syntax:

• undersegmentation of frequent strings (“forthepurposeof”)

• oversegmentation of rare strings (“in + s + an + e”)

• no syntactic / morphotactic constraints (“s + can”)

MorfessorBaseline

17 June 2005Mathias Creutz 5

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

State of the art (cont’d)• Morphology learning

– Beyond segmentation: allomorphy (“foot – feet, goose – geese”)

– Detection of semantic similarity (e.g., Yarowsky &

Wicentowski) (“sing – sings – singe – singed”)

– Learning of paradigms (e.g., John Goldsmith’s Linguistica)

believhopliv

movus

eedesing

Very restricted syntax / morphotactics in terms of number of morphemes per word form!

17 June 2005Mathias Creutz 6

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Morfessor with morpheme categories• Lexicon / Grammar dualism

– Word structure captured by a regular expression: word = ( prefix* stem suffix* )+

– Morph sequences (words) are generated by a Hidden Markov model:

P(STM | PRE) P(SUF | SUF)

ificover ationsimpl# s #

P(’s’ | SUF)P(’over’ | PRE)

Transition probs

Emission probs

17 June 2005Mathias Creutz 7

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Lexicon“Meaning” “Form”

14029

136 1 4 over

41 4 1 5 simpl

17259

1 4618 1 s

Freq

uency

Length

String

...

Right p

erplex

ity

Left

perplex

ity

Morp

hs

17 June 2005Mathias Creutz 8

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

How meaning affects morphotactic role

0

0,2

0,4

0,6

0,8

1

1,2

10 30 50 70 90

Left perplexity

Suffix-likeness0

0,2

0,4

0,6

0,8

1

1,2

10 30 50 70 90

Right perplexity

Prefix-likeness0

0,2

0,4

0,6

0,8

1

1,2

1 2 3 4 5 6 7 8 9 1

Morph length

Stem-likeness

• Prior probability distributions for category membership of a morph, e.g., P(PRE | ’over’)

• Assume asymmetries between the categories:

17 June 2005Mathias Creutz 9

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

How meaning affects role (cont’d) • There is an additional non-morpheme

category for cases where none of the proper classes is likely:

P(NON |'over') =

1− Prefixlike('over')[ ] ⋅ 1− Stemlike('over')[ ]

⋅1− Suffixlike('over')[ ]

P(PRE |'over') =Prefixlike('over')q ⋅ 1− P(NON |'over')[ ]

Prefixlike('over')q + Stemlike('over')q + Suffixlike('over')q

• Distribute remaining probability mass proportionally, e.g.,

17 June 2005Mathias Creutz 10

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Maximum a posteriori optimization

argmaxLexicon

P(Lexicon | Corpus) =

argmaxLexicon

P(Corpus | Lexicon) ⋅P(Lexicon)

Morfessor Categories-MAP:Older maximum-

likelihood version:Categories-ML

(lexicon controlledheuristically)

14029

136 1 4 over

41 4 1 5 simpl

17259

1 4618 1 s

...

P(STM | PRE) P(SUF | SUF)

ificover ationsimpl# s #

P(’s’ | SUF)P(’over’ | PRE)

Balance accuracy of representation of data against size of lexicon

17 June 2005Mathias Creutz 11

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Over- and undersegmentation still a problem?

P('morgana') = P(Freq =1) ⋅P(RightPpl =1) ⋅P(LeftPpl =1) ⋅P(Length = 7) ⋅

P('m') ⋅P('o') ⋅P('r') ⋅P('g') ⋅P('a') ⋅P('n') ⋅P('a')

• Probability of adding an entry to the lexicon:

Rare strings are split into smaller parts (e.g., morgan + a)

hands# #hand# #s

• Probability of sequences in the corpus:

vs.

Frequent strings are left unsplit and their inner structure is “lost” (e.g., hands)

17 June 2005Mathias Creutz 12

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Solution: Hierarchical structures in lexicon

oppositio kansanedustaja+

op positio kansan edustaja

kansa edusta jan

Non-morpheme Stem

Suffix• Make morphs consist of submorphs. • Expand the tree when performing morpheme segmentation.• Do not expand morphs consisting of non-morphemes.

17 June 2005Mathias Creutz 13

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Evaluation using Hutmegs(Helsinki University of Technology Morphological Evaluation Gold Standard)

• Evaluate the segmentation of Morfessor against a linguistic morpheme segmentation = Hutmegs

• Covers– 1.4 million Finnish word forms– 120 000 English word forms

• Publicly available and described in the technical report: M. Creutz and K. Lindén. 2004. Morpheme

Segmentation Gold Standards for Finnish and English. Publications in Computer and Information Science, Report A77, Helsinki University of Technology.

17 June 2005Mathias Creutz 14

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

50

60

70

80

10 50 250 12000

Corpus size [1000 words]

F-measure [%]30

40

50

60

70

80

10 50 250 16000

Corpus size [1000 words]

F-measure [%]

Evaluation against the Hutmegs Gold Standard

Finnish English

Ctxt-insens. (Baseline)Paradigms

(Linguistica)

Heuristic (Categories-ML)Categories-MAP

17 June 2005Mathias Creutz 15

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Example segmentationsFinnish English

[ aarre kammio ] issa [ accomplish es ]

[ aarre kammio ] on [ accomplish ment ]

bahama laiset [ beautiful ly ]

bahama [ saari en ] [ insur ed ]

[ epä [ [ tasa paino ] inen ] ]

[ insure s ]

maclare n [ insur ing ]

[ nais [ autoili ja ] ] a [ [ [ photo graph ] er ] s ]

[ sano ttiin ] ko [ present ly ] found

töhri ( mis istä ) [ re siding ]

[ [ voi mme ] ko ] [ [ un [ expect ed ] ] ly ]

17 June 2005Mathias Creutz 16

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Discussion

• Possibility to extend the model– rudimentary features used for “meaning”– more fine-grained categories– beyond concatenative phenomena (e.g., goose –

geese)– allomorphy

(e.g., beauty, beauty + ’s, beauti + es, beauti + ful)

• Already now useful in applications– automatic speech recognition (Finnish, Turkish)

17 June 2005Mathias Creutz 17

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Morpho project pagehttp://www.cis.hut.fi/projects/morpho/

17 June 2005Mathias Creutz 18

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Demo 6

http://www.cis.hut.fi/projects/morpho/

17 June 2005Mathias Creutz 19

HELSINKI UNIVERSITY OF TECHNOLOGY

NEURAL NETWORKS RESEARCH CENTRE

Demo 7