Word Sense Disambiguation in Old English

25
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 1 / 25 "God Wat þæt Ic Eom God" Word Sense Disambiguation in Old English Bamberg, Staatsbibliothek, Msc.Nat.1 (9th century) Martin Wunderlich and Alexander Fraser (LMU M+nchen) Paul Sander Langeslag (University of G*ttingen)

Transcript of Word Sense Disambiguation in Old English

Page 1: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

1 / 25

"God Wat þæt Ic Eom God"

Word Sense Disambiguation in Old EnglishBamberg, Staatsbibliothek, Msc.Nat.1 (9th century)

Martin Wunderlich and Alexander Fraser (LMU M+nchen)Paul Sander Langeslag (University of G*ttingen)

Page 2: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

2 / 25

Can we apply WSD techniques to a

historical language like Old English

and what are the

specific challenges?

Page 3: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

3 / 25

Overview

● Background on the Old English language

● NLP and historical languages – problems and opportunities

● Old English digital resources

● WSD methodologies applied here

● Experiments and results

● Summary and discussion

Page 4: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

4 / 25

Background on the OE language 1

● Spoken ca. 450 – 1100 AD● A Germanic language:

„God Wat þæt Ic Eom God‟→ „Gott weiß, dass ich gut bin‟

(„God knows I'm good‟ - David Bowie)● 5 cases, 3 genders, 3 numbers (singual, dual, plural)

An example: – „Seo cwen geseah þone guman.‟ *– „Se guma geseah þa cwen.‟ **

(from Crystal, 2010)

* „The woman saw the man.‟ ** „The man saw the woman‟

Page 5: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

5 / 25

Background on the OE language 2

● Initially a runic alphabet known as „futhorc‟(after the first letters -ᚠᚢᚦᚩᚱᚳ)

● ...keeping Thorn ᚦ and Wynn ƿ and adding Latin● 24 letter alphabet:

a æ b c d ð e f ᵹ/g h i l m n o p r s/ſ t þ u ƿ/w x y● Introduced around 600 AD

Page 6: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

6 / 25

Background on the OE language 3Migrations and settlements:

https://www.uni-due.de/SHE/Germanic_Migration_to_Britain.gif(site maintained by Prof. Raymond Hickey, Chair of Linguistics)

Page 7: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

7 / 25

NLP & historical languages: problems

● Stopword lists● POS taggers● Word and sentence tokenizers● Standard tools and libraries● Shared tasks with prepared training data● Existing research

Page 8: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

8 / 25

NLP & historical languages: problems

● Stopword lists● POS taggers● Word and sentence tokenizers● Standard tools and libraries● Shared tasks with prepared training data● Existing research … well, a bit ...

Page 9: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

9 / 25

NLP & historical languages: related work

● Annotation projection in Germanic languages with parallel bible texts(Sukhareva and Chiarcos, 2014)

● Application of existing NLP tools to ancient Italian (Pennacchiotti and Zanzotto, 2008)

● Tagging Old East Slavonic texts (Meyer, 2011)

● POS tagging Early Modern German texts (Bollmann, 2013)

● Projection of tags from contemporary EN to ME(Moon and Baldridge, 2007)

Page 10: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

10 / 25

NLP & historical languages: opportunities

1.Digital corpora & dictionaries/lexicons do exist(incl. OE Wikipedia: https://ang.wikipedia.org/wiki/H%C4%93afodtramet)

2.Static corpus

3.Few existing NLP applications → lots to explore

Page 11: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

11 / 25

Old English digital resources: corpora

● York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE); ca. 1.5 million words

● York-Toronto-Helsinki Parsed Corpus of Old English poetry (YCOEP); 71,490 words

● Dictionary of Old English Corpus in Electronic Form (DOEC); ca. 3.8 million words

→ all available through the University of Oxford Text Archive (http://www.ota.ahds.ac.uk/);

Page 12: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

12 / 25

Old English digital resources: dictionary

Dictionary of Old English (DOE) corpus stats:

Number of HTML documents 3,037

Token count 3,786,753

Type count 343,135

Token count / type count ca. 11

Total number of sentences 234113

Average sentence length 5.5

Minimum sentence length 1

Maximum sentence length 263

Compare to Brown corpus:

ca. 1 Mio tokens and ca. 50.000 types (T/T = 20)

Spelling variations. e.g. „wundarlic‟, „wundorlic‟, „wunderlic‟

12568 DOE entries for the letters from A to G (http://tapor.library.utoronto.ca/doe/)

Page 13: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

13 / 25

WSD methodologies 1

Criteria for selecting the target terms:

➔ minimum count 200, minimum length 3 characters➔ non-Latin (i.e. no „dictum‟, „confundantur‟, „magister‟...)➔ common nouns ➔ no proper nouns (e.g. no „Egypta‟, „Micel‟, „Iulianus‟...)

Page 14: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

14 / 25

WSD methodologies 2

Target terms: Target term Token count in DOE corpus

Basic translation

Anweald 242 Power, realm, order of angels

Fultum 574 Help, aid, remedy

Fæder 416 Father, lord (relig.)

For 955 Movement, journey...

Eadigan 263 To bless, to make happy

Boc 567 Book, volume, legal doc

Ban 314 Bone, ivory

Are 308 Honour, mercy, property

Andlang 1743 Continuous, upright

Dryhten 261 Lord (worldly & relig.), chief100 concordance matches each (random selection)

Page 15: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

15 / 25

WSD methodologies 3

Selected word senses of "bōc":(http://tapor.library.utoronto.ca/doe/dict/indices/headwordsd.html#E03007)

A. book

A.1. in general, without particular reference to form or contentLk (WSCp) 4.17: he þa boc unfeold

B. major division of a larger work

JnArgGl (Li) 3: ðis uutedlice godspell aurat in ðær meigð æfter ðon in Pathma ealond þæt boc ðæra sighðana eac awrat.

D. legal document

Birch 862: Þis is ðæs landes boc æt Duntune ðe Eadred cyng edniwon gæbocodæ sanctæ trinitate & Sanctæ Pætræ & Sanctæ Paule into ealdan mynstræ.

Page 16: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

16 / 25

WSD methodologies 4From corpus to feature vectors – bag-of-words model with fixed size token window

from Ch 540 (Birch 862):

Page 17: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

17 / 25

Implementation

● Libraries used: – Mallet (NLP and ML library)– Jsoup (HTML processing)

● Own implementation: – Parsing of corpus and dictionary data– Feature extraction and instance creation– Pipes for baseline classifiers (Mallet additions)– Metrics, summarization and output of results

...and much more...

Page 18: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

18 / 25

Experiments and results 1

● Baseline 1: most frequent class. – Accuracy: 0.67

● Baseline 2: random class. – Accuracy: 0.44

Human annotators' upper and lower bounds: 0.75 – 0.97 (Gale et al., 1992)

Page 19: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

19 / 25

Experiments and results 2

One-vs-all classification

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

A vs. notA - Naive Bayes

AccuracyAvg PrecisionAvg RecallAvg F1

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

A vs. notA - Naive Bayes

AccuracyLin Reg trendAvg PrecisionAvg RecallAvg F1

Page 20: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

20 / 25

Experiments and results 3

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

A vs. notA - MaxEnt

AccuracyAvg PrecisionAvg RecallAvg F1

One-vs-all classification

0 2 4 6 8 10 12 14 16 18 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

A vs. notA - MaxEnt

AccuracyLin Reg trendAvg PrecisionAvg RecallAvg F1

Page 21: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

21 / 25

WSD methodologies 3

Selected word senses of "bōc":(http://tapor.library.utoronto.ca/doe/dict/indices/headwordsd.html#E03007)

A. book

A.1. in general, without particular reference to form or contentLk (WSCp) 4.17: he þa boc unfeold

B. major division of a larger work

JnArgGl (Li) 3: ðis uutedlice godspell aurat in ðær meigð æfter ðon in Pathma ealond þæt boc ðæra sighðana eac awrat.

D. legal document

Birch 862: Þis is ðæs landes boc æt Duntune ðe Eadred cyng edniwon gæbocodæ sanctæ trinitate & Sanctæ Pætræ & Sanctæ Paule into ealdan mynstræ.

Page 22: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

22 / 25

Experiments and results 4

Algorithm Feature vector

Accuracy Precision Recall F1

Avg Std Dev Avg Std Dev Avg Std Dev Avg

NB, multi-class BoW 0.7635 0.11 0.7205 0.18 0.7865 0.16 0.7521

ME, multi-class BoW 0.7520 0.17 0.8610 0.10 0.6915 0.17 0.7670

NB, one-vs-all BoW 0.8400 0.09 0.8458 0.10 0.8368 0.11 0.8295

ME, one-vs-all BoW 0.7950 0.12 0.7875 0.13 0.8080 0.12 0.7662

NB, multi-class Coll. 0.7245 0.12 0.8135 0.08 0.6510 0.12 0.5895

ME, multi-class Coll. 0.7910 0.13 0.8845 0.08 0.6875 0.16 0.6510

NB, one-vs-all Coll. 0.8200 0.09 0.8305 0.12 0.8085 0.10 0.7970

ME, one-vs-all Coll. 0.7290 0.09 0.7395 0.10 0.7145 0.14 0.6890

Page 23: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

23 / 25

Summary

● Historical languages: interesting, rewarding and difficult to work with● WSD does give satisfactory results even without stemming etc.● Best WSD performance: NB (F1), one vs. all, window size: ??● Annotated data set (available on website) ● Baseline classifiers as contributions to MALLET● Possible extensions:

– More advanced vector representations– Bootstrapping– Train classifiers based on other corpora– Distributional thesaurus (DT)?

● Acknowledgements: Winfried Rudolf, Göttingen & Juan Carmona Ramirez, Jena

Page 24: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

24 / 25

Thanks a lot for your attention!

Any questions?

Paul S. Langeslag, GöttingenNew book: Seasons in the Literatures of the Medieval North

Alexander Fraser, München

Page 25: Word Sense Disambiguation in Old English

GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/

25 / 25

References● Mark Stevenson. Word sense disambiguation : the case for combinations of knowledge sources. CSLI

studies in computational linguistics. CSLI Publ., Stanford, Calif., 2003.● D. Yarowsky. Word sense disambiguation. In Alexander Clark, editor, The handbook of computational

linguistics and natural language processing, Blackwell handbooks in linguistics. Wiley-Blackwell, Oxford [u.a.], 1. publ. Edition, 2010.

● D. Crystal. The Cambridge Encyclopedia of Language. The Cambridge Encyclopedia of Language. Cambridge University Press, 2010.

● Clara Cabezas, Philip Resnik, and Jessica Stevens. Supervised sense tagging using support vector machi nes. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguati- on Systems, SENSEVAL ’01, pages 59–62, Stroudsburg, PA, USA, 2001. Association for Computational Linguistics.

● Andrew Kachites McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.

● Marcel Bollmann. Pos tagging for historical texts with sparse training data. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability in Discourse, pages 11–18, Sofia, Bulgaria, August 2013. Association for Computational Linguistics.

● Taesun Moon and Jason Baldridge. Part-of-speech tagging for middle English through alignment and projection of parallel diachronic texts. In Proceedings of the 2007 Joint Conference on Empirical Me- thods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 390–399, Prague, Czech Republic, June 2007. Association for Computational Linguistics.

● Roland Meyer. New wine in old wineskins? - tagging old russian via annotation projection from modern translations. Russian Linguistics, 35(2):267–281, 2011.

● Marco Pennacchiotti and Fabio Massimo Zanzotto. Natural Language Processing across time: an empi rical investigation on Italian, volume 5221, pages 371–382. Springer, 2008.