PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej...

24
plWordNet as the Cornerstone of a Toolkit of Lexico- semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz* G4.19 Research Group, Institute of Informatics Wrocław University of Technology * School of Electrical Engineering and Computer Science University of Ottawa www.plwordnet.pwr.wroc.pl

Transcript of PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej...

Page 1: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

plWordNet as the Cornerstone of a Toolkit of Lexico-semantic ResourcesMarek Maziarz, Maciej Piasecki,

Ewa Rudnicka, Stanisław Szpakowicz*

G4.19 Research Group, Institute of Informatics

Wrocław University of Technology

* School of Electrical Engineering and Computer Science

University of Ottawa

www.plwordnet.pwr.wroc.pl

Page 2: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

Wordnet as a Lexical Resource

• Princeton WordNet defines de facto standard– large size and coverage– open access– thousands of applications

• Applications:dictionary vs knowledge representation

• Range of description• Ideal size and natural development

limits

Page 3: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

plWordNet model: linguistic resource• Wordnet vs ontology

– O: a strict knowledge representation – W: concepts expressed entirely in a natural

language– W: synonymy is a matter of degree– O: certainty and a rigorous construction– W: shaped by the lexico-semantic dependencies

• Alternative to formalisation– Corpus analysis and substitution tests– Minimal commitment: defining lexico-semantic

relations without committing to any particular theory of lexical semantic or human cognition

Page 4: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

plWordNet model: corpus-based development• Main source of lexical knowledge: a

very large monolingual corpus– tools for corpus browsing – semi-automatic knowledge extraction

• Additional sources: dictionaries and encyclopedias

• Lexical unit– lemma-sense pair– a linguistically motivated primitive

Page 5: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

plWordNet model: synset definition• Synsets

– groups of lexical units sharing certain relations{afekt 1 `passion’, uczucie 2 `feeling’} hypernym

{miłość 1 `love’, umiłowanie 1 `affection’ , kochanie 1 ~`loving’}

• Constitutive relations– fairly frequent (to describe many LUs)– shared among LUs (to define groups)– grounded in the linguistic tradition (to facilitate their

consistent understanding)– used in other wordnets (to improve compatibility)

Page 6: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

plWordNet model: non-relational aspects• Constitutive features

– stylistic registers, – verb aspect – and semantic verb classes

• Referred to in the relation definitions– e.g. relations limited to verbs of the same

aspect and semantic class• Glosses helps wordnet editors• Usage examples: direct links to the

corpus

Page 7: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

Relation density

• Synset relation density in PWN 3.1 and in plWordNet 2.0

nouns verbs adjectives total0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

3.54

2.212.43

3.11

3.99

3.06

1.56

3.51

PWN plWN

Page 8: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

Size matters: lexical coverage

Coverage of PWN/plWN for lemmas of different frequency in two similar 1.2G words corpora

(Wikipedia)

≥1000 ≥500 ≥200 ≥100 ≥500%

10%

20%

30%

40%

50%

60%

70%

38.3%

28.0%

17.0%10.7%

6.4%

58.3%

45.6%

35.0%27.7%

21.0%

PWN plWN

Corpus frequency

Page 9: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

Size matters: plWordNet 2.2

POS Synsets Lemmas LUs Average synset

Nouns 102 613 105 883 140 701 1.37

Verbs 21 897 17 554 32 180 1.47

Adjectives 15 145 11 677 18 787 1.24

All 139 656 135 115 191 669 1.37

www.plwordnet.pwr.wroc.pl

Page 10: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

plWordNet: ongoing work

Page 11: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

Size matters: comparison of wordnets

synsets lemmas lexica units0

50000

100000

150000

200000

250000

plWN 2.2 PWN 3.1 GermaNet

Page 12: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

How many words are there?- existing dictionaries

● Woordenboek der Nederlandsche Taal430k lemmas

● dictionary of Grimm brothers330k lemmas

● Oxford English Dictionary300k lemmas

● `Warsaw’ Polish Dictionary280k lemmas

● contemporary Polish dictionaries130k lemmas

un

ab

ridg

ed

dictio

narie

s

Page 13: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

0 300 600 900 1200 1500 18000

50

100

150

200

174

19.8

45.2

93

Corpus size [x 10^6 words]

N10+

[x 1

0^

3 w

ord

s]

N10+ = 6,67

~174k (10+ lemmas)

COBUILD data

How many words are there?- approximation

Page 14: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

# entries

Polish dictionaries 100-280k

plWordNet corpus (10+ lemmas) [K] 174k

doubled plWordNet corpus (0+ lemmas) [GT]

+200k

How many words are there?

K - Krishnamurthy’s data (2002), GT - Good & Toulmin approximation (1956)

plWordNet 3.0 200k lemmas

Page 15: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

Toolkit of Lexico-semantic Resources• Lexicon of lexico-syntactic structures

of multi-word expressions• plWordNet 3.0 (Słowosieć 3.0)• plWordNet 3.0 to WordNet 3.1

mapping • Semantic lexicon of proper names• Mapping to an ontology

• And a valency lexicon linked to plWordNet

Page 16: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

Lexicon of multi-word expressions• Non-trivial morphology of Polish MWEs

– more than 100 nominal structural patterns

• Description of the lexico-syntactic structures of MWEs

• Multi-word LUs as semantic atoms– no internal semantic relations

• Dynamic lexicon– a tool for automatic MWE extraction– 60 000 described in the lexicon and

plWordNet

Page 17: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

Lexicon of Proper Names

• PNs are not a part of the lexicon• PN is an instance of a type

– characterised by referents– not by their semantic properties

• Linking PNs via a wordnet– some lexico-syntactic contexts signal instance of– PNs are represented in wordnets

• PNs as derivational bases for Common Nouns

• Dynamic lexicon with 2.5 milion PNs verified manually

Page 18: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

plWordNet to WordNet 3.1 mapping• plWordNet: built independently to obtain

faithful description• Manual mapping

– bottom-up order– comparison of the relations structures – a cascading list of Interlingual-relations

• plWordNet verification as an important side effect

• Present state: 72 000 N and Adj synsets mapped

• Target: complete plWordNet 3.0 mapped

Page 19: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

Wordnet editor: WordnetLoom

Page 20: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

WordnetLoom: editing the mapping

Page 21: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

Mapping to ontology

• Ontology: unambiguous concepts defined formally

• Lexical meanings– imprecisely delimited– constrained by usage, stylistic register and

sentiment • Mapping to ontology

– precise, formal description for meanings– association: concepts – their lexical embodiment

• SUMO selected– Princeton WordNet mapping– Semi-automated mapping of plWordNet

Page 22: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

Expectations

plWordNet 3.0

Valence lexicon MWE lexicon

WordNet 3.1 + extension

Proper Names

Ontology: SUMO + intermediate level

describes

Page 23: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

Applications

• Strong universal basis– a comprehensive wordnet >200 000 lemmas

resulting in ~285 000 LUs and ~210 000 synsets– one of the largest ever Polish dictionaries

• Modularly constructed toolkit– a layered architecture of large software systems– separate but linked layers– each layer based on limited set of notions and

principles and exchangeable • The core of the CLARIN-PL language

technology infrastructure

Page 24: PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.

Thank-you

www.plwordnet.pwr.wroc.pl

Thank you!