PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej...
-
Upload
quinn-anson -
Category
Documents
-
view
215 -
download
1
Transcript of PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej...
plWordNet as the Cornerstone of a Toolkit of Lexico-semantic ResourcesMarek Maziarz, Maciej Piasecki,
Ewa Rudnicka, Stanisław Szpakowicz*
G4.19 Research Group, Institute of Informatics
Wrocław University of Technology
* School of Electrical Engineering and Computer Science
University of Ottawa
www.plwordnet.pwr.wroc.pl
Wordnet as a Lexical Resource
• Princeton WordNet defines de facto standard– large size and coverage– open access– thousands of applications
• Applications:dictionary vs knowledge representation
• Range of description• Ideal size and natural development
limits
plWordNet model: linguistic resource• Wordnet vs ontology
– O: a strict knowledge representation – W: concepts expressed entirely in a natural
language– W: synonymy is a matter of degree– O: certainty and a rigorous construction– W: shaped by the lexico-semantic dependencies
• Alternative to formalisation– Corpus analysis and substitution tests– Minimal commitment: defining lexico-semantic
relations without committing to any particular theory of lexical semantic or human cognition
plWordNet model: corpus-based development• Main source of lexical knowledge: a
very large monolingual corpus– tools for corpus browsing – semi-automatic knowledge extraction
• Additional sources: dictionaries and encyclopedias
• Lexical unit– lemma-sense pair– a linguistically motivated primitive
plWordNet model: synset definition• Synsets
– groups of lexical units sharing certain relations{afekt 1 `passion’, uczucie 2 `feeling’} hypernym
{miłość 1 `love’, umiłowanie 1 `affection’ , kochanie 1 ~`loving’}
• Constitutive relations– fairly frequent (to describe many LUs)– shared among LUs (to define groups)– grounded in the linguistic tradition (to facilitate their
consistent understanding)– used in other wordnets (to improve compatibility)
plWordNet model: non-relational aspects• Constitutive features
– stylistic registers, – verb aspect – and semantic verb classes
• Referred to in the relation definitions– e.g. relations limited to verbs of the same
aspect and semantic class• Glosses helps wordnet editors• Usage examples: direct links to the
corpus
Relation density
• Synset relation density in PWN 3.1 and in plWordNet 2.0
nouns verbs adjectives total0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
3.54
2.212.43
3.11
3.99
3.06
1.56
3.51
PWN plWN
Size matters: lexical coverage
Coverage of PWN/plWN for lemmas of different frequency in two similar 1.2G words corpora
(Wikipedia)
≥1000 ≥500 ≥200 ≥100 ≥500%
10%
20%
30%
40%
50%
60%
70%
38.3%
28.0%
17.0%10.7%
6.4%
58.3%
45.6%
35.0%27.7%
21.0%
PWN plWN
Corpus frequency
Size matters: plWordNet 2.2
POS Synsets Lemmas LUs Average synset
Nouns 102 613 105 883 140 701 1.37
Verbs 21 897 17 554 32 180 1.47
Adjectives 15 145 11 677 18 787 1.24
All 139 656 135 115 191 669 1.37
www.plwordnet.pwr.wroc.pl
plWordNet: ongoing work
Size matters: comparison of wordnets
synsets lemmas lexica units0
50000
100000
150000
200000
250000
plWN 2.2 PWN 3.1 GermaNet
How many words are there?- existing dictionaries
● Woordenboek der Nederlandsche Taal430k lemmas
● dictionary of Grimm brothers330k lemmas
● Oxford English Dictionary300k lemmas
● `Warsaw’ Polish Dictionary280k lemmas
● contemporary Polish dictionaries130k lemmas
un
ab
ridg
ed
dictio
narie
s
0 300 600 900 1200 1500 18000
50
100
150
200
174
19.8
45.2
93
Corpus size [x 10^6 words]
N10+
[x 1
0^
3 w
ord
s]
N10+ = 6,67
~174k (10+ lemmas)
COBUILD data
How many words are there?- approximation
# entries
Polish dictionaries 100-280k
plWordNet corpus (10+ lemmas) [K] 174k
doubled plWordNet corpus (0+ lemmas) [GT]
+200k
How many words are there?
K - Krishnamurthy’s data (2002), GT - Good & Toulmin approximation (1956)
plWordNet 3.0 200k lemmas
Toolkit of Lexico-semantic Resources• Lexicon of lexico-syntactic structures
of multi-word expressions• plWordNet 3.0 (Słowosieć 3.0)• plWordNet 3.0 to WordNet 3.1
mapping • Semantic lexicon of proper names• Mapping to an ontology
• And a valency lexicon linked to plWordNet
Lexicon of multi-word expressions• Non-trivial morphology of Polish MWEs
– more than 100 nominal structural patterns
• Description of the lexico-syntactic structures of MWEs
• Multi-word LUs as semantic atoms– no internal semantic relations
• Dynamic lexicon– a tool for automatic MWE extraction– 60 000 described in the lexicon and
plWordNet
Lexicon of Proper Names
• PNs are not a part of the lexicon• PN is an instance of a type
– characterised by referents– not by their semantic properties
• Linking PNs via a wordnet– some lexico-syntactic contexts signal instance of– PNs are represented in wordnets
• PNs as derivational bases for Common Nouns
• Dynamic lexicon with 2.5 milion PNs verified manually
plWordNet to WordNet 3.1 mapping• plWordNet: built independently to obtain
faithful description• Manual mapping
– bottom-up order– comparison of the relations structures – a cascading list of Interlingual-relations
• plWordNet verification as an important side effect
• Present state: 72 000 N and Adj synsets mapped
• Target: complete plWordNet 3.0 mapped
Wordnet editor: WordnetLoom
WordnetLoom: editing the mapping
Mapping to ontology
• Ontology: unambiguous concepts defined formally
• Lexical meanings– imprecisely delimited– constrained by usage, stylistic register and
sentiment • Mapping to ontology
– precise, formal description for meanings– association: concepts – their lexical embodiment
• SUMO selected– Princeton WordNet mapping– Semi-automated mapping of plWordNet
Expectations
plWordNet 3.0
Valence lexicon MWE lexicon
WordNet 3.1 + extension
Proper Names
Ontology: SUMO + intermediate level
describes
Applications
• Strong universal basis– a comprehensive wordnet >200 000 lemmas
resulting in ~285 000 LUs and ~210 000 synsets– one of the largest ever Polish dictionaries
• Modularly constructed toolkit– a layered architecture of large software systems– separate but linked layers– each layer based on limited set of notions and
principles and exchangeable • The core of the CLARIN-PL language
technology infrastructure
Thank-you
www.plwordnet.pwr.wroc.pl
Thank you!