Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow...
Transcript of Panlingual Lexical CollaborationTactic 1: Assemble valuable panlingual data. How? Tactics Borrow...
PanLexPanlingual
Lexical Collaboration
Jonathan PoolUniversity of Washington Computational Linguistics Laboratory
22 April 2008
Task 1.
You encounter the word “!धाना%यापक”.
What language is it in?
What does it mean?
Task 2.
You encounter the word list at http://www.geonames.de/peace.html.
Is its content already in PanLex?
If not, how can you contribute it?
Pre-DemoPanLex (http://panlex.org/cgi-bin/panlex13.cgi)
Facilitate panlingual:
4. Vigor.
3. Discursive intertranslatability.
2. Lexical intertranslatability.
1. Lexical collaboration.
Goals
Goal 1: Facilitate panlingual lexical collaboration
How?
Strategy
1. Assemble valuable panlingual data.
2. Make the data accessible.
3. Invite contributions to the data.
4. Localize the interface panlingually.
Tactic 1: Assemble valuable panlingual data.
How?
Tactics
Borrow data from TransGraph.
Expression (lexeme) equivalences from 357 dictionaries.
13 multilingual, 344 bilingual.
1050 languages.
2.5 million expressions.
8 million expression tokens.
Accept (mainly) TransGraph’s lightweight schema.
An expression is just a string in a language.
A meaning is just a source-specific ID.
A denotation is just a source assigning a meaning to an expression.
A translation is just 2+ denotations with the same meaning.
TransGraph Data
Example
englinguisticsell
γλωσσολογία
turdilbilim
estkeeleteadus
6290415713
Tactic 2: Make the data accessible.
How?
Tactics
Open-source (PostgreSQL) database (vs. TransGraph).
Perl CGI-DBI application to query and modify the data.
Domain "panlex.org" to access the application.
All data exposed (vs. PanImages).
Data retrievable interactively and by plain-text or XML file export.
Tactic 3: Invite contributions to the data.
How?
Tactics
User contributions nondestructive.
Not a Wiki, not moderated.
Contributable data:
[Language varieties (vs. TransGraph languages).]
Expressions.
Sources.
Denotations.
Contribution modes:
Batch (file upload; plain-text or XML).
Incremental (interactive editing).
Tactic 4: Localize the interface panlingually.
How?
Tactics
In vivo localization.
Interface entirely lemmatic.
Therefore, PanLex can translate the interface.
Translation core: developer-attested translations.
Translation periphery: election with sources voting.
Test 1 (expert user):
15 query and modification tasks with test questions.
Failures and comments inspired interface changes.
Test 2 (expert user):
Found, formatted, checked, and uploaded data from:
Nepali-Esperanto dictionary.
English-Yiddish dictionary.
Eight-language medical glossary.
Evaluation
Coverage
Add dictionaries.
Recruit user-added dictionaries.
Add source types:
Thesauri.
WordNets.
Library subject headings.
Locale repositories.
Monolingual resources.
Export additions to TransGraph.
Future Work
*eng: Englishhun: magyar*fra: français
*deu: Deutschces: češtinahrv: hrvatski*tur: Türkçe
spa: españolest: eesti
ita: italiano*epo: Esperanto
ara: العربيةfin: suomi
jpn: 日本語nld: Nederlandspor: português*rus: русский
bre: brezhonegsrp: српскиron: română
kur: kurdîswe: svenska
0 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000
24,495
25,315
26,516
27,072
29,436
36,779
38,237
46,921
50,247
54,439
56,122
62,928
72,861
73,628
82,503
92,146
96,735
110,623
135,505
172,435
264,927
428,550
isl: íslenskasqi: shqipepol: polski
*nob: bokmålnci: Classical Nahuatl
nah: nawatlahtollibel: беларуская
cat: catalàlat: latine
gle: Gaeilgedan: dansk
oci: lenga occitanaplt: Plateau Malagasy
tuk: türkmenslk: slovenčina
slv: slovenščinaoji: ᐊᓂᔑᓇᐯ
chy: Tsétsêhéstaestselij: lengua lígure
zho: 漢語eus: euskara
frp: lenga arpitanaell: ελληνικά
glv: chengey Vanninglg: galego
cym: Cymraegmlt: Malti
art: ISO 639afr: Afrikaansnep: )पाली
yid: ייִדישheb: עבריתkor: 한국어
ltz: Lëtzebuergesch Sproochang: Englisce sprǣc
ind: bahasa Indonesiafao: føroyskt
aym: aymar arupap: Papiamentu
0 4,000 8,000 12,000 16,000 20,000 24,000
4,7984,9605,1815,3655,9796,3706,7506,8676,8687,0057,5187,6557,6757,8767,9038,2248,7808,8508,9029,0929,2409,3809,5549,81910,04910,05110,593
12,56213,00313,595
14,89115,614
17,13217,24518,01218,37819,35620,03220,513
lit: lietuviųhbs: Serbo-Croatian
bul: българскиgla: Gàidhlig na h-Alba
yua: yukatekyor: èdè Yorùbáfro: Old French
quz: Cusco Quechuacor: yeth Kernewek
pqm: Malecite-Passamaquoddyfry: Frysk
nds: Plattdüütsche Sprookvie: tiếng Việthmn: Hmoob
qul: North Bolivian Quechuaido: Ido
lav: latviešubos: bosanskitel: తJలుగు
roh: lingua rumantschaina: interlingua
urd: اردوary: Moroccan Arabic
ukr: українськаkab: ثاقبايليث
pcd: langue picardefas: فارسی
tgl: Tagalogcos: lingua corsa
got: gutiska razdamly: Bahasa Melayu
tpi: Tok Pisinswa: kiswahili
msa: bahasa Melayutha: ภาษาไทย
nap: lengua nnapulitanapes: فارسی
hin: ,हदीqus: Santiago del Estero Quichua
0 1,000 2,000 3,000 4,000 5,000
1,5251,5431,5871,6161,6891,7371,8091,9581,9682,0122,0832,1852,2392,2712,2962,3912,4852,5362,6492,6662,7582,8572,8702,9122,9193,0083,1213,2043,3403,4633,463
3,7573,7763,9604,1034,3214,3354,482
4,747
Features
More query functions.
User SQL entry.
Usability
Test and improve interface.
Non-expert interface.
Standards
Lemmatic forms (e.g., English “to”).
Multiword lexemes.
Future Work