Evaluating word sketches and corpora
-
Upload
hedda-farrell -
Category
Documents
-
view
15 -
download
1
description
Transcript of Evaluating word sketches and corpora
![Page 1: Evaluating word sketches and corpora](https://reader036.fdocuments.net/reader036/viewer/2022072014/56812e14550346895d937ef5/html5/thumbnails/1.jpg)
1
Evaluating word sketches and corpora
Adam KilgarriffLexical Computing LtdLexicography MasterClass LtdUniversities of Leeds and Sussex
![Page 2: Evaluating word sketches and corpora](https://reader036.fdocuments.net/reader036/viewer/2022072014/56812e14550346895d937ef5/html5/thumbnails/2.jpg)
Adam Kilgarriff2 IWSG-1, 2010
Word sketches
Over 10 years Since 1999
Feedback Good but anecdotal
Formal evaluation
![Page 3: Evaluating word sketches and corpora](https://reader036.fdocuments.net/reader036/viewer/2022072014/56812e14550346895d937ef5/html5/thumbnails/3.jpg)
Adam Kilgarriff3 IWSG-1, 2010
Goal
Collocations dictionary Model: Oxford Collocations Dictionary Publication-quality
Ask a lexicographer For 42 headwords
• For 20 best collocates per headwords “should we include this collocation in a
published dictionary?”
![Page 4: Evaluating word sketches and corpora](https://reader036.fdocuments.net/reader036/viewer/2022072014/56812e14550346895d937ef5/html5/thumbnails/4.jpg)
Adam Kilgarriff4 IWSG-1, 2010
Sample of headwords Nouns verbs adjectives, random High (Top 3000) N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999) N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000) N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable
![Page 5: Evaluating word sketches and corpora](https://reader036.fdocuments.net/reader036/viewer/2022072014/56812e14550346895d937ef5/html5/thumbnails/5.jpg)
Adam Kilgarriff5 IWSG-1, 2010
Precision and recall We test precision Recall is harder
How do we find all the collocations that the system should have found?
Current work• 200 collocates per headword
• Selected from
• All the corpora we have
• Various parameter settings
• Plus just-in-time evaluation for 'new' collocates
![Page 6: Evaluating word sketches and corpora](https://reader036.fdocuments.net/reader036/viewer/2022072014/56812e14550346895d937ef5/html5/thumbnails/6.jpg)
Adam Kilgarriff6 IWSG-1, 2010
Four languages, three families
Dutch ANW, 102m-word lexicographic corpus
English UKWaC, 1.5b web corpus
Japanese JpWaC, 400m web corpus
Slovene FidaPlus, 620m lexicographic corpus
![Page 7: Evaluating word sketches and corpora](https://reader036.fdocuments.net/reader036/viewer/2022072014/56812e14550346895d937ef5/html5/thumbnails/7.jpg)
Adam Kilgarriff7 IWSG-1, 2010
User evaluation
Evaluate whole system Will it help with my task
• Eg preparing a collocations dictionary Contrast: developer evaluation
Can I make the system better?• Evaluate each module separately• Current work
![Page 8: Evaluating word sketches and corpora](https://reader036.fdocuments.net/reader036/viewer/2022072014/56812e14550346895d937ef5/html5/thumbnails/8.jpg)
Adam Kilgarriff8 IWSG-1, 2010
Components
Grammar NLP tools
Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics
![Page 9: Evaluating word sketches and corpora](https://reader036.fdocuments.net/reader036/viewer/2022072014/56812e14550346895d937ef5/html5/thumbnails/9.jpg)
Adam Kilgarriff9 IWSG-1, 2010
Practicalities Interface
Good, Good-but• Merge to good
Maybe, Maybe-specialised, Bad• Merge to bad
For each language Two/three linguists/lexicographers If they disagree
• Don't use for computing performance
![Page 10: Evaluating word sketches and corpora](https://reader036.fdocuments.net/reader036/viewer/2022072014/56812e14550346895d937ef5/html5/thumbnails/10.jpg)
Adam Kilgarriff10 IWSG-1, 2010
Results
Dutch 66% English 71% Japanese 87% Slovene 71%
![Page 11: Evaluating word sketches and corpora](https://reader036.fdocuments.net/reader036/viewer/2022072014/56812e14550346895d937ef5/html5/thumbnails/11.jpg)
Adam Kilgarriff11 IWSG-1, 2010
Corpus evaluation
Collocation-findingTypical corpus task
Recall Hold all else constant
Statistic, NLP tools, grammarBest results: best corpus
• (for collocation-finding)
Pomikalek: de-duplication