N E-MELD Conference, 15-18 July 2004 1 Fixing a Legacy Lexicon Mike Maxwell [email protected]...
-
Upload
todd-morgan -
Category
Documents
-
view
212 -
download
0
Transcript of N E-MELD Conference, 15-18 July 2004 1 Fixing a Legacy Lexicon Mike Maxwell [email protected]...
E-MELD Conference, 15-18 July 2004 1
Fixing a Legacy Lexicon
Mike Maxwell [email protected]
University of PennsylvaniaLinguistic Data Consortium and Department of Linguistics
3600 Market Street, Philadelphia, PA 19104 U.S.A.
www.ldc.upenn.edu
E-MELD Conference, 15-18 July 2004 2
The Problem• Shoebox lexicon of Mawukakan
– Inconsistencies:
» Inconsistencies among POSs etc.(fixable in Shoebox)
» Spelling errors: English, French and Mawu(import into Word, use English and French spell correctors)
» Errors in hierarchy:Missing fieldsMis-ordered fields
» Missing reciprocal cross-references
• Absolutely typical of Shoebox-style lexicons
• Repairs needed for– Archiving
– Publication
– Export/ import
E-MELD Conference, 15-18 July 2004 3
Old Solution
• Parse until error, characterize error, find error in Shoebox, fix error…
• Find all errors, send list to user, user fixes them, re-do…
E-MELD Conference, 15-18 July 2004 4
Partial solutions
• Inconsistencies among POSs etc.– Fixable in Shoebox
– Helpful addition: counts of POS tokens
• Spelling errors– Import into Word with automatic marking of language, use English and
French spell correctors to fix errors, export back to Shoebox
– No solution for Mawu spelling(n-grams)
• Missing cross-references– Easy to find with shell script, send list to users
– Would be better to mark errors in lexicon
• Missing bi-directional references
E-MELD Conference, 15-18 July 2004 5
Partial solutions
• Errors in hierarchy\w ba’el\pos v.i\ex Yax bo’on ta sna Antonio.\exEn I’m going to Antonio’s house.|\ex Ban yax ba’at?\exEn Where are you going?\exFr Ou allez-vous?
E-MELD Conference, 15-18 July 2004 6
Repairing the hierarchy
• Solution: special purpose parser, mark SFM file with errors and suggested fixes
• Need hierarchyCannot (reliably) extract hierarchy from Shoebox typ file
• User or consultant must provide definition of hierarchy, as regex:(w ( (pos defn (ex exEn exFr)* (syn)?) | (num pos defn (ex exEn exFr)* (syn)?)+ ))– Tool to extract a list of all occurring record/ field patterns
E-MELD Conference, 15-18 July 2004 7
Sample output
• regex … (ex exEn exFr)*…
• Input…\ex Yax bo’on ta sna Antonio.\exEn I’m going to Antonio’s house.|\ex Ban yax ba’at?\exEn Where are you going?\exFr Ou allez-vous?
• Output: …\ex Yax bo’on ta sna Antonio.\exEn I’m going to Antonio’s house.|\exFr ***Missing field inserted***\ex Ban yax ba’at?\exEn Where are you going?\exFr Ou allez-vous?
E-MELD Conference, 15-18 July 2004 8
More sample output
• Input\w yax\pos AUX-V\pos Adj \defn green
• Output\w yax\pos AUX-V\pos Adj ***Erroneous field*** \defn green
E-MELD Conference, 15-18 July 2004 9
More sample output
• Input\w yax\pos AUX-V\foo bar\degn green
• Output\w yax\error ***Unable to parse record structure***\pos AUX-V\foo bar \degn green
E-MELD Conference, 15-18 July 2004 10
The next language
• Nahuatl lexicon– 11,000 entries
– 5000 record/ field patterns
– 147 SFMs…