LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori...
-
Upload
candice-heath -
Category
Documents
-
view
217 -
download
0
Transcript of LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori...
![Page 1: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.](https://reader035.fdocuments.net/reader035/viewer/2022081514/5697bfa61a28abf838c98362/html5/thumbnails/1.jpg)
LDMT MURI
Data Collection and Linguistic Annotation
November 2, 2012Jason Baldridge, UT Austin
Lori Levin, CMU
![Page 2: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.](https://reader035.fdocuments.net/reader035/viewer/2022081514/5697bfa61a28abf838c98362/html5/thumbnails/2.jpg)
Purpose
Collect and build data• Monolingual text• Bilingual text• Linguistic annotations
to support work on machine translation for • Kinyarwanda-English• Malagasy-English
![Page 3: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.](https://reader035.fdocuments.net/reader035/viewer/2022081514/5697bfa61a28abf838c98362/html5/thumbnails/3.jpg)
KGMC (270k) KGMC (225k)
Pbook (0.9k) Pbook (0.7k)
GWord (8b)
BILINGUAL(285k)
ENGLISHmonolingual
(huge)
KINYARWANDAmonolingual
(7m)
ENGtreebank
ENGtext
KINtext
KINtreebank
PTB (1m)
Kinyarwanda Data Resources
News (7m)
KGMC (5.8k) KGMC (4.8k)
BBC (0.3k) BBC (0.3k)
IGT (0.1k) IGT (0.06k)
Dict (9k) Dict (8k)
KGMC (2.9k)
KGMC (3.8k)
BBC (0.3k) BBC (0.3k)
IGT (0.06k)IGT (0.1k)
wordcounts
1.0 Release 02/112.0 Release 10/11
![Page 4: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.](https://reader035.fdocuments.net/reader035/viewer/2022081514/5697bfa61a28abf838c98362/html5/thumbnails/4.jpg)
KGMC (270k) KGMC (225k)
Pbook (0.9k) Pbook (0.7k)
GWord (8b)
BILINGUAL(285k)
ENGLISHmonolingual
(huge)
KINYARWANDAmonolingual
(7m)
ENGtreebank
ENGtext
KINtext
KINtreebank
PTB (1m)
Kinyarwanda Data Resources
News (7m)
KGMC (5.8k) KGMC (4.8k)
BBC (0.3k) BBC (0.3k)
IGT (0.1k) IGT (0.06k)
Dict (9k) Dict (8k)
KGMC (2.9k)Part-of-speech (2k)GFL (4.7k)
KGMC (3.8k)
BBC (0.3k) BBC (0.3k)
IGT (0.06k)IGT (0.1k)
wordcounts
Revi
ewed
& im
prov
ed
Revi
ewed
& im
prov
ed
1.0 Release 02/112.0 Release 10/113.0 Release 11/12
![Page 5: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.](https://reader035.fdocuments.net/reader035/viewer/2022081514/5697bfa61a28abf838c98362/html5/thumbnails/5.jpg)
Bible (730k) Bible (725k)
News (2.1k) News (2.3k)
Gword (8b)
BILINGUAL(732k)
ENGLISHmonolingual
(huge)
MALAGASYMonolingual
ENGtreebank
ENGtext
MLGtext
MLGtreebank
PTB (1m)
Malagasy Data Resources
News (2.1k)
News (2.3k)
1.0 Release 02/112.0 Release 10/11
![Page 6: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.](https://reader035.fdocuments.net/reader035/viewer/2022081514/5697bfa61a28abf838c98362/html5/thumbnails/6.jpg)
Bible (730k) Bible (725k)
News (2.1k) News (2.3k)
Gword (8b)
BILINGUAL(732k)
ENGLISHmonolingual
(huge)
MALAGASYMonolingual
ENGtreebank
ENGtext
MLGtext
MLGtreebank
PTB (1m)
Malagasy Data Resources
News (2.1k)Reviewed &improved.
News (2.3k)Reviewed &improved.Part-of-speech (2k)
Global voices (1.8m)
Global voices(1.9m)
Leipzig(600k)
Global voicesGFL (3.7k)
1.0 Release 02/112.0 Release 10/113.0 Release 11/12
Dic
tiona
ry (7
7.5k
)
![Page 7: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.](https://reader035.fdocuments.net/reader035/viewer/2022081514/5697bfa61a28abf838c98362/html5/thumbnails/7.jpg)
Malagasy Data Resources
• Year 1: 19th century Malagasy bible• Year 2:– Univ. of Leipzig Web Corpus• Monolingual Malagasy, very clean
– CMU Global Voices Archive
![Page 8: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.](https://reader035.fdocuments.net/reader035/viewer/2022081514/5697bfa61a28abf838c98362/html5/thumbnails/8.jpg)
Malagasy ResourcesTokens Types Hapax
Bible (Year 1) 579,578 19,460 8,401
Leipzig corpus (Year 2) 618,282 41,462 23,659
CMU Global Voices (Year 2) 2,148,976 84,744 46,627
Total 3,346,836 115,172 62,517
Malagasy - English Resourceseng-Tokens eng-Types mlg-Tokens mlg-Types
Bible (Year 1) 584,872 13,084 579,578 19,460
CMU Global Voices (Year 2) 1,785,472 63,357 2,148,976 84,744
Total 2,370,344 67,790 3,346,836 115,172
![Page 9: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.](https://reader035.fdocuments.net/reader035/viewer/2022081514/5697bfa61a28abf838c98362/html5/thumbnails/9.jpg)
CMU Global Voices Corpus
•Domains include Twitter, blogs, news about popular democracy movements• Actively published by volunteer translators– We are gathering ~ 500k words / language / year
of high quality parallel data
eng-Tokens eng-Types mlg-Tokens mlg-Types
Global Voices <Jun 2011 1,318,780 56,414 1,569,343 72,906
Global Voices <Jun 2012 1,732,674 59,750 2,066,419 79,269
http://www.ark.cs.cmu.edu/global-voices/
![Page 10: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.](https://reader035.fdocuments.net/reader035/viewer/2022081514/5697bfa61a28abf838c98362/html5/thumbnails/10.jpg)
Morphological analysis
• We decided against creating morphological gold-standard annotations from the output of finite state transducers.
• Initially tried to use XFST analyzer created by Dalrymple, Liakata and Mackie 2006.– Quality of the output of Dalrymple transducer was
poor (ambiguous, many incorrect).• No existing Kinyarwanda transducer– Any annotations would be subject to changing
analyses during transducer development.
![Page 11: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.](https://reader035.fdocuments.net/reader035/viewer/2022081514/5697bfa61a28abf838c98362/html5/thumbnails/11.jpg)
Morphological analysis
• Developed new transducers for both Kinyarwanda and Malagasy.– Less ambiguity– Cautious guessing for unknown stems => better
precision• Improvements driven by measuring
ambiguity/coverage on data and effect on performance in other tasks.
• We may produce annotations after transducer development deemed sufficient.
![Page 12: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.](https://reader035.fdocuments.net/reader035/viewer/2022081514/5697bfa61a28abf838c98362/html5/thumbnails/12.jpg)
Syntactic annotations
• During past year, we reviewed and revised phrase structures annotated for kin and mlg texts.– Analyses and labels made more consistent across
languages– Head annotations added to enable dependency
parsing training/evaluation.– All tokenization standardized.
• GFL annotations: 4k each tokens, kin and mlg
![Page 13: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.](https://reader035.fdocuments.net/reader035/viewer/2022081514/5697bfa61a28abf838c98362/html5/thumbnails/13.jpg)
Data accomplishments
• Fieldwork on Kinyarwanda that informs theoretical linguistic work and transducers.
• New morphological transducers for kin and mlg.• V 3.0 of monolingual, bilingual, and tree-banked
data for Kinyarwanda and Malagasy to be released this coming week.– Order of magnitude parallel data (mlg)– Better & more syntactic data (kin/mlg)
![Page 14: LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.](https://reader035.fdocuments.net/reader035/viewer/2022081514/5697bfa61a28abf838c98362/html5/thumbnails/14.jpg)
Data accomplishments
• Evaluation– Pilot annotations for linguistically target test suites
• Formal linguistic advances– GFL specification and tools for annotation and
visualization– Abstract Meaning Representation (AMR): leverage
ideas, data and tools from ISI as part of other synergistic projects.