Fieldwork as a Computational Problem Uniting Computational ...
Transcript of Fieldwork as a Computational Problem Uniting Computational ...
The Human Language Project:Uniting Computational Linguisticswith Documentary Linguistics
Steven Bird
University of Melbourne &University of Pennsylvania
Fieldwork as a Computational Problem
• three data types
• three kinds of metadata
• relations
• computational challenge
• http://www.ldc.upenn.edu/sb/fieldwork/
• this isn't computational linguistics
Convergences
• concern with data
• use of speech data
• bilingual text
Convergences:Bitext + morph = IGT
• bilingual text
• morphologically analyzed text
• comparative wordlists
• bilingual lexicons
Documentary and Descriptive Linguistics
Nikolaus Himmelmann (1998) "Documentary and Descriptive Linguistics" Linguistics 36:161-195
Documentation types:Interlinear text
Guwamu, Peter Austin (2010)
Documentation types:Lexicons
Kröger, F. Buli-English dictionary: With an Introductory Grammar and an Index. Münster: Lit, 1992.
Documentary and Descriptive LinguisticsUse of Computation
Nikolaus Himmelmann (1998) "Documentary and Descriptive Linguistics" Linguistics 36:161-195
• documentarists
• innovation, tool development
• descriptivists
• Evans, Hyman
Karaim CD-ROMEva Csato and David Nathan
Nathan, D. (1998) The spoken Karaim CD: Sound, text, lexicon and "active morphology" for language learning multimedia, Proceedings of the Ninth Annual Conference on Turkish Linguistics.
Where's the science?
After years of neglect in which linguistics lost sight of the value of empirical field research, new life has finally been breathed into this fundamentally important component of our discipline. But in the process, linguistic fieldwork has ironically lost sight of linguistics! That is, if by linguistics one means the scientific study of language, fieldwork ideology and practice have gone askew. The major movements and individuals that we can thank for the resurgence of interest in linguistic fieldwork all promote (in words or deeds) approaches to field research that fall far short of the tenets of science. Examples of such misguided directions include (a) the endangered languages movement, (b) language documentation, and (c) the "Dixon school".! In my talk, I expose the failings of these non-scientific approaches to linguistic field research and set out what would be required for linguistic fieldwork to qualify as truly scientific and thus be entitled to recognition as an essential subfield within linguistics per se.
Paul Newman -- Linguistic Fieldwork as a Scientific Enterprise, International Conference on Language
Key Questions
• What does computational linguistics offer to the problem of documenting and describing the world's languages?
• How can CL help improve the descriptive value of language documentation?
• three places where this might happen Basic Oral Language Documentation
Pilot projectSynopsis of 1 weekin Moife
1. Discussions re orthography, literacy
2. training, practice, listening, tone orthography experiment
3. training in oral transcription and translation; gave out recorders
4. re-assigned recorders
5. (Saturday)
6. oral transcription, vitality survey, orthography recommendations
7. more oral transcription
Pilot project
Main Phase
Preparation
• Batteries
• Date
• Identifiers
Training Training
Basic Oral Language DocumentationOverview of one week's activity... Oral Annotation Protocol
Transcription
Cross Checking Evaluation
• What is the quality of the collected materials?
• Can we correctly establish the phonemic inventory of the language from the recorded materials?
• What semantic domains are covered?
• What can trained linguists get from the raw transcripts?
Back to the computational questions...
Axioms
• Limited funding, but costs for local participation are negligible
• Cannot assume continuous presence of a linguist: primary collection work is "unsupervised"
• Cannot assume an orthography
• Can give training in documentation, but not description
• Contact language has every conceivable resource
• No time limit
Transcription
• contact-language orthography: issues with normalisation
• lexical inventory, diphone inventory
• sense tagging
• multiple instances of one story
• ASR?
• resegmentation
• active learning in interlinear text glossing
MT to help with eliciting morphology?
• problems with recording and translating isolated words
• short complete sentences with translations
• fix nouns and vary the form of the verb?
• bilingual texts as the key means a user would train the system
MT as the measure of adequacy?
• inspect MT output to see what is lost
• supply a corrected version when it gets something wrong
• supply other examples, much as you would do with a child
Data mining
Bird (1999) Multidimensional exploration of online linguistic field data. NELS 29: 33-50.