Indian Language Initiatives at LDC Denise DiPersio [email protected].

18
Indian Language Initiatives at LDC Denise DiPersio [email protected]

Transcript of Indian Language Initiatives at LDC Denise DiPersio [email protected].

Indian Language Initiatives at LDC

Denise [email protected]

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 2

Overview

Introduction to LDC

Tamil Projects/Resources

Indian Language Projects/Resources

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 3

LDC: Origin and Model

Linguistic Data Consortium established in 1992 Via open, competitive government solicitation, won by U. Penn Initial 5-year funding followed by self-sufficiency through

membership fees, data licenses Power of the collective

Language resource distributor/archive Centralized distribution, archiving, licensing Resources from donations, funded projects, community

initiatives, LDC initiatives

Membership Members support the consortium through fees, data, services Ongoing rights to data published in membership years Reduced fees on older corpora, extra copies

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 4

LDC: Roles

Data collection Language resource (LR) production, including quality control LR distribution and archiving Intellectual property rights management and license management Human subjects protocol management Annotation, lexicon building Creation of tools, specifications, best practices Knowledge transfer: documentation, metadata, consulting, training Corpus creation research and academic publication Resource coordination in large multisite programs Serving multiple research communities

Funding panelists, workshop participants, oversight committee members

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 5

LDC: Data Collection

News text Web text (newsgroups, blogs, chatrooms, twitter) Biomedical texts and abstracts Printed, handwritten and hybrid documents Broadcast programming (news, conversation) Conversational telephone speech Lectures, meetings, interviews Read and prompted speech Role play Video (broadcast, web) Animal vocalizations

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 6

LDC: Annotation

Data scouting, selection, triage Audio-audio alignment: bandwidth, signal quality, language, dialect,

program, speaker Quick and careful transcription, aligned at turn, sentence, word level Phonetic, dialect, sociolinguistic feature, supralexical Tokenization, tagging of morphology, part-of-speech, gloss Syntactic, semantic, discourse functions, disfluency, sense

disambiguation Identification/classification of entities, relations, events and

coreference Translation, alignment of translated text Identification/classification of entities/events in video Document zoning

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 7

LDC: Distribution

Since 1992, LDC has distributed Nearly 75,000 copies of 1300 titles to more than 3000

organizations in over 65 countries Approximately 8000 scholars and research groups receive LDC’s

monthly newsletter

Non-exclusive distribution of donated data LDC research communities span human language

technologies, computer science, social sciences Uniform licensing within and across research communities Stable infrastructure

LRs permanently accessible, ongoing access to data Standardized, simple terms of use and distribution methods

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 8

LDC: Data Scholarships

Formalizes LDC’s long practice of $0 distribution of data to students without the means to otherwise license it

Competitive process Student submits application that contains:

Data set requested, proposed need and use of data Description of research agenda Demonstration of high probability of success for work Letter of support from department chair/advisor including statement of

financial need

Two cycles completed; next will be Fall 2011 16 recipients Argentina, China, India, Indonesia, Mexico, UK, USA ~USD40,000 in data awarded

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 9

Tamil Projects: REFLEX/LCTL 1/3

REFLEX-LCTL (Less Commonly Taught Languages) Goal: to create human language technologies for the target

languages, especially machine translation, information extraction Language selection criteria

Large population of native speakers Relatively few language resources (electronic text, intentional difficulty

variation in LR creation) Linguistic and geographic diversity Include some related languages Make use of existing collaborations

Thirteen languages: Amazigh (Berber), Bengali, Hungarian, Kurdish, Pashto, Panjabi, Tamil, Tagalog, Thai, Tigrinya, Urdu, Uzbek, Yoruba Bengali, Panjabi, Urdu – related languages

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 10

Tamil Projects: REFLEX/LCTL 2/3

LDC created language packs for each language consisting of a monolingual news text corpus (500k words) a parallel text corpus (250k words) a lexicon (10k entries) a grammatical sketch an encoding converter a sentence segmenter a tokenizer a name transliterator a part of speech tagger and tagged text a named entity tagger and tagged text a morphological analyzer and tagged text

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 11

Tamil Projects: REFLEX/LCTL 3/3

Resources identified through individual scouting, “Harvest Festivals”, native speakers

Tamil Language Pack Text sources included websites (for monolingual and parallel

text) Collaboration with Harold Schiffman, Vasu Renganathan

• Tamil lexicon – An English Dictionary of the Tamil Verb• Consulted on encoding conversion

Project sponsor has not yet released pack for publication; potential use in ongoing technology evaluations

Will be published in LDC catalog when cleared for distribution

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 12

Tamil Projects: Language Resource

Wiki

Language Resource (LR) Wiki designed to be Publicly accessible, world-readable Portal of found resources “harvested” in REFLEX-LCTL project Editable by authenticated others outside LDC

Pages for seven languages, including Tamil http://lrwiki.ldc.upenn.edu/mediawiki/index.php/Tamil/Tamil Bengali, Berber, Panjabi, Pashto, Tagalog, Tamil, Urdu Breton, Ewe pages in progress Language summary, linguistic resources, encoding and fonts,

data sources, portals, tools and other natural language processing resources

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 13

Tamil Projects: Language Resource

Wiki

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 14

Tamil Projects: CALLFRIEND

CALLFRIEND project supported the development of language identification technology

LDC recruited native speakers in the target languages to make telephone calls to other native speakers

Calls were unscripted and lasted between 5-30 minutes Target languages: American English, Canadian French, Egyptian

Arabic, Farsi, German, Hindi, Japanese, Korean, dialectal Mandarin Chinese, Spanish (Caribbean, non-Caribbean), Tamil, Vietnamese

CALLFRIEND Tamil LDC96S59 60 telephone conversations Demographic data: sex, age, education Call information: channel quality, number of speakers Calls originated inside the continental United States and Canada

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 15

Tamil Resources

An English Dictionary of the Tamil Verb Second Edition LDC2009L01 Harold Schiffman, Vasu Renganathan (U Penn, Department of

South Asia Studies) Translations for 6597 English verbs and definitions for 9716

Tamil verbs Associated sound files for pronunciation; example sentences Windows search and browse application Complementary copy in conference packet

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 16

Indian Language

Projects/Resources: Hindi

Hindi Surprise Language Exercise (2003) Goal: to assemble found resources under timed conditions LDC collected newswire, web data, some parallel text Not all resources can be released due to intellectual property,

license restraints Further work needed for public release

Hindi WordNet LDC2008L02 Joint distribution with IIT Bombay First WordNet for an Indian language

CALLFRIEND Hindi LDC96S52

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 17

Indian Language Resources:

POS Tagsets

Indian Language Part of Speech Tagsets (IL-POST) Developed by Microsoft Research India; Anna University, Chennai;

Delhi University; IIT Bombay; Jawaharlal Nehru University, Delhi; Tamil University, Tamilnadu

Goal: to provide a common tagset framework for Indian languages that offers flexibility, cross-linguistic compatibility and reusability across languages

LDC currently distributes three IL-POST sets at no cost: Bengali, Hindi, Sanskrit IL-POST Bengali LDC2010T16 – 103k words from web text, EMILLE

corpus (parallel newswire) IL-POST Hindi LDC2010T24 – 98k words from web text IL-POST Sanskrit LDC2011T04 – 57k words from Panchatrantra stories

More languages planned, Tamil among them

Tamil Internet Conference 2011 Philadelphia, PA 17 June 2011 18

LDC: Need to Know

LDC website, http://www.ldc.upenn.edu/ The LDC Corpus Catalog,

http://www.ldc.upenn.edu/Catalog/ Submitting Corpora and Other Resources to LDC,

http://www.ldc.upenn.edu/Providing/ LDC Online, https://online.ldc.upenn.edu/login.html Member Resources,

http://www.ldc.upenn.edu/Membership/

Questions? Thank you!