In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

24
In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference

Transcript of In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

Page 1: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

In association with

CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad

1st International Conference

Page 2: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

Words unite people. Words can divide nations – they indulge in

‘war of words’… Word-smiths Word-smiths fashion textsfashion textsWord-mongersWord-mongers talk talk

nineteen to the dozennineteen to the dozenWord-lords Word-lords don’t tell you don’t tell you

that they ‘double-speak’that they ‘double-speak’Word-poetsWord-poets open the inner open the inner

abyss of lanes & bye-abyss of lanes & bye-lanes of meaninglanes of meaning

And so doAnd so do WordNetsWordNetsWhich is why we are all here!

Page 3: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

Welcome to 1st Global WordNet Conference

First, I shall tell First, I shall tell you a little about you a little about what the Indian what the Indian linguistic scene linguistic scene is likeis like, and , and what what we at CIIL have we at CIIL have been doingbeen doing

ThenThen, we will , we will offer our offer our suggestionssuggestions on on what we in what we in India could do India could do in in WordNetWordNet

MY ADDRESS HAS TWO PARTS

Page 4: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

CENTRAL INSTITUTE OF INDIAN LANGUAGES

^maVr` ^mfm g§ñWmZ^maVr` ^mfm g§ñWmZ{ejm {d^mJ, ^maV gaH$ma{ejm {d^mJ, ^maV gaH$ma

Initiatives inInitiatives in

LANGUAGE TECHNOLOGYLANGUAGE TECHNOLOGY

Page 5: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

CIIL in the first three decades:EquippingEquippingLanguageLanguageteachers andteachers andAnalystsAnalyststechnologicallytechnologically

Page 6: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

1. An Apex Institution under Languages Division, MHRD

In July 2001, In July 2001, 32 32 years completedyears completed This 287-people institution works for This 287-people institution works for

development of Indian languages. development of Indian languages. CIIL hasCIIL has five Centersfive Centers with with Research Research

GroupsGroups (16) and (16) and Service GroupsService Groups (6). (6). 7 7 Regional Language CentersRegional Language Centers are at are at

Bhubaneswar, Guwahati, Lucknow, Bhubaneswar, Guwahati, Lucknow, Mysore, Patiala, Pune, & Solan.Mysore, Patiala, Pune, & Solan.

Page 7: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

2. Four Main Objectives1. Develops 1. Develops languageslanguages by by

creating content, corpus, creating content, corpus, techniques and technologies.techniques and technologies.

2. Protects & Documents2. Protects & Documents Minority & Tribal languagesMinority & Tribal languages

3. Creates linguistic 3. Creates linguistic harmony harmony by teaching 15 Indian tongues by teaching 15 Indian tongues to non-native learners.to non-native learners.

4. Above all, 4. Above all, advicesadvices both both Central and State governments Central and State governments on matters related to language.on matters related to language.

Page 8: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

3. Functionality andMulti-disciplinarity

Although the Although the mainstaymainstay are are Indian Indian Languages Languages & Linguistics& Linguistics, the , the focus of all projects focus of all projects and programmes is and programmes is on on developing developing materials & materials & productsproducts – in print, – in print, audio, video and audio, video and computational.computational.

In addition, there In addition, there is enough interest is enough interest in in Comp. Lit, Comp. Lit, Education, Education, Language Language Technology & NLP, Technology & NLP, Folklore, Folklore, Geography, Geography, Statistics Statistics Psychology,SociolPsychology,Sociology& Translationogy& Translation

Page 9: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

4. Coverage of CIIL - sizable Archived 118 lgs dataArchived 118 lgs data Creating Voice Corpora Creating Voice Corpora Studied 80 Tribal lgsStudied 80 Tribal lgs 35 grammars on-line soon35 grammars on-line soon Published 490 books Published 490 books Cassette Courses in : Cassette Courses in :

Assamese, Urdu, Bengali Assamese, Urdu, Bengali Kashmiri & Marathi Kashmiri & Marathi

Radio coursesRadio courses in Hindi in Hindi through Kannadathrough Kannada

Page 10: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

5. Major Publications – 490+ booksall produced in-house

22 Grammars22 Grammars 30 Intensive Courses30 Intensive Courses 24 224 2ndnd Lg Textbooks Lg Textbooks 5 Common Vocab.5 Common Vocab. 18 Dictionaries18 Dictionaries 49 Apni Boli (KVS)49 Apni Boli (KVS) 15 Pictorial Glossaries15 Pictorial Glossaries 16 Literacy Books16 Literacy Books 12 Folklore12 Folklore 9 Bibliographies9 Bibliographies

12 Rhymes/Lg Games

16 Proceedings

Page 11: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

6. The Challenge before CIIL: 6. The Challenge before CIIL: EnormousEnormous

The Gigantic World ofThe Gigantic World ofIndian LanguagesIndian Languages

Page 12: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

A truly plural world of languages

1,5761,576 rationalized rationalized mother-tonguesmother-tongues;; 1,796 other mother-tongues;1,796 other mother-tongues; 114 languages114 languages with 10,000+ speakers; with 10,000+ speakers; Large variationLarge variation: Hindi (337 m) to : Hindi (337 m) to

Maram of Manipur with 10,144;Maram of Manipur with 10,144; Large non-scheduled lgsLarge non-scheduled lgs - Bhili (6 m) - Bhili (6 m)

and Santali (5 m);and Santali (5 m); 146 radio lgs/69 school lgs /35 lg dailies.146 radio lgs/69 school lgs /35 lg dailies.

Page 13: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

7. Programs - Modes of Delivery

10 months L2 teaching: 8000 teachers trained10 months L2 teaching: 8000 teachers trained Distance Courses in Tamil/Telugu/Bengali/UrduDistance Courses in Tamil/Telugu/Bengali/Urdu On-line Programs in 15 Indian languagesOn-line Programs in 15 Indian languages Kannada for officials in Karnataka Kannada for officials in Karnataka Radio courses Radio courses with AIR’s collaborationwith AIR’s collaboration 3-months Courses in Communication 3-months Courses in Communication Orientation for Mother-tongue teachersOrientation for Mother-tongue teachers Refresher Courses in LinguisticsRefresher Courses in Linguistics NLP Training modulesNLP Training modules

Page 14: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

8. Language Technology –Further Goals EnlargementEnlargement of 3-million word of 3-million word CorporaCorpora:: 100 m word corpora100 m word corpora for Hindi-Urdu for Hindi-Urdu Multilingual multidirectional Multilingual multidirectional E- DictionariesE- Dictionaries On-line Administrative Glossaries On-line Administrative Glossaries Lexical databasesLexical databases for MT Programs for MT Programs Tagging & Corpus ToolsTagging & Corpus Tools E-Zines and E-JournalsE-Zines and E-Journals Language Information ServicesLanguage Information Services Anukriti: Web-based Anukriti: Web-based Translation servicesTranslation services

Page 15: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

9 Indian Lgs & IT at CIIL 132-node LAN set up132-node LAN set up V-SAT through STPI V-SAT through STPI Brousing centre Brousing centre Has 2400 E-Journals & Has 2400 E-Journals &

350 paper journals.350 paper journals. Collaborating with Collaborating with

Schoolnet for electronic Schoolnet for electronic materialsmaterials

New generation Lg LabsNew generation Lg Labs Focus: Visual PhoneticsFocus: Visual Phonetics

Page 16: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

10. LIS-India Website

Type Type Language Name: Language Name: Type Type Area Name:Area Name:

Home or Home or http://www.http://www.ciilciil.org/.org/ General InformationGeneral Information Language/Language/ Area Area Profile: Profile:

Geolinguistic; Geolinguistic; Sociolinguistic; Sociolinguistic; Cultural; Cultural; LiteraryLiterary Language/Language/AreaArea History: History:

Genealogical; Genealogical; Archaeological; Archaeological; Cultural; Cultural; TextualTextual Language Vitality:Language Vitality:

Attitudinal; Attitudinal; Utilitarian; Utilitarian; Socio-political; Socio-political; ReferentialReferential Grammatical Information:Grammatical Information:

Phonetic; Phonetic; Graphemic; Graphemic; Phonological; Phonological; Morphological; Morphological; LexicalLexical;; Syntactic;Syntactic; Semantic;Semantic; StylisticStylistic

Biblio searchBiblio search

Page 17: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

11. Anukriti A Translation with NBT/SA

WEB-WEB-BASED BASED SERVICE SITESERVICE SITE calledcalled ANUKRUTIANUKRUTI. .

To be maintained with To be maintained with NBT/Sahitya AkademiNBT/Sahitya Akademi

E-journalsE-journals Technological ToolsTechnological Tools

EElectroniclectronic lexicon lexicon CCorporpus & toolsus & tools Parallel corporaParallel corpora Cultural GCultural Glossarieslossaries TThesaurihesauri WWord finderord finderss WordNetsWordNets

Page 18: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

12. Bhasha Bharati Project

Sahitya AkademiSahitya Akademi Sangeet Natak AcademySangeet Natak Academy All India RadioAll India Radio DoordarshanDoordarshan National LibraryNational Library National ArchiveNational Archive National Book TrustNational Book Trust Major TV ChannelsMajor TV Channels Films DivisionFilms Division

Major Newspaper housesMajor Newspaper houses Numerous FoundationsNumerous Foundations Individual writersIndividual writers Heirs of writersHeirs of writers Personal librariesPersonal libraries Little magazines Little magazines This rich This rich

manuscriptorium will manuscriptorium will display plural display plural literary and linguistic literary and linguistic landscaplandscapee of of India.India.

To be set up in collaboration with

Page 19: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

13. Doctoral Programs under planning

Already availableAlready available through through 22 Universities:22 Universities:

Linguistics & PsychologyLinguistics & Psychology

Now being planned inNow being planned in

NLPNLP

Folklore/CommunicationFolklore/Communication

TranslationTranslation

Indian Gram.TraditionIndian Gram.Tradition

Page 20: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

14. Future Programs

Dip in Experimental PhoneticsDip in Experimental Phonetics Masters by Research in Field LinguisticsMasters by Research in Field Linguistics Courses in Statistical LinguisticsCourses in Statistical Linguistics Diploma in Translation StudiesDiploma in Translation Studies Dip in Folklore/Comp. Lit. & SemioticsDip in Folklore/Comp. Lit. & Semiotics Internship in Linguistic GeographyInternship in Linguistic Geography Internship in NLP & Corpus LinguisticsInternship in NLP & Corpus Linguistics

Page 21: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

WHAT COULD WE DO TO CREATE

AN

Page 22: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

India has already had a strong lexicographical tradition

Working on Working on WordNetWordNet, , therefore, should come therefore, should come naturally to us.naturally to us.

Efforts have already Efforts have already begun as we see in begun as we see in Hindi, Tamil, Oriya and Hindi, Tamil, Oriya and a few other languages.a few other languages.

There does not seem to There does not seem to be any academic be any academic coordination, however.coordination, however.

Early 20Early 20thth century century Indian linguistics was Indian linguistics was dominated by studies dominated by studies on on sound-systemsound-system and and etymologiesetymologies

Mid-20Mid-20thth C focussed on C focussed on word-formationword-formation patternspatterns

Late 20Late 20thth C emphasized C emphasized on on syntaxsyntax

Page 23: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

We haven’t so far worked seriously on Lexical Semantics

While While SociolinguisticsSociolinguistics was a favourite, serious was a favourite, serious PsycholinguisticsPsycholinguistics was almost absent was almost absent

Formal SyntaxFormal Syntax was highly valued, but intricacies of was highly valued, but intricacies of SemanticsSemantics were not so attractive. were not so attractive.

Making of DictionariesMaking of Dictionaries continued throughout, but continued throughout, but major concerted efforts in each language were major concerted efforts in each language were highly individualistic or had happened long ago.highly individualistic or had happened long ago.

While While writing softwareswriting softwares or or applying themapplying them means means money, and is hence a crowded field, money, and is hence a crowded field, Language Language Technology Technology has so far been neglected. has so far been neglected.

Page 24: In association with CIIL-Mysore, IIT-Mumbai, IIIT-Hyderabad 1 st International Conference.

So, what do we need to do now?

Create an Create an Indian Indian WordNet AssociationWordNet Association

Work coordinatedly Work coordinatedly Remember to focus on Remember to focus on

areal semantic areal semantic features because with features because with so much linguistic & so much linguistic & cultural diversity, cultural diversity, India is ideal to test India is ideal to test and validateand validate the the concept of WordNet.concept of WordNet.