Hindi to English Wordnet Linkage: Challenges and Solutions
Jaya SaraswatiRajita Shukla
Ripple P. GoyalPushpak Bhattacharyya
Linking of Hindi wordnet (version 1.2) to the Princeton WordNet (version 2.1)
Introduction
Scenario in a Multilingual Country like India
Indo-European languages, Indo-Aryan and Dravidian languages, Austro-Asiatic, Tibeto-Burman.
o 22 Official languages and hundreds of dialects
o Several Linguistics Families
Roadmap
1. Need for Linkage
2. Challenges in Linkage
3. Solutions
4. WN Synset Linkage Tool
5. Statistics
Creation of Bilingual Dictionaries
Need for Linkage
NLP tasks like Machine Translation and Cross Lingual Information Retrieval
Word Sense Disambiguation even in the absence of sense tagged corpora in target language
Create a wide wordnet grid of shared concepts.
Kinship Relations
Challenges in Linkage
Women denoting Caste and Occupation
Musical Instruments Kitchen Utensils Tools Species Grains Castes Occupations Wages
Uncleचाचा
(caacaa)
मामा (maamaa)
फूफा (phoophaa)
मौसा (mausaa)
Kinship Relations
drumतबला
(tabalaa)
ढोल (dhol)मृदंग (mridang)
नगाड़ा (nagaadaa)
Musical Instruments
Kitchen Utensils
◦Specific utensils
डोंगा (dongaa - bowl); कटोरदान (katoradaan - container)
◦Size difference
कलछा (kalachhaa - big ladle ); कलछी (kalachhii - small ladle)
o Problem of exact matches in English
कनखोदनी (kanakhodanii); अंकुसी (ankusii) (very specific kinds of tools)
Tools
o Size difference
खुर्पाा! (khurpaa – big spud); खुर्पा" (khurpii – small spud)
o English WordNet does not always have synsets for the male and female of the species
मेंढक (meⁿḍhaka – male frog); मेंढकीी (meⁿḍhakii - female frog)
Species
o Some English concepts do not have separate synsets for species and male of the species
शेर (śera - denoting the species tiger); शेर (śera – denoting male tiger)
Millet
ज्वार बाजरा मँड़ुआ
Grains
लुहार (luhaara – a member of the caste of the ironsmiths)
Castes
धोबी (dhobi - a member of the caste of people who wash clothes)
लुहारी (luhaarii - occupation/work of an ironsmith)
Occupations
सुनारी (sunaarii - occupation/work of a goldsmith)
ढुलाई (dhulaaii – wages for carrying /transporting )
Wages
रु्पाताई (putaaii - wages for housepainting)
o Women of various castes
धोबिबन (dhobina - a woman belonging to the caste of the washermen)
Women denoting Caste and Occupation
o Wives of men from a certain caste or profession
धोबिबन (dhobina - wife of a washerman)
Two kinds of linkages:
Solutions
◦Hypernymy Linkage for synsets which cannot be linked directly to English concepts
◦Direct Linkage for synsets having exact equivalents in English
Examples of hypernymy linkage :
Solutions (contd.)
◦डोंगा (dongaa) – to be linked to tableware
◦चाचा (caacaa) and मामा (maamaa) – to be linked to uncle
◦तबला (tabalaa) etc. to be linked to drum
◦ज्वार (jwaara), बाजरा (baajaraa) - to be linked to millet
◦कनखोदनी (kanakhodanii) – to be linked to tool
o Terms denoting caste – to be linked to jati
Solutions (contd.)
o Terms denoting professions – to be linked to occupation
o Terms denoting remunerations – to be linked to wage
o Terms for women of various castes – to be linked to jati
o Terms for wives of men belonging to various castes and occupations - to be linked to wife
Size Differentiation in Tools and Utensils
o Direct linkage for the more popular term (as in खुर्पा" khurpii)o Hypernymy linkage to be used for the other (as inखुर्पाा! khurpaa)
Solutions (contd.)
Species and the male of the species
o Direct linkage for term denoting species (शेर śera – linked to tiger)
o Hypernymy linkage to be used to denote the male (शेर śera – again linked to tiger)
WN Synset Linking Tool (Offline)
Hindi Wordnet Online Interface
Total Hindi synsets 34343Number of Synsets Linked 15091
Number of Synsets Skipped 15550
Number of Synsets left for First Consideration
3702
Hypernymy Linked 20Direct Linked 15071
Statistics
Linking of the Hindi wordnet to the English wordnet,
Conclusions
Help in maximizing linkages
The Challenges therein, and
The Solutions - Strategy of using Direct and Hypernymy Linkages
Arun Karthikeyan Karra. 2010. WordNet Linking. Master of Technology Dissertation, CSE Department, IIT Bombay.
Dipak Narayan, Debasri Chakrabarty, Prabhakar Pande and P. Bhattacharyya. 2002. An Experience in Building the Indo WordNet- a WordNet for Hindi. International Conference on Global WordNet (GWC 02), Mysore, India.
Fellbaum, C. 1998. Wordnet: An Electronic Lexical Database. The MIT Press.
J. Ramanand, Akshay Ukey, Brahm Kiran Singh, Pushpak Bhattacharyya. 2007. Mapping and Structural Analysis of Multi-lingual Wordnets. IEEE Data Engineering Bulletin, 30(1).
Kamil Bulke. 1997. An English-Hindi Dictionary (ed.). S. Chand & Co, New Delhi, India.
References
Lewis Henry Morgan. 1871. Systems of consanguinity and affinity of the human family. Smithsonian Contributions to Knowledge; v. 218, Washington DC.
Mitesh Khapra, Sapan Shah, Piyush Kedia and Pushpak Bhattacharyya. 2009. Projecting Parameters for Multilingual Word Sense Disambiguation. Empirical Methods in Natural Language Processing (EMNLP09), Singapore.
Dr. S. Awasthi and Dr. (Smt.) I. Awasthi. 2000. Chambers English-Hindi Dictionary (ed.). Allied Publisher Limited, New Delhi, India.
www.Shabdkosh.com
www.wikipedia.org
http://pustak.org/bs/home.html
http://www.thefreedictionary.com
References
THANK YOU!!!
Top Related