Shalini R. Urs Vidyanidhi Digital Library University of Mysore,Mysore, India
description
Transcript of Shalini R. Urs Vidyanidhi Digital Library University of Mysore,Mysore, India
Indo-US Workshop, June 25, 2003
XML-Unicode environment for creating and accessing of
Indian language theses: Vidyanidhi experiences
Shalini R. UrsVidyanidhi Digital Library
University of Mysore,Mysore, [email protected]
Indo-US Workshop, June 25, 2003
Vidyanidhi Digital Library
• Vidyanidhi began as a pilot project in 2000
• Supported by the NISSAT, DSIR, GOI• Objective was to demonstrate the
feasibility of an Electronic Thesis and Dissertation( ETD) Initiative in the Indian Context
• It is now evolving into a national effort• Supported by the Ford Foundation
Indo-US Workshop, June 25, 2003
Vidyanidhi:Vision
To evolve into a information infrastructure to strengthen the research capacities of Indian Universities by-
Developing accessible digital libraries of theses and dissertations.
Sensitizing and training doctoral research students in Scholarly writing, E-publishing and ETDs
Developing appropriate policiesDeveloping/making available requisite
tools and resources
Indo-US Workshop, June 25, 2003
Vidyanidhi: Strategies
• Policy Framework – through meetings, liaison, participation
• Education and Training• Content Building- full text and
metadata• Resources and tools
(software,interfaces…)
Indo-US Workshop, June 25, 2003
Indian Academic Research Output• Large system of higher education• More than 300 universities-reservoir of
extensive doctoral research work• Doctoral research output-around 30,000
annually• English is the predominant language• Increasing vernacularisation –20-25% in
Indian Languages• This trend is increasing resulting in more
and more research output in Indian Languages
Indo-US Workshop, June 25, 2003
Language Interoperability
• Vidyanidhi approach has been guided by the language inter operability factor
• Our choice of technology and tools will have to be inter operable across languages
Indo-US Workshop, June 25, 2003
Indian Languages: Diversity
• The rich diversity in Indian Languages and scripts is simply overwhelming.
• India is made up of a number of separate linguistic communities, each of which shares a common language and culture.
• No of languages listed for India is 418• 407 are living languages• 11 are extinct.• Many Languages -without script of their own
Indo-US Workshop, June 25, 2003
Eighteen Indian languages
• Assamese • Gujarati• Kashmiri• Malayalam• Marathi• Oriya• Punjabi• Sindhi• Telugu
• Bengali• Hindi• Kannada• Konkani• Manipuri• Nepali• Sanskrit• Tamil• Urdu
Indo-US Workshop, June 25, 2003
Language Families of Indian Languages
• Indo European- North and Central India
• Dravidian – South India• Mon-Khmer- Assam and some
Eastern parts of India• Sino-Tibetan- Northern Himalayan
and Burmese border area
Indo-US Workshop, June 25, 2003
Indian Scripts
• Interestingly, though the languages belong to four different language groups, Indian scripts have a common root/origin
• Scripts of all Indian Languages are derived from Bhahmi
• Greater uniformity in the arrangement of Alphabets
Indo-US Workshop, June 25, 2003
Indian Alphabet: Characteristics• Consonants
– Five Vargs (groups) – Non varg– Have an implicit + vowel
• Anuswar ( a nasal consonant)• Chandrabindu ( a nasalisation Sign)• Visarg• Vowels and Vowel Signs• Vowel omission sign( Halant)• Conjuncts
Indo-US Workshop, June 25, 2003
Indian Languages and scripts
• Indic scripts are syllable oriented-phonetic based with imprecise character sets
• The different scripts look different (different shapes) but have vastly similar yet subtly different alphabet base and script grammar
Indo-US Workshop, June 25, 2003
Indian Languages and scripts:Issues
• The Indic characters consist of consonants, vowels, dependent vowels-called ‘matras’ or a combination of any or all of them called conjuncts.
• Collation (sorting) is a contentious issue as the script is phonetic based and not alphabet based
Indo-US Workshop, June 25, 2003
Handling Indian Languages:Possible approaches
• Transliteration - Glyph based approach– Indic characters are encoded in either
ASCII or any other proprietary encoding
– Use glyph technologies to display and print Indic scripts
– Currently the most popular approach for desktop publishing.
Indo-US Workshop, June 25, 2003
Handling Indian Languages:Possible approaches
• Develop an encoding system for all the possible characters/combinations running into nearly 13,000 characters in each language-with a possibility of a new combination leading to a new character- an approach developed and adopted by the IIT Madras development team
• Adopt the ISCII/Unicode encoding
Indo-US Workshop, June 25, 2003
ISCII- Indian Script Code for Information Interchange
• ISCII-91 -BIS Standard , IS 13194:1991• An outcome of the efforts of Govt. of
India, DOE, MIT, C-DAC and many other institutions
• Is an 8 bit code• Is an extension of the 7 bit ASCII code• Top 128 characters cater to the 10 Indian
Scripts
Indo-US Workshop, June 25, 2003
Unicode
• The Unicode consortium has encoded all of the world’s scripts
• Unicode represents a carefully thought out ,technically impressive and a full featured attempt at encoding Indic Scripts
• Unicode has unique code points for all of the Indic scripts
Indo-US Workshop, June 25, 2003
Script Unicode Range Major Languages
Devanagari U+0900 to U+097F Hindi, Marathi, Sanskrit
Bengali U+0980 to U+09FF Bengali, Assamese
Gurumukhi U+0A00 to U+0A7F Punjabi
Gujurati U+0A80 to U+0AFF Gujarati
Oriya U+0B00 to U+0B7F Oriya
Tamil U+0B80 to U+0BFF Tamil
Telugu U+0C00 to U+0C7F Telugu
Kannada U+0C80 to U+0CFF Kannada
Malayalam U+0D00 to U+0D7F Malayalam
Indo-US Workshop, June 25, 2003
Unicode implementation for Indic scripts
• Despite the robustness ,technical soundness and practical viability, Unicode implementation for Indic scripts is almost non existent
• Our search of the major databases-LISA, INSPEC, WOS did not show up any initiative in this direction
• Vidyanidhi is an example of successful implementation of Unicode for Indic scripts
Indo-US Workshop, June 25, 2003
Vidyanidhi approaches
• Taking Indian Language thesis to the Web– Full Text– Metadata
Indo-US Workshop, June 25, 2003
Template for thesis in MS Word
Student submits thesis in Word
Convert to XML using the RTF to XML Converter
MS Word to XML
Take them to the Web
Indo-US Workshop, June 25, 2003
Full Text
• Vidyanidhi provides tools for the creation of theses in Indian Languages
• Our approach is to-• provide a style sheet /template on line• When the thesis is submitted then convert
the same into to XML encoded in Unicode
Indo-US Workshop, June 25, 2003
Vidyanidhi database-approach…• Each script /language will have one
table. Currently there are three separate tables for the three scripts- one each for Roman, Hindi (Devanagari), & Kannada
• The theses in Indic languages will have two records -one in the Roman script (transliterated) and the other in the vernacular. However the theses in English will have only one record (in English)
Indo-US Workshop, June 25, 2003
Vidyanidhi database-approach…
• The two records are linked by the ThesisID number-a unique id for the record
• The bibliographic description of Vidyanidhi follows the ThesisMS Dublin Core standard adopted by the NDLTD and OCLC
Indo-US Workshop, June 25, 2003
Vidyanidhi - Platform
• Microsoft • Windows XP supports all the 10 Indic
scripts• Using Windows Glyph processing–• Open Type Font Format• Uniscribe-Unicode Script Processor• Open Type Layout Services library
Indo-US Workshop, June 25, 2003
Vidaynidhi - platform
– MS SQL 2000• A truly multilingual-capable SQL• Achieves satisfactory collation
– Front End- ASP– Java script
Indo-US Workshop, June 25, 2003
Vidyanidhi:Accessing and Searching
• One can search the Vidyanidhi Database either in -– In English ( Roman Script)– The integrated ( Master) database has
metadata records for theses in all languages
– Vernacular database has records of the specific language only
Indo-US Workshop, June 25, 2003
Two approaches-differences
• one affords search in the English language and the other in the vernacular.
• The first approach also provides for viewing records in Roman script for all theses-search output- that satisfy the conditions of the query and also an option for viewing records in vernacular script for theses in vernacular
Indo-US Workshop, June 25, 2003
• The second approach- enables one to search only the vernacular database and thus is limited to records in that language.
• However, this approach enables the search to be in the vernacular language and script
Indo-US Workshop, June 25, 2003
Unicode and Indic Scripts
• Vidyanidhi implementation dispels certain misconceptions and misconstructions about Unicode
• Supposed problems-– Data Input– Display and printing– Collation
Indo-US Workshop, June 25, 2003
Data input/Keyboard layout
Our Test bed and comparison with other methods:
• Unicode layout is as easy as the other in terms of speed
• In terms of ‘no of key strokes’-No difference and some times Unicode method has less number of keystrokes involved
• Data input was almost comparable to English records in terms of productivity
Indo-US Workshop, June 25, 2003
Display and Printing
• It is fairly satisfactory except for a few issues/problem areas-– Handling of certain conjuncts– Inability to display non terminating
pure consonant– Limited choice of font types
• Unicode can handle conjunct clusters of four consonants
Indo-US Workshop, June 25, 2003
Collation issues-some observations
• Consensus with respect of Indic scripts is hard to come by
• Difference of opinion is not uncommon as Indic languages are a cross between syllabic and phonemic writing systems
• Collation according to phonetic order would be different from alphabetic order
Indo-US Workshop, June 25, 2003
Collation Issues
• A few of the disorder stem from the common script base and order for all Indic scripts
• Differences between Indic scripts -in the number and arrangement of consonants and vowels-despite strong similarity
Indo-US Workshop, June 25, 2003
Collation by Unicode
• Given the above collation problems, the collation achieved by Unicode is fairly satisfactory and compares very well with other more popular Font based software package-Nudi
Indo-US Workshop, June 25, 2003
Conclusion
Unicode is able to handle admirably the challenges of a Multilanguage multi script database implementation despite the complexity and the minutiae of a family of Indian languages and scripts with strong commonalities and faint distinctions among themselves