Shalini R. Urs Vidyanidhi Digital Library University of Mysore,Mysore, India

Indo-US Workshop, June 25, 2003

XML-Unicode environment for creating and accessing of

Indian language theses: Vidyanidhi experiences

Shalini R. UrsVidyanidhi Digital Library

University of Mysore,Mysore, [email protected]


Vidyanidhi Digital Library

• Vidyanidhi began as a pilot project in 2000

• Supported by the NISSAT, DSIR, GOI• Objective was to demonstrate the

feasibility of an Electronic Thesis and Dissertation( ETD) Initiative in the Indian Context

• It is now evolving into a national effort• Supported by the Ford Foundation


Vidyanidhi:Vision

To evolve into a information infrastructure to strengthen the research capacities of Indian Universities by-

Developing accessible digital libraries of theses and dissertations.

Sensitizing and training doctoral research students in Scholarly writing, E-publishing and ETDs

Developing appropriate policiesDeveloping/making available requisite

tools and resources


Vidyanidhi: Strategies

• Policy Framework – through meetings, liaison, participation

• Education and Training• Content Building- full text and

metadata• Resources and tools

(software,interfaces…)


Indian Academic Research Output• Large system of higher education• More than 300 universities-reservoir of

extensive doctoral research work• Doctoral research output-around 30,000

annually• English is the predominant language• Increasing vernacularisation –20-25% in

Indian Languages• This trend is increasing resulting in more

and more research output in Indian Languages


Language Interoperability

• Vidyanidhi approach has been guided by the language inter operability factor

• Our choice of technology and tools will have to be inter operable across languages


Indian Languages: Diversity

• The rich diversity in Indian Languages and scripts is simply overwhelming.

• India is made up of a number of separate linguistic communities, each of which shares a common language and culture.

• No of languages listed for India is 418• 407 are living languages• 11 are extinct.• Many Languages -without script of their own


Eighteen Indian languages

• Assamese • Gujarati• Kashmiri• Malayalam• Marathi• Oriya• Punjabi• Sindhi• Telugu

• Bengali• Hindi• Kannada• Konkani• Manipuri• Nepali• Sanskrit• Tamil• Urdu


Language Families of Indian Languages

• Indo European- North and Central India

• Dravidian – South India• Mon-Khmer- Assam and some

Eastern parts of India• Sino-Tibetan- Northern Himalayan

and Burmese border area


Indian Scripts

• Interestingly, though the languages belong to four different language groups, Indian scripts have a common root/origin

• Scripts of all Indian Languages are derived from Bhahmi

• Greater uniformity in the arrangement of Alphabets


Indian Alphabet: Characteristics• Consonants

– Five Vargs (groups) – Non varg– Have an implicit + vowel

• Anuswar ( a nasal consonant)• Chandrabindu ( a nasalisation Sign)• Visarg• Vowels and Vowel Signs• Vowel omission sign( Halant)• Conjuncts


Indian Languages and scripts

• Indic scripts are syllable oriented-phonetic based with imprecise character sets

• The different scripts look different (different shapes) but have vastly similar yet subtly different alphabet base and script grammar


Indian Languages and scripts:Issues

• The Indic characters consist of consonants, vowels, dependent vowels-called ‘matras’ or a combination of any or all of them called conjuncts.

• Collation (sorting) is a contentious issue as the script is phonetic based and not alphabet based


Handling Indian Languages:Possible approaches

• Transliteration - Glyph based approach– Indic characters are encoded in either

ASCII or any other proprietary encoding

– Use glyph technologies to display and print Indic scripts

– Currently the most popular approach for desktop publishing.


Handling Indian Languages:Possible approaches

• Develop an encoding system for all the possible characters/combinations running into nearly 13,000 characters in each language-with a possibility of a new combination leading to a new character- an approach developed and adopted by the IIT Madras development team

• Adopt the ISCII/Unicode encoding


ISCII- Indian Script Code for Information Interchange

• ISCII-91 -BIS Standard , IS 13194:1991• An outcome of the efforts of Govt. of

India, DOE, MIT, C-DAC and many other institutions

• Is an 8 bit code• Is an extension of the 7 bit ASCII code• Top 128 characters cater to the 10 Indian

Scripts


Unicode

• The Unicode consortium has encoded all of the world’s scripts

• Unicode represents a carefully thought out ,technically impressive and a full featured attempt at encoding Indic Scripts

• Unicode has unique code points for all of the Indic scripts


Script Unicode Range Major Languages

Devanagari U+0900 to U+097F Hindi, Marathi, Sanskrit

Bengali U+0980 to U+09FF Bengali, Assamese

Gurumukhi U+0A00 to U+0A7F Punjabi

Gujurati U+0A80 to U+0AFF Gujarati

Oriya U+0B00 to U+0B7F Oriya

Tamil U+0B80 to U+0BFF Tamil

Telugu U+0C00 to U+0C7F Telugu

Kannada U+0C80 to U+0CFF Kannada

Malayalam U+0D00 to U+0D7F Malayalam


Unicode implementation for Indic scripts

• Despite the robustness ,technical soundness and practical viability, Unicode implementation for Indic scripts is almost non existent

• Our search of the major databases-LISA, INSPEC, WOS did not show up any initiative in this direction

• Vidyanidhi is an example of successful implementation of Unicode for Indic scripts


Vidyanidhi approaches

• Taking Indian Language thesis to the Web– Full Text– Metadata


Template for thesis in MS Word

Student submits thesis in Word

Convert to XML using the RTF to XML Converter

MS Word to XML

Take them to the Web


Full Text

• Vidyanidhi provides tools for the creation of theses in Indian Languages

• Our approach is to-• provide a style sheet /template on line• When the thesis is submitted then convert

the same into to XML encoded in Unicode


Vidyanidhi database-approach…• Each script /language will have one

table. Currently there are three separate tables for the three scripts- one each for Roman, Hindi (Devanagari), & Kannada

• The theses in Indic languages will have two records -one in the Roman script (transliterated) and the other in the vernacular. However the theses in English will have only one record (in English)


Vidyanidhi database-approach…

• The two records are linked by the ThesisID number-a unique id for the record

• The bibliographic description of Vidyanidhi follows the ThesisMS Dublin Core standard adopted by the NDLTD and OCLC


Vidyanidhi - Platform

• Microsoft • Windows XP supports all the 10 Indic

scripts• Using Windows Glyph processing–• Open Type Font Format• Uniscribe-Unicode Script Processor• Open Type Layout Services library


Vidaynidhi - platform

– MS SQL 2000• A truly multilingual-capable SQL• Achieves satisfactory collation

– Front End- ASP– Java script


Vidyanidhi:Accessing and Searching

• One can search the Vidyanidhi Database either in -– In English ( Roman Script)– The integrated ( Master) database has

metadata records for theses in all languages

– Vernacular database has records of the specific language only


Two approaches-differences

• one affords search in the English language and the other in the vernacular.

• The first approach also provides for viewing records in Roman script for all theses-search output- that satisfy the conditions of the query and also an option for viewing records in vernacular script for theses in vernacular


• The second approach- enables one to search only the vernacular database and thus is limited to records in that language.

• However, this approach enables the search to be in the vernacular language and script


Unicode and Indic Scripts

• Vidyanidhi implementation dispels certain misconceptions and misconstructions about Unicode

• Supposed problems-– Data Input– Display and printing– Collation


Data input/Keyboard layout

Our Test bed and comparison with other methods:

• Unicode layout is as easy as the other in terms of speed

• In terms of ‘no of key strokes’-No difference and some times Unicode method has less number of keystrokes involved

• Data input was almost comparable to English records in terms of productivity


Display and Printing

• It is fairly satisfactory except for a few issues/problem areas-– Handling of certain conjuncts– Inability to display non terminating

pure consonant– Limited choice of font types

• Unicode can handle conjunct clusters of four consonants


Collation issues-some observations

• Consensus with respect of Indic scripts is hard to come by

• Difference of opinion is not uncommon as Indic languages are a cross between syllabic and phonemic writing systems

• Collation according to phonetic order would be different from alphabetic order


Collation Issues

• A few of the disorder stem from the common script base and order for all Indic scripts

• Differences between Indic scripts -in the number and arrangement of consonants and vowels-despite strong similarity


Collation by Unicode

• Given the above collation problems, the collation achieved by Unicode is fairly satisfactory and compares very well with other more popular Font based software package-Nudi


Conclusion

Unicode is able to handle admirably the challenges of a Multilanguage multi script database implementation despite the complexity and the minutiae of a family of Indian languages and scripts with strong commonalities and faint distinctions among themselves


Contact [email protected]

Shalini R. Urs Vidyanidhi Digital Library University of Mysore,Mysore, India

Documents

Transcript of Shalini R. Urs Vidyanidhi Digital Library University of Mysore,Mysore, India