Recent Experiences on Measuring Languages in Cyberspace – 2007 UNESCO Headquarters, Room XVI S. T....

29
Recent Experiences on Measuring Languages in Cyberspace – 2007 UNESCO Headquarters, Room XVI S. T. Nandasara Lecturer USCS, University of Colombo, Sri Lanka Ashu Marasinghe Associate Professor LOP, Nagaoka University of Technology, Japan Yoshiki Mikami Professor, Leader LOP, Nagaoka University of Technology, Japan Asian Languages on the Web

Transcript of Recent Experiences on Measuring Languages in Cyberspace – 2007 UNESCO Headquarters, Room XVI S. T....

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

S. T. NandasaraLecturerUSCS, University of Colombo, Sri Lanka

Ashu MarasingheAssociate ProfessorLOP, Nagaoka University of Technology, Japan

Yoshiki MikamiProfessor, LeaderLOP, Nagaoka University of Technology, Japan

Asian Languages on the Web

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Introduction of Asian Languages Survey Objectives and Methodology Asian Language Presence on the Web Multilingualism in the Asian Web Script and Encoding Issues Asian Language Resource Network (ALRN)

Project

Asian Languages on the Web

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Give an overview for Asian Languages on the web To describe the state of multilingualism in Asian country

domains Defined at various levels, from a personal or document level

to a societal level Multiple language presence in each country domain Give an overview of cross-border languages

To shed light on script and encoding issues of Asian languages What extent is UCS/Unicode employed for Asian

languages? What scripts are actually used to represent a specific

language? What extent are locally developed encodings used?

Define a future agenda, which can guide us in realizing the vision of creating an observation-collection instrument for Asian languages.

Survey Objectives

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Used a web crawler (Ubi crawler) It traces links within pages and recursively

crawls to gather those newly discovered pages

The collection of downloaded web pages passed to the language identification engine

The language properties of the pages were identified

Survey Methodology

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Focused on web pages in 42 country domains in Asia.

The crawl was begun from a seed file containing 13,286 URLs

The list of ccTLDs contains ae, af, az, bd, bh, bn, bt, cy, id, il, in, iq, ir, jo, kg, kh, kw, kz, la, lb, lk, mm, mn, mv, my, np, om, ph, pk, ps, qa, sa, sg, sy, th, tj, tm, tp, tr, uz, vn and ye.

The Asia crawl started from 5th July 2006 at 11:00hrs and ended on 19th July 2006 at 19:03hrs

Downloaded 107,141,679 web pages in total, 652,710,237,381 bytes in size

Web Pages Collected

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Downloaded Pages by ccTLD – Top 10

Country ccTLD Pages Percentage

Israel il 30,943,029 29.48%

Thailand th 12,556,807 11.96%

Turkey tr 11,363,633 10.83%

Malaysia my 6,865,800 6.54%

Kazakhstan kz 6,441,378 6.14%

Singapore sg 5,771,191 5.50%

Indonesia id 5,742,097 5.47%

Vietnam vn 4,490,288 4.28%

India in 4,262,378 4.06%

Iran ir 4,022,270 3.83%

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Downloaded Pages by ccTLD – Least 10

Country ccTLD Pages Percentage

Iraq iq 0 0.00%

East Timor tp 13,213 0.01%

Myanmar mm 16,759 0.02%

Yemen ye 34,128 0.03%

Maldives mv 37,393 0.04%

Bhutan bt 44,594 0.04%

Syria sy 51,555 0.05%

Qatar qa 52,888 0.05%

Kuwait kw 59,152 0.06%

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

The language identification engine LIM (Language Identification Module) used

LIM consists of two components Training component

Training data is translations of the Universal Declaration of Human Rights (UDHR) provided by the United Nation’s Office of Higher Commissioner for Human Rights

The second component is identification component

LIM can simultaneously detect the triplet of language, script and encoding scheme

Language Identification Process

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Chinese, Japanese and Korean are excluded from the analysis

Hebrew, Thai, Turkish, Vietnamese, Arabic, Tatar, Farsi, Javanese, Indonesian, Malay, Sundanese, Hindi, Dari, Uzbek, Mongolian, Kazakh, Madurese, Uighur, Kashmiri Pushtu, Balochi, Turkmen, Minangkabau, Bikol, Kyrgyz, Balinese, Punjabi, Sindhi, Achehnese, Sinhala, Kapampangan, Iloko, Bengali & Assamese, Filipino, Waray, Bugisnese, Burmese, Kurdish, Tajiki, Azeri, Tamil, Hiligaynon, Dhivehi, Bhojpuri, Tibetan, Cebuano, Telugu, Saraiki, Lao, Gujarati, Pashto, Kannada, Urdu, Khmer, Hani

Discovered 55 Asian languages

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

No of web pages per 1000 population

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Number of pages by language – Top 10 Language Script

Speaker population

Total number of pages

No. of pages per 1000 speakers

Hebrew Hebrew 4,612,000 11,957,314 2592.65

Thai Thai 21,000,000 7,752,785 369.18

Turkish Latin 59,000,000 3,959,328 67.11

Vietnamese Latin 66,897,000 2,006,469 29.99

Arabic Arabic 280,000,000 1,671,122 5.97

Tatar Latin 7,000,000 1,575,442 225.06

Farsi Latin 33,000,000 1,293,880 39.21

Javanese Latin 75,000,000 1,267,981 16.91

Indonesian Latin 140,000,000 866,238 6.19

Malay Latin 17,600,000 432,784 24.59

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Number of pages by language – Least 10

Language ScriptSpeaker population

Total number of pages

No. of pages per 1000 speakers

Cebuano Latin 15,230,000 1,107 0.07

Telugu Telugu 73,000,000 1,072 0.01

Saraiki Arabic 15,020,000 1,036 0.07

Lao Lao 4,000,000 799 0.20

Gujarati Gujarati 44,000,000 765 0.02

Pashto Arabic 9,585,000 259 0.03

Kannada Kannada 33,663,000 164 0.00

Urdu Arabic 54,000,000 70 0.00

Khmer Khmer 7,063,200 65 0.01

Hani Latin 747,000 63 0.08

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Multilingualism by Country Domain The most recent version of Ethnologue lists close to seven

thousand languages around the world. More than 2600 of them are spoken in the Asian region. Large scale linguistic diversity is observable in Asia. Among the

2600, only around 51 languages are recognized by Asian governments as official or national language(s) Richest diversity of languages in the region, i.e. Indonesia Interesting to note that there is a significantly larger number of

pages in Javanese compared to either Indonesian or Malay The major language found in Indonesia, Malaysia, Brunei,

Singapore, Southern Thailand and Phillipines can be categorized into a single root Malay language spoken in different dialects.

Javanese has a dominating web presence in Indonesia. The lesser Sundanese, Madurese, Achehnese and Buginese

languages are found to be of great importance to Indonesia’s local language diversity on the Internet

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Cross-Border Languages Another aspect of the multilingualism in the region is

the overwhelming presence of cross-border languages on the web

Defined two categories of languages First category is “local languages”, which are

officially recognized language(s) and home speakers’ languages of the state

The second category is “cross-border languages”, such as English, French, Russian and Arabic, which are used as a language of communication among the peoples of different nations

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

0%

20%

40%60%

80%

100%

Cyp

rus

Tur

key

Isra

elL

eban

on

Jord

anS

yria

Pal

estin

eG

CC

Iran

Afg

anis

tan

%Local

%Arabic

%Others

%Russian

%English

West Asia

Cross-Border Language Presence

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

0%

20%

40%60%

80%

100%

Cyp

rus

Turk

eyIs

rael

Leba

non

Jord

anSy

riaPa

lest

ine

GC

CIr

anA

fgan

ista

n

%Local

%Arabic

%Others

%Russian

%English0%20%

40%60%

80%100%

Kaz

akhs

tan

Kyr

gyzs

tan

Uzb

ekis

tan

Tur

kmen

ista

n

Taj

ikis

tan

Aze

rbai

jan

Mon

golia

Central Asia

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

0%

20%

40%

60%

80%

100%

Mya

nmar

Tha

iland Lao

Cam

bodi

a

Mal

aysi

a

Indo

nesi

a

Phi

lippi

nes

Bru

nei

Vie

tnam

Sin

gapo

re

%Local

%Arabic

%Others

%Russian

%English

South East Asia

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

0%

20%

40%60%

80%

100%

Cyp

rus

Turk

eyIs

rael

Leba

non

Jord

anSy

riaPa

lest

ine

GC

CIr

anA

fgan

ista

n

%Local

%Arabic

%Others

%Russian

%English0%

20%

40%

60%

80%

100%P

akis

tan

Indi

a

Sri

Lan

ka

Mal

dive

s

Bhu

tan

Nep

al

Ban

glad

esh

South Asia

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Chinese 普通話 Urdu اردو Kyrgyz Кыргыз

EnglishEnglish

Sindhi سنڌي Filipino (Tagalog) Tagalog

Arabic (Alarabia) لعربية

Turkish (Türkçe) Türkçe Assameseঅসমী�য়া�

Bengaliবাং��লা�

Turkmen түркmенче Azeri/Azerbaijani (Cyrillic)

Азәрбајҹан дили

Hindiहि�न्दी�

Gujarati ગુ�જરા�તી�Malayalam മലയാ�ളം�

PortuguesePortuguês

Tamil தமி�ழ் Kashmiri का�ऽशुर / ر ٲكُش�

Indonesian Indonesea Kannada ಕನ್ನ�ಡ Pashto/Pakhto پښتو

Japanese (Nihongo) 日本語

Punjabi/Panjabi ਪੰ�ਜਾ�ਬੀ� / باجنپ Kazakh Қазақ / قازاق

Hankuko (Korean)

한국어 [ 韓國語 ]

Thai ภาษาไทย Uighur (Uyghur) Уйғур ئۇيغۇر/

Telugu తెలు�గు� Fijian vaka-Viti Uzbek (Cyrillic) Ўзбек

Vietnamese Tiếng Việt Uzbek (Cyrillic) Ўзбек Dari د"ر!ي

Marathi मर�ठी� Sanskrit सं�स्का� तम� Tatar татарча / تاتارچا

Tamil தமி�ழ்

Turkish (Türkçe) Türkçe

Kashmiri का�ऽशुर / ر ٲكُش�

Gujarati ગુ�જરા�તી� Balinese Bahasa BaliKyrgyz Кыргыз

Kannada ಕನ್ನ�ಡPunjabi/Panjabi ਪੰ�ਜਾ�ਬੀ� /

باجنپ

Maldivian Dhivehi

ިހ� ެވ� ިދ�Thai ภาษาไทย Sanskrit सं�स्का� तम�

TahitianTe Reo Tahiti

Uzbek (Cyrillic) Ўзбек MaoriTe Reo Māori

Bahasa Melayu (Malay) Bahasa melayu

Maori

Te Reo Māori

HawaiianŌlelo Hawai'i

Script Diversity of Asia

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

More than 480 million speakersHindi

More than 10 million speakersMarathiNepali

More than 1 million speakersAwadhiBhojpuriBraj-DhashaChahattsigarhiKonkaniKachchiMarwaniMaithaliMagahi

Scholars’ languageSanskrit

GarhwaliMundariNewariBegheliBhatneriBathiBateriBhiliGondiJaipuriHarautiHoKachchhiKanaujiKhadiyaKhorthi

KuluiKumaoniKhadiyaKhorthaKuluiKumaoniKurkuKurukhKurmaliPalpaPanchparganiaSantaliNagpuriKankanLimbuSherpa

Less than 1 million speakers

Devanagari Script used by

Same Script Shared by Various Languages

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Script Region

Encoded PDFs Images

Latin 253 2 1

Cyrillic 19 4 3

Arabic 1 2 7

Ideographic 3 0 0

Indic - 7 12

Others 1 10 7Speaker Population in

Millions[1] 4,644 254 905

Representation of the UDHR Document by Major Script Grouping

[1] Cumulated speaker population based on Ethnologue, “Language of the World”, 15th ed. (2005)

UDHR Document by Major Script Grouping

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

UTF-8 Encoding in Selected LanguagesLanguage UTF-8 encoded

documentsDocument encodedotherwise

Examples of other encodings found [1]

Vietnamese 1,934,392 (96.4%) 72,077 (3.6%) TCVN, VIQR, VPS

Mongolian 48,834 (95.5%) 2,300 (4.5%) Latin-Cyrillic

Hindi, Bhojpuri, Magahi, Marathi, Nepali, Sanskrit, Tamang

81,800 (78.4%) 22,544 (21.6%) Agra, Arjun, Kiran, Kruti, Hungama, Naidunia, Shivaji, Shree, Shusha

Sinhala 4,793 (44.5%) 5,977 (55.5%) Metta, Kaputa

Arabic 400,933 (24.0%) 1,270,189 (76.0%) Latin-Arabic

Telugu 178 (16.6%) 894(83.4%) Shree, TLH

Tamil 566 (14.9%) 3,232 (85.1%) Amudham, Kumudam, Shree, Vikatan

Hebrew 1,468,344 (12.3%) 10,488,970 (87.7%) Latin-Hebrew

Thai 207,901 (2.7%) 7,544,884 (97.3%) TIS 620

Burmese 24 (0.7%) 3,261 (99.3%) WinResearcher

Turkish 20,591 (0.5%) 3,938,737 (99.5%) Latin-Turkish

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

To create a network of qualified Asian partners to specify and support the development of high priority Language Resources (LRs) for Asian Languages in a systematic, standards-driven, collaborative and learning context. The project will focus on identifying the state of the art of

LRs in the region, assessing priority requirements through consultations with

language research, industry and communication players, and establishing a protocol and

standards for developing a LR Network for the languages spoken in the region.

ALRN Mission

Asian Language Resources – Agenda

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

ALRN Action Plan

The project will be focusing on South, South East, Central & West Asian Languages

Act as an umbrella with Asian Language Resources (ALR)

To accommodate Secure and Sustainable UTF base encoding

Take advantage of existing Organization such as Language Observatory Project (LOP,TCL)

Corpus collection from the web using LO’s crawler/language identifier

Language resources originated from Japan and with their paralleled language corpus available in other languages (UDHR, Oshin, One Straw Revolution, etc)

Multilingual Terminology Dictionary

Information Standards of language corpus building

Liaison with international organization such as UNESCO, UDHR, etc.

Information resource shearing web site (www.language-resource.net)

Asia

n Ac

adem

y of

Lan

guag

es …

?

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Thank youDanke schönMerciGraciasObrigadoGrazieDankeSpacibaΕυχάριστο

Thank youDanke schönMerciGraciasObrigadoGrazieDankeSpacibaΕυχάριστο

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

(The exact number of languages may never be determined exactly)

Language Presence in Asian Countries

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

(Half of the world’s languages are spoken in only eight countries)

Language Diversity

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Country Number of Languages

Country Population

Official or National Languages

Indonesia 742 245,452,739 Indonesian

India 427 1,095,351,995 Assamese, Bengali, English, Gujarati, Hindi, Kannada, Kashmiri, Konkani, Maithili, Malayalam, Manipuri, Marathi, Marwari, Nepali, Oriya, Panjabi, Sanskrit, Sindhi, Tamil, Telugu, Urdu

China 241 1,313,973,713 Chinese, Zhuang, Uighur, Hmong, Hani

Philippines 180 89,468,677 Filipino, English

Malaysia 147 24,385,858 Malay

Nepal 125 28,287,147 Nepali, Gurung, Tamang

Myanmar 109 47,382,633 Burmese

Vietnam 93 84,402,966 Vietnamese

Laos 82 6,368,481 Lao

Thailand 75 64,631,595 Thai

Iran 74 68,688,433 Arabic, Farsi

Pakistan 69 165,803,560 Urdu, Panjabi, Sindhi, English

Afghanistan 45 31,056,997 Dari, Pashto

Bangladesh 38 147,365,352 Bengali

Bhutan 24 2,279,723 Dzongkha

Iraq 23 26,783,383 Arabic, Kurdi

Cambodia 19 13,881,427 Khmer

Brunei 17 379,444 Malay, English

Mongolia 12 2,832,224 Halh Mongolian

Sri Lanka 8 20,222,240 Sinhala, Tamil, English

Asian Language Recognition

Recent Experiences on Measuring Languages in Cyberspace – 2007UNESCO Headquarters, Room XVI

Asian Language Resources Network - Agenda

9 Language Families

62 Languages

4 Asian Regions (West, Central, South & South East Asia)

42 Countries

18 Major Scripts

Will CoverWill Cover