Pat Barriers to Localisation v1 · Sindhi Gujurati Telugu Malayalam Kannada Marathi Assamese Oriya...

5
Barriers to Localisa-on a case study of Nepal Pat Hall 2 Wikipedia 2009 July 1 Sri Lanka Maldives 1,169.5 20.2 0.3 50.0 China 1,333.2 India Pakistan Bangladesh Myanmar Afghanistan Iran Bhutan Nepal 29.3 0.7 167.5 162.2 74.2 28.2 major migrant communities worldwide 28th Sept 2009 LRC 2009 brief background to Nepal popula-ons – 2009 es-mated millions 28th Sept 2009 LRC 2009 3 language diversity in South Asia Pushtu English pervasive - only 5% Hindi Tibetan Urdu Bangla Sinhala Tamil Kashmiri Punjabi Dzongka Nepali Sindhi Gujurati Telugu Malayalam Kannada Marathi Assamese Oriya INDIA Over 500 languages 22 mother-tongues with > 1 million speakers 114 mother-tongues with > 10,000 speakers 22 official in India Tibeto-Burmese Dravidian Indo-European Indo-European scholarship Sanskrit Farsi large diaspora Austro-Asiatic (Munda) 28th Sept 2009 LRC 2009 4 linguis-c history and poli-cs Gorkhali (later Nepali) official language since 1768 mother tongue of ruling elite from Gorkha 19511990 ‘one na-on, one culture, one language’ around 100 other languages Nepali now almost universally spoken – 98% and in neighbouring regions 5 other languages with mature wriVen tradi-ons 60 completely unwriVen 1990 enabled other mothertongue educa-on the irony of English 2006 interim cons-tu-on enabled use of other languages 2008 manda-ng mothertongue teaching 5 Wri-ng through the computer all South Asian writing derived from ancientBrahmi abugida (alphabetic with implicit vowel) largely phonetic - but phonology changes with time multiple forms (conjuncts) maybe only 50 languages across South Asia are written around 50% literacy new writing systems for unwritten languages by linguists, by missionaries (SIL) Tamil Kannada Hindi (Devanagari) 28th Sept 2009 LRC 2009 an old mechanical typewriter direct connec-on between key and type 6

Transcript of Pat Barriers to Localisation v1 · Sindhi Gujurati Telugu Malayalam Kannada Marathi Assamese Oriya...

Barriers  to  Localisa-ona  case  study  of  Nepal

Pat  Hall

2

Wikipedia2009 July 1

Sri LankaMaldives

1,169.5

20.20.3

50.0

China1,333.2

India

Pakistan

BangladeshMyanmar

Afghanistan

Iran

BhutanNepal29.3 0.7167.5

162.2

74.2

28.2

major migrantcommunities

worldwide

28th  Sept  2009 LRC    2009

brief  background  to  Nepal

popula-ons  –2009  es-mated  millions

28th  Sept  2009 LRC    2009 3

language  diversity  in  South  Asia

Pushtu

English pervasive- only 5%

Hindi

Tibetan

Urdu

Bangla

Sinhala

Tamil

Kashmiri

Punjabi DzongkaNepaliSindhi

Gujurati

Telugu

Malayalam

Kannada

Marathi

Assamese

Oriya

INDIAOver 500 languages 22 mother-tongues with > 1 million speakers 114 mother-tongues with > 10,000 speakers22 official in India

Tibeto-Burmese

Dravidian

Indo-European

Indo-European

scholarshipSanskrit

Farsi

large diaspora

Austro-Asiatic (Munda)

28th  Sept  2009 LRC    2009 4

linguis-c  history  and  poli-cs

• Gorkhali  (later  Nepali)  official  language  since  1768– mother  tongue  of  ruling  elite  from  Gorkha

– 1951-­‐1990  ‘one  na-on,  one  culture,  one  language’

• around  100  other  languages– Nepali  now  almost  universally  spoken  –  98%

• and  in  neighbouring  regions

– 5  other  languages  with  mature  wriVen  tradi-ons

– 60  completely  unwriVen

• 1990  enabled  other  mother-­‐tongue  educa-on– the  irony  of  English

• 2006  interim  cons-tu-on  enabled  use  of  other  languages

• 2008  manda-ng  mother-­‐tongue  teaching

5

Wri-ng  through  the  computer

• all South Asian writing derived from ancientBrahmi– abugida (alphabetic with implicit vowel)– largely phonetic - but phonology changes with time– multiple forms (conjuncts)

– maybe only 50 languages across South Asia are written– around 50% literacy

• new writing systems for unwritten languages– by linguists, by missionaries (SIL)

Tamil Kannada

Hindi (Devanagari)

28th  Sept  2009 LRC    2009

an  old  mechanical  typewriterdirect  connec-on  between  key  and  type

6

7

keyboard

internal codes

communications/applications

font tablecode -> dotmap

early  computersbased on typewriters

dot matrixprinter

electrical signals“codes”

ASCIIISO 646

ASCII60-70 Roman letters5-10 control7+1 bitsISO646

EBCDIC8 bitsmore

characters

8

indic  “hack  font”  developerskeyboards,  codes,  fonts  dependent

internal codes

communications/applications

font tablecode -> bitmap

1. designyour keyboard

2. acceptwhatever internalcode arises

3. use tool tocreate fonts

disasterNO communication

one-to-onecorrespondence

keyboard

charconceptuallinkage

128 Roman/control128 Indic characters8 bit BYTE

9

encodings• ISCII

– IIT  Kanpur,  then  CDAC  Pune

– leVers,  not  glyphs• like  Arabic• needs  renderer

– based  on  Brahmi  view

– single  code  table  with  language  switch  to  render  “same”leVer  appropriately

– sequence  of  characters  as  spoken

• Unicode– based  on  ISCII  but  separate  tables  for  each  script

• s-ll  controversial– new  Tamil  standard

hardware GIST cardthen software (DLL)cost prohibitive

28th  Sept  2009 LRC    2009 10

keyboard

internal Unicodes

communications/applications

font tablecode -> bitmap

modern  computersseparate  issues

laser or inkjetprinter

key mappingkey -> code

the essential

writing system

letters andcharacters

not graphics

28th  Sept  2009 LRC    2009 11

Nepali  language  into  Computers

• Desk  top  publishing  with  PCs– hack  fonts  to  give  some  representa-on  of  the  wri-ng

• 128  places  not  enough  for  everything• implicitly  defined  an  internal  code

– fonts  copied  widely  and  then  changed  slightly

– result  –  inability  to  exchange  data

– 1997  aVempts  to  standardise  in  context  of  emerging  Unicode• impressed  by  ISCII,  but  not  interna-onal

– some  fonts  also  developed  for  other  languages

• today,  accept  Unicode  Devanagari  for  Nepali– 1997  IDRC  and  later  UNDP  UNESCO  funded  keyboard  drivers  and  fonts

• can  use  Indian  open  type  fonts,  though  not  liked

– s-ll  no  agreed  keyboard  layout  for  Nepal

two  key  issues  in  technology  projects

Research  versus  applica-on• doing  ICT4D  projects  requires  new  understanding

• what  understanding?– ACM  paper  claims  must  deliver  computer  science  research

Technology  transfer  versus  knowledge  transfer

• do  we  do  it  for  our  beneficiaries?• or  teach  them  how  to  do  it  for  themselves?

Could  academic  objec-ves  cloud  development  objec-ves?

28th  Sept  2009 LRC    2009 12

28th  Sept  2009 LRC    2009 13

Nepalinux  at  MPP  funded  by  IDRC

• 1997,  2000  produced  Nepali  Unicode-­‐driver  add  on  to  Windows

• 2004  joined  PAN  localisa-on  project

• Debian  Linux  with  GNOME  desktop,  KDE  desktop,  OpenOffice,  ...– release  1  December  2005

– release  1.1  October  2006

• FOSS  cri-cally  important  here

• then  PDAs,  mobile  phones,  OCR

• in  parallel  Microsoi  produced  LIP  for  Nepali– launched  November  2005

what  is  involved  in  localisa-on?

• transla-on  of  all  text  in  screens,  menus,  and  help– available  in  separate  ‘resource’  files

– agree  terminology  in  na-onal  commiVees

– store  past  transla-ons  in  ‘transla-on  memory’

• develop  spell  checker– need  standard  spellings

– morphology

• other  capabili-es  that  might  be  needed– code  converters  =  hack  fonts  to  unicode

– develop  fonts

– text-­‐to-­‐speech

– OCR

– is  a  grammar  checker  needed?

28th  Sept  2009 LRC    2009 √14

need to research into languages

but not into technology

28th  Sept  2009 LRC    2009 15

NeLRaLEC  –funded  by  the  EUOpenU,  LancasterU,  GoterborgU,  ELRA,  MPP,  TribhuvanU

for  Nepali,  even  though  not  (yet)  endangered

• Nepali  Na-onal  Corpus– text  (5  M  words,  100K  parallel)  and  speech  (4hrs+130K  words)

• Nepali  dic-onary  –  aiming  at  100,000  words– word  list  and  entries  for  most  frequent

• linguis-c  tools

• fonts  for  Nepali– and  Maithili

• speech  genera-on

• trials  in  schools  and  universi-es– localised  schools  management  soiware

• computa-onal  and  corpus  linguis-cs  course  in  University

ensures sustainability

renamed Bhasha Sanchar

28th  Sept  2009 LRC    2009 16

Nepali  Na-onal  Corpus  –  what  people  actually  write

advice  from  Lancaster  University

• wanted  material  from  1991/2  to  match  corpora  in  US  and  UK– not  much  material  then,  just  aier  end  of  repressive  Panchayat  era

• wanted  in  fixed  genres  with  strong  western  bias– very  liVle  science  and  technology,  1  science  fic-on,  no  westerns  (or  Kung  Fu)

• CORE  –  aimed  a  1  million  words– collected  500  documents,  had  them  typed

– some  got  lost  in  civil  disturbances  and  lack  of  records

• General  –  opportunis-c,  at  least  4  million  words– already  digi-sed  but  in  hack  fonts,  needed  conversion

– when  to  stop?• English  dic-onaries  used  200  million  words  or  more

– eventually  stopped  at  13  million  words,  good  enough  will  do

28th  Sept  2009 LRC    2009 17

Nepali  Corpus-­‐based  Dic-onary

first  ever  in  South  Asia

• needed  soiware  to  store  entries  as  produced– mul-ple  lexicographers

– could  not  find  suitable  exis-ng  system

• developed  own  system  to  meet  needs  of  Nepali  and  linguists– 2  soiware  engineers  in  exploratory  development

• developed  entries  using  Oxford’s  Xiara  concordance  system– linguists  s-ll  learning,  each  did  things  differently– at  around  20,000  entries  realised  quality  problem

• started  again,  reusing  earlier  entries  or  producing  new  ones– only  reached  8,000,  okay  for  an  on-­‐line  dic-onary,  but  not  in  print

• expected  linguists  to  con-nue  aier  project– but  didn’t  because  not  paid

28th  Sept  2009 LRC    2009 18

Nepali  Font  Development• hiring  font  developers

– original  person  went  to  UK  to  do  masters

– recruited  two  graphic  designers,  arranged  training

• short  training  from  Reading  in  Kathmandu,  open  to  public– some  hazards  from  civil  disturbances

• font  development  process– draw  many  examples  of  characters  needed,  including  ligatures

– scan  into  computer,  add  rules  of  combina-on

– test  and  improve  constantly

• eventually  agreed  needed  further  training  in  UK– two  months  in  Reading

• outputs– font  for  Nepali  in  two  versions

– font  for  Maithili  Mithilaaksha  style  of  wri-ng.

28th  Sept  2009 LRC    2009 19

Nepali  TTS  speech  genera-on

based  on  Fes-val/Festvox  concatena-ve  synthesis

• expected  help  from  UK,  Roger  Tucker  and  Ksenia  Shalapova– Ksenia  refused  to  travel  to  Nepal  to  give  training  and  guidance

• had  to  help  our  2  soiware  engineers  in  other  ways– spent  1  month  in  Hyderabad  with  Kishore  and  Rajeev  Sangal

• recorded  voices– selected  words  containing  all  1764  diphones,  chose  1,200  sentences

– had  to  research  speech  –  eg  “Schwa  dele-on”

• developed  TTS  through  several  versions,  adding  prosody– reasonable  quality,  judged  reasonably  “natural”

• wanted  to  make  screen  reader  for  visually  disabled  and  illiterates– latest  Linux  supported  this,  so  in  Nepalinux  and  Ubuntu

– Fes-val  not  on  Windows,  could  not  find  genera-on  engine

evalua-on  of  use  of  Nepali  soiware

Nepalinux  and  MS  LIP  launched  late  2005• 1000  CDs  distributed,  500  aVended  demos

– heard  not  being  used,  interes-ng  novelty

• Conducted  surveys  to  find  out– what  was  really  going  on,  and  why?

– grounded  theory,  analysed  with  NVIVO

– done  by  Ganesh  Ghimire  and  Maria  Newton

what  we  found  outITID  journal,  Vol  5,  Issue  1  -­‐  SPRING  2009

– normal  social  processes  were  at  work

• should  we  move  to  other  100  languages  of  Nepal?

28th  Sept  2009 20LRC    2009

1.  first  impressions  of  computers

Problem:    hardware  not  localised– cabinets  labeled  in  English

– keyboards  not  marked  in  local  script

Example:  delivered  computers  to  schools– could  only  give  keyboard  layout  charts

• not  well  produced

– techies  claimed  phone-c  keyboards  easy• but  nobody  spoke  or  typed  English!• ironical  –  claim  systems  for  non-­‐speakers  of  English.

Solu-on:  get  keyboards  marked  for  Nepali– keyboards  in  Thailand  are  marked  in  Thai!

28th  Sept  2009 21LRC    2009

2.  transla-on  qualityProblem:  technical  terms  not  understood

– Nepali  terms  agreed  by  commiVee

Example:– “I  have  used  all  its  func-on  because  I  am  a  writer.  Some  of  the  words  like

“radditokari”  and  “anuprayog”  seem  to  be  the  unusual  ones.  They  look  likethey  have  been  directly  borrowed  from  Sanskrit  and  that  make  Nepali  evenmore  difficult  than  English”  –  banking  officer

– “It’s  good  for  people  who  are  trained  in  Nepali  that  don’t  have  exposure  toEnglish.    For  people  like  us  who  have  already  started  to  use  one  system  itbecomes  difficult  to  switch  over”  –  linguist  from  Kathmandu

Solu-on:– don’t  translate,  English  terminology  oien  bisarre,  maybe  transliterate– listen  to  users,  standardise  terminology

– will  they  get  used  to  it  anyway?

28th  Sept  2009 22LRC    2009

3.    teacher/trainer  iner-a

• Problem:  trainer  not  familiar  with  the  localised  soiware

• Example:  at  Sankhu  telecentre– language  of  instruc-on  -­‐  Nepali

– started  teaching  Nepali  interfaces

– switched  back  to  English  interfaces

• Solu-on:– train  the  trainers

• in  soiware  as  well  as  hardware

– don’t  give  op-on  of  English  interface

28th  Sept  2009 23LRC    2009

4.  cri-cal  mass  of  users

Problem:  want  to  get  help  from  others

Examples:– Nepal’s  na-onal  library

• one  user,  only  used  to  catalogue  Nepali  books

– government  officer  taught  himself

Solu-on:  need  to  create  cri-cal  mass,– train  complete  organisa-on

– restrict  opportunity  to  use  English• example  –  telephone,  video  tape  formats

Sociology– “social  interac-on”  Brock  and  Durlauf

– “social  embeddedness”

28th  Sept  2009 24LRC    2009

5.  language  shii  and  social  mobility

Problem:    people  change  to  dominant  language– see  economic  advantage  in  English  or  Nepali  or  ...

Example:– “Yeah  I  liked,  but  when  we  used  Nepali  windows  that  -me  I  feel  we  are

going  to  forget  English  language”  –  telecentre  social  mobilizer

– “I  have  daughter  and  I  will  not  ask  my  daughter  to  use  the  Nepaliinterface  because  I  want  my  daughter  to  be  good  in  English”  –  computeracademic

Solu-on:  accept  that  for  some  languages– not  worth  localising  the  OS,  but  enable  content

– suppor-ng  the  language  may  reduce  the  shii

Sociology– “sanskri-sa-on”  caste  mobility  -­‐  Srinivas

28th  Sept  2009 25LRC    2009

6.    soiware  eco-­‐system

Problem:  soiware  works  with  other  soiware– cannot  use  Unicode  for  informa-on  exchange!

– other  soiware  is  not  Unicode  compliant,  use  hack  fonts

– but  can  do  Desk  Top  Publishing

Example:  journalists– “But  it  has  font  problem.  In  publica-on  houses  mainly  they  use

pree-,  kan-pur  so  it’s  not  worthy  for  nepali  compu-ng”  –  journalist

Solu-on:  raise  an  Open  Source  project  to  produce  compliantpublishing  soiware.

28th  Sept  2009 26LRC    2009

7.  support  all  communi-es

Problem:  language  communi-es  very  small– commercial  development  not  viable

Example:  Lohrung  Rai  in  Nepal• subject  of  socio-­‐linguist  study  by  Jens  Allwood,  Yogendra  Yadava,

and  Bhim  Regmi

– 1,207  ‘mother-­‐tongue’  speakers

– language  not  yet  wriVen,  want  Roman  system.

Solu-on:– localise  for  local  lingua  franca

• create  wri-ng  close  to  that  of  lingua  franca

– only  enable  content  in  local  language

28th  Sept  2009 27LRC    2009

8.  cost  of  entry  for  new  language

Problem:  Transla-on  cost  can  be  significant

Example:    Gnome  interface  for  Linux–  40,000  ‘strings’  =  500,000  words  approx

– grows  by  5  to  10%  a  distribu-on

– at  1,500  words  per  day,  this  takes  300  days• 30  days  for  a  new  release

Solu-on:  avoid  manual  transla-on– machine  transla-on?

– be  smart  technically

– language  genera-on  from  model  of  soiware

28th  Sept  2009 28LRC    2009

Conclusions

localiza-on  must  be  TOTAL– all  hardware  including  keyboards

– all  soiware  in  use  by  a  community

– transla-ons  sensi-ve  to  language  poli-cs

• otherwise  it  will  rejected

entry  cost  must  be  minimal

• cheap  and  easy  to  localize  for  a  new  language– base  interac-on  by  language  genera-on

– part  of  s/w  development  process

– some-mes  only  enable  content

next  step  –  harmonised  wri-ng  and  encoding

• then  machine  (assisted)  transla-on28th  Sept  2009 29LRC    2009