Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng...

22
Isalin Transla te eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng ([email protected]) DLSU, College of Computer Studies Natural Language Processing Research Lab

Transcript of Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng...

Page 1: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

IsalinTranslate

eWika: Towards the Digitalization of Philippine Languages

Charibeth K. Cheng ([email protected])

DLSU, College of Computer Studies

Natural Language Processing Research Lab

Page 2: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

MT Research in RP

• started in 1993 at UP-Los Baňos

• Dr. Rachel Roxas and Allan Borra– grammar-based

• in 2004 start at DLSU– hybrid approach

Page 3: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

ENG-FIL MT System Project

• 3-year project

• started 2005

• funded by DOST-PCASTRD

• composition:– 6 faculty members of College of

Computer Studies– 15 computer science majors– assisted by the Filipino Dept and

Dept in English & Applied Linguistics of DLSU-M

Page 4: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Architectural Design of the Program

Language Resources: • Lexicon (electronic dictionary), • Morphological Analyzer & Generator• Part-of-Speech tagger• Grammar,• Corpus (Tagged)

MT: Example-based

MT: Rule-based

User Interface

Output Modeller

Source Text Target Text

Translator Engine

Page 5: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Rule-Based approach

Apply translation rules

The boy ate apples.

Kumain ng mga mansanas ang batang lalaki.

Where do we get the translation

rules?

Page 6: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Example-Based

• Learn the rules from examples

The boy ate apples.

Kumain ng mga mansanas ang batang lalaki.

A B C D

A BC D

Rule Learned:

A B C D C ng D A B

Page 7: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Using the rule

A B C D C ng D A B

The mother cooked fish.

Nagluto ng isda ang nanay.

A B C D

A BC D

Page 8: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Using the rule

A B C D C ng D A B

The mother went home.

Umuwi ng bahay ang nanay.

A B C D

A BC D

Page 9: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

The boy ate the fish.

Limitation of a Rule

A B C D C ng D A B

A B C D

Page 10: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Results of the MT Engine

• Qualities of a Good Translation– Clarity – 3.3– Accuracy – 3.2– Naturalness - 2.8

• highest score of 5

• 100 respondents (5 linguists)

Page 11: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Challenge!

• Language resources– Quality of translation is dependent on it.– Built from almost non-existent digital forms– manual vs. automatic construction

Page 12: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Lexicon

• Diksyunaryo ng Wikang Filipino

• automatic construction (AeFLEX):– accuracy rate - 57%

• Currently contains about 30,000+ entries

• Challenge: Lexical resources – translation documents– part-of-speech tagger

Page 13: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Morphological Analyzer and Generator

• Dictionary is incomplete

• Create a software that:– analyzes – determines the root word– generates – generates the inflected word

Given: eating -> eat -> kain -> kumakain

• Challenge : Lexical resources– lexicon– part-of-speech tagger

Page 14: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Part-Of-Speech Tagger

• automatic association of parts-of-speech to words in a document– Can? – kaya vs. lata– Baba? – chin or go down

• Challenge : Lexical resource– corpora– lexicon– morphological analyzer– grammar

Page 15: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Corpora

• collection of translation-pair documents

• used by the lexicon extractor and part-of-speech tagger, example-based MT

• came from translation works of DLSU English majors, verified by linguists

• consists of 207,000 words

Page 16: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Lexicon Resource Dependency

Corpus

Page 17: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Bringing it home …

• 171 Philippine Languages (SIL)• No Philippine Corpora• Unfortunately, today, the Philippines has one of

the highest rates of dying languages (Solfed Foundation Inc)

• “Without our language, we have no culture, we have no identity, we are nothing.” (Thorrson)

Page 18: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

eWika: Digitalization of Philippine Languages

• Build the Philippine Corpus

• Build software tools to study or use the corpus– Across Regions

– Across Forms and Genres

– Across Languages

Page 19: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Across Regions

• Web-based application: GLOBALIZATION– upload, download, tools

• Contributors (Main players)

• Verifiers

• Server: DLSU-M commits to host the server for the next three years.

• Terms of Use: Research purposes.

Page 20: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Across Languages

• 171 Philippine Languages (SIL List)

• start with 8 major languages– Tagalog, Cebuano, Ilocano, Hiligaynon, Bikol,

Waray, Kapangpangan, Boholano

• Filipino Sign Language

Page 21: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

Across Forms and Genres

• In various forms:– Text– Speech– Video: Filipino sign language

• In various Genres: – Text – literary & creative, essays, news articles,

religious, etc– Speech – scripted, conversations, etc– Video – common signs, regional signs, signs for

specific purposes (legal, IT, etc.)

Page 22: Isalin Translate eWika: Towards the Digitalization of Philippine Languages Charibeth K. Cheng (koc@dlsu.edu.ph) DLSU, College of Computer Studies Natural.

• The dream of building electronic, online Philippine language resources and tools

• Many many many major hurdles to overcome

• NEEDED : Language Resources, Tools, & Peopleware