Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf ·...
Transcript of Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf ·...
![Page 1: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/1.jpg)
Current Status of Machine Translation
Research in Vietnam
Towards Asian wide multi language machine translation project
![Page 2: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/2.jpg)
2
Content
Status of machine translation research
Dictionaries and corpora
Activities in organization and experts Others
![Page 3: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/3.jpg)
3
Overview: Main machine translation groups
Previously: rule-based approach to English-Vietnamese MT system. The system is completed but still not published.Currently: focus on statistical MT, and improve the rule-based MT system using statistical techniques.
G4. JAIST(Mr. LE Anh Cuong)
Since 1989 with various trails. Statistical approach to Vietnamese-English translation (since 2002) and phrase-based approach to English-Vietnamese translation and phrase extraction from Penn Treebank (since 2003)
G3. HCM Univ. of Technology, VNUHCM(Prof. PHAN Thi Tuoi)
Transfer based MT using BTL (Bitext Transfer Learning) for English-Vietnamese MT system. Experience in doing dictionary, bilingual corpus.
G2. Univ. of Natural Sciences, VNUHCM(Dr. DINH Dien)
Rule-based approach to English-Vietnamese MT systems. These are the only MT commercial systems in Vietnam (EVTRAN3.0, VETRAN3.0)
G1. National Center for Technology Progress(Dr. LE Khanh Hung)
ExperienceGroup
![Page 4: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/4.jpg)
4
G1 (Nacentech): About the group
People: 12 members, leader: Dr. LE Khanh Hung2 Ph.D. candidates on NLP3 masters Other 6 engineers and B.A.
Institution: National Center for Technology Progress, MOST, C6 Thanh Xuan Bac, Hanoi, Email: [email protected]
![Page 5: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/5.jpg)
5
G1 (Nacentech): Approach
Morphological Morphological AnalysisAnalysis
Phrase Phrase AnalysisAnalysis
Semantic Semantic LinkingLinking
Phrase Phrase SynthesisSynthesis
Morphological Morphological SynthesisSynthesis
Dependency Dependency TreesTrees
TranslationTranslation
Lexical Lexical LatticeLattice
Dependency Dependency TreesTrees
Source TextSource Text
Lexical Lexical RulesRules
Grammar Grammar RulesRules
Semantic Semantic LatticeLattice
Lexical Lexical LatticeLattice
Target Language
![Page 6: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/6.jpg)
6
G1 (Nacentech): Current status
MT research group was established in 1989 ,1990 starting with an English to Vietnamese MT system
Transfer TechnologyDictionary with 12,000 entries, 500 grammar rules
1997: EVTRAN 1.02,000 grammar rules, 60,000 entries
1999: EVTRAN 2.03,000 grammar rules, 250,000 entriesCommercial software in VietnamListed in Compendium of Translation Software (EAMT)
2005: EVTRAN 3.0Automatic source language identification10,000 grammar rules, 530,000 entries
![Page 7: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/7.jpg)
7
Phrase-sensitive grammar and dependency trees
Γ = (Σ, ℵ, S, E, ℘) where S = {S1, … Sn} – set of start symbols, E –Semantic Lattices defined in (Σ ∪ ℵ)*; Element of E – List of Phrases; ℘ – Rule Set defined in (E×E)
![Page 8: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/8.jpg)
8
Austro-Asiatic Germanic
Mon-Khmer West Germanic
Viet-Muong English
Vietnamese English
Muong
Japanese
Interlinguas
Japanese
G1 (Nacentech): Interlingual MT (plan)
![Page 9: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/9.jpg)
9
G2 (UNS-VNUHCM): About the group
PeopleDr. Dinh Dien, PhD in CS and in Linguistics (Leader)Dr. Ho Bao Quoc, Prof. Dong Thi Bich Thuy5 MS studentd in CS + 1 Ph.D. student
InstitutionDepartment of Information Technology, Univ. of Natural Sciences, Vietnam National Univ. in HCMC227 Nguyen Van Cu, HCMCEmail: [email protected]
![Page 10: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/10.jpg)
10
G2 (UNS-VNUHCM): Approach
BTL model (Bitext Transfer Learning) for English-Vietnamese MT: from annotated-EVC (bitext),
to automatically extract “transfer rules” (lexical and structure) by learning algorithm (fTBL)
then apply those rules to tag the target language (Vietnamese sentence).
![Page 11: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/11.jpg)
11
Group 2: BTL-based MTEVT
English
Morphology
Linguistic Annotating Annotated
parallel corpus
Transformation Rules
Grammar
Semantics
Transfer
VNese
Post-editingWord Align
KFTBL
UnannotatedParallel corpus
Generation
Baseline Tagging
Baseline Tagging
![Page 12: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/12.jpg)
12
Group 2: Current status
The group has developed two machine translation systems: EVT 1.0 and VCLEVT 2.0EVT 1.0
A rule-based MT systemEvaluated by PC World Vietnam Magazine in 1998: 65% for simple sentences; 50% for normal sentences; and 35% for complex sentences.
VCLEVT 2.0Using BTL modelLearning automatically on bilingual corpusGaining better translation quality on informatic documents
![Page 13: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/13.jpg)
13
G3 (HCMUT): About the group
PeopleProf. PHAN Thi Tuoi (leader), Prof. CAO Hoang Tru5 Ph.D. candidates and master students
InstitutionHoChiMinh City University of Technology, VNUHCM268 Ly Thuong Kiet Street, District 10, HoChiMinh City, VietnamEmail: [email protected]
![Page 14: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/14.jpg)
14
G3 (HCMUT): Research activities
Syntax-based English-Vietnamese translation for simple sentences (1989)
Vietnamese word segmentation using corpus and statistical models (2002)
Vietnames POS tagging by context and style
Text alignment and statistical models (2004)
Statistical model for Vietnamese-English MT (since 2002)
English-Vietnamese MT based on lexicon and phrase extraction from treebank (since 2003)
![Page 15: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/15.jpg)
15
G4 (JAIST) About the group
People: Prof. Ho Tu Bao, Le Anh Cuong (leader), Nguyen Phuong Thai, Dr. Nguyen Le Minh, Phan Xuan Hieu, Nguyen Van Vinh
InstitutionJAIST (Japan Advanced Institute of Science and Technology)1-1 Asahidai, Nomi, Ishikawa 923-1292 JapanEmail: [email protected], [email protected]
![Page 16: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/16.jpg)
16
G4 (JAIST): Status
History: 1999-2003Developed an English-Vietnamese MT system at an Information company in Vietnam. The system based on the transfer approach.
2004-present Research on modern technologies in MT: Example-Based, SMT, Phrase-Based SMT
– Le Anh Cuong: word sense disambiguation– Nguyen Phuong Thai: syntactic parsing– Nguyen Le Minh: example-based approach– Phan Xuan Hieu: par-of-speech tagging, chunking– Nguyen Van Vinh: Dictionary
![Page 17: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/17.jpg)
17
G4 (JAIST): Current Translation System
Format Processing Tokenizer Morphological Analyzer
Named EntityRecognizer
Parser
Word SenseDisambiguationTransfer and SynthesisFormat Processing
Grammar Rules
CommonDictionary
POS Tagger
User Dictionary
Domain Dictionary
Dictionaries
![Page 18: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/18.jpg)
18
G4 (JAIST): MT improvement direction
Develop a new MT system which combines advantages of rule based, example based, and statistical machine translationApply advances of English processing to improve current MT systemBuild powerful and intuitive tools which support users modifying and editing dictionary
![Page 19: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/19.jpg)
19
Content
Status of machine translation research
Dictionaries and corpora
Activities in organization and experts Others
![Page 20: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/20.jpg)
20
Overview
Dictionaries and corpora have been developed by each group by their need and ability. Dictionary
E-V dictionaries are well done, V-E dictionaries are in debateModel for Japanese EDR-based dictionaryNo J-V, V-J dictionaries on computer
CorporaSome work in the pastNew plan for corpora
![Page 21: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/21.jpg)
21
Japanese EDR-based Dictionary (JAIST)
Model for such a dictionary (in NLP project 2001-2003)Can benefit from Japanese EDR
English word dictionaryConcept dictionary with concept primary illustration and concept explication in VietnameseEnglish co-occurrence dictionaryEDR Corpus (English Corpus)
Components to be newly doneVietnamese word dictionaryEnglish co-occurrence dictionaryBilingual dictionary English-Vietnamese, Vietnamese-EnglishEDR Corpus (Vietnamese Corpus)
![Page 22: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/22.jpg)
22
Word Morp POS GRM SEM English Freq Field
máy tính C Ns cnt ART computer 2.221 cpt
hiển thị C Vt Vcom display 1.956 cpt
đường W Ns cnt LIN line 2.087
đường W Nm uncnt CHM sugar 1.987
A Vietnamese MRD (UNS-VNUHCM)
![Page 23: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/23.jpg)
23
Dictionary for machine translation(Example from JAIST group)<word> take
<grammar>verb_i<semantic-category> [none] </semantic-category><translation-default> có hiệu lực </translaition-default><translation>
cắn câu<constraint> subj: {fish} </constraint>
</translation><translation-pattern>
$VP[inf]:=take off for $Obj<translation-default> "vội vàng đến“ $Obj </translation-default><translation>
"cất cánh đi“ $Obj<constraint>
subj: {plane, aeroplane, airplane, aircraft}</constraint>
<translation></translation-pattern>
</grammar>. . .
<word>
Entries: 95,000 words, 15,000 phrases, 18,000 translation patterns (lexical rules)
![Page 24: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/24.jpg)
24
JAIST’s group MT system on the Web
![Page 25: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/25.jpg)
25
Some corpora
Monolingual corpora: VLC (Vietnam Lexicography Centre), UNS-VNUHCM, etc. for Vietnamese
Bilingual corpora: The EVC corpus (UNS-VNUHCM) consists of 400,000 pairs of E-V sentences (approx. 5,500,000 words) in the fields of Science and Technology (Computer, Electronics,..). This EVC has been being partially annotated with morphology (word boundary, lemmatize), POS and Sense tags semi-automatically.
![Page 26: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/26.jpg)
26
VĂN BẢN THÔ(Draw Text)
TÁCH TỪ(Word segmentation)
GÁN NHÃN TỪ LOẠI
(POS tagging)
GÁN NHÃN CÚ PHÁP
(Chunking & Parsing)
KHO NGỮ LIỆUTIẾNG VIỆT(Treebank)
Vietnamese Corpus ToolsVietnamese
Corpus Tools
KIỂM TRA CHÍNH TẢ(Spelling)
XÁC ĐỊNH CÂU(Sentence determination)
TỪ ĐIỂNTừ điển đơn ngữ
(Monolingual dictionary)Từ điển đa ngữ
(Bilingual dictionary)
VLC: Development of supporting tools
![Page 27: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/27.jpg)
27
VLC: Capacity and realization
Available tools- Word segmentation, POS tagging, Deep Parsing in TAG
formalism, Syllable list and morpho-syntactic lexicon, Editor for segmentation and tagging revision
- Some utilities for corpus explorationOngoing work
- Improvement of available tools- Improvement of the tagset for POS tagging- Building syntactic lexicon based on morpho-syntactic
lexicon.- Collection of a balanced corpus following the above criteria.
Human resources- About 10 persons working on the project.
![Page 28: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/28.jpg)
28
Content
Status of machine translation research
Dictionary and corpora
Activities in organization and experts Others
![Page 29: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/29.jpg)
29
VLSP national project 2006-2010
National project with participation of more ten research groups (all active groups on VLSP)Leaders: Prof. Ho Tu Bao (JAIST & IOIT) and Assoc. Prof. Luong Chi Mai (IOIT, VAST)Objectives:
1. Build and develop several typical products for VLSP for public end-users.
2. Build and develop indispensable resources and tools for the VLSP development
![Page 30: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/30.jpg)
30
Content: Basic research (1/3)
Basic research on methods for processing Vietnamese language and speech.Applied research to adapt methods and technologies for processing other languages or advanced techniques to Vietnamese language and speech.
Computation methods for VLSP
Typical products for the end-users
Resources and tools for
VLSP
![Page 31: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/31.jpg)
31
Content: Products for end-user (2/3)
P1: VnVoice system for VN synthesis
P2: Embedded speech synthesis and recognition system
P3: Large lexicon based speech recognizer
P4: Domain-specific English-Vietnamese translation system
P5: IREST system for information retrieval, extraction, summarization, and translation
P6: Vietnamese spelling checker
Computation methods for VLSP
Typical products for the end-users
Resources and tools for
VLSP
![Page 32: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/32.jpg)
32
Content: Resources and tools (3/3)
P7: Basic resources for speechCorpus for speech synthesis and
recognitionP8: Three basic resources for language
P81: Vietnamese MRD P82: Annotated corpora (mono, multi)P83. Entities (KB)
(Rules of VN grammar)P9: Five basic tools for language
P91: Spelling checkerP92: Vietnamese word segmentationP93: Vietnamese POS taggerP94: Vietnamese chunkerP95: Vietnamese syntax analyzer
Computation methods for VLSP
Typical products for the end-users
Resources and tools for
VLSP
![Page 33: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/33.jpg)
33
Recent events
VLSP workshop, 29 March 2005, HanoiVLSP workshop, 21 May 2005, HanoiVLSP workshop, July 2005VLSP meeting, 21-25 Nov. 2005, JAIST
![Page 34: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/34.jpg)
34
Content
Status of machine translation research
Dictionary and corpora
Activities in organization and experts Others
![Page 35: Current Status of Machine Translation Research in Vietnambao/talks/MachineTranslationinVN.pdf · Current Status of Machine Translation Research in Vietnam Towards Asian wide multi](https://reader031.fdocuments.net/reader031/viewer/2022021801/5b33e2d37f8b9a3a6d8b766f/html5/thumbnails/35.jpg)
35
Current and future demands
Current need for development: tourist, economy, communication, etc.Demand increases both on human translation and automatically translation, especially the translation on the Internet.Lack of translation experts, especially in the foreign languages other than English, such as important languages for Vietnam such as Japanese, Chinese.Demand of translation in future will be increased because of the increase of the world integration.