Survey of Current S peech Translation Research · Survey of Current S peech Translation Research...

Survey of Current Speech Translation Research† Ying Zhang

Language Technologies Institute, Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA 15213

[email protected]

† This paper is the final report for course 11-733 “Multilingual Speech-to-Speech Translation Seminar” Fall 2003, related information can be found at: http://projectile.is.cs.cmu.edu/research/public/talks/speechTranslation/facts.htm and http://projectile.is.cs.cmu.edu/research/public/talks/speechTranslation/talk.ppt

1 Introduction

There are no commercial speech translation systems in operation to date, but a number of industrial and government projects are exploring their feasibility. The feasibility of speech translation depends largely on the scope of the application, and ranges from applications that are well within range, such as voice activated dictionaries, to those that will remain impossible for the foreseeable future (e.g., unrestricted simultaneous translation.) Current research therefore aims at milestones between these extremes, namely limited domain speech translation. Such systems restrict the user in what he/she can talk about, and hence constrain the otherwise daunting task of modelling the world of discourse. Nevertheless such systems could be of practical and commercial interest, as they could be used to provide language assistance in common yet critical situations, such as registration for conferences, booking hotels, airlines, car rentals and theatre tickets, ordering food, getting directions, scheduling meetings or in medical doctor-patient situations. If successful, it may also be possible to combine such domains to achieve translation in a class of domains (say, travel). [Waibel 96] We surveyed about 20 major speech translation systems. These projects varied in technical aspects: there different automatic speech recognition (ASR) methods, different machine translation (MT) approaches, and different coupling strategies between ASR and MT. The surveyed projects also differ in management aspects: the projects are of different size, ranging from single institutional research effort to large-scale international

cooperation such as the Verbmobil project. They are funded mainly through government such as DARPA and EU. Compared with other research areas in computer science, industrial contributions are quite limited. In the following sections, we will first describing the major speech translation projects. Since it is not easy to classify all these projects using one criterion, we organize our discussion based only on one of the system’s most important features. Then, we will compare these projects using all features.

2 Speech Translation Projects

2.1 Voice Activated Phrase Lookup

Voice activated phrase lookup systems are not true speech translation systems by definition. A typical voice activated phrase lookup system is the Phraselator system [Sarich 2001]. The Phraselator (Figure 1) is a one-way device that can recognize a set of pre-defined phrases and play a recorded translation. This device can be ported easily to new languages, requiring only a hand translation of the phrases and a set of recorded sentences. However, such a system severely limits communication as the translation is one way, thus reducing one party’s responses to simple pointing and perhaps yes and no.

Figure 1. Phraselator

2.2 Loosely Coupled Speech Translation Systems

A typical loosely coupled speech to speech translation system contains three main components: automatic speech recognizer (ASR), machine translation engine (MT) and the text-to-speech synthesizer (TTS). In a loosely coupled system, these three components are connected in a sequential style: ASR converts user’s speech to text in source language and then MT translates the source text into the target language. In the end, SST creates the synthesized speech from the target text. Usually there is no information feedback from MT to ASR or from TTS to MT in a loosely coupled system.

2.2.1 MASTOR (IBM) MASTOR [Liu 2003] (Multilingual Automatic Speech-To-Speech TranslatOR) is IBM’s highly trainable speech-to-speech translation system, targeting conversational spoken language translation between English and Mandarin Chinese for limited domains. Figure 2 depicts the architecture of MASTOR. The speech input is processed and decoded by a large-vocabulary speech recognition system. Then the transcribed text is analyzed by a statistical parser for semantic and syntactic features. A sentence-level natural language generator based on maximum entropy (ME) modelling [Zhou 2002] is used to generate sentences in the target language from the parser output. The produced sentence in target language is synthesized into speech by a high quality text-to-speech system.

Figure 2. The architecture of MASTOR

2.2.2 DIPLOMAT/Tongues (CMU) The DIPLOMAT system [Balck 2002] was developed as a speech-to-speech translation system that could be readily adapted to new languages. It was designed to run on a small plat-form, such as a laptop or wearable. These requirements introduce different constraints on the system when compared with the larger more general speech-to-speech translation systems such as those in CSTAR. The later TONGUES project was to build a prototype speech-to-speech translation system designed to run on a sub-notebook computer for use by US Army Chaplains for communicating with locals on issues of refugees etc. This prototype was tested in the field in Zagreb in April 2001, see [Frederick 2002] for a full description of that evaluation. In DIPLOMAT and TONGUES it was not just the end system that was being deve loped, it was the processes involved in building the components, so application to new languages and domains require less effort. DIPLOMAT developed basic versions in: Croatian, Korean, Spanish and Haitian Creole. TONGUES targeted Croatian alone though almost all aspects of the models were rebuilt for that system. CMU Sphinx II recognition engine [Huang 92] was used for DIPLOMAT and TONGUES. Sphinx II is a semi-continuous HMM-based recognition engine which requires relatively low computational requirements to run. Festival Speech Synthesis System was used as the TTS module. The translation engine used was a Multi-Engine MT (MEMT) system, whose primary engines were an Example-Based MT (EBMT) engine and a bilingual dictionary/glossary. Carnegie Mellon’s EBMT system uses a “shallower” approach than many other EBMT systems; examples to be used are selected based on string matching and inflectional and other heuristics, with no deep structural analysis. The MEMT architecture uses a trigram language model of the output language to select among competing partial translations produced by several engines. It is used in this

system primarily to select among competing (and possibly overlapping) EBMT translation hypotheses. The translated chaplain dialogs provided some of the training. Pre-existing parallel English-Croatian corpora is also used. An addition finite-state word reordering mechanism was added to improve placement of clitics in Croatian.

2.2.3 Matrix (ATR) The Spoken Language Translation Research Laboratories of the Advanced Telecommunications Research Institute International (ATR) has five departments. Each department focuses on a certain area of Speech Translation.

o Department1: robust multi-lingual ASR; o Department2: integrating ASR and NLP to

make SST usable in real situations; o Department3: corpus-based spoken

language translation technology, constructing large-scale bilingual database;

o Department4: Japanese-to-English translation for monologue, e.g. simultaneous interpretation in international conference ;

o Department5: TTS The speech translation system in ATR is called ATR-MATRIX (ATR’s Multilingual Automatic Translation System for Information Exchange). [Takezawa 98] This system can recognize natural Japanese utterances such as those used in daily life, translate them into English and output synthesized speech. This system is running on a workstation or a high-end PC and achieved nearly real-time processing. Unlike its predecessor ASURA [Morimoto 93], ATR-MATRIX is designed for spontaneous speech input, and it is much faster. The current implementation deals with a hotel room reservation task. ATR-MATRIX adopted a cooperative integrated language translation model. Figure 3 shows the system configuration. This system consists of a speech recognition subsystem, a language translation subsystem, a speech synthesis subsystem (CHATR [Campbell 96]) and a main controller. Each subsystem is connected to the

main controller via each satellite controller. Each satellite controller can interact with them in a uniform way by using a standard packet message format.

Figure 3. ATR-MATRIX system architecture

Some key features of the MATRIX system are:

o Real-time speech recognition using speaker independent phoneme-context-dependent acoustic models and a language model of variable-order N-gram. The ASR is about 2,000 words.

o Robust language translation to deal with speech recognition results. MATRIX can deal with various expressions in spoken language because it uses not only sentence structure but also examples such as translation pairs. Furthermore, there is a partial translation mechanism for accepting speech recognition results that include recognition errors. Two heuristics are used here:

1. Similar constituents to translation examples are preferred. Similarity is calculated using the semantic distance based on translation pairs represented by patterns.

2. Larger constituents are preferred. The size of a constituent is measured by the number of word sequences in the constituent.

o Personalized speech synthesis. Although the ATR-MATRIX system claims it adopts a cooperative integrated translation model where ASR, MT and TTS all connect to a main controller, it is not clear if there is any information exchange from MT to ASR or from TTS to MT. Given our current understanding based on ATR’s publication, the MATRIX system is also a loosely coupled speech translation system.

2.3 Tight Coupling between ASR and MT using Finite State Transducer

A loosely coupled speech translation system consists in translating the output of a conventional ASR front-end. This implies that some restrictions present in the translation and the output language, which could enhance the acoustic search, are not taken into account. In this sense, it is preferable to integrate the translation model within a conventional ASR system to carry out a simultaneous search for the recognised sentence and its corresponding translation. Two research groups: Polytechnic University of Valencia (UPV) and AT&T research laboratories use finite-state translation models to integrate the ASR and MT.

2.3.1 EuTrans EuTrans project aims at building translation systems for text and speech input in limited domain application by 1) using example -based techniques, and 2) a tight integration of translation, syntactic and acoustic constraints into global models (Figure 4).

Figure 4. Example of the integrated architecture for

speech translation using FST

The system (Figure 5) is based on the ATROS (Automatically Trainable Recognizer Of Speech) engine. ATROS is a continuous-speech recognition/translation system which uses stochastic finite-state models at all its levels: acoustic -phonetic, lexical and syntactic/trans lation. All these models can be learnt automatically from speech and/or text data . This makes the system easily adaptable to different recognition/translation tasks. The use of finite-state models al-lows the

system to obtain the translation synchronously with the recognition process [Pastor 2001].

Figure 5. EuTrans prototype Architecture

An important drawback of this approach is the large amount of bilingual examples required to learn useful translation models. In order to reduce the severity of this requirement, EuTrans system trains its model from a categorized bilingual corpus in which words or short phrases (for instance, numbers, dates, or proper names) are replaced by adequate labels, thus simplifying the tasks that the learning algorithms have to tackle. The categorizing method consists of the following steps (Figure 6):

o Definition of categories: determine the set of categories.

o Corpus categorization: replace words and short phrases in the corpus by their category labels.

o Basic structure model learning: use the categorized corpus to train a model, which will be referred to as initial SST.

o Category modelling: for each category, learn a so-called category SST (cSST).

o Category expansion: expand the edges in the initial SST corresponding to the different categories using their respective cSSTs.

Figure 6. General scheme of the treatment of

categories in the learning and translation processes in EuTrans system

2.3.2 AT&T AT&T research group has very similar approach as UPV’s FST method. In AT&T’s approach multimodal parsing, understanding, and integration are achieved using a finite-state model [Mohri 1997]. The system subdivides the translation task into lexical choice and lexical reordering phases (Figure 7). The lexical choice phase is decomposed into phrase-level and sentence-level translation models. A tree-based alignment model was used to obtain a bilingual lexicon. The phrase-level translation is learned, based on joint entropy reduction of the source and target. A variable length n-gram model (VNSA) is learned for the sentence-level translation. The reordering step uses position markers on a tree-structure, but approximates a tree-transducer using a string-transducer [Bangalore 2000, 2001].

Figure 7. Lexical choice and reordering model

2.4 Interlingua Based Speech Translation Systems

Several interlingua based speech translation projects are carried on by partners of the C-Star consortium. The Consortium for Speech Translation Advanced Research (C-Star) consortium has an informal organizational style. It is not a funding organization, nor does it have

funding of its own. Each partner organization obtains its own support and collaborates with the other partners by voluntary association and collaboration. Partner members commit to building complete speech translation systems under a common scenario and linking them with other partners' systems in periodic international speech translation experiments and evaluations. Partners also participate in working groups and share databases and know-how as needed in mutual exchange. In order to translate the spoken dialogues among the 7 language, an interlingua was designed and accepted by all the partners. An Interlingua is a computer readable intermediate language that describes the intention of a spoken utterance independent of a particular language. C-Star members use an Interlingua designed for travel planning [C-Star http://www.c-star.org]. There are three main advantages for the Interlingua approach:

o Sloppy ramblings in the input or different variations of the same meaning are not “translated” but a cleaned up sentence is produced without the garble in the other language. What matters is the intent of the input utterance, whatever the way it was expressed. Sentences like “I don't have time on Tuesday”, “Tuesday is shot”, “I am on vacation, ...er.. I, I, I, will be on eeeh... taking a vacation....on Tuesday”, can now all be mapped onto the same intended meaning “I am unavailable on Tuesday”, and an appropriate sentence in the output language can be generated. Even culturally dependent expressions can be translated in a culturally appropriate fashion. Thus “Tuesday's no good” could be translated into “Kayoobi-wa chotto tsugo-ga warui” literally: “As for Tuesday, the circumstance is a little bit bad”.

o It is easier to add additional languages. To add a new language we need to translate only into that interlingua, and generate from that interlingua into our own language. All the other languages come for free. In other words, we would have to build one more translator instead of one for each language.

o We can generate an output utterance in our own language and thereby obtain a paraphrase of the input. This permits the user to verify if an input utterance was properly understood by the system. This very important feature improves the usability of a speech translation system, if the user does not know if an output translation in an unknown language is correct or not.

2.4.1 JANUS Developed by the Interactive System Laboratories, the earlier JANUS system JANUS I [Woszczyna 93] used a connectionist’s speech recognizer. The ASR generated an N-best list of sentence hypothesis and passed them to three parsers which in parallel mapped the input sentence into an interlingua representation (Figure 8).

Figure 8. Overview of JANUS I

In the later JANUS II and III systems [Lavie 96] , the acoustic units were context dependent 3-state Triphones, modelled via continuous density HMMs. Explicit noise models are added to help the system cope with breathing, lip-smack, and other human and non-human noises inherent in a spontaneous speech task. JANUS II/III employs two robust translation modules with complementary strengths (Figure 9). The GLR module gives more complete and accurate

translations where the Phoenix module is more robust over dis-fluencies of spoken language. The two modules can run separately or can be combined to gain the strengths of both. Each of the approaches has some clear strengths and weaknesses. One strategy to combine them is to use the Phoenix module as a back-up to the GLR module when the parse result of GLR* is judged as “Bad”.

Figure 9. Overview of Janus II/III

2.4.2 Nespole! Nespole! (Negotiating through SPOken language in E-commerce) is co-funded by EU and NSF. Two partners in this project are Interactive System Laboratories and ITC-irst. With very similar approach as JANUS, Nespole! translate speech via interlingua (Figure 10).

Figure 10. Nespole! system overview

2.4.3 LodeStar (CAS) Developed by the National Laboratory of Pattern Recognition, Chinese Academy of Sciences, LodeStar is an interactive Chinese-to- English speech translation system based on dialogue management [Zong 2002]. In this system (Figure 11) , the input utterance is first pre-processed and serially translated by the template-based translator and the interlingua based translator. If the two translators are failed to translate the input, the dialogue management mechanism is brought into play to supervise the interactive analysis for disambiguation of the input. The interaction is led by the system, so the system always acts on its own initiative in the interactive procedure. In this approach, the complicated semantic analysis is not involved.

Figure 11. The paradigm of interactive translation

in Lodestar system

2.4.4 Speech Translation Program in the Digital Olympics

As one of the 16 major tasks for the Digital Olympics, multilingual speech translation project is co-funded by European Union and the Chinese government. The objective of the program, as stated by the Beijing Organizing Committee for the Games of the XXIX Olympiad, is to “Make use of

artificial intelligence technology to understand natural languages so as to remove language barriers and offer multi-linguistic intellectual information service for the people related to the Olympics at any time, in any place and with facilities of various kinds. As a result, people can get to know each other with the help of this technology so that friendship and mutual understanding will be promoted and the goal of ‘people’s Olympics’ better realized.” Given the current status of speech translation technology and the constrained time for deployment in year 2008, it is not yet clear what strategy will be used by the research team to achieve such an ambitious goal. The current research effort is on the interlingua based approach and some statistical approach for limited domains. [Zong 2003]

2.5 Speech Translation Systems on Hand-held Device

2.5.1 NEC Speech Translation System The NEC system [Isotani 2003] performs bi-directional speech translation between Japanese and English in a travel situation. The system is implemented as application software on PDAs running Pocket PC 2002. In one of the papers [Yamabana 2003], the system was designed as a client-sever system using PDAs with mobile wireless connections as clients to the backend translation server (Figure 12).

Figure 12. Relation between client and server in

the NEC C/S mode speech translation system Most other papers described the system as a stand-alone speech translation device that is compact enough to carry around. One of the major design consideration is how to make the system compact

enough to be fit into the PDAs where the operating system only allows 32MB memory. The speech recognition module performs large vocabulary continuous speech recognition of conversational Japanese and English based on HMMs and a statistical language model. The recognition result is passed to the translation module. The reading and duration of each word are added to the result to be used for disambiguation in the translation process. To implement on a PDA, efforts are focused on downsizing of the models and reducing the computational cost and work memory of the search engine. Three techniques are used to reduce the memory and computation:

1. Gaussian reduction based on the minimum description length (MDL) criterion, which efficiently removes redundant Gaussian mixtures in the state emission probabilities without significant loss of accuracy.

2. The global tying of the diagonal covariance matrices of Gaussian mixtures. This makes HMM storage size smaller (about 50%) and reduces the Gaussian likelihood calculation into a Euclidean distance calculation between input feature vectors and Gaussian mean vectors.

3. A hierachical tree of Gaussians is used for high-speech calculation of emission probabilities. For an input feature vector, the likelihood calculation is made only for the Gaussians whose likelihoods are assumed to be large by traversing the tree from the root node. The method increased the computation speed by more than 10 times with the accuracy loss kept to minimum.

The translation module accepts a recognized expression from the speech recognition modules, and performs bi-directional translation between Japanese and English. NEC group developed and implemented the Lexicalized Tree AutoMata-based Grammars (LTAM), lexicalized grammar formalism, as the framework for grammar writing (Figure 13). In a strongly lexicalized grammar, the grammatical knowledge is localized in the word dictionary, which is an advantage for compact implementation. The memory consumption of the

translation module is about 8MB after the start-up of the module. The working memory is about 1 to 4MB, depending on the input data.

Figure 13. Translation module in NEC speech

translation system For the whole system, it takes about 27MB to load and 1~4MB working memory for each sentence.

2.5.2 Babylon The Babylon program [http://darpa-babylon.mitre.org/] aims to lead to rapid, two-way, multilingual speech translation interfaces for users in combat and other field environments. This technology will replace the DARPA RMS (One-way) translator in two phases. The program will attack the hard-unsolved problems involved in speech translation technologies including multilingual automatic speech recognition (ASR), parsing, semantic knowledge representation, portability between languages and domains, robustness in noise and distortions in coding. Specifically, the program will focus on overcoming the many technical and engineering constraints limiting multilingual ASR robustness, translation accuracy, and response time of current translation technology. Babylon will provide an enabling technology to give language support to the war fighter in deciphering possibly critical language communications during operations in foreign territories. In particular, the Babylon translation capabilities will be useful to personnel in Special Forces, crisis response, intelligence (interview/interrogation), and force protection.

The program will evaluate the operational utility of rapid multilingual speech translation technologies in realistic military experiments tied to service laboratories and the LASER ACTD. These experiments will feature rapid two-way speech translation units that will improve the ability of the operator to converse with, extract information from, and give instructions to a foreign language speaker encountered in the field. The evaluation of such experiments will be based upon user acceptance paradigms, translation accuracy, and speed of translation normalized to CPU performance. The technology will be scalable for easy porting to different platforms including PDA's and workstations. The language development process will be systemized so that non-language experts can configure a new language or add to an existing language. These technology advancements will positively impact all research and transition efforts for language translation. Quantitative goals will initially use measures and metrics associated with the relatively mature MT (machine translation) community. During this program, new metrics will be developed to account for the unique characteristics of speech-to-speech communications. No validated measurements exist today to establish performance baselines for systems oriented to speech-to-speech translation. Taking both ASR and MT, potential metrics include: increasing the translation accuracy rate, task completion rate/time, usable vocabulary, and reduce response time of translation rate under tactical situations. Program performance goals for two-way technology include: a) 1- 1.5x real time, b) ASR Accuracy 90%, c) translation accuracy 90% and d) task computation 80-85%. Qualitative goals include perceived user satisfaction / acceptance, ergonomic compliance to the uniform ensemble, error recovery procedures and quality of user tools for field modification and repair.

2.5.3 Speechlator The Babylon project addresses the issues of two-way communication where either party can use the device for conversation. A number of different groups throughout the US were asked to address

specific aspects of the task, such as different languages, translation techniques and platform specifications. The Pittsburgh group, including Carnegie Mellon University, Cepstral LLC, Multimodal Technologies Inc and Mobile Technologies Inc, was presented with three challenges. First, the target language is Arabic, a language with which the group had little experience, to test the capabilities in moving to new languages quickly. Second, Pittsburgh group was instructed to use an interlingua approach to translation, where the source language is translated into an intermediate form that is shared between all languages. This step streamlines expansion to new languages, and CMU has a long history in working with interlingua based translation systems. Third, it was constrained to one portable PDA-class device to host the entire two-way system: two recognizers, two translation engines, and two synthesizers. For the translation engine in the system, two different techniques were investigated, both interlingua based. The first was purely knowledge-based, following the previous work [Lavie 2002]. The engine developed for this was too large to run on the device, although they were able to run the generation part off-line seamlessly connected by a wireless link from the hand-held device. The second technique investigated used a statistical training method to build a model to translate structured interlingua IF to text in the target language. Because this approach was developed with the handheld in mind, it is efficient enough to run directly on the device, and is used in this demo.

2.6 European Speech Translation Projects

2.6.1 EuTrans Sponsored by the European Commission program ESPRIT, EuTrans involved Univeristy of Aachen (TWTH), Research center of the Foundazione Ugo Bordoni, Italy , ZERES GmbH, and the Universitat Politecnica of Valencia, Spain. More information can be found in section 2.3.1.

2.6.2 Verbmobil Verbmobil [Wahlster 2000] is a speaker-independent and bidirectional speech-to-speech translation system for spontaneous dialogs in mobile situations (Figure 14). It recognizes spoken input, analyses and translates it, and finally utters

the translation. The multilingual system handles dialogs in three business-oriented domains, with context-sensitive translation between three languages (German, English, and Japanese).

Figure 14: Verbmobil context-sensitive speech-to-

speech translation Since Verbmobil emphasizes the robust processing of spontaneous dialogs, it poses difficult challenges to human language technology. Verbmobil is a hybrid system incorporating both deep and shallow processing schemes. Verbmobil has a multi-blackboard architecture that is based on packed representations at all processing stages (Figure 15, Figure 16).

Figure 15. A comparison of the architecture of

Verbmobil I and II These packed representations together with formalisms for under specification capture the non-determinism in each processing phase, so that the remaining uncertainties can be reduced by linguistic, discourse and domain constraints as soon as they become applicable.

Figure 16. Some key blackboards with their

subscribing modules in Verbmobil The translation module in Verbmobil is a multi-engine approach (Figure 17, 18). It uses five concurrent translation engines: statistical translation, case-based translation, substring-based translation, dialog-act based translation, and semantic transfer.

Figure 17: Verbmobil’s multi-engine parsing

approach

Figure 18: The multi-engine translation approach

of Verbmobil

Distinguishing features like the multilingual prosody module and the generation of dialog summaries are highlighted. Verbmobil has successfully met the project goals with more than 80% of approximately correct translations and a 90% success rate for dialog tasks. One of the main lessons learned from the Verbmobil project is that the problem of speech-to-speech translation of spontaneous dialogs can only be cracked by the combined muscle of deep and shallow processing approaches:

o Deep processing can be used for merging, completing and repairing the results of shallow processing strategies

o Shallow methods can be used to guide the search in deep processing

o Statistical methods must be augmented by symbolic models to achieve higher accuracy and broader coverage

o Statistical methods can be used to learn operators or selection strategies for symbolic processes

Verbmobil involved both industrial and academia research groups, such as IBM, Siemens, Alcatel, Dailmler-Chrisler, CMU, RWTH UKA, etc. See [Wahlster 2000] for the complete list of the project partners.

2.6.3 PF-Star One of the priorities of EU’s Information Society Technologies (IST) in the Sixth Framework Programme (FP6) is to build on the European research community to lead the next generation of technology and applications by making them more user and people centred. PF-STAR (Preparing Future Multisensorial Interaction Research, http://pfstar.itc.it/) intends to contribute to establish future activities in the field of Multisensorial and Multilingual communication (Interface Technologies) on firmer bases by providing technological baselines, comparative evaluations, and assessment of prospects of core technologies, which future research and development efforts can build from. The project will address three crucial areas: technologies for speech-to-speech translation, the

detection and expressions of emotional states, and core speech technologies for children. From the expected results as described in the project information page, it is not clear whether the speech translation research will benefit from the other two areas, especially from the emotion detection. It seems that the research of each area is irrelevant of each other and is carried on by the corresponding group independently. Participants of this project are: ITC-irst, RWTH, UERLN, KTH, UB, CNR ISTC-SPFD.

2.6.4 LC-Star LC-STAR, launched officially in 1st of February 2002, is a project focusing on creating language resources (LR) for transferring Speech-to-Speech Translation (SST) components and thus improving human-to-human and man-machine communication in multilingual environments. LC-STAR is funded by the European Commission in the scope of the IST programme. The goal of the project is to create lexica for 13 languages and text corpora for 3 languages and to create a demonstrator translating within 3 languages. The language resources are specially designed as linguistic oriented which are needed for the language transfer of SST components that include:

o Flexible vocabulary speech recognition o High quality text-to-speech synthesis o Speech centred translation

Such linguistic oriented resources include suitable text corpora for producing lexica that are enriched with phonetic, prosodic and morpho-syntactic information. These generic LRs are needed to build SST components covering a wide range of application domains in different languages. Currently, such large-scale LR is not publicly or commercially available and industrial standards are lacking. The current language resources have the following drawbacks [Hartikainen 2003]:

o Lack of coverage for application domains o Lack of suitability for synthesis and

recognition o Lack of quality control o Lack of standards o Lack of coverage in languages o Mostly limited to research purposes

The main objective of the LC-STAR project is to make large lexica available for many languages that cover a wide range of domains along with the development of standards relating to content and quality. Pioneering work is being done for defining standards with respect to content and format issues. For speech-centred translation, the project will focus on statistical approaches allowing an efficient transfer to other languages using suitable LR. The LR needed for this purpose is aligned bilingual text corpora and monolingual lexica with morpho-syntactic information. The list of all languages covered and responsible partners is presented in Table 1 below.

Italian IBM Greek

Finnish Nokia Mandarin Hebrew NSC

US-English German RWTH Aachen Classical Arabic

Turkish Siemens Russian Spanish UPC

Catalan Table 1. List of languages and responsible partners

in LC-Star

2.6.5 TC-STAR_P The objective of TC-STAR_P project (http://www.tc-star.org/) is to prepare a future integrated project named “Technology and Corpora for Speech to Speech Translation” (TC-STAR), which will be proposed under the 6th Framework Programme and will aim at making speech to speech translation real. TC-STAR_P will be driven by industrial requirements and will involve industry key actors active in the development of SST systems and components, academic research institutions active in SST systems and components research, infrastructure centres active in the development of language

resources for SST components and SMEs using the provided technologies. Roadmaps for the development of SST will be prepared and further key actors from the industrial, research and infrastructure groups, as well as SMEs working with SST applications, will be involved. A new organisational model will be developed. Current participants include ELDA, IBM, ITC-irst, KUN, LIMSI-CNRS, Nokia , NSC, RWTH, Siemens AG, Siemens RIT SAS, Sony, TNO, UniKarl, and UPC.

3 International Cooperation Table 2 shows the cooperation between research sites in different research projects. This is neither a complete list of projects, nor research groups. The main purpose is to show the international cooperation in speech translation area.

Table 2. International cooperation in speech

translation projects

4 Conclusion In this paper, we surveyed the major speech translation systems. Systems vary in all aspects:

o Speech Recognizer: the majority of the surveyed systems use HMM ASR. A few systems use FST to integrate the ASR with the MT component.

o Machine Translation Paradigms: we see interlingua-based MT, EBMT, SMT, grammar-based MT and multi-engine translation systems. Each approach has pros and cons, there is no “best” MT solution for speech translation problems.

o Coupling between the ASR and MT: most of the surveyed systems are loosely

coupled, except the UPV and AT&T group where FSTs are used to integrate the ASR and MT component.

5 Acknowledgement I would like to thank the instructors of this course, Stephan Vogel, Tanja Schultz and Alan Black for their teaching and discussions.

6 Bibliographical References2 J. Amengual, J. Bened, F. Casacuberta, M. Castano, A.

Castellanos, V. Jimenez, D. Llorens, A. Marzal, M. Pastor, F. Prat, E. Vidal, and J. Vilar. The EuTrans-I speech translation system. Machine Translation . 1999.

Srinivas Bangalore and Giuseppe Riccardi, Stochastic Finite-State models for Spoken Language Machine Translation'', Workshop on Embedded Machine Translation Systems, Seattle, April 2000.

Srinivas Bangalore and Giuseppe Riccardi, A Finite-State Approach to Machine Translation'', North American ACL 2001 (NAACL-2001) , Pittsburgh, May 2001.

Alan Black, et al. Rapid Development of Speech-to-Speech Translation Systems. In Proceedings of ICSLP 2002.

Nick Campbell. CHATR: A High-Definition Speech Re-Sequencing System. In Proceeding of ASA/ASJ Joint Meeting, pp. 1223-1228 (1996).

R. Frederking, A. Black, R. Brown, J. Moody, and E. Stein-brecher, “Field testing the tongues speech-to-speech machine translation system,” in LREC, 2002.

Elviira Hartikainen, et al. Large Lexica for Speech-to-Speech Translation: From Specification to Creation. In Proceedings of EuroSpeech 2003

X. Huang, F. Alleva, H.-W. Hon, K.-F. Hwang, M.-Y. Lee, and R. Rosenfeld, “The SPHINX-II speech recognition sys-tem: an overview,” Computer Speech and Language, vol. 7(2), pp. 137–148, 1992.

Ryosuke Isotani, et al. Speech-to-Speech Translation Software on PDAs for Travel Conversation. 2003

Tsuyoshi Morimoto, et al. ATR’s Speech Translation System: ASURA, Proceeding of EuroSpeech 93, pp 1291-1294.

2 More references can be found at “Speech to Speech Translation Systems Survey” (http://projectile.is.cs.cmu.edu/research/public/talks/speechTranslation/facts.htm)

Alon Lavie, Alex Waibel, et al. Translation of Conversational Speech with JANUS-II. In Proceedings of ICSLP 96. 1996.

A. Lavie, et al. “A multi-perspective evaluation of the NESPOLE! speech-to-speech translation system,” in Proceedings of ACL 2002 workshop on Speech-to-speech Translation: Algorithms and Systems, Philadelphia,PA., 2002.

Fu-Hua Liu, et al. Noise Robustness in Speech to Speech Translation. IBM Tech Report RC22874, 2003.

M. Mohri, et al. Weighted Determinization and Minimization for Large Vocabulary Speech Recognition. In Proceedings of EuroSpeech 97.

M. Pastor. EuTrans: A Speech-to-Speech Translator Prototyp. In Proceedings of EuroSpeech 2001.

Toshiyuki Takezawa, et al. A Japanese-to-English Speech Translation System: ATR-MATRIX. In Proceedings of ICSLP 98.

Sarich, A., Phraselator, one-way speech translation system, http://www.sarich.com/translator/, 2001.

Wolfgang Wahlster, ed. 2000. Verbmobil: Foundations of Speech-to-Speech Translation. Springer. http://verbmobil.dfki.de/Vm-Buch.final.html .

M. Woszczyna, et al. Recent Advances in JANUS: A Speech Translation System. HLT 93.

Alex Waibel. 1996. Chapter 8.6: Multilingual Speech Processing. In Survey of the State of the Art in Human Language Technology, http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html

Kiyoshi Yamabana, et al. A Speech Translation System with Mobile Wireless Clients. In Proceedings of ACL 2003.

B. Zhou, et al, “Statistical Natural Language Generation for Speech-to-Speech Machine Translation Systems ”, ICSLP -2002, 2002.,

Chenqing Zong, Bo Xu and Taiyi Huang. Interactive Chinese-to-English Speech Translation Based on Dialogue Management. In Proceedings of the Workshop on Speech-to-Speech Translation: Algorithms and Systems, Philadelphia, July 2002, pp. 61-68. ACL 2002.

Chenqing Zong, Mark Seligman. 2003. Toward Practical Spoken Language Translation. To be published.

Survey of Current S peech Translation Research · Survey of Current S peech Translation Research...

Documents

Transcript of Survey of Current S peech Translation Research · Survey of Current S peech Translation Research...