May 2006CLINT-CS Verbmobil1 CLINT-CS Dialogue II Verbmobil.

May 2006 CLINT-CS Verbmobil 1

CLINT-CS

Dialogue II

Verbmobil


Verbmobil

• Verbmobil is a spoken dialogue system that provides phone users with simultaneous dialogue interpretation services for restricted topics.

• Recognises spoken input, translates it, and then utters the translation.

• Three languages: German, English and Japanese


Challenges for S and L Technology

Input Conditions

Naturalness Adaptability Dialogue Capabilities

Close speaking, PTT

Isolated words Speaker dependent

Monologue dictation

Telephone, pause based segmentation

Read continuous speech

Speaker independent

Information seeking dialogue

Open microphone, GSM quality

Spontaneous speech

Speaker adaptive

Multiparty negotiation

Incr

easi

ng d

iffic

ulty


Grand Challenges

• Not a push-to-talk system. Has to decide for itself when user input is complete.

• Spontaneous speech including disfluencies and repair phenomena.

• Speaker adaptive.• Mixed initiative dialogue• Three different domains of discourse


Domains

Scenario 1

Appointment

Scheduling

Scenario 2

Travel Planning

Scenario 3

Remote PC Maintenance

When?

Focus on temporal expressions

Vocabulary 2.5-6K

When? Where? How?

Focus on Temporal and spatial expresssions

Vocabulary 7-10K

What? When? Wherer? How?

Focus on integration of special sublanguage lexica

Vocabulary 15-30K


Data Collection

Transliteratedspeech data

Segmented speech with prosodic labels

Dialogues annotatedwith dialogue acts

Treebanks& predicateargument structures

Aligned bilingualCorpora

A signficant programme of data collection was performedTo extract statistical properties of different kinds of data


Speech Data

• Multi channel recording– close-speaking microphone– room microphone– various telephones

• Speech recognisers trained on data sets of different audio quality


Multi Level Data Annotation

• Speech Data– Transliteration– Orthography– Pronunciation– Phonological Segmentation– Word Segmentation– Prosodic Segmentation

• Non Speech– Dialogue Acts– Treebanks


Statistical Models

• Data used to train different statistical models using Machine Learning.

• Models include– Neural Networks– Probabilistic Automata (HMMs for speech)– Probabilistic CFGs (robust parsing)– Probabilistic Transfer Rules


Architecture

• Different input devices (microphone, telephone, mobile, internet)

• Multilingual speech recognition (EN, DE, JP) including prosodic analysis

• Parsing

• Multi-level translation

• Multi-lingual generation


Multi Engine Parsing Architecture

• Three different parsing models are employed– Probabilistic LR Parser– Robust Chunk Parsing– HPSG Chart Parser

• All parsing models produce trees that are tranformed into the same multistratal representation called VIT (Verbmobil Interface Terms)

• This facilitates integration of partial results from the different parsing models


Translation Models

• Substring Based

• Template Based

• Dialogue Act Based


Substring Based Translation

• Starts with the best sentence hypothesis of the speech recogniser

• Uses prosodic information to determine phrase boundaries and sentence mode

• Machine Learning methods applied to a sentence-aligned bilingual corpus

• The output of this module is a sequence of words in the target language together with a confidence measure that is used for selecting the best translation.


Template Based Translation

• Based on 30K translation templates learned from a sentence-aligned corpus

Ti = (Tis,Ti

t){x1,..,xn}

• 3 phases:– SL Template matching– Subphrase Translation– TL utterance generation


Template Translation Results

WL Best Hypothesis

All Word Lattice

Perfect Translation 47% 67%

Approx. Correct 16% 6%

Bad Translation 15% 5%

No Translation 22% 22%


Multi Engine TranslationSegment 1If you prefer another hotel

Segment 2please let me know

case basedtranslation

substring basedtranslation

selection module

statisticaltranslation

dialogue basedtranslation

semantictransfer

Segment 1Semantic Xfer

Segment 2CBT


Dialogue Act Based Translation

• Meaning based translation• Statistical classification of 19 dialogue acts.• Extraction of propositional content using finite

state transducers.• Content built from an ontology covering

appointment scheduling and travel planning tasks.

• Template based approach to generation of target language from content.


Part of Ontology for Propositional Content

top

object situation quality

agent location

event actionabstract concrete

move-by-rail move-by-plane

move by public transport

journey move stay show meeting


Dialogue Act Hierarchy

deliberatethankintroducebyegreet

control dialogue

promote task

manage task

DialogueAct

request suggestrequest clarifyrequest commentrequest commit

digressexcludeclarifyjustify

requestsuggestinformfeedbackcommitoffer

initdeferclose


Dialogue –Based Translation:Transfer Component rules

Semantic RepresentationSource Language VIT

Semantic RepresentationTarget Language VIT

Dia

logu

e an

dco

ntex

t ev

alua

tion

GENERATION


Prosody

• Input– Speech signal– Word Hypothesis Graph (WHG)

• Output– annotated WHG including, per word– duration, pitch, energy, pause info

• Used to classify phrase and clause boudaries, accented words, and sentence mood.


Prosody – Sentence Mood

row? morYou are coming to

You are coming to mor ro w.

time

pitch


Use of Prosodic Information

• Prosodic information is used systematically at all processing stages

• Prosodic difference can lead to different translation… wir haben noch (we still have vs. we have another)


Multi Blackboard Architecture

• Final system comprises 69 highly interactive modules.

• No direct communication between modules.• Communication is handled by 198

blackboards.• Shared representation structures• A module typically subscribes to several

blackboards.


Blackboards & Modules

command recogniser

generationrobust dialogue

semantics

semantic construction

spontaneous speechrecogniser

speakeradaptation

prosodic analysis

chunk parser

HPSG parser

semantictransfer

statissticalparser

dialogue actrecognition

Audio Data

WHG withprosodic labels

VIT discourserepresentation


Multi Engine Approach

statisticalparser

chunk parser

HPSGparser

robust dialogue semanticKBased reconstruction

complete and spanning VIT

chart containingpartial VITs

AugmentedWHG


Achievements

• 3 language pairs, three domains and a vocalbulary size of over 100K word forms

• Average processing time 4x original signal duration

• Word recognition rate of 75% for spontaneous speech

• 80% approximately correct translations• 90% success rate for dialogue tasks in end-

to-end evaluation


Conclusion

• Speech to speech translation of spontaneous dialogues can only be cracked by combining deep and shallow processing

• The final architecture maximises the necessary interaction between processing modules

• Software engineering considerations must be taken seriously in such a project.

May 2006CLINT-CS Verbmobil1 CLINT-CS Dialogue II Verbmobil.

Documents

Transcript of May 2006CLINT-CS Verbmobil1 CLINT-CS Dialogue II Verbmobil.