Download - Atomatic summarization of voicemail messages using lexical and prosodic features

Atomatic summarization Atomatic summarization of voicemail messages of voicemail messages

using lexical and prosodic using lexical and prosodic featuresfeatures

Koumpis and RenalsKoumpis and Renals

Presented by Daniel VassilevPresented by Daniel Vassilev

The DomainThe Domain

Voicemail is a special case of Voicemail is a special case of spontaneous speechspontaneous speech

Goal: Enable the voicemail user to Goal: Enable the voicemail user to receive his/her messages anywhere and receive his/her messages anywhere and any time, in particular on mobile devicesany time, in particular on mobile devices

Key components: caller identification, Key components: caller identification, reason for call, information that the caller reason for call, information that the caller requires and a return phone numberrequires and a return phone number

The TaskThe Task

Summarization: obtain the most Summarization: obtain the most important information from the voicemailimportant information from the voicemail

A complete system requires spoken A complete system requires spoken language understanding and language language understanding and language generation generation current technology is not current technology is not adequateadequate

Solution: simplify task by deciding for Solution: simplify task by deciding for each word if it will be in the summaryeach word if it will be in the summary

The TaskThe Task

Voicemail is short, 40s on averageVoicemail is short, 40s on average Summaries must fit into 140 characters Summaries must fit into 140 characters

(mobile devices)(mobile devices) Content more important than coherence Content more important than coherence

and document flowand document flow ASR used on the voicemail, a significant ASR used on the voicemail, a significant

word error rate must be assumed word error rate must be assumed (30%-40% error!)(30%-40% error!)

Voicemail CorpusVoicemail Corpus

IBM Voicemail Corpus-Part IIBM Voicemail Corpus-Part I 1801 messages (1601 for training, 200 1801 messages (1601 for training, 200

for testing)for testing) 14.6 h14.6 h On average 90 words / messageOn average 90 words / message Message topics: Message topics: 27% business related,

25% personal, 17% work-related, 13% technical and 18% in other categories.

The classification The classification problemproblem

Classifier decides if a word is included in Classifier decides if a word is included in the summarythe summary

Using Parcel (feature selection alg.) and Using Parcel (feature selection alg.) and a Receiver Operating Characteristic a Receiver Operating Characteristic (ROC) graphs for feature selection(ROC) graphs for feature selection

Hybrid multi-layer perceptron (MLP) / hidden Markov model (HMM) classifier

Receiver Operating Receiver Operating CharacteristicCharacteristic

Graph plots the true positive rate (sensitivity) Graph plots the true positive rate (sensitivity) vs. 1 - the true negative rate (specificity)vs. 1 - the true negative rate (specificity)

We can shift the positive vs. negative error by We can shift the positive vs. negative error by taking different acceptance thresholds (we taking different acceptance thresholds (we move on the ROC curve)move on the ROC curve)

Different classifiers will have different ROC Different classifiers will have different ROC curvescurves

Sample ROC graphSample ROC graph

System setupSystem setup

The team built a sophisticated, multi-The team built a sophisticated, multi-component system that can capture the component system that can capture the different types of information occurring in different types of information occurring in voicemailvoicemail

Initial trigram language model, Initial trigram language model, augmented with sentences from taugmented with sentences from the Hub-4 Broadcast News and Switchboard language model training corpora

System SetupSystem Setup

Pronunciation dictionary with 10,000 Pronunciation dictionary with 10,000 words from the training datawords from the training data

+ pronunciations obtained from the SPRACH broadcast news system

Annotated summary words in 1,000 messages

System overviewSystem overview

Entities in summariesEntities in summaries

Annotation proceduresAnnotation procedures

1. Pre-annotated NEs were marked as targets, unless unmarked by later rules;

2. The first occurrences of the names of the speaker and recipient were always marked

as targets; later repetitions were unmarked unless they resolved ambiguities;

3. Any words that explicitly determined the reason for calling including important

dates/times and action items were marked; 4. Words in a stopword list with 54 entries were

unmarked;

Annotation proceduresAnnotation procedures

Labeled only on transcribed messages (no Labeled only on transcribed messages (no audio)audio)

Annotators tended to eliminate irrelevant words Annotators tended to eliminate irrelevant words (as opposed to mark content words)(as opposed to mark content words)

Produced summaries about 30% shorter than Produced summaries about 30% shorter than original messageoriginal message

Relatively good level of inter-annotator Relatively good level of inter-annotator agreementagreement

Lexical featuresLexical features

Lexical features from ASR outputLexical features from ASR output

collection frequency (less frequent words more collection frequency (less frequent words more informative)informative)

acoustic confidence (ASR confidence)acoustic confidence (ASR confidence)

All other features considered before and after All other features considered before and after stemming:stemming:

NEs, proper names, tel. numbers, dates and NEs, proper names, tel. numbers, dates and times, word positiontimes, word position

Prosodic featuresProsodic features

Prosodic features from audio using signal Prosodic features from audio using signal processing algorithmsprocessing algorithms

duration normalization over the corpusduration normalization over the corpuspauses (preceding and succeeding)pauses (preceding and succeeding)mean energymean energyF0 range, average, onset and offsetF0 range, average, onset and offset

ResultsResults

Named entities identified very accurately Named entities identified very accurately (without stemming)(without stemming)

Telephone numbers recognized well also Telephone numbers recognized well also by specific named entity lists. Pos also by specific named entity lists. Pos also good as numbers appear towards the good as numbers appear towards the end of the messageend of the messageAll prosodic features but duration had no All prosodic features but duration had no predictive powerpredictive power

ROC curves for different ROC curves for different tasks / featurestasks / features

ResultsResults

Dates / times: best matched by specific Dates / times: best matched by specific named entity list and collection frequencynamed entity list and collection frequency

Prosodic features (duration, energy, F0 Prosodic features (duration, energy, F0 range) more important but still not the range) more important but still not the best predictorsbest predictors

The Parcel bootstrappingThe Parcel bootstrappingalgorithmalgorithm

ConclusionsConclusions

Trade-off between length of summary Trade-off between length of summary and retaining essential content wordsand retaining essential content words

70%-80% agreement with human 70%-80% agreement with human summary for hand-annotated messagessummary for hand-annotated messages

50% agreement when using ASR50% agreement when using ASR

ConclusionsConclusions

Automatic summaries perceived as Automatic summaries perceived as worse than human summaries (duh!)worse than human summaries (duh!)

However, if the summarizer used human However, if the summarizer used human annotated data (as opposed to ASR annotated data (as opposed to ASR output), the perceived quality improved output), the perceived quality improved significantlysignificantly