Download - Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News

Combining Prosodic and Text Featuresfor Segmentation of

Mandarin Broadcast NewsGina-Anne Levow

University of ChicagoSIGHAN

July 25, 2004

Roadmap

• The Problem: Mandarin Story Segmentation

• The Tools: Prosodic and Text Cues– Mandarin Chinese

• Individual Results

• Integrating Cues

• Conclusion & Future Work

The Problem:Mandarin Speech Topic Segmentation

•Separate audio stream into component topics

Why Segment?

• Enables language understanding tasks– Information Retrieval

• Only regions of interest

– Summarization• Cover all main topics

– Reference Resolution• Pronouns tend to refer within segments

The Challenge

• How do we define/measure topicality?– Are two regions on the same topic?– Fundamentally requires full understanding

• How can we approach with partial understanding?

• How do we identify boundaries sharply?– Association of sentences may be ambiguous

• Especially, “filler”

The Tools: Prosodic and Text Cues

• Represent local changes at boundaries with audio– Silence!, speaker change, pitch, loudness, rate (GHN, AT&T00)

• Represent topicality with text– Component words in audio stream

• Possibly noisy • Many possible models (Hearst 94, Beeferman99,..)

• Combining Prosody and Text – Human annotators more accurate, confident if use BOTH

transcribed text and original audio!! (Swerts 97)– English broadcast news (Tur et al, 2001)

Data and Processing

• Broadcast News– Topic Detection and Tracking TDT3 corpus– Voice of America broadcast news

• ASR transcription• Manually segmented – known boundaries

– ~4,000 stories, ~750K words

• Acoustic analysis (Praat)– Automatic pitch, intensity tracking

• Smoothed, speaker-normalized, per-word

Acoustic-Prosodic Cues

• Languages differ in use of intonation– E.g. English: declarative fall, question rise– Chinese: pitch contour determines word meaning

• At segment boundaries???– Surprisingly similar, though not identical– Significantly lower pitch at end of segment– Significantly lower amplitude at end of segment– Significantly longer duration at end of segment

Acoustic-Prosodic Contrasts

-0.25

-0.2

-0.15

-0.1

-0.05

0

Non-finalFinal

MandarinNormalized Pitch

MandarinNormalizedIntensity

Learning Boundaries

• Decision tree classifier (Quinlan C4.5)– Classification problem

• For each word, classify as final/non-final

• Features– Acoustic-Prosodic:

• Duration, Pitch, Loudness, Silence– Word average, Between-word difference

Text Boundary Features

– Text• Information retrieval style

– Cosine similarity between weighted term vectors

» tf*idf in 50-word windows

• Cue phrases– N-gram features

» Identified by BoosTexter (Schapire & Singer, 2000)

– E.g. “Voice of America”, “Audience”, “Reporting”

Classification Results

• Balanced training and test sets– Results on held-out subsets

• Acoustic cues only– 95.6% accuracy

• Text cues (+ silence)– 95.6% accuracy

• Combined text and prosody– 96.4% accuracy

• Typically, false alarms twice as common as miss

Joint Decision Tree

<<

Feature Assessment

•Role of silence•Useful in both text and acoustic classifiers

•More necessary for text•Text captures topicality, not locality

•Can not identify boundaries sharply•Prosodic cues:

•Localize boundaries•Multiple supporting cues: intensity, pitch: contrastive use

Issue: False Alarms

• Evaluate representative sample– Boundary <<< Non-boundary– 95.6% accuracy

• 2% miss, 4.4% false alarms

• Non-boundary frequent

• False alarms frequent

Voting Against False Alarms

• Error analysis:– Construct per-feature classifiers:

• Prosody-only, text-only, silence-only

– Compare classifiers: per-feature, joint• Joint + 0,1 per-feature classifer FALSE ALARM

• Approach: Voting– Require joint + 2 per-feature classifiers

• Result: 1/3 reduction in false alarms– ~97% accuracy: 2.8% miss, 3.15% false alarm

Conclusion

• Mandarin broadcast news segmentation– Identify topicality and boundary locality

• Integrate text and acoustic cues– Text similarity: vector space model, n-gram cues

– Prosodic cues: Silence, intensity, pitch, duration

» Robust across range of languages

• Provide supporting and orthogonal information

• Majority agreement of per-feature classifiers: – 1/3 fewer alarms

Current & Future Work

• Improving the model of topicality– Richer text similarity models; broader acoustic models

• Alternative classifiers– Preliminary experiments:

• Boosting, Boosted Decision trees, MaxEnt– Comparable

– Alternative integration strategies• Hierarchical subtopic segmentation

– Broadcast news– Dialogue: human-computer, human-human

• Integration with multi-modal features: e.g. gesture, gaze

Acoustic-Prosodic Contrasts

-0.25

-0.2

-0.15

-0.1

-0.05

0

Non-finalFinal

MandarinNormalized Pitch

MandarinNormalizedIntensity

EnglishNormalized Intensity

EnglishNormalized Pitch

Text Decision Tree

Prosodic Decision Tree

The Problem:Speech Topic Segmentation

• Separate audio stream into component topics

On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. ||And the millennium bug, Lubbock Texas prepares for catastrophe, India sees only profit.||