Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News

Combining Prosodic and Text Featuresfor Segmentation of

Mandarin Broadcast NewsGina-Anne Levow

University of ChicagoSIGHAN

July 25, 2004

Roadmap

• The Problem: Mandarin Story Segmentation• The Tools: Prosodic and Text Cues

– Mandarin Chinese• Individual Results• Integrating Cues• Conclusion & Future Work

The Problem:Mandarin Speech Topic Segmentation

•Separate audio stream into component topics

Why Segment?

• Enables language understanding tasks– Information Retrieval

• Only regions of interest– Summarization

• Cover all main topics– Reference Resolution

• Pronouns tend to refer within segments

The Challenge

• How do we define/measure topicality?– Are two regions on the same topic?– Fundamentally requires full understanding

• How can we approach with partial understanding?

• How do we identify boundaries sharply?– Association of sentences may be ambiguous

• Especially, “filler”

The Tools: Prosodic and Text Cues

• Represent local changes at boundaries with audio– Silence!, speaker change, pitch, loudness, rate (GHN, AT&T00)

• Represent topicality with text– Component words in audio stream

• Possibly noisy • Many possible models (Hearst 94, Beeferman99,..)

• Combining Prosody and Text – Human annotators more accurate, confident if use BOTH

transcribed text and original audio!! (Swerts 97)– English broadcast news (Tur et al, 2001)

Data and Processing

• Broadcast News– Topic Detection and Tracking TDT3 corpus– Voice of America broadcast news

• ASR transcription• Manually segmented – known boundaries

– ~4,000 stories, ~750K words • Acoustic analysis (Praat)

– Automatic pitch, intensity tracking• Smoothed, speaker-normalized, per-word

Acoustic-Prosodic Cues

• Languages differ in use of intonation– E.g. English: declarative fall, question rise– Chinese: pitch contour determines word meaning

• At segment boundaries???– Surprisingly similar, though not identical– Significantly lower pitch at end of segment– Significantly lower amplitude at end of segment– Significantly longer duration at end of segment

Acoustic-Prosodic Contrasts

-0.25

-0.2

-0.15

-0.1

-0.05

0

Non-finalFinal

MandarinNormalized Pitch

MandarinNormalizedIntensity

Learning Boundaries

• Decision tree classifier (Quinlan C4.5)– Classification problem

• For each word, classify as final/non-final

• Features– Acoustic-Prosodic:

• Duration, Pitch, Loudness, Silence– Word average, Between-word difference

Text Boundary Features

– Text• Information retrieval style

– Cosine similarity between weighted term vectors» tf*idf in 50-word windows

• Cue phrases– N-gram features

» Identified by BoosTexter (Schapire & Singer, 2000)– E.g. “Voice of America”, “Audience”, “Reporting”

Classification Results• Balanced training and test sets

– Results on held-out subsets• Acoustic cues only

– 95.6% accuracy • Text cues (+ silence)

– 95.6% accuracy• Combined text and prosody

– 96.4% accuracy

• Typically, false alarms twice as common as miss

Joint Decision Tree

<<

Feature Assessment

•Role of silence•Useful in both text and acoustic classifiers

•More necessary for text•Text captures topicality, not locality

•Can not identify boundaries sharply•Prosodic cues:

•Localize boundaries•Multiple supporting cues: intensity, pitch: contrastive use

Issue: False Alarms

• Evaluate representative sample– Boundary <<< Non-boundary– 95.6% accuracy

• 2% miss, 4.4% false alarms

• Non-boundary frequent• False alarms frequent

Voting Against False Alarms

• Error analysis:– Construct per-feature classifiers:

• Prosody-only, text-only, silence-only

– Compare classifiers: per-feature, joint• Joint + 0,1 per-feature classifer FALSE ALARM

• Approach: Voting– Require joint + 2 per-feature classifiers

• Result: 1/3 reduction in false alarms– ~97% accuracy: 2.8% miss, 3.15% false alarm

Conclusion

• Mandarin broadcast news segmentation– Identify topicality and boundary locality

• Integrate text and acoustic cues– Text similarity: vector space model, n-gram cues– Prosodic cues: Silence, intensity, pitch, duration

» Robust across range of languages

• Provide supporting and orthogonal information• Majority agreement of per-feature classifiers:

– 1/3 fewer alarms

Current & Future Work• Improving the model of topicality

– Richer text similarity models; broader acoustic models• Alternative classifiers

– Preliminary experiments: • Boosting, Boosted Decision trees, MaxEnt

– Comparable– Alternative integration strategies

• Hierarchical subtopic segmentation– Broadcast news– Dialogue: human-computer, human-human

• Integration with multi-modal features: e.g. gesture, gaze

Acoustic-Prosodic Contrasts

-0.25

-0.2

-0.15

-0.1

-0.05

0

Non-finalFinal

MandarinNormalized Pitch

MandarinNormalizedIntensity

EnglishNormalized Intensity

EnglishNormalized Pitch

Text Decision Tree

Prosodic Decision Tree

The Problem:Speech Topic Segmentation

• Separate audio stream into component topics

On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. ||And the millennium bug, Lubbock Texas prepares for catastrophe, India sees only profit.||

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News

Documents

Transcript of Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News