Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News
description
Transcript of Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News
Combining Prosodic and Text Featuresfor Segmentation of
Mandarin Broadcast NewsGina-Anne Levow
University of ChicagoSIGHAN
July 25, 2004
Roadmap
• The Problem: Mandarin Story Segmentation
• The Tools: Prosodic and Text Cues– Mandarin Chinese
• Individual Results
• Integrating Cues
• Conclusion & Future Work
The Problem:Mandarin Speech Topic Segmentation
•Separate audio stream into component topics
Why Segment?
• Enables language understanding tasks– Information Retrieval
• Only regions of interest
– Summarization• Cover all main topics
– Reference Resolution• Pronouns tend to refer within segments
The Challenge
• How do we define/measure topicality?– Are two regions on the same topic?– Fundamentally requires full understanding
• How can we approach with partial understanding?
• How do we identify boundaries sharply?– Association of sentences may be ambiguous
• Especially, “filler”
The Tools: Prosodic and Text Cues
• Represent local changes at boundaries with audio– Silence!, speaker change, pitch, loudness, rate (GHN, AT&T00)
• Represent topicality with text– Component words in audio stream
• Possibly noisy • Many possible models (Hearst 94, Beeferman99,..)
• Combining Prosody and Text – Human annotators more accurate, confident if use BOTH
transcribed text and original audio!! (Swerts 97)– English broadcast news (Tur et al, 2001)
Data and Processing
• Broadcast News– Topic Detection and Tracking TDT3 corpus– Voice of America broadcast news
• ASR transcription• Manually segmented – known boundaries
– ~4,000 stories, ~750K words
• Acoustic analysis (Praat)– Automatic pitch, intensity tracking
• Smoothed, speaker-normalized, per-word
Acoustic-Prosodic Cues
• Languages differ in use of intonation– E.g. English: declarative fall, question rise– Chinese: pitch contour determines word meaning
• At segment boundaries???– Surprisingly similar, though not identical– Significantly lower pitch at end of segment– Significantly lower amplitude at end of segment– Significantly longer duration at end of segment
Acoustic-Prosodic Contrasts
-0.25
-0.2
-0.15
-0.1
-0.05
0
Non-finalFinal
MandarinNormalized Pitch
MandarinNormalizedIntensity
Learning Boundaries
• Decision tree classifier (Quinlan C4.5)– Classification problem
• For each word, classify as final/non-final
• Features– Acoustic-Prosodic:
• Duration, Pitch, Loudness, Silence– Word average, Between-word difference
Text Boundary Features
– Text• Information retrieval style
– Cosine similarity between weighted term vectors
» tf*idf in 50-word windows
• Cue phrases– N-gram features
» Identified by BoosTexter (Schapire & Singer, 2000)
– E.g. “Voice of America”, “Audience”, “Reporting”
Classification Results
• Balanced training and test sets– Results on held-out subsets
• Acoustic cues only– 95.6% accuracy
• Text cues (+ silence)– 95.6% accuracy
• Combined text and prosody– 96.4% accuracy
• Typically, false alarms twice as common as miss
Joint Decision Tree
<<
Feature Assessment
•Role of silence•Useful in both text and acoustic classifiers
•More necessary for text•Text captures topicality, not locality
•Can not identify boundaries sharply•Prosodic cues:
•Localize boundaries•Multiple supporting cues: intensity, pitch: contrastive use
Issue: False Alarms
• Evaluate representative sample– Boundary <<< Non-boundary– 95.6% accuracy
• 2% miss, 4.4% false alarms
• Non-boundary frequent
• False alarms frequent
Voting Against False Alarms
• Error analysis:– Construct per-feature classifiers:
• Prosody-only, text-only, silence-only
– Compare classifiers: per-feature, joint• Joint + 0,1 per-feature classifer FALSE ALARM
• Approach: Voting– Require joint + 2 per-feature classifiers
• Result: 1/3 reduction in false alarms– ~97% accuracy: 2.8% miss, 3.15% false alarm
Conclusion
• Mandarin broadcast news segmentation– Identify topicality and boundary locality
• Integrate text and acoustic cues– Text similarity: vector space model, n-gram cues
– Prosodic cues: Silence, intensity, pitch, duration
» Robust across range of languages
• Provide supporting and orthogonal information
• Majority agreement of per-feature classifiers: – 1/3 fewer alarms
Current & Future Work
• Improving the model of topicality– Richer text similarity models; broader acoustic models
• Alternative classifiers– Preliminary experiments:
• Boosting, Boosted Decision trees, MaxEnt– Comparable
– Alternative integration strategies• Hierarchical subtopic segmentation
– Broadcast news– Dialogue: human-computer, human-human
• Integration with multi-modal features: e.g. gesture, gaze
Acoustic-Prosodic Contrasts
-0.25
-0.2
-0.15
-0.1
-0.05
0
Non-finalFinal
MandarinNormalized Pitch
MandarinNormalizedIntensity
EnglishNormalized Intensity
EnglishNormalized Pitch
Text Decision Tree
Prosodic Decision Tree
The Problem:Speech Topic Segmentation
• Separate audio stream into component topics
On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. ||And the millennium bug, Lubbock Texas prepares for catastrophe, India sees only profit.||