Combining Prosodic and Text Featuresfor Segmentation of
Mandarin Broadcast NewsGina-Anne Levow
University of ChicagoSIGHAN
July 25, 2004
Roadmap
• The Problem: Mandarin Story Segmentation
• The Tools: Prosodic and Text Cues– Mandarin Chinese
• Individual Results
• Integrating Cues
• Conclusion & Future Work
The Problem:Mandarin Speech Topic Segmentation
•Separate audio stream into component topics
Why Segment?
• Enables language understanding tasks– Information Retrieval
• Only regions of interest
– Summarization• Cover all main topics
– Reference Resolution• Pronouns tend to refer within segments
The Challenge
• How do we define/measure topicality?– Are two regions on the same topic?– Fundamentally requires full understanding
• How can we approach with partial understanding?
• How do we identify boundaries sharply?– Association of sentences may be ambiguous
• Especially, “filler”
The Tools: Prosodic and Text Cues
• Represent local changes at boundaries with audio– Silence!, speaker change, pitch, loudness, rate (GHN, AT&T00)
• Represent topicality with text– Component words in audio stream
• Possibly noisy • Many possible models (Hearst 94, Beeferman99,..)
• Combining Prosody and Text – Human annotators more accurate, confident if use BOTH
transcribed text and original audio!! (Swerts 97)– English broadcast news (Tur et al, 2001)
Data and Processing
• Broadcast News– Topic Detection and Tracking TDT3 corpus– Voice of America broadcast news
• ASR transcription• Manually segmented – known boundaries
– ~4,000 stories, ~750K words
• Acoustic analysis (Praat)– Automatic pitch, intensity tracking
• Smoothed, speaker-normalized, per-word
Acoustic-Prosodic Cues
• Languages differ in use of intonation– E.g. English: declarative fall, question rise– Chinese: pitch contour determines word meaning
• At segment boundaries???– Surprisingly similar, though not identical– Significantly lower pitch at end of segment– Significantly lower amplitude at end of segment– Significantly longer duration at end of segment
Acoustic-Prosodic Contrasts
-0.25
-0.2
-0.15
-0.1
-0.05
0
Non-finalFinal
MandarinNormalized Pitch
MandarinNormalizedIntensity
Learning Boundaries
• Decision tree classifier (Quinlan C4.5)– Classification problem
• For each word, classify as final/non-final
• Features– Acoustic-Prosodic:
• Duration, Pitch, Loudness, Silence– Word average, Between-word difference
Text Boundary Features
– Text• Information retrieval style
– Cosine similarity between weighted term vectors
» tf*idf in 50-word windows
• Cue phrases– N-gram features
» Identified by BoosTexter (Schapire & Singer, 2000)
– E.g. “Voice of America”, “Audience”, “Reporting”
Classification Results
• Balanced training and test sets– Results on held-out subsets
• Acoustic cues only– 95.6% accuracy
• Text cues (+ silence)– 95.6% accuracy
• Combined text and prosody– 96.4% accuracy
• Typically, false alarms twice as common as miss
Joint Decision Tree
<<
Feature Assessment
•Role of silence•Useful in both text and acoustic classifiers
•More necessary for text•Text captures topicality, not locality
•Can not identify boundaries sharply•Prosodic cues:
•Localize boundaries•Multiple supporting cues: intensity, pitch: contrastive use
Issue: False Alarms
• Evaluate representative sample– Boundary <<< Non-boundary– 95.6% accuracy
• 2% miss, 4.4% false alarms
• Non-boundary frequent
• False alarms frequent
Voting Against False Alarms
• Error analysis:– Construct per-feature classifiers:
• Prosody-only, text-only, silence-only
– Compare classifiers: per-feature, joint• Joint + 0,1 per-feature classifer FALSE ALARM
• Approach: Voting– Require joint + 2 per-feature classifiers
• Result: 1/3 reduction in false alarms– ~97% accuracy: 2.8% miss, 3.15% false alarm
Conclusion
• Mandarin broadcast news segmentation– Identify topicality and boundary locality
• Integrate text and acoustic cues– Text similarity: vector space model, n-gram cues
– Prosodic cues: Silence, intensity, pitch, duration
» Robust across range of languages
• Provide supporting and orthogonal information
• Majority agreement of per-feature classifiers: – 1/3 fewer alarms
Current & Future Work
• Improving the model of topicality– Richer text similarity models; broader acoustic models
• Alternative classifiers– Preliminary experiments:
• Boosting, Boosted Decision trees, MaxEnt– Comparable
– Alternative integration strategies• Hierarchical subtopic segmentation
– Broadcast news– Dialogue: human-computer, human-human
• Integration with multi-modal features: e.g. gesture, gaze
Acoustic-Prosodic Contrasts
-0.25
-0.2
-0.15
-0.1
-0.05
0
Non-finalFinal
MandarinNormalized Pitch
MandarinNormalizedIntensity
EnglishNormalized Intensity
EnglishNormalized Pitch
Text Decision Tree
Prosodic Decision Tree
The Problem:Speech Topic Segmentation
• Separate audio stream into component topics
On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. ||And the millennium bug, Lubbock Texas prepares for catastrophe, India sees only profit.||
Top Related