Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News
-
Upload
orson-cote -
Category
Documents
-
view
33 -
download
0
description
Transcript of Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News
![Page 1: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/1.jpg)
Combining Prosodic and Text Featuresfor Segmentation of
Mandarin Broadcast NewsGina-Anne Levow
University of ChicagoSIGHAN
July 25, 2004
![Page 2: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/2.jpg)
Roadmap
• The Problem: Mandarin Story Segmentation• The Tools: Prosodic and Text Cues
– Mandarin Chinese• Individual Results• Integrating Cues• Conclusion & Future Work
![Page 3: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/3.jpg)
The Problem:Mandarin Speech Topic Segmentation
•Separate audio stream into component topics
![Page 4: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/4.jpg)
Why Segment?
• Enables language understanding tasks– Information Retrieval
• Only regions of interest– Summarization
• Cover all main topics– Reference Resolution
• Pronouns tend to refer within segments
![Page 5: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/5.jpg)
The Challenge
• How do we define/measure topicality?– Are two regions on the same topic?– Fundamentally requires full understanding
• How can we approach with partial understanding?
• How do we identify boundaries sharply?– Association of sentences may be ambiguous
• Especially, “filler”
![Page 6: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/6.jpg)
The Tools: Prosodic and Text Cues
• Represent local changes at boundaries with audio– Silence!, speaker change, pitch, loudness, rate (GHN, AT&T00)
• Represent topicality with text– Component words in audio stream
• Possibly noisy • Many possible models (Hearst 94, Beeferman99,..)
• Combining Prosody and Text – Human annotators more accurate, confident if use BOTH
transcribed text and original audio!! (Swerts 97)– English broadcast news (Tur et al, 2001)
![Page 7: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/7.jpg)
Data and Processing
• Broadcast News– Topic Detection and Tracking TDT3 corpus– Voice of America broadcast news
• ASR transcription• Manually segmented – known boundaries
– ~4,000 stories, ~750K words • Acoustic analysis (Praat)
– Automatic pitch, intensity tracking• Smoothed, speaker-normalized, per-word
![Page 8: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/8.jpg)
Acoustic-Prosodic Cues
• Languages differ in use of intonation– E.g. English: declarative fall, question rise– Chinese: pitch contour determines word meaning
• At segment boundaries???– Surprisingly similar, though not identical– Significantly lower pitch at end of segment– Significantly lower amplitude at end of segment– Significantly longer duration at end of segment
![Page 9: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/9.jpg)
Acoustic-Prosodic Contrasts
-0.25
-0.2
-0.15
-0.1
-0.05
0
Non-finalFinal
MandarinNormalized Pitch
MandarinNormalizedIntensity
![Page 10: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/10.jpg)
Learning Boundaries
• Decision tree classifier (Quinlan C4.5)– Classification problem
• For each word, classify as final/non-final
• Features– Acoustic-Prosodic:
• Duration, Pitch, Loudness, Silence– Word average, Between-word difference
![Page 11: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/11.jpg)
Text Boundary Features
– Text• Information retrieval style
– Cosine similarity between weighted term vectors» tf*idf in 50-word windows
• Cue phrases– N-gram features
» Identified by BoosTexter (Schapire & Singer, 2000)– E.g. “Voice of America”, “Audience”, “Reporting”
![Page 12: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/12.jpg)
Classification Results• Balanced training and test sets
– Results on held-out subsets• Acoustic cues only
– 95.6% accuracy • Text cues (+ silence)
– 95.6% accuracy• Combined text and prosody
– 96.4% accuracy
• Typically, false alarms twice as common as miss
![Page 13: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/13.jpg)
Joint Decision Tree
<<
![Page 14: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/14.jpg)
Feature Assessment
•Role of silence•Useful in both text and acoustic classifiers
•More necessary for text•Text captures topicality, not locality
•Can not identify boundaries sharply•Prosodic cues:
•Localize boundaries•Multiple supporting cues: intensity, pitch: contrastive use
![Page 15: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/15.jpg)
Issue: False Alarms
• Evaluate representative sample– Boundary <<< Non-boundary– 95.6% accuracy
• 2% miss, 4.4% false alarms
• Non-boundary frequent• False alarms frequent
![Page 16: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/16.jpg)
Voting Against False Alarms
• Error analysis:– Construct per-feature classifiers:
• Prosody-only, text-only, silence-only
– Compare classifiers: per-feature, joint• Joint + 0,1 per-feature classifer FALSE ALARM
• Approach: Voting– Require joint + 2 per-feature classifiers
• Result: 1/3 reduction in false alarms– ~97% accuracy: 2.8% miss, 3.15% false alarm
![Page 17: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/17.jpg)
Conclusion
• Mandarin broadcast news segmentation– Identify topicality and boundary locality
• Integrate text and acoustic cues– Text similarity: vector space model, n-gram cues– Prosodic cues: Silence, intensity, pitch, duration
» Robust across range of languages
• Provide supporting and orthogonal information• Majority agreement of per-feature classifiers:
– 1/3 fewer alarms
![Page 18: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/18.jpg)
Current & Future Work• Improving the model of topicality
– Richer text similarity models; broader acoustic models• Alternative classifiers
– Preliminary experiments: • Boosting, Boosted Decision trees, MaxEnt
– Comparable– Alternative integration strategies
• Hierarchical subtopic segmentation– Broadcast news– Dialogue: human-computer, human-human
• Integration with multi-modal features: e.g. gesture, gaze
![Page 19: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/19.jpg)
Acoustic-Prosodic Contrasts
-0.25
-0.2
-0.15
-0.1
-0.05
0
Non-finalFinal
MandarinNormalized Pitch
MandarinNormalizedIntensity
EnglishNormalized Intensity
EnglishNormalized Pitch
![Page 20: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/20.jpg)
Text Decision Tree
![Page 21: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/21.jpg)
Prosodic Decision Tree
![Page 22: Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News](https://reader035.fdocuments.net/reader035/viewer/2022062521/56812c2b550346895d90a863/html5/thumbnails/22.jpg)
The Problem:Speech Topic Segmentation
• Separate audio stream into component topics
On "World News Tonight" this Thursday, another bad day on stock markets, all over the world global economic anxiety. || Another massacre in Kosovo, the U.S. and its allies prepare to do something about it. Very slowly. ||And the millennium bug, Lubbock Texas prepares for catastrophe, India sees only profit.||