Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC 35900-1 October 11, 2006.

22
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC 35900-1 October 11, 2006

Transcript of Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC 35900-1 October 11, 2006.

Recognizing Discourse Structure:Speech

Discourse & Dialogue

CMSC 35900-1

October 11, 2006

Roadmap

• Recognizing discourse structure in speech

• Analyzing spoken monologue

• Automatic topic segmentation– Acoustic cues, text cues, and integration

• Conclusions & Plans

Recognizing Discourse Structure

• Hypothesis:– Discourse can be decomposed into subunits

• Formal written text– Clues to structure: paragraphs, chapters, sections

• Spoken discourse– Lacks orthographic cues– Are compensating features available?

Prosody & Discourse Structure

• Discourse structure model– Grosz&Sidner 1986– Global structure: discourse segments, embedding– Local structure: prominence, salience

• Linguistic structure includes intonation– Signal global or local structure

• Use of phrases to signal global structure

• Signal parenthetical

Intonational Features

• Theoretical framework– Tone and Break Index (ToBI, Pierrehumbert)

• Tone: pitch contours; Breaks: phrase units

• “Intermediate” phrases are basic units

• Features:• Pitch range within and between phrases

• Amplitude (loudness)

• Pitch contour type

• Speaking rate (syll/sec)

• Inter-phrase pause duration

Speech Corpora

• Vary on:– Speaker type: professional/not– Speaking style: read/spontaneous– Speech content: news/directions/etc

• Variability in prosody too….

Pilot Study I: Newswire

• Professionally read 3 AP newswire stories

• Manual segmentation: Text only, Speech– Consensus labels: SB, SF

• Correlation of pitch range, amplitude, rate– Can identify structure via hand-labelings

• Issues:– Difficulty labeling, Idiosyncratic BN speech

Pilot Study II: Prominence and Discourse

• Prominence: Accent/stress on a word– Typically associated with NEW information– Contrast:

• Locally NEW (in segment) vs Globally NEW

• Analyze all NPs in 20 min spontaneous • Difference in position and form influence

– Full forms accented, pronouns etc not– Mismatches: Imply role of global/local

• Issues: – Difficulty labeling; use of full names or pronouns

Direction-giving Corpus

• Spontaneous/read speech; non-professional– Task-oriented: give directions, vary complexity

• Return later to read original transcriptions

• Discourse segment labeling: Text vs Speech– More consensus labels for speech than text

• Speech allows more reliable segmentation

• Spontaneous more reliable than read (medial)

Acoustic Analysis

• Features:• Max/mean f0 (pitch), amplitude, rate, pause (pre/post)

• Findings:• Segment beginnings: Higher max/mean f0, amplitude

– Shorter following pause (Longer preceding pause in read)

• Segment endings: Lower max/mean f0, amplitude

• Similar for T & S annotations

• Issues: Single speaker

Prominence and Discourse

• NPs annotated for:– Lexical form (full NP/pron), grammatical role,

surface position (sent/phrase), accent– 23% reduced stress

• Effect of form, role

• Repetition, not necessarily reduced– Also find reduced forms in contrasts

Summary

• Clear prosodic cues to discourse structure– Across speakers, speaking style, content– Initiation:

• High max/average pitch, amplitude; preceding pause– Finality is converse

• Information status– Few clear correlates with accentuation

• Mediated by form, grammatical role

Prosodic and Lexical Cues toTopic Segmentation

• Broadcast news story-level segmentation– Television and radio

• Contrast w/GHN– Fully automatic: transcription, prosodic labeling– Large data set- multiple speakers– All teleprompted news

Possible Signals

• Lexical topic similarity in vector space – Hearst (1994)

• Lexical discourse cues (Beeferman et al)• E.g. “CNN “ – Reporter sign-off

– HMM topic model

• Prosodic cues– Pitch, loudness, duration, speaker change, …

Basic Approach

• Chop audio stream into “sentences”

• Group “sentences” into topics

• Classify each sentence boundary as topic boundary or not

• Probabilistic framework– argmax B Pr(B|W,F)

• B is sequence of boundaries, W words, F features

Prosodic Classification

• Features:– Pitch (f0) – before and after possible boundary,– Duration – final phoneme, final rhyme, pause

• No amplitude – viewed as redundant with pitch

• Classifier: Decision trees– Features selected by wrapper loop on training

Lexical Classification

• HMM topic language models– Train one model per topic – Begin/End state

• Train on previous topics

• Later augment with Topic Boundary states

Integrating Models

• With decision trees:– Incorporate HMM topic boundary probability

as additional feature– Boundary labeled if exceeds some threshold

• With HMMs:– Use prosodic trees to estimate likelihoods – Use standard Viterbi decoding to find best

Testing & Evaluation

• Based on 6 shows– 104 shows used for training

• Used ASR output for words/positions– Contrast with correct forced alignment

• Used manual speaker segmentation

• Bizarre cost metric

• Basic units: Chop at 0.572 sec pause

Decision Tree Classification

• Prosody-only features: – Pause duration, F0 difference, speaker change,

gender• Consistent with GHN

• Gender? Different styles for males/females

• Combined:– HMM LM likelihoods, pause, F0 difference

Best Results

• Integrate prosody and lexical cues

• HMM-based model combination better– Decision tree thresholding inconsistent

• Improves over HMM classifier only