May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for...
-
date post
22-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for...
May 30th, 2006 Speech Group Lunch Talk
Features for Improved Speech Activity Detection for Recognition
of Multiparty Meetings
Kofi A. Boakye
International Computer Science Institute
May 30th, 2006 Speech Group Lunch Talk
Overview
● Background● Previous work and Proposed changes● HMM segmenter and ASR System● Features Investigated● Experimental Results● Conclusions
May 30th, 2006 Speech Group Lunch Talk
Background
● Segmentation of audio into speech/nonspeech is a critical first step in ASR
● Especially true for Individual Headset Microphone (IHM) condition in meetings– Issues:
1) Crosstalk2) Breath/contact noise
– Single-channel energy-based methods ineffective
May 30th, 2006 Speech Group Lunch Talk
Background
● Initiatives such as AMI, IM2, and the NIST RT eval show interest in recognition and understanding of multispeaker meetings
May 30th, 2006 Speech Group Lunch Talk
Background
● Major source of error for IHM recognition: speech activity detection errors
May 30th, 2006 Speech Group Lunch Talk
Previous Work● Previous approach: Time-based
intersection of two distinct segmenters
1) HMM-based segmenter with standard cepstral features– 12 MFCCs– Log-Energy– First and second differences
May 30th, 2006 Speech Group Lunch Talk
Previous Work● Previous approach: Time-based
intersection of two distinct segmenters
2) Local-energy detector– Generates segments by zero-thresholding
“crosstalk-compensated” energy-like signal
May 30th, 2006 Speech Group Lunch Talk
Proposed Changes● Though intersection approach was
effective, it was believed to be limited– Cross-channel analysis disjoint from
speech activity modeling– Fixed threshold potentially lacks
robustness– Fails to incorporate other acoustically
derived features (e.g., cross-correlation) ● New approach: integrate features
directly into HMM segmenter– Append features to cepstral feature vector
May 30th, 2006 Speech Group Lunch Talk
HMM Segmenter● Derived from HMM-based speech
recognition system● Two-class HMM with three-state phone
model● Multivariate GMM with 256 components● Segmentation proceeds by repeatedly
decoding waveform with decreasing transition penalties– Results in segments less than 60s
May 30th, 2006 Speech Group Lunch Talk
HMM Segmenter
● Post-processing– Pad segments by a fixed amount (40ms) to
prevent “clipping” effects– Merge segments with small separation (<
0.4s) to “smooth” segmentation– Constraints optimized based on
recognition accuracy and runtime for segmenter with standard cepstral features
May 30th, 2006 Speech Group Lunch Talk
ASR System
● For development and validation experiments ICSI-SRI RT-05S system used– Multiple decoding passes and front-ends
for cross-adaptation and hypothesis refinement
– PLP and MFCC+MLP features– Features transformed with VTLN and HLDA
along with feature-level constrained MLLR– Models trained on 2000 hours of telephone
data and MAP adapted to 100 hours of meeting data
– 4-gram LM trained with telephone, meeting transcripts, broadcast, and Web data
May 30th, 2006 Speech Group Lunch Talk
Features: Cross-channel●Log-Energy Differences (LEDs)
– Log of ratio of short-time energy between target and each non-target channel
●Normalized Log-Energy Differences– Subtract minimum frame energy of a
channel from all energy values in the channel
– Addresses significant gain differences– Largely independent of amount of speech
in channel
imin,inorm EnE=nE
May 30th, 2006 Speech Group Lunch Talk
Features: Cross-channel●Normalized Maximum Cross-correlation
(NMXC)
– Serves as an indicator of crosstalk– More common cross-channel feature than
energy-differences
0jj
ijτij φ
τφmax=Γ
May 30th, 2006 Speech Group Lunch Talk
Features: Cross-channel● Feature vector length standardization
– For cross-channel features, number of channels may vary, but feature vector length must be fixed
– Proposed solution: use order statistics (maximum and minimum) of the feature values generated on the different channels
May 30th, 2006 Speech Group Lunch Talk
Experiments: AMI devtest
● Performance of features initially investigated on AMI development set
● Testing– 12-minute excerpts from 4 meetings
● Training– First 10 minutes from each of 35 meetings
● “Fast” (two-decoding pass) version of recognition system used for quick turnaround
May 30th, 2006 Speech Group Lunch Talk
Experiments: AMI devtest
● Results
Del Subs Ins WER0
5
10
15
20
25
30
35
40
Recognition Performance
baselinebase + LEDsbase + NLEDsbase + NMXCreference
System Del Subs Ins WERbaseline 17.4 13.0 7.4 37.8base + LEDs 17.4 12.8 4.5 34.7
17.1 12.0 4.4 33.5base + NMXC 17.4 12.1 4.5 34.1reference 18.3 10.2 3.4 32.0
base + NLEDs
New features give significant improvement over baseline
●Reduced insertions
NLEDs give ~1% reduction over LEDs
May 30th, 2006 Speech Group Lunch Talk
Experiments: Eval04
● Having established effectiveness of features, systems were evaluated using RT-04S set
● Meetings vary in style, number of participants, and room acoustics
● Testing– 11-minute excerpts from 8 meetings, 2
from each of CMU, ICSI, NIST, and LDC● Training
– First 10 minutes from each of 15 NIST meetings and 73 ICSI meetings
May 30th, 2006 Speech Group Lunch Talk
Experiments: Eval04
● ResultsSystem WER
ALL CMU ICSI NIST LDCbaseline 29.6 33.1 23.4 20.0 38.7intersection 27.9 32.5 21.4 20.2 34.9base + LEDs 27.3 32.8 20.1 20.0 33.7
26.9 32.8 18.5 19.6 34.0base + NMXC 28.1 31.7 24.9 19.0 33.8reference 25.1 30.3 18.0 17.0 31.9
base + NLEDs
ALL CMU ICSI NIST LDC0
5
10
15
20
25
30
35
40
45
Recognition Performance
baseline
intersection
base + LEDs
base + NLEDs
base + NMXC
reference
WE
RFeatures give improvement over baseline and previous system
NMXC features not as robust● Removed from consideration for final SAD system
May 30th, 2006 Speech Group Lunch Talk
System Validation: Eval05 (and 06)
● Finalized system: HMM segmenter with baseline and NLED features*
● Training– Union of previous training sets
● AMI (35 mtgs), NIST (15 mtgs), ICSI (73 mtgs)– Baseline and intersection systems used
two models (ICSI+NIST and AMI)– New systems used single model with
pooled data
*Eval06 official submission used LEDs
May 30th, 2006 Speech Group Lunch Talk
System Validation: Eval05 (and 06)
21.4
25.2
33.0
37.3
NIST
19.5
22.7
24.7
25.6
ALL
WER
19.2
21.9
22.0
AMI
19.9
23.1
23.5
CMU
16.8
20.6
20.9
ICSI
20.6
22.9
23.8
VT
NLEDs+SDM
LEDs +SDM
Segmenter Method
Reference
LEDs
● Using the SDM signal – Eval05 included a meeting with an unmiked participant– SDM served as “stand-in” mic for participant– Including the SDM signal (and energy normalization)
improved results by >12% on NIST meetings!– The SDM signal was not used for eval06 since there were no
unmiked speakers
May 30th, 2006 Speech Group Lunch Talk
System Validation: Eval05 (and 06)
19.5
22.7
24.7
25.9
29.3
WER
eval05
11.2
10.9
11.1
11.0
11.0
Sub
6.7
10.2
10.2
11.5
10.3
Del
1.6
1.6
3.3
3.4
8.0
Ins
20.2
22.8
24.0
WER
eval06
NLEDs
LEDs
System
Reference
intersection
baseline
● 1.2% gain over last year’s segmenter on eval05● Energy normalization gave extra 1.2% gain on eval06,
2.0% on eval05 (due to unmiked speaker in NIST meeting)
May 30th, 2006 Speech Group Lunch Talk
Additional Experiments: MLP Features
● Use features as inputs to Multi-Layer Perceptron to see if additional gains can be made
● Training– Inputs consist of baseline and either LED or
NLED features (41 components)– Input context window of 11 frames and
400 hidden units– 90/10 split for cross-validation
May 30th, 2006 Speech Group Lunch Talk
Additional Experiments: MLP Features
● Amidevtest Results
27
28
29
30
31
32
33
34
35
36
Recognition Performance
base + LEDs
base + NLEDs
MLP + LEDs
MLP + NLEDs
base + MLP + LEDs
base + MLP + NLEDsWE
R
System WERbase + LEDs 34.7
33.5MLP + LEDs 33.9
34.4base + MLP + LEDs 34.7
35.0
base + NLEDs
MLP + NLEDs
base + MLP + NLEDs
MLP with LEDs better thanwith NLEDs
Addition of baseline features degrades performance
No combination outperforms NLED features
May 30th, 2006 Speech Group Lunch Talk
Conclusions● Integrating cross-channel analysis with
speech activity modeling yields large WER reductions
● Simple cross-channel energy-based features perform very well and are more robust than cross-correlation based features
● Minimum energy subtraction produces still further gains
● Inclusion of omnidirectional mic allows crosstalk suppression even from speakers without dedicated microphones
● Still room for improvement as significant gap (>2%) exists between automatic and ideal segmentation