Speaker Segmentation (2006)
-
Upload
luis-gustavo-martins -
Category
Technology
-
view
788 -
download
0
Transcript of Speaker Segmentation (2006)
Campus da FEUP | Rua Dr. Roberto Frias, 378 | 4200-465 Porto
Tf 222 094 000 | Fx 222 094 350 | www.inescporto.pt | [email protected]
Real-time Automatic Real-time Automatic Speaker Segmentation Speaker Segmentation
Luís Gustavo MartinsLuís Gustavo Martins
UTM – INESC PortoUTM – INESC Porto
[email protected]@inescporto.pt
http://www.inescporto.pt/~lmartinshttp://www.inescporto.pt/~lmartins
LabMeetingsLabMeetings
March 16, 2006March 16, 2006
INESC PortoINESC Porto
16.03.2006 Automatic Speaker Segmentation 2
Notice
This work is licensed under the Creative Commons Attribution-Share Alike 2.5 Portugal License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/2.5/pt/ or send a letter to
Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
16.03.2006 Automatic Speaker Segmentation 3
Summary
• SummarySummary
– System Overview– Audio Analysis front-end– Speaker Coarse Segmentation– Speaker Change Validation– Speaker Model Update– Experimental Results– Achievements– Conclusions
16.03.2006 Automatic Speaker Segmentation 4
Scope
• ObjectiveObjective
– Development of a Real-time, Automatic Speaker Segmentation module
• Already having in mind for future development:– Speaker Tracking– Speaker Identification
• ChallengesChallenges
– No pre-knowledge about the number and identities of speakers
– On-line and Real-time operationAudio data is not available beforehandMust only use small amounts of arriving speaker data iterative and computationally intensive methods are unfeasible
16.03.2006 Automatic Speaker Segmentation 5
System Overview
Front - end Process, feature extraction and Pre - segment
Potencial change?
Compare current speech segment with the last speaker model
Update current Speaker model
Real change?
No Yes
BIC
False Speaker Change Positive Speaker Change
Speech stream
16.03.2006 Automatic Speaker Segmentation 6
Audio Analysis front-end
Front - end Process, feature extraction and Pre - segment
Potencial change?
Compare current speech segment with the last speaker model
Update current Speaker model
Real change?
No Yes
BIC
False Speaker Change Positive Speaker Change
Speech stream
16.03.2006 Automatic Speaker Segmentation 7
Audio Analysis front-end
• Front-end ProcessingFront-end Processing– 8kHz, 16 bit, pre-emphasized, mono speech streams– 25ms analysis frames with no overlap– Speech segments with 2.075 secs and 1.4 secs overlap
• Consecutive sub-segments with 1.375 secs each
LSP1 LSP2LSP3
… LSP10
SPEECH SAMPLES / FRAMES
SPEECH SEGMENTS 25 ms (200 SAMPLES)
FEATURE VECTORS
time
1.375 secs (55 FRAMES)
LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
...LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
...LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
...LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
...
... ... ... ...
... ...
... ...
... ...
...
...
...
...
...
0.675 secs (27 FRAMES)
16.03.2006 Automatic Speaker Segmentation 8
Audio Analysis front-end
• Feature ExtractionFeature Extraction (1)(1)
– Speaker Modeling
• 10th-order LPC / LSP
– Source / Filter approach
• Other possible features…– MFCC– Pitch– …
Glottis Lips
Δ
Δ
Δ
Δ
Δ
A0 A1 A2 A3 A4 A5
1
. .p
n k n k nk
s a s G u
SOURCEFILTER
16.03.2006 Automatic Speaker Segmentation 9
0 1 2 1 1 1
1 0 1 2 2 2
2 1 0 3 3 3
1 2 3 0
.
p
p
p
p p p p p
R R R R a R
R R R R a R
R R R R a R
R R R R a R
Audio Analysis front-end
• LPC Modeling (1) LPC Modeling (1) [Rabiner93, Campbell97][Rabiner93, Campbell97] – Linear Predictive Coding
• Order p
1
1
1
. .
ˆ .
ˆ . .
p
n k n k nk
p
n k n kk
p
n n n n k n k nk
s a s G u
s a s
e s s s a s G u
2
2
1
1
1
0
.
0, 1,2,...,
. 1,2,...,
( ). ( )
p
n n k n kn n k
i
p
k n k n i n n ik n n
N
i
E e s a s
Ei p
a
a s s s s i p
R s i s i
Yule-Walker equations
Durbin’s recursive algorithm
Toeplitz autocorrelation matrix
Autocorrelationmethod
16.03.2006 Automatic Speaker Segmentation 10
LPCANALYSIS
LPCFILTER
IMPULSESOURCE
LPC ak
e(n)
IMPLUSE TRAIN
NOISE
INPUT SPEECH OUTPUT SPEECH
Audio Analysis front-end
• LPC Modeling (2)LPC Modeling (2)
1
1
. .
( ) [ ]( )
( ) [ ]
( )( )1
p
n k n k nk
n
n
p kkk
s a s G u
S z Z sH z
U z Z u
G GH z
A za z
LPC Spectrum
FFT Spectrum
Whitening Filter
Pitch
16.03.2006 Automatic Speaker Segmentation 11
Audio Analysis front-end
• LSP Modeling LSP Modeling [Campbell97][Campbell97] – Linear Spectral Pairs
• More robust to quantization, as normally used in speech coding
– Derived from the LPC ak coefficients
• Zeros of A(z) mapped to the unit circle in the Z-Domain
• Use of a pair of (p+1)-order polynomials
( 1) 1
( 1) 1
1( ) ( ) ( )
2
( ) ( ) ( )
( ) ( ) ( )
p
p
A z P z Q z
P z A z z A z
Q z A z z A z
16.03.2006 Automatic Speaker Segmentation 12
• Speaker ModelingSpeaker Modeling
– Speaker information is mostly contained in the voiced part of the speech signal…
• Can you identify Who’s speaking?
– LPC / LSP analysis behaves badly with non-voiced (i.e. non-periodic) signals
• Unvoiced/Silence data degrades speaker model accuracy!
Select only voiced data for processing…
Audio Analysis front-end
Unvoiced speech frames Voiced speech frames
16.03.2006 Automatic Speaker Segmentation 13
Audio Analysis front-end
• Voiced / Unvoiced / Silence (V/U/S) detectionVoiced / Unvoiced / Silence (V/U/S) detection
– Feature Extraction (2)
• Short Time Energy (STE) silence detection
• Zero Crossing Rate (ZCR) voiced / unvoiced detection
12
0
1( ) ( )
N
ni
STE n s iN
1
1
1( ) ( ) ( 1)
2
Ns
n ni
fZCR n sign s i sign s i
N
16.03.2006 Automatic Speaker Segmentation 14
Audio Analysis front-end
• V/U/S speech classes modeled by Gaussian DistributionsV/U/S speech classes modeled by Gaussian Distributions
– modeled by 2-d Gaussian Distributions
• Simple and Fast real-time operation
ZCR
STE
voiced
unvoicedsilence
11( ) ( )
21
( , ) , 2(2 ) .
Tx x
nN e n
Dataset:
~4 minutes of manually annotated speech signals 2 male and 2 female Portuguese speakers
16.03.2006 Automatic Speaker Segmentation 15
Audio Analysis front-end
• Manual Annotation of V/U/S segments in a speech signalManual Annotation of V/U/S segments in a speech signal
16.03.2006 Automatic Speaker Segmentation 16
Audio Analysis front-end
• V/U/S Speech datasetV/U/S Speech dataset
– Voiced / Unvoiced / Silence stratification in manually segmented audio files
----------------------------------------------------------------------------Portuguese Male 2Portuguese Male 2----------------------------------------------------------------------------Total Time Total Time = 60 secs= 60 secsvoiced voiced = 30 secs = 50.0%= 30 secs = 50.0%unvoiced unvoiced = 14 secs = 23.3%= 14 secs = 23.3%silencesilence = 13 secs = 21.6%= 13 secs = 21.6%
Voiced /(Voiced + Unvoiced) = 68% Voiced /(Voiced + Unvoiced) = 68% Unvoiced / (Voiced + Unvoiced) = 32%Unvoiced / (Voiced + Unvoiced) = 32%
----------------------------------------------------------------------------Portuguese Male 1Portuguese Male 1----------------------------------------------------------------------------Total Time Total Time = 60 secs= 60 secsVoicedVoiced = 37 secs = 62%= 37 secs = 62%unvoiced unvoiced = 12 secs = 20%= 12 secs = 20%silencesilence = 10 secs = 17%= 10 secs = 17%
Voiced /(Voiced + Unvoiced) = 76% Voiced /(Voiced + Unvoiced) = 76% Unvoiced / (Voiced + Unvoiced) = 24%Unvoiced / (Voiced + Unvoiced) = 24%
----------------------------------------------------------------------------Portuguese Female 2Portuguese Female 2----------------------------------------------------------------------------Total Time Total Time = 60 secs= 60 secsvoiced voiced = 30 secs = 50.0%= 30 secs = 50.0%unvoiced unvoiced = 19 secs = 31.6%= 19 secs = 31.6%silencesilence = 10 secs = 17%= 10 secs = 17%
Voiced /(Voiced + Unvoiced) = 61.2% Voiced /(Voiced + Unvoiced) = 61.2% Unvoiced / (Voiced + Unvoiced) = 38.7%Unvoiced / (Voiced + Unvoiced) = 38.7%
----------------------------------------------------------------------------Portuguese Female 1Portuguese Female 1----------------------------------------------------------------------------Total Time Total Time = 60 secs= 60 secsvoiced voiced = 32 secs = 53.3%= 32 secs = 53.3%unvoiced unvoiced = 17 secs = 28.3%= 17 secs = 28.3%silencesilence = 10 secs = 17%= 10 secs = 17%
Voiced /(Voiced + Unvoiced) = 65.3% Voiced /(Voiced + Unvoiced) = 65.3% Unvoiced / (Voiced + Unvoiced) = 32.7%Unvoiced / (Voiced + Unvoiced) = 32.7%
16.03.2006 Automatic Speaker Segmentation 17
Audio Analysis front-end
• Automatic Classification of V/U/S speech framesAutomatic Classification of V/U/S speech frames::
– 10-fold Cross-Validation • Confusion matrix:
– Some voiced frames are being discarded as unvoiced… • Waste of relevant and scarce data…
– A few unvoiced and silence frames are being misclassified as voiced• Contamination of the data to be analyzed
ClassifieClassified as: d as: ↓↓
voicedvoiced unvoicedunvoiced silencesilence
voicedvoiced 92.32 92.32 %%
4.17 %4.17 % 0.41 %0.41 %
unvoicedunvoiced 6.8 %6.8 % 62.28 %62.28 % 34.66 %34.66 %
silencesilence 0.88 %0.88 % 33.55 %33.55 % 64.92 %64.92 %
contamination
waste
(Theoretical Random Classifier Correct Classifications = 33.33%)Total Classification Error = 18.385%Total Correct Classifications = 81.615+/-1.13912%
16.03.2006 Automatic Speaker Segmentation 18
Audio Analysis front-end
• Voiced / Unvoiced / Silence (V/U/S) detectionVoiced / Unvoiced / Silence (V/U/S) detection
– Advantages• Only quasi-stationary parts of the speech signal are used
– Include most of the speaker information in a speech signal– Avoids model degradation in LPC/LSP
• Potentially more robust to different speakers/languages– Different languages may have distinct V/U/S stratification– Speakers talk differently (i.e. more paused more silence frames)
– Drawbacks• May cause few data points per speech sub-segment
– Ill-estimation of the covariance matrices» number of data points (i.e. voiced frames) >= d(d+1)/2» d=dim(cov matrix) = 10 (i.e. 10 LSP coefficients) » nr. data points / sub-segment >= 55 frames» Not always guaranteed!!
use of dynamically sized windows– Does this really work??
16.03.2006 Automatic Speaker Segmentation 19
Speaker Coarse Segmentation
Front - end Process, feature extraction and Pre - segment
Potencial change?
Compare current speech segment with the last speaker model
Update current Speaker model
Real change?
No Yes
BIC
False Speaker Change Positive Speaker Change
Speech stream
16.03.2006 Automatic Speaker Segmentation 20
Speaker Coarse Segmentation
• Divergence Shape Divergence Shape – Only uses LSP features
• Assumes Gaussian Distribution
– Calculated between consecutive sub-segments
Speech stream with 4 speech segments
( , ) ln ii j
jx
p xD i j p x p x dx
p x
1 1 1 11 1( , )
2 2
T
i j j i j i i j i jD i j tr tr u u u u
1 11( , )
2 i j j iD i j tr
LSP1 LSP2LSP3
… LSP10
SPEECH SAMPLES / FRAMES
SPEECHSEGMENTS
25 ms (200 SAMPLES)
FEATURE VECTORS
LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
...LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
...LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
... ...
... ...ΣiΣi-1
...
...
...
...
...D(i-1,i)
[Campbell97][Campbell97][Lu2002][Lu2002]
16.03.2006 Automatic Speaker Segmentation 21
Speaker Coarse Segmentation
• Dynamic Threshold Dynamic Threshold [Lu2002][Lu2002]
• Speaker change whenever:Speaker change whenever:
1
1( 2 , 1 2 )
N
in
Th D i n i nN
( , 1) ( 2, 3)
( , 1) ( 2, 1)
( , 1) i
D i i D i i
D i i D i i
D i i Th
D(i,j)
(i,i+1) (i+1,i+2)(i-1,i)
POTENTIAL SPEAKER CHANGE POINT
Thi...
...
... ...
LSP1 LSP2LSP3
… LSP10
SPEECH SAMPLES / FRAMES
SPEECHSEGMENTS
25 ms (200 SAMPLES)
FEATURE VECTORS
Σi-6
time
...
...1.375 secs
(55 FRAMES)
Σi-5
LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
...LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
...LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
...LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
...
... ... ... ...
... ...
... ...
... ...
... ...
... ...
0.675 secs (27 FRAMES)
...
...
...
Σi Σi+1
POTENTIAL SPEAKER CHANGE POINT
D(i-6,i-5)
Σi-4 Σi-3D(i-4,i-3)
D(i,i+1)
D(i+2,i+3)
D(i-2,i-1)
Σi+2 Σi+3
Σi-2 Σi-1
16.03.2006 Automatic Speaker Segmentation 22
Speaker Coarse Segmentation
• Coarse Segmentation performanceCoarse Segmentation performance
– Presents high False Alarm Rate (FAR = Type I errors)
• Possible solution:Possible solution:
– Use a Speaker Validation Strategy
• Should allow decreasing FAR…
• … but should also avoid an increase in Miss Detections (MDR = Type II errors)
16.03.2006 Automatic Speaker Segmentation 23
Speaker Change Validation
Front - end Process, feature extraction and Pre - segment
Potencial change?
Compare current speech segment with the last speaker model
Update current Speaker model
Real change?
No Yes
BIC
False Speaker Change Positive Speaker Change
Speech stream
16.03.2006 Automatic Speaker Segmentation 24
x xxxxx x xx xx xxxxxxxx xx x x x x xxxxxxxy yyyyy y yy yy yyyyyyyy yy y y y y yyy yyyy yy
x
p
Feature 1
θz p
Feature 1
θz
xxxxxxx xxxx xxxxxxxx xx x x x x xxxxxxxyyyyyyyyyyy yyyyyyyy yy yyy y yyyyyyy yy
x
Speaker Change Validation
• Bayesian Information Criterion (BIC) (1)Bayesian Information Criterion (BIC) (1)
– Hypothesis 0:• Single θz model for speaker data in segments X and Y
00 0
log ( | ) log ( | )yx NN
i z i zi i
L p x p y
X Y
Z
L0 L0 Same
speaker in segments X
and Y
Different speakers in segments X
and Y
16.03.2006 Automatic Speaker Segmentation 25
Speaker Change Validation
• Bayesian Information Criterion (BIC) (2)Bayesian Information Criterion (BIC) (2)
– Hypothesis 1:• Separate models θx, θy for speakers in segments X and Y, respectively
X Y
Z
10 0
log ( | ) log ( | )yx NN
i x i yi i
L p x p y
x xxxxx x xx xx xxxxxxxx xx x x x x xxxxxxxy yyyyy y yy yy yyyyyyyy yy y y y y yyy yyyy yy
x
p
Feature 1
p
Feature 1xxxxxxx xxxx xxxxxxxx xx x x x x xxxxxxx
yyyyyyyyyyy yyyyyyyy yy y yy y yyyyyyy yyx
θx θy
θx
θy
L1 L1
Same speaker in segments X and
Y
Different speakers in segments X
and Y
16.03.2006 Automatic Speaker Segmentation 26
Speaker Change Validation
• Bayesian Information Criterion (BIC) (3)Bayesian Information Criterion (BIC) (3)
– Log Likelihood Ratio (LLR)
– However, this is not a far comparison…• The models do not have the same number of parameters!
– More complex models always fit better the data» They should be penalized when compared with simpler models ΔK = difference of the nr. parameters in the two hypotheses
1 0 . .log( )2BIC x yd L L K N N
1 0 0LLRd L L Need to define a Threshold…
No Threshold needed! Or is
it!?
16.03.2006 Automatic Speaker Segmentation 27
Speaker Change Validation
• Bayesian Information Criterion (BIC) (4)Bayesian Information Criterion (BIC) (4)– Using Gaussian models for θx, θy and θz :
– Validate Speaker Change Point when:
1( , ) (( ) log log log )
21 1
( ( 1)) log( )2 2
x y x y z x x y y
x y
BIC N N N N
d d d N N
( , ) 0x yBIC
SPEECHSEGMENTS
... ... ...
... ...Σi Σi+1
POTENTIAL SPEAKER CHANGE POINT
Σi-2 Σi-1
BIC(Σi,Σi+1) > 0 ?
VALIDATE SPEAKER CHANGE POINT
time
...
Threshold Free! … but λ must be set…
16.03.2006 Automatic Speaker Segmentation 28
Speaker Change Validation
• Bayesian Information Criterion (BIC) (5)Bayesian Information Criterion (BIC) (5)
– BIC needs large amounts of data for good accuracy!
• Each speech segment only contains 55 data points… too few!
– Solution:
• Speaker Model Update…Speaker Model Update…
SPEECHSEGMENTS
... ... ...
... ...Σi Σi+1
POTENTIAL SPEAKER CHANGE POINT
Σi-1
BIC(Σi,Σi+1) > 0 ?
VALIDATE SPEAKER CHANGE POINT
time
...
1.375 secs (55 FRAMES)
16.03.2006 Automatic Speaker Segmentation 29
Speaker Model Update
Front - end Process, feature extraction and Pre - segment
Potencial change?
Compare current speech segment with the last speaker model
Update current Speaker model
Real change?
No Yes
BIC
False Speaker Change Positive Speaker Change
Speech stream
16.03.2006 Automatic Speaker Segmentation 30
Speaker Model Update
• ““Quasi-GMM”Quasi-GMM” speaker modeling speaker modeling [Lu2002][Lu2002] – Approximation to GMM (Gaussian Mixture Models)
• using segmental clustering of Gaussian Models instead of EM– Gaussian models incrementally updated with new arriving speaker data
– less accurate than GMM…• … but feasible for real-time operation
SPEECHSEGMENTS
Σi-6
...Σi-5
... ...
... ...
... ...
... ...Σi Σi+1
Σi-4 Σi-3
Σi-2 Σi-1
Quasi-GMM Speaker Model
(ΣQGMM)
time
...
16.03.2006 Automatic Speaker Segmentation 31
Speaker Model Update
• ““Quasi-GMM”Quasi-GMM” speaker modeling speaker modeling [Lu2002][Lu2002]
– Segmental Clustering• Start with one Gaussian Mixture (~GMM1)• DO:
– Update mixture as speaker data is received
– WHILE:» dissimilarity between mixture model before and after update is
sufficiently small
• Create a new Gaussian mixture (GMMn+1)– Up to a maximum of 32 mixtures (GMM32)
• Mixture Weight (wm):
1
mm
qGMM
S
qGMM mm
Nw
N
N N
16.03.2006 Automatic Speaker Segmentation 32
Speaker Model Update
• ““Quasi-GMM”Quasi-GMM” speaker modeling speaker modeling [Lu2002][Lu2002]
– Gaussian Model on-line updating
'
'2
'
( )( )( )
mm m
Tmm m m
m m m
m
N N
N N N N
N N N N
N N N N N N
N N N
'm
m m
N N
N N N N
μ dependent terms are discarded [Lu2002]
Increase robustness to changes in noise and background sound
~ Cepstral Mean Subtraction (CMR)
11( ) ( )
21
( , ) , 10(2 ) .
Tx x
nN e n
16.03.2006 Automatic Speaker Segmentation 33
Speaker Change Validation
Front - end Process, feature extraction and Pre - segment
Potencial change?
Compare current speech segment with the last speaker model
Update current Speaker model
Real change?
No Yes
BIC
False Speaker Change Positive Speaker Change
Speech stream
16.03.2006 Automatic Speaker Segmentation 34
Speaker Change Validation
• BIC and Quasi-GMM Speaker ModelsBIC and Quasi-GMM Speaker Models
– Validate Speaker Change Point when:
1
( , ) ( , )S
qGMM j m m jm
BIC w BIC
1
mm
qGMM
S
qGMM mm
Nw
N
N N
( , ) 0qGMM jBIC
SPEECHSEGMENTS
Σi-6
...Σi-5
... ...
... ...
... ...
... ...Σi Σi+1
POTENTIAL SPEAKER CHANGE POINTΣi-4 Σi-3
Σi-2 Σi-1
Quasi-GMM Speaker Model
(ΣqGMM)
BIC(ΣqGMM,Σi+1) > 0 ?
VALIDATE SPEAKER CHANGE POINT
time
...
16.03.2006 Automatic Speaker Segmentation 35
Complete System
Front - end Process, feature extraction and Pre - segment
Potencial change?
Compare current speech segment with the last speaker model
Update current Speaker model
Real change?
No Yes
BIC
False Speaker Change Positive Speaker Change
Speech stream
16.03.2006 Automatic Speaker Segmentation 36
Complete System
LSP1 LSP2LSP3
… LSP10
SPEECH SAMPLES / FRAMES
SPEECHSEGMENTS
25 ms (200 SAMPLES)
FEATURE VECTORS
time
...
...1.375 secs
(55 FRAMES)
LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
...LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
...LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
...LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
LSP1 LSP2LSP3
… LSP10
...
... ... ... ...
... ...
... ...
... ...
... ...
... ...
...
...
...
Σi Σi+1
POTENTIAL SPEAKER CHANGE POINT
Σi-4
D(i,i+1)
D(i+2,i+3)
D(i-2,i-1)
Σi+2 Σi+3
Σi-2 Σi-1
BIC(ΣqGMM, Σi+1) > 0 ?
POTENTIAL SPEAKER CHANGE POINT
Quasi-GMM Speaker ModelN(μqGMM, ΣqGMM)
VALIDATE SPEAKER CHANGE POINT
Σi-3
Σi-5Σi-6
D(i-4,i-3)
D(i-6,i-5)
...
...
...
...
...
YES
0.675 secs (27 FRAMES)
16.03.2006 Automatic Speaker Segmentation 37
Experimental Results
• Speaker Datasets:Speaker Datasets:– INESC Porto dataset:
• Sources:– MPEG-7 Content Set CD1 [MPEG.N2467]– broadcast news from assorted sources– male, female, various languages
• 43 minutes of speaker audio– 16 bit @ 22.05kHz PCM, single-channel
• Ground Truth– 181 speaker changes– Manually annotated– Speaker segments durations
» Maximum ~= 120 secs» Minimum = 2.25 secs » Mean = 19.81 secs
» Std.Dev. = 27.08 secs
16.03.2006 Automatic Speaker Segmentation 38
Experimental Results
• Speaker Datasets:Speaker Datasets:– TIMIT/AUTH dataset:
• Sources:– TIMIT database
» 630 English speakers» 6300 sentences
• 56 minutes of speaker audio– 16 bit @ 22.05kHz PCM, single-channel
• Ground Truth– 983 speaker changes– Manually annotated– Speaker segments durations
» Maximum ~= 12 secs» Minimum = 1.139 secs» Mean = 3.28 secs
» Std.Dev. = 1.52 secs
16.03.2006 Automatic Speaker Segmentation 39
Experimental Results
• Efficiency MeasuresEfficiency Measures
FA MD
FAR MDRGT FA GT
CFC CFC
PRC RCLDET GT
1
2.0 PRC RCLF
PRC RCL
1MDR RCL
RCL FAFAR
DET PRC RCL FA
16.03.2006 Automatic Speaker Segmentation 40
Experimental Results
• System’s Parameters fine-tuningSystem’s Parameters fine-tuning
– Parameters• Dynamic Threshold: α and nr. of previous frames • BIC: λ• qGMM: mixture creation thresholds
• Detection Tolerance Interval: set to [-1;+1] secs.
– tune system to higher FAR & lower MDR
• Missed speaker changes can not be recovered by subsequent processing
• False speaker changes will hopefully be discarded by subsequent processing– Speaker Tracking module (future work)
» Merge adjacent segments identified as belonging to the same speaker
16.03.2006 Automatic Speaker Segmentation 41
Experimental Results
• Dynamic Threshold and BIC parameters Dynamic Threshold and BIC parameters ((αα and and λλ))
0 10 20 30 40 50 60 70 80 90 1000
10
20
30
40
50
60
70
80
90
100ROC curve using BIC
FAR
MD
R
= 0.7
= 0.5
= 0.9
alpha = 2.0
alpha = 2.0alpha = 2.0
alpha = 0.0
alpha = 0.0
alpha = 0.0
Best Results found for: α = 0.8 λ = 0.6
16.03.2006 Automatic Speaker Segmentation 42
Experimental Results
• INESC Porto dataset evaluation (1)INESC Porto dataset evaluation (1)
INESC System ver.1 INESC System ver.2
•Features:•LSP
•Voiced Filter disabled•On-line processing (realtime)•Uses BIC
•Features:•LSP
•Voiced Filter enabled•On-line processing (realtime)•Uses BIC
16.03.2006 Automatic Speaker Segmentation 43
Experimental Results
• TIMIT/AUTH dataset evaluation (1)TIMIT/AUTH dataset evaluation (1)
INESC System ver.1 INESC System ver.2
•Features:•LSP
•Voiced Filter disabled•On-line processing (realtime)•Uses BIC
•Features:•LSP
•Voiced Filter enabled•On-line processing (realtime)•Uses BIC
16.03.2006 Automatic Speaker Segmentation 44
Experimental Results
• INESC Porto dataset evaluation (2)INESC Porto dataset evaluation (2)
INESC System ver.2
•Features:•LSP
•Voiced Filter disabled•On-line processing (realtime)•Uses BIC
AUTH System 1
•Features:•AudioSpectrumCentroid•AudioWaveformEnvelope
•Multiple-pass (non-realtime)•Uses BIC
16.03.2006 Automatic Speaker Segmentation 45
Experimental Results
• TIMIT/AUTH dataset evaluation (2)TIMIT/AUTH dataset evaluation (2)
AUTH System 1 INESC System ver.2
•Features:•AudioSpectrumCentroid•AudioWaveformEnvelope
•Multiple-pass (non-realtime)•Uses BIC
•Features:•LSP
•Voiced Filter disabled•On-line processing (realtime)•Uses BIC
16.03.2006 Automatic Speaker Segmentation 46
Experimental Results
• TIMIT/AUTH dataset evaluation (3)TIMIT/AUTH dataset evaluation (3)
AUTH System 2 INESC System ver.2
•Features:•DFT Mag•STE•AudioWaveformEnvelope•AudioSpectrumCentroid•MFCC
•Fast system (realtime?)•Uses BIC
•Features:•LSP
•Voiced Filter disabled•On-line processing (realtime)•Uses BIC
16.03.2006 Automatic Speaker Segmentation 47
Experimental Results
• Time Shifts on the detected Speaker Change Time Shifts on the detected Speaker Change Points Points
– Detection tolerance interval = [-1, 1] secs
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10
1000
2000
3000
4000
5000
6000
Speaker change detection time shifts (secs.)
nr. o
f ocu
rren
ces
INESC System ver.1
16.03.2006 Automatic Speaker Segmentation 48
Achievements
• SoftwareSoftware– C++ routines
• Numerical routines– Matrix Determinant– Polynomial Roots– Levinson-Durbin
• LPC (adapted from Marsyas)• LSP• Divergence and Bhattacharyya Shape metrics• BIC• Quasi-GMM modeling class
– Automatic Speaker Segment prototype application• As a Library (DLL)
– Integrated into “4VDO - Annotator”• As a stand-alone application
• ReportsReports– VISNET deliverables
• D29, D30, D31, D40, D41
• Publications (co-author)Publications (co-author)
– “Speaker Change Detection using BIC: A comparison on two datasets”• Accepted to the ISCCSP2006
– “Automatic Speaker Segmentation using Multiple Features and Distance Measures: A Comparison of Three Approaches”
• Submitted to ICME2006
DEMO
16.03.2006 Automatic Speaker Segmentation 49
Conclusions
• Open Issues…Open Issues…– Voiced detection procedure
• Results should improve… – Parameter fine-tuning
• Dynamic Threshold• BIC parameter • Quasi-GMM Model
• Further WorkFurther Work– Audio Features
• Evaluate other features for speaker segmentation, tracking and identification– Pitch– MFCC– …
– Speaker Tracking• Clustering of speaker segments• Evaluation
– Ground Truth Needs manual annotation work
– Speaker Identification• Speaker Model Training• Evaluation
– Ground Truth Needs manual annotation work
16.03.2006 Automatic Speaker Segmentation 50
Contributors
• INESC PortoINESC Porto
– Rui Costa– Jaime Cardoso– Luís Filipe Teixeira– Sílvio Macedo
• VISNETVISNET
• Aristotle University of Thessaloniki (AUTH), Greece– Margarita Kotti– Emmanuoil Benetos– Constantine Kotropoulos