Speaker Segmentation (2006)

51
Cam pus da FEU P | R ua D r. R o b erto Frias, 3 7 8 | 4 2 0 0 -4 6 5 P o rto Tf 222 094 000 | Fx 222 094 350 | w w w .in escp o rto .p t | w w w @ in escp o rto.p t Real-time Automatic Real-time Automatic Speaker Segmentation Speaker Segmentation Luís Gustavo Martins Luís Gustavo Martins UTM – INESC Porto UTM – INESC Porto [email protected] [email protected] http://www.inescporto.pt/~lmart http://www.inescporto.pt/~lmart ins ins LabMeetings LabMeetings March 16, 2006 March 16, 2006 INESC Porto INESC Porto

Transcript of Speaker Segmentation (2006)

Campus da FEUP | Rua Dr. Roberto Frias, 378 | 4200-465 Porto

Tf 222 094 000 | Fx 222 094 350 | www.inescporto.pt | [email protected]

Real-time Automatic Real-time Automatic Speaker Segmentation Speaker Segmentation

Luís Gustavo MartinsLuís Gustavo Martins

UTM – INESC PortoUTM – INESC Porto

[email protected]@inescporto.pt

http://www.inescporto.pt/~lmartinshttp://www.inescporto.pt/~lmartins

LabMeetingsLabMeetings

March 16, 2006March 16, 2006

INESC PortoINESC Porto

16.03.2006 Automatic Speaker Segmentation 2

Notice

This work is licensed under the Creative Commons Attribution-Share Alike 2.5 Portugal License. To view a copy of this license, visit

http://creativecommons.org/licenses/by-sa/2.5/pt/ or send a letter to

Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

16.03.2006 Automatic Speaker Segmentation 3

Summary

• SummarySummary

– System Overview– Audio Analysis front-end– Speaker Coarse Segmentation– Speaker Change Validation– Speaker Model Update– Experimental Results– Achievements– Conclusions

16.03.2006 Automatic Speaker Segmentation 4

Scope

• ObjectiveObjective

– Development of a Real-time, Automatic Speaker Segmentation module

• Already having in mind for future development:– Speaker Tracking– Speaker Identification

• ChallengesChallenges

– No pre-knowledge about the number and identities of speakers

– On-line and Real-time operationAudio data is not available beforehandMust only use small amounts of arriving speaker data iterative and computationally intensive methods are unfeasible

16.03.2006 Automatic Speaker Segmentation 5

System Overview

Front - end Process, feature extraction and Pre - segment

Potencial change?

Compare current speech segment with the last speaker model

Update current Speaker model

Real change?

No Yes

BIC

False Speaker Change Positive Speaker Change

Speech stream

16.03.2006 Automatic Speaker Segmentation 6

Audio Analysis front-end

Front - end Process, feature extraction and Pre - segment

Potencial change?

Compare current speech segment with the last speaker model

Update current Speaker model

Real change?

No Yes

BIC

False Speaker Change Positive Speaker Change

Speech stream

16.03.2006 Automatic Speaker Segmentation 7

Audio Analysis front-end

• Front-end ProcessingFront-end Processing– 8kHz, 16 bit, pre-emphasized, mono speech streams– 25ms analysis frames with no overlap– Speech segments with 2.075 secs and 1.4 secs overlap

• Consecutive sub-segments with 1.375 secs each

LSP1 LSP2LSP3

… LSP10

SPEECH SAMPLES / FRAMES

SPEECH SEGMENTS 25 ms (200 SAMPLES)

FEATURE VECTORS

time

1.375 secs (55 FRAMES)

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...

... ... ... ...

... ...

... ...

... ...

...

...

...

...

...

0.675 secs (27 FRAMES)

16.03.2006 Automatic Speaker Segmentation 8

Audio Analysis front-end

• Feature ExtractionFeature Extraction (1)(1)

– Speaker Modeling

• 10th-order LPC / LSP

– Source / Filter approach

• Other possible features…– MFCC– Pitch– …

Glottis Lips

Δ

Δ

Δ

Δ

Δ

A0 A1 A2 A3 A4 A5

1

. .p

n k n k nk

s a s G u

SOURCEFILTER

16.03.2006 Automatic Speaker Segmentation 9

0 1 2 1 1 1

1 0 1 2 2 2

2 1 0 3 3 3

1 2 3 0

.

p

p

p

p p p p p

R R R R a R

R R R R a R

R R R R a R

R R R R a R

Audio Analysis front-end

• LPC Modeling (1) LPC Modeling (1) [Rabiner93, Campbell97][Rabiner93, Campbell97] – Linear Predictive Coding

• Order p

1

1

1

. .

ˆ .

ˆ . .

p

n k n k nk

p

n k n kk

p

n n n n k n k nk

s a s G u

s a s

e s s s a s G u

2

2

1

1

1

0

.

0, 1,2,...,

. 1,2,...,

( ). ( )

p

n n k n kn n k

i

p

k n k n i n n ik n n

N

i

E e s a s

Ei p

a

a s s s s i p

R s i s i

Yule-Walker equations

Durbin’s recursive algorithm

Toeplitz autocorrelation matrix

Autocorrelationmethod

16.03.2006 Automatic Speaker Segmentation 10

LPCANALYSIS

LPCFILTER

IMPULSESOURCE

LPC ak

e(n)

IMPLUSE TRAIN

NOISE

INPUT SPEECH OUTPUT SPEECH

Audio Analysis front-end

• LPC Modeling (2)LPC Modeling (2)

1

1

. .

( ) [ ]( )

( ) [ ]

( )( )1

p

n k n k nk

n

n

p kkk

s a s G u

S z Z sH z

U z Z u

G GH z

A za z

LPC Spectrum

FFT Spectrum

Whitening Filter

Pitch

16.03.2006 Automatic Speaker Segmentation 11

Audio Analysis front-end

• LSP Modeling LSP Modeling [Campbell97][Campbell97] – Linear Spectral Pairs

• More robust to quantization, as normally used in speech coding

– Derived from the LPC ak coefficients

• Zeros of A(z) mapped to the unit circle in the Z-Domain

• Use of a pair of (p+1)-order polynomials

( 1) 1

( 1) 1

1( ) ( ) ( )

2

( ) ( ) ( )

( ) ( ) ( )

p

p

A z P z Q z

P z A z z A z

Q z A z z A z

16.03.2006 Automatic Speaker Segmentation 12

• Speaker ModelingSpeaker Modeling

– Speaker information is mostly contained in the voiced part of the speech signal…

• Can you identify Who’s speaking?

– LPC / LSP analysis behaves badly with non-voiced (i.e. non-periodic) signals

• Unvoiced/Silence data degrades speaker model accuracy!

Select only voiced data for processing…

Audio Analysis front-end

Unvoiced speech frames Voiced speech frames

16.03.2006 Automatic Speaker Segmentation 13

Audio Analysis front-end

• Voiced / Unvoiced / Silence (V/U/S) detectionVoiced / Unvoiced / Silence (V/U/S) detection

– Feature Extraction (2)

• Short Time Energy (STE) silence detection

• Zero Crossing Rate (ZCR) voiced / unvoiced detection

12

0

1( ) ( )

N

ni

STE n s iN

1

1

1( ) ( ) ( 1)

2

Ns

n ni

fZCR n sign s i sign s i

N

16.03.2006 Automatic Speaker Segmentation 14

Audio Analysis front-end

• V/U/S speech classes modeled by Gaussian DistributionsV/U/S speech classes modeled by Gaussian Distributions

– modeled by 2-d Gaussian Distributions

• Simple and Fast real-time operation

ZCR

STE

voiced

unvoicedsilence

11( ) ( )

21

( , ) , 2(2 ) .

Tx x

nN e n

Dataset:

~4 minutes of manually annotated speech signals 2 male and 2 female Portuguese speakers

16.03.2006 Automatic Speaker Segmentation 15

Audio Analysis front-end

• Manual Annotation of V/U/S segments in a speech signalManual Annotation of V/U/S segments in a speech signal

16.03.2006 Automatic Speaker Segmentation 16

Audio Analysis front-end

• V/U/S Speech datasetV/U/S Speech dataset

– Voiced / Unvoiced / Silence stratification in manually segmented audio files

----------------------------------------------------------------------------Portuguese Male 2Portuguese Male 2----------------------------------------------------------------------------Total Time Total Time = 60 secs= 60 secsvoiced voiced = 30 secs = 50.0%= 30 secs = 50.0%unvoiced unvoiced = 14 secs = 23.3%= 14 secs = 23.3%silencesilence = 13 secs = 21.6%= 13 secs = 21.6%

Voiced /(Voiced + Unvoiced) = 68% Voiced /(Voiced + Unvoiced) = 68% Unvoiced / (Voiced + Unvoiced) = 32%Unvoiced / (Voiced + Unvoiced) = 32%

----------------------------------------------------------------------------Portuguese Male 1Portuguese Male 1----------------------------------------------------------------------------Total Time Total Time = 60 secs= 60 secsVoicedVoiced = 37 secs = 62%= 37 secs = 62%unvoiced unvoiced = 12 secs = 20%= 12 secs = 20%silencesilence = 10 secs = 17%= 10 secs = 17%

Voiced /(Voiced + Unvoiced) = 76% Voiced /(Voiced + Unvoiced) = 76% Unvoiced / (Voiced + Unvoiced) = 24%Unvoiced / (Voiced + Unvoiced) = 24%

----------------------------------------------------------------------------Portuguese Female 2Portuguese Female 2----------------------------------------------------------------------------Total Time Total Time = 60 secs= 60 secsvoiced voiced = 30 secs = 50.0%= 30 secs = 50.0%unvoiced unvoiced = 19 secs = 31.6%= 19 secs = 31.6%silencesilence = 10 secs = 17%= 10 secs = 17%

Voiced /(Voiced + Unvoiced) = 61.2% Voiced /(Voiced + Unvoiced) = 61.2% Unvoiced / (Voiced + Unvoiced) = 38.7%Unvoiced / (Voiced + Unvoiced) = 38.7%

----------------------------------------------------------------------------Portuguese Female 1Portuguese Female 1----------------------------------------------------------------------------Total Time Total Time = 60 secs= 60 secsvoiced voiced = 32 secs = 53.3%= 32 secs = 53.3%unvoiced unvoiced = 17 secs = 28.3%= 17 secs = 28.3%silencesilence = 10 secs = 17%= 10 secs = 17%

Voiced /(Voiced + Unvoiced) = 65.3% Voiced /(Voiced + Unvoiced) = 65.3% Unvoiced / (Voiced + Unvoiced) = 32.7%Unvoiced / (Voiced + Unvoiced) = 32.7%

16.03.2006 Automatic Speaker Segmentation 17

Audio Analysis front-end

• Automatic Classification of V/U/S speech framesAutomatic Classification of V/U/S speech frames::

– 10-fold Cross-Validation • Confusion matrix:

– Some voiced frames are being discarded as unvoiced… • Waste of relevant and scarce data…

– A few unvoiced and silence frames are being misclassified as voiced• Contamination of the data to be analyzed

ClassifieClassified as: d as: ↓↓

voicedvoiced unvoicedunvoiced silencesilence

voicedvoiced 92.32 92.32 %%

4.17 %4.17 % 0.41 %0.41 %

unvoicedunvoiced 6.8 %6.8 % 62.28 %62.28 % 34.66 %34.66 %

silencesilence 0.88 %0.88 % 33.55 %33.55 % 64.92 %64.92 %

contamination

waste

(Theoretical Random Classifier Correct Classifications = 33.33%)Total Classification Error = 18.385%Total Correct Classifications = 81.615+/-1.13912%

16.03.2006 Automatic Speaker Segmentation 18

Audio Analysis front-end

• Voiced / Unvoiced / Silence (V/U/S) detectionVoiced / Unvoiced / Silence (V/U/S) detection

– Advantages• Only quasi-stationary parts of the speech signal are used

– Include most of the speaker information in a speech signal– Avoids model degradation in LPC/LSP

• Potentially more robust to different speakers/languages– Different languages may have distinct V/U/S stratification– Speakers talk differently (i.e. more paused more silence frames)

– Drawbacks• May cause few data points per speech sub-segment

– Ill-estimation of the covariance matrices» number of data points (i.e. voiced frames) >= d(d+1)/2» d=dim(cov matrix) = 10 (i.e. 10 LSP coefficients) » nr. data points / sub-segment >= 55 frames» Not always guaranteed!!

use of dynamically sized windows– Does this really work??

16.03.2006 Automatic Speaker Segmentation 19

Speaker Coarse Segmentation

Front - end Process, feature extraction and Pre - segment

Potencial change?

Compare current speech segment with the last speaker model

Update current Speaker model

Real change?

No Yes

BIC

False Speaker Change Positive Speaker Change

Speech stream

16.03.2006 Automatic Speaker Segmentation 20

Speaker Coarse Segmentation

• Divergence Shape Divergence Shape – Only uses LSP features

• Assumes Gaussian Distribution

– Calculated between consecutive sub-segments

Speech stream with 4 speech segments

( , ) ln ii j

jx

p xD i j p x p x dx

p x

1 1 1 11 1( , )

2 2

T

i j j i j i i j i jD i j tr tr u u u u

1 11( , )

2 i j j iD i j tr

LSP1 LSP2LSP3

… LSP10

SPEECH SAMPLES / FRAMES

SPEECHSEGMENTS

25 ms (200 SAMPLES)

FEATURE VECTORS

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

... ...

... ...ΣiΣi-1

...

...

...

...

...D(i-1,i)

[Campbell97][Campbell97][Lu2002][Lu2002]

16.03.2006 Automatic Speaker Segmentation 21

Speaker Coarse Segmentation

• Dynamic Threshold Dynamic Threshold [Lu2002][Lu2002]

• Speaker change whenever:Speaker change whenever:

1

1( 2 , 1 2 )

N

in

Th D i n i nN

( , 1) ( 2, 3)

( , 1) ( 2, 1)

( , 1) i

D i i D i i

D i i D i i

D i i Th

D(i,j)

(i,i+1) (i+1,i+2)(i-1,i)

POTENTIAL SPEAKER CHANGE POINT

Thi...

...

... ...

LSP1 LSP2LSP3

… LSP10

SPEECH SAMPLES / FRAMES

SPEECHSEGMENTS

25 ms (200 SAMPLES)

FEATURE VECTORS

Σi-6

time

...

...1.375 secs

(55 FRAMES)

Σi-5

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...

... ... ... ...

... ...

... ...

... ...

... ...

... ...

0.675 secs (27 FRAMES)

...

...

...

Σi Σi+1

POTENTIAL SPEAKER CHANGE POINT

D(i-6,i-5)

Σi-4 Σi-3D(i-4,i-3)

D(i,i+1)

D(i+2,i+3)

D(i-2,i-1)

Σi+2 Σi+3

Σi-2 Σi-1

16.03.2006 Automatic Speaker Segmentation 22

Speaker Coarse Segmentation

• Coarse Segmentation performanceCoarse Segmentation performance

– Presents high False Alarm Rate (FAR = Type I errors)

• Possible solution:Possible solution:

– Use a Speaker Validation Strategy

• Should allow decreasing FAR…

• … but should also avoid an increase in Miss Detections (MDR = Type II errors)

16.03.2006 Automatic Speaker Segmentation 23

Speaker Change Validation

Front - end Process, feature extraction and Pre - segment

Potencial change?

Compare current speech segment with the last speaker model

Update current Speaker model

Real change?

No Yes

BIC

False Speaker Change Positive Speaker Change

Speech stream

16.03.2006 Automatic Speaker Segmentation 24

x xxxxx x xx xx xxxxxxxx xx x x x x xxxxxxxy yyyyy y yy yy yyyyyyyy yy y y y y yyy yyyy yy

x

p

Feature 1

θz p

Feature 1

θz

xxxxxxx xxxx xxxxxxxx xx x x x x xxxxxxxyyyyyyyyyyy yyyyyyyy yy yyy y yyyyyyy yy

x

Speaker Change Validation

• Bayesian Information Criterion (BIC) (1)Bayesian Information Criterion (BIC) (1)

– Hypothesis 0:• Single θz model for speaker data in segments X and Y

00 0

log ( | ) log ( | )yx NN

i z i zi i

L p x p y

X Y

Z

L0 L0 Same

speaker in segments X

and Y

Different speakers in segments X

and Y

16.03.2006 Automatic Speaker Segmentation 25

Speaker Change Validation

• Bayesian Information Criterion (BIC) (2)Bayesian Information Criterion (BIC) (2)

– Hypothesis 1:• Separate models θx, θy for speakers in segments X and Y, respectively

X Y

Z

10 0

log ( | ) log ( | )yx NN

i x i yi i

L p x p y

x xxxxx x xx xx xxxxxxxx xx x x x x xxxxxxxy yyyyy y yy yy yyyyyyyy yy y y y y yyy yyyy yy

x

p

Feature 1

p

Feature 1xxxxxxx xxxx xxxxxxxx xx x x x x xxxxxxx

yyyyyyyyyyy yyyyyyyy yy y yy y yyyyyyy yyx

θx θy

θx

θy

L1 L1

Same speaker in segments X and

Y

Different speakers in segments X

and Y

16.03.2006 Automatic Speaker Segmentation 26

Speaker Change Validation

• Bayesian Information Criterion (BIC) (3)Bayesian Information Criterion (BIC) (3)

– Log Likelihood Ratio (LLR)

– However, this is not a far comparison…• The models do not have the same number of parameters!

– More complex models always fit better the data» They should be penalized when compared with simpler models ΔK = difference of the nr. parameters in the two hypotheses

1 0 . .log( )2BIC x yd L L K N N

1 0 0LLRd L L Need to define a Threshold…

No Threshold needed! Or is

it!?

16.03.2006 Automatic Speaker Segmentation 27

Speaker Change Validation

• Bayesian Information Criterion (BIC) (4)Bayesian Information Criterion (BIC) (4)– Using Gaussian models for θx, θy and θz :

– Validate Speaker Change Point when:

1( , ) (( ) log log log )

21 1

( ( 1)) log( )2 2

x y x y z x x y y

x y

BIC N N N N

d d d N N

( , ) 0x yBIC

SPEECHSEGMENTS

... ... ...

... ...Σi Σi+1

POTENTIAL SPEAKER CHANGE POINT

Σi-2 Σi-1

BIC(Σi,Σi+1) > 0 ?

VALIDATE SPEAKER CHANGE POINT

time

...

Threshold Free! … but λ must be set…

16.03.2006 Automatic Speaker Segmentation 28

Speaker Change Validation

• Bayesian Information Criterion (BIC) (5)Bayesian Information Criterion (BIC) (5)

– BIC needs large amounts of data for good accuracy!

• Each speech segment only contains 55 data points… too few!

– Solution:

• Speaker Model Update…Speaker Model Update…

SPEECHSEGMENTS

... ... ...

... ...Σi Σi+1

POTENTIAL SPEAKER CHANGE POINT

Σi-1

BIC(Σi,Σi+1) > 0 ?

VALIDATE SPEAKER CHANGE POINT

time

...

1.375 secs (55 FRAMES)

16.03.2006 Automatic Speaker Segmentation 29

Speaker Model Update

Front - end Process, feature extraction and Pre - segment

Potencial change?

Compare current speech segment with the last speaker model

Update current Speaker model

Real change?

No Yes

BIC

False Speaker Change Positive Speaker Change

Speech stream

16.03.2006 Automatic Speaker Segmentation 30

Speaker Model Update

• ““Quasi-GMM”Quasi-GMM” speaker modeling speaker modeling [Lu2002][Lu2002] – Approximation to GMM (Gaussian Mixture Models)

• using segmental clustering of Gaussian Models instead of EM– Gaussian models incrementally updated with new arriving speaker data

– less accurate than GMM…• … but feasible for real-time operation

SPEECHSEGMENTS

Σi-6

...Σi-5

... ...

... ...

... ...

... ...Σi Σi+1

Σi-4 Σi-3

Σi-2 Σi-1

Quasi-GMM Speaker Model

(ΣQGMM)

time

...

16.03.2006 Automatic Speaker Segmentation 31

Speaker Model Update

• ““Quasi-GMM”Quasi-GMM” speaker modeling speaker modeling [Lu2002][Lu2002]

– Segmental Clustering• Start with one Gaussian Mixture (~GMM1)• DO:

– Update mixture as speaker data is received

– WHILE:» dissimilarity between mixture model before and after update is

sufficiently small

• Create a new Gaussian mixture (GMMn+1)– Up to a maximum of 32 mixtures (GMM32)

• Mixture Weight (wm):

1

mm

qGMM

S

qGMM mm

Nw

N

N N

16.03.2006 Automatic Speaker Segmentation 32

Speaker Model Update

• ““Quasi-GMM”Quasi-GMM” speaker modeling speaker modeling [Lu2002][Lu2002]

– Gaussian Model on-line updating

'

'2

'

( )( )( )

mm m

Tmm m m

m m m

m

N N

N N N N

N N N N

N N N N N N

N N N

'm

m m

N N

N N N N

μ dependent terms are discarded [Lu2002]

Increase robustness to changes in noise and background sound

~ Cepstral Mean Subtraction (CMR)

11( ) ( )

21

( , ) , 10(2 ) .

Tx x

nN e n

16.03.2006 Automatic Speaker Segmentation 33

Speaker Change Validation

Front - end Process, feature extraction and Pre - segment

Potencial change?

Compare current speech segment with the last speaker model

Update current Speaker model

Real change?

No Yes

BIC

False Speaker Change Positive Speaker Change

Speech stream

16.03.2006 Automatic Speaker Segmentation 34

Speaker Change Validation

• BIC and Quasi-GMM Speaker ModelsBIC and Quasi-GMM Speaker Models

– Validate Speaker Change Point when:

1

( , ) ( , )S

qGMM j m m jm

BIC w BIC

1

mm

qGMM

S

qGMM mm

Nw

N

N N

( , ) 0qGMM jBIC

SPEECHSEGMENTS

Σi-6

...Σi-5

... ...

... ...

... ...

... ...Σi Σi+1

POTENTIAL SPEAKER CHANGE POINTΣi-4 Σi-3

Σi-2 Σi-1

Quasi-GMM Speaker Model

(ΣqGMM)

BIC(ΣqGMM,Σi+1) > 0 ?

VALIDATE SPEAKER CHANGE POINT

time

...

16.03.2006 Automatic Speaker Segmentation 35

Complete System

Front - end Process, feature extraction and Pre - segment

Potencial change?

Compare current speech segment with the last speaker model

Update current Speaker model

Real change?

No Yes

BIC

False Speaker Change Positive Speaker Change

Speech stream

16.03.2006 Automatic Speaker Segmentation 36

Complete System

LSP1 LSP2LSP3

… LSP10

SPEECH SAMPLES / FRAMES

SPEECHSEGMENTS

25 ms (200 SAMPLES)

FEATURE VECTORS

time

...

...1.375 secs

(55 FRAMES)

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...

... ... ... ...

... ...

... ...

... ...

... ...

... ...

...

...

...

Σi Σi+1

POTENTIAL SPEAKER CHANGE POINT

Σi-4

D(i,i+1)

D(i+2,i+3)

D(i-2,i-1)

Σi+2 Σi+3

Σi-2 Σi-1

BIC(ΣqGMM, Σi+1) > 0 ?

POTENTIAL SPEAKER CHANGE POINT

Quasi-GMM Speaker ModelN(μqGMM, ΣqGMM)

VALIDATE SPEAKER CHANGE POINT

Σi-3

Σi-5Σi-6

D(i-4,i-3)

D(i-6,i-5)

...

...

...

...

...

YES

0.675 secs (27 FRAMES)

16.03.2006 Automatic Speaker Segmentation 37

Experimental Results

• Speaker Datasets:Speaker Datasets:– INESC Porto dataset:

• Sources:– MPEG-7 Content Set CD1 [MPEG.N2467]– broadcast news from assorted sources– male, female, various languages

• 43 minutes of speaker audio– 16 bit @ 22.05kHz PCM, single-channel

• Ground Truth– 181 speaker changes– Manually annotated– Speaker segments durations

» Maximum ~= 120 secs» Minimum = 2.25 secs » Mean = 19.81 secs

» Std.Dev. = 27.08 secs

16.03.2006 Automatic Speaker Segmentation 38

Experimental Results

• Speaker Datasets:Speaker Datasets:– TIMIT/AUTH dataset:

• Sources:– TIMIT database

» 630 English speakers» 6300 sentences

• 56 minutes of speaker audio– 16 bit @ 22.05kHz PCM, single-channel

• Ground Truth– 983 speaker changes– Manually annotated– Speaker segments durations

» Maximum ~= 12 secs» Minimum = 1.139 secs» Mean = 3.28 secs

» Std.Dev. = 1.52 secs

16.03.2006 Automatic Speaker Segmentation 39

Experimental Results

• Efficiency MeasuresEfficiency Measures

FA MD

FAR MDRGT FA GT

CFC CFC

PRC RCLDET GT

1

2.0 PRC RCLF

PRC RCL

1MDR RCL

RCL FAFAR

DET PRC RCL FA

16.03.2006 Automatic Speaker Segmentation 40

Experimental Results

• System’s Parameters fine-tuningSystem’s Parameters fine-tuning

– Parameters• Dynamic Threshold: α and nr. of previous frames • BIC: λ• qGMM: mixture creation thresholds

• Detection Tolerance Interval: set to [-1;+1] secs.

– tune system to higher FAR & lower MDR

• Missed speaker changes can not be recovered by subsequent processing

• False speaker changes will hopefully be discarded by subsequent processing– Speaker Tracking module (future work)

» Merge adjacent segments identified as belonging to the same speaker

16.03.2006 Automatic Speaker Segmentation 41

Experimental Results

• Dynamic Threshold and BIC parameters Dynamic Threshold and BIC parameters ((αα and and λλ))

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100ROC curve using BIC

FAR

MD

R

= 0.7

= 0.5

= 0.9

alpha = 2.0

alpha = 2.0alpha = 2.0

alpha = 0.0

alpha = 0.0

alpha = 0.0

Best Results found for: α = 0.8 λ = 0.6

16.03.2006 Automatic Speaker Segmentation 42

Experimental Results

• INESC Porto dataset evaluation (1)INESC Porto dataset evaluation (1)

INESC System ver.1 INESC System ver.2

•Features:•LSP

•Voiced Filter disabled•On-line processing (realtime)•Uses BIC

•Features:•LSP

•Voiced Filter enabled•On-line processing (realtime)•Uses BIC

16.03.2006 Automatic Speaker Segmentation 43

Experimental Results

• TIMIT/AUTH dataset evaluation (1)TIMIT/AUTH dataset evaluation (1)

INESC System ver.1 INESC System ver.2

•Features:•LSP

•Voiced Filter disabled•On-line processing (realtime)•Uses BIC

•Features:•LSP

•Voiced Filter enabled•On-line processing (realtime)•Uses BIC

16.03.2006 Automatic Speaker Segmentation 44

Experimental Results

• INESC Porto dataset evaluation (2)INESC Porto dataset evaluation (2)

INESC System ver.2

•Features:•LSP

•Voiced Filter disabled•On-line processing (realtime)•Uses BIC

AUTH System 1

•Features:•AudioSpectrumCentroid•AudioWaveformEnvelope

•Multiple-pass (non-realtime)•Uses BIC

16.03.2006 Automatic Speaker Segmentation 45

Experimental Results

• TIMIT/AUTH dataset evaluation (2)TIMIT/AUTH dataset evaluation (2)

AUTH System 1 INESC System ver.2

•Features:•AudioSpectrumCentroid•AudioWaveformEnvelope

•Multiple-pass (non-realtime)•Uses BIC

•Features:•LSP

•Voiced Filter disabled•On-line processing (realtime)•Uses BIC

16.03.2006 Automatic Speaker Segmentation 46

Experimental Results

• TIMIT/AUTH dataset evaluation (3)TIMIT/AUTH dataset evaluation (3)

AUTH System 2 INESC System ver.2

•Features:•DFT Mag•STE•AudioWaveformEnvelope•AudioSpectrumCentroid•MFCC

•Fast system (realtime?)•Uses BIC

•Features:•LSP

•Voiced Filter disabled•On-line processing (realtime)•Uses BIC

16.03.2006 Automatic Speaker Segmentation 47

Experimental Results

• Time Shifts on the detected Speaker Change Time Shifts on the detected Speaker Change Points Points

– Detection tolerance interval = [-1, 1] secs

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000

5000

6000

Speaker change detection time shifts (secs.)

nr. o

f ocu

rren

ces

INESC System ver.1

16.03.2006 Automatic Speaker Segmentation 48

Achievements

• SoftwareSoftware– C++ routines

• Numerical routines– Matrix Determinant– Polynomial Roots– Levinson-Durbin

• LPC (adapted from Marsyas)• LSP• Divergence and Bhattacharyya Shape metrics• BIC• Quasi-GMM modeling class

– Automatic Speaker Segment prototype application• As a Library (DLL)

– Integrated into “4VDO - Annotator”• As a stand-alone application

• ReportsReports– VISNET deliverables

• D29, D30, D31, D40, D41

• Publications (co-author)Publications (co-author)

– “Speaker Change Detection using BIC: A comparison on two datasets”• Accepted to the ISCCSP2006

– “Automatic Speaker Segmentation using Multiple Features and Distance Measures: A Comparison of Three Approaches”

• Submitted to ICME2006

DEMO

16.03.2006 Automatic Speaker Segmentation 49

Conclusions

• Open Issues…Open Issues…– Voiced detection procedure

• Results should improve… – Parameter fine-tuning

• Dynamic Threshold• BIC parameter • Quasi-GMM Model

• Further WorkFurther Work– Audio Features

• Evaluate other features for speaker segmentation, tracking and identification– Pitch– MFCC– …

– Speaker Tracking• Clustering of speaker segments• Evaluation

– Ground Truth Needs manual annotation work

– Speaker Identification• Speaker Model Training• Evaluation

– Ground Truth Needs manual annotation work

16.03.2006 Automatic Speaker Segmentation 50

Contributors

• INESC PortoINESC Porto

– Rui Costa– Jaime Cardoso– Luís Filipe Teixeira– Sílvio Macedo

• VISNETVISNET

• Aristotle University of Thessaloniki (AUTH), Greece– Margarita Kotti– Emmanuoil Benetos– Constantine Kotropoulos

16.03.2006 Automatic Speaker Segmentation 51

Thank you!

Questions?Questions?

[email protected]