Speaker Segmentation (2006)

Campus da FEUP | Rua Dr. Roberto Frias, 378 | 4200-465 Porto

Tf 222 094 000 | Fx 222 094 350 | www.inescporto.pt | [email protected]

Real-time Automatic Real-time Automatic Speaker Segmentation Speaker Segmentation

Luís Gustavo MartinsLuís Gustavo Martins

UTM – INESC PortoUTM – INESC Porto

[email protected]@inescporto.pt

http://www.inescporto.pt/~lmartinshttp://www.inescporto.pt/~lmartins

LabMeetingsLabMeetings

March 16, 2006March 16, 2006

INESC PortoINESC Porto

mailto:[email protected]

http://www.inescporto.pt/~lmartins

16.03.2006 Automatic Speaker Segmentation 2

Notice

This work is licensed under the Creative Commons Attribution-Share Alike 2.5 Portugal License. To view a copy of this license, visit

http://creativecommons.org/licenses/by-sa/2.5/pt/ or send a letter to

Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.


Summary

• SummarySummary

– System Overview– Audio Analysis front-end– Speaker Coarse Segmentation– Speaker Change Validation– Speaker Model Update– Experimental Results– Achievements– Conclusions


Scope

• ObjectiveObjective

– Development of a Real-time, Automatic Speaker Segmentation module

• Already having in mind for future development:– Speaker Tracking– Speaker Identification

• ChallengesChallenges

– No pre-knowledge about the number and identities of speakers

– On-line and Real-time operationAudio data is not available beforehandMust only use small amounts of arriving speaker data iterative and computationally intensive methods are unfeasible


System Overview

Front - end Process, feature extraction and Pre - segment

Potencial change?

Compare current speech segment with the last speaker model

Update current Speaker model

Real change?

No Yes

BIC

False Speaker Change Positive Speaker Change

Speech stream


Audio Analysis front-end


Potencial change?



Real change?

No Yes

BIC


Speech stream



• Front-end ProcessingFront-end Processing– 8kHz, 16 bit, pre-emphasized, mono speech streams– 25ms analysis frames with no overlap– Speech segments with 2.075 secs and 1.4 secs overlap

• Consecutive sub-segments with 1.375 secs each

LSP1 LSP2LSP3

… LSP10

SPEECH SAMPLES / FRAMES

SPEECH SEGMENTS 25 ms (200 SAMPLES)

FEATURE VECTORS

time

1.375 secs (55 FRAMES)

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...

... ... ... ...

... ...

... ...

... ...

...

...

...

...

...




• Feature ExtractionFeature Extraction (1)(1)

– Speaker Modeling

• 10th-order LPC / LSP

– Source / Filter approach

• Other possible features…– MFCC– Pitch– …

Glottis Lips

Δ

Δ

Δ

Δ

Δ

A0 A1 A2 A3 A4 A5

1

. .p

n k n k nk

s a s G u

SOURCEFILTER


0 1 2 1 1 1

1 0 1 2 2 2

2 1 0 3 3 3

1 2 3 0

.

p

p

p

p p p p p

R R R R a R

R R R R a R

R R R R a R

R R R R a R


• LPC Modeling (1) LPC Modeling (1) [Rabiner93, Campbell97][Rabiner93, Campbell97] – Linear Predictive Coding

• Order p

1

1

1

. .

ˆ .

ˆ . .

p

n k n k nk

p

n k n kk

p

n n n n k n k nk

s a s G u

s a s

e s s s a s G u

2

2

1

1

1

0

.

0, 1,2,...,

. 1,2,...,

( ). ( )

p

n n k n kn n k

i

p

k n k n i n n ik n n

N

i

E e s a s

Ei p

a

a s s s s i p

R s i s i

Yule-Walker equations

Durbin’s recursive algorithm

Toeplitz autocorrelation matrix

Autocorrelationmethod


LPCANALYSIS

LPCFILTER

IMPULSESOURCE

LPC ak

e(n)

IMPLUSE TRAIN

NOISE

INPUT SPEECH OUTPUT SPEECH


• LPC Modeling (2)LPC Modeling (2)

1

1

. .

( ) [ ]( )

( ) [ ]

( )( )1

p

n k n k nk

n

n

p kkk

s a s G u

S z Z sH z

U z Z u

G GH z

A za z

LPC Spectrum

FFT Spectrum

Whitening Filter

Pitch



• LSP Modeling LSP Modeling [Campbell97][Campbell97] – Linear Spectral Pairs

• More robust to quantization, as normally used in speech coding

– Derived from the LPC ak coefficients

• Zeros of A(z) mapped to the unit circle in the Z-Domain

• Use of a pair of (p+1)-order polynomials

( 1) 1

( 1) 1

1( ) ( ) ( )

2

( ) ( ) ( )

( ) ( ) ( )

p

p

A z P z Q z

P z A z z A z

Q z A z z A z


• Speaker ModelingSpeaker Modeling

– Speaker information is mostly contained in the voiced part of the speech signal…

• Can you identify Who’s speaking?

– LPC / LSP analysis behaves badly with non-voiced (i.e. non-periodic) signals

• Unvoiced/Silence data degrades speaker model accuracy!

Select only voiced data for processing…


Unvoiced speech frames Voiced speech frames



• Voiced / Unvoiced / Silence (V/U/S) detectionVoiced / Unvoiced / Silence (V/U/S) detection

– Feature Extraction (2)

• Short Time Energy (STE) silence detection

• Zero Crossing Rate (ZCR) voiced / unvoiced detection

12

0

1( ) ( )

N

ni

STE n s iN

1

1

1( ) ( ) ( 1)

2

Ns

n ni

fZCR n sign s i sign s i

N



• V/U/S speech classes modeled by Gaussian DistributionsV/U/S speech classes modeled by Gaussian Distributions

– modeled by 2-d Gaussian Distributions

• Simple and Fast real-time operation

ZCR

STE

voiced

unvoicedsilence

11( ) ( )

21

( , ) , 2(2 ) .

Tx x

nN e n

Dataset:

~4 minutes of manually annotated speech signals 2 male and 2 female Portuguese speakers



• Manual Annotation of V/U/S segments in a speech signalManual Annotation of V/U/S segments in a speech signal



• V/U/S Speech datasetV/U/S Speech dataset

– Voiced / Unvoiced / Silence stratification in manually segmented audio files

----------------------------------------------------------------------------Portuguese Male 2Portuguese Male 2----------------------------------------------------------------------------Total Time Total Time = 60 secs= 60 secsvoiced voiced = 30 secs = 50.0%= 30 secs = 50.0%unvoiced unvoiced = 14 secs = 23.3%= 14 secs = 23.3%silencesilence = 13 secs = 21.6%= 13 secs = 21.6%

Voiced /(Voiced + Unvoiced) = 68% Voiced /(Voiced + Unvoiced) = 68% Unvoiced / (Voiced + Unvoiced) = 32%Unvoiced / (Voiced + Unvoiced) = 32%

----------------------------------------------------------------------------Portuguese Male 1Portuguese Male 1----------------------------------------------------------------------------Total Time Total Time = 60 secs= 60 secsVoicedVoiced = 37 secs = 62%= 37 secs = 62%unvoiced unvoiced = 12 secs = 20%= 12 secs = 20%silencesilence = 10 secs = 17%= 10 secs = 17%

Voiced /(Voiced + Unvoiced) = 76% Voiced /(Voiced + Unvoiced) = 76% Unvoiced / (Voiced + Unvoiced) = 24%Unvoiced / (Voiced + Unvoiced) = 24%

----------------------------------------------------------------------------Portuguese Female 2Portuguese Female 2----------------------------------------------------------------------------Total Time Total Time = 60 secs= 60 secsvoiced voiced = 30 secs = 50.0%= 30 secs = 50.0%unvoiced unvoiced = 19 secs = 31.6%= 19 secs = 31.6%silencesilence = 10 secs = 17%= 10 secs = 17%

Voiced /(Voiced + Unvoiced) = 61.2% Voiced /(Voiced + Unvoiced) = 61.2% Unvoiced / (Voiced + Unvoiced) = 38.7%Unvoiced / (Voiced + Unvoiced) = 38.7%

----------------------------------------------------------------------------Portuguese Female 1Portuguese Female 1----------------------------------------------------------------------------Total Time Total Time = 60 secs= 60 secsvoiced voiced = 32 secs = 53.3%= 32 secs = 53.3%unvoiced unvoiced = 17 secs = 28.3%= 17 secs = 28.3%silencesilence = 10 secs = 17%= 10 secs = 17%

Voiced /(Voiced + Unvoiced) = 65.3% Voiced /(Voiced + Unvoiced) = 65.3% Unvoiced / (Voiced + Unvoiced) = 32.7%Unvoiced / (Voiced + Unvoiced) = 32.7%



• Automatic Classification of V/U/S speech framesAutomatic Classification of V/U/S speech frames::

– 10-fold Cross-Validation • Confusion matrix:

– Some voiced frames are being discarded as unvoiced… • Waste of relevant and scarce data…

– A few unvoiced and silence frames are being misclassified as voiced• Contamination of the data to be analyzed

ClassifieClassified as: d as: ↓↓

voicedvoiced unvoicedunvoiced silencesilence

voicedvoiced 92.32 92.32 %%

4.17 %4.17 % 0.41 %0.41 %

unvoicedunvoiced 6.8 %6.8 % 62.28 %62.28 % 34.66 %34.66 %

silencesilence 0.88 %0.88 % 33.55 %33.55 % 64.92 %64.92 %

contamination

waste

(Theoretical Random Classifier Correct Classifications = 33.33%)Total Classification Error = 18.385%Total Correct Classifications = 81.615+/-1.13912%



• Voiced / Unvoiced / Silence (V/U/S) detectionVoiced / Unvoiced / Silence (V/U/S) detection

– Advantages• Only quasi-stationary parts of the speech signal are used

– Include most of the speaker information in a speech signal– Avoids model degradation in LPC/LSP

• Potentially more robust to different speakers/languages– Different languages may have distinct V/U/S stratification– Speakers talk differently (i.e. more paused more silence frames)

– Drawbacks• May cause few data points per speech sub-segment

– Ill-estimation of the covariance matrices» number of data points (i.e. voiced frames) >= d(d+1)/2» d=dim(cov matrix) = 10 (i.e. 10 LSP coefficients) » nr. data points / sub-segment >= 55 frames» Not always guaranteed!!

use of dynamically sized windows– Does this really work??


Speaker Coarse Segmentation


Potencial change?



Real change?

No Yes

BIC


Speech stream



• Divergence Shape Divergence Shape – Only uses LSP features

• Assumes Gaussian Distribution

– Calculated between consecutive sub-segments

Speech stream with 4 speech segments

( , ) ln ii j

jx

p xD i j p x p x dx

p x

1 1 1 11 1( , )

2 2

T

i j j i j i i j i jD i j tr tr u u u u

1 11( , )

2 i j j iD i j tr

LSP1 LSP2LSP3

… LSP10


SPEECHSEGMENTS

25 ms (200 SAMPLES)

FEATURE VECTORS

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

... ...

... ...ΣiΣi-1

...

...

...

...

...D(i-1,i)

[Campbell97][Campbell97][Lu2002][Lu2002]



• Dynamic Threshold Dynamic Threshold [Lu2002][Lu2002]

• Speaker change whenever:Speaker change whenever:

1

1( 2 , 1 2 )

N

in

Th D i n i nN

( , 1) ( 2, 3)

( , 1) ( 2, 1)

( , 1) i

D i i D i i

D i i D i i

D i i Th

D(i,j)

(i,i+1) (i+1,i+2)(i-1,i)

POTENTIAL SPEAKER CHANGE POINT

Thi...

...

... ...

LSP1 LSP2LSP3

… LSP10


SPEECHSEGMENTS

25 ms (200 SAMPLES)

FEATURE VECTORS

Σi-6

time

...

...1.375 secs

(55 FRAMES)

Σi-5

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...

... ... ... ...

... ...

... ...

... ...

... ...

... ...


...

...

...

Σi Σi+1


D(i-6,i-5)

Σi-4 Σi-3D(i-4,i-3)

D(i,i+1)

D(i+2,i+3)

D(i-2,i-1)

Σi+2 Σi+3

Σi-2 Σi-1



• Coarse Segmentation performanceCoarse Segmentation performance

– Presents high False Alarm Rate (FAR = Type I errors)

• Possible solution:Possible solution:

– Use a Speaker Validation Strategy

• Should allow decreasing FAR…

• … but should also avoid an increase in Miss Detections (MDR = Type II errors)


Speaker Change Validation


Potencial change?



Real change?

No Yes

BIC


Speech stream


x xxxxx x xx xx xxxxxxxx xx x x x x xxxxxxxy yyyyy y yy yy yyyyyyyy yy y y y y yyy yyyy yy

x

p

Feature 1

θz p

Feature 1

θz

xxxxxxx xxxx xxxxxxxx xx x x x x xxxxxxxyyyyyyyyyyy yyyyyyyy yy yyy y yyyyyyy yy

x


• Bayesian Information Criterion (BIC) (1)Bayesian Information Criterion (BIC) (1)

– Hypothesis 0:• Single θz model for speaker data in segments X and Y

00 0

log ( | ) log ( | )yx NN

i z i zi i

L p x p y

X Y

Z

L0 L0 Same

speaker in segments X

and Y

Different speakers in segments X

and Y




– Hypothesis 1:• Separate models θx, θy for speakers in segments X and Y, respectively

X Y

Z

10 0

log ( | ) log ( | )yx NN

i x i yi i

L p x p y

x xxxxx x xx xx xxxxxxxx xx x x x x xxxxxxxy yyyyy y yy yy yyyyyyyy yy y y y y yyy yyyy yy

x

p

Feature 1

p

Feature 1xxxxxxx xxxx xxxxxxxx xx x x x x xxxxxxx

yyyyyyyyyyy yyyyyyyy yy y yy y yyyyyyy yyx

θx θy

θx

θy

L1 L1

Same speaker in segments X and

Y

Different speakers in segments X

and Y




– Log Likelihood Ratio (LLR)

– However, this is not a far comparison…• The models do not have the same number of parameters!

– More complex models always fit better the data» They should be penalized when compared with simpler models ΔK = difference of the nr. parameters in the two hypotheses

1 0 . .log( )2BIC x yd L L K N N

1 0 0LLRd L L Need to define a Threshold…

No Threshold needed! Or is

it!?



• Bayesian Information Criterion (BIC) (4)Bayesian Information Criterion (BIC) (4)– Using Gaussian models for θx, θy and θz :

– Validate Speaker Change Point when:

1( , ) (( ) log log log )

21 1

( ( 1)) log( )2 2

x y x y z x x y y

x y

BIC N N N N

d d d N N

( , ) 0x yBIC

SPEECHSEGMENTS

... ... ...

... ...Σi Σi+1


Σi-2 Σi-1

BIC(Σi,Σi+1) > 0 ?

VALIDATE SPEAKER CHANGE POINT

time

...

Threshold Free! … but λ must be set…




– BIC needs large amounts of data for good accuracy!

• Each speech segment only contains 55 data points… too few!

– Solution:

• Speaker Model Update…Speaker Model Update…

SPEECHSEGMENTS

... ... ...

... ...Σi Σi+1


Σi-1

BIC(Σi,Σi+1) > 0 ?


time

...



Speaker Model Update


Potencial change?



Real change?

No Yes

BIC


Speech stream



• ““Quasi-GMM”Quasi-GMM” speaker modeling speaker modeling [Lu2002][Lu2002] – Approximation to GMM (Gaussian Mixture Models)

• using segmental clustering of Gaussian Models instead of EM– Gaussian models incrementally updated with new arriving speaker data

– less accurate than GMM…• … but feasible for real-time operation

SPEECHSEGMENTS

Σi-6

...Σi-5

... ...

... ...

... ...

... ...Σi Σi+1

Σi-4 Σi-3

Σi-2 Σi-1

Quasi-GMM Speaker Model

(ΣQGMM)

time

...



• ““Quasi-GMM”Quasi-GMM” speaker modeling speaker modeling [Lu2002][Lu2002]

– Segmental Clustering• Start with one Gaussian Mixture (~GMM1)• DO:

– Update mixture as speaker data is received

– WHILE:» dissimilarity between mixture model before and after update is

sufficiently small

• Create a new Gaussian mixture (GMMn+1)– Up to a maximum of 32 mixtures (GMM32)

• Mixture Weight (wm):

1

mm

qGMM

S

qGMM mm

Nw

N

N N



• ““Quasi-GMM”Quasi-GMM” speaker modeling speaker modeling [Lu2002][Lu2002]

– Gaussian Model on-line updating

'

'2

'

( )( )( )

mm m

Tmm m m

m m m

m

N N

N N N N

N N N N

N N N N N N

N N N

'm

m m

N N

N N N N

μ dependent terms are discarded [Lu2002]

Increase robustness to changes in noise and background sound

~ Cepstral Mean Subtraction (CMR)

11( ) ( )

21

( , ) , 10(2 ) .

Tx x

nN e n




Potencial change?



Real change?

No Yes

BIC


Speech stream



• BIC and Quasi-GMM Speaker ModelsBIC and Quasi-GMM Speaker Models

– Validate Speaker Change Point when:

1

( , ) ( , )S

qGMM j m m jm

BIC w BIC

1

mm

qGMM

S

qGMM mm

Nw

N

N N

( , ) 0qGMM jBIC

SPEECHSEGMENTS

Σi-6

...Σi-5

... ...

... ...

... ...

... ...Σi Σi+1

POTENTIAL SPEAKER CHANGE POINTΣi-4 Σi-3

Σi-2 Σi-1

Quasi-GMM Speaker Model

(ΣqGMM)

BIC(ΣqGMM,Σi+1) > 0 ?


time

...


Complete System


Potencial change?



Real change?

No Yes

BIC


Speech stream


Complete System

LSP1 LSP2LSP3

… LSP10


SPEECHSEGMENTS

25 ms (200 SAMPLES)

FEATURE VECTORS

time

...

...1.375 secs

(55 FRAMES)

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

LSP1 LSP2LSP3

… LSP10

...

... ... ... ...

... ...

... ...

... ...

... ...

... ...

...

...

...

Σi Σi+1


Σi-4

D(i,i+1)

D(i+2,i+3)

D(i-2,i-1)

Σi+2 Σi+3

Σi-2 Σi-1

BIC(ΣqGMM, Σi+1) > 0 ?


Quasi-GMM Speaker ModelN(μqGMM, ΣqGMM)


Σi-3

Σi-5Σi-6

D(i-4,i-3)

D(i-6,i-5)

...

...

...

...

...

YES



Experimental Results

• Speaker Datasets:Speaker Datasets:– INESC Porto dataset:

• Sources:– MPEG-7 Content Set CD1 [MPEG.N2467]– broadcast news from assorted sources– male, female, various languages

• 43 minutes of speaker audio– 16 bit @ 22.05kHz PCM, single-channel

• Ground Truth– 181 speaker changes– Manually annotated– Speaker segments durations

» Maximum ~= 120 secs» Minimum = 2.25 secs » Mean = 19.81 secs

» Std.Dev. = 27.08 secs



• Speaker Datasets:Speaker Datasets:– TIMIT/AUTH dataset:

• Sources:– TIMIT database

» 630 English speakers» 6300 sentences

• 56 minutes of speaker audio– 16 bit @ 22.05kHz PCM, single-channel

• Ground Truth– 983 speaker changes– Manually annotated– Speaker segments durations

» Maximum ~= 12 secs» Minimum = 1.139 secs» Mean = 3.28 secs

» Std.Dev. = 1.52 secs



• Efficiency MeasuresEfficiency Measures

FA MD

FAR MDRGT FA GT

CFC CFC

PRC RCLDET GT

1

2.0 PRC RCLF

PRC RCL

1MDR RCL

RCL FAFAR

DET PRC RCL FA



• System’s Parameters fine-tuningSystem’s Parameters fine-tuning

– Parameters• Dynamic Threshold: α and nr. of previous frames • BIC: λ• qGMM: mixture creation thresholds

• Detection Tolerance Interval: set to [-1;+1] secs.

– tune system to higher FAR & lower MDR

• Missed speaker changes can not be recovered by subsequent processing

• False speaker changes will hopefully be discarded by subsequent processing– Speaker Tracking module (future work)

» Merge adjacent segments identified as belonging to the same speaker



• Dynamic Threshold and BIC parameters Dynamic Threshold and BIC parameters ((αα and and λλ))

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100ROC curve using BIC

FAR

MD

R

= 0.7

= 0.5

= 0.9

alpha = 2.0

alpha = 2.0alpha = 2.0

alpha = 0.0

alpha = 0.0

alpha = 0.0

Best Results found for: α = 0.8 λ = 0.6



• INESC Porto dataset evaluation (1)INESC Porto dataset evaluation (1)

INESC System ver.1 INESC System ver.2

•Features:•LSP

•Voiced Filter disabled•On-line processing (realtime)•Uses BIC

•Features:•LSP

•Voiced Filter enabled•On-line processing (realtime)•Uses BIC



• TIMIT/AUTH dataset evaluation (1)TIMIT/AUTH dataset evaluation (1)

INESC System ver.1 INESC System ver.2

•Features:•LSP


•Features:•LSP

•Voiced Filter enabled•On-line processing (realtime)•Uses BIC



• INESC Porto dataset evaluation (2)INESC Porto dataset evaluation (2)

INESC System ver.2

•Features:•LSP


AUTH System 1

•Features:•AudioSpectrumCentroid•AudioWaveformEnvelope

•Multiple-pass (non-realtime)•Uses BIC




AUTH System 1 INESC System ver.2

•Features:•AudioSpectrumCentroid•AudioWaveformEnvelope

•Multiple-pass (non-realtime)•Uses BIC

•Features:•LSP





AUTH System 2 INESC System ver.2

•Features:•DFT Mag•STE•AudioWaveformEnvelope•AudioSpectrumCentroid•MFCC

•Fast system (realtime?)•Uses BIC

•Features:•LSP




• Time Shifts on the detected Speaker Change Time Shifts on the detected Speaker Change Points Points

– Detection tolerance interval = [-1, 1] secs

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10

1000

2000

3000

4000

5000

6000

Speaker change detection time shifts (secs.)

nr. o

f ocu

rren

ces

INESC System ver.1


Achievements

• SoftwareSoftware– C++ routines

• Numerical routines– Matrix Determinant– Polynomial Roots– Levinson-Durbin

• LPC (adapted from Marsyas)• LSP• Divergence and Bhattacharyya Shape metrics• BIC• Quasi-GMM modeling class

– Automatic Speaker Segment prototype application• As a Library (DLL)

– Integrated into “4VDO - Annotator”• As a stand-alone application

• ReportsReports– VISNET deliverables

• D29, D30, D31, D40, D41

• Publications (co-author)Publications (co-author)

– “Speaker Change Detection using BIC: A comparison on two datasets”• Accepted to the ISCCSP2006

– “Automatic Speaker Segmentation using Multiple Features and Distance Measures: A Comparison of Three Approaches”

• Submitted to ICME2006

DEMO


Conclusions

• Open Issues…Open Issues…– Voiced detection procedure

• Results should improve… – Parameter fine-tuning

• Dynamic Threshold• BIC parameter • Quasi-GMM Model

• Further WorkFurther Work– Audio Features

• Evaluate other features for speaker segmentation, tracking and identification– Pitch– MFCC– …

– Speaker Tracking• Clustering of speaker segments• Evaluation

– Ground Truth Needs manual annotation work

– Speaker Identification• Speaker Model Training• Evaluation

– Ground Truth Needs manual annotation work


Contributors

• INESC PortoINESC Porto

– Rui Costa– Jaime Cardoso– Luís Filipe Teixeira– Sílvio Macedo

• VISNETVISNET

• Aristotle University of Thessaloniki (AUTH), Greece– Margarita Kotti– Emmanuoil Benetos– Constantine Kotropoulos


Thank you!

Questions?Questions?

[email protected]

Speaker Segmentation (2006)

Technology

Transcript of Speaker Segmentation (2006)