Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker...

speaker change detection using fundamental frequencywith application to multi-talker segmentation

May 16, 2019Aidan Hogg, Christine Evers and Patrick NaylorElectrical and Electronic Engineering, Imperial College London, UK

diarizationMotivation

What is speaker diarization?

Answers the question “who spoke when?” in an audio recording.

Is diarization really that useful?

∙ Speaker indexing and rich transcription∙ Speaker segmentation and clustering helping Automatic SpeechRecognition (ASR) systems

∙ Preprocessing modules for single speaker-based algorithms

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 1

diarization method

speech signalDiarization Method


segmentationDiarization Method


clusteringDiarization Method


segmentation motivationDiarization Method

Is good segmentation really that useful?

Why not just segment the audio stream into small uniform segmentsand cluster with realignment?

If the speech segments are small then each segment only containsa small amount of information that can be used for clustering.


speaker pitch tracks

the ami meeting room corpusSpeaker Pitch Tracks

Multi-modal data set consisting of 100 hours of meeting recordings.

Recorded in English using three different rooms with differentacoustic properties and includes mostly non-native speakers.


speaker pitch tracks from ‘es2004b’Speaker Pitch Tracks

200 400 600 800 1000Time (s)

100

150

200

250

300

Estim

ated

pitc

h(H

z) Speaker ASpeaker BSpeaker CSpeaker D


speaker pitch tracks from ‘ts3003b’Speaker Pitch Tracks

200 400 600 800 1000Time (s)

100

150

200

250

300

Estim

ated

pitc

h(H

z) Speaker ASpeaker BSpeaker CSpeaker D


pitch segmentation

the new ideaPitch Segmentation

Assumption: If the speaker’s pitch only varies in a smooth mannerdue to physiological constraints (Xu, 2002) it should be possible toestimate the future pitch of the speaker based on their current pitch.

Main Idea: Use a Kalman filter to carry out this future pitchestimation. If the pitch can’t be estimated then the speaker haspotentially changed.


proposed systemPitch Segmentation

Changedetection

Segmentationfile

Kalmanfilter

PitchEstimation

Audio input

VAD

Proposed pitch segmentation system


kalman filterPitch Segmentation

The pitch 𝑥(𝑛) for a given frame 𝑛 can be written in the followingway:

𝑥(𝑛 + 1) = 𝑥(𝑛) + 𝑤,𝑤 ∈ 𝒩(0, 𝜎2

𝑤) .

The measurement 𝑧(𝑛) of the true pitch 𝑥(𝑛) can be modelledaccording to:

𝑧(𝑛) = 𝑥(𝑛) + 𝑣,𝑣 ∈ 𝒩(0, 𝜎2

𝑣) .


predictionPitch Segmentation

Performed on every frame

Predicted pitch estimate:

𝑥𝑛|𝑛−1 = 𝑥𝑛−1|𝑛−1.

Predicted estimate variance:

𝑃𝑛|𝑛−1 = 𝑃𝑛−1|𝑛−1 + 𝜎2𝑤.


variance ‘p’Pitch Segmentation

0

0.2

0.4

0.6

kHz

− 40 − 35 − 30 − 25 − 20 − 15 − 10 − 5 0[dB]

0 1 2 3 4 5 6Seconds

0.5

1.0

1.5

2.0

P

Variance


speaker change detectionPitch Segmentation

A Kalman filter is initialised and tracks first speaker.

If the error between measurement and prediction becomes largerthan a threshold (10 Hz) then all previously generated Kalman tracksare checked.

∙ If the closest previous Kalman pitch track is below a threshold(50 Hz) then this Kalman filter is continued.

∙ If on the other hand, the closest Kalman filter to themeasurement does not satisfy this threshold then a newKalman filter is generated.


ground truth

a comparison of pitch and speaker changesGround Truth

Meeting SC | PCES2004a 94.49ES2004b 89.25ES2004c 95.21ES2004d 91.85IS1009a 96.12IS1009b 98.94IS1009c 97.67IS1009d 98.55EN2002a 92.35EN2002b 87.01EN2002c 79.37EN2002d 86.00TS3003a 76.54TS3003b 76.59TS3003c 75.82TS3003d 81.34

Meeting PC | SCES2004a 78.76ES2004b 68.60ES2004c 70.22ES2004d 73.38IS1009a 68.91IS1009b 64.27IS1009c 59.38IS1009d 66.60EN2002a 88.59EN2002b 83.40EN2002c 87.70EN2002d 81.02TS3003a 52.08TS3003b 48.46TS3003c 56.47TS3003d 62.68

SC | PC The probability that there is a ‘speaker change’ given that there is a ‘pitch change’PC | SC The probability that there is a ‘pitch change’ given that there is a ‘speaker change’


evaluation

mfcc vs pitch segmentationEVALUATION

Segmentationfile

MFCCextraction

VADAudio input

Benchmark system (‘Sidekit’)https://projets-lium.univ-lemans.fr/s4d/

Changedetection

Segmentationfile

Kalmanfilter

PitchEstimation

Audio input

VAD

Proposed system


https://projets-lium.univ-lemans.fr/s4d/

benchmark system evaluationEVALUATION

0 2 4 6 8 10 12 14 16Meeting

0

20

40

60

80

100

Rate

(%)

Hit Miss Multi-Hit

500 ms collar around each speaker change boundary (250 ms before and after)


proposed system evaluationEVALUATION

0 2 4 6 8 10 12 14 16Meeting

0

20

40

60

80

100

Rate

(%)

Hit Miss Multi-Hit



evaluation comparisonEVALUATION

Pitch System MFCC System0

10

20

30

40

50

60

70

80

Rate

(%)

HitMissMulti-Hit



conclusionEVALUATION

The proposed Kalman filter prediction error-based approachperformed well when compared against a previous MFCC-basedmethod.

An evaluation on the AMI corpus showed a speaker changeddetection increase from 43.3% to 70.5%.


paper contributionsEVALUATION

In this paper we have...

...carried out a study of meetings in the AMI corpus that has shownthat a pitch change is a strong indicator of a speaker change.

...highlighted that an individual’s pitch is smoothly varying and,therefore, can be predicted by using a Kalman filter.

...proposed a Kalman filtering approach to identify speaker changeboundaries based on a model of the temporal variation of pitch.


Questions?

Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker...

Documents

Transcript of Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker...