Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker...

29
speaker change detection using fundamental frequency with application to multi-talker segmentation May 16, 2019 Aidan Hogg, Christine Evers and Patrick Naylor Electrical and Electronic Engineering, Imperial College London, UK

Transcript of Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker...

Page 1: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

speaker change detection using fundamental frequencywith application to multi-talker segmentation

May 16, 2019Aidan Hogg, Christine Evers and Patrick NaylorElectrical and Electronic Engineering, Imperial College London, UK

Page 2: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

diarizationMotivation

What is speaker diarization?

Answers the question “who spoke when?” in an audio recording.

Is diarization really that useful?

∙ Speaker indexing and rich transcription∙ Speaker segmentation and clustering helping Automatic SpeechRecognition (ASR) systems

∙ Preprocessing modules for single speaker-based algorithms

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 1

Page 3: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

diarization method

Page 4: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

speech signalDiarization Method

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 3

Page 5: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

segmentationDiarization Method

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 4

Page 6: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

clusteringDiarization Method

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 5

Page 7: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

segmentation motivationDiarization Method

Is good segmentation really that useful?

Why not just segment the audio stream into small uniform segmentsand cluster with realignment?

If the speech segments are small then each segment only containsa small amount of information that can be used for clustering.

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 6

Page 8: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

speaker pitch tracks

Page 9: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

the ami meeting room corpusSpeaker Pitch Tracks

Multi-modal data set consisting of 100 hours of meeting recordings.

Recorded in English using three different rooms with differentacoustic properties and includes mostly non-native speakers.

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 8

Page 10: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

speaker pitch tracks from ‘es2004b’Speaker Pitch Tracks

200 400 600 800 1000Time (s)

100

150

200

250

300

Estim

ated

pitc

h(H

z) Speaker ASpeaker BSpeaker CSpeaker D

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 9

Page 11: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

speaker pitch tracks from ‘ts3003b’Speaker Pitch Tracks

200 400 600 800 1000Time (s)

100

150

200

250

300

Estim

ated

pitc

h(H

z) Speaker ASpeaker BSpeaker CSpeaker D

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 10

Page 12: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

pitch segmentation

Page 13: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

the new ideaPitch Segmentation

Assumption: If the speaker’s pitch only varies in a smooth mannerdue to physiological constraints (Xu, 2002) it should be possible toestimate the future pitch of the speaker based on their current pitch.

Main Idea: Use a Kalman filter to carry out this future pitchestimation. If the pitch can’t be estimated then the speaker haspotentially changed.

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 12

Page 14: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

proposed systemPitch Segmentation

Changedetection

Segmentationfile

Kalmanfilter

PitchEstimation

Audio input

VAD

Proposed pitch segmentation system

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 13

Page 15: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

kalman filterPitch Segmentation

The pitch 𝑥(𝑛) for a given frame 𝑛 can be written in the followingway:

𝑥(𝑛 + 1) = 𝑥(𝑛) + 𝑤,𝑤 ∈ 𝒩(0, 𝜎2

𝑤) .

The measurement 𝑧(𝑛) of the true pitch 𝑥(𝑛) can be modelledaccording to:

𝑧(𝑛) = 𝑥(𝑛) + 𝑣,𝑣 ∈ 𝒩(0, 𝜎2

𝑣) .

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 14

Page 16: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

predictionPitch Segmentation

Performed on every frame

Predicted pitch estimate:

𝑥𝑛|𝑛−1 = 𝑥𝑛−1|𝑛−1.

Predicted estimate variance:

𝑃𝑛|𝑛−1 = 𝑃𝑛−1|𝑛−1 + 𝜎2𝑤.

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 15

Page 17: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

updatePitch Segmentation

Performed if the frame is considered to be voiced

Updated pitch estimate and updated estimate variance:

𝑥𝑛|𝑛 = 𝑥𝑛|𝑛−1 + 𝐾𝑛(𝑧𝑛 − 𝑥𝑛|𝑛−1)

𝑃𝑛|𝑛 = (1 − 𝐾𝑛)2𝑃𝑛|𝑛−1 + 𝐾2𝑛𝜎2

𝑣.

If the Kalman gain is 𝐾𝑛 = 1: 𝑥𝑛|𝑛 = 𝑧𝑛 ( just the measurement)

If the Kalman gain is 𝐾𝑛 = 0: 𝑥𝑛|𝑛 = 𝑥𝑛|𝑛−1 ( just the prediction)

Optimal Kalman gain:

𝐾𝑛 =𝑃𝑛|𝑛−1

𝑆𝑛.

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 16

Page 18: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

variance ‘p’Pitch Segmentation

0

0.2

0.4

0.6

kHz

− 40 − 35 − 30 − 25 − 20 − 15 − 10 − 5 0[dB]

0 1 2 3 4 5 6Seconds

0.5

1.0

1.5

2.0

P

Variance

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 17

Page 19: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

speaker change detectionPitch Segmentation

A Kalman filter is initialised and tracks first speaker.

If the error between measurement and prediction becomes largerthan a threshold (10 Hz) then all previously generated Kalman tracksare checked.

∙ If the closest previous Kalman pitch track is below a threshold(50 Hz) then this Kalman filter is continued.

∙ If on the other hand, the closest Kalman filter to themeasurement does not satisfy this threshold then a newKalman filter is generated.

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 18

Page 20: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

ground truth

Page 21: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

a comparison of pitch and speaker changesGround Truth

Meeting SC | PCES2004a 94.49ES2004b 89.25ES2004c 95.21ES2004d 91.85IS1009a 96.12IS1009b 98.94IS1009c 97.67IS1009d 98.55EN2002a 92.35EN2002b 87.01EN2002c 79.37EN2002d 86.00TS3003a 76.54TS3003b 76.59TS3003c 75.82TS3003d 81.34

Meeting PC | SCES2004a 78.76ES2004b 68.60ES2004c 70.22ES2004d 73.38IS1009a 68.91IS1009b 64.27IS1009c 59.38IS1009d 66.60EN2002a 88.59EN2002b 83.40EN2002c 87.70EN2002d 81.02TS3003a 52.08TS3003b 48.46TS3003c 56.47TS3003d 62.68

SC | PC The probability that there is a ‘speaker change’ given that there is a ‘pitch change’PC | SC The probability that there is a ‘pitch change’ given that there is a ‘speaker change’

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 20

Page 22: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

evaluation

Page 23: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

mfcc vs pitch segmentationEVALUATION

Segmentationfile

MFCCextraction

VADAudio input

Benchmark system (‘Sidekit’)https://projets-lium.univ-lemans.fr/s4d/

Changedetection

Segmentationfile

Kalmanfilter

PitchEstimation

Audio input

VAD

Proposed system

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 22

Page 24: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

benchmark system evaluationEVALUATION

0 2 4 6 8 10 12 14 16Meeting

0

20

40

60

80

100

Rate

(%)

Hit Miss Multi-Hit

500 ms collar around each speaker change boundary (250 ms before and after)

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 23

Page 25: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

proposed system evaluationEVALUATION

0 2 4 6 8 10 12 14 16Meeting

0

20

40

60

80

100

Rate

(%)

Hit Miss Multi-Hit

500 ms collar around each speaker change boundary (250 ms before and after)

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 24

Page 26: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

evaluation comparisonEVALUATION

Pitch System MFCC System0

10

20

30

40

50

60

70

80

Rate

(%)

HitMissMulti-Hit

500 ms collar around each speaker change boundary (250 ms before and after)

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 25

Page 27: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

conclusionEVALUATION

The proposed Kalman filter prediction error-based approachperformed well when compared against a previous MFCC-basedmethod.

An evaluation on the AMI corpus showed a speaker changeddetection increase from 43.3% to 70.5%.

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 26

Page 28: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

paper contributionsEVALUATION

In this paper we have...

...carried out a study of meetings in the AMI corpus that has shownthat a pitch change is a strong indicator of a speaker change.

...highlighted that an individual’s pitch is smoothly varying and,therefore, can be predicted by using a Kalman filter.

...proposed a Kalman filtering approach to identify speaker changeboundaries based on a model of the temporal variation of pitch.

A. Hogg, C. Evers and P. Naylor | Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation 27

Page 29: Speaker Change Detection Using Fundamental Frequency With Application … · 2019-10-10 · Speaker Change Detection Using Fundamental Frequency With Application To Multi-talker Segmentation

Questions?