Adaptive Speaker Identiﬂcation with AudioVisual...

1

Adaptive Speaker Identification with AudioVisual Cues for MovieContent Analysis

Ying Li, Shri Narayanan, and C.-C. Jay Kuo a

aIntegrated Media Systems Center and Department of Electrical EngineeringUniversity of Southern California, Los Angeles, CA 90089-2564

E-mail:{yingli,shri,cckuo}@sipi.usc.edu

An adaptive speaker identification system which employs both audio and visual cues is proposed in this workfor movie content analysis. Specifically, a likelihood-based approach is first applied for speaker identificationusing pure speech data, and techniques such as face detection/recognition and mouth tracking are applied fortalking face recognition using pure visual data. These two information cues are then effectively integrated undera probabilistic framework for achieving more robust results. Moreover, to account for speakers’ voice variationsalong time, we propose to update their acoustic models on the fly by adapting to their incoming speech data. Animproved system performance (80% identification accuracy) has been observed on two test movies.

Keywords: Movie content analysis; Speech segmentation and clustering; Mouth detection and tracking; Adaptivespeaker identification; Unsupervised speaker model adaptation

1. Introduction

A fundamental task in video analysis is to or-ganize and index multimedia data in a meaning-ful manner so as to facilitate user’s access suchas browsing and retrieval. This work proposesto extract an important type of information, thespeaker identity, from feature films for the contentindexing and browsing purpose.

So far, a large amount of speaker identifica-tion work has been reported on standard speechdatabases. For instance, Chagnolleau et al.(1999) adopted a likelihood-based speaker detec-tion approach to estimate target speakers’ seg-ments with Hub4 broadcast news, which is thebenchmark test set provided by NIST (NationalInstitute of Standards and Technology). How-ever, while acceptable results were reported in itsone-target-speaker case, the system performancedegraded dramatically in two-target-speaker case.Similar work was also reported by Rosenberg etal. (1998) where the NBC Nightly News broad-casts were used. Johnson (1999) addressed theproblem of labeling speaker turns by automati-cally segmenting and clustering a continuous au-dio stream. A frame-based clustering approach

was proposed, and an accuracy of 70% was ob-tained on the 1996 Hub4 development data.

Recently, with the increase of the accessibil-ity to other media sources, researchers have at-tempted to improve system performance by inte-grating knowledge from all available media cues.For instance, Tsekeridou and Pitas (2001) pro-posed to identify speakers by integrating cuesfrom both speaker recognition and facial analysisschemes. This system is, however, impracticablefor generic video types since it assumes there isonly one face in each video frame. Similar workwas also reported by Li et al. (2001), where TVsitcoms were used as test sequences. In (Li et al.,2002), a speaker identification system was pro-posed for indexing the movie content, where bothspeech and visual cues were employed. This sys-tem, however, has certain limitations since it onlyidentifies speakers in movie dialogs.

From the other point of view, most existingwork in this field deals with supervised identi-fication problems, where speaker models are notallowed to change once they are trained. Twodrawbacks arise when we apply supervised iden-tification techniques to feature films.

2

1. The insufficiency of training data. Aspeaker’s voice can have dramatic varia-tions along time, especially in feature films.Therefore, a model built with limited train-ing data (e.g. 40- or 50-second speech) can-not model a speaker well for a long videosequence. Moreover, to manually transcribetraining data is a very time-consuming task.

2. The decrease of system efficiency. Becausewe have to go through movies at least onceto collect and transcribe training data be-fore the actual identification process can bestarted, it wastes time and decreases systemefficiency.

An adaptive speaker identification system isthus proposed in this work which aims to of-fer a better solution to speaker identification formovie content analysis. Specifically, a set of ini-tial speaker models are first constructed duringthe system initialization stage; then we keep up-dating them on the fly by adapting to speakers’newly incoming data. By doing so, we are able tocapture the speakers’ voice variations along time,thus to achieve a higher identification accuracy.Both audio and visual sources are exploited inthe identification process, where the audio con-tent is analyzed to recognize speakers using alikelihood-based approach, while the visual con-tent is parsed to recognize talking faces using facedetection/recognition and mouth tracking tech-niques.

The rest of the paper is organized as follows.Section 2 will elaborate on the proposed identi-fication scheme which includes speech segmenta-tion and clustering, mouth detection and track-ing, audiovisual-based speaker identification andunsupervised speaker model adaptation. Exper-imental results obtained on two test movies arereported and discussed in Section 3. Finally, con-cluding remarks and possible future extensionsare given in Section 4.

2. Adaptive speaker identification

Figure 1 shows the proposed system frame-work that consists of the following six major mod-ules: (1) shot detection and audio classification,

(2) face detection, recognition and mouth track-ing, (3) speech segmentation and clustering, (4)initial speaker modeling, (5) audiovisual (AV)-based speaker identification, and (6) unsuper-vised speaker model adaptation. As shown, givenan input video, we first split it into audio andvisual streams, then perform a shot detection onthe visual source. Following this, a shot-basedaudio classification is carried out which catego-rizes each shot into either environmental sound,silence, music, or speech. Next, with non-speechshots being discarded, all speech shots are fur-ther processed in the speech segmentation andclustering module where speeches from the sameperson are grouped into one cluster. Meanwhile,a face detection/recognition and mouth trackingprocess is also performed on speech shots for rec-ognizing talking faces. Both of the speech andface cues are then effectively integrated to recog-nize speakers in the audiovisual-based identifica-tion module, under the assistance of either initialor updated speaker models. Finally, the identi-fied speaker’s model is updated in the unsuper-vised model adaptation module, which becomeseffective in the next round of the identificationprocess.

2.1. Shot Detection and Audio Classifica-tion

The first step towards visual content analysis isshot detection. In this work, a color histogram-based approach is employed to fulfill this task.Specifically, once a distinct peak is detected in theframe-to-frame histogram difference, we declare itas a shot cut (Li and Kuo, 2000).

In the second step, we analyze the audio con-tent of each shot and classify it into one of thefollowing four classes: silence, speech (includingspeech with music), music, and environmentalsound. Five different audio features includingshort-time energy function, short-time averagezero-crossing rate (ZCR), short-time fundamen-tal frequency (SFuF), energy band ratio (EBR)and silence ratio (SR), are extracted for this pur-pose. Specifically, silence is detected by thresh-olding the energy and ZCR features; speech isrecognized by exploiting the inter-relationship be-tween energy, ZCR and SFuF features, as well as

3

Shot

Detection

Audio

Stream

Visual

Stream

Shot-based

Audio

Classification

Speech Speech

Segmentation &

Clustering

Music

Environ. Snd.

Silence

Face Detection, Recognition

& Mouth Tracking

AV-based

Speaker

Identification

Initial Speaker

Modeling

Shot Sequence

Shot sequence

Unsupervised

Speaker Model

Adaptation

Recognized Talking faces

Speech Clusters

Speakers

Identified Speakers

Updated speaker models

Initial speaker models

Figure 1. Block diagram of the proposed adaptive speaker identification system.

by checking EBR and SR values; finally, musicis distinguished from other audio signals by mea-suring both ZCR and SFuF features (Zhang andKuo, 1999).

2.2. Face Detection, Recognition andMouth Tracking

The goal of this module is to detect and recog-nize talking faces in speech shots.

2.2.1. Face Detection and RecognitionThe face detection and recognition library we

used was developed by Hewlett-Packard ResearchLabs based on a neural network technology (HP,1998). Currently, this library can only detect up-right frontal faces or faces rotated by plus or mi-nus 10 degrees from the vertical. To speed up theprocess, we will only carry out face detection onspeech shots as shown in Figure 1. Figure 2(a)shows a detection example where the detectedface is bounded by a rectangle and two eyes areindicated by crosses. To facilitate the subsequentrecognition process, we organize the detection re-sults into a set of face sequences, where all frameswithin each sequence contain the same number(nonzero) of human faces.

The face database used for face recognition isconstructed as follows. During the system ini-tialization, we first ask users to select their Ninterested casts (also called target speakers) byrandomly choosing video frames that contain the

casts’ faces. These faces are then detected, asso-ciated with the casts’ names, and added into theface database.

During the face recognition process, each de-tected face in the first frame of each face sequenceis recognized. The result is returned as a facevector ~f = [f1, . . . , fN ], where fi is a value in [0,1] which indicates the confidence of being targetcast i.

2.2.2. Mouth detection and trackingAt this step, we will first apply a weighted block

matching approach to detect the mouth for thefirst frame of a face sequence, then we track it forthe rest of its frames. Note that if more than twofaces are present in the sequence, we will virtuallysplit it into a number of sub-sequences with eachfocusing on one face.

There has been some work reported on mouthdetection and tracking for video indexing andspeaker detection purposes. For example, Tsek-eridou and Pitas (2001) employed a mouth-template tracking process to assist in talkingface detection. In (Sobottka and Pitas, 1998),a facial feature extracting and tracking systemwas reported, where features including eyebrows,eyes, nostrils, mouth and chin were determinedby searching the minima and maxima in a topo-graphic grey-level relief. A block matching mech-anism was further applied to track the mouthin the sequence. This approach, however, is not

4

(a) (b) (c)

Figure 2. (a) A detected human face, (b) the coarse mouth center, the mouth search area, and two smallsquares for skin-color determination, and (c) the detected mouth region.

suitable for our research since the block match-ing method will not work well in talking mouthtracking.

1. Mouth detectionFigure 3 shows three abstracted face models

where (a) gives a model of an upright face, and(b), (c) give models for rotated faces. Accord-ing to the facial biometric analogies, we knowthat there is a certain ratio between the interoc-ular distance and the distance dist between eyesand mouth. Thus, once we obtain the eyes posi-tions from the face detector, which are denotedby (x1, y1) and (x2, y2) in the figure, we can sub-sequently locate the coarse mouth center (x, y)for an upright face.

(a) (b) (c)

θθ

(x1, y1)

(x2, y2)

(x, y)

θθ

(x1, y1)

(x2, y2)

(x, y)

θθ

dist

(x1,y1) (x2,y2)

(x, y)

Figure 3. Three abstracted face models for (a):upright face, (b) and (c): rotated faces.

When a face is rotated as shown in Figure 3(b)

or (c), we compute its mouth center as

x = x1+x22 ± dist× sin(θ),

y = y1+y22 + dist× cos(θ),

(1)

where θ is the head rotation angle.We then expand the coarse mouth center (x, y)

into a rectangular mouth search area as shownin Figure 2(b), and perform a weighted block-matching process to locate the target mouth. Theintuition we used to detect the mouth is that itpresents the largest color difference from the skincolor, which is determined from the average pixelcolor in the two small under-eye squares as shownin Figure 2(b).

Now, denote the size of the mouth search areaby (SW,SH), and the size of the desired mouthby (MW,MH), its upper-left corner is located as

(i, j) = arg max(i,j)

ci,j ×

i+MW∑

w=i

j+MH∑

h=j

dist(w, h)

,(2)

where

dist(w, h) = dist(Y UV (w, h), Y UV (skin-color)),

0 ≤ i ≤ (SW −MW ), and 0 ≤ j ≤ (SH −MH).dist() is a plain Euclidean distance in YUV colorspace, and ci,j is a Gaussian weighting coeffi-cient. An example of a correctly detected mouthis shown in Figure 2(c). For the rest of the dis-cussion, we denote the detected mouth center by(cx, cy).

5

2. Mouth trackingTo track the mouth for the rest of frames, we

assume that for each subsequent frame, the cen-troid of its mouth mask can be derived from thatof the previous frame as well as from its eye po-sitions. Moreover, we assume that the distancebetween the coarse mouth center (x, y) and thedetected mouth center (cx, cy) remains the samefor all frames.

Now, assume that we have obtained all featuredata for frame fi including the eye positions, thecoarse and final mouth centers (x, y) and (cx, cy),frame fi+1’s mouth centroid (cx′, cy′) can be com-puted as

{cx′ = cx− (x− x′),cy′ = cy − (y − y′),

where x, x′, y and y′ are calculated via Equation(1).

Figure 4 shows the mouth detection and track-ing results, as marked by rectangles, on a facesequence containing ten consecutive frames.

Finally, we apply a color histogram-based ap-proach to determine if the tracked mouth is talk-ing. Particularly, if the normalized accumulatedhistogram difference in the mouth area of the en-tire or part of the face sequence f exceeds a cer-tain threshold, we label it as a talking mouth; andcorrespondingly, we mark sequence f as a talkingface sequence.

2.3. Speech segmentation and clusteringFor each speech shot, the two major speech pro-

cessing tasks are speech segmentation and speechclustering. In the segmentation step, all individ-ual speech segments are separated from the back-ground noise or silence. In the clustering step,we group the same speaker’s segments into ho-mogeneous clusters so as to facilitate successiveprocesses.

2.3.1. Speech segmentationA general solution to separate speech from

the background, or equivalently, to detect silencefrom the voice segment, is to apply a globalenergy/zero-crossing ratio thresholding scheme(Zhang and Kuo, 1999). However, while a globalthreshold works well on static audio content, it is

not suitable for movies which have complex audiobackground.

In this work, we propose to determine thethreshold by adapting it to dynamic audio con-tent. Particularly, given the audio signal of aspeech shot, we first segment it into 15ms-longaudio frames, then we sort them into an ar-ray based on their energies computed in the dBscale. Next, for all frames whose energy values aregreater than a preset threshold Discard thresh,we quantize them into M bins where bin1 hasthe lowest and binM has the highest average en-ergy. Now that both speech and silence sig-nals are present in the shot, the threshold thatseparate them must be a value between thesetwo extremes. Specifically, we determine it fromthe sums of the first and last three bins, aswell as a predefined Speech thresh that givesthe minimum dB difference between the two sig-nals. Currently, we set Discard thresh to be30.0, Speech thresh to be 9.0, and M to be 10.

Next, a 4-state transition diagram is employedto separate speech segments from the back-ground. As shown in Figure 5, the diagram hasfour states: in-silence, in-speech, leaving-silenceand leaving-speech. Either in-silence or in-speechcan be the starting state, and any state can be afinal state. The input of this state machine is asequence of frame energy, and the output is thebeginning and ending frame indices of detectedspeech segments. The transition conditions arelabeled on each edge with the corresponding ac-tions described in parentheses. Here, Count is aframe counter, E denotes the frame energy, andL indicates the minimum length of a silence orspeech segment. As we can see, this state ma-chine basically groups blocks of continuous si-lence/speech frames as a silence/speech segmentwhile removing impulsive noises.

This algorithm works best when it is performedwithin a shot range since the background can beassumed to be quasi-stationary in this case. Asegmentation example is shown in the upper partof Figure 8 where the detected speech segmentsare bounded by the passbands of a pulse curve.As we can see, all speech fragments have beensuccessfully isolated from the impulse noise. Theloud background sounds are removed as well.

6

Figure 4. Mouth detection and tracking results on ten consecutive video frames, where eyes are indicatedby crosses and mouthes are bounded by rectangles.

2.3.2. Speech clusteringSpeech clustering has been studied for a long

time, and many approaches have been pro-posed. Among them, the Kullback-Leibler dis-tance metric, Generalized Likelihood Ratio cri-terion (GLR), Bayesian Information Criterion(BIC), GMM likelihood measure, and VQ dis-tortion criterion are the most popularly applied(Siegler et al., 1997), (Mori et al., 2001), (Chen etal., 1998). In this work, we use BIC to measurethe similarity between two speech segments.

When comparing two segments using the BIC,the distance measure can be stated as a model se-lection criterion where one model is representedby two separate segments X1 and X2, and theother model represents the joined segment X ={X1, X2}. The difference between these two mod-eling approaches is given by (Chen et al., 1998)

∆BIC(X1, X2) =12(M12 log |Σ12| −M1 log |Σ1|

−M2 log |Σ2|)− 12λ(d +

12d(d + 1)) log M12, (3)

where Σ1, Σ2, Σ12 are X1, X2 and X’s covari-ance matrices, and M1, M2, M12 are their fea-ture vector numbers, respectively. λ is a penaltyweight and equals 1 in this case. d gives the di-mension of the feature space. According to theBIC theory, when ∆BIC(X1, X2) is negative, the

two speech segments can be considered from thesame speaker.

Now, assume cluster C contains n homoge-neous speech segments, then given an incomingspeech segment X, we compute their distance as

Dist(X, C) =n∑

i=1

wi ×∆BIC(X, Xi), (4)

where wi is the weighting coefficient of segmentXi, and is computed as wi = Mi/

∑nj=1 Mj . Fi-

nally, if Dist(X, C) is less than 0, we merge X tocluster C; otherwise, if none of existing clustersis matched, a new cluster will be initialized.

2.4. Initial speaker modelingTo bootstrap the identification process, we

need initial speaker models as shown in Figure 1.This is achieved by exploiting the inter-relationbetween the face and speech cues. Specifically,the following two steps are applied.

First, for each target cast A, we identify its 1-face shot where a 1-face shot is a speech shot thatcontains only 1 face in most of its frames. Whenmultiple 1-face shots are identified for cast A, wechoose the one having the highest face recogni-tion rate and the longest shot length. Moreover,to make sure that the selected 1-face shot onlycontains A’s speech, we can return it to the userfor further verification.

7

Leaving

Silence

Leaving

Speech

In

Silence

In

Speech

E < T & (Count ++) < L

E < T & Count > L (Output the endpoint of a speech segment)

E < T

E >= T (Count = 0)

E < T & Count < L

E >= T & (Count ++) < L

E >= T & Count > L (Output the beginning point of a speech segment)

E >= T

E < T (Count = 0)

E >= T & Count < L

Figure 5. A state transition diagram for speech-silence segmentation, where T stands for the derivedadaptive threshold, E denotes the frame energy, count is a frame counter and L indicates the minimumspeech/silence segment length.

Next, we collect A’s speech segments from its1-face shot as described in Section 2.3.1, andbuild its initial model. Currently, the GaussianMixture Models (GMM) are employed to modelspeakers which will be introduced in the next sec-tion. Note that at this stage, each model will onlycontain one Gaussian component with its meanand covariance computed as the global ones dueto the limitation of training data.

2.5. Likelihood-based speaker identifica-tion

At this stage, we will identify speakers basedon pure speech information. Specifically, given aspeech signal, we first decompose it into a set ofoverlapped audio frames; then 14 Mel-frequencycepstral coefficients (MFCC) are extracted fromeach frame to form an observation sequence X.Next, we calculate the likelihood L(X;Mi) be-tween X and all speaker models Mi based onthe multivariate analysis. Finally, we obtain aspeaker vector ~v with its ith component indicat-ing the confidence of being target speaker i.

2.5.1. Likelihood calculationDue to its successful usage in both speech and

speaker recognition, MFCC coefficients are cho-sen as the audio features in this work. Moreover,considering that there are various noises in moviedata, we have also performed a cepstral mean nor-malization on all cepstral coefficients (Young etal., 2000).

To model speakers, we choose to use GMM(Gaussian Mixture Model) since the Gaussianmixture density can provide a smooth approxi-mation to the underlying long-term sample dis-tribution of a speaker’s utterances (Reynolds andRose, 1995). A GMM model M can be repre-sented by the notation M = {pj , ~µj ,Σj}, j =1, . . . , m. where m is the total number of compo-nents in M , and pj , ~µj , Σj are the weight, meanvector and covariance matrix of the ith compo-nent, respectively.

Now, let Mi be the GMM model correspond-ing to the ith enrolled speaker with Mi ={pij , ~µij , Σij}, and let X be the observation se-quence consisting of T cepstral vectors ~xt, t =1, . . . , T , under the assumption that all observa-

8

tion vectors are independent, the likelihood be-tween X and Mi can be computed as

L(X; Mi) =m∑

j=1

pij × L(X; ~µij , Σij), (5)

where L(X; ~µij ,Σij) is the likelihood of X’s be-longing to Mij . Based on the multivariate anal-ysis in Mardia et al. (1979), the log likelihood ofL(X; ~µij , Σij) can be computed as

`(X; ~µij , Σij) = −T

2log |2πΣij | − T

2tr(Σ−1

ij S)

− T

2(X − ~µij)′Σ−1

ij (X − ~µij),

where S and X are X’s covariance and mean,respectively. Note that when S equals Σij , theabove log likelihood becomes the Mahalonobisdistance.

Now, based on this identification scheme, givenany speech cluster C, we will assign it a speakervector ~v = [v1, . . . , vN ], where vi is a value in[0, 1] which equals the normalized log likelihoodvalue `(X; ~µij , Σij), and indicates the confidenceof being target speaker i.

2.6. Audiovisual integration for speakeridentification

In this step, we will finalize the speaker iden-tification task for cluster C (in shot S) by in-tegrating the audio and visual cues obtained inSections 2.2, 2.3 and 2.5. Specifically, given clus-ter C and all recognized talking face sequences Fin shot S, we examine if there is a temporal over-lap between C and any sequence Fi. If yes, andalso the overlap ratio exceeds a preset threshold,we assign Fi’s face vector ~f to C; otherwise, weset its face vector to null. However, if C is over-lapped with multiple Fi due to speech clusteringor talking face detection errors, we choose the onewith the highest overlap ratio. Finally, to accom-modate for detection errors, we can extend thetalking face sequence to both ends by a certainnumber of frames during the overlap checking. Aslightly better performance has been gained fromthis extra effort.

Now, we determine the speaker’s final identityin cluster C as

speaker(C) = arg max1≤j≤N

(w1 · f [j] + w2 · v[j]), (6)

where ~f and ~v are C’s face and speaker vectors,respectively. w1 and w2 are two weights that sumup to 1.0. Currently we set them to be equal inthe experiment. In fact, instead of choosing thetop speaker in the above equation, we can alsoobtain a sorted list of possible speakers for clusterC, which can be used to smooth the identificationresults over neighboring shots.

The detected speaker’s model is then updatedusing his or her current speech data as will bedetailed in the next section. However, one thingworth mentioning here is that if the current shotcontains music (i.e. it is a speech with musicshot), then no model adaptation is carried outsince otherwise, it will corrupt the model.

2.7. Unsupervised Speaker Model Adapta-tion

Now, after we identify speaker P for cluster C,we will update P’s model using C’s data in thisstep. Meanwhile, a background model will be ei-ther initialized or updated to account for all non-target speakers. Specifically, when there is no apriori background model, we use C’s data to ini-tialize it if the minimum of L(C;Mi), i = 1, . . . , Nis less than a preset threshold. Otherwise, ifthe background model produces the largest like-lihood, we denote the identified speaker as “un-known”, and use C’s data to update the back-ground model.

The following three approaches have been in-vestigated to update the speaker model: Average-based model adaptation, MAP-based model adap-tation, and Viterbi-based model adaptation.

2.7.1. Average-based model adaptationWith this approach, the P ’s model is updated

in the following three steps.Step 1: Compute the BIC distances between

cluster C and all of P ’s mixture component bi.Denote the component that gives the minimumdistance as b0.

Step 2: If the minimum distance is less thanan empirically determined threshold, we considerC to be acoustically close to b0, and use C’sdata to update this component. Specifically, letN(µ1, Σ1) and N(µ2,Σ2) be C and b0’s Gaussianmodels, respectively, we update b0’s mean and

9

covariance as (Mokbel, 2001)

µ′2 =N1

N12µ1 +

N2

N12µ2, (7)

Σ′2 =N1

N12Σ1+

N2

N12Σ2+

N1N2

N212

(µ1−µ2)(µ1−µ2)T ,(8)

where N1 and N2 are the number of feature vec-tors in C and b0, respectively. In practice, be-cause the mean can be easily biased by differentenvironment conditions, the third item in Equa-tion (8) is actually discarded in our implemen-tation, which results in a slightly better perfor-mance.

Otherwise, if the minimum distance is largerthan the threshold, we will initialize a new mix-ture component for P , with its mean and covari-ance equaling to µ1 and Σ1. However, once thetotal number of P ’s components reaches a certainvalue, which is currently set to be 32, no morecomponents will be initialized and only compo-nent adaptation is allowed. This is adopted toavoid having too many Gaussian components inone model.

Step 3: Update the weight pi for each of P ’smixture component. Specifically, pi is propor-tional to the number of feature vectors contribut-ing to component bi.

2.7.2. MAP-based model adaptationMAP adaptation has been widely and success-

fully used in speech recognition, yet it has notbeen well explored in speaker recognition. Inthis work, due to the limited speech data, onlyGaussian means will be updated. Specifically,given P ’s model Mp, we update its componentbi’s mean via

µ′i =Li

Li + τµ +

τ

Li + τµi, (9)

where τ defines the “adaptation speed”, and iscurrently set to 10.0. Li gives the occupationlikelihood of the adaptation data to componentbi, and is defined as

Li =T∑

t=1

p(i|~xt,Mp), (10)

where p(i|~xt,Mp) is the a posteriori probabilityof ~xt to bi, and is computed as

p(i|~xt, Mp) =pibi(~xt)∑m

k=1 pkbk(~xt). (11)

Finally, µ in Equation (9) gives the mean of ob-served adaptation data, and is defined as

µ =∑T

t=1 p(i|~xt, Mp)~xt∑Tt=1 p(i|~xt, Mp)

. (12)

Unlike the previous method, this MAP adapta-tion is applied to every component of P based onthe principle that every feature vector has a cer-tain possibility of belonging to every component.Thus, MAP adaptation provides a soft decisionon which feature vector belongs to which compo-nent.

Now that we can no longer say which featurevector occupies which component, we must definea new way to update the number of feature vec-tors belonging to each component. Specifically,if cluster C contains M frames, then the num-ber of bi’s feature vectors shall be increased byM × Li/

∑mj=1 Lj .

2.7.3. Viterbi-based Model AdaptationWidely used in speech recognition, the Viterbi

algorithm performs an alignment of training ob-servations to the mixture component that givesthe highest probability within a certain state(Young et al., 2000). As a result, every obser-vation will be associated with one single mixturecomponent. This is the key concept that we em-ploy in this approach.

Similar to the MAP-based approach, this ap-proach also allows different feature vectors be-longing to different components. Nevertheless,while the MAP approach gives a soft decision,this approach implies a hard one, i.e. for any oneparticular feature vector ~xt, it will either occupycomponent bi or not. Therefore, the probabilityfunction p(i|~xt,Mp) in Equation (10) will now bereplaced by an indicator function which is either0 or 1. Thus given any feature vector ~xt, the mix-ture component it occupies will be determined by

m0 = arg max1≤i≤m

p(i|~xt,Mp). (13)

10

Finally, we use Equations (7) and (8) to updateP ’s components after we assign every feature vec-tor to its belonged component. Clearly, this ap-proach is a compromise between the previous twomethods.

To summarize, based on the proposed modeladaptation approaches, a speaker model will growfrom 1 Gaussian mixture component up to 32components as we go through the entire movie se-quence. Strictly speaking, the GMMs generatedin this way are not the same as original GMMs,and they are also less accurate than the mod-els trained using the EM (Expectation Maximiza-tion) algorithm. However, compared to EM, thisapproach is computationally simpler. Moreover,since no iteration is needed, it can better meetthe real-time processing goal.

3. Experimental results

To evaluate the performance of the proposedadaptive speaker identification system, studieshave been carried out on two movies, each ofwhich is approximately 1-hour long. The firstmovie is a comedic drama (“When Harry MetSally”) with many conversation scenes while thesecond one is a tragic romance (“The Legendof the Fall”) with fewer dialogs but more back-ground music.

3.1. Results for Movie 1Three interested characters were chosen for

Movie 1, and totally 952 speech clusters were gen-erated. An average of 90% clustering purity wasachieved which is defined as the ratio between thenumber of segments from the dominant speakerand the total number of segments in a cluster.

Regarding the talking face detection, we haveachieved 83% precision and 88% recall rates on425 detected face sequences. However, the ra-tio of talking face-contained frames over the totalnumber of video frames is as low as 11.5%. This isbecause movie casts are always in constant mov-ing status, thus making it difficult to detect theirfaces.

The identification results for all obtainedspeech clusters are reported in the form of a con-fusion matrix as shown in Table 1. The three

speakers are indexed by A, B, C, and their cor-responding movie characters are denoted by A’,B’ and C’. “Unknown” is used for all non-targetspeakers. The number in each grid, say grid(A’, B), indicates the number of speech segmentswhere character A’ is talking yet actor B is iden-tified. Obviously, the larger the number in thediagonal, the better the performance. Three pa-rameters, namely, false acceptance (FA), false re-jection (FR) and identification accuracy (IA) arecalculated to evaluate the system performance.Particularly, for each cast or character, we have

FR =sum of off-diagonal numbers in the row

sum of all numbers in the row,

FA =sum of off-diagonal numbers in the column

sum of all numbers in the column,

IA = 1− FR.

Table 1(a) gives the identification result whenthe average-based model adaptation is applied.An average of 75.3% IA and 22.3% FA are ob-served. Result obtained from the MAP-basedapproach is given in Table 1(b) where we havean average 78.6% IA and 21% FA. This result isslightly better than that in (a), yet at the costof a higher computation complexity. The perfor-mance improvement of applying MAP adaptationto speaker recognition has also been reported byAhn et al. (2000), where a speaker verificationwas carried out on a Korean speech database.

Table 1(c) shows the result for the Viterbi-based approach. As we can see, this tablepresents the best performance with an average82% IA and 20% FA. The fact that this approachoutperforms the MAP approach may imply that,for speaker identification, a hard decision wouldbe good enough.

By carefully studying the results, we found twomajor factors that degrade the system perfor-mance: (a) imperfect speech segmentation andclustering, and (b) inaccurate facial analysis re-sults. Due to the various sounds/noises existingin movies, it is extremely difficult to achieve per-fect speech segmentation and clustering results.Besides, incorrect facial data can result in mouthdetection and tracking errors, which will furtheraffect the identification accuracy.

11

Table 1Adaptive speaker identification results for Movie1 using: (a) the average-based, (b) the MAP-based, and (c) the Viterbi-based model adapta-tion approaches.

A B C Ukn FR IAA’ 228 42 6 32 26% 74%B’ 37 281 35 24 25% 75%C’ 10 9 115 16 23% 77%Ukn 10 15 4 88FA 20% 19% 28%

(a)A B C Ukn FR IA

A’ 239 25 22 22 22% 78%B’ 59 302 5 11 20% 80%C’ 10 8 117 15 22% 78%Ukn 19 16 7 75FA 27% 14% 22%

(b)A B C Ukn FR IA

A’ 246 29 13 20 20% 80%B’ 41 317 13 6 16% 84%C’ 10 10 123 7 18% 82%Ukn 18 22 14 63FA 22% 14% 24%

(c)

Moreover, to examine the robustness of thethree set of speaker models (denoted by AVG,MAP and VTB) obtained from the three adap-tation processes, we carried out a supervisedspeaker identification based on these models. Theidentification results are shown in Table 2, anda slightly degraded system performance is ob-served. This is because, when generating thesemodels during the adaptation process, we havegradually adapted them to the later part of themovie data. Thus, when we apply them in super-vised speaker identification, they may not modelspeakers well for the entire movie. Neverthe-less, this table still presents acceptable results,especially for the Viterbi-based approach whichpresents 80% IA and 23% FA. These results arealso comparable to those reported by other su-

pervised identification work in an adverse audioenvironment (Chagnolleau et al., 1999), (Yu andGish, 1993).

Table 2Supervised speaker identification results.

Model IA FAset A’ B’ C’ A B CAVG 68% 74% 70% 23% 21% 33%MAP 71% 81% 72% 25% 21% 26%VTB 74% 90% 76% 24% 17% 28%

To determine the upper limit of the number ofmixture components in each speaker model, weexamined the average identification accuracy interms of 32 and 64 components for all three adap-tation methods and plotted them in Figure 6(a).As shown, except for the average-based methodwhere a similar performance is observed, the useof 32 Gaussian mixture components has produceda better performance.

Finally, the average identification accuracy ob-tained by using or without using face cues iscompared in Figure 6(b). Clearly, without theassistance of face cue, the system performancehas been significantly degraded, especially for theaverage-based adaptation approach. This indi-cates that the face cue plays an important role inmodel adaptation.

3.2. Results for Movie 2We also tested our algorithms on Movie 2,

which is a tragic romance with fewer dialogs.Four casts were chosen as target speakers, and thespeech clustering process resulted in 738 speechclusters. An average 83% clustering purity wasachieved.

As for the talking face detection, similar pre-cision and recall rates were achieved. However,the percentage of talking face-contained frames isonly around 3.6% in this movie. Therefore, com-pared to Movie 1, this time we got less help fromthe face cue.

The identification results for all three adapta-

12

75.30%

78.70%

82%

76.65%

75.60%

79.30%

1 2 3

The applied approaches

Iden

tifi

cati

on

acc

ura

cy

GMM32 GMM64

75.30%78.70%

82%

47.00%

66.30% 65.00%

1 2 3


Iden

tifi

cati

on

acc

ura

cy

w/ face cue w/o face cue

(a) (b)

Figure 6. Identification accuracy comparison for the average-based, the MAP-based, and the Viterbi-based approaches with: (a) 32-component vs. 64-component for speaker models, and (b) using vs. withoutusing face cues.

tion approaches are tabulated in Table 3. Asshown, they have yielded similar results. How-ever, the system performance has significantlydegraded from that in Table 1 due to the fol-lowing reasons: 1) the fewer dialogs in Movie 2have made the identification errors more costlysince now the target speakers have much fewerspeech data; 2) the frequent use of backgroundmusic in this movie brings a big challenge to ourspeech segmentation and clustering scheme; and3) the casts’ strong emotions have resulted in awide variation in their talking rates, volumes andother related acoustic characteristics, which alsobrings a challenge to the current system.

In order to find the optimal number of Gaussianmixture components for initial speaker models,we examined the average identification accuracyin terms of 1-component, 2-component and 4-component for all three adaptation methods andplotted them in Figure 7(a). As shown, for allthree cases, the use of 1 component gives the bestperformance. This means that when the amountof training data is very limited, e.g. the speechcollected from one single shot, the use of multiplecomponents tends to bring worse performance.

Another experiment was carried out to examinethe performance change when the speaker mod-els were updated with different amount of train-ing data. Specifically, we measured the identifica-tion accuracy when the first 0%, 10%, 30%, 60%,

90%, and 100% of the entire movie data wereused to update the models. The system perfor-mance evolvement for the average-based, MAP-based and Viterbi-based approaches are plottedin Figure 7(b) using circle-, triangle- and square-marked curves, respectively.

We see from this figure that all curves haveshown a continuous performance improvementwith the increase of the training data volume.This indicates the effectiveness of the proposedadaptive identification scheme. Moreover, we ob-serve that there is a significant performance in-crease when the data amount rises from 10% to30%, and from 30% to 60%, for all three cases.However, the identification accuracy remains rel-atively stable when the data amount increasesfrom 90% to 100%. Similarly, the performancegain is relatively minor when the speaker modelsare updated using the first 10% of data, exceptfor the Viterbi-based approach. This is becausethat, the initial speaker models are trained us-ing the data collected from randomly distributedshots, thus they may not work well at the verybeginning of the movie.

Figure 8 gives a detailed description of aspeaker identification example. Specifically, theupper part shows the waveform of an audio signalrecorded from a speech shot where two speakerstake turns to talk. The superimposed pulse curveillustrates the speech-silence separation result

13

Table 3Adaptive speaker identification results for Movie 2 using: (a) the average-based, (b) the MAP-based, and(c) the Viterbi-based model adaptation approaches.

IA FAMethod A’ B’ C’ D’ A B C DAVG-based 73% 80% 79% 64% 27% 43% 26% 13%MAP-based 66% 78% 80% 70% 30% 34% 19% 28%VTB-based 63% 82% 79% 65% 26% 36% 28% 32%

0.74 0.735 0.72

0.570.61 0.630.62 0.61 0.61

1 2 3


Iden

tifi

cati

on

acc

ura

cy

1-component 2-component 4-component

0 0.2 0.4 0.6 0.8 10.45

0.5

0.55

0.6

0.65

0.7

0.75Identification Accuracy Comparison

Training Data Amount

Average−basedMAP−based Viterbi−based

(a) (b)

Figure 7. Comparison of identification accuracy for the average-based, the MAP-based, and the Viterbi-based approaches with: (a) 1, 2 and 4 components in initial speaker models, and (b) different amount ofmodel training data.

where all detected speech segments are boundedby the passband of the curve. The speaker iden-tity for each speech segment is given right be-low this sub-figure, where speakers A and B arerepresented by dark- and light-colored blocks, re-spectively. The likelihood-based speaker identi-fication result is given right below the groundtruth, where only the toppest speaker’s identityis shown. Besides, since two of the speech seg-ments are too short (indicated by the circles) forspeaker identification, we will disregard them inlater processes. As shown, there are two falsealarms in this result where B is falsely recognizedas A twice. The talking face detection result isshown in the next sub-figure including both talk-ing and non-talking cases. Finally, the last sub-figure shows the ultimate identification result ob-

tained by integrating both speech and face cuesas discussed in Section 2.6. As shown, althoughthe first error still exists as the face cue cannotoffer help, the second error has been corrected.

4. CONCLUSION AND FUTUREWORK

In this research, an adaptive speaker identi-fication system was proposed which recognizesspeakers in a real-time fashion for movie contentindexing and annotation purposes. Both audioand visual cues were exploited where the audiosource was analyzed to recognize speakers using alikelihood-based approach, and the visual sourcewas parsed to recognize talking faces using facedetection/recognition and mouth tracking tech-niques. Moreover, to better capture speakers’

14

voice variations along time, we update their mod-els on the fly. Extensive experiments were carriedout on two test movies of different genres, and en-couraging results have been achieved.

As our future work, we plan to continue ourresearch on face detection and tracking modulesince the current face detector has certain diffi-culty in locating non-upright faces which turnsout to be very common in feature films. More-over, the performance of the speech clusteringmodule needs to be further improved.

References

Chagnolleau, I., Rosenberg, A., Parthasarathy,S., Detection of target speakers in audiodatabases, ICASSP’99, Phoenix, 1999.

Rosenberg, A., Chagnolleau, I., Parthasarathy,S., Huang, Q., Speaker detection in broadcastspeech databases, ICSLP’98, 1339-1342, Syd-ney, Australia, 1998.

Johnson, S., Who spoke when ? - automaticsegmentation and clustering for determiningspeaker turns, Eurospeech’99, 1999.

Yu, G., Gish, H., Identification of speakers en-gaged in dialog, ICASSP’93, 383-386, 1993.

Tsekeridou, S., Pitas, I., Content-based videoparsing and indexing based on audio-visual in-teraction, IEEE Transactions on Circuits andSystems for Video Technology, 11(4), 522-535,2001.

Li, D., Wei, G., Sethi, I., Dimitrova, N., Personidentification in TV programs, Journal of Elec-tronic Imaging, 10(4), 930-938, 2001.

Li, Y., Narayanan, S., Kuo, C., Identificationof speakers in movie dialogs using audiovisualcues, ICASSP’02, Orlando, May 2002.

Li, Y., Kuo, C., Real-time segmentation and an-notation of MPEG video based on multimodalcontent analysis I & II, Technical Report, Uni-versity of Southern California, 2000.

Zhang, T., Kuo, C., Audio-guided audiovisualdata segmentation, indexing and retrieval,Proc. of SPIE, 3656, 316-327, 1999.

HP Labs, Computational Video Group, The HPface detection and recognition library, User’sGuide and Reference Manual, Version 2.2, De-cember 1998.

Sobottka, K., Pitas, I., A novel method for au-tomatic face segmentation, facial feature ex-traction and tracking, Image Communication,12(3), 263-281, 1998.

Chen, S., Gopalakrishnan, P., Speaker, environ-ment and channel change detection and clus-tering via the Bayesian information criterion,Proc. of DARPA Broadcast News Transcrip-tion and Understanding Workshop, 1998.

Siegler, M., Jain, U., Raj, B., Stern, R., Auto-matic segmentation, classification, and cluster-ing of broadcast news, Proc. of Speech Recog-nition Workshop, Chantilly, Virginia, February1997.

Mori, K., Nakagawa, S., Speaker change detec-tion and speaker clustering using VQ distor-tion for broadcast news speech recognition,ICASSP’01, 2001.

Reynolds, D., Rose, R., Robust text-independentspeaker identification using Gaussian mixturespeaker models, IEEE Transactions on Speechand Audio Processing, 3(1), 72-83, 1995.

Mardia, K., Kent, J., Bibby, J., MultivariateAnalysis, Academic Press, San Diego, 1979.

Mokbel, C., Online adaptation of HMMs to real-life conditions: A unified framework, IEEETransactions on Speech and Audio Processing,9(4), 342-357, May 2001.

Young, S., Kershaw, D., Odell, J., Olla-son, D., Valtchev, V., Woodland, P.,HTK Book, Version 3.0, Downloaded fromhttp://htk.eng.cam.ac.uk/index.shtml, July2000.

Ahn, S., Kang, S., Ko, H., Effective speaker adap-tations for speaker verification, ICASSP’00, 2,1081-1084, 2000.

15

0 0 . 2 0 . 4 0 . 6 0 . 8 1 1 . 2 1 . 4 1 . 6 1 . 8 2

x 1 05

- 0 . 2

- 0 . 1 5

- 0 . 1

- 0 . 0 5

0

0 . 0 5

0 . 1

0 . 1 5

0 . 2

S a m p le In d e x

S p e e c h - S i l e n c e S e p a r a t i o n U s i n g a d a p t i ve s i l e n c e d e t e c t o r

Likelihood-based speaker identification result

Talking face detection result

Final speaker identification result

Legends:

: A’s non-talking face : B’s non-talking face

: A’s talking face : B’s talking face

: Speaker A : Speaker B

Speakers – Ground Truth

Figure 8. A detailed description of a speaker identification example.

Adaptive Speaker Identiﬂcation with AudioVisual...

Documents

Transcript of Adaptive Speaker Identiﬂcation with AudioVisual...