[IEEE 2012 International Conference on Signal Processing and Communications (SPCOM) - Bangalore,...

Robust Two Dimensional Source Localization using

the MUSIC-Group Delay Spectrum

Ardhendu Tripathy, Lalan Kumar, and Rajesh M Hegde

Dept. of Electrical Engg.

Indian Institute of Technology

Kanpur, 208016, India

Email: ardhendu,[email protected]

Abstract—Subspace-based methods require a large number ofsensors for localization of closely spaced sources since the spectralmagnitude of MUltiple SIgnal Classification (MUSIC) is used.However, the MUSIC-Group delay (MUSIC-GD) method hasbeen used earlier to resolve closely spaced sources with a limitednumber of sensors. In this work, the MUSIC-GD method is usedin high resolution azimuth and elevation estimation of spatiallyclose sources under reverberant environments over a planararray. The efficiency of the MUSIC-GD method in effectivelyresolving closely spaced sources, even when the noise eigen-values change considerably under reverberation, is describedand illustrated. Localization error analysis is performed onthe proposed method and its performance is illustrated usingtwo dimensional scatter plots. Cramer-Rao lower bound (CRB)analysis is also performed and the CRB is compared with theRoot Mean Square Error (RMSE) of the proposed method. Largevocabulary speaker dependent speech recognition experimentsare conducted on sentences from the TIMIT database acquiredover a planar microphone array. The proposed MUSIC-GDmethod indicates reasonable improvements in terms of local-ization and the Cramer-Rao lower bound error analysis. Areasonable reduction is also observed in terms of word errorrate (WER) from the experiments conducted on distant speechrecognition.

I. INTRODUCTION

Conventional approaches to acquiring clean speech from

distant microphones include microphone array based beam-

forming techniques. It is also shown in earlier work that speech

recognition systems are sensitive to speaker location errors.

Direction of arrival (DOA) estimation techniques that can re-

solve spatially contiguous speech sources therefore assume im-

portance in speech recognition on speech acquired from distant

microphones [1]. Both time and frequency domain approaches

have been used for DOA estimation using microphone arrays

[2], [3]. Subspace-based methods for DOA estimation based

on the spectral magnitude of MUSIC require a large number

of sensors and are prone to errors under reverberant conditions

[4]. However, these methods are computationally efficient.

The phase information in the MUSIC spectrum has not been

widely used in DOA estimation. In [5], a spatial spectral

analysis method based on the MUSIC-Group Delay (MUSIC-

GD) spectrum has been proposed. It has been illustrated

that the MUSIC-GD spectrum resolves spatially contiguous

sources better than the MUSIC-Magnitude (MM) spectrum

[6], using a minimal number of sensors. In [7], the superior

performance of the MUSIC-GD has been demonstrated for

2-D source localization over circular arrays. In this paper,

the significance of MUSIC-GD method for two dimensional

source localization under reverberant conditions is discussed.

First, we describe the MUSIC-GD method for estimation of

both azimuth and elevation angles of arrival using a planar

array. We then describe the performance of the group-delay

method under different types of errors [8] in reverberant envi-

ronments. Simulation results on DOA estimation are compared

using spectral plots and root mean square error plots [9], [10].

Experiments on distant speech recognition over a circular array

[11], [12], are conducted using sentences from the TIMIT

database [13].

II. MUSIC-GD FOR SPEECH SOURCE LOCALIZATION IN

REVERBERANT ENVIRONMENT

In [7], group-delay based methods were used to estimate

the DOA of an acoustic source with a circular microphone

array. These methods were developed with the assumption that

there was no multipath in the received signal. The effects of

multipath are encountered in a received signal when the source

signal reflects off of surrounding objects and gets added to

the direct path signal with a delay. The larger the number

of surrounding objects, the more reflected signals are added

to the direct path signal. This reverberation gains importance

in a meeting room environment, and makes DOA estima-

tion difficult. In subspace methods like MUSIC, the noise

eigenvalues of the correlation matrix take very small values

when compared to the signal eigenvalues. However, the noise

eigenvalues will change considerably under reverberation as

the room impulse response will also make a contribution.

In general, the output of an array of microphones is given

by the matrix,

X = S A+ V (1)

where S is a matrix of signal values at the reference micro-

phone, V is the matrix of noise values, and A is called the

array manifold or the steering vector matrix. The reverberant

version of X can be obtained by convolving it with the room

impulse response h as

Y = X ∗ h (2)

Hence the reverberant signal correlation matrix on simplifica-

978-1-4673-2014-6/12/$31.00 ©2012 IEEE

tion can be expressed as

R = E[Y Y H ]

= E[A(S ∗ h)(S ∗ h)HAH ] + (V ∗ h)(V ∗ h)H

= ARS∗hAH +RV ∗h (3)

Note that the noise eigenvalues of the correlation matrix R,

will not be small under reverberation and sensor imperfections.

This is because the former will significantly contribute to

computation of the matrix R as can be seen from Equation 3.

This affects the performance of subspace methods especially

MUSIC, in DOA estimation under reverberation.

For a uniform circular array (UCA), with M number of

microphones and K number of sources (M > K), Equation

1 can be written [2] as,

X = S A(φ, θ) + V (4)

where θ is the azimuth angle and φ is the elevation angle for

a source. The array manifold consisting of time delays [11],

is given by

A =[

ejωτ1 ejωτ2 · · · ejωτM]T

(5)

The time delay τi at the ith microphone with respect to the

reference microphone manifests as a function of both θ and φ

and it is given by [2]

τi =Ri cos(θ − αi) sinφ

v(6)

where Ri and αi are the radius and azimuth angle of the

ith microphone with the center of the circular array as the

reference.

The conventional MUSIC spectrum herein called the

MUSIC-Magnitude (MM) spectrum is given by [2]

PMUSIC(φ, θ) =1

M−K∑

i=1

|aH(φ, θ).qi|2

=1

||QHn .a(φ, θ)||2

(7)

where a(φ, θ) is a particular steering vector corresponding to

the direction (φ, θ), which forms the array manifold matrix A,

for that particular source location and Qn is the M×(M−K)matrix of eigenvectors spanning the noise subspace. Since

the eigenvectors of Qn are orthogonal to the signal steering

vectors, the denominator reaches a null value when (φ, θ) is

the signal direction. Hence the MUSIC-Magnitude spectrum

PMUSIC(φ, θ) shows a peak at the DOA represented in this

case by both the elevation and azimuth angles (φ, θ). However,

when the speech sources are spaced very close to each other,

MUSIC-Magnitude spectrum fails to resolve them with a

computationally plausible number of sensors. This can be seen

in Figure 1, where the MUSIC-Magnitude spectrum has been

used to estimate the azimuth and elevation for two sources

at (47◦,18◦) and (51◦,37◦) at a DRR (direct-to-reverberant

energy ratio) of 12 dB. However, the group delay function

of the MUSIC spectrum can be used to resolve closely spaced

sources using a computationally plausible number of sensors

Fig. 1. Azimuth and Elevation estimation using the MUSIC-Magnitudespectrum for sources at (47◦,18◦) and (51◦,37◦) at DRR=12dB.

Fig. 2. Azimuth and Elevation estimation using the MUSIC-GD spectrumfor sources at (47◦,18◦) and (51◦,37◦) at DRR=12dB.

for ULA as in [5]. The phase information in MUSIC based

noise eigenbeam is utilized herein in the DOA estimation of

spatially close sources. In practice abrupt changes can occur

in the phase due to small variations in the signal caused

by microphone calibration errors. To avoid these peaks in

the MUSIC-GD spectrum, a product of the MUSIC spectral

magnitude and the group delay of MUSIC can be used so that

spurious peaks are subdued and only the peaks corresponding

to DOAs are retained.

The MUSIC-GD spectrum for two dimensional DOA (az-

imuth and elevation) estimation over circular arrays [7], is

defined as follows:

PGDM (φ, θ) = (M−K∑

i=1

|∇ arg(aH(φ, θ).qi)|2).PMUSIC(φ, θ)

(8)

where ∇ arg indicates gradient of the argument of

(aH(φ, θ).qi). The azimuth and elevation estimation plots for

the MUSIC-GD method at DRR of 12 dB is shown in Figure

2. Improvements in terms of azimuth and elevation estimation

can be noted when the MUSIC-GD spectrum is used as

compared to the conventional MUSIC-Magnitude spectrum.

Two concentric circle of sensors (four in the inner circle and

eight in the outer circle) are used in these simulations. It is

obvious that when the number of sensors are increased the

MUSIC-Magnitude spectrum may well localize the spatially

close sources albeit at the expense of using a large number of

sensors. However, the MUSIC-GD spectrum is able to localize

spatially close sources with a reduced number of sensors.

III. LOCALIZATION ERROR ANALYSIS

Errors in the MUSIC and MUSIC-GD estimates can arise

from many sources like finite sample effects, imprecisely

known noise covariance and a perturbed array manifold. Finite

sample effects occur since a perfect covariance measurement

R cannot be obtained. In practice, the sample covariance R is

defined by

R =1

N

N∑

k=1

x(tk)xH(tk) (9)

In the present work, we take N (the length of the snapshot) to

be large and thus we can neglect the finite sample effects. In

the following sections, we analyze the error in localization

caused by perturbation errors and compare the root mean

square error caused by reverberation with the Cramer-Rao

lower bound for circular arrays.

A. Perturbation Errors

Here we analyze the effects of inexact sensor position on

group-delay based estimation of DOA (position noise). The

position matrix X is formed from the nominal sensor positions

as follows:[

X = x1 x2 . . . xM

]

A steering vector associated with the array defines the

complex array response for a source at DOA (φ, θ), and has

the form:

a(φ, θ) = a(X,ki) =[

e−jkT

ix1 e−jkT

ix2 . . . e−jkT

ixM

]T

Here xi is the ith sensor position vector and kj the wavevector

of the jth signal. The displacements from the nominal sensor

positions for each sensor is given as:

wi ∼ N (0, σ2

[

1 00 1

]

)

The position perturbations are assumed to be i.i.d. Gaussian

random variables and are independent of the signals or any

additive noises that may occur at the sensor outputs. In any

DOA estimation snapshot, the perturbations are assumed to be

time-invariant i.e. the same fixed perturbation is used for t =1, 2, ..N. The perturbed position matrix W is formed similarly

to the position matrix X:

W =[

w1 w2 . . . wM

]

The perturbed sensor positions are given by X = X+W.

Considering single parameter estimation for each source (e.g.

azimuth angle only), the ith steering vector associated with the

sensor positions is a(θi) = a(X,ki), which can be written as

[14]

a(θi) = Γia(θi)

Γi =

ejkiTw1 0 . . . 0

0 ejkiTw2

. . ....

.... . .

. . . 0

0 . . . 0 ejkiTwM

The signal model for subspace estimation in a reverberant

environment when the sensor positions are perturbed is

y(t) = A(Θ)s(t) ∗ h(t) + v(t) (10)

Here t = 1, 2, . . . , N .

Elevation angle (φ)

Azim

uth

an

gle

(θ)

Contour plot of MUSIC−Magnitude spectrum

33 34 35 36 37

53.5

54.5

55.5

56.5

57.5

58.5

59.5

60.5

61.5Source 1: (θ,φ)=(55,35)

Source 2: (θ,φ)=(57,37)

(a)

Elevation angle (φ)

Azim

uth

an

gle

(θ)

Contour plot of MUSIC−GD spectrum

33 34 35 36 37 38 39

53.5

54.5

55.5

56.5

57.5

58.5

59.5

60.5

Source 1: (θ,φ)=(55,35)

Source 2: (θ,φ)=(57,37)

Source 2

Source 1

(b)

Fig. 3. Contour plots showing the spectrum in presence of sensor perturbationerrors in a) MUSIC-Magnitude and b) MUSIC-GD.

The effects of the position perturbation on the array auto

correlation matrix are simulated exactly as described in [8],

and the analysis is carried out. The resolution of the MUSIC-

Magnitude method is compared to the proposed MUSIC-GD

method under perturbation errors in Figure 3 (a), and Figure

3 (b) respectively. The plots shown are contour plots for

the respective spectra. Note that the spectrum for MUSIC-

Magnitude shows a single peak with contours around it, while

the spectrum for MUSIC-GD shows two distinct peaks with

different contours.

B. Cramer-Rao bound analysis

In this Section, Cramer-Rao bound analysis for the MUSIC-

GD Method is carried out. It is compared with the root mean

square error in DOA estimates for both the MUSIC-GD and

MUSIC-Magnitude methods. The Cramer Rao inequality for

estimating multiple parameters is given as[9], [10]

var(θm) ≥ [F−1]mm

where the mnth element of the Fisher information matrix F

is given by

Fmn = N.tr{R−1 ∂R∂pm

R−1 ∂R∂pn

}

N is the number of snapshots. Also,

R = E[X(t)X(t)H ] = ARsAH + σ2I

Here, Rs is the source correlation matrix, I is the identity

matrix and σ2 denotes the variance of the white Gaussian

noise. For 2-D DOA estimation, the unknown parameter vector

is p = [φ, θ], where φ is the elevation angle and θ is the

azimuth angle. Fisher information matrix is given by

F =

[

Fθθ Fθφ

Fφθ Fφφ

]

where

Fθθ = 2N.Re[(RsAHR−1ARs)× (AθP⊥

AR−1Aθ)T ]

Fθφ = 2N.Re[(RsAHR−1ARs)× (AθP⊥

AR−1Aφ)T ]

Fθφ = Fφθ

P⊥

A = I − A(AHA)−1AH

Aθ =∑K

n=1

∂A∂θn

The Cramer-Rao bound is plotted for various combinations

of (φ, θ) along with the RMSE values for estimates obtained

by MUSIC-Magnitude and MUSIC-GD in Figure 4. The blue

plane at the bottom is the Cramer-Rao bound for circular

arrays. It can be seen that the performance of both algorithms

is similar.

3035

4045

5055

60 60 70 80 90 100 110 120

10−3

10−2

10−1

100

101

Azimuth(θ)Elevation(φ)

Ro

ot M

ea

n S

qu

are

Erro

r

CRB

MGD

MM

Fig. 4. Comparison of CRB with RMSE of MUSIC-Magnitude and MUSIC-GD methods

IV. EXPERIMENTS ON DISTANT SPEECH RECOGNITION

AND DOA ESTIMATION

In this Section, the performance of the proposed method

(MUSIC- GD) of source localization is evaluated by con-

ducting two experiments. In the first experiment, two closely

spaced sources in a reverberant environment are localized with

multiple trials and the scatter plots are shown to highlight the

accuracy of the proposed method. In the second experiment

distant speech recognition trials in a meeting room environ-

ment is conducted over a circular array. In both cases the

results are compared to the MUSIC- Magnitude method.

A. Experiments on source localization

In these set of experiments, twelve uniformly spaced micro-

phones are used over two concentric rings. Among these, four

are placed in a circular ring of radius of 2.5 cm and the rest

are placed in a ring of radius 5 cm. The number of signals

incident on the microphone array can be varied. A SNR of 20

dB is selected for the experiments. Reverberation is modeled

by generating a room-impulse response and convolving it with

the incident signals. We use the image method to generate

a room impulse response according to the measurement of

the meeting room environment. We can change the DRR by

either changing the distance of the source from the array or

by increasing the number of virtual sources considered in

the image method. A DRR of 9.74 dB has been set for the

localization experiments. Several DOA estimation trials are

then conducted. The estimated DOAs are compared with the

actual DOAs each time. The projections of the unit vectors

along the direction of signal arrival on the two dimensional

plane are plotted as scatter plots.

The two dimensional scatter plots for the estimation of

azimuth and elevation angles of two closely spaced sources at

(55,35) and (57,37), located in space for multiple trials in the

case of MUSIC-Magnitude and MUSIC-GD, are illustrated in

Figure 5. The estimated DOA’s and the actual DOA are marked

for both MUSIC-Magnitude and MUSIC-GD methods. The

blue stars denote the actual DOA while the red and black dots

show the estimated DOA for the two sources. It can be seen

that the scatter of the estimated DOA is nearer to the actual

DOA for MUSIC-GD. In the case of DOA estimation using

MUSIC-Magnitude, there were several cases when the two

estimates overlapped each other, leading to poor localization

of source DOA. However, the estimates were more separated

in the case of MUSIC-GD.

B. Experiments on Distant speech recognition

Speaker dependent large vocabulary speech recognition ex-

periments are conducted for speech acquired from circular

microphone arrays [1], [3], in a meeting environment. The

setup has four participants and a presenter, all connected to a

close talking microphone. It is also equipped with a circular

microphone array. The experiments are conducted for four

spatially located speakers.The circular array setup is similar

and consistent with the setup used for DOA estimation as

described earlier. The experimental results on distant speech

recognition are presented as % word error rate (WER). The

% WER is calculated as

WER = 100−(Wn − (Ws +Wd +Wi))

Wn

· 100

where Wn is the total number of words, Ws the total number

of substitutions, Wd the total number of deletions, and Wi the

total number of insertions. To ensure conformity with standard

databases we record sentences generated from words that

constitute the TIMIT database [13]. For each of the following

methods used in the DOA estimation process,

• Close talking microphone (CTM).

• A filter and sum beamformer realized using DOAs esti-

mated from the MUSIC-GD spectrum.

• A filter and sum beamformer realized using DOAs esti-

mated from the MUSIC-Magnitude (MM) spectrum.

0.6

Sine of elevationangle

Azimuthangle

60

(a)

Azimuthangle

0.6

Sine of elevationangle

60

(b)

Fig. 5. Two dimensional scatter plot for localization of two closely spaced sources, (55◦,35◦) and (57◦,37◦), indicated by blue stars using (a) MUSIC-Magnitude method and (b) MUSIC-GD method.

Location CTM MUSIC-GD MM9.74dB DRR 4.69dB DRR 9.74dB DRR 4.69dB DRR

1 7.60% 13.71% 21.41% 15.62% 25.32%2 13.42% 31.28% 33.21% 35.72% 38.13%3 9.09% 18.21% 27.45% 25.73% 29.40%4 11.18% 21.38% 29.73% 28.67% 31.23%

TABLE IDISTANT SPEECH RECOGNITION PERFORMANCE AS WORD ERROR RATE

(WER), FOR FOUR SPATIALLY CLOSE SPEAKERS (LOCATION 1- 4) IN A

MEETING ROOM AT A DRR OF 9.74 DB AND 4.69 DB.

results of speech recognition for speakers at four spatially

close locations (Location 1 - 4) using the commercially

available dragon speech recognition SDK are listed as WER

in Table I. Note that a wide band version of MUSIC [12], is

used in the above experiments. The proposed method based on

the MUSIC-GD indicates reasonable reduction in WER when

compared to MUSIC-Magnitude method. The results are also

encouraging for a low DRR of 4.69 dB as can be seen from

Table I.V. CONCLUSION

A high resolution source localization method to resolve

closely spaced sources using a limited number of sensors

has been proposed in this work. The MUSIC-Group Delay

method is able to resolve both azimuth and elevation angles

for such closely spaced sources and indicates higher robustness

when compared to conventional subspace methods in terms

of localization error analysis. The proposed method indicates

a reasonable improvement in terms of word error rates for

distant speech recognition. It can therefore be used in an

effective way to reduce word error rates for speech recognition

over microphone arrays which are generally far worse than

those obtained with a close talking microphone. Computing

localization error expressions for the proposed method require

complex number analysis which is currently being explored.

The application of the proposed method in meeting room

camera systems and simple co operative robotic applications

is also being explored.

ACKNOWLEDGMENT

This work was funded in part by TCS Research Scholarship

Program under project number TCS/CS/20110191.

REFERENCES

[1] M.L. Seltzer, “Bridging the gap: Towards a unified framework forhands-free speech recognition using microphone arrays,” in Hands-Free

Speech Communication and Microphone Arrays, 2008. HSCMA 2008,May 2008, pp. 104 –107.

[2] H. L. Van Trees, Optimum Array Processing, Wiley-Interscience, 2002.[3] W. Zhang and B. Rao, “Robust broadband beamformer with diagonally

loaded constraint matrix and its application to speech recognition,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Processing. , 2006, pp.785–788.

[4] Engin Tuncer, Benjamin Friedlander, Classical and Modern Direction-

of-Arrival Estimation, Academic Press, 2009.[5] M Shukla and Rajesh M Hegde, “Significance of the MUSIC Group

Delay Spectrum in Speech Aquisition from Distant Microphones,” Proc.IEEE Int. Conf. Acoust., Speech, Signal Processing, Mar 2010, pp.2738–2741.

[6] R. Mandala, M. Shukla, and R. Hegde, “Group delay based methodsfor recognition of distant talking speech,” in Signals, Systems and

Computers (ASILOMAR), 2010 Conference Record of the Forty Fourth

Asilomar Conference on, nov. 2010, pp. 1702 –1706.[7] A. Tripathy, L. Kumar, and R.M. Hegde, “Group delay based methods

for speech source localization over circular arrays,” in Hands-free

Speech Communication and Microphone Arrays (HSCMA), 2011 Joint

Workshop on, 30 2011-june 1 2011, pp. 64 –69.[8] V. Cevher and J. H. McClellan, “2-d sensor perturbation analysis:

equivalence to awgn on array outputs,” in SAM 2002, Washington, DC,4–6 August 2002.

[9] P. Stoica and Nehorai Arye, “Music, maximum likelihood, and cramer-rao bound,” Acoustics, Speech and Signal Processing, IEEE Transactions

on, vol. 37, no. 5, pp. 720 –741, may 1989.[10] T. Filik and T.E. Tuncer, “Design and evaluation of v-shaped arrays for

2-d doa estimation,” in Acoustics, Speech and Signal Processing, 2008.

ICASSP 2008. IEEE International Conference on, 31 2008-april 4 2008,pp. 2477 –2480.

[11] M Brandstein and D Ward, Eds.,, Microphone Arrays, Springer Verlag,Berlin, 2001.

[12] T.L. Tung, K. Yao, D. Chen, R.E. Hudson, and C.W. Reed, “Sourcelocalization and spatial filtering using wideband music and maximumpower beamforming for multimedia applications,” in Signal Processing

Systems, 1999. SiPS 99. 1999 IEEE Workshop on, 1999, pp. 625 –634.[13] John S. Garofolo, TIMIT Acoustic-Phonetic Continuous Speech Corpus,

Linguistic Data Consortium, Philadelphia, 1993.[14] V. Cevher and J. H. McClellan, “2-d sensor perturbation analysis:

equivalence to awgn on array outputs,” in SAM 2002, Washington, DC,4–6 August 2002.

[IEEE 2012 International Conference on Signal Processing and Communications (SPCOM) - Bangalore,...

Documents

Transcript of [IEEE 2012 International Conference on Signal Processing and Communications (SPCOM) - Bangalore,...