[IEEE 2012 International Conference on Signal Processing and Communications (SPCOM) - Bangalore,...
Transcript of [IEEE 2012 International Conference on Signal Processing and Communications (SPCOM) - Bangalore,...
Robust Two Dimensional Source Localization using
the MUSIC-Group Delay Spectrum
Ardhendu Tripathy, Lalan Kumar, and Rajesh M Hegde
Dept. of Electrical Engg.
Indian Institute of Technology
Kanpur, 208016, India
Email: ardhendu,[email protected]
Abstract—Subspace-based methods require a large number ofsensors for localization of closely spaced sources since the spectralmagnitude of MUltiple SIgnal Classification (MUSIC) is used.However, the MUSIC-Group delay (MUSIC-GD) method hasbeen used earlier to resolve closely spaced sources with a limitednumber of sensors. In this work, the MUSIC-GD method is usedin high resolution azimuth and elevation estimation of spatiallyclose sources under reverberant environments over a planararray. The efficiency of the MUSIC-GD method in effectivelyresolving closely spaced sources, even when the noise eigen-values change considerably under reverberation, is describedand illustrated. Localization error analysis is performed onthe proposed method and its performance is illustrated usingtwo dimensional scatter plots. Cramer-Rao lower bound (CRB)analysis is also performed and the CRB is compared with theRoot Mean Square Error (RMSE) of the proposed method. Largevocabulary speaker dependent speech recognition experimentsare conducted on sentences from the TIMIT database acquiredover a planar microphone array. The proposed MUSIC-GDmethod indicates reasonable improvements in terms of local-ization and the Cramer-Rao lower bound error analysis. Areasonable reduction is also observed in terms of word errorrate (WER) from the experiments conducted on distant speechrecognition.
I. INTRODUCTION
Conventional approaches to acquiring clean speech from
distant microphones include microphone array based beam-
forming techniques. It is also shown in earlier work that speech
recognition systems are sensitive to speaker location errors.
Direction of arrival (DOA) estimation techniques that can re-
solve spatially contiguous speech sources therefore assume im-
portance in speech recognition on speech acquired from distant
microphones [1]. Both time and frequency domain approaches
have been used for DOA estimation using microphone arrays
[2], [3]. Subspace-based methods for DOA estimation based
on the spectral magnitude of MUSIC require a large number
of sensors and are prone to errors under reverberant conditions
[4]. However, these methods are computationally efficient.
The phase information in the MUSIC spectrum has not been
widely used in DOA estimation. In [5], a spatial spectral
analysis method based on the MUSIC-Group Delay (MUSIC-
GD) spectrum has been proposed. It has been illustrated
that the MUSIC-GD spectrum resolves spatially contiguous
sources better than the MUSIC-Magnitude (MM) spectrum
[6], using a minimal number of sensors. In [7], the superior
performance of the MUSIC-GD has been demonstrated for
2-D source localization over circular arrays. In this paper,
the significance of MUSIC-GD method for two dimensional
source localization under reverberant conditions is discussed.
First, we describe the MUSIC-GD method for estimation of
both azimuth and elevation angles of arrival using a planar
array. We then describe the performance of the group-delay
method under different types of errors [8] in reverberant envi-
ronments. Simulation results on DOA estimation are compared
using spectral plots and root mean square error plots [9], [10].
Experiments on distant speech recognition over a circular array
[11], [12], are conducted using sentences from the TIMIT
database [13].
II. MUSIC-GD FOR SPEECH SOURCE LOCALIZATION IN
REVERBERANT ENVIRONMENT
In [7], group-delay based methods were used to estimate
the DOA of an acoustic source with a circular microphone
array. These methods were developed with the assumption that
there was no multipath in the received signal. The effects of
multipath are encountered in a received signal when the source
signal reflects off of surrounding objects and gets added to
the direct path signal with a delay. The larger the number
of surrounding objects, the more reflected signals are added
to the direct path signal. This reverberation gains importance
in a meeting room environment, and makes DOA estima-
tion difficult. In subspace methods like MUSIC, the noise
eigenvalues of the correlation matrix take very small values
when compared to the signal eigenvalues. However, the noise
eigenvalues will change considerably under reverberation as
the room impulse response will also make a contribution.
In general, the output of an array of microphones is given
by the matrix,
X = S A+ V (1)
where S is a matrix of signal values at the reference micro-
phone, V is the matrix of noise values, and A is called the
array manifold or the steering vector matrix. The reverberant
version of X can be obtained by convolving it with the room
impulse response h as
Y = X ∗ h (2)
Hence the reverberant signal correlation matrix on simplifica-
978-1-4673-2014-6/12/$31.00 ©2012 IEEE
tion can be expressed as
R = E[Y Y H ]
= E[A(S ∗ h)(S ∗ h)HAH ] + (V ∗ h)(V ∗ h)H
= ARS∗hAH +RV ∗h (3)
Note that the noise eigenvalues of the correlation matrix R,
will not be small under reverberation and sensor imperfections.
This is because the former will significantly contribute to
computation of the matrix R as can be seen from Equation 3.
This affects the performance of subspace methods especially
MUSIC, in DOA estimation under reverberation.
For a uniform circular array (UCA), with M number of
microphones and K number of sources (M > K), Equation
1 can be written [2] as,
X = S A(φ, θ) + V (4)
where θ is the azimuth angle and φ is the elevation angle for
a source. The array manifold consisting of time delays [11],
is given by
A =[
ejωτ1 ejωτ2 · · · ejωτM]T
(5)
The time delay τi at the ith microphone with respect to the
reference microphone manifests as a function of both θ and φ
and it is given by [2]
τi =Ri cos(θ − αi) sinφ
v(6)
where Ri and αi are the radius and azimuth angle of the
ith microphone with the center of the circular array as the
reference.
The conventional MUSIC spectrum herein called the
MUSIC-Magnitude (MM) spectrum is given by [2]
PMUSIC(φ, θ) =1
M−K∑
i=1
|aH(φ, θ).qi|2
=1
||QHn .a(φ, θ)||2
(7)
where a(φ, θ) is a particular steering vector corresponding to
the direction (φ, θ), which forms the array manifold matrix A,
for that particular source location and Qn is the M×(M−K)matrix of eigenvectors spanning the noise subspace. Since
the eigenvectors of Qn are orthogonal to the signal steering
vectors, the denominator reaches a null value when (φ, θ) is
the signal direction. Hence the MUSIC-Magnitude spectrum
PMUSIC(φ, θ) shows a peak at the DOA represented in this
case by both the elevation and azimuth angles (φ, θ). However,
when the speech sources are spaced very close to each other,
MUSIC-Magnitude spectrum fails to resolve them with a
computationally plausible number of sensors. This can be seen
in Figure 1, where the MUSIC-Magnitude spectrum has been
used to estimate the azimuth and elevation for two sources
at (47◦,18◦) and (51◦,37◦) at a DRR (direct-to-reverberant
energy ratio) of 12 dB. However, the group delay function
of the MUSIC spectrum can be used to resolve closely spaced
sources using a computationally plausible number of sensors
Fig. 1. Azimuth and Elevation estimation using the MUSIC-Magnitudespectrum for sources at (47◦,18◦) and (51◦,37◦) at DRR=12dB.
Fig. 2. Azimuth and Elevation estimation using the MUSIC-GD spectrumfor sources at (47◦,18◦) and (51◦,37◦) at DRR=12dB.
for ULA as in [5]. The phase information in MUSIC based
noise eigenbeam is utilized herein in the DOA estimation of
spatially close sources. In practice abrupt changes can occur
in the phase due to small variations in the signal caused
by microphone calibration errors. To avoid these peaks in
the MUSIC-GD spectrum, a product of the MUSIC spectral
magnitude and the group delay of MUSIC can be used so that
spurious peaks are subdued and only the peaks corresponding
to DOAs are retained.
The MUSIC-GD spectrum for two dimensional DOA (az-
imuth and elevation) estimation over circular arrays [7], is
defined as follows:
PGDM (φ, θ) = (M−K∑
i=1
|∇ arg(aH(φ, θ).qi)|2).PMUSIC(φ, θ)
(8)
where ∇ arg indicates gradient of the argument of
(aH(φ, θ).qi). The azimuth and elevation estimation plots for
the MUSIC-GD method at DRR of 12 dB is shown in Figure
2. Improvements in terms of azimuth and elevation estimation
can be noted when the MUSIC-GD spectrum is used as
compared to the conventional MUSIC-Magnitude spectrum.
Two concentric circle of sensors (four in the inner circle and
eight in the outer circle) are used in these simulations. It is
obvious that when the number of sensors are increased the
MUSIC-Magnitude spectrum may well localize the spatially
close sources albeit at the expense of using a large number of
sensors. However, the MUSIC-GD spectrum is able to localize
spatially close sources with a reduced number of sensors.
III. LOCALIZATION ERROR ANALYSIS
Errors in the MUSIC and MUSIC-GD estimates can arise
from many sources like finite sample effects, imprecisely
known noise covariance and a perturbed array manifold. Finite
sample effects occur since a perfect covariance measurement
R cannot be obtained. In practice, the sample covariance R is
defined by
R =1
N
N∑
k=1
x(tk)xH(tk) (9)
In the present work, we take N (the length of the snapshot) to
be large and thus we can neglect the finite sample effects. In
the following sections, we analyze the error in localization
caused by perturbation errors and compare the root mean
square error caused by reverberation with the Cramer-Rao
lower bound for circular arrays.
A. Perturbation Errors
Here we analyze the effects of inexact sensor position on
group-delay based estimation of DOA (position noise). The
position matrix X is formed from the nominal sensor positions
as follows:[
X = x1 x2 . . . xM
]
A steering vector associated with the array defines the
complex array response for a source at DOA (φ, θ), and has
the form:
a(φ, θ) = a(X,ki) =[
e−jkT
ix1 e−jkT
ix2 . . . e−jkT
ixM
]T
Here xi is the ith sensor position vector and kj the wavevector
of the jth signal. The displacements from the nominal sensor
positions for each sensor is given as:
wi ∼ N (0, σ2
[
1 00 1
]
)
The position perturbations are assumed to be i.i.d. Gaussian
random variables and are independent of the signals or any
additive noises that may occur at the sensor outputs. In any
DOA estimation snapshot, the perturbations are assumed to be
time-invariant i.e. the same fixed perturbation is used for t =1, 2, ..N. The perturbed position matrix W is formed similarly
to the position matrix X:
W =[
w1 w2 . . . wM
]
The perturbed sensor positions are given by X = X+W.
Considering single parameter estimation for each source (e.g.
azimuth angle only), the ith steering vector associated with the
sensor positions is a(θi) = a(X,ki), which can be written as
[14]
a(θi) = Γia(θi)
Γi =
ejkiTw1 0 . . . 0
0 ejkiTw2
. . ....
.... . .
. . . 0
0 . . . 0 ejkiTwM
The signal model for subspace estimation in a reverberant
environment when the sensor positions are perturbed is
y(t) = A(Θ)s(t) ∗ h(t) + v(t) (10)
Here t = 1, 2, . . . , N .
Elevation angle (φ)
Azim
uth
an
gle
(θ)
Contour plot of MUSIC−Magnitude spectrum
33 34 35 36 37
53.5
54.5
55.5
56.5
57.5
58.5
59.5
60.5
61.5Source 1: (θ,φ)=(55,35)
Source 2: (θ,φ)=(57,37)
(a)
Elevation angle (φ)
Azim
uth
an
gle
(θ)
Contour plot of MUSIC−GD spectrum
33 34 35 36 37 38 39
53.5
54.5
55.5
56.5
57.5
58.5
59.5
60.5
Source 1: (θ,φ)=(55,35)
Source 2: (θ,φ)=(57,37)
Source 2
Source 1
(b)
Fig. 3. Contour plots showing the spectrum in presence of sensor perturbationerrors in a) MUSIC-Magnitude and b) MUSIC-GD.
The effects of the position perturbation on the array auto
correlation matrix are simulated exactly as described in [8],
and the analysis is carried out. The resolution of the MUSIC-
Magnitude method is compared to the proposed MUSIC-GD
method under perturbation errors in Figure 3 (a), and Figure
3 (b) respectively. The plots shown are contour plots for
the respective spectra. Note that the spectrum for MUSIC-
Magnitude shows a single peak with contours around it, while
the spectrum for MUSIC-GD shows two distinct peaks with
different contours.
B. Cramer-Rao bound analysis
In this Section, Cramer-Rao bound analysis for the MUSIC-
GD Method is carried out. It is compared with the root mean
square error in DOA estimates for both the MUSIC-GD and
MUSIC-Magnitude methods. The Cramer Rao inequality for
estimating multiple parameters is given as[9], [10]
var(θm) ≥ [F−1]mm
where the mnth element of the Fisher information matrix F
is given by
Fmn = N.tr{R−1 ∂R∂pm
R−1 ∂R∂pn
}
N is the number of snapshots. Also,
R = E[X(t)X(t)H ] = ARsAH + σ2I
Here, Rs is the source correlation matrix, I is the identity
matrix and σ2 denotes the variance of the white Gaussian
noise. For 2-D DOA estimation, the unknown parameter vector
is p = [φ, θ], where φ is the elevation angle and θ is the
azimuth angle. Fisher information matrix is given by
F =
[
Fθθ Fθφ
Fφθ Fφφ
]
where
Fθθ = 2N.Re[(RsAHR−1ARs)× (AθP⊥
AR−1Aθ)T ]
Fθφ = 2N.Re[(RsAHR−1ARs)× (AθP⊥
AR−1Aφ)T ]
Fθφ = Fφθ
P⊥
A = I − A(AHA)−1AH
Aθ =∑K
n=1
∂A∂θn
The Cramer-Rao bound is plotted for various combinations
of (φ, θ) along with the RMSE values for estimates obtained
by MUSIC-Magnitude and MUSIC-GD in Figure 4. The blue
plane at the bottom is the Cramer-Rao bound for circular
arrays. It can be seen that the performance of both algorithms
is similar.
3035
4045
5055
60 60 70 80 90 100 110 120
10−3
10−2
10−1
100
101
Azimuth(θ)Elevation(φ)
Ro
ot M
ea
n S
qu
are
Erro
r
CRB
MGD
MM
Fig. 4. Comparison of CRB with RMSE of MUSIC-Magnitude and MUSIC-GD methods
IV. EXPERIMENTS ON DISTANT SPEECH RECOGNITION
AND DOA ESTIMATION
In this Section, the performance of the proposed method
(MUSIC- GD) of source localization is evaluated by con-
ducting two experiments. In the first experiment, two closely
spaced sources in a reverberant environment are localized with
multiple trials and the scatter plots are shown to highlight the
accuracy of the proposed method. In the second experiment
distant speech recognition trials in a meeting room environ-
ment is conducted over a circular array. In both cases the
results are compared to the MUSIC- Magnitude method.
A. Experiments on source localization
In these set of experiments, twelve uniformly spaced micro-
phones are used over two concentric rings. Among these, four
are placed in a circular ring of radius of 2.5 cm and the rest
are placed in a ring of radius 5 cm. The number of signals
incident on the microphone array can be varied. A SNR of 20
dB is selected for the experiments. Reverberation is modeled
by generating a room-impulse response and convolving it with
the incident signals. We use the image method to generate
a room impulse response according to the measurement of
the meeting room environment. We can change the DRR by
either changing the distance of the source from the array or
by increasing the number of virtual sources considered in
the image method. A DRR of 9.74 dB has been set for the
localization experiments. Several DOA estimation trials are
then conducted. The estimated DOAs are compared with the
actual DOAs each time. The projections of the unit vectors
along the direction of signal arrival on the two dimensional
plane are plotted as scatter plots.
The two dimensional scatter plots for the estimation of
azimuth and elevation angles of two closely spaced sources at
(55,35) and (57,37), located in space for multiple trials in the
case of MUSIC-Magnitude and MUSIC-GD, are illustrated in
Figure 5. The estimated DOA’s and the actual DOA are marked
for both MUSIC-Magnitude and MUSIC-GD methods. The
blue stars denote the actual DOA while the red and black dots
show the estimated DOA for the two sources. It can be seen
that the scatter of the estimated DOA is nearer to the actual
DOA for MUSIC-GD. In the case of DOA estimation using
MUSIC-Magnitude, there were several cases when the two
estimates overlapped each other, leading to poor localization
of source DOA. However, the estimates were more separated
in the case of MUSIC-GD.
B. Experiments on Distant speech recognition
Speaker dependent large vocabulary speech recognition ex-
periments are conducted for speech acquired from circular
microphone arrays [1], [3], in a meeting environment. The
setup has four participants and a presenter, all connected to a
close talking microphone. It is also equipped with a circular
microphone array. The experiments are conducted for four
spatially located speakers.The circular array setup is similar
and consistent with the setup used for DOA estimation as
described earlier. The experimental results on distant speech
recognition are presented as % word error rate (WER). The
% WER is calculated as
WER = 100−(Wn − (Ws +Wd +Wi))
Wn
· 100
where Wn is the total number of words, Ws the total number
of substitutions, Wd the total number of deletions, and Wi the
total number of insertions. To ensure conformity with standard
databases we record sentences generated from words that
constitute the TIMIT database [13]. For each of the following
methods used in the DOA estimation process,
• Close talking microphone (CTM).
• A filter and sum beamformer realized using DOAs esti-
mated from the MUSIC-GD spectrum.
• A filter and sum beamformer realized using DOAs esti-
mated from the MUSIC-Magnitude (MM) spectrum.
0.6
Sine of elevationangle
Azimuthangle
60
(a)
Azimuthangle
0.6
Sine of elevationangle
60
(b)
Fig. 5. Two dimensional scatter plot for localization of two closely spaced sources, (55◦,35◦) and (57◦,37◦), indicated by blue stars using (a) MUSIC-Magnitude method and (b) MUSIC-GD method.
Location CTM MUSIC-GD MM9.74dB DRR 4.69dB DRR 9.74dB DRR 4.69dB DRR
1 7.60% 13.71% 21.41% 15.62% 25.32%2 13.42% 31.28% 33.21% 35.72% 38.13%3 9.09% 18.21% 27.45% 25.73% 29.40%4 11.18% 21.38% 29.73% 28.67% 31.23%
TABLE IDISTANT SPEECH RECOGNITION PERFORMANCE AS WORD ERROR RATE
(WER), FOR FOUR SPATIALLY CLOSE SPEAKERS (LOCATION 1- 4) IN A
MEETING ROOM AT A DRR OF 9.74 DB AND 4.69 DB.
results of speech recognition for speakers at four spatially
close locations (Location 1 - 4) using the commercially
available dragon speech recognition SDK are listed as WER
in Table I. Note that a wide band version of MUSIC [12], is
used in the above experiments. The proposed method based on
the MUSIC-GD indicates reasonable reduction in WER when
compared to MUSIC-Magnitude method. The results are also
encouraging for a low DRR of 4.69 dB as can be seen from
Table I.V. CONCLUSION
A high resolution source localization method to resolve
closely spaced sources using a limited number of sensors
has been proposed in this work. The MUSIC-Group Delay
method is able to resolve both azimuth and elevation angles
for such closely spaced sources and indicates higher robustness
when compared to conventional subspace methods in terms
of localization error analysis. The proposed method indicates
a reasonable improvement in terms of word error rates for
distant speech recognition. It can therefore be used in an
effective way to reduce word error rates for speech recognition
over microphone arrays which are generally far worse than
those obtained with a close talking microphone. Computing
localization error expressions for the proposed method require
complex number analysis which is currently being explored.
The application of the proposed method in meeting room
camera systems and simple co operative robotic applications
is also being explored.
ACKNOWLEDGMENT
This work was funded in part by TCS Research Scholarship
Program under project number TCS/CS/20110191.
REFERENCES
[1] M.L. Seltzer, “Bridging the gap: Towards a unified framework forhands-free speech recognition using microphone arrays,” in Hands-Free
Speech Communication and Microphone Arrays, 2008. HSCMA 2008,May 2008, pp. 104 –107.
[2] H. L. Van Trees, Optimum Array Processing, Wiley-Interscience, 2002.[3] W. Zhang and B. Rao, “Robust broadband beamformer with diagonally
loaded constraint matrix and its application to speech recognition,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Processing. , 2006, pp.785–788.
[4] Engin Tuncer, Benjamin Friedlander, Classical and Modern Direction-
of-Arrival Estimation, Academic Press, 2009.[5] M Shukla and Rajesh M Hegde, “Significance of the MUSIC Group
Delay Spectrum in Speech Aquisition from Distant Microphones,” Proc.IEEE Int. Conf. Acoust., Speech, Signal Processing, Mar 2010, pp.2738–2741.
[6] R. Mandala, M. Shukla, and R. Hegde, “Group delay based methodsfor recognition of distant talking speech,” in Signals, Systems and
Computers (ASILOMAR), 2010 Conference Record of the Forty Fourth
Asilomar Conference on, nov. 2010, pp. 1702 –1706.[7] A. Tripathy, L. Kumar, and R.M. Hegde, “Group delay based methods
for speech source localization over circular arrays,” in Hands-free
Speech Communication and Microphone Arrays (HSCMA), 2011 Joint
Workshop on, 30 2011-june 1 2011, pp. 64 –69.[8] V. Cevher and J. H. McClellan, “2-d sensor perturbation analysis:
equivalence to awgn on array outputs,” in SAM 2002, Washington, DC,4–6 August 2002.
[9] P. Stoica and Nehorai Arye, “Music, maximum likelihood, and cramer-rao bound,” Acoustics, Speech and Signal Processing, IEEE Transactions
on, vol. 37, no. 5, pp. 720 –741, may 1989.[10] T. Filik and T.E. Tuncer, “Design and evaluation of v-shaped arrays for
2-d doa estimation,” in Acoustics, Speech and Signal Processing, 2008.
ICASSP 2008. IEEE International Conference on, 31 2008-april 4 2008,pp. 2477 –2480.
[11] M Brandstein and D Ward, Eds.,, Microphone Arrays, Springer Verlag,Berlin, 2001.
[12] T.L. Tung, K. Yao, D. Chen, R.E. Hudson, and C.W. Reed, “Sourcelocalization and spatial filtering using wideband music and maximumpower beamforming for multimedia applications,” in Signal Processing
Systems, 1999. SiPS 99. 1999 IEEE Workshop on, 1999, pp. 625 –634.[13] John S. Garofolo, TIMIT Acoustic-Phonetic Continuous Speech Corpus,
Linguistic Data Consortium, Philadelphia, 1993.[14] V. Cevher and J. H. McClellan, “2-d sensor perturbation analysis:
equivalence to awgn on array outputs,” in SAM 2002, Washington, DC,4–6 August 2002.