Performance of time- and frequency-domain …Performance of time- and frequency-domain binaural...

13
Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms Michael E. Lockwood, a) Douglas L. Jones, Robert C. Bilger, Charissa R. Lansing, William D. O’Brien, Jr., Bruce C. Wheeler, and Albert S. Feng Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, 405 North Mathews Ave., Urbana, Illinois 61801 ~Received 2 November 2002; accepted for publication 22 August 2003! Extraction of a target sound source amidst multiple interfering sound sources is difficult when there are fewer sensors than sources, as is the case for human listeners in the classic cocktail-party situation. This study compares the signal extraction performance of five algorithms using recordings of speech sources made with three different two-microphone arrays in three rooms of varying reverberation time. Test signals, consisting of two to five speech sources, were constructed for each room and array. The signals were processed with each algorithm, and the signal extraction performance was quantified by calculating the signal-to-noise ratio of the output. A frequency-domain minimum-variance distortionless-response beamformer outperformed the time-domain based Frost beamformer and generalized sidelobe canceler for all tests with two or more interfering sound sources, and performed comparably or better than the time-domain algorithms for tests with one interfering sound source. The frequency-domain minimum-variance algorithm offered performance comparable to that of the Peissig–Kollmeier binaural frequency-domain algorithm, but with much less distortion of the target signal. Comparisons were also made to a simple beamformer. In addition, computer simulations illustrate that, when processing speech signals, the chosen implementation of the frequency-domain minimum-variance technique adapts more quickly and accurately than time-domain techniques. © 2004 Acoustical Society of America. @DOI: 10.1121/1.1624064# PACS numbers: 43.72.Ew, 43.66.Pn, 43.66.Ts @DOS# Pages: 379–391 I. INTRODUCTION A major problem for hearing aids, speech recognition, hands-free telephony, teleconferencing, and other acoustic processing applications is extracting, with good fidelity, a target sound in the presence of multiple competing sounds. This is particularly true of speech sounds, which are highly nonstationary in spectrum and intensity, and which may change position with respect to the listener over time. Thus, the cancellation of multiple, nonstationary, interfering speech sources requires fast, accurate tracking of the sources and robustness to reverberation and correlation between sources. Many interference suppression techniques have been ex- plored to address this problem. The most common approach is the use of an adaptive beamformer to process the sampled, time-domain outputs of a multimicrophone array, such that reception of the target sound from a particular direction is enhanced @see the reviews by Van Veen and Buckley, 1988; Brandstein and Ward, 2001#. These techniques use signals from the array to estimate the gradient of an error function and then iteratively move the filter coefficients closer to an optimal solution in small steps. Two algorithms that have been used extensively to address this problem are the iterative-adaptive techniques of Frost @1972# and Griffiths and Jim @1982#. Other variations of adaptive algorithms in- clude those of Berghe and Wouters @1998# and Welker and Greenberg @1997#. Although these adaptive algorithms gen- erally work well for suppressing statistically stationary inter- ference sources that are uncorrelated with the target source, our experience has shown that they tend to adapt slowly or inaccurately in the presence of multiple, nonstationary inter- ference sources such as speech, especially when there are more sources than sensors. As a result, the performance of the algorithms is compromised. Greenberg and Zurek @1992# suggested that a solution to the problem of having more speech sources than sensors was to add more microphones. This solution is effective if all microphones are located far enough away from each other that they provide added useful inputs. However, accomplish- ing this in a hearing-aid system is difficult because locating microphones away from the ears is undesirable. Some cur- rent behind-the-ear ~BTE! hearing aids contain two micro- phones per instrument; however, the small separation of the microphones limits the effectiveness of such systems. Like- wise, it is impractical to use more than one microphone per instrument in systems that are located in the ear canal. Thus, a preferable system would use two sensors ~one per instru- ment, located at each ear or in the ear canals!, providing greater spatial separation of the microphones. Several previous studies of adaptive beamformers @Greenberg and Zurek, 1992; Kompis and Dillier, 1994; Hoffman et al. 1994; Kates and Weiss, 1996# have avoided the problem of slow algorithm adaptation by allowing the adaptive filters sufficient time ~at least 2 s for these studies! to converge before processing test signals. To provide more challenging test conditions, Greenberg et al. @2003# used a a! Electronic mail: [email protected] 379 J. Acoust. Soc. Am. 115 (1), January 2004 0001-4966/2004/115(1)/379/13/$20.00 © 2004 Acoustical Society of America

Transcript of Performance of time- and frequency-domain …Performance of time- and frequency-domain binaural...

Page 1: Performance of time- and frequency-domain …Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms Michael E. Lockwood,a) Douglas

Performance of time- and frequency-domain binauralbeamformers based on recorded signals from real rooms

Michael E. Lockwood,a) Douglas L. Jones, Robert C. Bilger, Charissa R. Lansing,William D. O’Brien, Jr., Bruce C. Wheeler, and Albert S. FengBeckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign,405 North Mathews Ave., Urbana, Illinois 61801

~Received 2 November 2002; accepted for publication 22 August 2003!

Extraction of a target sound source amidst multiple interfering sound sources is difficult when thereare fewer sensors than sources, as is the case for human listeners in the classic cocktail-partysituation. This study compares the signal extraction performance of five algorithms using recordingsof speech sources made with three different two-microphone arrays in three rooms of varyingreverberation time. Test signals, consisting of two to five speech sources, were constructed for eachroom and array. The signals were processed with each algorithm, and the signal extractionperformance was quantified by calculating the signal-to-noise ratio of the output. Afrequency-domain minimum-variance distortionless-response beamformer outperformed thetime-domain based Frost beamformer and generalized sidelobe canceler for all tests with two ormore interfering sound sources, and performed comparably or better than the time-domainalgorithms for tests with one interfering sound source. The frequency-domain minimum-variancealgorithm offered performance comparable to that of the Peissig–Kollmeier binauralfrequency-domain algorithm, but with much less distortion of the target signal. Comparisons werealso made to a simple beamformer. In addition, computer simulations illustrate that, whenprocessing speech signals, the chosen implementation of the frequency-domain minimum-variancetechnique adapts more quickly and accurately than time-domain techniques. ©2004 AcousticalSociety of America.@DOI: 10.1121/1.1624064#

PACS numbers: 43.72.Ew, 43.66.Pn, 43.66.Ts@DOS# Pages: 379–391

nsad

hlauca

rce

aplha

i8lsioavet

-

n-

r-rce,or

ter-e aree of

owas

alltherish-ingcur--the

ke-perhus,

ers4;

e

ore

I. INTRODUCTION

A major problem for hearing aids, speech recognitiohands-free telephony, teleconferencing, and other acouprocessing applications is extracting, with good fidelity,target sound in the presence of multiple competing sounThis is particularly true of speech sounds, which are hignonstationary in spectrum and intensity, and which mchange position with respect to the listener over time. Ththe cancellation of multiple, nonstationary, interfering speesources requires fast, accurate tracking of the sourcesrobustness to reverberation and correlation between sou

Many interference suppression techniques have beenplored to address this problem. The most common approis the use of an adaptive beamformer to process the samtime-domain outputs of a multimicrophone array, such treception of the target sound from a particular directionenhanced@see the reviews by Van Veen and Buckley, 198Brandstein and Ward, 2001#. These techniques use signafrom the array to estimate the gradient of an error functand then iteratively move the filter coefficients closer tooptimal solution in small steps. Two algorithms that habeen used extensively to address this problem areiterative-adaptive techniques of Frost@1972# and Griffithsand Jim@1982#. Other variations of adaptive algorithms include those of Berghe and Wouters@1998# and Welker andGreenberg@1997#. Although these adaptive algorithms ge

a!Electronic mail: [email protected]

J. Acoust. Soc. Am. 115 (1), January 2004 0001-4966/2004/115(1)/3

,tic

s.yys,hndes.x-

ched,t

s;

nn

he

erally work well for suppressing statistically stationary inteference sources that are uncorrelated with the target souour experience has shown that they tend to adapt slowlyinaccurately in the presence of multiple, nonstationary inference sources such as speech, especially when thermore sources than sensors. As a result, the performancthe algorithms is compromised.

Greenberg and Zurek@1992# suggested that a solution tthe problem of having more speech sources than sensorsto add more microphones. This solution is effective ifmicrophones are located far enough away from each othat they provide added useful inputs. However, accompling this in a hearing-aid system is difficult because locatmicrophones away from the ears is undesirable. Somerent behind-the-ear~BTE! hearing aids contain two microphones per instrument; however, the small separation ofmicrophones limits the effectiveness of such systems. Liwise, it is impractical to use more than one microphoneinstrument in systems that are located in the ear canal. Ta preferable system would use two sensors~one per instru-ment, located at each ear or in the ear canals!, providinggreater spatial separation of the microphones.

Several previous studies of adaptive beamform@Greenberg and Zurek, 1992; Kompis and Dillier, 199Hoffman et al. 1994; Kates and Weiss, 1996# have avoidedthe problem of slow algorithm adaptation by allowing thadaptive filters sufficient time~at least 2 s for these studies!to converge before processing test signals. To provide mchallenging test conditions, Greenberget al. @2003# used a

37979/13/$20.00 © 2004 Acoustical Society of America

Page 2: Performance of time- and frequency-domain …Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms Michael E. Lockwood,a) Douglas

a

retreusoslkrraan

oc

ndala

tio

cea

diusi

hoes.v

ion-gon

rmc

si

cohanoeheth

ss

om

lsen

veo

Ra-gple-nce

e-theegof

almhedi-

edme-

inrep-s,

the

rcesbe-dly ine-ects

ith

15edo-

-

er

ssedto

s of-thein

the

of

he

‘‘roving’’ interfering sound source that changed locationrandom times. However, this roving source was not usedall test conditions, and the algorithms in that study wepermitted to converge for 1 s before the onset of the targsignal. We argue that a beamforming algorithm used in ahearing-aid instrument cannot be preadapted in an acocally crowded environment due to the changing head ption of the listener and the movement of the interfering taers, and it may be unable to adapt quickly enough to perfoeffective interference suppression. Thus, the adaptationof a hearing-aid signal-processing algorithm is an importconsideration.

To improve the adaptation rate of time-domain algrithms, an alternate approach to an iterative-adaptive tenique is direct solution of the optimal beamformer@Capon,1969#. While this in theory provides rapid convergence aimproved interference cancellation, this technique generrequires the inversion of large, time-domain correlation mtrices, a process that is inherently unstable and computaally impractical@Golub and Van Loan, 1996#.

To suppress multiple, nonstationary interfering sourusing only two sensors, frequency-domain beamforminggorithms appear to have distinct advantages comparetime-domain algorithms. For example, the algorithm of Let al. @1997, 2000, 2001# first determines source locationand strengths, and then performs constrained beamformin each frequency band to remove interference. This metwas shown to adapt quickly and could effectively supprmultiple speech interferers using only two microphonesdemonstrates a distinct improvement in performance otime-domain methods, but it requires intensive computatThe LENS algorithm@Deslodge, 1998# extracts a target signal by placingn21 spatial nulls usingn sensors. Greenberet al. @2003# evaluated a four-sensor LENS implementatiwith up to three interfering sources.

Other frequency-domain algorithms do not perfobeamforming, but rather attenuate individual frequenbands that contain interference. The algorithm of PeisKollmeier, and colleagues~Kollmeier, 1997; Kollmeieret al., 1993; Peissig and Kollmeier, 1997; Wittkopet al.,1997!, hereafter referred to as the P–K algorithm, usesherence and phase and amplitude differences between cnels to determine a gain for each frequency band. SlyhMoses@1993# describe another example of this type of algrithm. These techniques have been shown to be effectivattenuating off-axis sources using only two sensors, but talso introduce signal distortion by attenuating part oftarget signal.

Frequency-domain minimum-variance distortionleresponse~MVDR! beamformers@Cox et al., 1986; 1987# aremore computationally efficient than both the techniquetime-domain correlation matrix inversion and the algorithof Liu et al. @2000, 2001#. MVDR beamformers pass signafrom a target direction with no distortion, assuming the ssors are matched. Kates and Weiss@1996# used adaptivefrequency-domain algorithms~preadapted for 2 s! to extractspeech in a diffuse noise field using signals from a fisensor end-fire array. Their study included an MVDR algrithm with a limited adaptation rate.

380 J. Acoust. Soc. Am., Vol. 115, No. 1, January 2004

tine

alti-i--mtet

-h-

ly-n-

sl-to

ngds

Iter.

yg,

-an-d

-iny

e

-

f

-

--

We hypothesized that a frequency-domain MVD~FMV! algorithm, specifically implemented for fast adapttion, might be effective for suppressing multiple interferinspeech sources using only two sensors. We have immented such an algorithm and evaluated its performawith computer simulations@Lockwood, 1999; Lockwoodet al., 1999#. Initial tests showed that, compared to timdomain adaptive algorithms, the computational cost ofFMV algorithm is similar, but it converges much morquickly. For simulated signals with up to four interferinsources, the FMV algorithm outperformed the algorithmsFrost @1972# and Griffiths and Jim@1982# in terms of SNRgain @Yang et al., 2000#.

The focus of the current study is threefold. The first gois to further evaluate the performance of the FMV algorithwith a two-sensor array in real environments. Although tperformance of the FMV algorithm under simulated contions is promising@Larsen et al., 2001#, it has not beenevaluated under actual room conditions with real recordsignals. The second goal is to evaluate and compare tiand frequency-domain algorithms in acoustic sceneswhich there are more speech sources than sensors. Thisresents a challenging condition for beamforming algorithmand a condition that most studies have not explored. Onlystudy of Kates and Weiss@1996# evaluated algorithms in anenvironment with more sources than sensors, but the souwere all multitalker babble rather than speech. Finally,cause speech sources and acoustic scenes change rapireal-world listening environments, the third goal is to dvelop a better understanding of how adaptation speed affthe performance of these algorithms.

Recordings were made in three different rooms wvarying reverberation times~RTs: 0.10, 0.37, 0.65 s! usingthree different microphone arrays:~1! two microphonescoupled to the ear canals of a KEMAR mannequin;~2! twoomnidirectional microphones in free field separated bycm; and~3! two cardioid microphones in free field separatby 15 cm. The performances of a two-channel FMV algrithm @Lockwood, 1999; Lockwoodet al., 1999#, the Frostadaptive beamformer@Frost, 1972#, a version of the generalized sidelobe canceler~GSC! @Greenberg, 1998; Griffiths andJim, 1982#, and an implementation of the Peissig–Kollmei~P–K! @Kollmeier et al., 1993; Wittkopet al., 1997# binauralalgorithm were assessed. The signals were also procewith a fixed beamformer. The algorithms were not allowedpreadapt, and their performance was compared in termthe signal-to-noise ratio~SNR! gain and target signal distortion produced by each. The adaptation characteristics ofFMV, Frost, and GSC algorithms were further evaluatedcomputer simulation. Finally, the computational costs ofalgorithms were examined.

II. ALGORITHM IMPLEMENTATIONS

A. Frequency-domain MVDR „FMV… algorithm

Time-domain input signals, with a sampling rate22.05 kHz, are transformed periodically~every L516samples! into the frequency domain via a length-N FFT, us-ing a Hamming window. For a two-microphone system, t

Lockwood et al.: Performance of beamformers

Page 3: Performance of time- and frequency-domain …Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms Michael E. Lockwood,a) Douglas

n

0esfro

kcoor

imen

ith

p

thetical

i-

;

therytime

ds2tem,on-issta-

alFT

ime

tat-

ctsedan

asgth24-for-th ofcedtheelyin-ta

ronre

ueero

frequency-domain signals from the sensors are represeby the components of the vectorXk5@X1k X2k#, wherekindexes the frequency bins. TheF most recent FFTs arestored in a buffer, and a correlation matrixRk is calculatedfor each frequency bink by using:

Rk5F M

F (i 51

F

X1k,i* X1k,i

1

F (i 51

F

X1k,i* X2k,i

1

F (i 51

F

X2k,i* X1k,i

M

F (i 51

F

X2k,i* X2k,i

G , ~1!

where* represents complex conjugation, andM is a multi-plicative ‘‘regularization’’ constant slightly greater than 1.0that helps avoid matrix singularity and improves robustnto sensor mismatch. Coxet al. @1987# described the use oadditive regularization to control the trade-off betweenbustness and white-noise gain. Values ofN, M, and F arefound in Table I. The correlation matricesRk are updatedevery L516 samples, allowing them to quickly tracchanges of the input signals in all frequency bands. Therelation matrices and FFT buffers were set to zero befprocessing each signal.

For each frequency bandk, the monaural output of thebeamformer is

Yk5wkHXk , ~2!

where wk is a vector of frequency-domain weights andH

represents the Hermitian transpose of a vector. The optzation goal and constraint are expressed for each frequband as

minwk

E$uYku2%, ~3a!

subject to eHwk51, ~3b!

where min represents the minimization of a function wrespect to selected variables~the weights,wk , in this case!,E$ % represents the expected-value operation, ande is a vec-tor indicating the desired arrival direction. This general a

TABLE I. Algorithm parameters for best performance.

Microphone

Processing algorithm:

FMV Frost GSC P–K

Omnidirectional N51024 NF5401 KGSC5401 N51024~SennheiserMKEII !

F532 mF51.0 a50.15 c155

M51.03 cF50.01 c251c351

Cardioids N51024 NF5401 KGSC5401 N51024~SennheiserME-104!

F532 mF51.0 a50.15 c155

M51.03 cF50.01 c251c351

KEMAR N51024 NF5401 KGSC5401 N51024~EtymoticER-1!

F532 mF51.0 a50.07 c151.0

M51.10 cF50.01 c251.5c351.0

J. Acoust. Soc. Am., Vol. 115, No. 1, January 2004

ted

s

-

r-e

i-cy

-

proach is originally attributed to Capon@1969#. For an on-axis target source, both detectors receive the signal atsame time and with the same amplitude, assuming idendetectors. Thus,eH5@1 1#. If the desired receive directionwere off-axis, thene would be complex valued. For the minmization goal and constraint given in Eqs.~3a! and ~3b!, anoptimal solution is known@Capon, 1969; McDonough, 1979Cox et al., 1987#. For each frequency bink, the optimalweight vectorwopt,k is given by

wopt,k5Rk

21e

eHRk21e1s

, ~4!

where Rk is defined in Eq.~1!, Rk21 represents the matrix

inverse ofRk , ands is a very small positive quantity thaprevents division by zero. Inherent to this solution is tassumption that it is valid only if the inputs are stationarandom processes. This is assumed to be true for smallintervals of speech signals in each frequency band.

To respond quickly to changes inRk , new optimalweights wk are calculated for half of the frequency banevery L samples, so all weights are updated everyLsamples. This is possible because for a two-sensor systhe matrix inversion for each frequency band is computatially inexpensive. It will be demonstrated in Sec. V that thtechnique yields faster and more accurate tracking of nontionary sources than time-domain techniques.

To obtain the time-domain output, the newest optimweights for each frequency band are applied to buffered Fdata to obtain the output@Eq. ~2!#. The resulting Nfrequency-domain values are then transformed to the tdomain using a length-N inverse FFT. This occurs everyLsamples, and the centralL samples of time-domain outpuare used. As the outer samples of the FFT window aretenuated by the Hamming window, this minimizes the effeof circular convolution which arise due to the FFT-basfiltering, while requiring less computation and delay thanoverlap-save or overlap-add method@Joho and Moschytz,2000#.

The main consideration in the FMV implementation wfrequency resolution, which is determined by the FFT lenrelative to the sampling rate. For our experiments, a 10point FFT was chosen because it provided the best permance. Increasing the FFT length decreases the bandwideach frequency bin and should improve FMV performan~for stationary signals!, as this provides the more detaileestimates of signal spectra. However, in practice, whenFFT length was too long the performance decreased, likbecause the signal was not sufficiently stationary for theterval of the FFT. Also, longer FFTs required more dapoints, and objectionably increased the system delay.

B. Frost, GSC, P–K, and fixed beamformerimplementations

Optimized parameter values~see the next section fodetails! for three of the algorithms described in this sectiare shown in Table I. All time-domain adaptive weights weinitialized to that of a conventional beamformer, with a valof 0.5 for each channel at an appropriate delay, and z

381Lockwood et al.: Performance of beamformers

Page 4: Performance of time- and frequency-domain …Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms Michael E. Lockwood,a) Douglas

cerry

at

-roAde-

ts-

rgrehea

r

erth

roee

c

a-

-g-a

g

he-

pa-t oftherst.

s itetero-

rees ob-heha-

ro-ple

os-ts.notwasilesig-

gs

sces24alestgu-

nal,d-

i-le

to-

oor,

otherwise. All time-domain weights were updated easample. The correlation matrices for the P–K algorithm walso set to zero before processing and were updated evesamples~as with the FMV algorithm!.

The Frost algorithm was implemented with the updequation

Wnew5P•S Wold22

3•

WoldT "xf "xf

mF•xfT"xf1cF

D 1F, ~5!

whereWold is the previous set of time-domain filter coefficients, andmF and cF are adjustable parameters to contthe step size and to prevent divide-by-zero, respectively.ditionally, xf is a column vector composed of the timdomain input signals from both channels,P is a precomputedprojection matrix, andF represents the response constrainall as per Frost@1972#. The factor of 2/3 facilitates comparison of the step size with a bound derived by Frost.NF isdefined as the length of the adaptive filter.

The generalized sidelobe canceler~GSC! algorithm@Griffiths and Jim, 1982# was implemented as per Greenbe@1998#. This implementation improves performance andduces target distortion in nonstationary environments wthe target signal is strong. The weight update equation w

Wnew5Wold1asum

KGSC@se2~n!1sx

2~n!#•e~n!•xG~n!, ~6!

where asum is a step size parameter,n is an index of thecurrent sample,Wold is the previous set of time-domain filtecoefficients,e is the processed output,xG is a vector ofsamples of the signal passed by the blocking matrix~mostlyinterference!, KGSC is the filter length, andse

2 andsx2 repre-

sent the average powers~updated every sample! of e andxG ,respectively.

The P–K algorithm@Kollmeier et al., 1993; Wittkopet al., 1997# was implemented with three variable parametto control the attenuation of the signal as a function ofphase and amplitude differences between channels andcoherence between channels. For thekth frequency band, the~real-valued! filter weight Gk was determined using

Gk5g1k•g2k•g3k ,~7!

g1k5maxS 0, 12c1u/R12u

p D ,

g2k5c2•minS R11

R22,R22

R11D , g3k5S Re@R12

2 #

R11•R22D C3

,

wherec1 , c2 , andc3 are adjustable parameters that contthe sensitivity of the algorithm to phase differences betwchannels, amplitude differences between channels, andherence between channels, respectively;/ represents thephase angle in radians, Re@ # represents the real part ofcomplex value, andRij is an element from a frequencydomain correlation matrix@see Eq.~1!#, with M51.00.

A simple fixed beamformer~referred to as a conventional beamformer! was implemented by averaging the sinals from the two microphones after a matching filter wapplied.

382 J. Acoust. Soc. Am., Vol. 115, No. 1, January 2004

he16

e

l-

,

-ns

sethe

lno-

s

C. Optimization

All algorithms and metrics were implemented usinMATLAB 6.5 ~The MathWorks, Inc., Natick, MA!. Floating-point calculations were performed with 64-bit precision. Talgorithm parameters~Table I! were adjusted for best performance in terms of the SNR metric@Sec. III F, Eq.~9!#. Thetest signals were processed with many different sets oframeters. For the frequency-domain algorithms, the effecchanging the FFT length was generally independent ofeffects of changing other parameters, so this was set fiAdditionally, the additive constantcF in the Frost algorithmwas found to have little effect on performance so long awas above a minimum value. This narrowed the paramspace to two variables for the FMV, Frost, and GSC algrithms, and to three variables for the P–K algorithm.

Many sets of values were chosen for the remaining fparameters, and the signals were processed and resulttained for all. It was found that performance varied with tnumber of interfering sources. Because this study empsizes the effects of multiple interfering sources~more sourcesthan sensors!, the parameter set for each algorithm that pduced the best overall performance for tests with multiinterferers was chosen as optimal. This was anad hocdeci-sion made by plotting the performance on a graph and choing the line that was highest for the multiple-interferer tesFor the time-domain algorithms, special care was takento choose parameter sets that caused instability, as itpossible to have good multiple-interferer performance whthe algorithm became unstable for the one-interferer testnals.

III. EXPERIMENTAL METHODS

A. Test materials

A series of high-context sentences by eight talkers~fourfemales and four males! from the revised R-SPIN test@Bilgeret al., 1984# were recorded on digital audio tape~DAT! at asampling rate of 48 kHz, quantized to 16 bits. Recordinwere made in a sound-treated studio~model: Studio 73535ft., Acoustic Systems, Austin, TX!. The recorded sentencewere downsampled to 44.1 kHz. Three different sentenwere chosen from each talker. This provided a total ofdifferent sentences, 12 by male talkers and 12 by femtalkers. A section of multitalker babble from the R-SPIN tewas also used as an interfering signal in several test confirations. Its sampling rate was also 44.1 kHz.

B. Recording techniques and setup

Each of the 24 sentences, the multitalker babble sigand 10 s of white noise were played back from eight louspeakers housed in a semicircular enclosure~SPATS, Sensi-metrics Corp., Somerville, MA!. Each loudspeaker was equdistant~75 cm! from a central point and comprised a sing7.6-cm-diameter driver with frequency response restricted200-Hz to ;13 kHz. The two-microphone arrays were located at the central point of the array, 1.15 m above the fl

Lockwood et al.: Performance of beamformers

Page 5: Performance of time- and frequency-domain …Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms Michael E. Lockwood,a) Douglas

ircle

FIG. 1. ~A! Diagram of room 1~treated with acoustic foam! with loudspeaker array.~B! Diagram of room 2~conference room with windows! withloudspeaker array.~C! Diagram of room 3~conference room, bare walls! with loudspeaker array. Filled circle represents target loudspeaker, open crepresents interferer; distances to central point of the array are shown.

s

ee

d

R

ct

24edimle

reasithioem-

o3

um

re

oatF

nt ofThengedfre-

atednd

iseIRs ofateo aenes,thefores,

fra-rage

dthe

pli-to

re-

om,

pro-sticthe.

and oriented such that the loudspeakers were at azimuth60°, 40°, 20°, 0°,220°, 240°, 260°, 280° with respect tothe broadside array.

Recordings were made with three sets of microphon~1! Sennheiser MKEII omnidirectional microphones spac15 cm apart in free field;~2! Sennheiser ME104 cardioimicrophones spaced 15 cm apart in free field; and~3! Ety-motic ER-1 microphones mounted in the ears of a KEMAmannequin~Knowles Electronics, Itasca, IL!. The omnidi-rectional and cardioid microphones were connected direto a microphone preamplifier~Millennia Media HV-3B, Plac-erville, CA! that was connected to the inputs of the Aarksystem. The KEMAR microphones were connected to thown dedicated preamps, and then to the Millennia Mepreamplifier. Recording and playback were done with a sapling rate of 44.1 kHz. The recorded data were downsampto 22.05 kHz prior to being saved to hard disk.

C. Room descriptions

Recordings were made in three rooms with differentverberation characteristics. In all rooms, the ceiling hacoustic tile suspended at a height of 9 ft. Above it waconcrete ceiling. The floors of all rooms were covered wshort carpet. The dimensions of the rooms and the positing of the loudspeaker and microphone arrays within thare shown in Figs. 1~a!, ~b!, and~c!. Room 1 was an acoustically controlled space with 6.4-cm-thick foam~SONEXValueline, Illbruck Corp., Minneapolis, MN! attached to allwall surfaces; Room 2 was a rectangular conference rowith bay windows and bookshelves on two walls; Roomwas a rectangular conference room with painted gypsboard walls.

The reverberation times of the rooms were measuwith a sound-level meter~Bruel & Kjaer, model 2260!. Amore suitable source of an impulse was not available, shand clap was used as a stimulus after it was found thprovided consistent results over several measurements.

J. Acoust. Soc. Am., Vol. 115, No. 1, January 2004

of

s:d

ly

ira-d

-da

n-

m

-

d

ait

ive

measurements were taken in each room at the central poithe loudspeaker array and the results were averaged.meter calculated theT60 times for 1/3-octave bands betwee200-Hz and 10 kHz. The values for each band were averaover the five measurements, and then averaged acrossquency bands to obtain the reverberation time. The estimaverageT60 values for rooms 1, 2, and 3 were 0.10, 0.37, a0.65 s, respectively.

D. Compensation, energy equalization, test creation

To match the microphones, the recorded white nofrom the loudspeaker at 0° was filtered with a 43-tap Ffilter adapted by an LMS algorithm to match the responsethe microphones for the target direction. For a sampling rof 22.05 kHz, this filter has a 2-ms delay, corresponding tsound propagation distance of approximately 0.7 m. Givthe dimension of the rooms and location of the microphondirect reflections should require more than 2 ms to reachmicrophones; the filters should thus compensate mostlythe differences in frequency response of the microphonrather than responding to reflections from the room.

All acoustic recordings~except for the recordings owhite noise! were of the same duration; however, the dution of the spoken sentences varied. Therefore, the avepower of each recorded sentence was calculated~taking intoaccount the varying durations! and all sentences were scaleto have the same average power. This is referred to asnormalized energy of a single interfering source. The amtude of each interfering source could then be adjustedprovide a specified relative amount of interference withspect to the target sentence.

To construct a test, the recordings of each source frits appropriate location~loudspeaker! were scaled and addedkeeping the left and right channels separated, therebyducing a binaural test signal. This assumed that the acouaddition of the sources and the various components inrecording and playback system all act in a linear manner

383Lockwood et al.: Performance of beamformers

Page 6: Performance of time- and frequency-domain …Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms Michael E. Lockwood,a) Douglas

B

ble

384 J. Acoust. S

TABLE II. Summary of test configurations.

Configurationnumber

Azimuth angle of source

160° 120° 0° 240° 280°

1 Interferera Target2 Interferer Target3 MTBa Target4 MTB Target5 Interferer Target Interferer6 Interferer Target Interferer7 Interferer Target Interferer8 Interferer Interferer Target Interferer9 Interferer Interferer Target Interferer

10 Interferer Interferer Target Interferer Interferer11 Interferer or MTB Interferer or MTB Target Interferer or MTB Interferer or MT

a‘‘Interferer’’ and ‘‘MTB’’ indicate interference from the angle listed by a single talker and multitalker babsource, respectively.

titeind

era

asFoner

uestt

org

m

slu

rcrgiff

e,rge

reeso

, fte

a

inefteron

bebe-nal--

d,heral.

ls

Inhe

di-

ftlly,ro-

E. Test terminology and guidelines

A test configuration was defined as one particular spaarrangement of target and interfering sources. Elevenconfigurations were used; Table II describes the positionof the target and interferer~s!. Each configuration containeone on-axis target source~always a single talker! presentedfrom the loudspeaker at 0°, and one to four off-axis interfing sources presented from the loudspeakers at possiblemuth angles of160°, 120°, 240°, or 280°, each a singletalker or multitalker babble source. Multitalker babble wused in test configurations 3 and 4 as the only interferer.test configuration 11, a multitalker babble was placed atof the four interference locations and single-talker interferwere placed at the other three.

All sources were located in the front half-plane, and ththe effects of the directionality of the cardioid microphonwas minimal. Additionally, a front–back ambiguity did noexist as there was no source located greater than 90° fromtarget signal.

For each spatial test configuration, 12 tests were cstructed. A test was defined as one permutation of a tasentence and interfering sentence~s!. Twelve different sen-tences, three sentences each from two male and two fetalkers, were used exclusively as target sentences.~Thus, foreach test configuration, six target talkers were males andwere females.! The remaining 12 sentences were used excsively as interferers. Additionally, in any test, each souwas a different talker, and all interferers had the same ene

Each test was constructed and processed at three dent SNRs. For configurations 1–4~one interferer!, three nor-malized energies were used:23, 0, and13 dB ~correspond-ing to each interference source having 3 dB less, the sam3 dB more average power, respectively, than the tasource!. For test configurations 5–11, three lower normalizenergies~26, 23, 0 dB! were used. Therefore, the entibattery of tests consisted of 11 test configurations, 12 tper configuration, and 3 SNR levels per test, for a total396 test signals. Each test was approximately 2.5 s longa total of 16.5 minutes of test signals. Results for eachconfiguration are presented in Sec. IV, and were obtainedaveraging the performance metrics over the 12 teststhree SNR levels for each configuration.

oc. Am., Vol. 115, No. 1, January 2004

alstg

-zi-

ores

s

he

n-et

ale

ix-

ey.

er-

oretd

tsforstbynd

F. Performance metrics

Processing with the various algorithms was done off-land the signals from the individual sources before and aprocessing were known, greatly simplifying the calculatiof performance metrics. Because the target signal coulddistorted by significant amounts during processing, andcause distortion is detrimental to speech perception, a sigto-noise ratio~SNR! metric that incorporated both interference and signal bias~distortion! error was chosen. Theoutput SNR~after processing! is defined as

SNROUT510• log10S (v51V tu~v !2

(v51V ~yp~v !2tu~v !!2D , ~8!

whereyp(v) and tu(v) are thevth samples of the processeoutput and unprocessed~ideal! target signals, respectivelyandV is the length of the signal in samples. In this work, tunprocessed signal is binaural and the output is monauThus, the expression is modified to be

SNROUT

510• log10S ( i 51V @g1tu,L~v !1g2tu,R~v !#2

( i 51V ~yp~v !2~g1tu,L~v !1g2tu,R~v !!!2D ,

~9!

wheretu,L(v) and tu,R(v) are the unprocessed target signafrom the left and right microphones, respectively, andg1 andg2 represent the gains applied by the algorithm~effectivelythe steering vector! to the target signal in each channel.this study,g1 , g250.5 because filters are used to match tresponses of the microphones in the target direction~see Sec.III D !. If the microphones are not matched for the targetrection, the gains must be replaced with filters.

The input SNR for one channel is defined as

SNRIN510• log10S (v51V tu,L~v !2

(v51V i u,L~v !2D , ~10!

where i u,L(v) is the interference signal received by the lemicrophone. Because all signals were recorded individuathe target and interference signals received by the mic

Lockwood et al.: Performance of beamformers

Page 7: Performance of time- and frequency-domain …Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms Michael E. Lockwood,a) Douglas

dithoioe

th, steoe

rteu

-

cendnme

.

e-

go-n

sts,by

e.

n

le

ho

E-

data

phones are known and the SNRIN calculation yields the SNRafter microphone reception.

A second metric used was a measure of the targettortion. Access to the processed target signal alloweddistortion to be quantified. The FMV, Frost, and GSC algrithms are constrained to pass the target with zero distortand are referred to as distortionless response beamformHowever, microphone mismatch and reverberation meanonly a portion of the target signal satisfies the constrainttarget distortion may occur. The P–K algorithm attenuafrequency bands that contain interference, and thus almalways cancels some of the target signal. We define a msure of target distortion as

TD5(v51

V ~ tp~v !2@g1tu,L~v !1g2tu,R~v !# !2

(v51V @g1tu,L~v !1g2tu,R~v !#2

, ~11!

wheretp(v) is the processed~monaural! target signal. WhenTD is greater than zero, the target signal has been distoNote that this metric does not distinguish between attention and phase distortion.

IV. RESULTS AND DISCUSSION

A. Algorithm comparisons

Figure 2~results for omnidirectional and cardioid microphones, left channel! and Fig. 3~results for KEMAR micro-phones, both channels! show the average SNROUT producedby the algorithms after processing. The SNRIN, as receivedby the omnidirectional, cardioid, and KEMAR~ER-1! micro-phones, is labeled as~SNR, omni!, ~SNR, card!, and~SNR,ER-1!, respectively. To facilitate the comparisons in SeIV C, ~SNR, omni! is also shown with the results for thcardioid and KEMAR microphones. In all three rooms afor all microphone types, the FMV and P–K algorithms cosistently outperform the Frost and GSC algorithms in terof SNROUT for all test configurations with more than on

FIG. 2. ~A!, ~C!, ~E! Processed and unprocessed SNR values for theomnidirectional microphone data in rooms 1, 2, and 3, respectively.~B!, ~D!,~F! Processed and unprocessed SNR values for the left cardioid micropdata in rooms 1, 2, and 3, respectively.

J. Acoust. Soc. Am., Vol. 115, No. 1, January 2004

s-is-n,rs.atossta-

d.a-

.

-s

interfering source~configurations 5–11!. This is the mainadvantage conferred by the frequency-domain algorithms

For the one-interferer tests~configurations 1–4!, theFMV and P–K algorithms perform better than the timdomain algorithms in terms of SNROUT for configurations 1and 2. For configurations 3 and 4, the time-domain alrithms perform similarly or slightly better in room 1, and iroom 2 with cardioid microphones.~Though the algorithmparameters were not optimized for these one-interferer tethe improvement in performance that can be obtainedoptimizing the parameters for these tests is not very larg!

The FMV and P–K algorithms perform similarly in

FIG. 4. ~A!, ~D!, ~G! Target distortion for omnidirectional microphones irooms 1, 2, and 3, respectively.~B!, ~E!, ~H! Target distortion from cardioidmicrophones in rooms 1, 2, and 3, respectively.~C!, ~F!, ~I! Target distortionfor KEMAR microphones in rooms 1, 2, and 3, respectively.

ft

ne

FIG. 3. ~A!, ~C!, ~E! Processed and unprocessed SNR values for left KMAR microphone data in rooms 1, 2, and 3, respectively.~B!, ~D!, ~F!Processed and unprocessed SNR values for right KEMAR microphonein rooms 1, 2, and 3, respectively.

385Lockwood et al.: Performance of beamformers

Page 8: Performance of time- and frequency-domain …Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms Michael E. Lockwood,a) Douglas

r-t

ve

a

lg

nlgovh

ap

fes

sKenr

thgoormre

twa

th

ere

Kgo

eth

or

oncioare

r

eyateion

to

in-dis-i-ainveKndino-themp-timehistill

ts

oxi-

e-

s.rom-ht.

als

R

v-on

ereob-Ifhese

ece

thatresedingme-

he

terms of SNROUT, but the P–K produces more target distotion, as can be seen in Fig. 4. This is to be expected, asFMV algorithm uses constrained spatial filtering to remointerference without distorting the target signal~withmatched sensors and no reverberation!, while the P–K algo-rithm simply attenuates frequency bands that appear todominated by interference. Therefore, target distortion iscepted in exchange for an improved SNROUT.

The differences between the GSC and the Frost arithm are generally small. Griffiths and Jim@1982# showedthat their GSC structure converges to the same solutionthe Frost beamformer, although their paths to convergemay be different. In the tests conducted here, the GSC arithm appears to have a slight performance advantagethe Frost algorithm for the single-interferer tests, while tFrost algorithm appears to perform better than the GSCthe number of interferers rises. This difference is moreparent with the KEMAR microphones~Fig. 3!, for which theGSC has a more pronounced advantage in the one-intertests~up to 3 dB! and for which the Frost algorithm performbetter in the multiple-interferer tests~by up to 2 dB!. Thecause of these differences is not known.

The FMV and GSC algorithms generally produce ledistortion~Fig. 4! of the target signal than the Frost and P–algorithms. The FMV and GSC distortion figures are oftcomparable, but the FMV has an advantage in the moreverberant environments. The high distortion figures ofP–K algorithm are expected, but those for the Frost alrithm are not. They suggest that the Frost algorithm is msensitive to the amount of reverberation and the type ofcrophone being used than the GSC or FMV algorithms a

The conventional beamformer, which averages theinputs, performs more poorly than all other algorithms,shown in Figs. 2 and 3.

B. Effects of reverberation

The performance, in terms of SNROUT ~Figs. 2 and 3!, ofall algorithms generally decreases by 1 dB or less fortests in room 2~RT50.37 s! as compared to room 1~RT50.10 s!. This decrease is fairly consistent across the diffent types of microphones. Performance differences betwroom 2 and room 3~RT50.65 s! are slightly more pro-nounced, including a drop of 1–2 dB for the FMV and P–algorithms across all configurations. Frost and GSC alrithm performance is reduced by a similar amount.

In general, the performance of all algorithms is dcreased by similar amounts as the reverberation time ofroom increases. Importantly, this implies that the perfmance advantage of the frequency-domain algorithmsmaintained as the reverberation time increases. For theinterferer configurations, the advantage of the frequendomain algorithms actually increases with reverberattime; this implies that the advantages of these algorithmsnot limited to the configurations in which there are mosources than sensors.

It is important to point out that distortion fodistortionless-response beamformers~FMV, Frost, GSC! isdue to sensor mismatch~including mis-steering! and rever-

386 J. Acoust. Soc. Am., Vol. 115, No. 1, January 2004

he

bec-

o-

asceo-er

eas-

rer

s

e-e-ei-.os

e

-en

-

-e

-ise-

y-nre

beration. The matching filters were chosen carefully, but thwere kept sufficiently short so that they did not compensfor reflections and reverberation. Therefore, the distortfigures generally reveal how sensitive the algorithms arereverberation.

Increasing the amount of reverberation consistentlycreases target distortion. Figure 4 reveals that the targettortion for the FMV, Frost, and GSC algorithms approxmately doubles from room 1 to room 2, and doubles agfrom room 2 to room 3. The FMV and GSC generally hathe lowest distortion for all rooms. The distortion of the P–algorithm rises by approximately 0.05 for both rooms 2 a3. The P–K algorithm has by far the highest distortionrooms 1 and 2 with omnidirectional and cardioid micrphones; Frost has the second highest. The distortion ofFrost algorithm is comparable to that of the P–K algorithfor room 3. While the FMV, Frost, and GSC algorithms apear to be more sensitive to increases in reverberationthan the P–K algorithm, for the range of RTs used in tstudy, the distortion of the FMV and GSC algorithms is snotably less than that of the P–K algorithm.

C. Microphone effects

The directionality of the cardioid microphones accounfor up to a 3-dB improvement in the SNRIN ~SNR, card! overthat of the omnidirectional microphones~SNR, omni!, as canbe seen in Fig. 2. This appears to account for the apprmately 1–2-dB improvement in the SNROUT that is observedfor all algorithms for the multiple-interferer tests in all threrooms. Overall, the FMV algorithm with cardioid microphones performs best, albeit by a small margin.

The KEMAR microphones~Fig. 3! cause a reduction inthe SNRIN for the left channel and an increase in the SNRIN

~SNR, ER-1! for the right channel for single-interferer testThis is because the interfering source for these tests is fthe left of the array, at160° or 120°, and thus the interference is stronger in the left ear of the KEMAR than the rigFor multiple interferer tests, the SNRIN is lower than with theother microphones. Thus, processing the KEMAR signyields 1–2-dB lower SNROUT than with omnidirectional orcardioid microphones. Another notable effect of the KEMAmicrophones is the dramatic increase in [email protected]~c!,~f!,~i!# for the Frost algorithm, likely caused by sensitiity to reverberation or microphone mismatch. The distortifor the GSC algorithm remains low.

For the experiments in this study, all sound sources wplaced in the front half-plane. This reduced the benefitstained by using the cardioid or KEMAR microphones.sources had been placed in the back half-plane, using tmicrophones would have resulted in higher values of SNRIN.However, for this study, the effect of the directionality of thmicrophones that is most interesting is the ability to reduthe reverberation in the recorded signals. It was hopedthe directional microphones would make all algorithms morobust to reverberation effects. Compared to the processignals from the omnidirectional microphones, processthe signals from the cardioid microphones produces a sowhat higher output SNR~for all three rooms! and generallylower target distortion, while processing signals from t

Lockwood et al.: Performance of beamformers

Page 9: Performance of time- and frequency-domain …Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms Michael E. Lockwood,a) Douglas

8

J. Acoust. Soc. Am.

TABLE III. Frequencies of sinusoidal interferers and positions as a function of time.

Timeinterval Sinusoid frequencies~Hz!

500 611 682 769 921 1016 1095 1187 1331 144

Azimuth angles

0.00–0.75 s

265° 255° 245° 235° 225° 20° 30° 40° 50° 60°

0.75–1.50 s

20° 30° 40° 50° 60° 265° 255° 245° 235° 225°

1.50–2.25 s

60° 50° 40° 30° 20° 225° 235° 245° 255° 265°

ndo

nai-

a-fra

arfily

aghey

th-m

anis

-e

fetio

ththraiffenisha

terns

ests-

llytheidfec-

u-ce

wo-on-ache-elyin-

forerers

KEMAR microphones produces slightly lower SNRs ahigher distortion. Overall, the performances of the algrithms were improved by the use of the cardioid directiomicrophones, and only slightly reduced by the KEMAR mcrophones.

D. Best parameter sets

The optimal parameter sets for each algorithm~Table I!were found to be the same for both omnidirectional and cdioid microphones, but different for the KEMAR microphones. As compared to the best parameter sets for thefield microphones, the FMV algorithm performs best withlarger regularization valueM, the GSC algorithm requiressmaller step sizea to remain stable in multiple interferetests, and the P–K algorithm performs best with values oc1

and c2 that weight the amplitude difference most heavinstead of the phase difference.

V. TIME-VERSUS FREQUENCY-DOMAIN MVDRBEAMFORMERS

A. Overview

The results in Sec. IV show the performance advantof the FMV over conventional time-domain algorithms. TFrost and GSC algorithms differ from the FMV in that theare iterative-adaptive time-domain techniques, whileFMV calculates optimal solutions directly for many frequency bands. However, all are classified as MVDR beaformers, because they have identical optimization goalsconstraints~minimize output power, pass target source undtorted!. These MVDR beamformers~time- and frequency-domain! will all asymptotically converge to identical solutions for inputs that are stationary random processTherefore, differences in SNR performance are due to difent adaptation characteristics in the presence of nonstaary signals such as speech.

Using specific simulated signals, we now examinereasons for the performance advantage of the FMV overother algorithms. The test signals are not true binauralcordings as in the previous experiments, rather theysimulated. The signals for each source in each channel donly by time delay. Therefore, no sensor mismatch is presthe FMV, Frost, and GSC algorithms produce no target dtortion, and performance differences result only from tvarying adaptation characteristics of the algorithms. The

, Vol. 115, No. 1, January 2004

-l

r-

ee-

e

e

-d-

s.r-n-

ee

e-reert,-

eb-

sence of sensor mismatch also allows accurate beam patto be calculated to observe these characteristics.

The algorithms were all optimized to produce the bSNR gain for the simulated signals. The first example illutrates the behavior of MVDR beamformers with statisticastationary interference. The second example shows thatFMV is able to adapt more quickly and accurately to rapchanges in interfering sound sources, and thus it more eftively attenuates multiple-interfering sound sources.

B. Simulation example 1: Multiple spatially separatedsinusoidal interferers

MVDR beamforming algorithms may, in theory, attenate multiple, statistically stationary narrow-band interferensignals at different frequencies. To demonstrate this, a tchannel test signal was synthesized containing a singleaxis speech source and ten off-axis sinusoidal signals, ewith a different frequency and azimuth. The sinusoid frquencies ranged from 500 to 1450 Hz, with approximat100-Hz spacing. Neglecting frequency smearing due to w

FIG. 5. ~A! Spectrogram of the target speech signal in example 1~‘‘The warwas fought with armored tanks.’’! before processing.~B! Spectrogram of thecombined target and sinusoidal interference signals in example 1 beprocessing. The target signal is at 0° azimuth, and ten sinusoidal interfeoriginate from angles between665° azimuth.~C!, ~D! Spectrograms ofexample 1 after processing with the FMV algorithm~C! and the Frost algo-rithm ~D!.

387Lockwood et al.: Performance of beamformers

Page 10: Performance of time- and frequency-domain …Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms Michael E. Lockwood,a) Douglas

er

sas

fo

fas

y-eeree

-Ho,

cs

aes

-ne

arrthio

ger

o-llalsaapt

cedof

e in

wlap.le,ig-

er-of

Anech

-

FT

pro-are athe

dowing, there is no spatial or frequency overlap of the intferers. So that the adaptation rate of both algorithms canobserved, the sinusoids instantaneously change location0.75 and 1.5 s. The sinusoid frequencies and azimuthsfunction of time are listed in Table III.

Spectrograms of the target and combined signals beprocessing are shown in Figs. 5~a! and~b!, respectively. TheSNR before processing is27.35 dB. The spectrograms othe processed output from the FMV and Frost algorithmsshown in Figs. 5~c! and~d!, respectively, and the SNR gainwere 18.56 and 16.14 dB, respectively.

While processing this signal, all of the frequencdomain weights calculated by the FMV algorithm for thFFT bin centered on 925 Hz were recorded. All of the timdomain filter coefficients from the Frost algorithm wetransformed to frequency-domain coefficients and recordUsing these coefficients, the beam pattern for the 925FFT bin ~the gain, in decibels, for signals as a functionazimuth angle and time! was calculated for both algorithmsand is shown in Fig. 6~a!. A two-element beamformer~with aconstraint! can place at most one spatial null per frequenband, and the null direction changes with time as it adaptthe input signal. The null is easily visible, occurring at225°,near 60°, and at 20° for time [email protected], 0.75 s#, @0.75, 1.5s#, and @1.5, 2.25 s#, respectively. Table III shows thatsinusoidal interferer with a frequency of 921 Hz was at thlocations at the above time intervals.

Figure 6~a! also shows that the FMV algorithm null deviates from the direction of the interferer between 0.75 a1.5 s. Examining Fig. 5~b! reveals that the spectra of thtarget~speech! and interference~sinusoidal! sources overlapin the 925-Hz FFT bin between 0.75 and 1.5 s. The optimfilter weights are calculated assuming that there is no colation between target and interfering sources; in practice,is rarely true over short time intervals. Thus, the correlatmatrix estimates contain some error.~To reduce this error,

FIG. 6. ~A! Magnitude~in decibels! of the beam pattern for the FMV algorithm for the 925-Hz FFT bin. Note the null at225° (t50.00 to 0.75 s!,'60° (t50.75 to 1.50 s!, and 20° (t51.50 to 2.25 s!. ~B! Magnitude~indecibels! of the beam pattern for the Frost algorithm for the 925-Hz Fbin. The null location varies considerably, but on average is near225° (t50.00 to 0.75 s!, 60° (t50.75 to 1.50 s!, and 20° (t51.50 to 2.25 s!.

388 J. Acoust. Soc. Am., Vol. 115, No. 1, January 2004

-be

ata

re

re

-

d.z

f

yto

e

d

le-isn

correlation matrices may be formed using data from lontime intervals with the trade-off of slower adaptation.!

Figure 6~b! shows the beam pattern for the Frost algrithm for the 925-Hz FFT bin. The direction of the nuplaced by the Frost algorithm is erratic when the signoverlap ~between 0.75 and 1.5 s!. This occurs becausenoisy time-domain estimate of the gradient is used to adthe filter weights. This misadjustment error can be reduby lowering the adaptation rate, but only at the expenseslower convergence, thus leading to poorer performancthe presence of multiple interferers.

In this example the FMV and Frost algorithms shononideal effects when target and interference signals overHowever, both algorithms effectively attenuate multipnonoverlapping statistically stationary interferers using snals from two sensors.

C. Simulation example 2: Multiple speech interferers

This example uses statistically nonstationary interfence sources~speech! to compare the speed and accuracyconvergence of the FMV, Frost, and GSC algorithms.on-axis speech source and two off-axis interfering spesources~at 645°! are simulated. Figures 7~a! and ~b! show

FIG. 7. ~A! Spectrogram of target speech signal in example 2~‘‘He killedthe dragon with his sword.’’! before processing.~B! Spectrogram of com-bined target and two interfering speech signals in example 2, beforecessing. The target signal is at 0° azimuth and the interference sources645°. ~C!, ~D!, ~E! Spectrogram of example 2 after processing with tFMV ~C!, Frost~D!, and GSC~E! algorithms.

Lockwood et al.: Performance of beamformers

Page 11: Performance of time- and frequency-domain …Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms Michael E. Lockwood,a) Douglas

nsinc

gs.hent

e

Ramin

e

ec-m’s

and

d

o-en

go-ults

ise-nly

T.-or

the spectrograms of the target and target plus interferesignals, respectively. The SNR gains produced by proceswith the FMV, Frost, and GSC algorithms are 9.14, 5.41, a4.30 dB, respectively, indicating that the FMV most effe

FIG. 8. ~A!, ~B!, ~C! Magnitude~in decibels! of the beam patterns for theFMV ~A!, Frost~B!, and GSC~C! algorithms for the 800-Hz FFT bin. Nullsnear645° correspond to cancellation of interfering speech sources at645°.~D!, ~E!, ~F! Magnitude~in decibels! of the beam patterns att50.6875 s forthe FMV ~D!, Frost~E!, and GSC~F! algorithms.

J. Acoust. Soc. Am., Vol. 115, No. 1, January 2004

cengd-

tively reduced the interference. This is supported by Fi7~c!–~e!, which show the spectrograms of the outputs of tFMV, Frost, and GSC algorithms, respectively. It is evidethat the FMV output@Fig. 7~c!# more closely resembles thtarget signal@Fig. 7~a!#.

The better performance of the FMV in terms of the SNgain may be explained by examining the time-varying bepatterns for the FMV, Frost, and GSC algorithms, shownFigs. 8~a!, ~b!, and~c!, respectively. The nulls placed by thFMV converge consistently and quickly to645°. In contrast,the Frost algorithm’s nulls rarely converge to exact the dirtions of the interference sources, and the GSC algorithconvergence is even less precise. Figures 8~d!, ~e!, and ~f!show instantaneous beam patterns for the FMV, Frost,GSC algorithms. At this point in time, the FMV algorithmplaced nulls closer to the directions of the interferers~at645°!, especially for the245° source between 2300 an5000 Hz.

This example illustrates the typical behavior of the algrithms for signals containing multiple speech sources. Whcompared to the Frost and GSC algorithms, the FMV alrithm exhibits faster, more accurate adaptation. This resin better interference cancellation.

D. Computational complexity

The computational complexity of the FMV algorithmderived after some simplifications of the computation. Timdomain microphone data are real-valued, so weights for oN/211 frequency bins are calculated~whereN is the numberof FFT bins!, and the remainingN/221 bins are obtained byutilizing the conjugate-symmetry property of the FFBecause cross-correlation terms inR are conjugate symmetric, only three of the four terms need to be calculated. Fan n-microphone system, weights for onlyn21 channels

TABLE IV. Computational expense for various parts of the FMV algorithm.

Part ofalgorithm Complex adds

Complexmultiplies

Complexdivisions

Realmultiplies Time

Windowinga 0 0 0 n•NL

fs

FFTs ~n11!•N

2log2 N ~n11!•

N

4log2 N 0 0

L

fs

Correlation ~n21n!SN2 11D2

~n21n!SN2 11D2

0 0L

fs

Weightapplication ~n21!SN2 11D nSN2 11D 0 0

L

fs

Matrix inversionn3

3

n3

30 0

BL

fs

Weightcalculation~notincluding matrixinversion!

~n21n22!SN2 11D ~n21n!SN2 11D ~n21!SN2 11D 0BL

fs

aFor example, the windowing operation requiresn•N real multiplies everyL/ f s seconds, whereL is the numberof samples between FFTs, andf s is the sampling rate in Hz.

389Lockwood et al.: Performance of beamformers

Page 12: Performance of time- and frequency-domain …Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms Michael E. Lockwood,a) Douglas

n

l,lt

pu

go-in

mthmdr

a-re

s

goeAsmro-veo

bgo

tollinMVt oBai

ereof

er-ain

ide-–estr-estr-in

reheheerero-buthisorm-lotsbysts

actnot

m,ghnse

rve

al-

un-s indi-

theigh

nceto-ra-ms

need be calculated using Eq.~4!, because Eq.~3b! maybe utilized to obtain weights for thenth channel with lesscomputation.

The computational complexity of the FMV is a functioof f s , the sampling frequency,n, the number of sensors,B,the block length~the number of FFTs, for a single channethat are calculated between each computation of new fiweights!, N, the number of points in the FFT, andL, thenumber of output samples obtained from each IFFT.~It isassumed that there is no overlap of groups of outsamples.! Assumptions made for this analysis are:~1! aradix-2-real-valued FFT is calculated everyL samples;~2!calculating a matrix inverse requiresn3/3 complex multipli-cations andn3/3 complex additions; and~3! the secondweight is obtained using Eq.~3b!.

Table IV summarizes the costs of each part of the alrithm in terms of complex additions, complex multiplications, complex divisions, real multiplications, and the timewhich each must be completed. These cost estimates doinclude the cost of overhead within a computer prograsuch as retrieving data from memory; they reflect onlymathematical operations required to execute the algorith

Assuming that two real additions for each complex adition, four real multiplications and two real additions foeach complex multiplication, and two complex multiplictions and two real divisions for each complex division arequired, the total number of operations per second~OPS!required for the algorithm is

OPS5f s

L FNS n15

2~n11!log2 ND1~N12!S 4

3Bn3

1S 214

BDn21S 2014

BDn21522

BD G . ~12!

Additionally, the cost of the Frost and GSC algorithmare, respectively,

OPSF5 f s•~2KF14KF•n!, ~13!

OPSGSC5 f s•~n214KGSC•~n21!!. ~14!

For the best parameter sets, the FMV and P–K alrithms require 178 million OPS, the Frost algorithm requir88 million OPS, and the GSC requires 36 million OPS.implemented, the FMV requires about five times more coputation than the GSC and about twice as much as the FBy increasingL from 16 to 32 in the FMV and P–K algorithms, the cost is roughly cut in half. In practice, we hafound that by increasing the interval between calculationFFTs, the computational cost of the FMV algorithm canmade approximately equal to that of the GSC or Frost alrithms with only a small decrease in performance.

The P–K algorithm implementation was very similarthat of the FMV, because it also performs FFTs periodicaand uses correlation matrices and frequency-domain filterIts cost was considered to be comparable to that of the Falgorithm. Thus, with similar computational cost, the FMand P–K algorithms provide performance superior to thathe Frost and GSC algorithms in terms of SNR gain.comparison, the cost of the direct inversion of a time-dom

390 J. Acoust. Soc. Am., Vol. 115, No. 1, January 2004

er

t

-

not,

e.-

-s

-st.

fe-

yg.V

fyn

correlation matrix~as per Capon@1969#! of dimension 401~the length of the Frost and GSC filters!, done every 32samples~as with the calculation of the FMV weights!, isapproximately 15 billion OPS.

VI. CONCLUSION

Test signals containing multiple speech sources wcreated from recordings made with three different typestwo-microphone arrays in three rooms with varying revberation times. The performance of a frequency-domminimum-variance distortionless-response~FMV! beam-former, the Frost adaptive beamformer, the generalized slobe canceler~GSC!, and an implementation of the PeissigKollmeier ~P–K! binaural algorithm were evaluated. ThFMV and P–K algorithms outperform the time-domain Froand GSC algorithms in terms of output SNR. A pair of cadioid microphones yields the best performance and lowtarget distortion. A pair of omnidirectional microphones peforms almost as well, followed by microphones mountedeach of the ear canals of a KEMAR.

The performances of the FMV and P–K algorithms avery similar in terms of the SNR of the output signal, but tP–K algorithm generates significantly more distortion of ttarget signal. In the tests conducted in this study, filters wused to match approximately the response of the micphones for sounds received from the target direction,some inherent error is still present in the test signals. Terror means that even the distortionless response beamfers could distort the target signal, as is evidenced in the pof the target distortion.~This issue has been addressedElledge @2000#.! The FMV generally produces the smalleamount of distortion, while the Frost and P–K algorithmgenerally produce the largest distortion figures. The impof the distortion on speech perception and quality hasbeen determined with human listeners.

Informal observations indicate that the P–K algorithas implemented here, produces highly intelligible, thoudistorted output. This suggests that if distortionless respois a requirement~such as in a high-fidelity hearing aid!, thenthe FMV algorithm is preferred, but if maximal intelligibilityis the principal goal, then the P–K algorithm~as imple-mented here! is also promising. A hybridization of the FMVand P–K algorithms may prove beneficial if it can presesome of the benefits of each approach.

The two-channel FMV has been implemented in a retime system@Elledge et al., 2000#. Preliminary tests showthat the real-time system helps normal-hearing listenersderstand an on-axis talker amidst many interfering sourcecontrolled environments as compared to conditions withchotic unprocessed signals and diotic signals~channelssummed! @Larsenet al., 2001; Feng, 2002#. Informal testsindicate that the FMV improves intelligibility in rooms thavary widely in size and reverberation characteristics. Toutput of the real-time system sounds natural and is of hfidelity.

This study uses recordings that were made at a distaof 75 cm from the loudspeaker, where the direct-reverberant ratio is relatively high. Even though reverbetion was not dominant, the performance of all the algorith

Lockwood et al.: Performance of beamformers

Page 13: Performance of time- and frequency-domain …Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms Michael E. Lockwood,a) Douglas

–Kfoe

anrar-

thdaocecrc

Nais

a-e

nfono

C.J

-

ly-

ss

in

L.

2Xe

:

y

-

gs.

Soc.

istic

,’’ J.

-st.

B.

or-

ded

ent

.,

.,

tiple

ine-

S.,er,-

ired

-

s.

ll-on

g,’’

is compromised to varying degrees, with the FMV and Palgorithms showing the smallest degradation. Our goalthe present study was to determine the relative effectivenof these algorithms in rooms having different reverbercharacteristics, but not to quantify the effect of the reverbetion per se. Future work is required to evaluate the perfomance of the algorithms with microphones farther fromsources where the direct-to-reverberant ratio is lower. Adtionally, the 7.6-cm-diameter loudspeakers in the arrayquite directional at high frequencies due to the directivitythe driver, thereby reducing the amount of high-frequenreverberation. In the future, loudspeakers with less dirtional responses should be used in tests that involve souat greater distances.

ACKNOWLEDGMENTS

This research was supported by a grant from thetional Institute on Deafness and Other Communication Dorders~R21DC-04840! of the National Institutes of Healthand by grants from the University of Illinois at UrbanChampaign~the Beckman Institute, the Mary Jane Neer Rsearch Fund, and the Charles M. Goodenberger Fund!. Wethank Chen Liu and Kong Yang for their earlier work oalgorithm simulations, Mark Elledge and Jeffrey Larsentheir participation in the group, and the many staff who geerously donated their time to participate in the recordingthe speech materials used in our tests.

Berghe, J. V., and Wouters, J.~1998!. ‘‘An adaptive noise canceller forhearing aids using two nearby microphones,’’ J. Acoust. Soc. Am.103,3621–3626.

Bilger, R. C., Nuetzel, J. M., Rabinowitz, W. M., and Rzeczkowski,~1984!. ‘‘Standardization of a test of speech perception in noise,’’Speech Hear. Res.27, 32–48.

Brandstein, M., and Ward, D.~2001!. Microphone Arrays: Signal Processing Techniques and Applications~Springer, Berlin!.

Capon, J.~1969!. ‘‘High-resolution frequency-wavenumber spectrum anasis,’’ Proc. IEEE57~8!, 1408–1419.

Cox, H., Zeskind, R. M., and Kooij, T.~1986!. ‘‘Practical supergain,’’ IEEETrans. Acoust., Speech, Signal Process.ASSP-34~3!, 393–398.

Cox, H., Zeskind, R. M., and Owen, M. M.~1987!. ‘‘Robust adaptive beam-forming,’’ IEEE Trans. Acoust., Speech, Signal Process.ASSP-35~10!,1365–1376.

Deslodge, J. G.~1998!. ‘‘The location-estimating null-steering~LENS! al-gorithm for adaptive microphone-array processing,’’ Ph.D. thesis, Machusetts Institute of Technology, Cambridge, MA.

Elledge, M. E.~2000!. ‘‘Real-time implementation of a frequency-domaarray processor,’’ M.S. thesis, University of Illinois, Urbana, IL.

Elledge, M. E., Lockwood, M. E., Bilger, R. C., Feng, A. S., Jones, D.Lansing, C. R., O’Brien, W. D., and Wheeler, B. C.~2000!. ‘‘Real-timeimplementation of a frequency-domain beamformer on the TI C6EVM,’’ presented in 10th Annual DSP Technology Education and Rsearch Conference Houston, TX, 2–4 August.

Feng, A. S.~2002!. ‘‘Biologically inspired binaural hearing aid algorithmsDesign principles and effectiveness,’’ J. Acoust. Soc. Am.111, 2354.

Frost, O. L.~1972!. ‘‘An algorithm for linearly constrained adaptive arraprocessing,’’ Proc. IEEE60~8!, 926–935.

Golub, G., and Van Loan, C.~1996!, Matrix Computations, 3rd ed.~TheJohns Hopkins University Press Ltd., London!.

Greenberg, J. E., and Zurek, P. M.~1992!. ‘‘Evaluation of an adaptive beamforming method for hearing aids,’’ J. Acoust. Soc. Am.91, 1662–1676.

J. Acoust. Soc. Am., Vol. 115, No. 1, January 2004

rsst-

ei-refy-es

--

-

r-f

.

a-

,

-

Greenberg, J. E.~1998!. ‘‘Modified LMS algorithms for speech processinwith an adaptive noise canceller,’’ IEEE Trans. Speech Audio Proces6,338–351.

Greenberg, J. E., Deslodge, J. G., and Zurek, P. M.~2003!. ‘‘Evaluation ofarray-processing algorithms for a headband hearing aid,’’ J. Acoust.Am. 113, 1646–1657.

Griffiths, L. J., and Jim, C. W.~1982!. ‘‘An alternative approach to linearlyconstrained adaptive beamforming,’’ IEEE Trans. Antennas Propag.AP-30~1!, 27–34.

Hoffman, M. W., Trine, T. D., Buckley, K. M., and Van Tasell, D. J.~1994!.‘‘Robust adaptive microphone array processing for hearing aids: Realspeech enhancement,’’ J. Acoust. Soc. Am.96, 759–770.

Joho, M., and Moschytz, G. S.~2000!. ‘‘Connecting partitioned frequency-domain filters in parallel or in cascade,’’ IEEE Trans. Circuits Syst.47,685–698.

Kates, J. M., and Weiss, M. R.~1996!. ‘‘A comparison of hearing-aid array-processing techniques,’’ J. Acoust. Soc. Am.99, 3138–3148.

Kollmeier, B. ~1997!. Psychoacoustics, Speech and Hearing Aids~WorldScientific, Singapore!.

Kollmeier, B., Peissig, J., and Hohmann, V.~1993!. ‘‘Real-time multibanddynamic compression and noise reduction for binaural hearing aidsRehabil. Res. Dev.30~1!, 82–94.

Kompis, M., and Dillier, N.~1994!. ‘‘Noise reduction for hearing aids: Combining directional microphones with an adaptive beamformer,’’ J. AcouSoc. Am.96, 1910–1913.

Larsen, J. B., Lockwood, M. E., Lansing, C. R., Bilger, R. C., Wheeler,C., O’Brien, Jr., W. D., Jones, D. L., and Feng, A. S.~2001!. ‘‘Performanceof a frequency-based minimum-variance beamforming algorithm for nmal and hearing-impaired listeners,’’ J. Acoust. Soc. Am.109, 2494.

Liu, C., Feng, A. S., Wheeler, B. C., O’Brien, W. D., Bilger, R. C., anLansing, C. R.~1997!. ‘‘Speech enhancement via directional hearing bason the place theory,’’ 2nd Biennial Hearing Aid Research & DevelopmConference, Washington, DC, Sept. 1997, 58A~II-P-29!.

Liu, C., Wheeler, B. C., O’Brien, Jr., W. D., Bilger, R. C., Lansing, C. Rand Feng, A. S.~2000!. ‘‘Localization of multiple sound sources with twomicrophones,’’ J. Acoust. Soc. Am.108, 1888–1905.

Liu, C., Wheeler, B. C., O’Brien, Jr., W. D., Lansing, C. R., Bilger, R. CJones, D. L., and Feng, A. S.~2001!. ‘‘A two-microphone dual delay-lineapproach for extraction of a speech sound in the presence of mulinterferers,’’ J. Acoust. Soc. Am.110, 3218–3231.

Lockwood, M. E.~1999!. ‘‘Development and testing of a frequency-domaminimum-variance algorithm for use in a binaural hearing aid,’’ M.S. thsis, University of Illinois, Urbana, IL.

Lockwood, M. E., Jones, D. L., Bilger, R. C., Elledge, M. E., Feng, A.Goueygou, M., Lansing, C. R., Liu, C., O’Brien, Jr., W. D., and WheelB. C. ~1999!. ‘‘A minimum-variance frequency domain algorithm for binaural hearing aid processing,’’ J. Acoust. Soc. Am.106, 2278.

McDonough, R. N. ~1979!. ‘‘Application of the maximum-likelihoodmethod and the maximum-entropy method to array processing,’’ inNon-linear Methods of Spectral Analysis: Topics in Applied Physics, edited byS. Haykin~Springer, New York!, pp. 181–243.

Peissig, J., and Kollmeier, B.~1997!. ‘‘Directivity of binaural noise reduc-tion in spatial multiple noise-source arrangements for normal and impalisteners,’’ J. Acoust. Soc. Am.101, 1660–1670.

Slyh, R. E., and Moses, R. L.~1993!. ‘‘Microphone array speech enhancement in overdetermined signal scenarios,’’ ICASSP-93,2, 347–350.

Van Veen, B. D., and Buckley, K. M.~1988!. ‘‘Beamforming: A versatileapproach to spatial filtering,’’ IEEE ASSP Magazine, April, 4–24.

Welker, D. P., and Greenberg, J. E.~1997!. ‘‘Microphone-array hearing aidswith binaural output II. A two-microphone adaptive system,’’ IEEE TranSpeech Audio Process.5~6!, 543–551.

Wittkop, T., Albani, S., Hohmann, V., Peissig, J., Woods, W. S., and Komeier, B. ~1997!. ‘‘Speech processing for hearing aids: Noise reductimotivated by models of binaural interaction,’’ Acustica83, 684–699.

Yang, K. L., Lockwood, M. E., Elledge, M. E. and Jones, D. L.~2000!. ‘‘Acomparison of beamforming algorithms for binaural acoustic processinProc. 9th IEEE Digital Signal Processing Workshop, Hunt, TX.

391Lockwood et al.: Performance of beamformers