Application of Acoustic Echo Cancellation

Master Thesis IMIT/LECS/2007-63

Implementation of

Acoustic Echo Cancellation

For PC Applications

Using MATLAB

Master of Science Thesis

In System on Chip Design

by

Lu Lu

Stockholm, 05/2007

Supervisor: Temujin Gautama (NXP Software Leuven Belgium)

Examiner: Axel Jantsch (ICT/KTH Stockholm Sweden)

2

Abstract

The communication technique has changed a lot in the recent years. Today people are

more interested in hands-free communication with the use of a loudspeaker and a

microphone, in stead of a normal telephone. However, the presence of a large acoustic

coupling between the loudspeaker and microphone would produce a loud echo that would

make conversation difficult. The solution to these problems is the elimination of the echo

with an echo cancellation or echo suppression algorithm. However, traditional methods

are not sufficient.

The objective of this thesis is to find out a good echo removal algorithm, which is

capable of providing convincing results for PC application. The basic components of an

echo canceller are an adaptive filter, and a double-talk detector. The adaptive filter

estimates the echo path, based on which a replica of the echo is created and subtracted

from the combination of the actual echo and the near-end speech signal. Double talk

occurs when both ends are talking. The task of a doubletalk detector is to sense the

doubletalk, so that to stop the adaptive filter in order to avoid divergence. Since there has

been a revolution in the field of personal computers in recent years, this work attempts to

implement the acoustic echo canceller algorithm on a PC with the help of the MATLAB

software.

3

Acknowledgement

Firstly I would like to thank a lot my responsible person Temujin Gautama from NXP

Software (Leuven, Belgium), for his patient help and friendly support throughout my

work. Without him, there will be no this project. His advice and constant guidelines have

assisted me to get through a lot of difficult situations.

A special thank goes to my favorite professor Luc Bienstman in GroupT Engineering

School (Leuven, Belgium), who generously did a big effort to help me make this thesis

project possible, and also, as always, gave me continuous encouragement and taught me

how to gain confidence in me when I doubted myself.

Besides my advisers, I would also thank all my friends here in Leuven, who are always so

supportive and make me never feel lonely.

4

Table of Contents:

CONCLUSION 7

CHAPTER I: INTRODUCTION 8

1.1 Need for Echo Cancellation 8

1.2 Basics of Echo Cancellation 11

1.2.1 System overview 11

1.2.2 Adaptive Filter 13

1.2.3 Double talk Detector 13

1.3 Measures of Performance 14

1.3.1 Echo Return Loss Enhancement (ERLE) 14

1.3.2 Near-end Attenuation (NEA) 14

1.4 Thesis Organization 16

CHAPTER II: ECHO CANCELLATION ALGORITHMS 18

2.1 Acoustic Echo Canceller 18

2.2 Acoustic Echo Suppressor 19

2.2.1 Noise Suppression with Spectral Subtraction 19

2.2.2 Acoustic Echo Suppression with Spectral Subtraction 22

2.2.3 Overlapping-windowed FFT 22

2.2.4 Comparison of AEC and AES 25

2.3 Adaptive filters 25

2.3.1 Wiener Filter 26

2.3.2 Least Mean Square Algorithm (LMS) 27

2.3.3 Normalized Least Mean Square Algorithm (NLMS) 28

2.3.4 Problem with NLMS algorithm 29

2.3.5 A Simplified Echo Path Model 31

2.4 Double Talk Detector 33

2.4.1 The Generic Doubletalk Detection Scheme 34

2.4.2 Geigel DTD 34

5

2.4.3 Normalized Cross-correlation (NCR DTD) 35

2.4.4 Variable Impulse Response (VIRE DTD) 36

2.4.5 Double talk detection performance evaluation 37

CHAPTER III: OTHER ISSUES 39

3.1 Room Impulse Response 39

3.1.1 Measure the Testing Room acoustics 40

3.2 Measure of the delay between the loudspeaker and the microphone 41

3.3 Noise issues 42

3.3.1 Typing noise cancellation based on cross-correlation 43

3.3.2 High Pass Filtering 46

CHAPTER IV: EVALUATION 47

4.1 Requirements of AEC 48

4.2 Speech Stimuli 48

4.3 Acoustic Echo Canceller 49

4.4 Acoustic Echo Suppressor Based on Spectral Subtraction 52

4.4.1 NLMS-Based AES 52

4.4.2 Coloration Effect Filter-Based AES 56

4.5 DTD Performance Evaluation 58

4.5.1 Geigel DTD 60

4.5.2 NCR DTD 62

4.5.3 VIRE DTD 65

4.6 AES with different DTD Algorithms 68

4.6.1 The Influence of the NES and the FES 71

4.6.2 The Noise Performance 73

CHAPTER V: CONCLUSION AND FURTHER WORK

5.1 Summary and Conclusion 75

5.2 Further Works 76

6

LIST OF ACRONYMS 77

LIST OF FIGURES 78

REFERENCES 80

APPENDIX

7

CONCLUSION

This paper works on the implementations of acoustic echo cancellation algorithms and

analysis based on simulations in MATLAB. It focuses on Normalized Least Mean Square

(NLMS) algorithm and the recently proposed method by Christof Faller et al which uses

a simplified echo path Model based on a frequency-domain coloration effect filter.

As an important part of a successful Acoustic Echo Cancellation, several Double-Talk

detection methods are also studied and analyzed, including the Geigel algorithm, the

Normalized Cross-correlation method (NCR DTD), and the Variable Impulse Response

Double Talk Detector (VIRE DTD). Some possible further works are discussed at the end

of this paper.

Key words: AEC, AES, NLMS, Coloration-effect filter (CF), DTD, Geigel, NCR, VIRE,

SER, SNR, ERLE, NEA

8

CHAPTER I

INTRODUCTION

Echo is a delayed and distorted version of an original sound or electrical signal which

is reflected back to the source. If a reflected wave arrives after a very short time of direct

sound, it is considered as a spectral distortion or reverberation. However, when the

reflected wave arrives a few tens of milliseconds after the direct sound, it is heard as a

distinct echo. In data communication, the echo can incur a big data transmit error. In

applications like hands-free telecommunications, the echo, in extreme conditions, can

make the conversation impossible. The echo has been a big issue in communication

networks. Hence this thesis is devoted to the investigation and development of an

effective way to control the acoustic echo in hands-free communications.

This chapter gives a general review of the basic techniques in AEC such as the echo

cancellation structure, the adaptive filter, double-talk detector and performance measures.

Section 1.1 addresses the causes of echo and the echo cancellation environment. Section

1.2 details the basics of an acoustic echo removal system and the system structure of the

echo cancellation process. Section 1.3 introduces the two important measures when

evaluating the echo removal performance. Finally, the organization of the thesis is

described.

1.1 Need for Echo Cancellation

There are two types of echo existing in telecommunication networks, namely electrical

echo and acoustic echo. The electrical echo is due to the impedance mismatch at various

points along the transmission medium. This echo can be found in the public-switched

telephone network (PSTN), mobile, and IP phone systems. The electric echo is created at

the hybrid connections which are created at the two-wire / four-wire PSTN conversion

points as shown in Figure 1.1. It will not be included in the scope of this thesis.

9

Figure 1.1: Hybrid Connections and the Resulting Electric Echo

However, the development of hands-free communication systems gave rise to another

kind of echo known as an acoustic echo. The sound wave travels from loudspeaker to

microphone through vibrations of circuit or open air and generated echo. Examples of

such systems are mobiles, VOIP calls by using, for instance, Skype, the teleconferencing

for meetings or remote educations etc. and the hands-free operations have gained more

and more popularity in recent years. This situation is the one we will contribute to in this

thesis. The basic setup of a typical hands-free system can be shown as:

Fig.1.2 Basic setup of a hands-free communication system

Each side of the communication process is called an ‘End’. The remote end from the

speaker is called the far end (FE), and the near end (NE) refers to the end being measured.

The acoustic echo is due to the coupling between the loudspeaker and microphone. The

speech of the far-end speaker is sent to the loudspeaker at the near end, and it is reflected

10

from the floor, walls and other neighboring objects, and then picked up by the near-end

microphone and transmitted back to the far-end speaker, yielding an echo, which can be

illustrated in Figure 1.3.

Fig 1.3: Generation of acoustic echo through direct coupling and reverberations

Acoustic echo can severely reduce conversation quality. Adaptive cancellation of such

acoustic echoes has become very important in hands-free communication systems.

However, not all echoes reduce voice quality. In order for telephone conversations to

sound natural, callers must be able to hear themselves speaking. For this reason, a short

instantaneous echo, termed side tone, is deliberately inserted. The side tone is coupled

with the caller’s speech from the telephone mouthpiece to the earpiece so that the line

sounds connected, and it also allows the speaker to adjust his/her own speaking level.

Nevertheless the necessity of the side tone in mobile phones has been frequently brought

into discussion. The reason is that side tone poses more difficult problems in the

out-of-doors environment of the mobile phone. If you sneeze or blow into your

microphone, or when wind noise exists, you hear it loud and clear. Hence nowadays the

side tone is either eliminated or designed to be adjustable in mobile phones upon user’s

option.

11

1.2 Basics of Echo Cancellation

As stated in the previous section, there is a need of removing undesired echoes during

telecommunications. Hence from this part on, the investigation of echo removal method

is started. Echo can be either cancelled in time domain or suppressed in frequency

domain. In this section the system schematic of the acoustic echo cancellation is firstly

introduced, which is the basic structure for all echo removal methods. Later the two

major concerns in the echo cancelling process, which are the adaptive filter and the

double-talk detector, are briefly reviewed.

1.2.1 System Overview

Since we know the original signal which goes to the loudspeaker, we can use it to predict

and remove the signal picked up by the microphone. The process of doing this is called

Acoustic Echo Cancellation.

Schematically we can describe an AEC system as in Figure 1.4. The remote speaker

signal, which is always referred as far-end signal and denoted as )(tx , passes through the

room acoustic filter, h, producing an acoustic echo termed )(ty . The microphone receives

near-end speech signal )(tv together with the echo disregarding of the surrounding noise;

the received signal )(tz thus consists of both )(tv and )(ty :

)(),()()()( tvhxftvtytz +=+=

where h is the room acoustic filter.

The task of the AEC is to model the room acoustic path with an adaptive filter th as well

as possible and remove the echo signal from the measured signal, yielding a residual

signal )(te which will only consist of the near-end speech. The acoustic filter is required

to be adaptive since the echo path in the room is most likely time-varying which can be

caused by, for example, the movement of objects or the moving of the loudspeaker or

microphone from one place to another. However, to capture the complexity of an acoustic

echo path, the length of the filter needs to be infinity, but a large filter order will bring a

12

high computational load. So evidently there is a trade-off between the complexity and the

performance of the AEC.

The residual signal,

)(ˆ)()(ˆ)()( txhtztytzte T−=−= ,

should only consist of the near-end signal, which is the case when the acoustic adaptive

filter is close to the echo path, namely )()(ˆ tyty ≈ , then )()( tvte ≈ .

Fig. 1.4 General schematic of Acoustic Echo Cancellation

The adaptive filter uses the residual signal )(te to estimate the error and update new filter

coefficients, however, only if there is no near-end speech. When near-end speech exists,

the estimated error is not correct so that it will distort the filter or even result in the

divergence of the filter, so that it is important to determine whether the near-end speech is

present or not. Hence an acoustic echo cancellation normally includes parts as the

adaptive filter, as well as the double-talk detector to detect if near-end speech exists, and

possibly a nonlinear processor to eliminate the residual echoes. We will discuss each of

them in the later-on study.

13

1.2.2 Adaptive Filtering

There are two main types of digital filtering: the Finite Impulse Response (FIR) and the

Infinite Impulse Response (IIR). IIR can normally achieve similar performance as FIR,

with smaller amount of coefficients and less computation. However, as the complexity of

the filter grows, the order of the IIR filter increases a lot and the computational advantage

is less dominant. Also, IIR suffers from the instability problem. So the filters that are

being used in AEC are usually of the FIR type.

The adaptive filter is the critical part of the AEC which performs the work of estimating

the echo path of the room to get a replica of the echo signal. It needs an adaptive update

to adapt to the environmental change, for example, people moving in the room. An

important issue of the adaptive filter is the convergence speed which measures how fast

the filter converges to the best estimate of the room acoustic path.

A lot of adaptive filters have been derived and employed for the AEC. In this paper, we

will mainly study the standard Normalized Least Mean Square (NLMS) algorithm which

is old, has a low computational complexity and is proven to work well compared to a lot

of new methods. Also the recently-proposed simplified echo path model using

frequency-domain coloration filter is studied and analyzed.

1.2.3 Double Talk Detection

One of the most difficult issues in the AEC is to know when the filter should stop or slow

down the adaptation. As we have discussed in the previous part, it is important to know if

the near-end speech exists or not, when there is far-end signal present. The situation,

when both the near-end and the far-end are active, is referred to as Double Talk. If double

talk occurs, the error signal )(te will not only contain the echo estimation error, but also

the near-end signal. If this signal is used to update the filter coefficients, it might create

an artificial echo and even diverge. Thus, it is a vital yet difficult job, which is the task of

Double-talk Detector.

14

There are a variety of Double-talk detection methods. In this thesis work, we consider

three famous ones: the Geigel algorithm which is quite simple, the Normalized

Cross-correlation method (NCR) and the Variance of Impulse Response algorithm

(VIRE). All of them will be implemented and compared in the later work.

1.3 Measures of Performance

To have a standard way to examine the performance of the echo removal algorithms,

some parameters are required as a measure of the performance. The most important task

of AEC (or AES) is to suppress the echo, so it would be necessary to know how much the

echo can be reduced. During the period of double talk, the near-end signal would be

affected as well as the undesired echo in the cancellation or suppression process, so that

the amount of the attenuation would be interesting to know.

1.3.1 Echo Return Loss Enhancement (ERLE)

Echo Return Loss Enhancement (ERLE) is the most important measure of how much in

dB the echo is suppressed by the acoustic echo cancellation. It is defined as the power of

the original echo over the power of the residual echo signal after cancellation in dB unit:

)(log102

10

r

zERLEσσ

= ,

where 2

zσ is the power of the microphone signal and 2

rσ is the power of the residual

echo.

A precise measure of ERLE should be performed in the portion where there is no

near-end signal but only the echo. The higher the ERLE is, the better the AEC works.

1.3.2 Near-end Attenuation (NEA)

Near-end attenuation (NEA) is a measure of how much the near-end signal is suppressed

in dB during the cancellation process in double talk situation. It is defined as the power of

the near-end signal after suppression over the power before suppression during double

talk:

15

)(log102

2

10

bef

aftNEA

σ

σ= ,

where 2

aftσ is the power of the near-end speech during DT in the residual signal and 2

befσ is

the power of the near-end signal during DT in the microphone signal.

To make the NEA calculation practical, during recording in this thesis, the recorded

signals consist of three segments: far-end single talk, double talk and near-end single talk.

The ERLE is calculated based on the far-end only stage. To calculate NEA, we made

another synthetic signal based on the recorded microphone signal, which has a

sign-inversed (counter-phase) near-end speech during double talk by subtracting the

double of the near-end part, as shown in Figure 1.5. After passing both of them through

the AEC, we subtract the two residual signals and divide the result by two, which gives

us the near-end speech during DT after AEC:

( ) 2/)()(__ min usplus teteresidualendNear −=

( ) 2/)()(__ min usplus teteresidualendFar +=

residualendNearaft __=σ

Evidently low near-end attenuation is desired.

Fig 1.5 Composition of the signals used to calculate the NEA

16

1.4 Thesis Organization

This thesis focuses on two main issues of acoustic echo cancellation, namely the

adaptation algorithm and the control of adaptation in double talk situation.

Chapter 2 presents all the theory backgrounds. Firstly it reviews and compares the two

major ways to achieve echo cancellation, which are the acoustic echo canceller (AEC) in

time domain and acoustic echo suppressor (AES) in frequency domain. The adaptive

filter which is used to model the acoustic echo path is the central part of the AEC. Hence

much effort and researches have been devoted to it. LMS is an old, simple and proven

algorithm which has turned out to work well in comparison with newer more advanced

algorithms. In this project, we use the normalized LMS (NLMS) for the main filter in

AEC, since NLMS is so far the most popular algorithm in practice for its computational

simplicity. For the frequency-domain adaptive filtering method in AES, the recently

introduced simplified echo path method with a frequency-domain coloration effect filter

is studied. After that, the generic double talk detection scheme is outlined and then

several well-known double talk detectors are discussed. The Geigel algorithm is simple

and works well when the far-end signal is sufficiently smaller than the near-end speech,

namely it has assumption of the echo path, so in practice not widely applied to the echo

cancellation algorithms. The Normalized Cross-correlation method uses the correlation

value between the far-end signal and the near-end signal, and is also normalized, which

would bring more promising results compared to the Geigel algorithm. The Variance of

Impulse Response algorithm is based on the fact that the presence of the near-end speech

will bring dramatic variations on the filter taps, which could bring good result. However,

it is more sensitive the microphone signal than the NCR. At last, the measures and the

receiver operating curve which are used to evaluate the DTD are introduced.

In chapter 3 the other issues occurred during the echo cancellation process are discussed.

Since the adaptive filter is trying hard to mimic the room acoustics, it might be interesting

to find a strategy to measure one to have a general idea of how the room impulse

response looks like. Secondly the adaptive filtering algorithms normally require the

synchronization between the far-end and near-end speech. There has to be a delay from

17

the speaker to the microphone for the sound wave to propagate. To estimate this delay, a

method based on cross-correlation is adopted. Also noise is a big issue when the quality

of the microphone is not so good, as for the case of the internal microphone of the laptops.

The noise includes the hard-disk and fan noise from the laptop itself, the typing noise

from the near-end user, as well as the environmental noise. The typing noise mostly

probably would be the most annoying one, since the keyboard is always close to the

internal microphone in a laptop construction. A high pass filter gives a good attenuation

of most of the noise because most noise concentrate at low frequencies. The nonlinear

processor as a possible part of a AEC is introduced generally at the end of this chapter.

Chapter 4 is devoted to the evaluation of all the algorithms discussed above. Through a

bunch of recordings and simulations in MATLAB, we try to find out which adaptive

filtering and double talk detection algorithms suit better the PC application.

In chapter 5 the conclusion is drawn and also the possible future work is presented.

18

CHAPTER II

ECHO CANCELLATION ALGORITHMS

In this chapter, the theoretical background for echo cancellation is reviewed

generally. There are two common ways to remove acoustic echo. The basic method is the

traditional Acoustic Echo Canceller (AEC) which is discussed schematically in section

1.2 and will be covered again in section 2.1. Another way is the Acoustic Echo

Suppressor (AES). The AES for telephony application is usually half-duplex which shuts

off completely the speech from the direction with lower power after comparing the

strength of both ends. It is simple but not effective. Full-duplex communication is more

comfortable for real-time conversations. Another AES method, derived from noise

suppression based on spectral subtraction, makes full-duplex possible, and will be

introduced in section 2.2. In section 2.3, adaptive filtering algorithms are presented in

detail, including the LMS filter and NLMS filter which are derived from Wiener optimal

filters, and the coloration effect filter in frequency domain. Different double-talk

detection (DTD) methods are discussed individually in section 2.4 and also the DTD

performance evaluation measures.

2.1 Acoustic Echo Canceller

The traditional solution to the acoustic echo problem is the acoustic echo canceller (AEC).

An acoustic echo canceller achieves the echo removal by modeling the echo path impulse

response with an adaptive filter and subtracting echo estimation from the microphone

signal.

The acoustic echo path is assumed to be a linear filter with length L, { }TLhhhhh Λ321 ,,= ,

where L is the length of the echo path, and T)(Λ denotes the transpose of a matrix or a

vector. Then the microphone signal is expressed as:

)()()()( knkvkxhkz T ++⋅=

19

where TkxLkxLkxkx )}()...2(),1({)( +−+−= , so )(kxhT ⋅ is the echo signal, )(kv is

the near-end speech and )(kn stands for the ambient noise signal.

A modeling filter { }TLhhhhh ˆˆ,ˆ,ˆˆ321 Λ= is used to approximate the true echo path h , where

L is the length of the filter. The echo estimate will be

)(ˆ)(ˆ kxhkyT=

Adaptive algorithms are used to search the optimum h . Once the adaptive filter converges,

the residual signal will be the echo-cancelled outgoing signal.

The echo signal can be cancelled successfully when the modeling filter approaches the

true echo path. In practice, however, a modeling filter often differs from the true echo

path due to complicated reasons such as speaker nonlinearities and environment changes,

the lack of knowledge about the length of the echo path, and so on, resulting in residual

echo signals.

2.2 Acoustic Echo Suppressor

Unlike AEC, an acoustic echo suppressor achieves echo attenuation in the frequency

domain, and which is working in similar manner as the traditional noise suppression

algorithm. The AES can achieve similar results in a full duplex way as the AEC.

2.2.1 Noise Suppression with Spectral Subtraction

The introduction of noise suppression here is because that the echo works in a similar

way as the noise. So the method for noise suppression could be also interesting for echo

elimination. Various speech enhancement techniques exist for the purpose of eliminating

noise. Spectral subtraction is one of these methods to enhance speech in the presence of

noise. Spectral subtraction for noise suppression basically means that an estimate

|)(ˆ| fN of the noise magnitude spectrum is subtracted from the instantaneous input

magnitude spectrum )( fX . The noise can also be attenuated with a certain factor. The

aim of this process is to obtain an audio signal which contains less noise than the original.

20

The basic flowchart of the spectral subtraction looks like following:

Figure.2.1 Noise suppression with spectral subtraction

The noisy speech consists of 2 parts basically, the clean speech and the noise:

)()()( tntstx +=

After Fourier transform:

)()()( fNfSfX += ,

and the magnitude of the frequency spectrum can be approximately expressed as:

)()()( fNfSfX +≈

So the magnitude of the clean speech can be calculated by subtracting the estimation of

the average noise spectrum:

)(ˆ)()( fNfXfS −≈

and for the phase of the clean signal, the phase information of the noisy speech is

adopted:

)()( fXfS ∠≈∠

Combine the amplitude and phase information we get the estimate of the speech

amplitude spectrum:

)()() )(

)0),)(ˆ)(max((()()(

1

fXfGfX

fNfXfXfS ii

i

i

i ⋅=−

⋅= αα

ααβ

21

where α and β are the design parameters to control the performance, and i stands for the

ith frame since the frequency-domain calculation needs the FFT which is frame-based.

Since Short-time estimates of )( fX i fluctuate randomly in noise-only frames, resulting

in randomly fluctuating gains )( fGi. After noise suppression, statistical analysis shows

that broadband noise is transformed into signal composed of short-lived tones with

randomly distributed frequencies, called musical noise, which sounds like a warbling or

watery effect on the enhanced speech. These artifacts are due to randomly distributed

spectral peaks in the residual noise spectrum. One possible way to solve this is to

overestimate the average noise power to lower the peaks, but the original speech signal

might also be distorted.

Also a lot clicking noise occurs due to the steep changes in the gain function and it can be

removed by adding a gain smoothing function as following:

GifactorsmoothiGsfactorsmoothiGs ⋅+⋅−= _,)_1(,

and smooth factor will determine the time constant of the exponentially smoothed gain

function. To understand better how the smoothing function works, supposing a step gain

function in Figure 2.2 (solid blue line), after applying the gain smoothing we will get a

smoothed version of the steep changing corners (dotted line). The smooth factor in this

figure is 0.01. In this way, the sharp glitches in the gain function are eliminated.

Figure 2.2 Smoothing of a step function

22

2.2.2 Acoustic Echo Suppression with Spectral Subtraction

Echo suppression is basically performed in the same manner as noise suppression. Unlike

AEC, an acoustic echo suppressor achieves echo attenuation through manipulating the

magnitude spectrum of the microphone signal in the frequency domain, while leaving the

phase spectrum untouched.

The adaptive filter in the AES works as the Noise density estimation unit in the noise

suppressor, which is combined with FFT to produce an estimate of the echo magnitude

spectrum. The echo spectrum estimate is used to form the gain function together with the

spectrum of the microphone signal.

αα

ααβ 1

) )(

)0),)(ˆ)(max((()(

fZ

fYfZfGi

−=

where α and β are the design parameters to control the echo suppression performance. If

the echo is under estimated, β >1 is used and β <1 if it is over-estimated.

Then the multiplication of the gain function and the microphone signal will calculate the

magnitude spectrum of the residual signal which is supposed to be echo-free. After

performing the inverse FFT transformation, the echo-suppressed outgoing signal is

obtained as:

)]()([)( 1 fZfGiFne −=

where )(1 Λ−F denotes the inverse FFT.

2.2.3 Overlapping-windowed FFT

For the transformation into the spectral domain, the choice of data window and

overlapping are also important. When windowing a simple waveform, like )cos( tω ,

causes its Fourier transform to have non-zero values at frequencies other thanω ,

commonly called leakage. The rectangular window is the simplest window and has the

best resolution, but suffers most from the window leakage problem among all. Other

windows like Hann, Hamming, Kaiser Windows, are moderate. Hamming has more

leakage than the other two. Kaiser can have the smallest leakage in price of lower

23

resolution. On the other hand, using windows to perform FFT on a small segment of

input will bring some distortions because of the transient effects, so overlapped

windowing is employed. The windows will overlap in time, namely the window will only

shift a part of the total window size instead of the whole. FFT and latter processes are

then performed for every window. To restore the signal, the reconstructed data through

IFFT are summed up at the end.

Different windows are compared and the results are displayed in Figure 2.3, 2.4 and 2.5,

to find out which could bring better result, namely less error. Half of the window is

shifted per time and FFT and IFFT functions are processed. The error between the

original and the recovered signal is displayed in the unit of dB.

The Hann Window has relatively smaller error compared to the other two methods. The

performance of the Kaiser window depends highly on the value of beta. For the Kaiser

window, larger the beta is, wider the window will be and smaller the side-lobe becomes.

According to simulations in MATLAB, the one with beta equals to around 5.8 gives the

smallest error after performing FFT and IFFT as shown in Figure 2.5.

Figure 2.3 Error caused by Hann-Windowing FFT

24

Figure 2.4 Error caused by Hamming-Windowing FFT

beta = 2.5

beta = 4.5

beta = 5.8

beta = 6.8

Figure 2.5 Error caused by Kaiser-Windowing FFT with different beta

25

As a conclusion from the simulation results, Hann window gives the best performance

upon the error reduction. Hence, the Hann window is chosen to perform the Overlap-add

method during AES process.

2.2.4 Comparison of AEC and AES

Both AEC and AES have their advantages and disadvantages. The AEC is a well-defined

technique. When the modeling filter approaches the true echo path, an AEC can eliminate

echo signal successfully without introducing much distortion to the outgoing signal.

However, in reality the modeling filter often differs from the true echo path due to

complicated reasons, for example, the modeling filter is shorter than the true echo path,

the echo path may change or the existing nonlinearity in the echo path, and so on. As a

result, some residual echoes may still remain. In addition, an AEC is often

computationally expensive. In comparison, an AES is able to achieve higher echo

attenuation and more robust. In addition, the AES algorithm may introduce less

computational complexity as the simplified echo path method which will be discussed

later on. However, as in the noise suppressor, this technique sometimes introduces

audible distortions to the outgoing signal.

2.3 Adaptive Filters

The adaptive filter is the central part of the AEC, which is used to mimic the acoustic

echo path. There are numerous adaptive algorithms that are applicable in acoustic echo

cancellation such as least mean squares (LMS), recursive least squares (RLS) and affine

projection algorithm (APA) etc. LMS is an old, simple and proven algorithm which has

turned out to work well in comparison with newer more advanced algorithms. In this

project we use the normalized LMS (NLMS) for the main filter in AEC, since NLMS is

so far the most popular algorithm in practice for its computational simplicity. In the

following paragraphs the Normalized Least Mean Square algorithm is outlined, which is

an adaptation process based on linear FIR algorithm. It aims at approximating the room

acoustic path with the best possible model. For the frequency-domain adaptive filtering

method in AES, the recently introduced simplified echo path method with a

26

frequency-domain coloration effect filter is studied, which has the advantage of lower

computational complexity and more robustness.

2.3.1 Wiener Filter

The Wiener filter represents the optimum filter in the sense of the Mean-Squared Error

(MSE). It minimizes the cost function based on the filter coefficients which can be

expressed as:

}{)( 2eEJ =w

where w stands for the corresponding filter coefficients. }{ 2eE represents the mean

power of the error signal )(ke . With the optimal filter coefficients the minimum of the

cost function

}){min()( 2eEJ opt =w

is reached.

The error signal can be calculated as the difference between the desired signal and the

output of the adaptive filter, )()()( kykdke −= , and )()( kky T xw ⋅=

with TkxLkxLkxk )}()...2(),1({)( +−+−=x , where L is the length of the filter.

The squared error function would be:

)()()()(2)()( 22 kkkdkkdke TTT xxwwxw ⋅+−= .

The auto-correlation matrix R is defined by

)}()({ kkE T xxR = ,

and the cross-correlation vector is:

)}()({ kdkE xp = .

Assuming that the desired signal is real, wide-sense stationary, the cost function can be

written as:

wRwpwwTT

doptJ +−= 2)( 2σ

Then the minimum point of the function can be obtained by calculating the point which

has zero gradient and the general gradient of the cost function is:

27

)(222)}({ pRwRwpw −=+−=∇ Jw

The above leads to the time-discrete Wiener-Hopf-Equation:

1−= pRw opt

which gives the filter coefficients of a Wiener Filter, optimal in the sense of the MSE.

The Wiener filter is a linear optimum filter. It depends on the known statistics R and p. In

practice, we do not know R and P exactly, and in an adaptive context they may be slowly

varying with time. The adaptive filter should be able to track the changes in the statistics

hence a changing optw , so some approximations are necessary. One idea is to approximate

the R and p values, which leads to the Recursive Least Squares (RLS) algorithm. Another

way is to approximate the gradient as in the Least Mean Square algorithm presented in

the following section. The LMS algorithm is introduced much earlier than the RLS

algorithm. The RLS algorithms have the advantage of fast convergence, while the LMS

costs much fewer computations. In the PC embedded software application, the benefit of

the LMS method is more attractive.

2.3.2 Least Mean Square Algorithm (LMS)

The Least Mean Square algorithm is derived from the Steepest Descent method. Instead

of going the direct path from the starting point to the optimum, it is easier to follow the

gradient of the error function which leads to the optimum iteratively. The gradient as

shown in Figure 2.6, is a vector pointing in the steepest uphill direction on the error

surface at a given point of w(k). The filter coefficient is updated by taking a step opposite

the gradient direction. It goes locally “downhill” in the steepest direction to approach the

optimum:

)}({)()1( www Jckk w∇⋅−=+

And wxxdxw )()(2)()(2)}({ kkkkJ T

w +−=∇

])()()[(2 wxx kkdk T−−=

)()(2 kekx−=

So now it leads to the Least Mean-Square algorithm

)()()()1( kekkk xww µ+=+

28

where c2=µ is defined as the step-size. The step size parameter µ controls the

convergence speed of the filter.

Figure 2.6 Gradient of the Error function

2.3.3 Normalized Least Mean Square Algorithm (NLMS)

The NLMS algorithm is derived from the LMS algorithm. The motivation of this

algorithm is that the power of the input signal varies with time, so the step size between

two adjacent filter coefficients will vary as well, then also the convergence speed. The

convergence speed will slow down with small signals, and for the loud ones the

over-shoot error would increase. So the idea is to continuously adjust the step size

parameter with the input power. Therefore, the step size is normalized by the current

input power, resulting in the Normalized Least Mean Square algorithm, with

)(

2)(

2 kxn

αµ =

whereα is again the design parameter to adjust the convergence speed, and 10 <<α .

NLMS usually converges much more quickly than LMS at very little extra cost, so it is

very commonly used.

29

2.3.4 The Problem with NLMS:

The performance of the fast converging NLMS algorithm will be largely degraded when

doubletalk or only near-end speech exists. The reason is that it is calculated from a ratio

between the error signal and the power of the far-end signal.

)()()(

2)()1(

2kekx

kxkwkw ⋅+=+

α

During the pauses of doubletalk or when only near-end speech exists, the coefficients

become exceedingly unstable since the input is approaching zero while the error signal is

relatively large due to the near-end signal’s existence. The filter weights start to diverge.

The LMS algorithm does not suffer from this problem.

There are several possible solutions to solve this, which will be illustrated as follows:

1. Safety constant

One possibility to solve this problem is to simply add a safety constant to denominator:

)()()(

2)()1(

2kekx

kxkwkw ⋅

++=+

ρ

α

The value of the factor will influence the output quality in a way that by increasing the

factor the less the jitter of the weights will be, but the lower the ERLE will become.

2. Threshold

Another common and low-cost possibility is to introduce certain threshold to the input

power. The weight will be kept the same if the power of the input is lower than the

threshold to avoid the large jitters of the weights. It is basically a far-end signal detector

based on the input power.

thresholdkxifkwkw

thresholdkxifkekxkx

kwkw

<←=+

>←⋅+=+

2

2

2

)()()1(

)()()()(

2)()1(

α

30

3. Combination of LMS and NLMS

Both the safety factor and the input threshold will be input power dependent. Hence we

introduce a new idea which combines the advantages of NLMS and LMS. Two adaptive

filters are adapted in parallel and adjusted by a factor γ (0< γ <1). Each of the filter

banks donates γ or 1- γ percentage during the calculation of the error signal.

2)1(1 yyze ⋅−−⋅−= γγ

y1 is the echo estimation of the NLMS filter and y2 for LMS section. This method is

basically trying to find the optimal combination of LMS and NLMS at each time instance,

in order to achieve fast convergence and relatively large ERLE for echo cancellation and

also gain more stability.

To derive the appropriate value or update method forγ , we use the same way as for LMS.

The steepest descent method is applied to approach the minimum of the least

mean-squared value.

[ ]eyyyyyzyy

e2)21()2)21((2)21(

2

⋅−−=−−⋅−⋅−−=∂Ε∂

γγ

eyycii 2)21(1 ⋅−⋅+=+ γγ

c is the step size parameter as μ for the LMS algorithm.

Theoretically theγ should be more or less ‘1’ for FE (Far-end) single talk section, which

indicates the employ of NLMS algorithm. This is because the NLMS filter adapts faster

and gains higher ERLE than LMS at this moment. During DT (Double Talk) and NE

(Near-end) single talk, γ becomes ‘0’, since the LMS algorithm does not suffer from

the stability problem as the NLMS when NES (Near-end Speech) exists.

We tested this algorithm with recorded signal including three segments, which are

far-end speech only, DT and near-end speech only. Theγ value as plotted in Figure 2.7

shows analogical result as we expected.

31

Figure 2.7 γ plot ( 01.0=α 8.0=µ )

During simulation in MATLAB, it indicates that the new algorithm does not improve

enough compared to the calculation complexity it brings. As a conclusion, with the same

ERLE achieved, and according to the audible test results of the three methods, safety

constant method is chosen as an efficient way which brings acceptable results.

2.3.5 A Simplified Echo Path Model

The normal adaptive algorithm aims at approximating the real acoustic echo path,

inherently suffers from the effect of echo path changes and non-linearity. Christof Faller

and Christophe Tournery recently proposed a new AES without a need for the complex

computation of the acoustic echo path estimation. Instead of identifying the echo path

impulse response, the proposed method estimates only the magnitude spectrum of the

echo that is needed for echo suppression. A filter mimicking the coloration effect of the

echo path on the loudspeaker signal is adopted. The gain filter for the AES is computed

using this coloration effect filter. The proposed AES has low complexity and higher

robustness because it estimates signal independent on the physical echo path.

Coloration in an audio process means that some frequency ranges are attenuated or

amplified while the others are not. It is necessary to know which frequencies are

32

attenuated, not modified or amplified on the loudspeaker signal for the AES. A typical

room impulse response consists of the direct sound which comes from the loudspeaker

directly to the microphone, several early reflections and then the late reflections which is

like a long tail with high density, as shown in figure 2.8. The dense late reflections hardly

influence the amplitude of the frequency spectrum. The large direct sound and the early

reflections are what color the signal. Hence, to obtain the necessary information for the

echo suppression it is enough to just consider the direct sound and the early reflections,

which indicates the improvement of the computational complexity.

Figure 2.8 Typical room impulse response

A real-valued “coloration effect filter” ),( kiGv , mimicking the spectral modification

effect of the echo path on the loudspeaker signal, is estimated. For obtaining an

approximate echo magnitude spectrum, the estimated delay and coloration effect filter are

applied to the loudspeaker signal spectra,

),(),(),(ˆ kiXkiGvkiY d=

where d stands for the number of samples to delay. Since it takes a certain amount of time

for the loudspeaker signal to reach the microphone, the magnitude spectrum of the echo

is calculated with the delayed loudspeaker signal.

The coloration effect filter is computed as the magnitude of the least squares estimator

{ }{ }),(),(

),(),(),(

*

*

kiXkiXE

kiYkiXEkiGv

dd

d=

33

where ∗ denotes complex conjugate. Since the acoustic echo path is likely to vary in

time, ),( kiGv is estimated iteratively as

),(

),(),(

22

12

kia

kiakiGv =

where

)1,()1()},(),({),(

)1,()1()},(),({),(

22

*

22

12

*

12

−−+⋅=

−−+⋅=

kiakiXkiXEkia

kiakiZkiXEkia

dd

d

σσ

σσ

and ]1,0[∈σ determines the time constant of the exponentially decaying estimation

window.

Then the magnitude spectrum of the echo signal is used to form the gain filter as in:

αα

αα

β 1

^

) ),(

)0),k)(i,Y),(max((

().(kiZ

kiZ

kiG

−=

During double talk, the coloration filter will affect the near-end speech and even diverge

in the same way as the NLMS algorithm. To prevent this, a double talk detector (or

near-end speech detector) can be necessary to freeze the coloration effect filter when

double talk exists.

2.4 Double Talk Detector

An important feature that an AEC should have is its capability to provide full duplex

services, which means it allows the both ends to speak simultaneously, namely the case

of Double Talk (DT). If DT exists, the microphone signal which is used for adaptation

will not only contain the echo but also the near-end signal. This could lead to the

divergence of the adaptive filters since the near-end speech acts as a strong uncorrelated

noise to the adaptive algorithm. Thus it is necessary to detect when the double talk occurs,

and stop the adaptation process. This is done by a double talk detector.

34

2.4.1 The Generic Doubletalk Detection Scheme

The generic DTD is based on a detection statisticξ , which is formed by using available

signals as the speaker signal, the microphone signal and the output signal etc. Then by

comparing the ξ with a preset threshold T, the double talk situation is declared or not.

Once the double talk is detected, the filter adaptation will be disabled for a minimum

period of time Thold. The filter adaptation will be resumed if the detection statistic

indicates that there is no DT consecutively over a time Thold.

There are a variety of double talk detectors based on different algorithm to calculate the

decision statisticξ . The most popular ones are the Geigel algorithm and the Normalized

Cross-correlation (NCR) method as well as the Variance Impulse Response algorithm

(VIRE).

2.4.2 Geigel DTD

The most basic algorithm for double talk detection is the one originally developed by

Geigel. It is a quite simple approach by comparing the power of the received signal and

the far-end signal. Since normally the room acoustic filter will damp the far-end signal,

when the received microphone signal divided by the maximum of the past far-end

samples is lager than certain threshold, the DT is declared.

The decision statistic is calculated as:

))1(...)1(,)(max(

|)(|

+−−=

Ntxtxtx

tzξ

If ξ is larger than some preset threshold T, it is deemed that DT is occurring, otherwise

not, i.e.

T>ξ Double talk present

T≤ξ Double talk not present

The choice of T will strongly affect the performance of the detector. During MATLAB

analysis, it can be found by plotting the decision variable and finding out which threshold

35

would optimally distinguish the DT from the far-end signal. The Geigel detector has the

benefit of being computationally simple and needing little memory. However, the Geigel

detector has quite poor performance.

2.4.3 Normalized Cross-correlation (NCR) DTD

An alternative method is the normalized cross-correlation algorithm. The microphone

signal z(k) can be expressed as a sum of the echo signal and the near-end speech signal

(NES), where we ignore the noise influence first.

)()()( tvtytz +=

Suppose the echo path impulse response of the room is h, such that the echo signal is:

)()( txhty T=

The power of the measured microphone signal can be written as:

)()( 22 thRht vxx

T

z σσ +=

where { }xxER T

xx = .

The cross-correlation sequence of the speaker and microphone signals can be expressed

according to definition:

{ } hRxyEr xxxy ==

Yielding:

1−= xxxyRrh

And the power of the microphone signal can be rewritten as:

)()( 212 trRrt vxyxx

T

xyz σσ += −

When there is no NES present, i.e. 0)( =tv , then )()( tytz = and

xzxx

T

xzz rRrt 12 )( −=σ with { }xzErxz = .

The detection statistic is suggested as:

2

1

2

1

)(

=

−

t

rRr

z

xzxx

T

xz

σξ .

36

The nominator is the power of the measured signal if no near-end speech is present,

whereas the denominator is the actual power of the measured signal. Thus if there is no

near-end speech signal present, 1≈ξ , otherwise 1<ξ .

The DT decision is formed as

T<ξ Double Talk present

T≥ξ Double Talk not present

T is selected between 0 and 1.

The NCR method is normally computationally infeasible, as it not only requires the

estimation of the cross-correlation sequence xyr and the far-end covariance matrix xxR ,

but also the inversion of the covariance matrix. A practical approach is adopted for this

reason. The room echo path response is approximated by the response of the adaptive

filter, which results in:

wRrh xxxy ≈= −1

)(

)(

)()( 2

2

ˆ

22 t

t

t

wRw

t

wr

z

y

z

T

xx

T

z

T

xz

σ

σ

σσξ ===

The nominator is the power of the estimated echo signal and the denominator is the actual

microphone signal power. This is the form of the cheap normalized cross-correlation

algorithm. Since we are using Hann window based overlapping add method, the decision

factor will be calculated for each window frame.

2.4.4 Variance Impulse Response (VIRE DTD)

VIRE DTD is a recently introduced method which uses the maximum value of the

adaptive filter coefficients. The recent variance impulse response algorithm (VIRE) is

based on the variance of the adaptive filters. Since the near-end speech acts as a

corrupting noise, it will induce dramatic variations in the adaptive filter taps. It uses the

maximum value of the adaptive filter as a measure of the fluctuations with certain

exponential forgetting factor:

2][)1()1()( γγλξλξ −⋅−+−⋅= nn

37

where γ is the maximum value of the filter coefficients and γ is a expected value of

γ which is again formed with the exponential forgetting factor:

γλγλγ ⋅−+−⋅= )1()1()( nn

))1()1(),0(max( −= khhh Λγ

By defining certain threshold for the variation of the adaptive filter taps, the DT decision

is made as following:

T>ξ Double Talk present

T≤ξ Double Talk not present

The detection will still be frame-based and calculated once at the end of each frame.

Hence normally the length of the frame in this work is relatively small. Also it is worth to

mention that the VIRE algorithm is more sensitive the power of the near-end speech as

we will see later during simulation.

2.4.5 Double Talk Detection Performance Evaluation

Certain criteria are necessary to compare different types of DTD, since we can not

compare the performance directly because different threshold can be used. Also, a

systematic approach is required to select the value of threshold.

The criteria to evaluate DTD performance are as follows:

Probability of False alarm (Pf): the probability of declaring detection when DT does not

exist.

Probability of Detection (Pd): the probability of successful detection when DT does exist.

Probability of miss (Pm = 1 - Pd): the probability of detection failure when DT is present.

Pf is calculated when there is only far-end signal present, namely v = 0,

N

XP

active

f

∑ ⋅=

ξ

where ξ is the output decision of the DTD, activeX is the output of the activity detector

and N is the length of the entire far-end speech signal x, which is the first 15 seconds in

our case.

38

The miss probability Pm is measured as the proportion of near-end speech that remains

undetected when far-end speech also exits. The Pm is a meaningful criterion to fairly

compare different DTD methods, because the disruptive effect of undetected double talk

on an adaptive filter depends on the near-end speech that goes undetected.

∑∑

⋅

⋅⋅−=

activeactive

activeactive

mVX

VXP

ξ1 and PmPd −= 1

where ξ is the output decision of the DTD, activeX and activeV are the output of the

activity detector for near end and far end respectively. The logical AND assures that the

miss probability is only counted when both NE and FE are active.

A good detection method should maximize Pd while minimizing Pf even in a low signal

to noise ratio situation. In general, higher Pd is achieved at the cost of a higher Pf. There

is a trade-off depending on the cost of a false alarm and a miss. A Receiver Operating

Characteristic (ROC) curve is widely used to characterize detection schemes as in radar

applications. The Pm’s with respect to Pf’s at different threshold points are plotted in the

ROC curve in order to find a proper threshold to achieve certain performance. Also,

given a certain probability of fault alarm, one can plot the probability of miss in function

of the Signal-to-Echo Ratio (SER) or the Signal-to-Noise Ratio (SNR) to evaluate the

DTD algorithm under different speech power condition or environmental noise

circumstance.

39

CHAPTER III:

OTHER ISSUES

Some other issues besides the echo cancellation theory are presented in this chapter. To

decide the length of the adaptive filter, we need to know the length of the actual echo

path which the filter is trying to model. Section 3.1 discusses about the room acoustics

and the way to measure it. Another important issue is the synchronization problem

between the loudspeaker signal and the microphone signal, which is essential for the echo

cancellation process and is covered in section 3.2. Last but not least, the noise is always

an important consideration for speech and audio applications. The noise sources for

hands-free communication are various.

3.1 Room Acoustics

The testing of the AEC will be performed in real rooms. The AEC which works well in

one room, however, might not be compliant in another. This is because the acoustics of

all rooms are different. This flexibility allows designer to test the AEC in the type of

rooms they were designed for. However, this also means that the user has the

responsibility of determining whether the AEC will operate in his or her particular

environment. An AEC solution that was designed to operate in an office may not work

properly in a conference room. If an echo canceller works in one room and not another, it

would most likely be due to a tail length that was too short for the second room.

The tail length of an AEC is the length of time over which it can cancel echoes (in the

unit of ms). This is directly related to the reverberation time of the room. As the room

reverberation time increases, a longer tail length will be needed in that room. If the

reverberation time is much longer than the tail length, a significant amount of the echo

will remain audible.

40

There are two main factors that affect the reverberation time of a room. They are room

size, and the materials used to construct the walls and objects in the room. Most sound is

absorbed when it strikes walls or other surfaces. If materials are used that absorb sound

well (such as carpet, curtains, or acoustic tile), the reverberation will die out more quickly

than if the room contains mostly reflective materials (hard wood, or glass). If a room is

small, the sound waves will bounce off the walls more frequently, and will be absorbed

more quickly.

3.1.1 Measure the Testing Room acoustics

Since the job of the adaptive filter is to model the room acoustic, it is interesting to have

an idea how it looks like, also the length of the adaptive filter should cover sufficient

length of the room impulse response it is to be operated in.

Typically the impulse response is the system response to a Dirac pulse, which

theoretically has infinite amplitude at certain time point and all zero at the others.

However it is impossible to make a real Dirac pulse in practice. The room impulse

response, namely the acoustic echo path in the room, can be measured in several ways.

As a conceptual method consider a room and a balloon in it at point p. The balloon pops

and makes a "pou" sound, which is similar (due to its short duration) to a Dirac delta, and

the output h[n] is the sequence of the damped sound. Here h[n] depends on the location

(point p) of the balloon. If we know h[n] at point p of the room, then we actually know

the impulse response of the room at point p. It is then possible to predict its response to

any sound produced at this point. However it is not easy to use this method to fit exactly

our actual recording position. We can approximately use a sudden big sound to simulate

but it won’t bring good results.

The other way is to simulate with sine waves of different frequencies. To use the

sine-waves would bring the best results but it is quite time-consuming.

The third choice is to measure with a white noise signal which in theory has a flat

response in the frequency magnitude spectrum, so the de-convolution in time domain can

41

be calculated easily as a division in frequency domain. The final method of recording

white noise has the potential of giving a good result while it can be realized quite easily

using MATLAB and doesn’t require anything other than a computer equipped with a

microphone and a speaker. This is the method that we have chosen.

0 5 10 15 20 25-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Time (ms)

Amplitude

Sampling rate = 8KHz

Figure 3.1 Room Impulse Response in the Scream room in NXP Leuven

Our testing room has the size of 5m x 6m with hard walls. As we can see, approximately

16ms adaptive filter length is needed during our testing, namely 128 taps at 8 KHz

sampling rate.

3.2 Measure of the delay between the loudspeaker and the microphone

Not only for the new AES algorithm proposed by Christof Faller and Christophe Tourney,

the other algorithms of AEC or AES also require the estimation of the delay between the

microphone and loudspeaker signals, since apparently all the algorithms would need the

two signals to be synchronized.

To estimate the time delay, the cross-correlation is used in this paper. Cross-correlation

is the standard way of measuring how two signals are correlated. The correlation will be

42

high if the microphone signal is similar to the reference loudspeaker signal. So the result

of the CC indicates the point where the two signals correlated most.

The cross correlation can be related to the convolution as:

where the inverse sequence of the complex conjugate of one signal is used in the

convolution calculation. And the convolution between two signals in the time domain can

be transferred into multiplication in the frequency domain, and converted back to time

series by IFFT. The MATLAB code looks like following:

x_inv = [flipud(x (index)); zeros (fs, 1)]; % inverse the loudspeaker signal

z = [z (index); zeros (fs, 1)]; % zero padding for later calculation

Temp = FFT (x_inv).*FFT (z); % multiplication in frequency domain

cc = abs (IFFT (temp)); % IFFT

[Value, Lag] = MAX (cc); % Index of the maximum of the CC result is the

lag

To get a relatively accurate estimation, we calculate the delay for several frames and

average the results.

3.3 Noise issues

During hands-free communication for PC applications, a lot of noise may exist and

disturb the speech going to the microphone. The noise problem is especially worth

concern in the situation of using the internal microphone of a laptop. The amplification of

the internal microphone equipped in the laptop is usually high to be able to pick up the

near-end speech. Hence, a variety of noises such as the hard-disk, fan of the laptop, the

typing on the keyboard, mouse clicking as well as various ambient noise are likely to be

picked up

The possible noise sources when using internal microphone of the laptop is illustrated in

Figure 3.2. The hard drives and cooling fan are close the microphone, so as to together

43

with other mechanical sounds, be transmitted to the internal mike through vibrations. The

clicking sounds of the keyboard and the mouse are also major noise sources in this case.

Since the keyboard is usually close the position of the microphone, the typing noise can

be quite loud which makes it the most annoying noise source in this case. The situation

can be improved much more when a good-quality external microphone is adopted.

During our recording process, by using the Trust MC-1200 high sensitive external

microphone, the noise floor is lowered by 10-13dB compared to the internal microphone.

Figure 3.2 Common Laptop Noise Sources

3.3.1 Typing noise cancellation based on cross-correlation

As mentioned above, the internal microphone is likely to pick up a lot of noise from

sources such as the hard-disk and fan of the laptop and environments, and the typing

noise on the keyboard could the most annoying one among all, which bring up the

motivation of finding a way to reduce it. The most direct way is to use an external

microphone. Also there exists dedicated keyboard in the market which uses special pads

to reduce the typing noise. What we would like to make effort on below is to look for

44

certain algorithm which could be embedded into our AEC (or AES) algorithms to cancel

the typing noise.

Firstly we need to know how the typing sound looks like before dealing with it. So we

recorded the typing action using the internal microphone of the laptop. By analyzing the

recording result, it is found out that each typing sound generally consists of two separate

parts, namely a press sound and a trailing release sound, as we can see in Figure 3.3.

0 0.2 0.4 0.6 0.8 1 1.2-1

0

1One typing sound (press and release)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08-0.5

0

0.5

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08-0.5

0

0.5

1

Time (s)

Typical press sound for the keyboard of our testing laptop

Typical release sound for the keyboard of our testing laptop

Figure 3.3 Typical look of a typing sound on the keyboard

The first idea coming to mind is to use cross-correlation algorithm to recognize the typing

operation and then cancel it out. Two masks are needed for corresponding press and

release sounds because the pause between them could vary from one typing to another.

The cross correlation is calculated with a shifting window and normalized to be

45

independent on the input power. Once the cross-correlation value exceeds certain

threshold, as shown in Figure 3.4, a pressing or releasing action is considered as

happening, and a scaled mask will be subtracted from the input signal.

The way to scale the mask is important. The projection operation brings a good estimate

of one vector on the direction of another, as defined in:

BBB

BAAprojB ⋅

><><

=,

,

< > denotes the inner product of two vectors:

∑=

>=<n

i

iibaa,b1

or aba,b T>=<

so the estimated typing noise is calculated as the projection of the masks on the actual

microphone signal when the typing is detected.

As we can observe from the residual signal, the typing noise can’t be completely

cancelled and still audible. This is because individual typing can be somewhat different

from the mask, which may indicate that there is no linear model for the typing noise.

Hence in the following we try to model the noise with a LMS adaptive filter.

The input to the LMS system is a pulse train to trigger the typing action, which can be

generated by the above mentioned cross correlation method. If a linear time-invariant

model exists for the typing noise, the cancellation should perform well after the filter

converged, otherwise not.

Through simulation we observed that the typing noise estimation generated from the

LMS filter only match perfectly with the input signal occasionally, in other cases it may

advance or lag the input. Hence, we can conclude that there is no linear time-invariant

model applicable to the typing noise cancellation, namely, it varies with time. A much

more sophisticated method is required, and we will not go further within the scope of this

thesis.

46

-1

0

1Recorded Typing Noise z(t)

-1

0

1Cross-correlation result between z(t) and press mask

0

1Press Flag

-1

0

1Cross-correlation result between z(t) and release mask

0

1Release Flag

0 1 2 3 4 5 6 7 8 9 10

x 104

-1

0

1Residual typing noise

Samples

Figure 3.4 Typing noise cancellation

3.3.2 High Pass Filtering

After analyzing the recorded noise, it is found out that most noise components dominate

the low-frequency portion in the spectrum, including the typing noise. So to reduce the

noise, a high pass filter will be an efficient choice. A second order Butterworth filter with

cutoff frequency of 200Hz is adopted.

47

CHAPTER IV

EVALUATION

So far, we have studied and discussed two echo removal methods, which are the

Acoustic Echo Canceller and the Acoustic Echo Suppressor. Most AEC products are

based on the adaptive LMS or NLMS digital filter, which is a well-defined algorithm that

has been used for years. To achieve larger echo attenuation without the help from other

devices as Nonlinear Processor, the Acoustic Echo Suppressor based on spectral

subtraction is a good option. Despite using the combination of NLMS and FFT transform,

a simplified echo path in frequency domain can be also adopted by AES, even with lower

computational complexity. To deal with the annoying situation during Double talk, three

Double talk Detection methods including Geigel, Normalized Cross-correlation and

Variance Impulse Response algorithms are presented. In the following section, the

performances of all different methods will be examined and compared using MATLAB

simulations.

Many parameters occur in the algorithms, e.g., the learning rate, the safety

constant and the suppression factor in AES, etc. They all affect the performance of AEC

or AES in some way. Hence, to achieve a certain target performance, the parameter

tuning gains an important role in the simulation process.

The evaluation of algorithms is primarily based on how much ERLE they may

achieve, since echo attenuation is the goal of an AEC. When similar ERLE are

accomplished by different algorithms, the initial convergence time and the near-end

attenuation will be more important. In real hands-free communication, the volume of the

near-end voice and far-end speech are totally variable. Hence, it is necessary to check

how the algorithm reacts to the change of the Signal-to-Echo ratio, which is defined as

the ratio between the power of the near-end speech and the power of the echo signal. The

noise has been an important issue for audio and speech systems for a long time. The

performance under different noise strength needs to be evaluated for each algorithm.

48

4.1 Requirements for AEC

The performance evaluation of an AEC (or AES) solution is based on specifications and

listening tests. As discussed in section 1.3, there are some measures existing for the

evaluation of AEC. The International Telecommunication Union (ITU) has regulated

certain criteria for a number of performance characteristics of AEC. These include such

specifications as rate of convergence, amount of cancellation and bandwidth. Although

these criteria are necessary, they are not sufficient to determine whether an AEC is good

enough, since the performance of the AEC is quite location-sensitive and noise-sensitive,

and the specification can only cover certain test environment. Hence, the evaluation

through auditory test is necessary for a given application. At the end of the day, how an

AEC sounds is the final criteria.

4.2 Speech Stimuli

In hands-free communication systems, the input signal is primarily speech and the output

signal consists of speech disturbed by noise and other speech signals. Speech has highly

time-varying characteristics. It is not stationary, but can be approximated to be stationary

in short time intervals. Speech is sometimes quasi-periodic (e.g., vowels) and sometimes

acts as noise (e.g., fricatives) or like impulses (e.g., plosives). Speech also contains

pauses. Speech signals are wide-band with a frequency content ranging from 100 Hz to

more than 8 kHz. In agreement with the sampling theorem, the audio signals (with

frequency between 300 Hz to 3400 Hz), should be sampled at a frequency equal or

greater than 6800 Hz (2 X 3400). Actually, the telephone applications usually take the

sampling rate at 8 KHz. The most popular choices for VOIP are 8 KHz and 16 KHz. A

higher sampling rate improves the speech quality but also requires wider bandwidth.

Throughout our simulation, 8 KHz sampling frequency is used. In all, speech provides a

non-persistent excitation for the adaptive filters used in AEC.

The speech stimuli signals consist of two channels: the channel to be played on the NE

speaker (male voice) and the channel to be played on the FE (female voice). The recorded

signals consist of three segments, which are FE single talk, double talk and NE single talk,

49

to examine the achieved ERLE as well as the performance during DT respectively. Each

segment has a length of 15 seconds and there are pauses of 1 second in between.

-0.6

-0.4

-0.2

0

0.2

0.4

FE

0 5 10 15 20 25 30 35 40 45 50-1

-0.5

0

0.5

1

TIme (s)

NE

FE single talk Double talk NE single talk

Figure 4.1 Speech stimuli segmentation

The recording is made with a DELL Latitude-D600 laptop. The near-end setup of a

laptop user can be different form one to another. Different configurations of internal or

external microphones and internal or external loudspeakers differ in the nominal

Signal-to-Echo Ratio (SER) and Signal-to-Noise Ratio (SNR). The SER is defined as the

ratio of the power of the NES to the power of the echo in the recording signal. The

nominal SER is obtained with the recording under common perceptional strength of NE

and FE speech. When the external microphone is used, higher nominal SER and SNR can

be achieved, compared to the situation using internal microphone with the same

loudspeaker positions.

4.3 Acoustic Echo Canceller

Firstly, the AEC based on simple NLMS-adapted FIR filter is evaluated through

parameter tuning. The learning rate is always an important parameter for NLMS to

50

control the convergence speed. Another parameter, as discussed in section 2.3.4, a safety

constant, is added to the denominator of the NLMS coefficient adaptation equation to

avoid divergence. In order to find a proper value for the safety constant, the

corresponding ERLE is plotted against different safety constants as shown in Figure 4.2,

with different learning rates. The simulation is performed with the signal which is

recorded with common perceptional strength of NE and FE speech using external

microphone and external speaker setup. The nominal SER in this case is 5dB. The length

of the filter is 128.

0 0.2 0.4 0.6 0.8 18

9

10

11

12

13

14

15

16

Safety constant

ERLE (dB)

Learning rate = 1.0

Learning rate = 0.5

Learning rate = 0.1

Figure 4.2 How ERLE changes with the safety constant

(Three lines have different corresponding learning rate)

It is observed that the ERLE has a peak value. The result is reasonable. If the step size is

large when the safety constant is negligible, the filter may be over-adapt and take longer

time to reach the final optimum point. In another case, when the step size is significantly

lowered by the safety constant, the adaptation will also take more steps to approach the

optimum. In both cases, the ERLE would become lower in the consequences of slower

adaptation. Hence, it is reasonable to obtain a peak value of ERLE where the filter

coefficients reach the optimum in the minimum number of steps.

51

Next, the learning rate is tuned in the similar manner. The result is illustrated in Figure

4.3. Similar reason as for safety constant, the higher learning rates result in

over-adaptation and the lower values result in fine adaptation, both slow down the

convergence speed. There is an optimal value for α which leads to the fastest

convergence.

0 0.5 1 1.5 20

5

10

15

Learning rate (2 x Alpha)

ERLE (dB)

Figure 4.3 The effect of Learning rate on ERLE (safety constant = 0.1)

The simulation results in Figure 4.4 show the NLMS-based AEC can only achieve certain

amount of attenuation. One reason is that the adaptive filter can never model the echo

path impulse response completely due to its limited number of filter taps. Another

important reason is that the NLMS assumes a linear echo path, yet in reality, the

loudspeaker-room-microphone impulse response is nonlinear. The nonlinearities come

from the saturation effects from the amplifiers and loudspeakers. Through listening test,

the echo residual is still audible. During double talk, though the influence on the near-end

signal is small (1 to 2 dB near-end attenuation), the echo left is still significant. Figure 4.4

also proves the improvement of adding the safety factor. When there is no safety constant,

the filter becomes unstable when DT starts so that a lot of jitters are observed.

52

-0.2

0

0.2Recorded signal z(t)

-1

0

1Without Safety constant

0 0.5 1 1.5 2 2.5 3 3.5 4

x 105

-0.2

0

0.2Safety constant = 0.1

Samples

Figure 4.4 Echo Cancellation Result of NLMS AEC under nominal SER (learning

rate = 1. Notice the y-axis of the figure without safety constant has larger scale)

4.4 Acoustic Echo Suppressor Based on Spectral Subtraction

As seen above, the time domain AEC can only achieve a very low ERLE. Hence, an echo

suppression filter is added after the NLMS based echo canceller. The echo estimate from

the NLMS algorithm is transformed and subtracted in frequency domain with a Hann

window. Over-suppression can be performed to gain higher ERLE. An alternative way

goes to the simplified echo path, which only estimates the magnitude spectrum of the

echo signal and leaves out the phase information to reduce the computational load.

4.4.1 NLMS-Based AES

There are a variety of parameters in an NLMS AES to adjust the performance, as

discussed in section 2.2, including the learning rate and the safety constant during NLMS

process, theα and the echo suppression ratio β , and the smooth factor of the gain

function. All of them face a trade-off between echo rejection and speech distortion. In the

following paragraphs, each parameter will be tuned individually to investigate its effect.

The test signal will still be the nominal recording with 5dB SER by using external

53

microphone and external loudspeakers. With this recording, it is found out that 30dB

ERLE is the least requirement to make the echo in the FE single talk section acceptable.

As discussed in last section, the learning rate of the NLMS adaptive filter has influence

on the ERLE as well as the NEA during DT. As observed in Figure 4.5, the larger the

learning rate, the higher the ERLE and NEA will be. In fact the adaptation speed of the

NLMS filter for AES is not as crucial as for the AEC any more. There are other

parameters which are able to tune the initial convergence time and the ERLE, e.g. the

suppression ratio β and the smooth factor of the gain function. Hence, unlike the NLMS

algorithm which uses learning rate of 1, small learning rate is chosen to assure low NEA

at this moment.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.925

30

35

ERLE (dB)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9-20

-15

-10

-5

Learning rate

NEA (dB)

Figure 4.5 The effect of learning rate on the ERLE and the NEA ( 1=α , 25=β ,

safety factor = 0.1, smooth = 0.1)

The smooth factor as introduced in section 2.21, manipulates the fluctuating of the

suppression gain function. The influence of smooth factor on the ERLE and NEA is

drawn in Figure 4.6. Larger smoothing (smaller smooth factor) results in more flat

suppression gain, which leads to the attenuation of both the echo and the NES during DT.

As observed from Figure 4.7, the large smoothing (smooth factor=0.01) brings in high

suppression on the NES during DT and the residual speech sounds natural yet has a very

54

low volume, while the low smoothing (smooth factor=0.99) results in enormous clicking

effect during DT which yields artificial sounds. Hence, a smooth factor in the middle

range should be chosen. We use smooth factor of 0.3.

0 0.2 0.4 0.6 0.8 124

26

28

30

ERLE (dB)

0 0.2 0.4 0.6 0.8 1-12

-10

-8

-6

Smooth factor of the Gain function

NEA (dB)

Figure 4.6 The effect of the smooth factor on the ERLE and the NEA ( 1=α , 25=β ,

Learning rate = 0.2, safety factor = 0.1)

-0.2

0

0.2Recorded signal z(t)

-0.2

0

0.2Residual signal with smooth=0.01

0 0.5 1 1.5 2 2.5 3 3.5

x 105

-0.2

0

0.2Residual signal with smooth=0.99

Samples

Figure 4.7 The effect of smooth factor on the NES during DT

55

Recalling the gain function of AES

αα

ααβ 1

) )(

)0),)(ˆ)(max((()(

fZ

fYfZfGi

−= ,

to find the optimal values of alpha and beta, we use the minimization function in

MATLAB to return the beta value corresponding to certain alpha which minimizes the

squared difference between the acquired ERLE and desired value. In such way, with a

series of given alpha values we get a series of corresponding beta values to achieve 30dB

ERLE as shown in Figure 4.8. We choose 1=α which brings least computational load so

that the β needs to be at least 27 to achieve 30dB ERLE.

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

500

1000

1500

2000

2500

alpha

beta

0.2 0.4 0.6 0.8 10

10

20

30

alpha

beta

Zoomed in

Figure 4.8 Alpha – Beta values to achieve 30dB ERLE (Learning rate = 0.2,

safety constant = 0.1, smooth factor = 0.3)

The simulation work load is quite high. To speed up the slow simulation process due to

the slow run of for-loops in MATLAB, the NLMS is implemented as a mex-function

which is more than 10 times faster.

Through simulation and listening test, the NLMS AES is able to achieve much higher

ERLE than the NLMS AEC, with more computational load introduced. During the DT

period, the echo is also inaudible yet the NES is affected largely as seen in Figure 4.7.

56

Certain portion of the NES has more attenuation than other due to the sharp changes of

the suppression gain. This results in discontinuities in the residual signal during DT.

4.4.2 Coloration-Effect-Filter-Based AES

As introduced in section 2.3.5, the AES based on a simplified echo path magnitude

spectrum costs much less computational complexity.

Recalling equation:

),(

),(),(

22

12

kia

kiakiGv = and

)1,()1()},(),({),(

)1,()1()},(),({),(

22

*

22

12

*

12

−−+⋅=

−−+⋅=

kiakiXkiXEkia

kiakiZkiXEkia

dd

d

σσ

σσ,

here the σ functions similarly as the learning rate in the NLMS algorithm, controlling

the adaptation speed of the coloration-effect filter. If it is too large, the attenuation of the

echo and the NES will be both high, and even the NES is not audible any more. If it is too

small, the initial convergence speed will be too slow, as shown in Figure 4.9. The initial

convergence time is defined as the time which it takes the echo to become totally silent.

The coloration-effect filter also suffers from the divergence problem when NES exists.

Hence, a constant is added to the denominator in the same way as the NLMS algorithm.

),(

),(),(

22

12

kiac

kiakiGv

+=

A large constant will smooth the variation of the filter taps and slow down the

convergence significantly, but bring less attenuation to the NES during DT, which is

illustrated in figure 4.10. According to the simulation and listening results, sigma of 0.05

and safety constant of 0.01 are chosen to ensure a relatively faster convergence and

smaller attenuation on the NES.

57

-0.2

0

0.2Recorded Echo

-0.05

0

0.05Echo Residual with sigma=0.01

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

-0.05

0

0.05Echo Residual with sigma = 0.05

Samples

Slow convergence

Fast convergence

Figure 4.9 Different convergence time corresponding to different sigma

( 1=α , 30=β Notice the axis of the recorded echo has larger scale)

0 0.2 0.4 0.6 0.80

10

20

30

40

50

60

Sigma

ERLE (dB)

0 0.2 0.4 0.6 0.8-60

-50

-40

-30

-20

-10

0

Sigma

NEA (dB)

c = 0

c = 0.0001

c = 0.001

c = 0.01

Figure 4.10 The influence of the sigma to the ERLE and NEA with different safety

constant values ( 1=α , 30=β )

58

Then the optimal values of α and β to gain 30dB ERLE are computed again in the

same way as for NLMS AES, shown in Figure 4.10. When 1=α , 27=β is required.

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

2000

4000

6000

8000

10000

12000

14000

16000

alpha

beta

0.2 0.4 0.6 0.8 10

20

40

60Zoomed-in

alpha

beta

Figure 4.11 Alpha – Beta values to achieve 30dB ERLE (Sigma = 0.05,

Safety constant = 0.01, smooth factor = 0.3)

To draw a conclusion at this moment, the performance of AES is superior to that of the

AEC. The AES gains higher ERLE and is able to eliminate the echo completely. During

DT, though the NES is affected as well, the background echo is quite distinct any more.

Through simulation and listening test, it is found out that the Coloration-effect filter AES

can achieve similar result as the NLMS AES, with much less computational complexity.

However, its convergence speed is slower than the NLMS.

4.5 DTD Performance Evaluation

All the algorithms discussed above suffer from the DT problem. The NES is attenuated

more or less by using different methods, especially for AES. A lot of discontinuities

occur during DT. Hence we will introduce Double Talk Detectors into AES. The

adaptation of the filter will be frozen for a hold time and lower β will be adopted to have

lower NEA when DT is detected. Instead of not being able to hear the speaker properly,

59

clear voice yet with some residual echo would be preferable. Longer hold time will

protect the NES better yet lead to long recovery time from the DT to FE single talk which

may bring in a boost of the echo. The value of hold time is normally chosen to be tens of

milliseconds. 32ms of hold time is used throughout this thesis. As stated in section 2.4.5,

the performance of DTD is evaluated by probability of false alarm (Pf) during the FE

single talk and the probability of miss (Pm) in DT duration, as shown in table 4.1. Based

on the computations of Pf and Pm, a plot which is referred to as the Receiver Operating

Characteristic (ROC) curve is adopted as a comparison criterion between different

algorithms. In the ROC curve, the probability of false alarm is plotted against the

probability of miss by tuning the threshold. This curve provides us with the knowledge of

the threshold to achieve certain DTD performance in terms of Pf and Pm, and vice versa.

For example, we can find out the threshold corresponding to 0.1 Pf. A typical ROC curve

is illustrated in Figure 4.12. It shows the tradeoff between the correct detection and the

fault ones. The smaller area enclosed by the ROC curve is, which indicates both low Pf

and low Pm can be attained at the same time, the better the DTD performance will be.

FE single talk DT

DTD = 0 Correct Miss

DTD = 1 False alarm Correct

Table 4.1 Definition of False alarm and Miss

Figure 4.12 Typical ROC curve illustration

60

4.5.1 Geigel DTD

We firstly implemented the simple Geigel scheme. The most important parameter for a

DTD is the threshold. The threshold set in Figure 4.12 separates all the FE single talk in

order to have a uniform attenuation. Because different suppression factors are used for FE

single talk and double talk in AES, the false alarm during FE single talk segment will

result in a sudden boost in the residual echo, which sounds annoying. Hence, a low

probability of false alarm is required. However, partial DT situations are left undetected.

0 0.5 1 1.5 2 2.5 3 3.5 4

x 105

-15

-10

-5

0

5

10

15

20

25

Samples

Decision variable for Geigel (dB)

A possible threshold

Figure 4.13 The detection statistic for Geigel algorithm for the nominal recording

with external microphone and loudspeakers

Higher threshold reduces the probability of false alarm at the price of an increase to the

probability of miss during DT; lower threshold gains more correct DT detection yet may

result in large fault detection so as to reduce the ERLE. The ROC curve of Geigel

algorithm is shown in Figure 4.13.

As stated in section 2.4.2, the Geigel DTD operates by comparing the power of the

received signal and the far-end signal. Recalling equation:

61

))1(...)1(,)(max(

|)(|

+−−=

Ntxtxtx

tzξ



which shows that it works well when the strength of the NES is much higher than the

FES, namely when the SER is large as illustrated in Figure 4.14. The SER is changed by

adjusting the NES while keeping the nominal FES. The probability of false alarm will be

almost constant because the strength of the echo is kept the same. As discussed before,

the probability of false alarm is required to be as low as possible so that the threshold is

chosen to attain zero probability of false. When the SER goes low, the chance of missing

DT detection becomes higher because most of the decision variable will be lower than the

threshold due to the small NES.

Overall the Geigel algorithm only works well under circumstances which assume the

echo path attenuates the FES and NES is sufficiently high. However, Geigel is not a

strong candidate in reality where unknown echo path and unknown NES are present.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Probability of False alarm

Probability of Miss

Figure 4.14 ROC curve of Geigel Algorithm under nominal SER

62

-5 0 5 100

0.2

0.4

0.6

0.8

1

SER (dB)

Probability

Probability of miss

Probability of false alarm

Figure 4.15 Probability of miss decreases as the SER is increased by enlarging the

amplitude of near-end speech (Pf = 0)

4.5.2 NCR DTD

The cheap-NCR algorithm is normally adopted for its efficient calculation. Recalling

form section 2.4.3, the detection statistic of cheap NCR is calculated as:

)(

)(2

2

ˆ

t

t

z

y

σ

σξ =



which is the ratio between the power of the estimated echo and the power of the

microphone signal. Since it needs the time-domain echo estimate from the adaptive filter,

it is easier to be applied to the NLMS algorithm compared to the simplified echo path.

The convergence speed of the NLMS filter also needs to be lower to slow down the fault

adaptation during DT in order to have the right detection statistic. As shown in Figure

4.15, a smaller learning rate brings better result.

The ROC curve is drawn as in Figure 4.16 to look for the threshold to achieve a low

probability of false alarm. The performance of the NCR is also dependent on the window

size. The larger the window size, the more precise the calculation of the power, so that

63

the better the prediction will be, which is also shown in Figure 4.14. The ROC with

window size 800 has improvement of 10% probability of miss over the one with window

size of 128, with the same probability of false alarm acquired. However, larger window

size means longer computation time and larger delay to yield output, while the real-time

communications demand low processing delay.

Next the variation of the probabilities against the SER is evaluated as illustrated in Figure

4.17. The threshold is chosen to attain 0 probability of false alarm and fixed for all SER.

The probability of miss increases as the near-end signal gets weaker, because the

detection statistic will be larger during double talk so that less DT events will be detected.

Compared to the Geigel algorithm in Figure 4.12, the variation of the probability of miss

against the SER is smaller in cheap-NCR DTD, with similar probability of false alarm

achieved and same window size. In all, the NCR DTD is a more reliable method than the

Geigel DTD, yet it requires slower adaptation and longer processing delay to obtain

better detection.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1


Probability of miss

Learning rate = 1.0

Learning rate = 0.2

Learning rate = 0.1

Figure 4.16 ROC curves of cheap-NCR DTD for NLMS AES with different

Learning rate under nominal SER

64

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1


Probability of miss

Window size = 128

Window size = 800

Figure 4.17 ROC curve of cheap-NCR DTD with different window size

under nominal SER (learning rate = 0.1)

-5 0 5 100

0.2

0.4

0.6

0.8

1

SER (dB)

Probability

Probability of miss


Figure 4.18 Probability of miss increases as the SER is decreased by reducing the

amplitude of near-end speech (learning rate = 0.1, window size = 128)

65

4.5.3 VIRE DTD

The VIRE DTD uses the maximum value of the adaptive filter as a measure of the

fluctuations as presented in section 2.4.4. The variations of the adaptive filter taps will be

both high for NLMS filter and the coloration-effect filter, so that it can be applied to both

algorithms. The faster the filter adapts, the larger the variations of the filter coefficients

will be. A large learning rate makes the VIRE DTD work extremely well as observed in

Figure 4.18 and Figure 4.19. Though the high learning rate brings large NEA as shown in

Figure 4.5, the DTD is now used to protect the NES from being over-attenuated such that

a large learning rate can be adopted to affirm the good performance of the DTD. Hence,

one advantage of the VIRE DTD is the inherent fast adaptation.

The ROC curves in Figure 4.18 and 4.19 also show that the VIRE DTD for both NLMS

and coloration-effect filter AES can achieve very low probability of false alarm and

probability of miss at the same time, especially for the coloration-effect filter AES. The

detection statistics of the VIRE algorithm for the coloration-effect filter AES in Figure

4.20 makes it possible to draw a threshold to separate the single talk and double talk

completely, unlike the Geigel and NCR algorithms. The detection decisions during

double talk are much larger than the ones during single-talk period, which leads to the

excellent performance of the VIRE DTD.

0 0.1 0.2 0.3 0.4 0.5 0.60

0.2

0.4

0.6

0.8

1


Probability of miss

Learning rate = 1.0

Learning rate = 0.2

Learning rate = 0.1

Figure 4.19 ROC curves of VIRE DTD for NLMS AES with different Learning rate

under nominal SER (window size = 128)

66

0 0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1


Probability of miss

sigma = 0.05

sigma = 0.01

Figure 4.20 ROC curves of VIRE DTD for the Coloration-effect filter AES with

different Learning rate under nominal SER (window size = 128)

0 1 2 3 4

x 105

-160

-140

-120

-100

-80

-60

-40

-20

0

20

Samples

Detection S

tatistic in dB unit

Double Talk FE Single Talk NE Single Talk

A possible threshold

Figure 4.21 Detection decision obtained using the VIRE DTD for the

Coloration-effect filter AES under nominal SER

After tuning the SER by adjusting the power of the NES, similar results are obtained for

NLMS AES and Coloration-effect filter AES. Though the performance of the VIRE DTD

is excellent under the nominal SER situation, it turns out to be varying dramatically with

67

the SER as we can see in Figure 4.21. When the echo is much larger compared to the

near-end speech, the near-end signal is weak during DT. After the NLMS filter has

converged during FE single talk, the filter weights will not be influenced much by the

NES during DT, so that the variation of filter coefficients is not large enough to detect the

current situation as DT which leads to the high probability of miss. On the other hand, the

VIRE DTD performs in a similar manner in the Coloration-effect filter AES. The DTD

performance is quite stable during certain SER range as illustrated in Figure 4.22. Lower

threshold will cover larger SER operating range because lower threshold allows lower

NES during DT to be still separated from the FE single talk, as implied in Figure 4.20.

As a conclusion, the VIRE DTD is able to achieve best detection performance among all

algorithms. Especially the VIRE DTD embedded in the Coloration-effect filter AES

provides the most promising results. As well, compared to the cheap-NCR algorithm,

VIRE DTD has the advantage of fast adaptation and convergence. However, the VIRE

DTD is quite sensitive to the strength of the NES. It is hard to detect the DT situation

when the NES is too weak. Hence there is certain limitation in this method.

-5 0 5 100

0.2

0.4

0.6

0.8

1

SER (dB)

Probability Probability of miss


Figure 4.22 Probability of miss of the VIRE DTD in the NLMS AES

varies dramatically with near-end speech (Pf = 0)

68

-5 0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SER (dB)

Probability

Probability of miss when T = 0.03



Probability of false alarm for all

three thresholds is zero

Figure 4.23 Performance of the VIRE DTD in the Coloration-effect filter AES

4.6 AES with Different DTD Algorithms

In this section, the improvement of each DTD algorithm to the AES is going to be

evaluated and compared. In the AES without DTD equipped, the same suppression ratio

will be used everywhere, which results in large attenuation on the NES during DT. It

makes the conversation during DT rather difficult. Hence, the DTD is adopted to predict

the DT situation. When DT is detected, on one hand the filter adaptation is stopped, and

on the other hand the suppression ratio β is lowered in order to limit the NEA. However,

certain amount of echo attenuation is still required during DT period. Hence, a different

β is used to acquire 20dB ERLE during DT period, which is 10dB lower than that

during FE single talk.

The performance of each method under nominal SER is firstly compared as shown in

Figure 4.23. With the same 30dB ERLE achieved during FE single talk segment, it can be

observed that all the DTD algorithms help to reduce the NEA during DT. The

Coloration-effect filter AES results in less NEA than the NLMS AES overall. The results

also show that the Coloration-effect filter AES with VIRE DTD displays the most

69

outstanding performance. It is expected because of the excellent detection capability of

the VIRE DTD for the Coloration-effect filter AES. Through auditory tests, the NES

during DT by the Coloration-effect Filter AES with VIRE DTD sounds most natural,

with least attenuation and without discontinuities.

In practice the near-end speech is an unknown variable, as well as the far-end speech.

The effect of NES and FES are evaluated separately in paragraph 4.6.1.As discussed in

section 3.3, various kinds of noise exist in hands-free communications. Some noises are

stationary background noise, while some are abrupt sounds. The impact of the stationary

noise is more likely to be tested than the unpredicted sudden noise, which will be

discussed in section 4.6.2.

70

-0.2

0

0.2Original microphone signal z(t)

-0.2

0

0.2NLMS AES without DTD

-0.2

0

0.2NLMS AES with Geigel DTD

-0.2

0

0.2NLMS AES with cheap-NCR DTD

0 0.5 1 1.5 2 2.5 3 3.5

x 105

-0.2

0

0.2NLMS AES with VIRE DTD

Samples

-0.2

0

0.2Coloration-effect Filter AES without DTD

-0.2

0

0.2Coloration-effect Filter AES with Geigel DTD

0 0.5 1 1.5 2 2.5 3 3.5

x 105

-0.2

0

0.2Coloration-effect Filter AES with VIRE DTD

Samples

Figure 4.24 Comparison of different AES methods under nominal SER

71

4.6.1 The Influence of the NES and the FES

The performance of each DTD algorithm against varying NES has been studied as in

section 4.5. The ERLE value is affected by the Pf in the FE single talk segment, while the

NEA is determined by the Pm during DT. Figure 4.24 shows how the power of NES

influences the ERLE and NEA. The FES is kept the same so that the detection statistics

during FE single talk section for each DTD are stable over all SER range. The threshold

of each DTD is chosen to achieve zero Pf to avoid echo boost. In such a way, all methods

attain the same suppression ratio during FE single talk so that similar ERLE is achieved

and maintained over the whole SER range. As observed before, the Pm increases as the

NES diminishes, and therefore the NEA also rises. The results in Figure 4.24 verify the

analysis in section 4.5.

Now, the nominal NES is kept the same and the FES is varied. The result is illustrated in

Figure 4.25. The louder the FES is, namely the larger the echo, the larger the ERLE value

is achieved. However, it can be observed that the echo attenuation drops when the echo is

too large for the VIRE and Geigel algorithms. Due to the amplification effect from the

volume adjustment, the magnitude of echo (z(t)) may be comparable to or even larger

than the FES (x(t)). The performance of the Geigel algorithm and the VIRE DTD for

Coloration-effect filter AES depend on the ratio between the microphone signal (z(t)) and

speaker signal (x(t)) and Pf increases as z(t) enlarges. For the VIRE DTD in the NLMS

AES, the variation of the filter taps becomes larger during FE single talk as the

microphone signal increases, which also increases the Pf. In such a way, the filter

adaptation will be slowed down and low suppression ratio is used in the Geigel and VIRE

DTD. Hence, significant amount of echo is left after cancellation. The NCR algorithm

does not suffer from this problem because it estimates the echo path based on both the

FES and the echo signal such that the ratio between the estimated echo and the actual

echo is relatively stable, namely a stable DTD performance.

72

Figure 4.25 Performance variation against SER with varying NES

-5 0 5 10-12

-10

-8

-6

-4

-2

0

SER (dB)

NEA (dB)

-5 0 5 100

5

10

15

20

25

30

35

40

45

SER (dB)

ERLE (dB)

NLMS AES

NLMS AES with NCR DTD

NLMS AES with Geigel DTD

NLMS AES with VIRE DTD

CF AES

CF AES with Geigel DTD

CF AES with VIRE DTD

Figure 4.26 Performance variation against SER with varying FES

-5 0 5 10-14

-12

-10

-8

-6

-4

-2

SER (dB)

NEA (dB)

-5 0 5 100

5

10

15

20

25

30

35

40

SER (dB)

ERLE (dB)

NLMS AES




CF AES



73

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SER (dB)

Probability of miss

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SER (dB)


NCR DTD

Geigel DTD

VIRE DTD for CF

VIRE DTD for NLMS

Figure 4.27 DTD performance variation caused by varying FES

4.6.2 The Noise Performance

To examine the influence of the strength of the stationary noise, the white noise with

adjusted noise power is added to the nominal recording. As we can see in Figure 4.26, the

resulting ERLE drops as the noise power increases. As we can understand from the

analysis above, the performance of the DTD determines how the AES acts overall. The

influence of the stationary noise to the DTD is examined as shown in Figure 4.27. The

decline of the ERLE is on one hand caused by the increase of the Pf, and on the other

hand is due to the larger noise contribution to the residual signal.

The NEA is almost invariable because the variation of the Pm is small. The Pf of the

VIRE DTD for Coloration-effect Filter AES barely changes which guarantees a steady

attenuation of the echo during FE single talk in spite of the noise strength. The Pm rises

as the noise gets stronger so that some discontinuities occur as in the NCR and Geigel

methods.

74

10 15 20 25 30-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

SNR (dB)

NEA (dB)

10 15 20 25 300

5

10

15

20

25

30

SNR (dB)

ERLE (dB)

NLMS AES




CF AES



Figure 4.28 Noise Performance under nominal SER

10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SNR (dB)


NCR DTD

Geigel DTD

VIRE DTD for CF

VIRE DTD for NLMS

10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SNR (dB)

Probability of miss

Figure 4.29 Comparison of DTD noise performance

75

CHAPTER V

CONCLUSION & FURTHER WORK

5.1 Summary and Conclusion

Nowadays, the conventional and hands-free telephones occupy more and more important

role in solving people’s communication needs. One of the major problems in a

telecommunication application over a telephone system is echo. This thesis is devoted to

find a solution for acoustic echo cancellation during a hands-free conversation using PC.

The basic echo canceller based on the famous NLMS algorithm is firstly studied. The

resulting ERLE of the basic AEC is low such that the residual echo is still audible.

Therefore the acoustic echo suppressor is introduced which is able to eliminate the echo

completely. The NLMS algorithm is firstly adopted to calculate the magnitude spectrum

of the echo signal in the AES but it costs much computational complexity. Then a

recently proposed algorithm, which uses a coloration-effect filter to estimate the

magnitude spectrum of the main portion of the acoustic path, is studied and modified to

ease the calculation. The disadvantage of the Coloration-effect Filter AES is slower

adaptation compared to the NLMS AES. Both of the two AES algorithms are capable of

making the echo inaudible during far-end single talk, but they all suffer from near-end

attenuation and discontinuities problems during double talk. Hence, Double Talk

Detection algorithms are investigated, including the Geigel DTD, (cheap) NCR DTD and

VIRE DTD. Each DTD algorithm is analyzed and evaluated individually based on the

two parameters: Pf and Pm. After that, the DTD algorithms are implemented into each

AES methods and compared. From the simulation and auditory tests, it is found out that

the Coloration-effect Filter AES with VIRE DTD is able to bring in the best result, with

least attenuation and without discontinuities. Yet, this consequence will only hold when

the near-end signal picked up by the microphone is strong enough compared to the

far-end speech. Also, the performance will only degrade as the noise becomes stronger

than certain point. In all, the echo cancellation algorithm presented in this thesis

successfully attempted to find a software solution for the problem of echoes in the

telecommunications environment. Furthermore, many efforts have been contributed to

76

the ways of regulating the parameters and a general frame for evaluating and comparing

different algorithms, as well as the analysis of the inside meaning of the results. The

proposed algorithm was completely a software approach without utilizing any DSP

hardware components. The algorithm was capable of running in any PC with MATLAB

software installed. In addition, the results obtained were convincing. The audio of the

output speech signals were highly satisfactory and validated the goals of this research.

5.2 Possible Further Work

The test of the algorithm was performed totally ‘off-line’. The testing speech was

recorded beforehand as input to the algorithm and the output was looked over after

simulation. Therefore, the real-time application to for testing purpose could be the most

interesting future work.

The high background noise level is annoying to the listener’s side during a conversation

and will affect the performance of the algorithm. However, the background noise is a

natural part of a conversation, which may provide the surrounding environment of the

person we talk to. Hence, there is a need of the noise suppression algorithm to reduce the

background noise to a comfortable level. Moreover, a study of the way to handle the

music noise which is trickier to solve can be also done in the future.

In practice, the echo could be still noticeable due to large variations of echo path

characteristic. Therefore, a further research and evaluation of the reaction of the

algorithm to the echo path changes should be made effort to.

77

LIST OF ACRONYMS

AEC: Acoustic Echo Canceller

AES: Acoustic Echo Suppresser

DT: Double Talk

DTD: Double Talk Detection (Detector)

FES: Far End Speech

LMS: Least Mean Square

NCR: Normalized Cross-correlation

NES: Near End Speech

NLMS: Normalized Least Mean Square

Pf: Probability of false alarm

Pm: Probability of miss

SER: Signal To Echo Ratio

SNR: Signal To Noise Ratio

VIRE: Variance Impulse Response

78

LIST OF FIGURES

Figure 1.1: Hybrid Connections and the Resulting Electric Echo

Figure 1.2: Basic setup of a hands-free communication system

Figure 1.3: Generation of acoustic echo through direct coupling and reverberations

Figure 1.4: General schematic of Acoustic Echo Cancellation

Figure 1.5: Composition of the signals used to calculate the NEA

Figure.2.1: Noise suppression with spectral subtraction

Figure 2.2: Smoothing of a step function

Figure 2.3: Error caused by Hann-Windowing FFT

Figure 2.4: Error caused by Hamming-Windowing FFT

Figure 2.5: Error caused by Kaiser-Windowing FFT with different beta

Figure 2.6: Gradient of the Error function

Figure 2.7: γ plot

Figure 2.8: Typical room impulse response

Figure 3.1: Room Impulse Response in the Scream room in NXP Leuven

Figure 3.2: Common Laptop Noise Sources

Figure 3.3: Typical look of a typing sound on the keyboard

Figure 3.4: Typing noise cancellation

Figure 4.1: Speech stimuli segmentation

Figure 4.2: How ERLE changes with the safety constant

Figure 4.3: The effect of Learning rate on ERLE (safety constant = 0.1)

Figure 4.4: Echo Cancellation Result of NLMS AEC under nominal SER

Figure 4.5: The effect of learning rate on the ERLE and the NEA

Figure 4.6: The effect of the smooth factor on the ERLE and the NEA

Figure 4.7: The effect of smooth factor on the NES during DT

Figure 4.8: Alpha – Beta values to achieve 30dB ERLE

Figure 4.9: Different convergence time corresponding to different sigma

Figure 4.10: The influence of the sigma to the ERLE and NEA with different safety

constant values

Figure 4.11: Alpha – Beta values to achieve 30dB ERLE

79

Figure 4.12: Typical ROC curve illustration

Figure 4.13: The detection statistic for Geigel algorithm for the nominal recording with

external microphone and loudspeakers

Figure 4.14: ROC curve of Geigel Algorithm under nominal SER

Figure 4.15: Probability of miss decreases as the SER is increased by enlarging the

amplitude of near-end speech (Pf = 0)

Figure 4.16: ROC curves of cheap-NCR DTD for NLMS AES with different Learning

rate under nominal SER

Figure 4.17: ROC curve of cheap-NCR DTD with different window size

under nominal SER

Figure 4.18: Probability of miss increases as the SER is decreased by reducing the

amplitude of near-end speech

Figure 4.19: ROC curves of VIRE DTD for NLMS AES with different Learning rate

under nominal SER

Figure 4.20: ROC curves of VIRE DTD for the Coloration-effect filter AES with

different Learning rate under nominal SER

Figure 4.21: Detection decision obtained using the VIRE DTD for the

Coloration-effect filter AES under nominal SER

Figure 4.22: Probability of miss of the VIRE DTD in the NLMS AES

varies dramatically with near-end speech

Figure 4.23: Performance of the VIRE DTD in the Coloration-effect filter AES

Figure 4.24: Comparison of different AES methods under nominal SER

Figure 4.25: Performance variation against SER with varying NES

Figure 4.26: Performance variation against SER with varying FES

Figure 4.27: DTD performance variation caused by varying FES

Figure 4.28: Noise Performance under nominal SER

Figure 4.29: Comparison of DTD noise performance

80

REFERENCES

Christof Faller and Christophe Tournery, “Robust Acoustic Echo Control Using A Simple

Echo Path Model” Audiovisual Communications Laboratory, EPFL, Lausanne,

Switzerland 2006

C. Faller and C. Tournery, “Estimating the delay and coloration effect of the acoustic

echo path for low complexity echo suppression” in Proc. Intl. Works. On Acoustic. Echo

and Noise Control (IWAENC), Sept. 2005.

Andreas Jakobsson, Karlstad University and Per ºAhgren, Uppsala University, “Acoustic

Echo Cancellation”

S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction” IEEE

trans. Acoust. Speech Sig. Processing, vol. 27, no. 2, pp. 113–120, Nov. 1979.

K. Ochiai, T. Araseki, and T. Ogihara, “Echo canceller with two echo path models,”

IEEE trans. on Communications, vol. 25, no. 6, pp. 589–595, June 1977.

Jun H. Cho, Dennis R. Morgan and Jacob Benesty, “An objective Technique for

Evaluating Doubletalk Detectors in Acoustic Echo Cancellers” IEEE trans. On Speech

and Audio Processing, vol. 7 no. 6, Nov. 1999

Raghavendran, Srinivasaprasath “Implementation of an Acoustic Echo Canceller Using

Matlab”2003

P. ºAhgren, “On System Identification and Acoustic Echo Cancellation” PhD thesis,

Uppsala University, 2004.

J. Benesty, D. R. Morgan, and J. H. Cho, “A new class of doubletalk detectors

based on cross-correlation,” IEEE Trans. Speech Audio Processing, vol. 8, pp. 168-172,

March 2000.

81

Form detection statistic Comparison with T

Geigel DTD

))1(...)1(,)(max(

|)(|

+−−=

Ntxtxtx

tzξ



Cheap NCR

DTD )(

)(2

2

ˆ

t

t

z

y

σ

σξ =



VIRE DTD 2][)1()1()( γγλξλξ −⋅−+−⋅= nn

γλγλγ ⋅−+−⋅= )1()1()( nn

))1()1(),0(max( −= khhh Λγ

T>ξ Double Talk present

T≤ξ Double Talk not present

Application of Acoustic Echo Cancellation

Documents

Transcript of Application of Acoustic Echo Cancellation