lms.tf.fau.de · Erkl¨arung Ich versichere, dass ich die vorliegende Arbeit ohne fremde Hilfe und...

Friedrich-Alexander-Universitat Erlangen-Nurnberg

Lehrstuhl fur Multimediakommunikation und

Signalverarbeitung

Prof. Dr.-Ing. Walter Kellermann

Masterarbeit

Ego-noise reduction for the robot NAOusing multichannel nonnegative matrix

factorization

von Benjamin Thum

September 2015

Betreuer: Dr. Antoine Deleforge

Erklarung

Ich versichere, dass ich die vorliegende Arbeit ohne fremde Hilfe und

ohne Benutzung anderer als der angegebenen Quellen angefertigt habe,

und dass die Arbeit in gleicher oder ahnlicher Form noch keiner an-

deren Prufungsbehorde vorgelegen hat und von dieser als Teil einer

Prufungsleistung angenommen wurde. Alle Ausfuhrungen, die wortlich

oder sinngemaß ubernommen wurden, sind als solche gekennzeichnet.

————————————

Ort, Datum

————————————

Unterschrift

CONTENTS I

Contents

Abstract VII

List of Abbreviations IX

Notation X

1 Introduction 1

2 Literature review of audio source separation 6

2.1 Models for audio source separation . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Instantaneous and convolutive mixing . . . . . . . . . . . . . . . 7

2.1.2 Amplitude and phase ambiguity . . . . . . . . . . . . . . . . . . 8

2.1.3 Mixing by spatial source images . . . . . . . . . . . . . . . . . . 9

2.1.4 Spatial covariance matrix . . . . . . . . . . . . . . . . . . . . . 9

2.1.5 Characteristics of source separation algorithms . . . . . . . . . . 10

2.2 Properties that allow the separation of sources . . . . . . . . . . . . . . 10

2.2.1 Spectral diversity . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 Spatial diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.3 Non-guassianity and non-stationarity of sources . . . . . . . . . 12

2.3 Common approaches to audio source separation . . . . . . . . . . . . . 13

2.4 (Almost) blind source separation . . . . . . . . . . . . . . . . . . . . . 13

2.4.1 Independent Component Analysis (ICA) . . . . . . . . . . . . . 13

2.5 Model based source separation via Gaussian processes . . . . . . . . . . 15

2.5.1 Model for single channel scenarios . . . . . . . . . . . . . . . . . 16

2.5.2 Model for multichannel scenarios . . . . . . . . . . . . . . . . . 20

2.6 The permutation problem . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Literature review on matrix factorization 27

3.1 Cost functions for matrix factorization . . . . . . . . . . . . . . . . . . 28

3.1.1 Generalized α divergence as a Csisz’ars divergences . . . . . . . 29

3.1.2 The β divergence as a Bregman divergence . . . . . . . . . . . . 30

3.1.3 Generalized α-β divergence . . . . . . . . . . . . . . . . . . . . 31

3.2 Properties of (nonnegative) matrix factorization . . . . . . . . . . . . . 34

3.2.1 Indeterminacies . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.2 Local optimality . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Solving the matrix factorization problem . . . . . . . . . . . . . . . . . 35

3.3.1 Nonnegative matrix factorization via multiplicative update rules 35

3.3.2 Multichannel EM algorithm for the sum of Gaussian components 38

3.4 Constraints and variants of matrix factorization . . . . . . . . . . . . . 39

3.4.1 Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4.2 Smoothness and temporal continuity . . . . . . . . . . . . . . . 41

3.4.3 Local NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.4.4 Semi-orthogonal NMF . . . . . . . . . . . . . . . . . . . . . . . 42

3.4.5 Fischer NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4.6 Scaling problem with a single constraint . . . . . . . . . . . . . 43

4 Ego-noise suppression for the robot NAO 44

4.1 Ego-noise analysis of NAO . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1.1 Fan and static noise . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1.2 Movement noise . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1.3 Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Reduction of stationary components as preprocessing . . . . . . . . . . 53

4.2.1 Exploiting the high sparsity in the frequency domain to reduce fan spike 55

CONTENTS III

4.2.2 Static noise reduction by single channel wiener filter . . . . . . . 58

4.3 Reduction of non-stationary ego-noise . . . . . . . . . . . . . . . . . . . 58

4.3.1 Informed source separation by pretraining of the dictionary . . . 58

4.3.2 Approximate additivity assumption . . . . . . . . . . . . . . . . 60

4.3.3 Euclidean MF with a single dictionary . . . . . . . . . . . . . . 63

4.3.4 Euclidean MF with two dictionaries . . . . . . . . . . . . . . . . 65

4.3.5 Dictionary training based on desired and undesired data . . . . 68

4.4 Multiplicative Update (MU) algorithm for multichannel data . . . . . . 70

4.5 The challenge of multiple mixtures . . . . . . . . . . . . . . . . . . . . 72

4.5.1 Multichannel Wiener filter (MWF) . . . . . . . . . . . . . . . . 74

4.5.2 Parametric multichannel Wiener filter (PMWF) . . . . . . . . . 75

4.5.3 MVDR beamformer . . . . . . . . . . . . . . . . . . . . . . . . . 76

5 Experimental results 78

5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2 Prefiltering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2.1 Fan noise reduction . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2.2 Static and residual fan noise reduction . . . . . . . . . . . . . . 80

5.2.3 Prefiltering results . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2.4 Evaluation approach for movement reduction . . . . . . . . . . . 82

5.3 Movement noise reduction . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.3.1 Experimental results for single channel data . . . . . . . . . . . 85

5.3.2 Performance of PMWF for spatial preprocessing . . . . . . . . . 85

5.3.3 Experimental results for multichannel data . . . . . . . . . . . . 88

5.3.4 Discussion on using NMF with MWF as spatial prefilter . . . . 89

5.3.5 Performance of MVDR beamformer for spatial preprocessing . . 91

6 Conclusion 93

A Single channel results 96

A.1 walking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A.1.1 α-β-divergence with speech and noise codebook . . . . . . . . . 96

A.1.2 MF with the α-β-divergence and Euclidean predictive by a noise codebook 97

A.2 waving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

A.2.1 α-β-divergence speech and noise codebook . . . . . . . . . . . . 99

A.2.2 MF with the α-β-divergence and Euclidean predictive by a noise codebook100

B Multichannel results 102

B.1 walking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

B.1.1 MF with the α-β-divergence, MWF as prefilter and speech and noise codebook102

B.1.2 MF with α-β-divergence and Euclidean predictive, MWF as prefilter and noise codeb

B.1.3 Multichannel matrix factorization with the α-β-divergence . . . 105

B.2 waving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

B.2.1 PMWF as prefilter . . . . . . . . . . . . . . . . . . . . . . . . . 106

B.2.2 MF with the α-β-divergence, MWF as prefilter and speech and noise codebook107

B.2.3 MF with the α-β-divergence, MWF as prefilter and noise codebook108

B.2.4 Multichannel matrix factorization with the α-β-divergence . . . 109

C Overview on related algorithms 111

C.0.5 Minimum Mean Square Error (MMSE) estimation . . . . . . . . 111

C.0.6 Wiener filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

C.0.7 Nonnegative Graph Embedding . . . . . . . . . . . . . . . . . . 115

C.1 Terminology of non-blind source separation algorithms . . . . . . . . . 116

D Derivations 118

D.1 Parameter estimation by matrix factorization . . . . . . . . . . . . . . 118

D.1.1 Euclidean distance as ML estimation of additive Gaussian noise 118

D.1.2 ML estimation of the sum of Poisson distributed magnitude components119

D.1.3 Itakura-Saito divergence as ML estimation of gamma multiplicative noise121

CONTENTS V

D.2 Deriving the α-divergence as a Csisz’ars divergence . . . . . . . . . . . 122

D.3 Deriving the β-divergence as a Bregman divergence . . . . . . . . . . . 123

D.4 Derivation of MU rules for the α-β divergence (single channel) . . . . . 125

D.5 Derivation of a multichannel EM algorithm for the Itakura-Saito divergence 128

D.6 Computational growth for parallel matrix inversion . . . . . . . . . . . 130

References 131

ABSTRACT VII

Abstract

This thesis studies ego-noise reduction for the robot NAO using multichannel non-

negative matrix factorization (NMF). The movement ego-noise, the robot creates by

walking on the spot as well as by waving its arm, is investigated. It is shown, that

a significant reduction of these ego-noises on a single channel is possible using a re-

cently proposed NMF algorithm based on the generalized α-β divergence. The existing

approach of multichannel multiplicative update rules, where the mutual information

between the channel is discarded, is then derived for the generalized α-β divergence

as cost function. This multichannel NMF approach results in worse performance than

applied to a single channel.

Furthermore a novel approach to separate the reduction in a spatial prefilter and a

single channel NMF algorithm is investigated. Experimental results show that the

large speech artifacts introduced by the spatial filter significantly degrade a subsequent

single channel NMF algorithm. This puts forward room for improvement.

From a theoretical point of view, it is also pointed out that current approaches of

pretraining the dictionary in the literature are not well suited for most signals. A new

algorithm based on NMF with the Euclidean cost function is developed. In the design,

the pretraining of the dictionaries with the reference signals is taken into account. The

experimental results suggest that ego-noise reduction with NMF could benefit from

more suitable training algorithms.

VIII ABSTRACT

Kurzfassung

Diese Arbeit erforscht die Eigengerauschreduktion fur den Roboter NAO mittels nicht-

negativer Matrixfaktorisierung (NMF). Dabei werden die Bewegungsgerausche unter-

sucht, die der Roboter erzeugt, wenn er sich auf der Stelle bewegt, sowie wenn er

seinen rechten Arm bewegt. Es wird gezeigt, dass eine signifikante Verringerung des

Eigenrauschens mit einem einzigen Kanal moglich ist. Dabei wird ein vor kurzem

veroffentlichter NMF Algorithmus mit der generalisierten α-β Divergenz verwendet.

Der existierende Ansatz mehrkanaliger multiplikativer Update-Regeln, bei denen der

mittlere Transinformationsgehalt vernachlassigt wird, wird fur die generalisierte α-β

Divergenz verallgemeinert. Experimentelle Ergebnisse zeigen, dass dieser Algorithmus

eine schlechtere Performance erreicht als mit einem einzigen Kanal.

Daruber hinaus wird ein neuer Ansatz untersucht, in dem der Algorithmus in einen

raumlichen Filter und einen einkanaligen NMF Algorithmus getrennt wird. Die Ex-

perimente zeigen, dass die großen Artefakte des raumlichen Filters den nachfolgenden

NMF Algorithmus signifikant beeintrachtigen.

Eine theoretische Untersuchung zeigt, dass die aktuellen Herangehensweisen in der Lit-

eratur, bei denen das Worterbuch vorher trainiert wird, fur die meisten Signale nicht

geeignet sind. Dabei wird ein neuartiger Algorithmus auf Grundlage von NMF mit der

euklidischen Kostenfunktion entwickelt. In dessen Design ist das Training des Worter-

buchs durch Referenzsignale berucksichtigt. Die experimentelle Auswertung legt nahe,

dass die Eigenrauschunterdruckung durch NMF von besser geeigneten Trainingsalgo-

rithmen profitieren kann.

ABKRZUNGSVERZEICHNIS IX

List of Abbreviations

BSS blind source separation

DFT discrete Fourier transform

EM expectation maximization

i.i.d. independent and identically distributed

IIR infinite impulse response

MF matrix factorization

ML maximum likelihood

MMSE minimum mean square error

MSE mean square error

MU multiplicative update

MVDR minimum variance distortionless response

MWF multichannel Wiener filter

NMF nonnegative matrix factorization

pdf probability distribution function

psd power spectral density

PMWF parametric multichannel Wiener filter

RAR residual to artifact ratio

SAR speech/signal to artifact ratio

SDR speech/signal to distortion ratio

SIR speech/signal to interference ratio

STFT short time Fourier transform

X FORMELZEICHEN

Notation

Notation for signals

Y or Y single/multiple microphone mixture samples

X or X single/multiple (hidden) microphone source samples

S or S single/multiple (hidden) actual source samples

A single sample is always represented by a normal capital letter and usually accom-

panied with one or more indices. A vector or matrix of observations is indicated by a

bold capital letter and possibly also used with indices. For the definition of a column

vector the following notation is used

Y = [Y1, ..., YI ]T ∆= [Yi]I

Parameters and indices

K number of codewords and therefore size of codebook

F number of frequencies

R number of frames

U number of mixtures/microphones

TR to indicate training data

TE to indicate test data

N or S to indicate speech or test data

The capital letters describe the size of a variable and the corresponding small letters

are used to index it. For example f is used to index the f -th frequency. Similarly t is

used to index time (samples).

FORMELZEICHEN XI

Mathematical expressions

c= equal up to a constantm= equal up to a scalar multiplication

∆= is defined as

!= has to be

Y .[b] element wise b-th power

Y .X element wise matrix multiplication

CHAPTER 1. INTRODUCTION 1

Chapter 1

Introduction

The rapid development of computational power by computers in the recent years has

led to a fast spread of electric devices into many different areas of human life - in the

industry, in our homes, in transportation devices and even in our pocket. In particular

the manufacturing of human-like robots, with their potential task of assisting humans

in their daily life. One of these humanoid robots is called NAO and it is developed

by Aldebaran Robotics. In order to become human-like, a robot must however be

capable of performing many different tasks, walking, talking, listening, to name a few.

But also more complex tasks such as understanding and follow a given order are desired.

This thesis investigates one of these tasks, speech enhancement. Algorithms designed

to understand speech are usually referred to as automatic speech recognition engines. In

recent years, efficient algorithms have been developed for scenarios with low reverbera-

tion and background noise and are nowadays spread in many smartphones around the

world. The early algorithms severely lost performance in more challenging scenarios,

such as multiple speakers, high reverberation or high background noise. For this rea-

son, the development of robust speech recognition algorithms or suitable preprocessing

algorithms for these more challenging scenarios is still an ongoing research topic.

In robot audition - a general term used for tasks such as speech recognition or speaker

localization for robots - a quite unique problem arises. The ego-noise, meaning the

noise the robot creates itself, has to be taken into account as well. When processing

real world audio signals, the static noise of the microphones and sometimes also the

noise of a fan has to be taken into account. Both have to be considered as ego-noise,

too. But conventional applications, such as for example smartphones, do not create

audio noise from mechanical movement.

Hence mechanical movement ego-noise is uncommon and in addition particularly chal-

lenging for the following two reasons. First, the intensity of the movement noise, as for

example in the case of the robot NAO, it is significantly louder than the speech signal.

The robots sound might not appear so loud for a person standing in the same room as

a robot, but the distance to the noise source has to be taken into account. In case of

the NAO robot, the microphones are spaced less than one meter from the noise origin.

Second, the movement of the robot is quite versatile. It is very difficult to predict, as

its properties such as spectral shape or total energy vary strongly over time.

The ego-noise reduction problem for the robot NAO can, as it is usually also done for

other robots, be separated into the reduction of non-stationary and stationary noise. It

is quite common to tackle these two types by two different algorithms, see for example

[Inc11], [LBD+14]. Although this work has to deal with both stationary and non-

stationary noises, the main focus is on the reduction of the non-stationary, movement

noise.

Ego-noise reduction systems (especially for the non-stationary noise) can again be sep-

arated into either relying only on the audio data or including additional information.

It is especially of interest to consider motor data, such as position, velocity and accel-

eration of its yaws and other moving parts. This work solely focuses on approaches

using the audio data only. An overview on ego-noise reduction exploiting sensor data

can for example be found in [Inc11].

The reduction of non-stationary noise form non-stationary speech can be seen as a

special case of audio source separation, where only one source, namely the speech

signal, is of interest. Due to this interpretation, (audio) source separation algorithms

are in theory also applicable for the task of ego-noise reduction. As the investigated

robot has four and thus multiple microphones, audio source separation techniques for

multiple signals are of interest.

There exist a number of different approaches for audio source separation. The focus

on this thesis lies on one particular source separation type. The non-stationarity is

modeled in a particular way, such that the resulting optimization problem can be

interpreted as a non-negative matrix factorization (NMF) problem.

NMF itself is in its basic idea the factorization (or approximation) of a big nonnega-

tive matrix into the multiplication of two smaller matrices. In the last 15 years NMF

has reached quite some attention in the literature and has been successfully applied

to neural science, music separation or pattern recognition tasks as for example face

recognition.

The task of this thesis can thus be formulated as the following: It shall be investi-

gated, whether multichannel audio source separation techniques that can be solved as

a nonnegative matrix factorization problem such as [OF10] are suitable for the ego-

noise reduction of the robot NAO. The results shall then be compared to the recently

proposed ego-noise reduction method in [DK14].

The ego-noise reduction is hereby considered in a very special scenario. It is assumed,

that suitable training material is given. Suitable training material contains hereby

a recording of the ego-noise which is similar to the movement the robot is about to

perform. In addition a training recording of the desired speech signal is presumed and

the spatial properties are assumed to be time independent. Speaker and robot have to

be positioned at the same positions in the same room for training and testing .

A related real world scenario is for example the NAO robot waiting patiently in a room

until it is talked to by a speaker. The first communication, before the robot starts to

move, can then be seen as a training data for the spatial features of the speaker.

Training data for the movement noise can be created beforehand and therefore be

used for the corresponding movements. The assumption that the robot only performs

movements where the spatial properties are roughly unchanged is quite restrictive. The

robot can not do much except for waving its hand or walking on the spot. However,

this work focuses on the applicability of NMF for ego-noise reduction in this kind of

artificial scenario and not on the tracking of and adaptation to the spatial characteristic

of the speaker.

The novel ideas in this work are the following.

• In chapter 4.2.1 a algorithm for the reduction of fan spikes is proposed, which

exploits the spike bandwidth of few Hertz.

• In chapter 4.3.2 the idea of the approximate additivity assumption is further

developed and evaluated by the example of Euclidean matrix factorization.

• The approach used to pretrain the dictionary is not well suited for most signals

and a better suited algorithm based on NMF with the Euclidean distance is

derived in chapter 4.3.5.

• An existing multichannel MU algorithm for the Itakura-Saito divergence is mod-

ified to solve the problem with the generalized α-β divergence in chapter 4.4.

• Furthermore in chapter 4.5 a two stage approach with spatial prefilter and single

channel NMF algorithm is discussed.

• In the experimental results chapter audios source separation by NMF is for the

first time applied to the robot ego-noise reduction problem.

The thesis is structured as follows. The literature review of works related to the given

problem of ego-noise reduction is divided into chapter 2 for audio source separation

and chapter 3 for the overview of matrix factorization.

Chapter 4 describes then the overall structure of the noise reduction algorithm and

contains the earlier mentioned novel part of this thesis.

In chapter 5 the experimental results with real recordings of the robot are shown for

the algorithms selected in the last chapter.

A conclusion and proposals for future research is given in chapter 6

6 CHAPTER 2. LITERATURE REVIEW OF AUDIO SOURCE SEPARATION

Chapter 2

Literature review of audio source

separation

Many of the well-known noise suppression techniques assume stationarity or at least

long enough short-term stationarity such that it is sufficient to estimate the noise

characteristics for example in the speech pauses. However if the noise is only short

time stationary for the same time span as the speech signal or less, these algorithms

no longer result in a good noise suppression performance.

For this scenario, the problem of noise reduction can be seen as a source separation

problem of two short time stationary processes, although strictly speaking the estimate

on the noise is not of interest. But as source separation can provide an estimate of the

speech signal from the mixture, audio source separation algorithms are appropriate for

the presented ego-noise reduction problem.

2.1 Models for audio source separation

In source separation it is common to name the unknown signals sources and the ob-

served signals mixtures. Further-on the number of sources is denoted by J and the

number of mixtures by U . In the robots ego-noise reduction scenario, the mixtures are

simply given by the microphone signals.

2.1. MODELS FOR AUDIO SOURCE SEPARATION 7

2.1.1 Instantaneous and convolutive mixing

The mixing model describes the relation between the sources and the mixtures. De-

pending on the environment the mixing is, as described in the following, either modeled

as instantaneous or by convolution.

Instantaneous mixing as for example used in [PC01] is the simplest model. Each

mixture is the weighted sum of the J sources, such that Yu,t =∑

j GujSjt. The mixing

can be written as a set of linear equations using Yt = [Yu,t]U and St = [Sjt]J

Yt = GSt (2.1)

and G = [Guj ]U×J is referred to as mixing matrix.

Unfortunately audio mixtures in real rooms can not be described well by instantaneous

mixing. Instead the recorded mixtures are sums of the reverberated sound sources.

Reverberation is well described by a convolution with the mixing filters, or in this case

room impulse responses G. The convolutive mixing model will be denoted by

Yut =∑

Guj ∗ Sjt (2.2)

as there exists no suitable matrix notation as for the instantaneous mixing. In the

above form, the convolutive mixing is very difficult to handle. It is therefore often

approximated by a complex valued multiplication in the STFT domain. [VBGB14]

names this approximation narrowband assumption. The mixing process it then written

Yfr = GfSfr (2.3)

where each frequency is modeled by its own mixing matrix. [PC01]

If one is able to truly identify the U × J mixing matrix G and it the Moore-Penrose

pseudo inverse exists, the sources can be extracted from the mixtures by

St = G+Yt (2.4)

Note that this also implies that the sources are properly scaled in case of instantaneous

mixing or (approximately) dereverberated in case of convolutive mixing in the STFT

domain.

Dereverberation itself is however a very challenging task and has frequently been

addressed in literature. In recent years, nonnegative matrix factorization based ap-

proaches have also been developed, for example [KSRS11] for single channel and [MH14]

for multichannel scenarios. Though the mixing model as described above is in theory

capable of dereverebration, its practical use is limited due to the mixing models am-

plitude and phase ambiguity.

2.1.2 Amplitude and phase ambiguity

If not stated otherwise, all algorithms in this work, which are based on the above mixing

models, lack the ability to determine the amplitude and sign/phase of source signal

and mixing filter properly. Assume that observations are created by the instantaneous

mixing in the time domain as Yjt = GjSjt =Gj

c(cSjt) = GjSjt. This reformulation

is valid for any constant c, such that all filter source pairs {Gj

c, cSjt} are equally valid

representation of the observations Yjt. Source and mixing coefficients can therefore

only be determined up to a signed scaling factor.

In a similar way this can be done for the convolutive mixing in the time-frequency

domain. As these observations are typically complex, the constant can be complex as

well. The ambiguity is thus over amplitude and phase.

If this ambiguity is not solved, a proper scaling/dereverberation is of course impossible.

For many algorithms this ambiguity is solved by adding some constraints to source or

propagation vectors. In [HO00] the source variance can for example by constrained to

unity using E{SjfrSHjfr} = 1

2.1. MODELS FOR AUDIO SOURCE SEPARATION 9

2.1.3 Mixing by spatial source images

To circumvent the amplitude and phase ambiguity [Car98b] introduced the concept of

spatial source images (though the name was introduced later in [VAT+12]) by defining

Xujt = GujSjt, Xujt = Guj ∗ Sjt and Xjfr = GjfSjfr for the three above models, such

Yt =∑

Xjt or Yfr =∑

Xjfr (2.5)

for the time or STFT domain respectively. Xuj can be interpreted as the actual con-

tribution of the j-th source onto the u-th mixture. The model is for example used in

[VBGB14]. In this report it is used to avoid the scaling and phase ambiguity in the

single channel case.

Note that in the definitions so far, the mixing models are assumed to be independent

of time and therefore rather easy to estimate. In real world scenarios for ego-noise re-

duction robot and speakers move around. The mixing vectors are thus time dependent.

This thesis is however dedicated to time independent mixing vectors, which proves to

be a challenging task already.

2.1.4 Spatial covariance matrix

The mixing models of equation 2.2 are furthermore limited to single point sources, as

each source is defined by a single mixing filter per mixture. They are therefore not

suited for spatially diffuse sources. One way to address this challenge (for Gaussian

processes) is via the spatial covariance matrix. In chapter 2.5 it is shown that the

spatial covariance matrix of a single source with the convolutive mixing model is

ΣG,jf = E{GjfGHjf} (2.6)

and thus of rank 1 where Gjf = [Gujf ]U . In [DVG10] a full rank spatial covariance

matrix is used to model diffuse sources.

2.1.5 Characteristics of source separation algorithms

One important assumption, made in (almost) all source separation techniques is the

independence of the source signals. Besides this assumption and the mixing model,

there exist several more properties, that can be used to describe a source separation

technique. A specific source separation problem can for example be characterized by

the following properties:

• Convolutive or instantaneous

• Overdetermined or underdetermined, depending on the ratio of sources to mix-

• Stationary or non-stationary/short-time stationary) sources

• Gaussian or non-Gaussian observations

• Temporal correlation / spectral shape

Consequently there exist a lot of different methods for source separation. Nevertheless

their concept is similar in general, as they exploit at least one of the properties described

in the next chapter.

2.2 Properties that allow the separation of sources

[Car01] points out that two or more sources cannot be separated if they are stationary,

Gaussian distributed and white, thus if they are i.i.d. Gaussian. In order to make

separation possible, the sources need to fulfill certain properties. In accordance with

[VBGB14], the two most prominent properties will be called spectral diversity and

spatial diversity.

2.2. PROPERTIES THAT ALLOW THE SEPARATION OF SOURCES 11

2.2.1 Spectral diversity

In the above example, the sources can still not be separated by applying a filter and

therefore spectral shaping of the mixture. The problem is thus not the whiteness of the

signals, but more precisely the same spectral shape. The property of signals having

different spectral shape will further on be referred to as spectral diversity, which is

equivalent to different time correlation [Car01].

Roughly speaking, when the probabilities of a certain frequency component being gen-

erated by the sources are different, it is possible to distribute each frequency component

onto the sources in some optimum manner. For example by assuming that each fre-

quency component is created by a single source, the source with the highest probability

has to be selected for each component.

In chapter 2.5.1 a model is introduced that allows exploiting the spectral diversity of

sources through a low-rank approximation of their magnitude squared STFT matrix

2.2.2 Spatial diversity

Besides the possibility to separate sources based on different spectral characteristics,

there exists another possibility if more than one (related) signals are observed. In audio

source separation multiple signals are usually achieved with a microphone array. As

source separation can only benefit from an microphone array, if the sources are located

at different positions in space, this property will be further-on referred to as spatial

diversity.

Spatial diversity is usually exploited for source separation by

• estimating the mixing matrix (algorithms to do so are described later in this

thesis).

• estimation of the source signal by the pseudo inverse of the mixing matrix (see

equation 2.4).

2.2.3 Non-guassianity and non-stationarity of sources

Non-gaussianity was since its very beginnings used for blind signal separation (see

for example [Car98a]). Some authors ([Car01]) claim that source separation can be

achieved by exploiting non-stationarity.

However it has to be noted that in all work reviewed for this thesis, non-gaussianity or

non-stationarity of a source was only used to identify spectral or spatial parameters.

Thus the source separation, as for example in chapter 2.4.1, was actually performed

based on spectral or spatial parameters. This has also been pointed out by [PC01] in

their discussion about non-stationary and non-Gaussian models

Though it is not exploited in any of the reviewed literature, one can think of source

separation based on the sources (non-Gaussian) probability distribution without spa-

tial and spectral diversity. This is for example given by two random processes with

statistical independence over time, which are observed by a single mixture. Based

on the probability distribution of the two sources, there still exist situations, where

nontrivial estimates of the two sources can be obtained. As a simple example, two

random processes are added up, one creates either plus or minus 1 for each sample

and the other one plus or minus 10. The resulting observations are thus from the set

{−11,−9, 9, 11}. Obviously for any observations the realizations of the two underlying

processes can be determined. This work will not further investigate such scenarios, but

they merely suggest that spatial and spectral diversity are not the only possibilities for

source separation.

A final interesting remark is, that non-gaussianity and non-stationarity can be used

equivalently in some cases [PC01]. In [Li06] the author points out that by applying a

time-dependent linear filter to Gaussian i.i.d. processes, the output can either be seen

as a non-stationary Gaussian process or as a stationary non-Gaussian process.

Concluding this chapter, all unsupervised algorithms reviewed for this paper perform

source separation by exploiting the spectral and/or spatial diversity of the sources.

The identification of the diversity parameters (e.g. mixing matrix) is then performed

2.3. COMMON APPROACHES TO AUDIO SOURCE SEPARATION 13

by exploiting non-stationarity or non-gaussianity of the sources.

2.3 Common approaches to audio source separation

In this chapter the two most common approaches to source separation are introduced,

blind source separation and model based source separation.

2.4 (Almost) blind source separation

An algorithm will be termed almost blind or blind when it does not explicitly model

the sources as processes with a certain probability distribution (see chapter 2.5) and

it does not use reference signals (see chapter 4.3.1). Still blind source separation al-

gorithms usually require some assumptions, for example about non-Gaussianity or

non-stationarity and statistical independence of the sources

2.4.1 Independent Component Analysis (ICA)

The most common type of blind source separation problems, is the Independent Com-

ponent Analysis (ICA). Its main concept, as it is presented in [HO00], is to extract

sources out of the mixture, which are again independent. It thus directly exploits the

independence of the sources to identify the spatial diversity of the sources.

Recall the instantaneous mixing model from equation 2.1

Yt = GSt

where St represents now an vector of J statistical independent sources. ICA and

related algorithms can also be applied in the STFT domain to account for convolutive

mixtures. These methods can be referred to as frequency dependent ICA (FDICA).

See [VJA+10] and [VBGB14] or for more information.

If U ≤ J and therefore the underdetermined case, there exists at least one matrix

V = G+ such that

St = V Yt (2.7)

Conventional ICA methods can therefore not be used for separation of overdetermined

scenarios (J > U). Approaches to these scenarios are often termed sparse component

analysis (SCA) and we refer to [VJA+10] for further information.

However there may exist another matrix VP = PV such that

SP,t = PSt (2.8)

As SP,t and St are both statistical independent, P has to be a permutation matrix.

The permutation problem will be discussed in chapter 2.6, so that at this point it is

sufficient to calculate an demixing matrix by restoring statistical independence.

One now needs an algorithm that calculates the matrix V such that the sources in

St have minimum mutual information. As pointed out earlier, source separation is

impossible for i.i.d. Gaussian processes. In order to achieve separation, an additional

assumption has to be made.

ICA based non-gaussianity of the sources results from the problem that if more than

one source is Gaussian, the mixing matrix can only be estimated up to a orthogonal

transformation. [HO00] [Car98a]

Statistical independence can then be minimized by higher order moments. These type

of ICA algorithms can be seen as a generalization of the principle component analysis,

which ensure statistical independence only up to the second moment.[Com94]

Again we want to point out, that spatial diversity is exploited to separate the sources

and the higher order moments and thus non-gaussianity is only used to identify the

spatial parameters.

For completeness, we point out, that non-Gaussianity is not the only working assump-

tion for ICA. In [PC01] the authors exploit the non-stationarity of the sources to

estimate the mixing matrix.

2.5. MODEL BASED SOURCE SEPARATION VIA GAUSSIAN PROCESSES 15

At this point we are not interested in specific algorithms to obtain the independent

components, but only on the concept of ICA in order to compare it with the other

approaches shown later in this thesis. For more detail on actual ICA algorithm we

forward to the many works cited in this chapter.

2.5 Model based source separation via Gaussian

processes

In contrast to blind source separation, model based source separation is based on

models for the sources. After the parameters of the model are estimated, parameters

and model are used to estimate the actual source signals.

Hereby the model has to be chosen according to the actual signal.

Statistical independent Gaussian processes are of particular interest, as their (joint)

pdf depends on first and second order moments only. This greatly limits the number

of parameters that have to be identified. Furthermore the observations resulting from

this processes are jointly Gaussian and it is known that the optimum MMSE estimator

is linear (see appendix C.0.5).

It is well known, that speech or audio signals in general can be considered to be short

time stationarity for about 5 to 20 ms. As the focus here is on audio source separation,

it is further on assumed, that at least one source is short time stationary.

Due to the limited stationarity of at least one source, it is very common to apply

the algorithm in the STFT domain. Each frame is approximately stationary and esti-

mated separately. Note that this approach implicitly assumes independence between

the frames. Furthermore it is very common to assume independence between the fre-

quency bins. The applied filter can then be interpreted as the STFT approximation of

the optimum wiener filter (more details to this matter are given in appendix C.0.5).

As this is (almost) never completely given with real data, it has to be noted that in the

STFT domain the wiener filter loses its optimality strictly speaking. Nevertheless, as

it works reasonable well, it is the most common approach for a short time stationary

Gaussian processes.

2.5.1 Model for single channel scenarios

Recall the spatial sound source model of equation 2.5

Yfr =∑

Due to the statistical independence over frequency and frames and the jointly Gaussian

observations, the estimate on the j-th spatial sound source in the STFT domain by a

single mixture is given by

Xjfr = E{Xjfr|Yfr} =E{YfrX

E{YfrYHfr }

Yfr (2.9)

Introducing the variance or psd in the STFT domain VX,jfr = E{XjfrXHjfr} and using

the independence of the sources, equation 2.9 simplifies to

Xjfr =VX,jfr

j VX,jfr

Yfr (2.10)

Estimating the PSDs

The optimum source estimation filter is given by the single channel Wiener filter of

equation 2.10. However in order to apply the filter the variance parameters VX,jfr have

to be determined. As already pointed out, speech is only short-time stationary. It is

therefore further on assumed, that at least one source is short time stationary.

Separation of a non-stationary source from one or more stationary source(s)

This is the common scenario for suppression of (almost) stationary noise on speech

recordings. The most popular algorithm for this task is the following:

When the speech (and noise) is active, the combined speech and noise variance is

recursively calculated by

Vspeech+noise,fr = αVspeech+noise,f(r−1) + (1− α)|Yfr|2 (2.11)

Hereby the recursion parameter α has to be chosen to account how far the variance of

the speech varies between frames. As extreme case it can be set to zero, in which case

the variance is estimated by its instantaneous value.

When the speech is not active, one can set Vspeech,fr = 0 in order to get zero output

(or Vspeech,fr = ǫ to get some comfort noise). The speech pauses can then be used to

reestimate the psd of really slow changing noise or if it does not change at all, it can

be determined beforehand.

Finally the estimated speech is obtained by reformulating equation 2.10 to

Xspeech,fr =Vspeech+noise,fr − Vnoise,fr

Vspeech+noise,frYfr (2.12)

Separation of non-stationary sources

If several sources are short term stationary the approach applied in the section chapter

is no longer suitable. The variance parameters for all sources have to be estimated at

once. The problem can be described as estimating JF R variance parameters out of

FR observations in the single channel case and UF R observations in the multichannel

case. This is especially problematic in underdetermined mixtures (J > U).

One approach to tackle this problem is to replace the variance with the instantaneous

power spectogram and reduce the number of parameters that have to be estimated by

nonnegative matrix factorization (NMF). Though NMF is often applied with different

cost functions and therefore without a model, there exists a model that is suitable to

introduce NMF as a model-based source separation technique. The NMF model is

introduced in the following chapter.

Sum of short-time stationary Gaussian components (single mixture)

Given a single mixture (U = 1), the model of independent Gaussian time-frequency-

bins can be evolved to the generative model first introduced by [BDBG03]. Each

frame-frequency-bin of the source Xjfr is represented by Kj components Cjkfr:

Xjfr =

Cjkfr (2.13)

Furthermore each component can be described by a time (or frame) dependent gain

jkr and a stationary complex Gaussian distributed process with variance Djkf . This

can be seen as a model for the non-stationarity via amplitude modulated stationary

processes. By shifting the gain of the components into the variance, the model can be

equivalently described ([FBD09] [OF10]) as

Cjkfr ∼ N (0, DjfkAjkr) (2.14)

with a frame dependent variance. N (x|µ,Σ) = det(πΣ)−1e−(x−µ)HΣ−1(x−µ) hereby

denotes the proper multivariate complex Gaussian distribution.

As the spatial sound sources are assumed to be mutually and individually independent

across frequencies and frames, they are also Gaussian and their mean and variances

add up such that

Xjfr ∼

N (0, DjfkAjkr) = N (0,DTjfAjr) (2.15)

The mixture is the sum of the independent spatial sound sources, such that

Xjfr ∼ N (0,

DTjfAjr) (2.16)

By defining joint dictionaries and activation matrices for all sources

Df = [DTjf ]J ; (2.17)

Ar = [ATjr]J (2.18)

and the total number of components K =∑J

j Kj .

The mixture is then distributed in the STFT domain by

Yfr ∼ N (0,DTf Ar) (2.19)

ML estimation of variance parameters

[FBD09] describes the maximum likelihood estimate on Y given D and A by

CML(D,A) = p(Y |D,A) (2.20)

Due to the independence assumption p(Y |D,A) =∏fr

fr p(Yfr|Df ,Ar)

The negative log-likelihood function is given by

CML(D,A) = −

log p(Yfr|Df ,Ar)RF log(π) +

logDTf Ar +

|Yfr|2

DTf Ar

(2.21)

dIS(|Yfr|2||DT

f Ar) (2.22)

Note that as |Y |.2 is the instantaneous variance or instantaneous power and its estimate

is given as

VY,fr = DTf Ar =

DTjfAjr =

VXjfr (2.23)

CML(D,A) =

dIS(|Yfr|2||VY,fr) (2.24)

Due to the equivalence of NMF with the Itakura-Saito model to ML estimation of

the sum of Gaussian components (equation 2.21), it is the most commonly used NMF

technique for audio source separation. Yet there exist also other models that result in

solving an NMF problem. An overview is given in appendix D.1. An algorithm that

solves the NMF problem with the Itakura-Saito divergence is given in chapter 3.3.1.

It is also important to note, that the processes of a source are assumed to be indepen-

dent, but the variances are dependent. Even more they are explicitly modeled jointly

in the time-frequency domain.

By modeling each source as a sum of Kj frame-wise stationary Gaussian components

with frame varying variance, one estimates KF +KR parameters instead of JF R out

of FR samples. In that sense, the NMF model can be interpreted as exploiting the

spectral diversity. The variance parameters of source J are sparse in the time-frequency

domain as DjAj is for Kj < min(R,F ) of rank Kj and therefore not of full rank. The

NMF model can also be compared with a stationary Gaussian process with variance

Vjfr = Vjf . The stationary Gaussian process is simply represented as a rank 1 STFT

variance matrix. The resulting activation vector is the unity vector.

If more than one mixture is observed, a similar algorithm can be derived to exploit

both spatial and spectral diversity.

2.5.2 Model for multichannel scenarios

Recall the convolutive mixing model in the STFT domain from equation 2.3

Yfr = GfSfr

The time frequency bins are not independent between the mixtures such that the

optimum linear estimator is jointly calculated for the U mixture observations Yfr =

[Yufr]U as

Sjfr = E{Sjfr|Yfr} = E{SjfrYHfr }E{YfrY

−1Yfr = VSjfrGHjfΣ

−1Y,frYfr (2.25)

The covariance matrices of sources and observations are hereby defined as

ΣX,jfr = ΣG,jfVS,jfr (2.26)

ΣY,fr = E{YfrYHfr } = E{GfSfr(GfSfr)

H} =∑

VS,jfrΣG,jf (2.27)

with the spatial covariance matrix ΣG,jf = E{GjfGHjf} = GjfG

Hjf and the source

variance VS,jfr = E{SjfrSHjfr}

As alternative the spatial sound sources can be estimated as:

Xjfr = Gjf Sjfr = ΣG,jfVS,jfrΣ−1Y,frYfr = ΣX,jfrΣ

−1Y,frYfr (2.28)

Equation 2.28 describes the multichannel Wiener filter in the STFT domain and is used

in the source separation context for example in [VBGB14] and [LDDR13]. It denotes

an approximation of the optimum filter for mixtures of Gaussian processes, with the

assumptions made above.

The optimum estimation filter is already given in equation 2.28, but as in the single

channel case the variance parameters have to be estimated by a suitable model. In

[VJA+10] this is referred to as variance modeling, in contrast to linear models, which are

not discussed in this work. The spatial modeling via the spatial covariance matrix was

already mentioned in chapter 2.1.4. The authors of [VJA+10] furthermore distinguish

the spectral modeling inmonophonic and polyphonic spectral models. Gaussian mixture

models (GMMs), hidden markov models (HMMs) and extensions such as Gaussian

scaled mixture models (GSMM) are listed as monophonic spectral models and we

refer to [VJA+10] for an overview on these methods. Our interest is at the polyphonic

spectral models, as their factorization of the power spectrogram again leads to the NMF

algorithms as described in the last chapter for a single mixture and in the following for

multiple mixtures.

Modeling multiple mixtures by the sum of short time stationary Gaussian

components

[OF10] was the first to model multiple mixtures as the sum of short time stationary

Gaussian processes. The model results from the combination of the multichannel mix-

ing model of equation 2.3 and equation 2.13 and an additional additive spatial Gaus-

sian white noise term for each mixture Bufr with diagonal spatial covariance matrix

ΣB,fr = diag([VB,ufr]U ). The multichannel model then writes as

Yfr = GC,fCfr +Bfr (2.29)

The difference to the mixing model is, that instead of J independent Gaussian sources

the mixture is composed of K independent Gaussian components (where each com-

ponent belongs to one of the J sources) and an additional Gaussian noise term. The

variance of the components is modeled in the NMF typical way by VC,kfr = DfkAkr.

The optimum filter of equation 2.25 can also be formulated for the components as

Ckfr = VC,kfrGHjkf

(ΣY,fr)−1Yfr (2.30)

with ΣY,fr =∑

VCkfrΣG,kf +ΣB,fr (2.31)

Note that in [OF10] the mixing vectors of components of the same source are identical,

such that Gk1f = Gk2f if k1, k2 ∈ Kj . Therefore jk denotes the source that component

k belongs to.

The estimate of a source can then simply be obtained by Sj,fr =∑

k∈KjCkfr.

Again, in order to apply the filter, the spatial parameters G and for the component

variances expressed via D,A have to be determined beforehand.

Estimation of parameters by discarding the mutual information

When the mutual information between the mixtures is discarded the multichannel cost

function simply is the sum of the equation 2.24 over all mixtures

CML((G,D,A) =∑

dIS(|Yufr|2||VYufr) (2.32)

In the single channel case, the estimated variance of the observation was simply given

by the sum of the variances of the spatial sound images. In the original mixing model,

where the mixing matrices are used, dictionary and activation coefficients form the

variance of the components by VS,jfr =∑

k∈KjDfkAkr. The components impact on

the variance of the u-th mixture is then obtained by multiplication with the squared

mixing coefficient. As the components are assumed to be statistical independent, their

variances add up, such that in the noiseless case, VB,fr = 0, the variance of a single

mixtures time frequency bin can be written as

VY,ufr =∑

|Gujf |2∑

k∈Kj

DfkAkr (2.33)

[OF10] algorithm to optimize equation 2.32 regarding the parameters G, D, A is

reviewed in 4.4.

Estimation of parameters via the EM approach

The authors also propose a second algorithm, where the mutual information between

the channels is not discarded. They perform maximum likelihood estimate on the

complete data {Y ,C} as

CML(G,D,A) = p(Y ,C|G,D,A) (2.34)

Note that while the observations Y are known (observable), the components C are

hidden and can only be estimated (for example by 2.30). As shown later this models

therefore results in an expectation maximization (EM) algorithm.

Equation 2.35 can be reformulated using Bayes rule as

CML(G,D,A) = p(Y |C,G,D,A)p(C|G,D,A) (2.35)

The negative log-likelihood is then given by

− log p(Y |C,G,D,A) =

− log p(Yfr|Cfr,GC,f ,Df ,Ar)− log p(Cfr|GC,f ,Df ,Ar)

(2.36)

Given the components Cfr and all parameters, the observations are simply distributed

with the probability distribution of the additive noise Bfr around the mean vector

defined by the components as GC,fCfr

− log p(Yfr|Cfr,GC,f ,Df ,Ar) = − logN (Yfr|GC,fCfr,ΣB,f)

= log π + log(det(ΣB,fr)) + (Yfr −GC,fCfr)HΣ−1

B,f(Yfr −GC,fCfr)(2.37)

As the components are also Gaussian distributed and statistical independent, their

negative log-likelihood can be written as

− log p(Cfr|GC,f ,Df ,Ar) =∑

− logN (Ckfr|0, Df,kAk,r)

log π + log(Df,kAk,r) +|Ckfr|2

Df,kAk,r

(2.38)

The negative log-likelihood is thus equal up to a constant to

− log p(Y |C,G,D,A)

log(det(ΣB,fr)) + (Yfr −GC,fCfr)HΣ−1

B,f (Yfr −GC,fCfr)

log(Df,kAk,r) +|Ckfr|2

Df,kAk,r

(2.39)

[OF10] algorithms to optimize this cost function via an EM algorithm is reviewed in

3.3.2.

2.6 The permutation problem

So far source separation was considered for the case, that only the mixtures are observ-

able. Obviously this includes, that no labeled reference signals are used. A recording

with only the ego-noise would for example be considered as a labeled reference for

the ego-noise. Algorithms that are not only based on the observed mixtures, but on

additional recordings will further on be termed guided or informed. The most common

approach for NMF is discussed in chapter 4.3.1. An algorithm can be guided/informed,

regardless of assumptions, for example on statistical properties or an underlying model

as for example by random processes.

2.6. THE PERMUTATION PROBLEM 25

For the spatial diversity in ICA the permutation problem was already pointed out in

[Car98a]. [HO00] gives a formal prove for the ambiguity in the order of the components.

Given the mixing model Y = GS and the assumption that the sources S are statistical

independent and there exists a demixing matrix W = G+ such that WY = S.

However there exists also a matrix W = WP such that WY = PS. When P is

only a permutation matrix, the demixed signals PS are still statistical independent.

Hence W and W are both optimum solutions regarding statistical independence. This

is closely related to the statement of [Car98a] about a theorem of Darmois.

By using statistical independence as optimization criterion, the correct order of the

sources cannot be determined. [VBGB14]

Spatial diversity can also be interpreted as extracting a number of point sources dis-

tributed in space. As stated earlier, the order of the sources is random. The random

ordering prevents that several point sources can be combined correctly into a single

overall source. With this idea, the transition to the equivalent permutation problem

of the spectral diversity is rather straightforward. The spectral equivalence to a point

source in space is a certain frequency pattern. Assume that K frequency patterns or

components have been successfully extracted. Same as for the spatial diversity, the or-

der of the components is random. It is known that the K components form J sources

by addition. But, without exploiting additional data, there is no information available,

which components belong to which source.

The formulation of the permutation problem is no longer straightforward, if spatial and

spectral diversity are exploited simultaneously. The ambiguity of the components is

solved in the multichannel NMF algorithm described in chapter 2.5.2, as all components

of one source share the same mixing filters.

A component k1 of source jk1 with mixing vector Gujk1, code word Dk1 and activation

vector Ak1 leads to the estimated variance VX,uj1k1rf = |Gujk1f|2Dfk1Ak1r. For a given

mixture u the same variance can easily be constructed by a component k2 of a different

source jk2 , such that

VXuj2k2,rf = |Gujk2f

|2Dfk2Ak2r = VX,uj1rf (2.40)

by setting Aj2k2r = Aj1k1r and

Dj2fk2 = Dj1fk1

|Guj1f |2

|Guj2f |2

(2.41)

But equation 2.41 has to be fulfilled for all mixtures, such that component k1 of source

j1 leads to the same variance parameter as component k2 of source j2. This is in general

not granted, as a component’s contribution cannot be equivalently represented by a

component of another source.

At this point it is however not certain, what happens if J >> U . For this evaluation

it could be beneficial to interpret the problem as a nonnegative tensor factorization

problem, as it is done in recent literature as extension of NMF for multichannel data

(for an introduction see for example [CZPA09])

In case of informed/guided source separation, the ordering or permutation problem can

efficiently be solved by using labeled training data as discussed in chapter 4.3.1.

CHAPTER 3. MATRIX FACTORIZATIO 27

Chapter 3

Literature review on matrix

factorization

In chapter 2.5.1 was shown, that nonnegative matrix factorization can be beneficial for

audio source separation. In recent years, NMF has received a lot of attention, not only

in the audio source separation community. For this reason this chapter is dedicated to

review NMF algorithms, that are of potential interest for the ego-noise reduction.

Matrix factorization (MF) is a technique to represent the matrix X ∈ RF×R by the

multiplication of the two matrices D ∈ RF×K and A ∈ RK×R such that

X = DA (3.1)

As reviewed earlier, the problem of audio signal separation is often performed in the

DFT domain. In this case X represents observations in the time-frequency domain

with the number of frequencies F and the number of frames R. Further on D will

be referred to as dictionary and each of its K columns Dk as codeword. A codeword

hence represents a set of simultaneously excited frequencies, while the entries in Dk

fix the ratio of their amplitudes. Matrix A will be referred to as activation matrix, as

it shows the activity of the codewords over the frames and therefore time.

If all involved matrices are nonnegative (X ∈ R+F×R0 , D ∈ R

+F×K0 and A ∈ R

+K×R0 )

the problem is related to as nonnegative matrix factorization (NMF).

Regardless of a constraint on nonnegativity, there exists only a solution X = DA if

rank (X) ≤ K. Due to additive (white) noise generated by the microphones, the rank

of X is F for audio source separation (as usually R > F ). For this reason, the problem

is sometimes also referred to as (nonnegative) matrix approximation. Dictionary and

activation matrix form only an estimate on X:

X = DA (3.2)

In order to determine dictionary and activation function or to evaluate the performance

of some given A andD some cost function or distance measure has to be defined. Some

popular cost functions for matrix factorization are presented in the following.

3.1 Cost functions for matrix factorization

As cost functions, any function can be of interest that measures the similarity between

the estimated samples X and the true samples X. One set of functions for such a

comparison are norms. Function D(

X − X)

is called a norm, if following properties

• absolute homogeneity: D (aX) = |a|D (X)

• triangle inequality: D (X + Y ) ≤ D (X) +D (Y )

• if D (X) = 0 then X is the zero vector

A more general set of functions are divergences, which will further on be denoted as

. Divergences only have to satisfy the following properties

• Nonnegativity: D(

• D(

= 0 if and only if X = X

3.1. COST FUNCTIONS FOR MATRIX FACTORIZATION 29

For example a divergence does not need to satisfy the triangle inequality and does not

need to be symmetric, such that D(

≶ D(

X ||X)

In the context of matrix factorization the divergence is usually applied for each element

of the matrices and the total cost is the sum over the divergence of all elements in the

matrix. Further on cost functions and divergences on matrix data will be referred to

with capital letters C and D(

and the divergence function on a single element

by the lower-case function d(

Xfr||Xfr

. If not stated otherwise, the following relations

Xfr||Xfr

In appendix D.1 it is pointed out, that using the elementwise or separable divergence

of the observation matrix can be seen as statistical independence assumption between

the observations.

[CZA06] names three classes of generalized divergences that could be of interest for

• Csisz’ars divergences

• Bregman divergences

• Amaris alpha divergence

In the following two special cases of the first two divergences and a generalization,

which is named generalized α-β divergence, will be reviewed.

3.1.1 Generalized α divergence as a Csisz’ars divergences

In [CZA06] the focus is on the separableCsisz’ars φ-divergence is

Xr,f , Xr,f

with dφ(x||y) = xφ(

For this work one particular case of the Csisz’ars divergences is of interest, the gen-

eralized α divergence. The derivation is shown in appendix D.2. In this work the

divergence is defined in accordance to [CZPA09] and [CCA11] as

dα(x||y) =

1α(α−1)

(xαy1−α − αx+ (α− 1)y) α ∈ R \ {0, 1}

x ln xy+ y − x α = 1

y ln y

x+ x− y α = 0

3.1.2 The β divergence as a Bregman divergence

Again the separable Bregman divergences is of interest for matrix factorization, such

Xi,j||Xi,j

with the Bregman divergence dϕ (x||y) = ϕ (x)−ϕ (y)−∇ϕ (y) (x− y). ϕ has to be a

strictly convex function with a continuous gradient ∇ϕ.

In [DS05] multiplicative update rules for the Bregman divergence are derived. Later

[LLP12] propose a coordinate descent algorithm using Taylor series expansion. At

this point we will only use the Bregman divergence to derive the β-divergence in the

following.

The β-divergence was introduced by [EK01] in their work on robust maximum likeli-

hood estimation. Later [CAZ+06] and [FBD09] used it for matrix factorization.

In this thesis the β-divergence is defined as it is used by [CCA11] where β is shifted

by +1 compared to other authors. The new notation is used for the comparison with

the generalized α-β-divergence in chapter 3.1.3.

In appendix D.3 the β-divergence is derived as a special case of the Bregman divergences

by using ϕ′′β (x) = xβ−1, ϕ′

β (x) =xβ

βand ϕβ (x) =

(β+1)β.

The β-divergence is then given as

dβ (x||y) =

1β(β+1)

xβ+1 + βyβ+1 − (β + 1) xyβ)

β ∈ R \ {−1, 0}

x ln xy+ y − x β = 0

xy− ln x

y− 1 β = −1

3.1.3 Generalized α-β divergence

In [CCA11] a new type of generalized divergence is proposed:

dα,β (x||y) =

− 1αβ

xαyβ − αα+β

xα+β − β

α+βyα+β

for α, β, α+ β 6= 0

xα ln(

− xα + yα)

for α 6= 0, β = 0

for α = −β 6= 0

yβ ln(

− yβ + xβ)

for α = 0, β 6= 0

12(ln (x)− ln (y))2 for α, β = 0

As shown in table 3.1, the generalized α-β divergence includes the previously defined α

divergence and β divergence as special cases. Before investigating some special cases,

first some of the generalized α-β divergence’s properties are worth noting:

• Duality : Dα,β

= Dβ,α

• Inversion: D−α,−β

= Dα,β

X .[−1]||X .[−1])

• Scaling : Dωα,ωβ

= Dα,β

X .[ω]||X .[ω])

• Zoom: As described in the following:

For α 6= 0 and with β = β

α, it holds that

Dα,β=αβ

XαfrX

αβfr −

1 + βX

α(1+β)fr −

1 + βXα(1+β)

α2Dβ

X .[α]||X .[α])

m= Dβ

X .[α]||X .[α])

So the α-β-divergence can be interpreted as a β-divergence with parameter β

αwhere

an α-zoom is applied on its argument.

This α-zoom property can hence also be applied to all the special cases of the β-

divergence. A possible use of this interpretation is to define the α parameter to make

the observations for the Itakura-Saito divergence more Gaussian (see chapter 2.5.1).

For the sake of completeness, a similar equivalence can be determined for the α-

divergence, if α+ β 6= 0.

Dα,β

m= Dα

X .[α+β]||X .[α+β])

with α =α

α + β(3.10)

An overview of the special cases of the generalized α-β divergence is listed in table 3.1.

Table 3.1: List of divergences that are included in the generalized α-β divergence

α β model

α-divergence α 1− α

β -divergence 1 β

Squared euclidean distance 1 1 Gaussian

Kullback-Leibler divergence 1 0 Poisson

Itakura-Saito divergence 1 -1 Gaussian or Gamma multiplicative noise

Hellinger distance 0.5 0.5

Log-Euclidean distance 0 0

In the following we will explicitly view some divergences that are of particular impor-

tance for NMF, as they can be interpreted as ML estimate of some model. A review

on the models can be found in appendix D.1.

Squared Euclidean distance

For α = 1, β = 1 equation 3.8 results in the squared Euclidean distance or least squares

which can be described by the Frobenious norm of the estimation error:

Dα=1,β=1

= ||X − X||2F =∑

Xi,j − Xi,j

(3.11)

Furthermore one can define a generalized squared Euclidean distance with α-zoom as

Dα,β=α

XαfrX

αfr −

fr −1

m= ||X .[α] − X .[α]||2F = CLS

X .[α], X .[α])

(3.12)

Kullback-Leibler divergence

For α = 1, β = 0:

Dα=1,β=0

Xi,j lnXi,j

−Xi,j + Xi,j (3.13)

For∑

i,j Xi,j =∑

i,j Xi,j = 1 the cost functions is the Kullback-Leibler divergence or

relative entropy. Again a generalized Kullback-Leibler divergence with α-zoom results

Dα,β=0

m= CKL

X .[α], X .[α])

(3.14)

Itakura-Saito divergence

Another cost function, introduced in [FBD09], is the Itakura-Saito divergence. It is

the limit case for α = 1, β = −1:

Dα=1,β=−1

− lnXi,j

− 1 (3.15)

One interesting property of the Itakura-Saito divergence is its scale-invariance:

dIS (γx|γy) = dIS (x|y) (3.16)

For α = −β a generalized Itakura-Saito divergence with α-zoom can be derived as

Dα,β=−α

m= CIS

X .[α], X .[α])

(3.17)

As an example for divergences that cannot be represented by the generalized α-β

divergence, a class of generalized γ-divergence can be derived by applying a nonlinear

transformation [CCA11]. In this report though, the focus regarding divergence is on

the versatile generalized α-β divergence.

3.2 Properties of (nonnegative) matrix factoriza-

In this section some important properties of matrix factorization are described.

3.2.1 Indeterminacies

The factorization of matrix Y into two matrices D and A introduces new indetermi-

nacies. It is important to discriminate the MF indeterminacies to the indeterminacies

introduced with a mixing model (see chapter 2.1.2 and 2.6).

Scaling

In the mixing model there is the amplitude and phase ambiguity between mixing ma-

trix and source signal. In addition to that, a scaling ambiguity arises for the NMF

parameters. It can best be explained by formally introducing a K×K diagonal matrix

Λ into the estimate as Y = DΛΛAΛ, where DΛ and AΛ are scaled versions of D,

A such that ||DΛ,k||p = ||AΛ,k||p = 1∀k. p can be chosen such that the entries on Λ

correspond to the energy of the components (e.g. p = 1 if Y represents the magnitude

squared or p = 2 for the magnitude of the actual observed microphone signal). By

shifting the energy from Λ arbitrary onto D and A for each component k, one can

create an infinite number of sets of activation matrix and dictionary, which result all

in the same cost function.

In [OF10] this scaling ambiguity is for example removed by rescaling activation matrix

and dictionary, such that all energy is transferred to the activation coefficients.

3.3. SOLVING THE MATRIX FACTORIZATION PROBLEM 35

Permutation

Besides the scaling ambiguity NMF also has a permutation ambiguity. This spectral

permutation was already discussed in chapter 2.6, after NMF was introduced as ML

estimator for variance parameters.

3.2.2 Local optimality

In chapter 3.3 algorithms are introduced to optimize the MF cost function. At this

point only the optimality of these algorithms is discussed.

It is well known, that NMF cost functions are convex in A and also convex in D, yet

the way NMF was treated in most literature, it is non-convex in both parameters (e.g.

in [CCA11]). These algorithms therefore perform local optimization and thus require

a suitable initialization. On the positive side, when considering NMF as ML param-

eter estimator in source separation, we are not necessarily interested in the optimum

solution, but only a set of parameters that provides sufficiently good separation results.

Nevertheless some research in the area of convex NMF algorithms has also been per-

formed (see e.g. [VGA09] or convex-hull NMF in [TKWB11]).

3.3 Solving the matrix factorization problem

This chapter gives examples for the two most common approaches to solve the MF

problem, expectation maximization (EM) and multiplicative update (MU) rules.

3.3.1 Nonnegative matrix factorization via multiplicative up-

date rules

The multiplicative update rule was used in its very beginning by [LS00] for the squared

Euclidean distance and the Kullback-Leibler divergence.

Multiplicative updates ensure nonnegativity by an elementwise multiplication with a

nonnegative matrix in each iteration

As the inverses of nonnegative matrices are only nonnegative if the matrix is ... (citation

needed), MU rules usually do not require the calculation of inverses.

Another interesting property is, that parameters that are initialized with zero, stay

Multiplicative Update (MU) algorithm for the generalized α-β-divergence

In [CCA11] multiplicative update rules for their generalized α-β-divergence are pre-

sented. They can thus be used for all the divergences listed in table 3.1. In appendix

D.4 the derivation is reviewed to highlight its similarity to the multichannel MU rules

derived in chapter 4.4.

For α 6= 0 the update rules can be formulated for the matrices as

A← A.

DT (X .[β−1].X .[α])

DT X .[α+β−1]

).[ 1α]

(3.18)

D ←D.

(X .[β−1].X .[α]AT

X .[α+β−1]AT

).[ 1α]

(3.19)

For completeness the update rule for the β-divergence is derived by setting α = 1. In

matrix notation this writes as

A← A.

DT (X .[β−1].X)

DT (X).[β]

(3.20)

D ←D.

(X .[β−1].XAT

X .[β]AT

(3.21)

and is equivalent to the updates in [FBD09] when considering that their β-divergence’s

β is shifted by +1

Also the α-divergence update rules can be derived by inserting β = 1− α:

A← A.

DT (X .[α].X .[−α])

DT (X).[0]

).[ 1α]

(3.22)

D ← D.

(X .[−α].X .[α]AT

X .[0]AT

).[ 1α]

(3.23)

3.3. SOLVING THE MATRIX FACTORIZATION PROBLEM 37

Multiplicative Update (MU) algorithm for multichannel data

[OF10] presented following MU update rules for the optimization problem derived in

chapter 2.5.2 by discarding the mutual information between the channels. Recall the

cost function of equation 2.32

CML((G,D,A) =∑

dIS(|Yufr|2||VYufr)

The corresponding multichannel MU rules are given by

A← A.

u(diag(Qu,jk)D)T (X−2u .Xu)

u(diag(Qu,jk)D)T X−1u

(3.24)

D ←D.

u diag(Qu,jk)(X−2u .Xu)A

u diag(Qu,jk)X−1u AT

(3.25)

Qu,j ← Qu,j.

((DjAj).X−2u .Xu)1R×1

((DjAj).X−1u )1R×1

(3.26)

In chapter 4.4 it is shown, that equation 3.24 is a special case of the multichannel MU

rules for the α-β-divergence by considering that the Itakura-Saito divergence is given

for α = 1 and β = −1.

Furthermore the authors use the following normalization for each iteration to remove

the scaling indeterminacy:

• Scaling of Dj and Gj , such that∑

u |GS,ujf |2 = 1 and GS,1jf ∈ R+

• Scaling of Dj and Aj, such that∑

f Dfk = 1

3.3.2 Multichannel EM algorithm for the sum of Gaussian

components

The second cost functions derived in chapter 2.5 was

CML(G,D,A) =

log(det(ΣB,f )) + (Yfr −GC,fCfr)HΣ−1

B,f(Yfr −GC,fCfr)

log(Df,kAk,r) +|Ck,fr|2

Df,kAk,r

In appendix D.5 it is shown, that the parameters can be estimated by the following

equations

Dfk =1

|Ckfr|2

(3.27)

Akr =1

(3.28)

GS,f = RY S,fR−1SS,f (3.29)

ΣB,f = diag(RY Y ,f +GS,fRHY S,f +RY S,fG

HS,f +GS,fRSS,fG

HS,f) (3.30)

It can be seen in equation 3.27 to 3.30, that the set {Ukfr,RY Y ,f ,RY S,f ,RSS,f ∀f, r, k}

is sufficient to calculate all parameters. In accordance with the authors, they will be

named natural (sufficient) statistics. However, as the source and components can not

be observed, they have to be estimated before the parameters can be estimated.

Rewriting the estimate of the components by equation 2.30 in a vector for all compo-

nents with ΣC,fn = diag([VC,kfn]K) and VC,kfn = DfkAfr

Cfr = [Ck,fr]K = ΣC,fnGHjk,f

Σ−1Y,frYfr (3.31)

3.4. CONSTRAINTS AND VARIANTS OF MATRIX FACTORIZATION 39

and for the sources using ΣS,fn = diag([VS,jfn]J) and VS,jfn =∑

k ∈ KjDfkAfr

Sfr = [Sj,fr]J = ΣS,fnGHj,fΣ

−1Y,frYfr (3.32)

[OF10] proposes the following estimates for the natural statistics in their EM algorithm:

RY Y ,f = RY Y ,f =1

YfrYHfr (3.33)

RY S,f =1

YfrSHfr (3.34)

RSS,f =1

SfrSHfr +ΣS,fr −ΣS,frG

Hj,f(ΣY,fr)

−1Gj,fΣS,fr (3.35)

ukfr = (CfrCHfr +ΣC,fr −ΣC,frG

Hj,f(ΣY,fr)

−1Gj,fΣC,fr)k,k (3.36)

The EM algorithm executes then the following steps for several iterations:

• : E-step: Updating the natural statistics RY S,f , RSS,f , ukfr

• M-step: Updating the parameters Dfk, Akr, GS,f

• Normalization to remove scale and phase ambiguities

– Rescaling Dj and Gj , such that∑

u |GS,ujf |2 = 1 and GS,1jf ∈ R+

– Rescaling Dj and Aj, such that∑

f Dfk = 1

3.4 Constraints and variants of matrix factorization

Very often matrix factorization is not used in its plain form, as defined at the beginning

of chapter 3. In order to improve the estimation results, to incorporate labeled data

or to take priors into account, constraints are added to the MF cost function in many

publications.

The constraints are usually added to the NMF optimization problem as a penalty term

with the Lagrangian multiplier λ as

argminθ

CNMF (θ) + λf(θ) (3.37)

The NMF optimization can then either be seen as an equality constraint f(θ) = c or

as an additional minimization tasks min f(θ), where λ controls the trade-off between

the cost functions.

Due to the additive implementation, the constraints can usually be rather straightfor-

wardly integrated into gradient based algorithms of the original NMF problem, such

as the MU rules.

The amount of proposed constraints grew so much in recent years that it is impossible

to review all of them. There exist for example adaptations of NMF to the source-filter

model of speech production [OVB12], Bayesian NMF approaches [D+09] [VG+08] or

NMF models incorporating priors [OVB12][VJA+10]. At this point the focus is mainly

on two types of constraints. Constraints that resolve ambiguities in the most common

NMF training scenario (see chapter 4.3.4) and constraints that directly incorporate

labeled training data. In the following the most promising algorithms and ideas for the

investigated scenario of ego-noise reduction are reviewed.

3.4.1 Sparsity

Sparsity of activation coefficients of dictionary can be used in audio data to enforce a

harmonic structure into the dictionary or account that components may be active only

in a few time bins (for example a drum in music separation).

A straightforward way to force zero elements into the parameters is to penalize the

number of nonzero elements. The number of nonzero elements is equivalent with the

L0-norm and unfortunately not well suited for optimization. It is hence frequently

approximated with the L1 norm. A sparser dictionary can for example be achieved by

adding the constraint

min ||Dk||1 = min∑

|Dfk|∀k (3.38)

Note that due to the nonnegativity constraint, |Dfk| = Dfk.

[LN13] derives MU rules for the Euclidean NMF including L1 norm sparsity constraints.

More information on sparsity in NMF literature can be found at [Vir07]. The authors

also report that the sparsity constraint did not lead to significant improvement in their

single channel audio source separation experiments.

3.4.2 Smoothness and temporal continuity

In [LN13] the authors enforce smoothness via an L2 norm for the dictionary and acti-

vation function

min ||Dk||22∀k (3.39)

and/or min ||Ak||22∀k (3.40)

and derive MU rules for the Euclidean NMF.

In [Vir07], the following constraint is used to enforce continuity of the activation vectors

min∑

r=2(Akr − Ak(r−1))2

r(Akr)2(3.41)

and it is solved via MU rules. A similar temporal continuity is used in [EF13].

3.4.3 Local NMF

A lot of NMF constraints are developed for image procession task, such as face recog-

nition. One of these is the local nonnegative matrix factorization in [LHZC01].

They add to the Kullback-Leibler divergence the following four constraints:

• L1 sparsity constraint on dictionary : subject to:∑

f Dfk = 1∀k

• L2 smoothness constraint on the dictionary : min∑

f D2fk∀k

• penalizing non-orthogonal codewords : min∑

f1 6=f2Df1kDf2k∀k, f2

• maximizing the activity of components : min−∑

r A2kr∀k

3.4.4 Semi-orthogonal NMF

In [CZPA09] the authors propose to apply a constraint on dictionary or activation

matrices by applying one of the following updates at each iteration

D = D(DTD)−1

2 (3.42)

A = (AAT )−1

2A (3.43)

3.4.5 Fischer NMF

In [WJ04] the Fischer constraint from the Fischer linear discriminant analysis is added

to the NMF formulation.

minSW − SB (3.44)

with the within-class scatter SW = 1J

k∈Kj||Ak − Aj||22 and the between class

scatter SB = 1J(J−1)

j2||Aj1 − Aj2||

22. J denotes the number of classes or in our

case sources, Kj the codebook size of the j-th source and the mean over a class is given

by Aj =1Kj

k∈Kj|Ak

However there is a severe difference between face recognition and source separation.

For face recognition only one face is used as input, while at source separation any

number of sources can be active at any time. In other words, Fischer NMF tries to find

a codebook where the probability of an observed activation belonging to a certain class

is maximum. In source separation the goal is to find codewords, where the activation of

one component is maximum and the activation of all other components is minimum. A

suitable measure could be for example to find the dictionary, where the corresponding

activation matrices have a maximum ratio of desired to undesired representation.

Nevertheless, Fischer NMF is worth noting as it was one of the first NMF algorithms

that explicitly used the class information in the training phase. In appendix C.0.7, a

generalization of the Fischer NMF, the nonnegative graph embedding is reviewed.

3.4.6 Scaling problem with a single constraint

It is worth noting that if the scaling ambiguity of NMF is not considered in the design

(e.g. by a suitable extra constraint), the scaling property allows to shift the energy

to the unconstrained parameter(s). In the multichannel case, the scaling ambiguity

between source and mixing coefficients has to be taken into account as well.

For example, if the dictionary is penalized by an L1 or L2 norm, the algorithm can just

scale the energy into the activation function. This may also be the reason why an L1

and L2 norm constraint is applied in the local NMF algorithm (see chapter 3.4.3).

There are also constraints, such as the temporal continuity constraint in chapter 3.4.2,

that are invariant to scaling due to its normalization.

44 CHAPTER 4. EGO-NOISE SUPPRESSION FOR THE ROBOT NAO

Chapter 4

Ego-noise suppression for the robot

In the two previous chapters, algorithms and ideas from audio source separation and

NMF were reviewed. This chapter intents to connect these information with the sce-

nario of robot ego-noise reduction. It is separated into two parts. First the noises

the robot has to suppress are identified and analyzed. Second a general structure and

suitable algorithms are selected and adopted to tackle the ego-noise reduction.

4.1 Ego-noise analysis of NAO

Before suitable suppression techniques can be designed, the different ego-noise types

of NAO have to be identified and analyzed. In the introduction it was already pointed

out that a robot typically suffers from non-stationary and stationary noises. A picture

of NAO is shown in figure 4.1.

The recordings used in this thesis were made with 4 microphones, which are all located

in the head of the robot. As it can be seen in figure 4.2, two are located at the

front (positions A and B) and two in the back of the head (positions C and D).The

microphones create their own noise with their sensors and amplifiers. This microphone

1 https://jv.wikipedia.org/wiki/Nao_(robot)

4.1. EGO-NOISE ANALYSIS OF NAO 45

Figure 4.1: Aldebarans humanoid robot NAO 1

self-noise is further on related to as static noise.

Figure 4.2: Four microphones are located at NAOs head 2

The robots processor is located in its head, too. It is cooled by a fan located between

microphones C and D marked by position E. The fan creates audio noise as well, further

on referred to as fan noise.

Besides the static and fan noise, the robot creates noise through its movement. For the

movement the robot has a lot of degrees of freedom, as it can be seen in figure 4.3. For

this thesis two movements are selected as examples, the noise created by NAO waving

his right arm and the noise of NAO walking on the spot. Furthermore, as it proves to

be a challenging enough task, only one movement noise is tackled at a time. This leads

2 http://doc.aldebaran.com/2-1/family/robots/microphone_robot.html

to the two scenarios, where the robot either creates walking noise or waving noise.

In the next sections the four noises described above are analyzed. The main focus is

hereby whether the noises fulfill criteria relevant for the representation by NMF. Other

assumptions such as Gaussianity or statistical independence, which were made for the

derivation of the estimation by the wiener filter are not investigated, because they are

not essential for the NMF model.

Figure 4.3: The degrees of freedom for NAO’s movement3

4.1.1 Fan and static noise

As the static noise is always present, fan and static noise are analyzed jointly in this

chapter. A short time power spectrogram with a frame size of 1024 samples is shown in

figure 4.4. As it can be seen, the spectrum changes very little over time. The power in

the STFT domain can thus assumed to be stationary. One result of this assumption is,

that an average over all frames is a suitable representation for these noises. Figure 4.5

shows the average power of the DFT frequency bins calculated for each microphone.

As it can be seen in the upper plot, most of the noise is at frequencies less than three

kHz. In the lower plot it is shown that the spikes created by the fan noise are roughly

at the same position for all microphones, their intensity though varies between the

3 https://jv.wikipedia.org/wiki/Nao_(robot)

time [s]0 2 4 6 8 10

Figure 4.4: Short time power spectrogram of fan and static noise for all microphones

microphones up to several 10 dB.

Furthermore, in order to have an idea of what part of the noise in figure 4.5 is created

frequency [kHz]0 1 2 3 4 5 6 7 8

frequency [kHz]0.5 1 1.5 2 2.5

Figure 4.5: Average power over frequencies of fan and static noise for all microphones

by the static microphone noise, their average power is compared in figure 4.6. As it

can be seen, the static noise is clearly beneath the fan noise for all frequencies. Even

if one would remove the fan spikes, the remaining broadband fan noise is significantly

higher than the static noise. Nevertheless, if the fan is turned off, the static noise can

be heard in speech pauses. Static and fan noise of the NAO have already been analyzed

in the frequency domain with similar frame length in [LBD+14].

frequency [kHz]0 1 2 3 4 5 6 7 8

fan and static noisestatic noise

Figure 4.6: Energy in fan and static noise that is actually created by static microphonenoise

We propose a new approach for the analysis of the fan noise, by exploiting its ap-

proximate stationarity. By significantly increasing the DFT length (in this example

to one second) the average power of the fan in the frequency domain gives some quite

surprising results, that were not observable before. The frequency range from 0 to 3

kHz, as shown in the upper plot in figure 4.7, shows a lot more spikes compared to the

result from a 1024 point DFT in figure 4.5. By zooming into a certain frequency band

(lower plot of figure 4.7) it can be seen, that an individual spike of the fan noise is very

narrowband, in most cases only a few Hertz. Obviously this frequency sparsity cannot

be exploited by wiener filtering in the STFT domain with frequency resolution smaller

than the bandwidth of the fan spikes.

Furthermore, we propose a new model for the fan noise by transforming a single spike

of a microphone back into the time domain. If this is for example performed for the

frequency band from 480 Hz to 500 Hz, the time domain signal is given by figure 4.8.

In the upper plot, it is obvious, that the fan spike is slightly amplitude modulated. By

adding a sinus as reference, which matches the spike signal at the beginning and the

end of the frame (e.g. by counting zero crossings) it can be shown that the spike is

also slightly frequency modulated (lower middle plot).

A suitable model for the fan noise could therefore be

Xfan,t =∑

f∈Ffan

aft sin(2π(f + bft)t) (4.1)

frequency [kHz]0 0.5 1 1.5 2 2.5 3

frequency [Hz]1250 1260 1270 1280 1290 1300 1310 1320 1330 1340 1350

Figure 4.7: average power of fan and static noise for all microphones by a 1 secondDFT

where aft and bft are modeled as random processes with statistical dependencies over

time and possibly also over frequencies. Even a statistical dependency between ampli-

tude and frequency is possible.

time [ms]0 100 200 300 400 500 600 700 800 900 1000

-0.005

time [ms]2 4 6 8 10 12

time [ms]190 195 200

time [ms]990 995 1000

spike in time domainsinusoid reference

Figure 4.8: single fan spike in the time domain with reference sinusoids (bottom plots)

4.1.2 Movement noise

As mentioned earlier, the movement noise can only be observed with additive fan and

static noise. For this reason, this analysis and later the algorithms are not performed on

the direct recordings, but on a prefiltered versions. The applied prefiltering techniques

are described in chapter 4.2.

Walking

The walking noise, which NAO creates by walking on the spot, is discussed as first

movement noise. Figure 4.9 displays the short time power, using a framelength of 512

samples. The walking noise clearly shows a structure in time and frequency domain.

In the time domain the excitation has a resolution less than a second. The excitation

also shows a similar frequency characteristic. The microphones in the top row look

quite similar, whereas the microphones in the bottom row are clearly different.

time [s]0 5 10 15 20

Figure 4.9: Short time power spectrogram of walking noise for all microphones

Furthermore, it is of interest in how good the power in the time-frequency domain

can be represented by a low rank matrix. Or in other words, if the NMF model is

reasonable. The NMF residual error using the Euclidean cost function is displayed

in figure 4.10 for the four microphones. The error is normalized by the total power

of the matrix to be factorized, such that the normalized residual error is 0 dB for a

codebooksize of zero.

As it can be seen, the residual can be decreased for three of the microphones by more

than 10 dB with less than 10 codewords. For 15 dB reduction, very large codebook

sizes of up to 60 are required.

size of codebook0 20 40 60 80 100 120

Figure 4.10: residual error for NMF approximation of waving power in time-frequencydomain

Waving

The waving noise shows as well a structure in the time and the frequency domain as

shown in figure 4.11. Compared to the walking noise, the excitation over time is with

several seconds rather long. Also the frequencies above 3 kHz are more distorted than

for the walking noise. The individual peaks show some harmonic structure, which is

not constant but with the highest frequencies in the middle of the excitation.

The representation of the short time power by a low rank matrix is very similar to the

walking noise (see figure 4.12). With the first 10 codewords a residual error of over

-10 dB can be achieved for three out of the 4 microphones. Compared to the walking

noise, the residual error is roughly 5 dB lower for large codebooks.

4.1.3 Speech

Besides the movement noise it is also important, whether the short time power of the

speech can be represented by a low rank matrix. As shown in figure 4.13, the speech has

time [s]0 5 10 15 20

Figure 4.11: Short time power spectrogram of waving noise for all microphones

size of codebook0 20 40 60 80 100 120

Figure 4.12: Residual error for NMF approximation of waving power in time-frequencydomain

similar short excitation as the walking noise and shows quite some harmonic structure.

time [s]0 5 10 15 20fr

Figure 4.13: Short time power spectrogram of waving noise for all microphones

4.2. REDUCTION OF STATIONARY COMPONENTS AS PREPROCESSING 53

Figure 4.14 shows that in order to represent the speech power with a residual of -10

dB more than 20 codewords are required. A reduction by 15 dB cannot be achieved

for all microphones regardless of the codebook size. It has however to be noted that

for large codebooks, the local optimality of the NMF algorithm is decisive. Obviously

it is possible to represent 129 frequencies by 129 codewords without error, for example

by only picking codewords with one frequency each.

size of codebook0 20 40 60 80 100 120

Figure 4.14: Residual error for NMF approximation of speech power in time-frequencydomain

The average power between the two movement noises and the speech can be compared,

too (see figure 4.15). In the band from 200 Hz to 1 kHz, where the speech has most of

its power the speech to interference ratio (SIR) lies between -3 and +13 dB. For lower

frequencies the walking SIR becomes really bad, as does the waving SIR from 1 to 2

4.2 Reduction of stationary components as prepro-

cessing

In chapter 4.1, two stationary noise sources were identified, the fan noise and the static

noise. Due to the stationarity, there exist efficient algorithms to reduce these noises

in a pre- or postprocessing step. At this point however the question arises, whether

frequency [kHz]0 1 2 3 4 5 6 7 8

wavingwalkingspeech

frequency [kHz]0 1 2 3 4 5 6 7 8

SIR wavingSIR walking

Figure 4.15: Average power (over time and microphones) of movement noise and speechover frequency

a separate treatment of the stationary noise types is necessary. In chapter 2.5.2 a

model was defined that can cope with spatially white noise, such as the static noise.

It was also pointed out at the end of chapter 2.5.1, that a stationary Gaussian process

can be modeled in the STFT domain by a source with a single component and unity

activation.

Nevertheless, both noise types are reduced in a preprocessing step in this work. The

main reason for that is, that it is not possible to get reference signals for training

without fan and static noise (with exception of a speech training signal where the

fan was turned off). One could still get a reference with static and fan noise and train

those static noise parameters beforehand. Then one could train the other sources while

incorporating the additional static noise on the reference files with the pretrained pa-

rameters. It is however questionable, whether this additional effort results in sufficient

benefit compared to a reduction by preprocessing.

Probably the most common algorithm for such a problem is a multichannel Wiener fil-

ter. In [LBD+14] a speech distortion weighted multichannel Wiener filter (see [SMW04]),

for the NAO robot is proposed both for the static and the fan noise. But, in this thesis

the static noise reduction is used in a preprocessing step. In order to avoid loosing

spatial information for future processing, it is essential to apply the exact same filter

on all channel.

4.2.1 Exploiting the high sparsity in the frequency domain to

reduce fan spike

Instead of applying a single channel Wiener filter for static and fan noise, a different

approach is proposed in this thesis. Recall the properties of the fan noise described in

chapter 4.1.1. It is composed of 20-50 frequencies with an average bandwidth of 5-10

Hz. The exact numbers strongly depend on the definition of what is counted as a spike

of the fan noise and what is not.

In general a Wiener filter is very well suited for sparse signals in the frequency domain.

However to fully exploit the fact that the bandwidth of the active components is few

Hertz, we aim at a frequency spacing of at least 1 Hz (and will assume it is 1 Hz further-

on). Hence a DFT approximately of length NDFT = Fs

1[Hz]has to be used, where Fs

denotes the sample rate. The idea is now to determine and attenuate all frequencies,

where the fan noise can be assumed to be a lot stronger than the speech. To do so, we

propose the following approach:

• First the mean energy of the fan noise is calculated over all channel and frames

(blue line in figure 4.16).

• Second all frequency bins are extracted, which are 20 dB above the local average.

We calculate the local average by convolution with a (rectangular) window of size

500 and proper scaling.

• Third, the determined frequencies are reduced to actual spikes: We define all

frequencies as spikes, which are a local maximum in the bandwidth corresponding

to the 20 Hz to both sides. In order to find the true (center) of the peak positions,

the search can be applied on a smoothed version of the power spectrum (e.g. by

convolution with a 3 Hz rectangular window as depicted by the cyan line in figure

4.16).

• Last, the bandwidth of the fan noise has to be determined for each spike. As

bandwidth, for example, the minimum distance to the center frequency can be

used, at which the energy (we again use the curve smoothed by a 3Hz rectangular

window) drops below a certain threshold. For this report the threshold (green

curve in figure 4.16) was set to 7 times the energy of the background (static)

noise in this frequency band. We approximate the background noise power by a

median of the energy in the 40 Hz band. Due to the median operator, the static

noise level estimate is less affected by the fan noise peak.

frequency [kHz]0.795 0.8 0.805 0.81 0.815 0.82 0.825 0.83 0.835

40average energy over frames and microphonessmoothed version of average energyestimated static noisedetermined peak and bandwidth positions

Figure 4.16: Fan frequency detection algorithm

Using the above specified parameters the algorithm detects 21 frequencies for the fan

noise with an average bandwidth of 6 Hz. Compared to the total number of Fs

2= 8000

Hz, 1.6% of the frequencies are corrupted by the fan noise spikes.

With this knowledge there exist several possibilities to get rid of the fan noise. The

power of the determined frequency bins can be

• set to zero.

• set to a fixed low value (e.g. the threshold described earlier).

• set to an estimated value interpolated from the surrounding frequencies that are

not distorted by the fan.

• reduced by the power the frequency bin had in the training phase (spectral sub-

traction)

In our experiments the results were rather similar for all methods described above, so

further on the simplest version, where all fan frequency bins are set to zero, is applied.

The approach via the long DFT is not very suitable for real time implementation, as

either a delay up to one second is introduced or many overlapping frames have to be

used such that the computational complexity increases rapidly.

The computational complexity can be reduced by taking into account that the fan

noise only significantly contributes to approximately 21 · 6 out of the NDFT = 16000

frequency bins. Thus only these frequencies have to be calculated. Then, a suitable

manipulation can be performed (or nothing is done if the frequency bins shall be set to

zero). Afterwards the fan frequency bins are transformed back into the time domain

and subtracted from the original signal.

The computational complexity or delay could, for example, further be decreased by

developing a scheme for recursive calculation of the desired DFT bins. Such an algo-

rithms would then be closely related to frequency tracking algorithms (see for example

[VZ12]).

If the frequency bins are set to zero, this approach can equivalently be seen as applying a

filter with 21 stopbands of varying bandwidth. An alternative approach could therefore

also be to design suitable (IIR) filters for the given scenario.

As the focus of this work is on evaluating if NMF is well suited for (nonstationary)

ego-noise reduction, the straightforward implementation via a long DFT was used in

this thesis.

4.2.2 Static noise reduction by single channel wiener filter

The static noise and the residual of the fan noise is afterwards suppressed by a Wiener

filter with pretrained noise power VNstatic,fr.

HWF,fr = max

1−VNstatic,fr

u |Yufr|2

The estimate of the signal without static noise is then given by Yufr = HWF,frYufr

4.3 Reduction of non-stationary ego-noise

After removing the stationary parts, the non-stationary suppression technique can be

trained with less interference from the stationary components.

In the literature review, two multichannel NMF algorithms had been highlighted:

In chapter 3.3.1 MU rules for a multiple mixtures with the Itakura-Saito divergence

were presented. We will generalize this algorithm in chapter 4.4. The second multi-

channel algorithm reviewed was EM based. As discussed in chapter 4.5 computational

complexity reduces its usefulness for the given problem. In the same chapter we will

thus propose a computationally inexpensive approach with a mutlichannel prefilter.

Yet one important part of any NMF method is still not discussed, the pretraining of

NMF methods. Especially in the first years of NMF, this topic has received little at-

tention. In the following chapter the most common NMF training approach is reviewed

and an interpretation is given, why it is not well suited for less sparse signals.

4.3.1 Informed source separation by pretraining of the dictio-

In chapter 2.6 it was shown that blindness leads to a permutation problem and also

model based source separation approaches can suffer from permutation. As presented in

appendix C.1, there is no dominant term for source separation using reference signals

4.3. REDUCTION OF NON-STATIONARY EGO-NOISE 59

in the literature. In this work reference based source separation algorithms will be

termed informed or guided.

The first approach to incorporate training data in the NMF algorithms was to initialize

(and fix) the dictionary of each source by a clean recording similar to that source (see

e.g. [Sma07],[OF10],[SFM+14]). Training observations are further on indicated by an

additional index TR, for example the j-th source of the u-th mixture as YTR,ujfr in the

time frequency domain.

Training can then be performed by the same algorithm as the source separation later.

Solving the NMF problem with observations Y1 is written as argminCNMF (Y1). Such

that the training is represented as

{GTR,j ,DTR,j ,ATR,j} = argminG,D,A

CNMF (YTR,j) ∀j (4.3)

Afterwards a unified dictionary is created, by concatenating the dictionaries of all

sources for the test scenario:

DTE = [DTR,1, ...,DTR,j , ...,DTR,J ] (4.4)

Usually the dictionary is then fixed for the testing phase such that the testing opti-

mization simplifies to

{GTE,ATE} = argminG,A

CNMF (YTE) (4.5)

If the NMF algorithm is for example applied to the time-frequency domain variance.

The variance of the j-th source is estimated by extracting the activation vectors cor-

responding to the sources components out of the complete activation matrix. The

estimate is then calculated as XTE,j = DTR,jATE,j.

However, other combinations can be interesting as well for audio source separation. If

the mixing matrix is approximately identical between training and testing, it can be

initialized by the training data and fixed for testing.

If there exists a reference with a similar time structure (in [SLVB15] music source

separation is discussed, where cover songs are used as references) it may be beneficial

to pretrain the activation matrices.

Of course it is also possible to initialize a parameter by its reference, but still update

it in the test optimization. This can be interpreted as exploiting the local optimality

of the NMF algorithms. If this initialization is for example applied for the dictionary,

one could describe this approach as finding a similar dictionary for that source, which

is more suitable for the new data. Due to the local optimality it is likely, that the

codewords in the j-th source dictionary still represent the j-th source and not any

other source.

4.3.2 Approximate additivity assumption

The training approach as described in the last chapter however has a major drawback.

In accordance with [SFM+14] we will term it approximate additivity assumption. Con-

sider a two source single channel scenario, where dictionary and activation matrix are

trained such that

YTR,j = DTR,jATR,j (4.6)

For simplicity the assumption is made that the training observations have the same

rank as dictionary and activation matrix (noiseless case) and the global optimum is

found, such that YTR,j = YTR,j .

If the dictionary is trained and fixed the activation matrix can be expressed by using an

activation function fA as ATR,j = fA(YTR,j ,DTR,j). The function fA depends on the

NMF algorithm (e.g. the used divergence) as well as the initialization. In general the

output of the activation function has some randomness due to the random initialization.

In the training a first criterion can thus be formulated as

YTR,j = YTR,j = DTR,2fA(YTR,j ,DTR,j) (4.7)

In the described training scenario it is essential, that the activation functions can also

be extracted from the observations of the mixture, given the trained dictionaries. It

differs from the training step, as the observations are dependent both on noise and

speech signals. In mathematical terms we express this dependency by the mixing

function fM as YTR = fM (YNTR,YXTR

). If an algorithm cannot obtain the activation

matrices from mixed training observations, it is unlikely, that the algorithm can do it

for the actual test observations.

If the observations represent the variance and speech and noise are statistical indepen-

dent, the speech and noise variance would add up. Yet in our model the observations

represent the instantaneous power, the observations thus do not simply add up (they

only do in average). Even worse, the mixing function is not deterministic regarding

the speech and the noise. Take, for example, the magnitude squared of the addition

of two complex values. The result is obviously dependent on the amplitude and the

phase difference of the two values. Yet as only the magnitude squared is observable,

there exists an (infinite) number of complex speech and noise samples that lead to the

same observation.

Regardless of the mixing function, the new estimate can be expressed as

ˆYTR = DTR,1ATR,1 +DTR,2ATR,2 (4.8)

(Perfect) separation/reconstruction is now performed, if ATR,1!= ATR,1 and ATR,2

ATR,2.

At that point [SFM+14] claims that the activation matrices are approximately equiva-

lent if the dictionaries are sufficiently different and therefore the elements of the mixture

are modeled by only one dictionary. In the following we try to give a mathematical

background for this statement. The perfect reconstruction criterion for source 1 can

be rewritten as

fA(YTR,DTR,1) = ATR,1!= ATR,1 = fA(YTR,1,DTR,1) (4.9)

If the following four assumptions hold, a perfect reconstruction can be achieved:

• The initialization is perfect : The activation matrices are initialized in such a way

that ˆY TR = YTR

• Observations are additive: fM(YNTR,YXTR

) = YNTR+ YXTR

• The activation function is additive with respect to the observations : fA(YTR,1 +

YTR,2,DTR,1) = fA(YTR,1,DTR,1) + fA(YTR,2,DTR,1)

• The sources are separable: fA(YTR,2,DTR,1)!= 0

For example take the very simple linear function f(YTR,DTR,1 = YTRDTR,1. Obviously

the function is additive regarding the observations and the assumption that the first

codebook is unable to represent the observations of the second source simplifies to an

uncorrelatedness constraint: YTR,2DTR,1)!= 0. Each codeword has to be orthogonal to

all observation vectors over time. As both codeword and observations are orthogonal,

this is only given if for each frequency either all codewords of source 1 are zero or all

observation frames of source 2 are zero.

On the other hand, perfect reconstruction is not necessarily impossible if one of the

four above assumptions is not fulfilled, but the impact of the second source on the first

sources estimate is unknown in general.

Recently some new approaches to reference based source separation, that can take the

additivity requirement into account. A brief introduction into this new approach is

given in chapter 6.

Further on, it is investigated, whether it is important to train and use dictionaries

for all sources. In the context of robot ego-noise reduction, the sources reduce to one

source for the desired speech signal and one source for the undesired ego-noise. In

the same context of robot ego-noise reduction, [DK14] only uses a dictionary for the

noise. This approach is also possible for the NMF algorithms described in this work, as

the power of the untrained source is approximately the observed power of the mixture

minus the estimated power of the trained source.

In the next section we try to give insights about using NMF with only a noise codebook.

Afterwards the approach is compared to the same scenario with a speech and a noise

codebook. Some more insight into the approximate additivity requirement discussed

in this chapter, by investigating its impact on an actual algorithm.

4.3.3 Euclidean MF with a single dictionary

For simplicity the case of euclidean MF is considered. The nonnegativity constraint

is neglected for simpler mathematical derivations. The training noise observations are

denoted as YNTR, such that the training optimization problem is given by

[ANTR,DN ] = argmin

CEUC(YNTR,DA) = argmin

||YNTR−DA||2F (4.10)

As in NMF, the dictionary and activation functions are updated iteratively, and the

algorithms converges at some point. By setting the gradient to zero, the following

equations hold for the (local) optimum parameters:

DN = YNTRAT

NTR(ANTR

)−1 (4.11)

ANTR= fA(YNTR

,DN) = (DTNDN )

−1DTNYNTR

(4.12)

For a given dictionary we can thus form the estimate resulting from Euclidean matrix

factorization as

YNTR= DNANTR

= DN(DTNDN )

−1DTNYNTR

(4.13)

In equation 4.13 two observations can be made for the training estimator. The estima-

tor is

• linear.

• of rank K.

The perfect training criterion from equation 4.7 is now given as

YNTR= DN (D

TNDN )

−1DTNYNTR

(4.14)

(IK×K −DN (DTNDN)

−1DTN)YNTR

!= 0 (4.15)

and the perfect reconstruction criterion of equation 4.9 leads to

(DTNDN)

−1DTN YTR

!= (DT

NDN )−1DT

NYNTR(4.16)

by inserting the activation function. Using the mixing function YTR = fM(YXTR,YNTR

it rewrites to

(DTNDN )

−1DTN (fM(YXTR

,YNTR)− YNTR

)!= 0 (4.17)

If the mixing is linear, perfect reconstruction simplifies to

(DTNDN)

−1DTNYXTR

!= 0 (4.18)

Equation 4.18 can be interpreted as the following: The best estimate is achieved if the

noise dictionary is not able to represent the speech observations.

However at no point in the training process, the impact of the speech on the training

dictionary was considered. This is equivalent to assuming that the noise is white, it

can thus be represented equally good by any codebook.

Obviously an estimator trained for white noise will lose efficiency when corrupted by

noise more related to the desired observations. Similarly it will gain performance when

corrupted with noise less similar to the desired observations.

In the next chapter, it is shown, that the problem identified above does not vanish by

simply training a codebook for speech and noise.

4.3.4 Euclidean MF with two dictionaries

A codebook for speech and noise can simply be used by modifying equation 4.10 into

[ANTR,AXTR

] = argminAN ,AX

||YTR −DNAN −DXAX ||2F (4.19)

Similarly to equation 4.15 and 4.17, good estimates are achieved if the following equa-

tions are fulfilled:

(IK×K −DN(DTNDN )

−1DTN )YNTR

!= 0 (4.20)

(IK×K −DX(DTXDX)

−1DTX)YXTR

!= 0 (4.21)

(DTNDN )

−1DTNYXTE

!= 0 (4.22)

(DTXDX)

−1DTXYNTE

!= 0 (4.23)

The first two equations ensure that the codebook is able to represent the data and can

be fulfilled by perfect initialization and a suitable codebook size.

The last two equations make sure that the noise dictionary is unable to model the

speech observations and in addition the speech dictionary is unable to model the noise

dictionary. But again, none of this was considered in the training of the dictionaries

in previous NMF approaches.

This is again the property which was pointed out in [SFM+14] by stating that separate

pretraining of the dictionary is applicable if the dictionaries are sufficiently different.

Therefore the elements of the mixture are modeled by only one dictionary. Yet the

authors did not discuss the consequences, if the assumptions above are not fulfilled.

Thus we will do it in the following:

For simplicity, a scenario in which speech and dictionary both have an identical code-

word is assumed. Both codewords can thus represent the exact same data in the

frequency domain. Regarding the cost function it does however not matter, which

of the codewords is used; the cost stays the same. The algorithm therefore arbitrary

distributes the optimum activation onto the two codewords.

If equation 4.20 is fulfilled, the estimate on the noise parameter does obviously not

benefit from using the speech codebook and vice versa. If the equation is not fulfilled,

the algorithm arbitrarily distributes the joint parts of the spectrum into the speech

and noise dictionary. The arbitrary distribution is an undesired property.

In comparison with the single dictionary, we can thus conclude, that if two sources are

easily separable, it is not essential, if one or both sources are trained. For two similar

sources none of the training strategies is suitable.

Solving the arbitrary behavior by constraints

One interesting approach to solve this arbitrary behavior is the use of constraints.

Many works have proven that sparsity or temporal continuity constraints improve the

separation results. Sparsity can for example be interpreted as follows: Energy that

can be arbitrarily shifted between two codewords is now shifted (arbitrarily) to one

of the codewords in order to get a sparse representation. If the codewords are not

identical but can, for example, both model a certain frequency, a sparsity constraint

would model this frequency by codewords that are nonzero as they also model a part

the other algorithm cannot model.

For example, consider three codewords. One codeword only models frequency f1, the

second models frequency f2 and both belong to the same source. The third codeword

can model both frequencies with same amplitude and belongs to another source. As-

sume now, that the two frequencies are observed with same power. Without sparsity

the observations can be modeled by the equal combination of the first two codewords,

by the third codeword or any combination of them. But, if nonzero codewords are

penalized, the observations will be modeled by the third codeword only.

In a similar manner, a temporal continuity constraint solves the ambiguity by choosing

the ratio of the two activation coefficients that is best suited in relation with its neighbor

frames. By constraining the activation vectors by an L2 norm or higher, one could make

sure, that as many dictionaries as possible are used.

At that point it is interesting to note that in [SFM+14], the authors apply source

separation by NMF with identical codebooks by exploiting the temporal statistics of

the sources.

In this report it is tried to solve the problem described above by investigating two

different approaches: One approach for training a noise dictionary only at the end of

this chapter and one for a speech and noise dictionary, as described in the next section.

Penalizing similar codewords

An orthogonality constraint can be applied on speech and noise dictionary in accor-

dance to chapter 3.4.3. In order to do so, the training of the speech and noise dictionary

has to be performed jointly, for example by the following equation:

[DN ,ANTR,DX ,AXTR

] = argminDN ,AN ,DX ,AX

||YNTR−DNAN ||

+||YXTR−DXAX ||

2F + λ

kN ,kX

DTN,kN

(4.24)

The penalty on orthogonal components in equation 4.24 is problematic in the com-

bination with the nonnegativity constraint of the dictionary. Actually it penalizes

all dictionaries which have some nonzero frequencies in common. This is obviously

not desired. Nevertheless, one can argue, that the constraint penalizes more if two

components are similar.

One could also think about different penalty functions for similar codewords. For

example one could penalize identical codewords by

[DN ,ANTR,DX ,AXTR

] = argminDN ,AN ,DX ,AX

||YNTR−DNAN ||

+||YXTR−DXAX ||

2F + λ

kN ,kX

||DN,kN −DX,kX ||22

(4.25)

The usefulness of such penalty functions on similar codewords has still to be investi-

gated.

The approach of jointly training all dictionaries is related to recent publications, which

are briefly reviewed in chapter 6

4.3.5 Dictionary training based on desired and undesired data

Recall the estimate after training a single dictionary given by equation 4.13 with the

Euclidean distance:

YNTR= DNANTR

= DN(DTNDN )

−1DTNYNTR

The novel approach is now to train a dictionary which exactly matches the above

estimator. This can be achieved by optimizing the following cost function

[DN ] = argminDN

||Y NTR−DN (D

TNDN )

−1DTNYTR||

2F (4.26)

If a good solution can be found, the trained dictionary would have the following prop-

erties:

DN (DTNDN )

−1DTNYNTR

≈ Y NTR(4.27)

DN (DTNDN )

−1DTNYXTR

≈ 0 (4.28)

Interestingly equation 4.28 coincides (up to a multiplication with DN) with the earlier

defined criterion for a good estimator in equation 4.18 and equation 4.27 coincides with

the criterion for perfect training of a dictionary in equation 4.15.

The algorithm selects codewords by maximizing the squared difference of the represen-

tation of the part to be estimated (equation 4.27) to the representation of the part to

be suppressed (equation 4.28). Recall that in chapter 4.3.3 the noise codewords were

simply selected by their ability to represent the part to be estimated.

The proposed algorithm will hence not choose a codeword that is able to represent

the undesired part better than the desired part. And it also favors a codeword that

represents a medium amount of the part to be estimated and a small amount of the

part to be suppressed over a codeword that represents the parts to be estimated and

suppressed both well.

The original algorithm simply selects the codewords after the amount of representation

of the desired components. It can thus be interpreted as assuming that the undesired

part can be equally represented by all possible codewords, which is equivalent to as-

suming distortion by white noise.

Solving the optimization problem through relaxation

Still the problem of optimizing the dictionary in equation 4.26 is not addressed. The

gradient ∂∂Dn||Y NTR

−DN (DTNDN)

−1DTNYTR||2F , is difficult to calculate analytically.

For that reason the optimization problem is relaxed by introducing the matrix B =

(DTNDN)

−1DTN . The noise observations are now estimated as YNTR

= DNBYTR. The

relaxed optimization problem thus writes as

[DN ,B] = argminDN ,B

||Y NTR−DNBYTR||

2F (4.29)

5 The matrices can then iteratively be updated by

||Y NTR−DNBYTR||

2F = (YNTR

−DNBYTR)(BYTR)T != 0 (4.30)

DN = YNTRY T

TRBT (BYTRY

T )−1 (4.31)

∂B||Y NTR

−DNBYTR||2F = DT

N (YNTR−DNBYTR)Y

!= 0 (4.32)

B = (DTNDN)

−1DTNYNTR

Y TTR(YTRY

−1 (4.33)

The above algorithm does not enforce a nonnegativity constraint on the matrices.

At this point we no longer see a necessity to do so. Nevertheless, it is likely that

multiplicative update rules can be derived to optimize equation 4.29 with nonnegativity

constraint.

The proposed method has also some drawbacks. First of all it is based on the Euclidean

distance and thus enforces a linear low rank estimator. Secondly it is based on the

assumption that the speech and noise observations are combined by an additive mixing

function fM .

Finally the special case where K = F is investigated. As DNB∆= U is now of full

rank, the optimum estimator can straightaway be calculated via the pseudo-inverse to

U = YNTRY T

TR(YTRYTTR)

−1 = YNTRY +

TR (4.34)

4.4 Multiplicative Update (MU) algorithm for mul-

tichannel data

In chapter 3.3.1 MU rules for the factorization of multichannel data by NMF with

the Itakura-Saito divergence was reviewed. In this chapter a generalization of this

algorithm for the α-β divergence is derived. The cost function is given by

CML(G,D,A) =∑

Dα,β(Xu||Xu) (4.35)

with the mixing model for the observations

Xu,fr =∑

Qu,j,f

k∈Kj

DfkAkr (4.36)

The gradient for the new model is given by

∂Cα,β

∂θ= −

Xα+β−1u,fr

∂Xu,fr

∂θ(4.37)

The MU rules for the new model can similar to equation D.33 be derived as

Akr ← Akr exp1−α

ηkr∑

Xα+β−1u,fr ln1−α

∂Xu,fr

∂ ln1−α (Akr)

(4.38)

4.4. MULTIPLICATIVE UPDATE (MU) ALGORITHM FOR MULTICHANNELDATA 71

Again the gradients of the transformed parameters have to be calculated for the new

mixing model as

∂Xu,fr

∂ ln1−α (Akr)=

∂ ln1−α (Akr)

Qu,j,f

k∈Kj

Dfk exp1−α (ln1−α (Akr)) = Qu,jk,fDfkA1−αkr

(4.39)

where jk denotes the source to which codeword k belongs. Inserted into equation 4.40

this rewrites as follows:

ηkr∑

Qu,jk,fDfkA1−2αkr Xα+β−1

u,fr ln1−α

(4.40)

Again, defining suitable step-sizes

ηkr =A2α−1

uf Qu,jk,fDfkXα+β−1fr

(4.41)

the MU rule write as

uf Qu,jk,fDfkXα+β−1u,fr ln1−α

uf Qu,jk,fDfkXα+β−1fr

(4.42)

For α 6= 0 the MU rules can be rewritten as

Aj ← Aj.

u(diag(Qu,j)Dj)T (Xβ−1

u .Xαu )

u(diag(Qu,j)Dj)TXα+β−1u

(4.43)

By using the step-size ηfk =D2α−1

ur Qu,jk,fAkrX

α+β−1

the update rule for the dictionary can

be calculated in similar manner and writes as

Dj ←Dj .

u diag(Qu,j)(Xβ−1u .Xα

u )ATj

u diag(Qu,j)Xα+β−1u AT

(4.44)

In the multichannel case a third update rule has to be derived for Qu,j,f . Using

∂Xu,fr

∂ ln1−α (Qu,j,f)= Q1−α

k∈Kj

DfkAkr = Q1−αu,j,fDj,fAj,r (4.45)

the update rule is

Qu,j,f ← Qu,j,f exp1−α

ηu,j,fQ1−2αu,j,f

Dj,fAj,rXα+β−1u,fr ln1−α

(4.46)

Using the stepsize ηu,j,f = 1

Q1−2αu,j,f

∑r Dj,fAj,rX

α+β−1

the update rule writes as

Qu,j,f ← Qu,j,f exp1−α

r Dj,fAj,rXα+β−1u,fr ln1−α

r Dj,fAj,rXα+β−1u,fr

(4.47)

For α 6= 0 the update rules simplifies to

Qu,j ← Qu,j.

((DjAj).Xβ−1u .Xα

u )1R×1

((DjAj).Xα+β−1u )1R×1

(4.48)

By inserting α = 1 and β = −1, such that the α-β-divergence becomes the Itakura-

Saito divergence, the update rules are equivalent to the version from [OF10], which

were reviewed in chapter 3.3.1.

4.5 The challenge of multiple mixtures

So far two algorithms were presented that can be applied to multichannel data. How-

ever, in the MU algorithm of chapter 4.4 the mutual information between the channels

is discarded.

Also, the EM algorithm described in chapter 3.3.2 has an undesirable property: its

computational complexity. The most computational expensive operator is the inverse

of the mixtures covariance matrix ΣY,fr, which is for example required to calculate

equation 3.31. The dimension of ΣY,fr is M×M . If no advanced inversion algorithm is

used, the computational complexity of this operation is O (M3). It is problematic that

the inversion has to be performed for each time-frequency bin and for each iteration.

When I denotes the number of iterations, the overall computational complexity for the

4.5. THE CHALLENGE OF MULTIPLE MIXTURES 73

EM algorithm is given as

M3FRI)

(4.49)

In [OF10] the algorithm is used for binaural music separation (M = 2). In this work it

shall however be applied to a robot with 4 microphones. According to equation 4.49 the

shift fromM = 2 toM = 4 results in an 8 times higher computational complexity. 2×2

matrices can easily be inverted for all time and frequency bins at once by elementwise

matrix multiplication, division and addition. For 4× 4 matrices a similar approach is

also possible. However, instead of growing by factor 8, as suggested by equation 4.49,

the number of multiplications increases from 4 to 280 and the additions/subtractions

from 1 to 104, as sketched in appendix D.6.

Above’s comparison and experimental tests showed that in the given scenario, the

source separation cannot be performed in few hours. Anyway 4 microphones is not a

very high number for a robot and the algorithm’s complexity increases rapidly for even

more microphones.

As both methods considered so far have some drawback, an alternative approach is

proposed in this thesis. In spatial filter design it is very well known, that a multichannel

Wiener filter (MWF), which is the optimum linear MSE filter (see appendix C.0.5)

can be separated into an MVDR beamformer and an single channel postfilter (see for

example [BW01] chapter 3.2.2).

It is not granted that the optimum filter resulting from the NMF model for the reduc-

tion of non-stationary noise can be separated in an multichannel and single channel

part. Nevertheless it is worth trying, for the following two reasons:

• reduced computational complexity as the multichannel problem has to be solved

only once and not in every iteration

• all existing single channel algorithms can be applied after the multichannel pre-

processing

At that point some properties of the investigated ego-noise reduction scenario are

introduced that do not hold in general for audio source separation. By restricting the

scenario to 2 sources (ego-noise and speech), out of whom only the speech estimates

performance is of importance, it is sufficient to design a single multichannel filter for

the speech. Recall that a scenario is considered where microphone and loudspeaker

positions do not change over time. The propagation vector is therefore approximately

constant over time. This and the similarity of speech and noise between training and

testing allows to design a fixed filter with the training data and then apply it during

testing.

4.5.1 Multichannel Wiener filter (MWF)

As filter we propose to use the optimum spatial linear MSE estimator or often called

MWF for the u-th spatial sound source of the speech training data, denoted asXSTR,ufr:

E{|XSTR,ufr −HHf YTR,fr|

2} (4.50)

The algorithm is designed to estimate one of the spatial sound sources instead of the

true sources signals, as the true source signals are not always accessible for the training

data. As derived in appendix C.0.5, the optimum filter is

H = E{YTR,frYHTR,fr}

−1E{YTR,frX

∗STR,ufr} = Σ−1

Y Y,frΣY XS ,fr (4.51)

As the observations YTR,fr = XSTR,ufr+XNTR,ufr4 add up and speech and noise are un-

correlated E{XNTR,ufrX∗STR,ufr} = 0, the equation simplifies to Hf = Σ−1

Y Y,frΣXSXS ,fr.

The spatial covariance matrices can then be estimated as time average over the training

4At this point the mixing is performed in the complex time frequency domain, where two sources

actually add up. Hence there is not the problem with the unknown mixing behavior described in

chapter 4.3.2

data by

ΣY Y,f =1

YTR,frYHTR,fr (4.52)

ΣXSXS ,f =1

XSTR,ufrX∗STR,ufr

∆= VSTR,ufdX (4.53)

The vector dX can be interpreted as the average propagation vector of the source

relative to the selected microphone. The MWF then rewrites as

Hf = Σ−1Y Y,f VSTR,ufdX (4.54)

and used for all frames of the testing data.

4.5.2 Parametric multichannel Wiener filter (PMWF)

In chapter 5.3.4 it will be pointed out, that the introduced speech distortion of the

MWF results in a bad performance in the following processing step.

The parametric multichannel Wiener filter (PMWF), as for example presented in

[SBA10] and [LN11] connects the multichannel Wiener filter with the MVDR (min-

imum variance distortionless response) beamformer. It is a well-known approach to

control the trade-off between speech distortion and noise reduction. The PMWF can

be derived by reformulating the optimization problem in equation 4.50. Using uncor-

relatedness between speech and noise, the cost function can be reformulated as

E{|XSTR,ufr −HHf YTR,fr|

2} = E{|(XSTR,ufr −HHf XSTR,fr)−HH

f XNTR,fr|2}

= E{|XSTR,ufr −HHf XSTR,fr|

2}+ E{|HHf XNTR,fr|

2}(4.55)

E{|XSTR,ufr −HHf XSTR,fr|2} represents the speech distortion and E{|HH

f XNTR,fr|2}

the noise reduction. A trade-off can easily be achieved by adding a weighting factor ν

to the speech distortion and thus solving the optimization problem

E{|HHf XNTR,fr|

2}+ νE{|XSTR,ufr −HHf XSTR ,fr|

2} (4.56)

If ν is interpreted as a Lagrangian multiplier, the optimization problem is equivalent

E{|HHf XNTR,fr|

subject to E{|XSTR,ufr −HHf XSTR,fr|

2} < ǫ

(4.57)

where ǫ is the allowed signal distortion. The PMWF can then be derived as

Hf = (µΣXNXN ,fr +ΣXSXS ,fr)−1VSTR,ufrdX (4.58)

with the substitution 1ν

∆= µ. For µ→ 0 the beamformer does not care about the noise

reduction and only minimizes the speech distortion. Unfortunately there is a problem,

if the speech is not created by a point source or if the microphone signals have some

additive noise. In that case, no distortion can only be achieved by the trivial solution,

where the reference microphone receives gain 1 and all other microphones gain 0.

4.5.3 MVDR beamformer

For this reason the MVDR beamformer is derived in its original version [CZO87] by

the following optimization problem:

E{|HHf YTR,fr|

2} (4.59)

subject to HHf df = 1 (4.60)

Equation 4.59 has the well known solution

Hf =Σ−1

Y Y,frdf

dHf Σ

−1Y Y,frdf

(4.61)

Compared to the MWF of equation 4.54, the MVDR beamformer is just scaled in a

way, that the looking direction (represented as d) is not distorted.

If the speech is a point source with propagation vector d the optimization problem can

be equivalently rewritten as

E{|HHf XNTR ,fr|

2} (4.62)

subject to HHf df = 1 (4.63)

Again the solution is

Hf =Σ−1

NN,frdf

dHf Σ

−1NN,frdf

(4.64)

It is important to consider that, as the propagation vector is only an average propaga-

tion vector of the source, the MVDR beamformer will indeed introduce distortion. A

distortionless response is only granted for a point source with the exact propagation

vector. It is however likely, that the distortions are less severe than for the MWF

without obtaining the trivial solution of the PMWF for µ = 0.

78 CHAPTER 5. EXPERIMENTAL RESULTS

Chapter 5

Experimental results

The experimental results are summarized in this chapter. First the experimental setup

is explained. Afterwards the reduction performance of the stationary noises achieved

through the prefiltering is discussed. The chapter is then finished with a comparison

of the results of the different movement noise reduction algorithms.

5.1 Experimental setup

If not stated otherwise in the specific sections, the following setup and parameters are

used for the experiments evaluated later on.

The samples are recorded with a sample rate of 48 kHz, but downsampled to 16 kHz

to reduce the computational complexity. The algorithms are performed in the STFT

domain. Each frame is a rectangular window of length 256 and the individual frames

do not overlap.

The training and test signals are obtained as the following: All recordings are made

in the same room, while NAO was standing or walking on the same spot. The fol-

lowing four signals are recorded with the robot NAO in in a real room with moderate

reverberation level (T60≈200ms):

• A speech signal is played back by a loudspeaker. It contains short sentences of

5.2. PREFILTERING 79

different speakers. It is recorded, while the fan was off, thus it contains only the

speech and some static noise.

• A fan noise reference is recorded, containing only fan and static noise.

• Walking noise is recorded, where the robot walks on the spot. As the fan has to

be turned on, the recording consists of fan, static and waving noise.

• Waving noise is recorded, where the robot walks on the spot. As the fan has to

be turned on, the recording consists of fan, static and waving noise.

Either the walking or the waving noise is used for the experiments at the same time,

never both. For simplicity, the walking or waving noise recording, that is currently

used for an experiment is referred to as movement-noise.

From the speech and the noise recording 20 seconds are extracted starting at 3 seconds

and form the testing data. Then a random offset of up to 1000 samples is applied

and another 100 seconds are extracted to form the training data. The random offset

is applied to avoid getting the exactly same results when running a ’deterministic’

algorithm several times. It is thus prevented to get a particularly good or bad results

by chance.

5.2 Prefiltering

This section evaluates the performance of the fan spike reduction and the static and

residual fan noise reduction via the prefilters.

5.2.1 Fan noise reduction

First the fan noise reduction algorithm as described in chapter 4.2.1 is applied to all

signals. The training is performer with the separate fan noise recording and then

applied to each microphone of speech and noise, training and test data.

Obviously the algorithm removes energy from all the signal. In order to determine,

whether the algorithm works as planned, the residual to artifact ratio (RAR) as defined

by equation 5.1 is used.

RAR = 10 log

Vin − Vout

[dB] (5.1)

Vin is the power of the signal before the algorithm and Vout the power of the signal

after the algorithm. The RAR is used, as it can be calculated individually for each

of the sources in the mixture (although not all of them can be observed individually).

It is thus a measure of how each source is affected by the algorithm. A RAR of ∞

dB corresponds to a signal not changed by the algorithm and a RAR of −∞ dB to a

complete suppression. The corresponding RAR values for all signals and microphones

can be seen in table 5.1. The distortion of speech, static, walking and waving noise is

considerably small for all microphones, but the distortion or in other words, reduction

of the fan noise is quite large. In average there are 4.9 dB more fan noise removed than

remaining.

Table 5.1: The residual to artifact ratio after fan spike reduction shows a reduction ofthe fan noise, while the other signals are hardly affected

mic 1 mic 2 mic 3 mic 4 average

RAR speech [dB] 11.5 11.5 11.7 11.8 11.6

RAR static [dB] 11.8 9.8 10.5 11.0 10.8

RAR fan [dB] -3.3 -4.7 -6.4 -5.1 -4.9

RAR walking [dB] 12.9 14.7 9.6 11.9 12.3

RAR waving [dB] 11.2 11.4 13.2 12.3 12.0

5.2.2 Static and residual fan noise reduction

The second filter, the single channel Wiener filter described in chapter 4.2.2, is applied

to each microphone of speech and noise, training and test data as well. As the filter

depends on the power of the mixture, its impact on the static and fan noise is calculated

with the average filter of the walking mixture.

Although experiments showed that a slightly better result can be obtained by larger

frames, a rectangular window of size 256 is applied, as this frame-size is used later for

the movement noise reduction, too.

The RAR values in table 5.2 are again calculated in relation to the input before the

fan noise reduction. They contain thus the distortion/reduction introduced by the fan

spike reduction algorithm and the Wiener filter.

Table 5.2: The RAR after Wiener filtering shows a reduction of static and fan noisewith few distortion of the other sources

mic 1 mic 2 mic 3 mic 4 average

RAR speech [dB] 8.3 8.4 8.8 8.8 8.5

RAR static [dB] -1.6 -1.7 -1.6 -1.6 -1.6

RAR fan [dB] -6.7 -7.8 -9.4 -8.1 -8.0

RAR walking [dB] 6.4 6.5 5.1 4.0 5.5

RAR waving [dB] 8.2 8.4 9.6 8.4 8.6

With an average of 8.5 dB, the artifacts (distortion) of the speech signal is still accept-

able, whereas most of the fan noise is removed. More than half of the static noise is

reduced as well.

5.2.3 Prefiltering results

After proving that the introduced speech artifacts are small, it is investigated, how the

speech to interference ratios is affected by the prefilters. An overview is given in table

5.3. As it can be seen, microphone 4 has the best speech to interference ratio (SIR)

regarding all noise types. The speech to static ratio increases about 3 dB through

the Wiener filter and with values over 20 dB and can be neglected further on. The

speech to fan ratio increases about 6 dB through the spike reduction and additional

2 dB through the Wiener filtering. With values from 10 to 16 dB it is also negligible

compared to the movement noise, which has SIR values below 8 dB, as shown later.

Table 5.3: Impact of the prefilters on the speech to interference ratio (SIR) of staticand fan noise

mic 1 mic 2 mic 3 mic 4

static SIR at input [dB] 18.0 18.8 22.4 22.7

static SIR after fan spike reduction [dB] 18.0 18.9 22.5 22.8

static aSIR fter WF [dB] 21.3 22.1 25.7 26.1

fan SIR at input [dB] 2.3 2.6 3.3 8.3

fan SIR after fan spike reduction [dB] 7.0 8.3 10.3 14.3

fan SIR after WF [dB] 9.2 10.5 12.6 16.5

Average fan and static SIR gain by both prefilters 6.9 7.9 9.3 8.2

5.2.4 Evaluation approach for movement reduction

For the performance evaluation of the movement noise reduction algorithms, the speech

and movement estimates obtained through the prefiltering are used as reference. The

performance measures of the movement noise reduction is thus not limited through

the artifacts of the prefilters. If an algorithm would perform perfect separation of the

speech signal, the resulting estimate will not be the true speech signal, but the speech

recording distorted by the two prefilters (with the residual static noise). As the RAR

in table 5.2 for speech and movement noise is between 5 to 9 dB this approach can be

justified.

The SIR can then also be determined for the input of the movement noise reduction

step. It is listed in table 5.4. The movement noise includes in the new definition also

the residual fan noise power, as does the speech signal include the residual static noise

power. In the table the BSS toolbox version 3.0 is used for the calculation of the

performance. It is further on used for the performance measure of the movement noise

reduction.

Table 5.4: SIR and SDR by BSS toolbox is equivalent after prefiltering

SIR [dB] SDR [dB]

mic 1 mic 2 mic 3 mic 4 mic 1 mic 2 mic 3 mic 4

walking via BSS -0.6 1.1 4.3 7.4 -0.7 1.0 4.2 7.3

waving via BSS -8.6 -7.3 -6.3 -3.1 -8.7 -7.4 -6.3 -3.1

The BSS toolbox is especially designed for performance evaluation of source separation

algorithms. Roughly explained, the toolbox takes into account, that the estimated

speech can be expressed as the sum of four components:

Xtarget = Xtarget +Xinterferer +Xnoise +Xartifacts (5.2)

with X = [Xt]T represents a vector of time samples. The four terms can be interpreted

as follows:

• Xtarget: desired source (speech) in the estimate

• Xinterferer: undesired/interfering sources (movement noise) in the estimate

• Xnoise: residual of the estimate, that is not included in one of the sources

• Xartifacts: represents the deviation of the desired source in the estimate compared

to the true signal of the desired source (comparable to the artifacts used in

equation 5.1)

For a more sophisticated explanation, we forward at this point to the authors website1.

Furthermore the noise term is assumed to be zero. In this case, two of the BSS

parameters are sufficient to describe the separation performance. Hence the source to

distortion ratio (SDR) and the source to interference ratio (SIR), defined as follows,

1http://bass-db.gforge.inria.fr/bss_eval/

are chosen:

SIR = 10 log10||Xtarget||

||Xinterferer||22(5.3)

SDR = 10 log10||Xtarget||22

||Xinterferer +Xnoise +Xartifacts||22(5.4)

The SDR is the measure that is used to determine which of the algorithms performs

best. The numerator of the SDR is not changed by the algorithm, whereas the denom-

inator can be reformulated to the MSE of the estimate ||Xtarget − Xtarget||22 by using

equation 5.2.

In addition the SIR is useful to determine the relation of artifact and interferer power. If

SIR≫ SDR the artifacts are the main problem, if SIR ≈ SDR the are no severe arti-

facts and if SIR ≈ SDR+3dB artifacts and interference are approximately equivalent.

In table 5.4, by simply using the prefiltered microphone signal, a first separation esti-

mate is already listed. As it can be seen, SIR and SDR are identical, as the distortion

in the estimate is just the additive interfering movement-noise (no artifacts).

5.3 Movement noise reduction

For the movement noise reduction four different multichannel algorithms are applied:

• α-β MU with prefiltering by a MWF, where a speech and a noise codebook are

pretrained

• α-β MU with prefiltering by a MWF, where only a noise codebook is pretrained

• Euclidean MU with prefiltering by a MWF, where a noise codebook is pretrained

such that it predicts the noise power from the observations (see chapter 4.3.5).

• α-β MU for multichannel data, where a speech and a noise codebook are pre-

trained

5.3. MOVEMENT NOISE REDUCTION 85

Each algorithm is tested for different codebook sizes K and if applicable for the fol-

lowing parameters α and β:

Table 5.5: Selected parameter for the generalized α-β divergence

1.0 1.0 squared Euclidean distance

1.0 0.0 Kullback-Leibler divergence

1.0 -1.0 Itakura-Saito divergence

0.5 1.0

2.0 1.0

1.0 -2.0

1.0 2.0

The complete results can be seen in appendix A andB. In the following section, the

best performance for each algorithm (regarding codebooksize K and parameters α, β)

is discussed.

5.3.1 Experimental results for single channel data

Before proceeding to the multichannel results this chapter shows, that the three single

channel methods indeed achieve a performance increase for single channel data.

As shown in table 5.6 the best SDR is achieved by training a speech and noise codebook.

A SDR increase of 5.6 dB for walking and 10.3 dB for waving can be achieved. With

1 and 9 the codebook sizes are small. The predictive algorithm performs second best

and NMF with only the noise codebook performs worst. If selected after best SIR, the

order is the same (see table 5.7).

5.3.2 Performance of PMWF for spatial preprocessing

As described in chapter 4.5 a beamformer can be used as additional prefilter for solving

the multichannel problem with less computational complexity. The algorithm proposed

Table 5.6: Single channel overview; parameters select after best SDR

walking waving

Algorithm α; β K SIR SDR α; β K SIR SDR

without an algorithm - - -0.6 -0.7 - - -8.6 -8.7

α-β MU, speech + noise 1;0 1 8.1 4.9 0.5;1 9 3.4 1.6

α-β MU, noise 1;1 6 5 1.9 1;-1 129 -5.4 -0.5

Euclidean predictive, noise - 90 5.9 2.7 - 129 -0.5 -0.9

Table 5.7: Single channel overview; parameters select after best SIR

walking waving

Algorithm α; β K SIR SIR α; β K SIR SDR

without an algorithm - - -0.6 -0.7 - - -8.6 -8.7

α-β MU, speech + noise 1;2 4 12.9 1.9 0.5;1 9 3.4 1.6

α-β MU, noise 2,1 27 6.6 1.5 1;0 27 -2.6 -1.1

Euclidean predictive, noise - 80 5.9 2.7 - 22 0 -1

earlier is the parametric multichannel Wiener filter (PMWF). First the parameter

µ that controls the tradeoff between noise reduction and speech artifacts has to be

determined. As it can be seen in figure 5.1 (for completeness the corresponding figure

for the waving noise is shown in appendix B.2.1), the SDR is best for µ = 1, where

the PMWF simply becomes the multichannel Wiener filter (MWF). This result is not

surprising, as it is well known, that the MWF is the optimum linear estimator regarding

MSE. For this reason the MWF is selected as prefilter. The achieved performance and

the corresponding improvement is listed in table 5.8.

Furthermore figure 5.1 shows, that for µ→ 0, SIR and SDR converge towards the input

performance of the microphone that was used as reference. The estimated filter is thus

the trivial solution, where one microphone is completely preserved and all others are

attenuated to zero. Choosing µ = 0 would thus result in the performance shown in the

single channel result section. Though, it has to be mentioned, that the single channel

evaluation was as explained earlier performed with the first microphone, which does

not have the best input SDR. Table 5.8 shows that microphone 3 is best for both noise

10 log10

(µ)-60 -50 -40 -30 -20 -10 0 10 20 30 40 50

10 log10

(µ)-60 -50 -40 -30 -20 -10 0 10 20 30 40 50

10 log10

(µ)-60 -50 -40 -30 -20 -10 0 10 20 30 40 50

Figure 5.1: SIR, SDR and SAR of the parametric Wiener filter (walking noise)

Table 5.8: BSS performance after MWF (µ = 1)

SIR [dB] SDR [dB]

waving after MWF 15.0 14.5 15.2 14.6 6.2 6.4 7.2 7.3

waving improvement 23.6 21.8 21.5 17.7 14.8 13.8 13.6 10.4

walking after MWF 19.5 18.5 20.6 18.5 10.7 11.7 13.3 13.5

walking improvement 20.0 17.4 16.3 11.1 11.4 10.7 9.1 6.1

types regarding SIR, whereas microphone 4 is best regarding SDR. As microphone 1

was most of the time used as reference for the experiments, it is also picked as reference,

although the MWF designed for microphone 4 performs better.

5.3.3 Experimental results for multichannel data

As stated earlier, the main criterion for selecting the best algorithm is the SDR. The

four multichannel algorithms are listed in table 5.9, with their optimum parameters.

As indicated by the bold letters the MWF results in the best SDR. The three NMF

methods applied after the MWF thus always lead to a loss in SDR. Furthermore the

true multichannel algorithm (last row) has a far worse SDR than the beamformer with

a loss of -9.7 dB for walking and -5.1 dB for waving. For reference the table also

shows the single channel results and the performance of PO-KSVD+ from [DK14] on

the same scenario. As it can be seen, the multichannel algorithm performs even worse

than the single channel. The single channel method even achieves a better SDR for

the walking noise compared to PO-KSVD+, but performs worse for the waving noise.

Table 5.9: Multichannel overview best SDR

walking waving

best input microphone - - 7.4 7.3 - - -3.1 -3.1

MWF - - 19.8 10.7 - - 15.7 6.2

α-β MU, prefilter, speech + noise 1; -1 1 25.3 7.9 1; 0 1 22.2 5.7

α-β MU, prefilter, noise 1; 2 1 19.4 2.4 1; 2 1 13.2 2.6

Euclidean pred., prefilter, noise 50 21.6 4.9 80 18.7 3.3

α-β MU, multich., speech + noise 1; 2 24 19.8 1 2;1 18 24.1 1.1

α-β MU, single ch., speech + noise 1;0 1 8.1 4.9 0.5;1 9 3.4 1.6

PO-KSVD+ from [DK14] 22.3 1.8 22.5 2.3

The results are different, if one would only care about the SIR. The corresponding

performances are listed in table 5.10.

The α-β MU with prefilter and a speech and noise codebook with small codebook size

(K = 4) achieves here a quite impressive SIR increase by +5.2 dB compared to the

MWF alone, but the SDR decreases by −7.7 dB for the walking noise.

Interestingly the multichannel algorithm achieves best SIR for the waving noise. The

SIR increases by 11.6 dB, whereas the SDR drops by 5.1 dB.

Table 5.10: Multichannel overview best SIR

walking waving

α-β MU, prefilter, speech + noise 1; 1 4 26 3 1,-1 1 25.4 5

α-β MU, prefilter, noise 1; -1 1 22 2.3 1;-1 129 20 1.6

Euclidean pred., prefilter, noise - 50 21.6 4.9 - 129 18.7 3.3

α-β MU, multich., speech + noise 1; 0 26 23.4 0 0.5; 1 4 27.3 1.1

Several conclusions can be drawn from the data of the last two chapters:

• The three investigated single channel algorithms improve the SDR of single chan-

nel data, yet they cannot do so after MWF as spatial prefilter.

• The multichannel NMF algorithm can perform some improvement, yet its SDR

is worse compared to the equivalent single channel algorithm.

• All multichannel algorithms result in a quite impressive SIR.

In order to explain, why the NMF algorithms after a MWF result in a increase in

SIR but decrease in SDR, the behavior of desired source, interferer and artifacts is

investigated before, between and after the two stages.

5.3.4 Discussion on using NMF with MWF as spatial prefilter

At the input of the MWF, no artifacts are assumed, such that SIR and SDR are

equivalently given as

SIRin = SDRin = 10 log10||Xtarget,in||22||Xinterferer,in||22

The first stage represented byMWF reduces the interferer error term asXinterferer,inMWF→

Xinterferer,MWF . By doing so, the target source is also distorted. It is reduced accord-

ing to Xtarget,inMWF→ Xtarget,MWF . On the other hand, the modification of the target

by the stage can be interpreted as adding some artifacts to the signal, where the artifact

error term is Xartifact,MWF = Xtarget,in −Xtarget,MWF .

The SDR at the output of the first stage is then given as

SDRMWF = 10 log10||Xtarget,in||22

||Xinterferer,MWF +Xartifact,MWF ||22(5.6)

From equation 5.5 to 5.6 only the denominator changes. The best estimator regarding

SDR is thus the one that finds the best trade-off between average interferer and artifact

power.

As mentioned in chapter 5.2.4, the denominator can also be seen as the MSE of an

estimate. Minimizing the MSE is hence equivalent with maximizing the SDR. As

pointed out earlier, the MWF uses the MSE cost function. The residual cost of the

estimator is given by

CMSE,MWF = ||Xinterferer,MWF +Xartifact,MWF ||2 (5.7)

The second stage, abbreviated byNMF further reduces the interferer asXinterferer,MWFNMF→

Xinterferer,NMF while introducing the artifactsXartifact,NMF = Xtarget,MWF−Xtarget,NMF .

The MSE cost of the second stage is similar to equation 5.7

CMSE,NMF = ||Xinterferer,NMF +Xartifact,NMF ||2 (5.8)

In contrast to the MWF, the generalized α-β divergence MU rules do not minimize the

MSE for all parameters. Thus they do not maximize the SDR. But even if they would

minimize the MSE, this is no longer equivalent to maximizing the overall SDR given

SDRMWF+NMF = 10 log10||Xtarget,in||2

||Xinterferer,NMF +Xartifact,MWF +Xartifact,NMF ||2(5.9)

The second stage should consider the artifacts introduced in the first stage, yet it is

independent of those artifacts. Resulting from that, the SDR drops and the SIR, which

is independent of the artifacts, increases. Another interpretation is, that the second

stage can only reduce the artifacts of the first stage by chance, as it does not know

them. At this point this mismatch can be tackled by two different approaches:

• The second stage has to estimate the speech input of the first stage rather than

the speech output of the first stage.

• The first stage is not allowed to add artifacts.

The first idea is difficult to transfer to the NMF algorithms used as second stage. It

may be possible to adopt the predictive Euclidean matrix factorization to account the

first stage. As a drawback the Wiener filter can then no longer be restricted to values

between 0 and 1, as the the second stage has to be able to equalize the attenuation

(and thus artifacts) of the first stage. The only possibility left is to keep the artifacts

of the first stage at a minimum. As shown earlier, the PMWF could only do so by the

trivial solution of using the reference microphone only. In the next chapter the MVDR

beamformer in its original form is therefore tested.

5.3.5 Performance of MVDR beamformer for spatial prepro-

cessing

As explained in chapter 4.5.3 the MVDR beamformer can be calculated with two

versions, which are equivalent for a not moving point source. The achieved performance

for both versions is shown in table 5.8 for the waving noise. The SDR can be slightly

increased by both methods. However compared to the MWF, the resulting SDR is

significantly worse.

The table shows that SIR≫ SDR regardless which microphone is used as reference or

which method is used. The artifacts are thus more troublesome than the residual inter-

ference. In contrast to the theoretic background, the MVDR beamformer introduces

severe artifacts. The objective of no speech distortion is not given.

Table 5.11: BSS performance after MVDR

SIR [dB] SDR [dB]

waving MVDR, ΣNN 12.7 12.6 11.5 10.1 -7.6 -6.5 -1.8 -2.2

waving MVDR, ΣY Y 11.6 11.3 11.3 9.8 -7.5 -6.4 -1.8 -2.2

The only reasonable explanation is that the used relative propagation function is not

at all constant over time. This can also be observed with the actual data. The average

propagation vector is thus not suited well for most of the frames. In all those frames

the speech signal gets distorted.

The reason for this behavior is however not clear. If the speech source would move

around or is not point like, a changing propagation vector over frames would make

sense. Yet a loudspeaker is static and pretty point-like and the robot was standing still

as well.

At this point it can only be guessed, that the frame length of 256 samples was not

well suited. We conclude this chapter by pointing out, that the investigates spatial

preprocessing methods introduced to many speech artifacts for a postprocessing to

make sense.

CHAPTER 6. CONCLUSION 93

Chapter 6

Conclusion

The experimental results of this work show that a noise reduction by a Wiener filter

with variance parameters estimated by NMF is possible. With the generalized α-β

divergence, a SDR increase of up to 5 to 10 dB, depending on the input SDR, is possible.

The results strongly depend on the used parameters, the number of codewords as well

as the training technique. The experiments clearly show that the best parameters

depend on the scenario.

Using a speech and noise codebook proves to be in general superior to the case where

only a noise codebook was used. A novel matrix factorization approach with the pur-

pose to account the pretraining of the codebook by a reference signal is derived for

the Euclidean cost function. It performs better than the Euclidean NMF with a single

codebook for both tested movement noises, yet fails to outperform Euclidean NMF

with speech and noise codebook. The new estimator is indeed linear and of rank K. It

highlights the fact that the form the standard NMF procedure uses the observations

are unknown due to the iterative optimization. Also, the new approach does not fea-

tures the NMF typical performance loss for large codebooks. This property confirms

the theoretical investigation of NMF with pretrained dictionaries. We have suggested,

that the discussed training approach is based on several assumptions, which are not

given in most scenarios and can lead to a severe loss in performance.

The experimental results with the multichannel data show that NMF is, in the way it

was used in this work, not suited for multichannel ego-noise reduction. The multichan-

nel MU rules for the generalized α-β divergence performs worse regarding SDR than

applied to a single channel.

The novel two stage approach with a beamformer as spatial prefilter does not work in

combination with any of the applied algorithm. Although the spatial filter achieves

significant increase in SDR, supported by the time invariant spatial properties, the

NMF algorithm actually leads to a decrease in SDR. The evaluation shows that the

NMF algorithm does not take into account the artifacts created by the MWF, which

are more severe than the residual interferer power. The MVDR beamformer is tried as

an alternative spatial prefilter. The use of an average relative transfer function, turns

out to be not suitable to keep the speech artifacts in a tolerable level.

Future work

This work develops several ideas which can be investigated in future works.

The concept of multichannel prefiltering can further be investigated with the objective

to find a beamformer with little artifacts on the speech signal. In this context, the

evaluation of the SDR as a decisive performance criterion has to be questioned and

should be compared to the keyword recognition rate of an actual recognition engine.

The results of the fixed MWF suggest that if the spatial covariance matrix of the ego-

noise can be estimated accurately enough, a very good noise reduction performance

can be achieved. At this point, the incorporation of the motor data seems to be very

promising.

The proposed interpretation of NMF models via an activation function and a mixing

function is another starting point for further research. Especially, the property of the

mixing to be non-deterministic regarding the observations of the individual sources

requires the development of mixing models for the observations via probability distri-

butions.

The theoretical work on the pretraining of the dictionary results in the claim that

the NMF algorithm can result in arbitrary distribution of the power in the activation

functions. Both experimental and theoretical investigations of the behavior could lead

to a better understanding of source separation using NMF. In the course of that,

constrained NMF becomes of new importance.

Finally the development of better NMF training approaches is desired. The impor-

tance of new training approaches is underlined by recent developments in [SLOVB14]

and [SLVB15] on their so called deformed references. As a prospect to future NMF

techniques a review on their ideas will conclude this report.

Instead of two individual training and testing stages, the authors propose to share pa-

rameters (for example the dictionaries) between training and testing. Parameters are

hereby shared by transition probabilities. The optimization problem is then given by

minimizing the training and the testing cost function. Transferring this idea onto the

approach of this work, it should be possible to train a speech and a noise dictionary

such that the speech dictionary represents the speech training observations, the noise

dictionary the noise training observations, and together they could represent the train-

ing mixture observations. It is likely that this approach is not the perfect solution, but

still better suited regarding the approximate additivity assumption.

96 ANHANG A. SINGLE CHANNEL RESULTS

Appendix A

Single channel results

A.1 walking

A.1.1 α-β-divergence with speech and noise codebook

K0 20 40 60 80 100 120

α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2

K0 20 40 60 80 100 120

Figure A.1: Walking-noise reduction by the α-β-divergence and speech and noisecodebook (single channel)

A.1. WALKING 97

Table A.1: Walking-noise reduction by the α-β-divergence and speech and noise code-book (single channel)

α;β 1.0; 1.0 1.0; 0.0 1.0; -1.0 0.5; 1.0 2.0; 1.0 1.0; -2.0 1.0; 2.0

Selected after best SIR

K 2 3 2 1 1 1 4

SIR [dB] 12.7 8.9 6.3 11.5 11.8 3.2 12.9

SDR [dB] 2.2 4.0 3.1 3.4 3.4 1.5 1.9

Selected after best SDR

K 10 1 7 16 6 1 80

SIR [dB] 8.1 8.1 5.9 7.6 8.2 3.2 8.1

SDR [dB] 3.9 4.9 3.9 4.4 4.3 1.5 3.4

A.1.2 MF with the α-β-divergence and Euclidean predictive

by a noise codebook

K0 20 40 60 80 100 120

α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2

K0 20 40 60 80 100 120

Figure A.2: Walking-noise reduction by the α-β-divergence with noise codebook

K0 20 40 60 80 100 120

-10123

K0 20 40 60 80 100 120

Figure A.3: Walking-noise reduction by Euclidean predictive MF (single channel)

Table A.2: Walking-noise reduction by the α-β-divergence with noise codebook

α;β 1; 1 1; 0 1; -1 0.5; 1 2; 1 1; -2 1; 2 EUC pred.

K 15 10 4 14 27 1 16 80

SIR [dB] 5.3 4.9 3.8 4.3 6.6 5.6 5.0 5.9

SDR [dB] 1.7 1.5 1.5 1.7 1.5 1.7 1.6 2.7

K 6 10 60 5 4 1 11 90

SIR [dB] 5.0 4.9 3.2 4.1 5.6 5.6 4.7 5.9

SDR [dB] 1.9 1.5 1.6 1.8 1.6 1.7 1.6 2.7

A.2. WAVING 99

A.2 waving

A.2.1 α-β-divergence speech and noise codebook

K0 20 40 60 80 100 120

α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2

K0 20 40 60 80 100 120

Figure A.4: Waving-noise reduction by the α-β-divergence and speech and noise code-book (single channel)

Table A.3: Waving-noise reduction by the α-β-divergence and speech and noise code-book (single channel)

α;β 1.0; 1.0 1.0; 0.0 1.0; -1.0 0.5; 1.0 2.0; 1.0 1.0; -2.0 1.0; 2.0

K 3 2 1 9 3 29 9

SIR [dB] 1.4 2.4 3.4 3.4 -2.7 -4.1 1.4

SDR [dB] 0.2 -0.6 0.3 1.6 -3.5 -0.8 0.4

K 3 3 1 9 3 29 9

SIR [dB] 1.4 2.0 3.4 3.4 -2.7 -4.1 1.4

SDR [dB] 0.2 -0.3 0.3 1.6 -3.5 -0.8 0.4

A.2.2 MF with the α-β-divergence and Euclidean predictive

by a noise codebook

K0 20 40 60 80 100 120

α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2

K0 20 40 60 80 100 120

Figure A.5: Waving-noise reduction by the α-β-divergence with noise codebook

K0 20 40 60 80 100 120

-8-6-4-20

Figure A.6: Waving-noise reduction by Euclidean predictive MF (single channel)

A.2. WAVING 101

Table A.4: Waving-noise reduction by the α-β-divergence and Euclidean predictivewith noise codebook

α;β 1; 1 1; 0 1; -1 0.5; 1 2; 1 1; -2 1 1; 2 EUC pred.

K 80 27 60 90 129 1 100 22

SIR [dB] -3.4 -2.6 -5.2 -3.9 -3.4 5.2 -3.9 0.0

SDR [dB] -2.3 -1.1 -1.2 -2.6 -2.3 0.0 -3.1 -1.0

K 80 90 129 129 70 1 100 129

SIR [dB] -3.4 -3.9 -5.4 -3.9 -3.5 5.2 -3.9 -0.5

SDR [dB] -2.3 -0.7 -0.5 -2.5 -2.3 0.0 -3.1 -0.9

1This parameter set is excluded while searching for the best parameters due to its strange behavior

over K

102 ANHANG B. MULTICHANNEL RESULTS

Appendix B

Multichannel results

B.1 walking

B.1.1 MF with the α-β-divergence, MWF as prefilter and

speech and noise codebook

K0 10 20 30 40 50 60

α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2

K0 10 20 30 40 50 60

Figure B.1: Walking-noise reduction by the α-β-divergence with MWF prefilter andspech and noise codebook

B.1. WALKING 103

Table B.1: Walking-noise reduction by the α-β-divergence with MWF prefilter andspeech and noise codebook

α;β 1.0; 1.0 1.0; 0.0 1.0; -1.0 0.5; 1.0 2.0; 1.0 1.0; -2.0 1.0; 2.0

K 4 1 1 2 5 0 7

SIR [dB] 26.0 24.2 25.3 25.9 24.4 19.8 23.6

SDR [dB] 3.0 6.6 7.9 3.0 3.0 10.7 1.0

K 0 0 0 0 0 0 0

SIR [dB] 19.8 19.8 19.8 19.8 19.8 19.8 19.8

SDR [dB] 10.7 10.7 10.7 10.7 10.7 10.7 10.7

B.1.2 MF with α-β-divergence and Euclidean predictive, MWF

as prefilter and noise codebook

K0 20 40 60 80 100 120

α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2

K0 20 40 60 80 100 120

Figure B.2: Walking-noise reduction by the α-β-divergence with MWF prefilter andnoise codebook

K0 20 40 60 80 100 120

Figure B.3: Walking-noise reduction by Euclidean predictive MF

Table B.2: Waving-noise reduction by the α-β-divergence and Euclidean predictive,MWF as prefilter and noise codebook

α;β 1; 1 1; 0 1; -1 0.5; 1 2; 1 1; -2 1; 2 EUC pred.

K 5 1 1 2 1 0 0 50

SIR [dB] 20.6 21.6 22.0 19.8 20.4 19.7 19.7 21.6

SDR [dB] 2.1 2.1 2.3 2.5 1.8 10.7 10.7 4.9

K 0 0 0 0 0 0 0 0

SIR [dB] 19.7 19.7 19.7 19.7 19.7 19.7 19.7 19.8

SDR [dB] 10.7 10.7 10.7 10.7 10.7 10.7 10.7 10.7

B.1. WALKING 105

B.1.3 Multichannel matrix factorization with the α-β-divergence

K0 10 20 30 40 50 60

α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2

K0 10 20 30 40 50 60

Figure B.4: Walking-noise reduction by the α-β-divergence for multichannel data withspeech and noise codebook

Table B.3: Walking-noise reduction by the α-β-divergence for multichannel data withspeech and noise codebook

α;β 1.0; 1.0 1.0; 0.0 1.0; -1.0 0.5; 1.0 2.0; 1.0 1.0; -2.0 1.0; 2.0

K 30 26 6 6 15 1 10

SIR [dB] 21.2 23.4 19.0 21.1 20.9 4.4 21.8

SDR [dB] 0.0 0.0 0.4 0.3 0.5 0.7 0.6

K 5 3 3 1 9 1 24

SIR [dB] 20.6 18.5 18.9 19.5 20.4 4.4 19.8

SDR [dB] 0.7 0.1 0.8 0.5 0.7 1.0

B.2 waving

B.2.1 PMWF as prefilter

10 log10

(µ)-60 -50 -40 -30 -20 -10 0 10 20 30 40 50

10 log10

(µ)-60 -50 -40 -30 -20 -10 0 10 20 30 40 50

10 log10

(µ)-60 -50 -40 -30 -20 -10 0 10 20 30 40 50

Figure B.5: SIR,SDR and SAR of the parametric Wiener filter for the waving noise

B.2. WAVING 107

speech and noise codebook

K0 10 20 30 40 50 60

α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2

K0 10 20 30 40 50 60

Figure B.6: Waving-noise reduction by the α-β-divergence with MWF prefilter andspeech and noise codebook

Table B.4: Waving-noise reduction by the α-β-divergence with MWF prefilter

α;β 1.0; 1.0 1.0; 0.0 1.0; -1.0 0.5; 1.0 2.0; 1.0 1.0; -2.0 1.0; 2.0

K 5 1 1 2 14 24 22

SIR [dB] 22.9 22.2 25.4 22.2 21.0 17.9 22.2

SDR [dB] 2.5 5.7 5.0 4.6 2.5 0.7 1.6

K 0 0 0 0 0 0 0

SIR [dB] 15.7 15.7 15.7 15.7 15.7 15.7 15.7

SDR [dB] 6.2 6.2 6.2 6.2 6.2 6.2 6.2

noise codebook

K0 20 40 60 80 100 120

α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2

K0 20 40 60 80 100 120

Figure B.7: Waving-noise reduction by the α-β-divergence with MWF prefilter andnoise codebook

K0 20 40 60 80 100 120

Figure B.8: Waving-noise reduction by Euclidean predictive MF

B.2. WAVING 109

Table B.5: Waving-noise reduction by the α-β-divergence and Euclidean predictivewith MWF prefilter and noise codebook

α;β 1; 1 1; 0 1; -1 0.5; 1 2; 1 1; -2 1 1; 2 EUC pred.

K 3 1 129 4 8 1 0 129

SIR [dB] 16.1 17.6 20.0 16.0 17.0 23.7 15.6 18.7

SDR [dB] 1.5 1.6 1.6 1.6 0.9 0.0 6.2 3.3

K 0 0 0 0 0 0 0 0

SIR [dB] 15.6 15.6 15.6 15.6 15.6 15.6 15.6 15.7

SDR [dB] 6.2 6.2 6.2 6.2 6.2 6.2 6.2 6.2

B.2.4 Multichannel matrix factorization with the α-β-divergence

K0 10 20 30 40 50 60

α=1; β=1α=1; β=0α=1; β=-1α=0.5; β=1α=2; β=1α=1; β=-2α=1; β=2

K0 10 20 30 40 50 60

Figure B.9: Waving-noise reduction by the α-β-divergence for multichannel data withspeech and noise codebook

1This parameter set is excluded while searching for the best parameters due to its strange behavior

over K

Table B.6: Waving-noise reduction by the α-β-divergence for multichannel data withspeech and noise codebook

α;β 1.0; 1.0 1.0; 0.0 1.0; -1.0 0.5; 1.0 2.0; 1.0 1.0; -2.0 1.0; 2.0

K 9 6 4 4 65 3 65

SIR [dB] 26.1 26.3 27.0 27.3 26.1 -3.4 24.9

SDR [dB] 0.4 0.2 0.6 0.3 0.6 -3.2 0.7

K 16 2 2 2 18 40 26

SIR [dB] 25.2 25.1 25.2 26.4 24.1 -12.1 21.1

SDR [dB] 0.5 0.5 0.6 0.5 1.1 -1.0 1.0

ANHANG C. OVERVIEW ON RELATED ALGORITHMS 111

Appendix C

Overview on related algorithms

C.0.5 Minimum Mean Square Error (MMSE) estimation

As it can for example be seen in [OV10], pages 140-160, the mean square error (MSE)

can be formulated as the following:

E{|X − x (y) |2|Y = y} =

|x− x (y) |2fX|Y (x|y)dx (C.1)

with the conditional or a posteriori density fX|Y (x|y).

The estimate is obtained by taking the derivative with respect to x

|x− x (y) |2fX|Y (x|y)dx = −2

x− x (y) fX|Y (x|y) dx!= 0 (C.2)

and therefore∫

x (y) fX|Y (x|y) dx = x (y) =

xfX|Y (x|y)dx (C.3)

Finally by rewriting equation C.3 the estimate is given by the conditional expectation

x (y) = E{X|Y = y} (C.4)

In a similar manner the MMSE estimate for L observations Y1 = y1, ..., YL = yL can be

shown to be

x (y1, ..., yL) = E{X|Y1 = y1, ..., YL = yL} (C.5)

By grouping the measured random variables in vector Y and the measurements in y

the MMSE estimate writes as

x (y) = E{X|Y = y} (C.6)

For the calculation the conditional density fX|Y (x|Y = y) has to be known.

Similar to calculating an estimator for the known observation y an estimator can be

formulated for the random variables Y :

X = x (Y ) = E{X|Y } (C.7)

The estimator x has hence to minimize the mean square error for each y out of Y or

written as integral

E{X|Y = y}fY (y) dy (C.8)

This is however given as x (y) minimizes the inner expectation and fY (y) is nonneg-

ative.

Linear Minimum Mean Square Error (LMMSE) estimation for jointly Gaus-

sian observations

As shown in the previous chapter, the MMSE estimate for the observation vector Y is

X (Y ) = E{X|Y } (C.9)

If observations Y and X are jointly Gaussian and of zero mean it can be shown, that

the optimum MMSE estimator is linear. Assuming that

X (Y ) = E{X|Y } = hHLMMSEY (C.10)

and by introducing the error of the estimate E = X − X = X − hHLMMSEY equation

C.9 rewrites as

X (X) = E{E + X|Y } = E{E|Y }+ E{X|Y } (C.11)

Furthermore E and Y are

• orthogonal (see [OV10]).

• of zero mean as E is a linear combination of Y , which is of zero mean.

• hence uncorrelated.

• jointly Gaussian as E is a linear combination of the jointly Gaussian Y .

• statistical independent as they are uncorrelated, of zero mean and jointly Gaus-

Due to the statistical independence of E and X,

E{E|Y } = 0 (C.12)

and finally

X (Y ) = E{X|Y } (C.13)

which is obviously fulfilled for the linear estimator. A linear MMSE filter is thus a

optimum filter regarding MMSE for jointly Gaussian observations.

The optimum LMMSE coefficients can be calculated by

hLMMSE = argminh

E{|X − hHY |2} (C.14)

∂hHE{(

X − hHY) (

X∗ − Y Hh)

} = −2E{X(

X∗ − Y Hh)

}!= 0 (C.15)

In the end the LMMSE coefficients are given by

h = E{Y Y H}−1E{Y X∗} = Σ−1

Y Y ΣY X (C.16)

C.0.6 Wiener filtering

For Gaussian random processes, the LMMSE estimator leads to the well-known Wiener

filter. If the observations contain all previous samples, y(t) = [y(−∞), ..., y(t −

1), y(t)]T , the signal is thus semi-infinite, the filter is referred to as causal wiener filter

([Li06]) and MMSE estimator is calculated as

x(t) = hHy(t) with h = Σ−1Y Y ΣY X (C.17)

If the observations are given by a signal that is infinite on both sides, y(t) = [y(−∞), ..., y(t), ..., y(∞)]

the LMMSE can be referred to as non-causal wiener filter ([Li06]). The complete signal

can be estimated by

x(t) = hHy(t) with h = Σ−1Y Y ΣY X (C.18)

it can also be expressed in the Fourier domain as element wise multiplication

x(f) = H(f)y(f) with H(f) =SY X(f)

SY Y (f)(C.19)

where SY X and SY Y (f) denotes the cross- and power spectral density. Note that each

frequency is handled separately. One can conclude that the frequency therefore has to

be statistical independent. Furthermore the estimator uses the noisy phase.

The original Wiener theory cannot be used in practice, as their filters and the moments

the filter is composed of are of infinite length. For the causal Wiener filter it is sufficient

to assume that the estimate x(t) is only dependent on the last N observations.

The frequency domain wiener filter is usually approximated by using a DFT of length

N. At this point it may be noted, that the frequencies of the DFT of length N may no

longer be statistical independent (e.g. they may be correlated). The approximation

via the DFT of length N may therefore no longer be an optimum MMSE estimator

given the N observations. This was also pointed out by [LA04].

Furthermore to apply the filter, it is implied, that the second order moments are known

for each time instance. Strictly speaking, the optimum filter for the observations y(t)

is h(t). The way to estimate the moments strongly depends on the given scenario and

is therefore not discussed at this point.

C.0.7 Nonnegative Graph Embedding

In [YYF+08] the authors propose a general nonnegative data decomposition formu-

lation that combines NMF with the graph embedding framework (see [YXZ+07]and

later rewritten in [YXZZ05] )

G = {X,S} (C.20)

denote the graph G with vertex set Xand the similarity measure S. Furthermore the

diagonal matrix SD is defined as

SD,ii =∑

Sij∀i (C.21)

and the Laplacian L as

L = SD − S (C.22)

Also let the penalty graph be defined as GP = {XP ,SP}. Let A1 denote the desired

low-dimensional representation of the training data, the graph preserving problem can

be formulated as the following:

r1 6=r2

||A1r1−A1

r2||2SP

r1,r2= tr(A1LA1T ) (C.23)

(C.24)

r1 6=r2

||A1r1−A1

r2||2Sr1,r2 = tr(A1LPA1T ) (C.25)

To combine the graph embedding with the NMF cost function, the authors divide the

activation matrix and dictionary in

,D = [D1D2] (C.26)

A2, D2 represent the complementary space of A1, D1, such that their addition results

in the original data. The objective function in equation C.23 is equivalent to

(C.27)

r1 6=r2

||A2r1−A2

r2||2Sr1,r2 = tr(A2LPA2T ) (C.28)

The (euclidean) non-negative graph embedding problem is then formulated as

minA,D

tr(A2LPA2T ) + tr(A1LPA1T ) + λ||X −AD||2F (C.29)

subject to: A,D > 0 (C.30)

Furthermore the authors prove that their exists a iterative optimization problem.

C.1 Terminology of non-blind source separation al-

gorithms

In [VJA+10] and [LGS+12] the terms informed source separation and blind source

separation are used. Later [LDDR13] further divided informed audio source separation

into model-based and signal-based, depending of the available side information. Signal-

based side information are then further on divided into score-informed and exemplar-

based source separation. Score-informed data would for example be information over

time whether a source is on or off. Exemplar-based source separation instead is based

on additional actual signals, such as cover tracks for music.

In [VBGB14] the authors suggest, that the term informed source separation is used

rather inconsistently. They therefore use strongly guided and weakly guided, depending

C.1. TERMINOLOGY OF NON-BLIND SOURCE SEPARATION ALGORITHMS117

on the type of side information.

In papers more related to machine learning, often the terms supervised and semi-

supervised are used [YYF+08] [CHH07] [Zhu05]. Semi-supervised learning techniques

are especially promising, as they can be initially trained with the labeled data, but

can increase their performance with the unlabeled data later on. In chapter 6 a NMF

training technique is referred to that exploits labeled and unlabeled data.

In our scenario all training data is labeled, in other words, there exists training data

for the speech and training data for the ego noise of the robot.

118 ANHANG D. DERIVATIONS

Appendix D

Derivations

D.1 Parameter estimation by matrix factorization

In chapter 2.5.1 NMF with the Itakura-Saito divergence was introduced as ML esti-

mate for the sum of Gaussian components. However, the presented model is not the

only statistical model, where the estimation of parameters is equivalent to solving a

matrix factorization problem. Furthermore, it has to be noted that some source mod-

els developed for audio source separation in the literature are not equivalent to a MF

problem. In [Att03] for example, Gaussian mixture models (GMMs) are used.

D.1.1 Euclidean distance as ML estimation of additive Gaus-

sian noise

In [Vir07] and [FC09] Euclidean distance is named as the ML estimator in the presence

of additive Gaussian noise. The difference to chapter 2.5.1 is that the mean of the

processes is modeled by the NMF parameters, whereas the variance is identical for all

components.

Yfr =K∑

Ck,fr with Ckfr ∼ N (DfkAkr, σ2) (D.1)

D.1. PARAMETER ESTIMATION BY MATRIX FACTORIZATION 119

The negative log-likelihood thus writes as

CML(D,A) = −∑

log p(Yfr|Df ,Ar)

log(π) + log(σ2/K) +(Yfr −DT

f Ar)2

(Yfr −DTf Ar)

σ2DEUC(Yfr,D

Tf Ar)

D.1.2 ML estimation of the sum of Poisson distributed mag-

nitude components

[FBD09] pointed out that by modifying the generative model of equation 2.13 into

|Yfr| =K∑

|Ck,fr| with |Ckfr| ∼ P(DfkAkr) (D.3)

and assuming that the absolute of each component is Poissonian distributed with

P(x|λ) = exp(−λ)λx

As for example shown in [GW], the sum of statistical independent Poisson random

variables is also Poisson distributed and its shape parameter is the sum of the individual

shape parameters, such that

|Yfr| ∼ P (

DfkAkr) (D.4)

Again assuming the components to be mutually independent and individually inde-

pendent across frequencies and frames, the negative log-likelihood estimator is given

CML(D,A) = −

log p(|Yfr||Df ,Ar)

DTf Ar − |Yfr| log(D

Tf Ar) + log(|Yfr|!)

DTf Ar + |Yfr| log

DTf Ar

− |Yfr| log |Yfr|+ log(|Yfr|!)

DTf Ar + |Yfr| log

DTf Ar

− |Yfr| = CKL(|X|,AD)

This shows, that ML estimation of the sum of Poisson distributed magnitude compo-

nents is equivalent up to a constant with the matrix factorization of the magnitude

spectogram with the Kullback-Leibler divergence.

Note, that the factorial operator is only defined for non-negative integer values. [FBD09]

gives the interpretation for samples audio spectrograms by scaling them to integer val-

ues. For solving the optimization problem, the factorial operator vanishes anyway, as

the term |Yfr|! is independent of the parameters to be optimized.

One has to be clear, that the model described above no longer assumes Gaussian

sources. In fact the sources itself are not modeled directly. Their magnitude is modeled,

but no assumption is made on their phase. As the amplitudes of the components

in equation D.3 add up, this can be interpreted as if all components have the same

phase. Even more, if only the components amplitude is estimated, the phase remains

undetermined. One obvious estimate is for example used in [Vir07] and [Sma07], where

simply the phase of the mixture is used for all components, arg(ck,fr) = arg(Yfr) ∀k ∈

[1;K]. [FBD09] points out, that the approach from above is not conservative, the sum

of the estimated sources does not necessarily result in the observed signal.

D.1. PARAMETER ESTIMATION BY MATRIX FACTORIZATION 121

D.1.3 Itakura-Saito divergence as ML estimation of gamma

multiplicative noise

[FBD09] showed the connection between ML estimation in gamma multiplicative noise

and matrix factorization by the Itakura-Saito distance.

Given a generative model where the observations Y are a rank K matrix distorted by

multiplicative noise

Y = (DA).E (D.6)

If the multiplicative noise E is independent and identical distributed (i.i.d.) according

efr =Yfr

∼ G(α, β) (D.7)

where G(x|α, β) = βα

Γ(α)xα−1e−βx with x ≥ 0 denotes the gamma distribution and Γ the

gamma function. Again using independence across frequency and time, the negative

log-likelihood for the parameters that form the estimate of the rank K matrix, is given

CML(D,A) = −∑

log p(Yfr|Df ,Ar)

= −∑

|α, β

−α log(β) + log(Γ(α))− (α− 1) log

+ βYfr

α− 1

If (α−1)β

= 1 the maximum likelihood is a scaled version of NMF with the Itakura-Saito

divergence:1

CML(D,A) = βCIS(Y ,DTA) (D.9)

1Note that the author writes the equivalence is given for αβ= 1

Note that in this model the observations Y are not specified (in contrast to the previous

chapters). It could therefore be applied to the magnitude, magnitude squared or any

other form of the observations. As the relation of Y to the actual observed signal is

not specified, there is obviously also no method defined to calculate the estimates on

the individual sources.

Nevertheless, this interpretation of matrix factorization with the Itakura-Saito distance

is interesting as the Itakura-Saito divergence is scale invariant.

D.2 Deriving the α-divergence as a Csisz’ars diver-

Recall the Csisz’ars φ-divergence of equation 3.4

Xr,f , Xr,f

with dφ(x||y) = xφ(

For φ(z) =z(zν−1−1)ν(ν−1)

+ 1−zν

= zν−νz+ν−1ν(ν−1)

the resulting divergence

dν(x||y) = x

)ν− ν

+ ν − 1

ν(ν − 1)=

yνx1−ν − νy + (ν − 1)x

ν(ν − 1)(D.10)

In their later works ([CZPA09] and [CCA11]) the authors use a slightly different nota-

tion, which can be derived by replacing ν = 1− α:

dα(x||y) =y1−αxα − (1− α)y − αx

(1− α)(−α)=

y1−αxα + (α− 1)y − αx

(α− 1)α(D.11)

Note that equations D.10 and D.11 are both denoted generalized α-divergence, we

will use the second definition in this paper to ease comparison with the generalized

α-β-divergence in 3.1.3.

D.3. DERIVING THE β-DIVERGENCE AS A BREGMAN DIVERGENCE 123

The limits for α→ 0 and α→ 1 are usually reformulated. For α→ 0:

limα→0

dα(x||y) = limα→0

y1−αxα − y

−α+ lim

α→0

αy − αx

= x− y + y limα→0

y−αxα − 1

−α= x− y + y lim

α→0

eα(ln(x)−ln(y))

(D.12)

Using l’Hospital’s rule the limit rewrites as

limα→0

dα(x||y) = x− y + y limα→0

ln(x)− ln(y)eα(ln(x)−ln(y))

= x− y + y(− ln(x) + ln(y)) = y ln(y

+ y − x

(D.13)

Similar for α→ 1:

limα→1

dα(x||y) = limα→1

(α− 1)(y − x)

α− 1+ x lim

α→1

y1−αxα−1 − 1

α− 1

= y − x+ x limα→1

)α−1

α− 1

(D.14)

Again using l’Hospital’s rule the term simplifies to

limα→1

dα(x||y) = y − x+ x limα→1

e(α−1) ln(xy )

1== y − x+ x ln

(D.15)

Using equation D.13 and D.15 the α-divergence can be written, as defined in 3.5 as

dα(x||y) =

1α(α−1)

(xαy1−α − αx+ (α− 1)y) α ∈ R \ {0, 1}

x ln xy+ y − x α = 1

y ln y

x+ x− y α = 0

D.3 Deriving the β-divergence as a Bregman diver-

[HDB11] shows that the β-divergence can be derived from the Bregman divergences by

using ϕ′′β (x) = xβ−1, ϕ′

β (x) =xβ

βand ϕβ (x) =

(β+1)β.

dϕβ(x||y) = ϕβ (x)− ϕβ (y)− ϕ′

β (y) (x− y)

=xβ+1

(β + 1)β−

β(x− y) =

xβ+1 + βyβ+1 − (β + 1)xyβ

β (β + 1)

(D.16)

Again two limits have to be determined. For β = −1 the limit of equation D.16 is

limβ→−1

dϕβ(x||y) = lim

β→−1

(β + 1)(−xyβ + yβ+1)

−(β + 1)+ lim

β→−1

xβ+1 − yβ+1

−(β + 1)

y− 1− lim

β→−1

e(β+1) ln(x) − e(β+1) ln(y)

β + 1

(D.17)

Using l’Hospital’s rule the limit rewrites as

limβ→−1

dϕβ(x||y) =

y− 1− lim

β→−1

ln (x) e(β+1) ln(x) − ln (y) e(β+1) ln(y)

y− 1− ln (x) + ln (y) = − ln

y− 1

(D.18)

For β = 0 the limit of equation D.16 is

limβ→0

dϕβ(x||y) = lim

β→0

β(yβ+1 − xyβ)

β+ lim

β→0

xβ+1 − xyβ

= y − x+ x limβ→0

eβ ln(x) − eβ ln(y)

(D.19)

Again using l’Hospital’s rule the term rewrites as

limβ→0

dϕβ(x||y) = y − x+ x lim

β→0

ln (x) eβ ln(x) − ln (y) eβ ln(y)

(1)= y − x+ x ln

(D.20)

Finally the β-divergence can be written as defined already in equation 3.7 as

dβ (x||y) =

1β(β+1)

xβ+1 + βyβ+1 − (β + 1) xyβ)

β ∈ R \ {−1, 0}

x ln xy+ y − x β = 0

xy− ln x

y− 1 β = −1

D.4. DERIVATION OF MU RULES FOR THE α-β DIVERGENCE (SINGLECHANNEL) 125

D.4 Derivation of MU rules for the α-β divergence

(single channel)

Calculating the gradient of the generalized α-β-divergence for some parameters θ can

be formulated as the following:

∂Cα,β

∂θ= −

∂θdα,β

Xfr||Xfr

= −∑

∂dα,β

Xfr||Xfr

∂Xfr

= −∑

Xα+β−1fr

∂Xfr

(D.21)

[CCA11] introduces now the (1− α) deformed logarithm

ln1−α (z) =

zα−1α

for α 6= 0

ln (z) for α = 0(D.22)

such that

∂Cα,β

∂θ= −

Xα+β−1fr ln1−α

∂Xfr

∂θ(D.23)

For matrix factorization the parameters θ = {A,D} form the estimate X = DA such

Xfr = DTf Ar =

DfkAkr (D.24)

The gradients for the parameters are therefore given by

∂Cα,β

∂Akr

= −∑

Dfk (D.25)

∂Cα,β

∂Dfk

= −∑

Akr (D.26)

Instead of applying an gradient descent algorithm directly on the parameters, it is

applied to the natural parameter space φ (z) such that

Akr ← φ−1

φ (Akr)− ηkr∂Cα,β

∂φ (Akr)

(D.27)

Dfk ← φ−1

φ (Dfk)− ηfk∂Cα,β

∂φ (Dfk)

(D.28)

If the (1− α) deformed logarithm is used as transformation, φ (z) = ln1−α (z) the

inverse transform is the (1− α) deformed exponential:

φ−1 (z) = exp1−α (z) =

ez for α = 0

(1 + αz)1

α for 1 + αz ≥ 1

0 for 1 + αz < 1

(D.29)

Akr ← exp1−α

ln1−α (Akr) + ηkr∑

∂Xfr

∂ ln1−α (Akr)

(D.30)

Dfk ← exp1−α

ln1−α (Dfk) + ηfk∑

∂Xfr

∂ ln1−α (Dkr)

(D.31)

by using

exp1−α (ln1−α (z) + v) =

1 + α

zα − 1

= (1 + zα − 1 + αv)1

1 + αv

= z exp1−α

(D.32)

equation D.30 can be reformulated to

ηkr∑

∂Xfr

∂ ln1−α (Akr)

(D.33)

Dfk ← Dfk exp1−α

ηfk∑

∂Xfr

∂ ln1−α (Dkr)

(D.34)

D.4. DERIVATION OF MU RULES FOR THE α-β DIVERGENCE (SINGLECHANNEL) 127

Inserting

∂Xfr

∂ ln1−α (Akr)=

∂ ln1−α (Akr)exp1−α (ln1−α (Akr))Dfk = A1−α

kr Dfk (D.35)

∂Xfr

∂ ln1−α (Dfk)= AkrD

1−αfk (D.36)

this simplifies to

ηkr∑

A1−2αkr DfkX

α+β−1fr ln1−α

(D.37)

ηfk∑

AkrD1−2αfk Xα+β−1

fr ln1−α

(D.38)

By selecting suitable step-sizes

ηkr =A2α−1

f DfkXα+β−1fr

(D.39)

ηfk =D2α−1

r AkrXα+β−1fr

(D.40)

the MU rules rewrite as

DfkXα+β−1fr

f DfkXα+β−1fr

ln1−α

(D.41)

AkrXα+β−1fr

r AkrXα+β−1fr

ln1−α

(D.42)

For α 6= 0 the generalized MU rules simplify to

Akr ← Akr

1 + α

DfkXα+β−1fr

f DfkXα+β−1fr

f DfkXβ−1fr Xα

f DfkXα+β−1fr

(D.43)

Dfk ← Dfk

r AkrXβ−1fr Xα

r AkrXα+β−1fr

(D.44)

Rewriting as matrix multiplications this leads to

Akr ← Akr

DTk (X

.[β−1]r .Xα

DTk (Xr).[α+β−1]

(D.45)

Dfk ← Dfk

(X.[β−1]f .X

.[α]f )TAk

(XTf )

.[α+β−1]Ak

(D.46)

and finally to the form of equation 3.18

A← A.

DT (X .[β−1].X .[α])

DT X .[α+β−1]

).[ 1α]

D ←D.

(X .[β−1].X .[α]AT

X .[α+β−1]AT

).[ 1α]

D.5 Derivation of a multichannel EM algorithm for

the Itakura-Saito divergence

CML(G,D,A) =

log(det(ΣB,f )) + (Yfr −GC,fCfr)HΣ−1

B,f(Yfr −GC,fCfr)

log(Df,kAk,r) +|Ck,fr|

Df,kAk,r

By using GC,fCfr = GS,fSfr and adding the trace to the scalar expression above

equation can be reformulated as

CML(G,D,A) =∑

log(det(ΣB,f)) +∑

Df,kAk,r

Σ−1B,fYfrY

Hfr +Σ−1

B,fGS,fSfrYHfr

+Σ−1B,fYfrS

HS,f +Σ−1

B,fGS,fSfrSHfrG

(D.47)

D.5. DERIVATION OF A MULTICHANNEL EM ALGORITHM FOR THEITAKURA-SAITO DIVERGENCE 129

By defining the spatial covariance matrices,

RY Y ,f =1

YfrYHfr (D.48)

RY S,f =1

YfrSHfr (D.49)

RSS,f =1

SfrSHfr (D.50)

equation D.47 rewrites as

CML(G,D,A) =∑

log(det(ΣB,fr)) +∑

Df,kAk,r

(D.51)

tr(Σ−1B,f(RY Y ,f +GS,fR

HY S,f +RY S,fG

HS,f +GS,fRSS,fG

HS,f)) (D.52)

By setting the gradient to zero, Dfk can be calculated as

∂CML(G,D,A)

∂Dfk

−|Ckfr|2

(DfkAkr)2Akr

!= 0 (D.53)

By equation D.53 and in similar manner for the other parameters, the parameter

estimates result as presented in chapter 3.3.2 as equation 3.27 to

Dfk =1

|Ckfr|2

Akr =1

GS,f = RY S,fR−1SS,f

and by considering that ΣB,f has to be diagonal and ∂V −1U∂V

= −(V −1UV −1)T and

∂ log |det(V |V

= (V −1)T

ΣB,f = diag(RY Y ,f +GS,fRHY S,f +RY S,fG

HS,f +GS,fRSS,fG

D.6 Computational growth for parallel matrix in-

version

The EM algorithm proposed by [OF10] a huge number of M ×M matrices have to be

inverted. This chapter shows that the computational complexity severely growth from

2 to 4 microphones.

The inversion of a 2 × 2 matrix A =

a11 a12

a21 a22

is simply given by only using multi-

plications and additions as

det(A) = a11a22 − a12a21 (D.54)

A−1 =1

det(A)

a22 −a21

−a12 a11

(D.55)

Thus it can be calculated via matrices for all time-frequency bins at once. So for each

time-frequency bin, the determinant has to be calculated by 2 multiplication and 1

subtraction and again 4 multiplication for the scaling of the entries. In total this gives

6 multiplications and 1 subtraction.

A similar formulation can be made for 4 × 4 matrices 2. The calculation of the deter-

minant requires then 3 ∗ 24 multiplications/divisions and 10 additions/subtractions. It

requires furthermore 16 multiplications for scaling and for each of the 16 components

12 multiplications and 5 additions. So in total it requires 280 multiplications and 103

additions/subtractions.

2for example with the formula at http://www.cg.info.hiroshima-

cu.ac.jp/ miyazaki/knowledge/teche23.html

BIBLIOGRAPHY 131

Bibliography

[Att03] Attias, Hagai: New EM Algorithms for Source Separation and Deconvo-

lution with a Microphone Array, 2003

[BDBG03] Benaroya, Laurent ; Donagh, Lorcan M. ; Bimbot, Frdric ; Gribon-

val, Rmi: Non negative sparse representation for wiener based source

separation with a single sensor. In: in Proc. IEEE Int. Conf. on Acous-

tics, Speech and Signal Processing (ICASSP, 2003, S. 613–616

[BW01] Brandstein, Michael ; Ward, Darren: Microphone arrays: signal pro-

cessing techniques and applications. Springer Science & Business Media,

[Car98a] Cardoso, Jean-Fran C.: Blind signal separation: statistical principles.

In: Proceedings of the IEEE 86 (1998), Nr. 10, S. 2009–2025

[Car98b] Cardoso, Jean-Francois: Multidimensional independent component

analysis. In: Acoustics, Speech and Signal Processing, 1998. Proceedings

of the 1998 IEEE International Conference on Bd. 4 IEEE, 1998, S. 1941–

[Car01] Cardoso, Jean-Francois: The three easy routes to independent compo-

nent analysis; contrasts and geometry. In: Proc. ICA Bd. 2001, 2001

[CAZ+06] Cichocki, Andrzej ; Amari, Shun-ichi ; Zdunek, Rafal ; Kompass,

Raul ; Hori, Gen ; He, Zhaohui: Extended SMART algorithms for

non-negative matrix factorization. In: Artificial Intelligence and Soft

Computing–ICAISC 2006. Springer, 2006, S. 548–562

[CCA11] Cichocki, Andrzej ; Cruces, Sergio ; Amari, Shun-ichi: Generalized

alpha-beta divergences and their application to robust nonnegative matrix

factorization. In: Entropy 13 (2011), Nr. 1, S. 134–170

[CHH07] Cai, Deng ; He, Xiaofei ; Han, Jiawei: Semi-supervised discriminant

analysis. In: Computer Vision, 2007. ICCV 2007. IEEE 11th International

Conference on IEEE, 2007, S. 1–7

[Com94] Comon, Pierre: Independent component analysis, a new concept? In:

Signal processing 36 (1994), Nr. 3, S. 287–314

[CZA06] Cichocki, Andrzej ; Zdunek, Rafal ; Amari, Shun ichi: Csiszrs diver-

gences for non-negative matrix factorization: Family of new algorithms.

In: LNCS, Springer, 2006, S. 32–39

[CZO87] Cox, Henry ; Zeskind, Robert M. ; Owen, Mark M.: Robust adaptive

beamforming. In: Acoustics, Speech and Signal Processing, IEEE Trans-

actions on 35 (1987), Nr. 10, S. 1365–1376

[CZPA09] Cichocki, Andrzej ; Zdunek, Rafal ; Phan, Anh H. ; Amari, Shun-ichi:

Nonnegative matrix and tensor factorizations: applications to exploratory

multi-way data analysis and blind source separation. John Wiley & Sons,

[D+09] Dikmen, Onur u. a.: Unsupervised single-channel source separation us-

ing Bayesian NMF. In: Applications of Signal Processing to Audio and

Acoustics, 2009. WASPAA’09. IEEE Workshop on IEEE, 2009, S. 93–96

[DK14] Deleforge, Antoine ; Kellermann, Walter: Phase-Optimized

K-SVD for Signal Extraction from Underdetermined Multi-

BIBLIOGRAPHY 133

channel Sparse Mixtures. In: CoRR abs/1410.2430 (2014).

http://arxiv.org/abs/1410.2430

[DS05] Dhillon, Inderjit S. ; Sra, Suvrit: Generalized nonnegative matrix ap-

proximations with Bregman divergences. In: In: Neural Information Proc.

Systems, 2005, S. 283–290

[DVG10] Duong, Ngoc Q. ; Vincent, Emmanuel ; Gribonval, Remi: Under-

determined reverberant audio source separation using a full-rank spatial

covariance model. In: Audio, Speech, and Language Processing, IEEE

Transactions on 18 (2010), Nr. 7, S. 1830–1840

[EF13] Essid, Slim ; Fevotte, Cedric: Smooth nonnegative matrix factorization

for unsupervised audiovisual document structuring. In: Multimedia, IEEE

Transactions on 15 (2013), Nr. 2, S. 415–425

[EK01] Eguchi, Shinto ; Kano, Yutaka: Robustifying maximum likelihood es-

timation. In: Tokyo Institute of Statistical Mathematics, Tokyo, Japan,

Tech. Rep (2001)

[FBD09] Fevotte, Cedric ; Bertin, Nancy ; Durrieu, Jean-Louis: Nonnegative

Matrix Factorization with the Itakura-saito Divergence: With Application

to Music Analysis. Cambridge, MA, USA : MIT Press, Marz 2009. – ISSN

0899–7667, 793–830

[FC09] Fevotte, Cedric ; Cemgil, A T.: Nonnegative matrix factorizations as

probabilistic inference in composite models. In: Signal Processing Confer-

ence, 2009 17th European IEEE, 2009, S. 1913–1917

[GW] Grimmett, Geoffrey ; Welsh, Dominic: Probability: An Introduction

[HDB11] Hennequin, Romain ; David, Bertrand ; Badeau, Roland: Beta-

divergence as a subclass of bregman divergence. In: IEEE Signal Pro-

cessing Letters 18 (2011), Nr. 2, S. 83–86

[HO00] Hyvarinen, Aapo ; Oja, Erkki: Independent component analysis: algo-

rithms and applications, Elsevier, 2000, S. 411–430

[Inc11] Ince, Gokhan: Ego Noise Estimation for Robot Audition, 2011

[KSRS11] Kumar, Kshitiz ; Singh, Rita ; Raj, Bhiksha ; Stern, Richard: Gam-

matone sub-band magnitude-domain dereverberation for ASR. In: Acous-

tics, Speech and Signal Processing (ICASSP), 2011 IEEE International

[LA04] Li, Chunjian ; Andersen, Søren V.: Inter-frequency dependency in

MMSE speech enhancement. In: Proceedings of the 6th Nordic Signal

Processing Symposium Citeseer, 2004

[LBD+14] Loellmann, Heinrich W. ; Barfuss, Hendrik ; Deleforge, Antoine ;

Meier, Stefan ; Kellermann, Walter: Challenges in acoustic signal en-

hancement for human-robot communication. In: Speech Communication;

11. ITG Symposium; Proceedings of VDE, 2014, S. 1–4

[LDDR13] Liutkus, Antoine ; Durrieu, Jean-Louis ; Daudet, Laurent ; Richard,

Guilhem: An overview of informed audio source separation. In: Image

Analysis for Multimedia Interactive Services (WIAMIS), 2013 14th Inter-

national Workshop on IEEE, 2013, S. 1–4

[LGS+12] Liutkus, Antoine ; Gorlow, Stanislaw ; Sturmel, Nicolas ; Zhang,

Shuhua ; Girin, Laurent ; Badeau, Roland ; Daudet, Laurent ; Marc-

hand, Sylvain ; Richard, Gael: Informed audio source separation: A

comparative study. In: Signal Processing Conference (EUSIPCO), 2012

Proceedings of the 20th European IEEE, 2012, S. 2397–2401

[LHZC01] Li, Stan Z. ; Hou, Xin W. ; Zhang, HongJiang ; Cheng, QianSheng:

Learning spatially localized, parts-based representation. In: Computer

BIBLIOGRAPHY 135

Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the

2001 IEEE Computer Society Conference on Bd. 1 IEEE, 2001, S. I–207

[Li06] Li, Chunjian: Non-Gaussian, Non-stationary and Nonlinear Signal Pro-

cessing Methods-with Applications to Speech Processing and Channel Es-

timation, Videnbasen for Aalborg UniversitetVBN, Aalborg Universite-

tAalborg University, Det Teknisk-Naturvidenskabelige FakultetThe Fac-

ulty of Engineering and Science, Multimedia Information and Signal Pro-

cessingMultimedia Information and Signal Processing, 2006

[LLP12] Li, Liangda ; Lebanon, Guy ; Park, Haesun: Fast bregman divergence

nmf using taylor expansion and coordinate descent. In: Proceedings of the

18th ACM SIGKDD international conference on Knowledge discovery and

data mining ACM, 2012, S. 307–315

[LN11] Low, Siow Y. ; Nordholm, Sven: An insight into the parametric mul-

tichannel wiener formulation. In: Computer Science and Automation En-

gineering (CSAE), 2011 IEEE International Conference on Bd. 1 IEEE,

2011, S. 513–516

[LN13] Li, Yifeng ; Ngom, Alioune: Versatile sparse matrix factorization and

its applications in high-dimensional biological data analysis. In: Pattern

Recognition in Bioinformatics. Springer, 2013, S. 91–101

[LS00] Lee, Daniel D. ; Seung, H. S.: Algorithms for Non-negative Matrix

Factorization. In: In NIPS, MIT Press, 2000, S. 556–562

[MH14] Mirsamadi, Seyedmahdad ; Hansen, John H.: Multichannel speech

dereverberation based on convolutive nonnegative tensor factorization for

ASR applications. In: Fifteenth Annual Conference of the International

Speech Communication Association, 2014

[OF10] Ozerov, Alexey ; Fevotte, Cedric: Multichannel nonnegative matrix

factorization in convolutive mixtures for audio source separation. In: Au-

dio, Speech, and Language Processing, IEEE Transactions on 18 (2010),

Nr. 3, S. 550–563

[OV10] Oppenheim, Alan ; Verghese, George: 6.011 Introduction to Commu-

nication Control and Signal Processing, Massachusetts Institute of Tech-

nology: MIT OpenCourseWare, Spring 2010

[OVB12] Ozerov, Alexey ; Vincent, Emmanuel ; Bimbot, Frederic: A general

flexible framework for the handling of prior information in audio source

separation. In: Audio, Speech, and Language Processing, IEEE Transac-

tions on 20 (2012), Nr. 4, S. 1118–1133

[PC01] Pham, Dinh-Tuan ; Cardoso, Jean-Francois: Blind separation of instan-

taneous mixtures of nonstationary sources, IEEE, 2001, S. 1837–1848

[SBA10] Souden, Mehrez ; Benesty, Jacob ; Affes, Sofiene: On optimal

frequency-domain multichannel linear filtering for noise reduction. In: Au-

dio, Speech, and Language Processing, IEEE Transactions on 18 (2010),

Nr. 2, S. 260–276

[SFM+14] Smaragdis, Paris ; Fevotte, Cedric ; Mysore, Gautham J. ; Mo-

hammadiha, Nasser ; Hoffman, Matthias: Static and dynamic source

separation using nonnegative factorizations: A unified view. In: Signal

Processing Magazine, IEEE 31 (2014), Nr. 3, S. 66–75

[SLOVB14] Souviraa-Labastie, Nathan ; Olivero, Anaik ; Vincent, Emmanuel

; Bimbot, Frederic: Audio source separation using multiple deformed ref-

erences. In: Signal Processing Conference (EUSIPCO), 2014 Proceedings

of the 22nd European IEEE, 2014, S. 311–315

BIBLIOGRAPHY 137

[SLVB15] Souviraa-Labastie, Nathan ; Vincent, Emmanuel ; Bimbot,

Frederic: Music separation guided by cover tracks: designing the joint

NMF model. In: 40th IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP) 2015. Brisbane, Australia, April 2015

[Sma07] Smaragdis, Paris: Convolutive Speech Bases and Their Ap-

plication to Supervised Speech Separation. In: Trans. Au-

dio, Speech and Lang. Proc. 15 (2007), Januar, Nr. 1, 1–

12. http://dx.doi.org/10.1109/TASL.2006.876726. – DOI

10.1109/TASL.2006.876726. – ISSN 1558–7916

[SMW04] Spriet, Ann ; Moonen, Marc ; Wouters, Jan: Spatially pre-processed

speech distortion weighted multi-channel Wiener filtering for noise reduc-

tion. In: Signal Processing 84 (2004), Nr. 12, S. 2367–2387

[TKWB11] Thurau, Christian ; Kersting, Kristian ; Wahabzada, Mirwaes ;

Bauckhage, Christian: Convex non-negative matrix factorization for

massive datasets. In: Knowledge and information systems 29 (2011), Nr.

2, S. 457–478

[VAT+12] Vincent, Emmanuel ; Araki, Shoko ; Theis, Fabian ; Nolte, Guido ;

Bofill, Pau ; Sawada, Hiroshi ; Ozerov, Alexey ; Gowreesunker,

Vikrham ; Lutter, Dominik ; Duong, Ngoc Q.: The signal separa-

tion evaluation campaign (2007–2010): Achievements and remaining chal-

lenges. In: Signal Processing 92 (2012), Nr. 8, S. 1928–1936

[VBGB14] Vincent, Emmanuel ; Bertin, Nancy ; Gribonval, Remi ; Bimbot,

Frederic: From blind to guided audio source separation: How models and

side information can improve the separation of sound. In: IEEE Signal

Processing Magazine 31 (2014), Nr. 3, S. 107–115

[VG+08] Virtanen, Tuomas ; Godsill, Simon u. a.: Bayesian extensions to non-

negative matrix factorisation for audio signal modelling. In: Acoustics,

Speech and Signal Processing, 2008. ICASSP 2008. IEEE International

[VGA09] Vasiloglou, Nikolaos ; Gray, Alexander G. ; Anderson, David V.:

Non-negative Matrix Factorization, Convexity and Isometry. In: SDM

SIAM, 2009, S. 673–684

[Vir07] Virtanen, Tuomas: Monaural sound source separation by nonnegative

matrix factorization with temporal continuity and sparseness criteria. In:

IEEE Trans. On Audio, Speech and Lang. Processing (2007)

[VJA+10] Vincent, Emmanuel ; Jafari, Maria G. ; Abdallah, Samer A.

; Plumbley, Mark D. ; Davies, Mike E.: Probabilistic modeling

paradigms for audio source separation, Hershey: IGI Global, 2010, S. 162–

[VZ12] Van Zaen, Jerome: Efficient schemes for adaptive frequency tracking

and their relevance for EEG and ECG. (2012)

[WJ04] Wang, Yuan ; Jia, Yunde: Fisher non-negative matrix factorization for

learning local features. In: In Proc. Asian Conf. on Comp. Vision Citeseer,

[YXZ+07] Yan, Shuicheng ; Xu, Dong ; Zhang, Benyu ; Zhang, Hong-Jiang ;

Yang, Qiang ; Lin, Stephen: Graph embedding and extensions: a general

framework for dimensionality reduction. In: Pattern Analysis and Machine

Intelligence, IEEE Transactions on (2007), Nr. 1, S. 40–51

[YXZZ05] Yan, Shuicheng ; Xu, Dong ; Zhang, Benyu ; Zhang, Hong-Jiang:

Graph embedding: A general framework for dimensionality reduction. In:

Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Com-

puter Society Conference on IEEE, 2005, S. 830–837

BIBLIOGRAPHY 139

[YYF+08] Yang, Jianchao ; Yang, Shuicheng ; Fu, Yun ; Li, Xuelong ; Huang,

Thomas: Non-negative graph embedding. In: Computer Vision and Pat-

tern Recognition, 2008. CVPR 2008. IEEE Conference on IEEE, 2008, S.

[Zhu05] Zhu, Xiaojin: Semi-Supervised Learning Literature Survey / Computer

Sciences, University of Wisconsin -Madison. 2005 (1530). – Forschungs-

bericht

lms.tf.fau.de · Erkl¨arung Ich versichere, dass ich die vorliegende Arbeit ohne fremde Hilfe und...

Documents

Transcript of lms.tf.fau.de · Erkl¨arung Ich versichere, dass ich die vorliegende Arbeit ohne fremde Hilfe und...

Ich kann nicht leben ohne

KINDERGARTEN filepure Zuwendung, wenn es ehrlich ist. Wenn ich mich einem Menschen zuwende, geht das nicht, ohne dass ich mich auf ihn, seine Geschichten, Erlebnisse, sein Freud und

Ich versichere, dass ich die Arbeit ohne fremde Hilfe und ...isa/gradmp_2012_01_19_08_30.pdf · Ja-va Remote Method Invocations\ oder " Remote Procedure Calls\ ahnelt. Hierbei spielt

media.suub.uni-bremen.de...-Erklärung/Declaration-Name: Sebastian Hötzel . Anschrift: Elsasser Straße 24, 28211 Bremen . Hiermit erkläre ich, dass ich . 1. die Arbeit ohne unerlaubte

Ethanol-Biodiesel-Biogas als Kraftstoff: Umsetzungsgrad ... · Eidesstattliche Erklärung Hiermit versichere ich an Eides statt, dass ich die vorliegende Bachelorarbeit ohne fremde

FPGA-BASIERTE ECHTZEIT-KORREKTUR VON … · ii EIDESSTATTLICHE ERKLÄRUNG Hiermit erkläre und versichere ich, dass ich diese Diplomarbeit selbstständig und ohne fremde Hilfe verfasst

Ermittlung von Auswahlkriterien für feuerfeste Baustoffe ...Eidesstattliche Erklärung Ich erkläre an Eides statt, dass ich die vorliegende Arbeit selbstständig und ohne fremde

Ich träumte, dass ich ein Interview mit Gott hätte. Ich träumte, dass ich ein Interview mit Gott hätte.

Universitätsbibliothek Erklärung zur Abgabe einer ......1.Ich erkläre hiermit eidesstattlich, dass ich die vorliegende Arbeit selbstständig und ohne fremde Hilfe verfasst, andere

Ich glaube. Ich glaube, dass wir Freunde nicht verändern müssen, wenn wir verstehen, dass Freunde sich von selbst verändern. Ich glaube, dass, egal wie.

Ich träumte, dass ich ein Interview mit Gott hätte .

Feldorientierte Regelung der permanenterregten ... · Erkl¨arung Ich versichere, dass die vorliegende Arbeit ohne unzul¨assige Hilfe Dritter und ohne Benutzung anderer als der angegebenen

Albrecht Behmel - download.e-bookshelf.de€¦ · die mich zum virtuellen Poker nach Las Vegas einladen, ohne dass ich selber hin muss – das find ich gut, weil ich Berlin Mitte

Masterarbeit - opus.ostfalia.de · I Eidesstattliche Erklärung Hiermit erkläre ich an Eides Statt, dass ich die vorliegende Arbeit selbstständig und ohne unerlaubte Hilfe angefertigt,

Transkript zum Lehrbuch - Klett Kiadóklett.hu/download/Aspekte-neu-1_Lehrbuch_Transkript.pdf · 173 2 Aber mein Ausbilder hat gesagt, dass ich ohne die Also, ich glaube, Helden sind

Twitter - ohne kann ich nicht (mehr)

A Ich bin Mitglied, seit ich 16 bin. - Hueber · 15 Django möchte Mitglied im Lauft reﬀ werden. Schreiben Sie Sätze mit ohne dass oder, wenn möglich, ohne... zu. a §a Ich möchte

DISSERTATION - mediatum.ub.tum.de · Ehrenwörtliche Erklärung Hiermit versichere ich, dass ich die vorliegende Arbeit selbstständig und ohne die Benutzung anderer als der angegebenen

Woran erkenne ich, dass ich durch einen Zeckenstich krank ...

Freiheit. - martinskirche-ludwigsburg.de · Freiheit. zeigt hat, dass Ich glaube, dass Jesus ... und feierte mit tiefem Ernst das Abendmahl. Ich weiß noch, dass ich beinahe erleichtert